From brett at python.org  Fri Jun  1 00:51:54 2007
From: brett at python.org (Brett Cannon)
Date: Thu, 31 May 2007 15:51:54 -0700
Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere?
Message-ID: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>

I vaguely remember a discussion about the str/unicode unification and
whether there was going to be standardization on the internal representation
of Unicode or not.  I don't remember the outcome, but I am curious as to
whether it will lead to the removal of --enable-unicode or not.

Reason I ask is that the OS X extension modules do not like it when you
compile with UCS-4 (see http://www.python.org/sf/763708).  If the option is
not going to go away I am going to try to lean on someone to address this as
Unicode is obviously going to play a bigger role in Python come 3.0.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070531/043991b9/attachment.html 

From guido at python.org  Fri Jun  1 00:58:56 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Jun 2007 06:58:56 +0800
Subject: [Python-3000] [Python-Dev] PEP 367: New Super
In-Reply-To: <20070531170734.273393A40AA@sparrow.telecommunity.com>
References: <001101c79aa7$eb26c130$0201a8c0@mshome.net>
	<002d01c79f6d$ce090de0$0201a8c0@mshome.net>
	<ca471dc20705260708t952d820w7473474554c9469b@mail.gmail.com>
	<003f01c79fd9$66948ec0$0201a8c0@mshome.net>
	<ca471dc20705270259ke665af6v3b5bdbffbd926330@mail.gmail.com>
	<009c01c7a04f$7e348460$0201a8c0@mshome.net>
	<ca471dc20705270550j5e199624xd4e8f6caa9dda93d@mail.gmail.com>
	<ca471dc20705281937y48300821u840add9d5454e8d9@mail.gmail.com>
	<ca471dc20705310448p5c5cfeds41fdc75e05c21f55@mail.gmail.com>
	<20070531170734.273393A40AA@sparrow.telecommunity.com>
Message-ID: <ca471dc20705311558v6c33ae07wc29590f2c84262ac@mail.gmail.com>

Ouch. You're right. Class methods are broken by this patch. I don't
have time right now to look into a fix (thanks for the various
suggstions) but if somebody doesn't get to it first I'll look into
this in-depth on Monday.

class C:
    @classmethod
    def cm(cls): return cls.__name__
class D(C): pass
print(D.cm(), D().cm())

This prints "C C" with the patch, but "D D" without it. Clearly this
shouldn't change.

--Guido

On 6/1/07, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 07:48 PM 5/31/2007 +0800, Guido van Rossum wrote:
> >I've updated the patch; the latest version now contains the grammar
> >and compiler changes needed to make super a keyword and to
> >automatically add a required parameter 'super' when super is used.
> >This requires the latest p3yk branch (r55692 or higher).
> >
> >Comments anyone? What do people think of the change of semantics for
> >the im_class field of bound (and unbound) methods?
>
> Please correct me if I'm wrong, but just looking at the patch it
> seems to me that the descriptor protocol is being changed as well --
> i.e., the 'type' argument is now the found-in-type in the case of an
> instance __get__ as well as class __get__.
>
> It would seem to me that this change would break classmethods both on
> the instance and class level, since the 'cls' argument is supposed to
> be the derived class, not the class where the method was
> defined.  There also don't seem to be any tests for the use of super
> in classmethods.
>
> This would seem to make the change unworkable, unless we are also
> getting rid of classmethods, or further change the descriptor
> protocol to add another argument.  However, by the time we get to
> that point, it seems like making 'super' a cell variable might be a
> better option.
>
> Here's a strategy that I think could resolve your difficulties with
> the cell variable approach:
>
> First, when a class is encountered during the symbol setup pass,
> allocate an extra symbol for the class as a cell variable with a
> generated name (e.g. $1, $2, etc.), and keep a pointer to this name
> in the class state information.
>
> Second, when generating code for 'super', pull out the generated
> variable name of the nearest enclosing class, and use it as if it had
> been written in the code.
>
> Third, change the MAKE_FUNCTION for the BUILD_CLASS to a
> MAKE_CLOSURE, and add code after BUILD_CLASS to also store a super
> object in the special variable.  Maybe something like:
>
>       ...
>       BUILD_CLASS
>       ... apply decorators ...
>       DUP_TOP
>       STORE_* classname
>       ... generate super object ...
>       STORE_DEREF $n
>
> Fourth, make sure that the frame initialization code can deal with a
> code object that has a locals dictionary *and* cell variables.  For
> Python 2.5, this constraint is already met as long as CO_OPTIMIZED
> isn't set, and that should already be true for the relevant cases
> (module-level code and class bodies), so we really just need to
> ensure that CO_OPTIMIZED doesn't get set as a side-effect of adding
> cell variables.
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From greg.ewing at canterbury.ac.nz  Fri Jun  1 01:27:37 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 01 Jun 2007 11:27:37 +1200
Subject: [Python-3000] Lines breaking
In-Reply-To: <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<ca471dc20705281544i3be797f7ldab472dac3e1f543@mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83@mail.gmail.com>
	<465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de>
	<465CD016.7050002@canterbury.ac.nz>
	<87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407@canterbury.ac.nz>
	<8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <465F59E9.4030702@canterbury.ac.nz>

Stephen J. Turnbull wrote:

> *Python* does the right thing: it leaves the line break character(s)
> in place.  It's not Python's problem if programmers go around
> stripping characters just because they happen to be at the end of the
> line.

But currently you *know* that, e.g. string.strip() will
only ever remove whitespace and \n characters, so if
those don't matter to you, it's safe to use it.

I would be worried if it started removing characters
that it didn't remove before, because that could
alter the semantics of my code.

> Those characters are
> mandatory breaks because the expectation is *very* consistent (they
> say).

I object to being told by the Unicode committee what
semantics I should be using for ASCII characters that
pre-date unicode by a long way.

--
Greg

From guido at python.org  Fri Jun  1 01:50:29 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Jun 2007 07:50:29 +0800
Subject: [Python-3000] __debug__
In-Reply-To: <f3n9f1$kqg$1@sea.gmane.org>
References: <f3n0ul$k49$1@sea.gmane.org>
	<bbaeab100705311151q5a900afeq7ab1f8988a7eecee@mail.gmail.com>
	<f3n9f1$kqg$1@sea.gmane.org>
Message-ID: <ca471dc20705311650m47f0659dpdd2ec3f960e0df71@mail.gmail.com>

Making __debug__ another keyword atom sounds great to me.

On 6/1/07, Thomas Heller <theller at ctypes.org> wrote:
> Brett Cannon schrieb:
> > On 5/31/07, Georg Brandl <g.brandl at gmx.net> wrote:
> >>
> >> Guido just fixed a case in the py3k branch where you could assign to
> >> "None" in a function call.
> >>
> >> __debug__ has similar problems: it can't be assigned to normally, but via
> >> keyword arguments it is possible.
> >>
> >> This should be fixed; or should __debug__ be thrown out anyway?
> >
> >
> >
> > I never use the flag, personally.  When I am debugging I have an
> > app-specific flag I set.  I am +1 on ditching it.
> >
> > -Brett
> >
> >
>
> I would very much wish that __debug__ stays, because I use it it nearly every larger
> program that I later wish to freeze and distribute.
>
> "if __debug__: ..." blocks have the advantage that *no* bytecode is generated
> when run or frozen with -O or -OO, so the modules imported in these blocks
> are not pulled in by modulefinder.  You cannot get this effect (AFAIK) with
> app-specific flags.
>
> Thanks,
> Thomas
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Fri Jun  1 01:55:22 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Jun 2007 07:55:22 +0800
Subject: [Python-3000] Is the --enable-unicode configure arg going
	anywhere?
In-Reply-To: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>
References: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>
Message-ID: <ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>

I don't know exactly what that option does; it won't be possible to
disable unicode in 3.0, but I fully plan to continue supporting both
2-byte and 4-byte storage. 4-byte storage is broken on OSX it ought to
be fixed (unless it's a platform policy not to support it, as appears
to be the case on Windows).

--Guido

On 6/1/07, Brett Cannon <brett at python.org> wrote:
> I vaguely remember a discussion about the str/unicode unification and
> whether there was going to be standardization on the internal representation
> of Unicode or not.  I don't remember the outcome, but I am curious as to
> whether it will lead to the removal of --enable-unicode or not.
>
> Reason I ask is that the OS X extension modules do not like it when you
> compile with UCS-4 (see http://www.python.org/sf/763708).  If the option is
> not going to go away I am going to try to lean on someone to address this as
> Unicode is obviously going to play a bigger role in Python come 3.0.
>
> -Brett
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/guido%40python.org
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From brett at python.org  Fri Jun  1 03:59:19 2007
From: brett at python.org (Brett Cannon)
Date: Thu, 31 May 2007 18:59:19 -0700
Subject: [Python-3000] Is the --enable-unicode configure arg going
	anywhere?
In-Reply-To: <ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>
References: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>
	<ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>
Message-ID: <bbaeab100705311859l52075f0bo8e35b8f4626b4b69@mail.gmail.com>

On 5/31/07, Guido van Rossum <guido at python.org> wrote:
>
> I don't know exactly what that option does;


It specifies how Unicode is stored internally in the interpreter (I
believe).

it won't be possible to
> disable unicode in 3.0, but I fully plan to continue supporting both
> 2-byte and 4-byte storage. 4-byte storage is broken on OSX it ought to
> be fixed (unless it's a platform policy not to support it, as appears
> to be the case on Windows).


It's broken in the Mac extension modules that are auto-generated.  Otherwise
it's fine.

-Brett


--Guido
>
> On 6/1/07, Brett Cannon <brett at python.org> wrote:
> > I vaguely remember a discussion about the str/unicode unification and
> > whether there was going to be standardization on the internal
> representation
> > of Unicode or not.  I don't remember the outcome, but I am curious as to
> > whether it will lead to the removal of --enable-unicode or not.
> >
> > Reason I ask is that the OS X extension modules do not like it when you
> > compile with UCS-4 (see http://www.python.org/sf/763708).  If the option
> is
> > not going to go away I am going to try to lean on someone to address
> this as
> > Unicode is obviously going to play a bigger role in Python come 3.0.
> >
> > -Brett
> >
> > _______________________________________________
> > Python-3000 mailing list
> > Python-3000 at python.org
> > http://mail.python.org/mailman/listinfo/python-3000
> > Unsubscribe:
> > http://mail.python.org/mailman/options/python-3000/guido%40python.org
> >
> >
>
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070531/89b2d1c1/attachment.htm 

From stephen at xemacs.org  Fri Jun  1 05:23:54 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 01 Jun 2007 12:23:54 +0900
Subject: [Python-3000] Lines breaking
In-Reply-To: <465F59E9.4030702@canterbury.ac.nz>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<ca471dc20705281544i3be797f7ldab472dac3e1f543@mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83@mail.gmail.com>
	<465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de>
	<465CD016.7050002@canterbury.ac.nz>
	<87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407@canterbury.ac.nz>
	<8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465F59E9.4030702@canterbury.ac.nz>
Message-ID: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp>

Greg Ewing writes:

 > Stephen J. Turnbull wrote:
 > 
 > > *Python* does the right thing: it leaves the line break character(s)
 > > in place.  It's not Python's problem if programmers go around
 > > stripping characters just because they happen to be at the end of the
 > > line.
 > 
 > But currently you *know* that, e.g. string.strip() will
 > only ever remove whitespace and \n characters, so if
 > those don't matter to you, it's safe to use it.

Yes.  Both FF and VT *are* whitespace, AFAIK that has universal
agreement, and in particular they *are* removed by string.strip().  I
don't understand what you're worried about; nothing changes with
respect to handling of generic whitespace.

The *only* thing that adoption of the Unicode recommendation for line
breaking changes is that "\x0c\n" is now two empty lines with well-
defined semantics instead of some number of lines with you-won't-know-
until-you-ask-the-implementation semantics.

 > > Those characters are mandatory breaks because the expectation is
 > > *very* consistent (they say).

 > I object to being told by the Unicode committee what
 > semantics I should be using for ASCII characters that
 > pre-date unicode by a long way.

The ASCII standard, at least as codified in ISO 646, agrees with
Unicode, by referring to ECMA-48/ISO-6249 for the definition of the 32
C0 characters.  I suspect that the ANSI standard semantics of FF and
VT haven't changed since ANSI_X3.4-1963.

You just object to adopting a standard, period, because it might force
you to change your practices.  That's reasonable, changing working
software is expensive.  But interoperability is an important goal too.

From hagenf at CoLi.Uni-SB.DE  Fri Jun  1 07:17:48 2007
From: hagenf at CoLi.Uni-SB.DE (=?ISO-8859-1?Q?Hagen_F=FCrstenau?=)
Date: Fri, 01 Jun 2007 07:17:48 +0200
Subject: [Python-3000] Is the --enable-unicode configure arg
	going	anywhere?
In-Reply-To: <ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>
References: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>
	<ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>
Message-ID: <465FABFC.5040804@coli.uni-saarland.de>

Hi,

I've been reading the list for a couple of weeks, but this is my first post.

Guido van Rossum wrote:
> I don't know exactly what that option does; it won't be possible to
> disable unicode in 3.0, but I fully plan to continue supporting both
> 2-byte and 4-byte storage.

Does this still include the possibility of switching between 1-, 2- and
4-byte storage internally? I think you mentioned this in your Google
talk and I thought it was a very good compromise - and much better
than a compile-time switch.

- Hagen

From martin at v.loewis.de  Fri Jun  1 08:05:16 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 01 Jun 2007 08:05:16 +0200
Subject: [Python-3000] Is the --enable-unicode configure
	arg	going	anywhere?
In-Reply-To: <465FABFC.5040804@coli.uni-saarland.de>
References: <bbaeab100705311551t754be878x88f29b6c1c402f0@mail.gmail.com>	<ca471dc20705311655g3a54be09n7adcdee1afe083d8@mail.gmail.com>
	<465FABFC.5040804@coli.uni-saarland.de>
Message-ID: <465FB71C.5060201@v.loewis.de>

> Guido van Rossum wrote:
>> I don't know exactly what that option does; it won't be possible to
>> disable unicode in 3.0, but I fully plan to continue supporting both
>> 2-byte and 4-byte storage.
> 
> Does this still include the possibility of switching between 1-, 2- and
> 4-byte storage internally? I think you mentioned this in your Google
> talk and I thought it was a very good compromise - and much better
> than a compile-time switch.

In the current py3k-struni branch, it's still a compile time option.
I doubt that will change unless somebody contributes code to make it
change. The current compile-time option is between 2-byte and 4-byte
representation; 1-byte representation is not supported.

Regards,
Martin

From jason.orendorff at gmail.com  Fri Jun  1 18:05:32 2007
From: jason.orendorff at gmail.com (Jason Orendorff)
Date: Fri, 1 Jun 2007 12:05:32 -0400
Subject: [Python-3000] map, filter, reduce
Message-ID: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>

PEP 3100 still isn't clear on the fate of these guys, except that
reduce() is gone.

How about moving all three to the functools module instead?

-j

From steven.bethard at gmail.com  Fri Jun  1 18:24:12 2007
From: steven.bethard at gmail.com (Steven Bethard)
Date: Fri, 1 Jun 2007 10:24:12 -0600
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
Message-ID: <d11dcfba0706010924q9ff8514nf9bb1e8eddc3a1a@mail.gmail.com>

On 6/1/07, Jason Orendorff <jason.orendorff at gmail.com> wrote:
> PEP 3100 still isn't clear on the fate of these guys, except that
> reduce() is gone.
>
> How about moving all three to the functools module instead?

The itertools module already has imap() and ifilter(). They can just
be renamed to map() and filter() and left where they are.

STeVe
-- 
I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a
tiny blip on the distant coast of sanity.
        --- Bucky Katt, Get Fuzzy

From tjreedy at udel.edu  Fri Jun  1 19:12:00 2007
From: tjreedy at udel.edu (Terry Reedy)
Date: Fri, 1 Jun 2007 13:12:00 -0400
Subject: [Python-3000] map, filter, reduce
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
Message-ID: <f3pk11$607$1@sea.gmane.org>


"Jason Orendorff" <jason.orendorff at gmail.com> wrote in message 
news:bb8868b90706010905p2dae12b7qc538cf25190c7127 at mail.gmail.com...
| PEP 3100 still isn't clear on the fate of these guys, except that
| reduce() is gone.
|
| How about moving all three to the functools module instead?

The current reduce is broken due to being a mashing together of two 
versions of the function (one 2 params, the other 3), with the 3-param one 
having an ill-formed signature (inconsistent parameter order) to allow the 
mashing that should not have been done.  (The ill-formed signature is hard 
to remember and is responsible for part of some peoples' dislike of 
reduce.) I would like a proper 3-param version in functools, but have not 
writen the exact proposal yet since library changes have been put off.

I am also thinking about an ireduce, but need to make sure it cannot be 
easily done with current itertools.

Terry Jan Reedy




From janssen at parc.com  Fri Jun  1 19:51:17 2007
From: janssen at parc.com (Bill Janssen)
Date: Fri, 1 Jun 2007 10:51:17 PDT
Subject: [Python-3000] Lines breaking
In-Reply-To: <873b1c287v.fsf@uwakimon.sk.tsukuba.ac.jp> 
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<ca471dc20705281544i3be797f7ldab472dac3e1f543@mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83@mail.gmail.com>
	<465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de>
	<465CD016.7050002@canterbury.ac.nz>
	<87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407@canterbury.ac.nz>
	<07May31.074957pdt."57996"@synergy1.parc.xerox.com>
	<873b1c287v.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <07Jun1.105118pdt."57996"@synergy1.parc.xerox.com>

> I agree that that looks nice in my editor, but it is not Unicode-
> conforming practice, and I suspect that if you experiment with any
> printer you'll discover that you get an empty line at the top of the
> page.

This seems to me to be a non-issue; most "text" files are actually
data files (think about it), and were never intended to be printed.

> I also suspect that any program that currently is used to process
> those files' content by lines probably simply treats the FF as
> whitespace, and throws away empty lines.

Nope.  At least, my program doesn't.  And I don't think it's an
appropriate assumption, either.  Many programs are written to ignore
empty lines in their input, but many, maybe more, are not.  Blank
lines convey critical information in many contexts.

> If so, it will still work
> with FF treated as a hard line break in line-processing mode, since
> the trailing NL will now generate a (superfluous) empty line.

Nope.  The line-breaking is actually used (and this is common in data
represented as text files) as part of the parsing process, so by
turning it into two lines you've broken the program logic.

Bill

From janssen at parc.com  Fri Jun  1 20:14:32 2007
From: janssen at parc.com (Bill Janssen)
Date: Fri, 1 Jun 2007 11:14:32 PDT
Subject: [Python-3000] Lines breaking
In-Reply-To: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> 
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<ca471dc20705281544i3be797f7ldab472dac3e1f543@mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83@mail.gmail.com>
	<465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de>
	<465CD016.7050002@canterbury.ac.nz>
	<87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407@canterbury.ac.nz>
	<8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465F59E9.4030702@canterbury.ac.nz>
	<87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <07Jun1.111434pdt."57996"@synergy1.parc.xerox.com>

> The *only* thing that adoption of the Unicode recommendation for line
> breaking changes is that "\x0c\n" is now two empty lines with well-
> defined semantics instead of some number of lines with you-won't-know-
> until-you-ask-the-implementation semantics.

Well, that's just the way text is.

> The ASCII standard, at least as codified in ISO 646, agrees with
> Unicode, by referring to ECMA-48/ISO-6249 for the definition of the 32
> C0 characters.  I suspect that the ANSI standard semantics of FF and
> VT haven't changed since ANSI_X3.4-1963.
> 
> You just object to adopting a standard, period, because it might force
> you to change your practices.  That's reasonable, changing working
> software is expensive.  But interoperability is an important goal too.

Where, specifically, are the breakdowns in interoperability
manifesting themselves?

I'm sort of amazed at the turn of this argument.  Greg is arguing that
it might be arbitrarily expensive to make this change, because of the
way that text is used to store data by many programs, and because it's
been the way it's been for 15 years of Python history.  So the cost of
"changing working software" could run into billions; we have no way to
know.  But Stephen is arguing that we need to do it anyway to conform
to the dictates of some post-facto standards committee (yes, I know, I
usually *like* that argument :-).

Yesterday at Google Developer's Day, Alex Martelli told me that Python
is about pragmatics; I think I know which side the pragmatic comes
down on in this case.

How about a subtype of File which supports this behavior?

Bill





From collinw at gmail.com  Fri Jun  1 21:08:32 2007
From: collinw at gmail.com (Collin Winter)
Date: Fri, 1 Jun 2007 12:08:32 -0700
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
Message-ID: <43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com>

On 6/1/07, Jason Orendorff <jason.orendorff at gmail.com> wrote:
> PEP 3100 still isn't clear on the fate of these guys, except that
> reduce() is gone.

I'm not sure what isn't clear: reduce() is listed as "to be removed",
and since map() and filter() aren't mentioned as "to be removed",
they're presumably not going to be removed. What's tripping you up?

Collin Winter

From g.brandl at gmx.net  Fri Jun  1 21:17:35 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Fri, 01 Jun 2007 21:17:35 +0200
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <f3pk11$607$1@sea.gmane.org>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
	<f3pk11$607$1@sea.gmane.org>
Message-ID: <f3prce$9c$1@sea.gmane.org>

Terry Reedy schrieb:
> "Jason Orendorff" <jason.orendorff at gmail.com> wrote in message 
> news:bb8868b90706010905p2dae12b7qc538cf25190c7127 at mail.gmail.com...
> | PEP 3100 still isn't clear on the fate of these guys, except that
> | reduce() is gone.
> |
> | How about moving all three to the functools module instead?
> 
> The current reduce is broken due to being a mashing together of two 
> versions of the function (one 2 params, the other 3), with the 3-param one 
> having an ill-formed signature (inconsistent parameter order) to allow the 
> mashing that should not have been done.  (The ill-formed signature is hard 
> to remember and is responsible for part of some peoples' dislike of 
> reduce.) I would like a proper 3-param version in functools, but have not 
> writen the exact proposal yet since library changes have been put off.
> 
> I am also thinking about an ireduce, but need to make sure it cannot be 
> easily done with current itertools.

How should an "ireduce" work? The result is not a sequence which could be
returned lazily.

Georg


-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From g.brandl at gmx.net  Fri Jun  1 22:14:38 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Fri, 01 Jun 2007 22:14:38 +0200
Subject: [Python-3000] Error in PEP 3115?
Message-ID: <f3pund$c57$1@sea.gmane.org>

In PEP 3115 (the new metaclasses PEP), there is an example metaclass:

      # The metaclass
      class OrderedClass(type):

          # The prepare function
          @classmethod
          def __prepare__(metacls, name, bases): # No keywords in this case
             return member_table()

          # The metaclass invocation
          def __init__(self, name, bases, classdict):
             # Note that we replace the classdict with a regular
             # dict before passing it to the superclass, so that we
             # don't continue to record member names after the class
             # has been created.
             result = type(name, bases, dict(classdict))
             result.member_names = classdict.member_names
             return result

Shouldn't __init__ be __new__? Also, if type(...) and not
type.__new__(self, ...) is called, the type of a class using this
metaclass will be type, not OrderedClass, but this may be intended.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From jason.orendorff at gmail.com  Fri Jun  1 22:55:56 2007
From: jason.orendorff at gmail.com (Jason Orendorff)
Date: Fri, 1 Jun 2007 16:55:56 -0400
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
	<43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com>
Message-ID: <bb8868b90706011355m4a17b883pe91f0fc6cb24b804@mail.gmail.com>

On 6/1/07, Collin Winter <collinw at gmail.com> wrote:
> On 6/1/07, Jason Orendorff <jason.orendorff at gmail.com> wrote:
> > PEP 3100 still isn't clear on the fate of these guys, except that
> > reduce() is gone.
>
> I'm not sure what isn't clear: reduce() is listed as "to be removed",
> and since map() and filter() aren't mentioned as "to be removed",
> they're presumably not going to be removed. What's tripping you up?

  "I think these features should be cut from Python 3000. [...]
  I think dropping filter() and map() is pretty uncontroversial [...]."
  http://www.artima.com/weblogs/viewpost.jsp?thread=98196

I know it was two years ago, and just a blog post for crying out
loud, but apparently it was pretty traumatic for some people,
because I still hear people whinge about it.  I would like to have
an authoritative document to point those people toward.  Perhaps
PEP 3099?

-j

From tjreedy at udel.edu  Fri Jun  1 23:44:48 2007
From: tjreedy at udel.edu (Terry Reedy)
Date: Fri, 1 Jun 2007 17:44:48 -0400
Subject: [Python-3000] map, filter, reduce
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com><f3pk11$607$1@sea.gmane.org>
	<f3prce$9c$1@sea.gmane.org>
Message-ID: <f3q40h$td7$1@sea.gmane.org>


"Georg Brandl" <g.brandl at gmx.net> wrote in message 
news:f3prce$9c$1 at sea.gmane.org...
| How should an "ireduce" work? The result is not a sequence which could be
| returned lazily.

It would generate the sequence of partial reductions (potentially 
indefinately).
list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10]

This is obviously *not* the same as a reduce() which only returns the final 
value without the intermediate values.

Terry Jan Reedy








From alexandre at peadrop.com  Sat Jun  2 00:57:41 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Fri, 1 Jun 2007 18:57:41 -0400
Subject: [Python-3000] Handling of wide Unicode characters
Message-ID: <acd65fa20706011557h6bed4c8dh7a920385663e86b4@mail.gmail.com>

Hi,

I was doing some testing on the new _string_io module, since I was
slightly skeptical on my handling of wide Unicode characters (32-bit
of length, instead of the usual 16-bit in UTF-16). So, I ran this
little test:

   >>> s = _string_io.StringIO()
   >>> s.write(u'??')
   >>> s.tell()
   2

Like I expected, wide Unicode characters count for two. However, I was
surprised that Python treats them as two characters as well:

   >>> len(u'??')
   2
   >>> u'??'
   u'\ud87e\udccd'

Is it a bug, or only an implementation choice?

Cheers,
-- Alexandre

From jcarlson at uci.edu  Sat Jun  2 01:11:35 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 01 Jun 2007 16:11:35 -0700
Subject: [Python-3000] Handling of wide Unicode characters
In-Reply-To: <acd65fa20706011557h6bed4c8dh7a920385663e86b4@mail.gmail.com>
References: <acd65fa20706011557h6bed4c8dh7a920385663e86b4@mail.gmail.com>
Message-ID: <20070601160728.6EF5.JCARLSON@uci.edu>


"Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> Hi,
> 
> I was doing some testing on the new _string_io module, since I was
> slightly skeptical on my handling of wide Unicode characters (32-bit
> of length, instead of the usual 16-bit in UTF-16). So, I ran this
> little test:
> 
>    >>> s = _string_io.StringIO()
>    >>> s.write(u'????')
>    >>> s.tell()
>    2
> 
> Like I expected, wide Unicode characters count for two. However, I was
> surprised that Python treats them as two characters as well:
> 
>    >>> len(u'????')
>    2
>    >>> u'????'
>    u'\ud87e\udccd'
> 
> Is it a bug, or only an implementation choice?

If your Python is compiled as a UTF-16 build, then any character in the
extended plane will be seen as two characters by Python.  If you are
using a UCS-4 build (it's the same as UTF-32), then you should be seeing
the single wide character as a single wide character.  The only
exception to this rule is if you enter the wide character as a surrogate
pair, in which case Python doesn't normalize it into the single wide
character.  To get a real wide character, you would need to use a proper
escape, or decode from an encoded string.


 - Josiah


From guido at python.org  Sat Jun  2 01:11:29 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 2 Jun 2007 07:11:29 +0800
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <f3q40h$td7$1@sea.gmane.org>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
	<f3pk11$607$1@sea.gmane.org> <f3prce$9c$1@sea.gmane.org>
	<f3q40h$td7$1@sea.gmane.org>
Message-ID: <ca471dc20706011611w1649b6b5h25e0eb79cd6a2340@mail.gmail.com>

I see no benefit in ireduce(), just more ways to write obfuscated code.

Regarding map() and filter(), I don't see what's unclear about PEP 3100:

"""
* Make built-ins return an iterator where appropriate (e.g. ``range()``,
  ``zip()``, ``map()``, ``filter()``, etc.) [zip and range: done]
"""

--Guido

On 6/2/07, Terry Reedy <tjreedy at udel.edu> wrote:
>
> "Georg Brandl" <g.brandl at gmx.net> wrote in message
> news:f3prce$9c$1 at sea.gmane.org...
> | How should an "ireduce" work? The result is not a sequence which could be
> | returned lazily.
>
> It would generate the sequence of partial reductions (potentially
> indefinately).
> list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10]
>
> This is obviously *not* the same as a reduce() which only returns the final
> value without the intermediate values.
>
> Terry Jan Reedy
>
>
>
>
>
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Sat Jun  2 01:18:47 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 2 Jun 2007 07:18:47 +0800
Subject: [Python-3000] Error in PEP 3115?
In-Reply-To: <f3pund$c57$1@sea.gmane.org>
References: <f3pund$c57$1@sea.gmane.org>
Message-ID: <ca471dc20706011618p3c3312a1k6e817c5b57df6a1c@mail.gmail.com>

You're right. Fixed now. I also fixed dict.setitem (should be
dict.__setitem__). Thanks for noticing!

--Guido

On 6/2/07, Georg Brandl <g.brandl at gmx.net> wrote:
> In PEP 3115 (the new metaclasses PEP), there is an example metaclass:
>
>       # The metaclass
>       class OrderedClass(type):
>
>           # The prepare function
>           @classmethod
>           def __prepare__(metacls, name, bases): # No keywords in this case
>              return member_table()
>
>           # The metaclass invocation
>           def __init__(self, name, bases, classdict):
>              # Note that we replace the classdict with a regular
>              # dict before passing it to the superclass, so that we
>              # don't continue to record member names after the class
>              # has been created.
>              result = type(name, bases, dict(classdict))
>              result.member_names = classdict.member_names
>              return result
>
> Shouldn't __init__ be __new__? Also, if type(...) and not
> type.__new__(self, ...) is called, the type of a class using this
> metaclass will be type, not OrderedClass, but this may be intended.
>
> Georg
>
> --
> Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
> Four shall be the number of spaces thou shalt indent, and the number of thy
> indenting shall be four. Eight shalt thou not indent, nor either indent thou
> two, excepting that thou then proceed to four. Tabs are right out.
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Sat Jun  2 01:24:57 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 2 Jun 2007 07:24:57 +0800
Subject: [Python-3000] Handling of wide Unicode characters
In-Reply-To: <20070601160728.6EF5.JCARLSON@uci.edu>
References: <acd65fa20706011557h6bed4c8dh7a920385663e86b4@mail.gmail.com>
	<20070601160728.6EF5.JCARLSON@uci.edu>
Message-ID: <ca471dc20706011624i292aa5bag8df95d8b47c9492c@mail.gmail.com>

What he said. IOW, we're treating each half of a surrogate as a
"character", at least for purposes of counting items in a string.
(Otherwise operations like len() and indexing/slicing would no longer
be O(1).)

--Guido

On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> > Hi,
> >
> > I was doing some testing on the new _string_io module, since I was
> > slightly skeptical on my handling of wide Unicode characters (32-bit
> > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > little test:
> >
> >    >>> s = _string_io.StringIO()
> >    >>> s.write(u'??')
> >    >>> s.tell()
> >    2
> >
> > Like I expected, wide Unicode characters count for two. However, I was
> > surprised that Python treats them as two characters as well:
> >
> >    >>> len(u'??')
> >    2
> >    >>> u'??'
> >    u'\ud87e\udccd'
> >
> > Is it a bug, or only an implementation choice?
>
> If your Python is compiled as a UTF-16 build, then any character in the
> extended plane will be seen as two characters by Python.  If you are
> using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> the single wide character as a single wide character.  The only
> exception to this rule is if you enter the wide character as a surrogate
> pair, in which case Python doesn't normalize it into the single wide
> character.  To get a real wide character, you would need to use a proper
> escape, or decode from an encoded string.
>
>
>  - Josiah
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Sat Jun  2 01:49:01 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Fri, 1 Jun 2007 19:49:01 -0400
Subject: [Python-3000] Handling of wide Unicode characters
In-Reply-To: <20070601160728.6EF5.JCARLSON@uci.edu>
References: <acd65fa20706011557h6bed4c8dh7a920385663e86b4@mail.gmail.com>
	<20070601160728.6EF5.JCARLSON@uci.edu>
Message-ID: <acd65fa20706011649s1a7ef615nc62a23b1eee6d0c9@mail.gmail.com>

Thanks for explanation. Anyway, it certainly much simpler to deal with
surrogate pairs than with variable-width characters.

On 6/1/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> > Hi,
> >
> > I was doing some testing on the new _string_io module, since I was
> > slightly skeptical on my handling of wide Unicode characters (32-bit
> > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > little test:
> >
> >    >>> s = _string_io.StringIO()
> >    >>> s.write(u'??')
> >    >>> s.tell()
> >    2
> >
> > Like I expected, wide Unicode characters count for two. However, I was
> > surprised that Python treats them as two characters as well:
> >
> >    >>> len(u'??')
> >    2
> >    >>> u'??'
> >    u'\ud87e\udccd'
> >
> > Is it a bug, or only an implementation choice?
>
> If your Python is compiled as a UTF-16 build, then any character in the
> extended plane will be seen as two characters by Python.  If you are
> using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> the single wide character as a single wide character.  The only
> exception to this rule is if you enter the wide character as a surrogate
> pair, in which case Python doesn't normalize it into the single wide
> character.  To get a real wide character, you would need to use a proper
> escape, or decode from an encoded string.
>
>
>  - Josiah
>
>


-- 
Alexandre Vassalotti

From greg.ewing at canterbury.ac.nz  Sat Jun  2 02:09:29 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 02 Jun 2007 12:09:29 +1200
Subject: [Python-3000] Lines breaking
In-Reply-To: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<ca471dc20705281544i3be797f7ldab472dac3e1f543@mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83@mail.gmail.com>
	<465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de>
	<465CD016.7050002@canterbury.ac.nz>
	<87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407@canterbury.ac.nz>
	<8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp>
	<465F59E9.4030702@canterbury.ac.nz>
	<87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4660B539.4080109@canterbury.ac.nz>

Stephen J. Turnbull wrote:
> Both FF and VT *are* whitespace, AFAIK that has universal
> agreement, and in particular they *are* removed by string.strip().

You're right, strip() wasn't a good example, and I
withdraw it.

However, there's a big difference between being a
whitespace character and being a line break character.
Programs that currently deal with FF and VT chars
would have their behaviour changed, because they
would suddenly start seeing lines broken in unexpected
(to them) places, and getting lines that don't end
with \n which aren't at the end of the file.

--
Greg

From greg.ewing at canterbury.ac.nz  Sat Jun  2 02:15:24 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 02 Jun 2007 12:15:24 +1200
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <f3q40h$td7$1@sea.gmane.org>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>
	<f3pk11$607$1@sea.gmane.org> <f3prce$9c$1@sea.gmane.org>
	<f3q40h$td7$1@sea.gmane.org>
Message-ID: <4660B69C.5010806@canterbury.ac.nz>

Terry Reedy wrote:
> It would generate the sequence of partial reductions (potentially 
> indefinately).
> list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10]
> 
> This is obviously *not* the same as a reduce() which only returns the final 
> value without the intermediate values.

It's sufficiently different that I think calling it
'ireduce' would just be confusing.

It's more like a 'running_reduce' or something.

--
Greg

From jcarlson at uci.edu  Sat Jun  2 02:44:19 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 01 Jun 2007 17:44:19 -0700
Subject: [Python-3000] Handling of wide Unicode characters
In-Reply-To: <acd65fa20706011649s1a7ef615nc62a23b1eee6d0c9@mail.gmail.com>
References: <20070601160728.6EF5.JCARLSON@uci.edu>
	<acd65fa20706011649s1a7ef615nc62a23b1eee6d0c9@mail.gmail.com>
Message-ID: <20070601174032.6EFB.JCARLSON@uci.edu>


"Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> Thanks for explanation. Anyway, it certainly much simpler to deal with
> surrogate pairs than with variable-width characters.

I don't know, I really liked my tree overlay that could handle
variable-width characters of any internal encoding (utf-7, utf-8, utf-16).
Of course it takes an extra O(n/logn) space and O(logn) time to access
arbitrary characters in the worst case, but such is the case with
time/space tradeoffs.

 - Josiah

> On 6/1/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> > > Hi,
> > >
> > > I was doing some testing on the new _string_io module, since I was
> > > slightly skeptical on my handling of wide Unicode characters (32-bit
> > > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > > little test:
> > >
> > >    >>> s = _string_io.StringIO()
> > >    >>> s.write(u'????')
> > >    >>> s.tell()
> > >    2
> > >
> > > Like I expected, wide Unicode characters count for two. However, I was
> > > surprised that Python treats them as two characters as well:
> > >
> > >    >>> len(u'????')
> > >    2
> > >    >>> u'????'
> > >    u'\ud87e\udccd'
> > >
> > > Is it a bug, or only an implementation choice?
> >
> > If your Python is compiled as a UTF-16 build, then any character in the
> > extended plane will be seen as two characters by Python.  If you are
> > using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> > the single wide character as a single wide character.  The only
> > exception to this rule is if you enter the wide character as a surrogate
> > pair, in which case Python doesn't normalize it into the single wide
> > character.  To get a real wide character, you would need to use a proper
> > escape, or decode from an encoded string.
> >
> >
> >  - Josiah
> >
> >
> 
> 
> -- 
> Alexandre Vassalotti


From stephen at xemacs.org  Sat Jun  2 06:03:11 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat,  2 Jun 2007 13:03:11 +0900 (JST)
Subject: [Python-3000] Lines breaking
In-Reply-To: <07Jun1.111434pdt."57996"@synergy1.parc.xerox.com>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
Message-ID: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp>

<ca471dc20705281544i3be797f7ldab472dac3e1f543 at mail.gmail.com>
	<acd65fa20705281649m7a7a871bw8d690456202f7b83 at mail.gmail.com>
	<465B814D.2060101 at canterbury.ac.nz>
	<465BB994.9050309 at v.loewis.de>
	<465CD016.7050002 at canterbury.ac.nz>
	<87ps4j0zi6.fsf at uwakimon.sk.tsukuba.ac.jp>
	<465E37C3.9070407 at canterbury.ac.nz>
	<8764691mpq.fsf at uwakimon.sk.tsukuba.ac.jp>
	<465F59E9.4030702 at canterbury.ac.nz>
	<87y7j4z7at.fsf at uwakimon.sk.tsukuba.ac.jp>
	<07Jun1.111434pdt."57996"@synergy1.parc.xerox.com>
X-Mailer: VM 7.17 under 21.5  (beta27) "fiddleheads" (+CVS-20070324) XEmacs Lucid
Date: Sat, 02 Jun 2007 13:03:10 +0900
Message-ID: <87tztrypdt.fsf at uwakimon.sk.tsukuba.ac.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Bill Janssen writes:

 > > The *only* thing that adoption of the Unicode recommendation for line
 > > breaking changes is that "\x0c\n" is now two empty lines with well-
 > > defined semantics instead of some number of lines with you-won't-know-
 > > until-you-ask-the-implementation semantics.
 > 
 > Well, that's just the way text is.

If it were *text*, it wouldn't matter, you say yourself.  People would
be able to live with an empty first line.  The issue arises because
you've defined a *formal data format* embedded in text which conflicts
with long-established standards for text.  Now we have an attempt to
define a universal standard for text, which conflicts with your
practice.

 > > You just object to adopting a standard, period, because it might force
 > > you to change your practices.  That's reasonable, changing working
 > > software is expensive.  But interoperability is an important goal too.
 > 
 > Where, specifically, are the breakdowns in interoperability
 > manifesting themselves?

That's not the point; this is like the logical operations on decimal
thing.  Adopting a standard in full is reassuring to potential users,
who won't complain, they just go away.

 > I'm sort of amazed at the turn of this argument.  Greg is arguing that
 > it might be arbitrarily expensive to make this change,

Which I've acknowledged.  But we have no data at all.  We're talking
about Python 3000, and we know that many programs will require porting
effort anyway.  How expensive is it?  "Arbitrarily" is FUD.

 > But Stephen is arguing that we need to do it anyway to conform to
 > the dictates of some post-facto standards committee (yes, I know, I
 > usually *like* that argument :-).

And you should.

We only *need* to do it if we want to claim Unicode conformance in
this area.  I think that is desirable; readline functionality is very
basic to a text-processing language.

 > How about a subtype of File which supports this behavior?

We're talking about Python 3000, right?  If we're going to claim
conformance, it should be default.  If it's not going to be default,
there's no need to talk about it until somebody writes the module and
submits it for inclusion in the stdlib.


From rauli.ruohonen at gmail.com  Sat Jun  2 06:14:21 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sat, 2 Jun 2007 07:14:21 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>

On 5/27/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>James Y Knight writes:
>> a 'pyidchar.txt' file with a list of character ranges, and now that
>> pyidchar.txt file is going to have separate sections based on module
>> name? Sorry, but are you !@# kidding me?!?
>
>The scalability issue was raised by Guido, not the ASCII advocates.

He did not say that such files or command-line options would be
scalable either. They are fine tools for auditing, but not for using
finished products. One should provide both auditing tools and ease
of use of already audited code.

One possibility for providing both:

(1) Add a mandatory ASCII-only special comment at the beginning of
    each module. The comment would continue until the first empty
    line and would contain only valid directives matching some
    regular expression. Only whitespace is allowed before the
    comment. Anything else is a syntax error.
(2) Allow directives in the special comment to change encoding and
    tab/space rules. Also allow them to restrict the identifier
    character set and the string character set.
(3) Defaults: utf-8 encoding, no mixed tabs and spaces, identifier
    and string content is not restricted. (beyond the restrictions
    in PEP 3131 etc. which the user can't lift, of course) One could
    change these in site.py, but the directives in (2) override
    the defaults, so they can't be used for B&D.
(4) Have a command line parameter for restricting the character sets
    of all modules. Every module must satisfy both this and its own
    directives simultaneously. A default value for this could be set
    in site.py, but it must be immutable after first assignment.

This way everything "just works" for quick hacks and for naive users
who only run code they trust. For real projects it's easy to add a couple
of lines in modules to enforce project policy. When you see code
that doesn't specify a character set you trust, then you know you
may have to be careful.

If you don't want to be careful, then you can set the command line
parameter default to e.g. ascii in site.py and nothing using
non-ascii identifiers will work for you. If you're fine with
explicit charset tags but not implicit ones, then you can set the
defaults for tagless modules to ascii in site.py.

Example 1 (the defaults, implicit):

#!/usr/bin/env python

# Real code starts here. This comment is not special and you
# can even us? wh?t?v?r ch?r?ct?rs y?? w?nt t? h?r?.

Example 2 (the defaults, explicit):

#!/usr/bin/env python
#
# coding: utf-8
# identifier_charset: 0-1fffff
# string_charset: 0-1fffff
# indentation: unmixed

# Real code.

Example 3 (strawman for some Japanese code):

# identifier_charset:
#     0-7f 3b1-3c9 "Hiragana" "Katakana" "CJK Unified Ideographs"
#     "CJK Unified Ideographs Extension A"
#     "CJK Unified Ideographs Extension B"

# The range 3b1-3c9 is lowercase Greek, which is often used in math.

Example 3 (inclusion from a file, similar to import):

# identifier_charset: fooproject.codingstyle.identifier_charset

From gproux+py3000 at gmail.com  Sat Jun  2 07:15:31 2007
From: gproux+py3000 at gmail.com (Guillaume Proux)
Date: Sat, 2 Jun 2007 14:15:31 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
Message-ID: <19dd68ba0706012215y6af9044bw98a9c7a3119795a6@mail.gmail.com>

On 6/2/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> (1) Add a mandatory ASCII-only special comment at the beginning of
>     each module. The comment would continue until the first empty
>     line and would contain only valid directives matching some
>     regular expression. Only whitespace is allowed before the
>     comment. Anything else is a syntax error.

Interesting proposal. I really like it indeed. I wonder how people
against "magic" will like it. Although there is a precedent with the
first line giving the path to the interpreter.
.. but it sounds quite a fair proposal and would solve some of the
issues raised here before

But i wonder if you would see the security issue with some person
sending you  diff file that would (among other changes...) do
something like this
"
1a2
>
"

(inserted a blank line)

Then you would be in trouble again...

An alternative would be to require for any comment before any code
that purposes to set identifier encoding restrictions etc...  to be
preceded by a specific character (just like the unix convention with
the " ! "). Any comment line starting with this character would be
restricted to be ascii only. "%" sounds like a good character for this
purpose.

Like for example...

#!/usr/bin/env python
#
# %coding: utf-8
# %identifier_charset: 0-1fffff
# %string_charset: 0-1fffff
# %indentation: unmixed

Regards,

Guillaume

From jcarlson at uci.edu  Sat Jun  2 09:14:58 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 02 Jun 2007 00:14:58 -0700
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
Message-ID: <20070601235750.6F01.JCARLSON@uci.edu>


"Rauli Ruohonen" <rauli.ruohonen at gmail.com> wrote:
> On 5/27/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> >James Y Knight writes:
> >> a 'pyidchar.txt' file with a list of character ranges, and now that
> >> pyidchar.txt file is going to have separate sections based on module
> >> name? Sorry, but are you !@# kidding me?!?
> >
> >The scalability issue was raised by Guido, not the ASCII advocates.
> 
> He did not say that such files or command-line options would be
> scalable either. They are fine tools for auditing, but not for using
> finished products. One should provide both auditing tools and ease
> of use of already audited code.
> 
> One possibility for providing both:
> 
> (1) Add a mandatory ASCII-only special comment at the beginning of
>     each module. The comment would continue until the first empty
>     line and would contain only valid directives matching some
>     regular expression. Only whitespace is allowed before the
>     comment. Anything else is a syntax error.

"""
If a comment in the first or second line of the Python script matches
the regular expression coding[=:]\s*([-\w.]+), this comment is processed
as an encoding declaration; the first group of this expression names the
encoding of the source code file.
"""

Your suggestion would unnecessarily change the semantics of the encoding
declarations.  I would call this gratuitous breakage.

> (2) Allow directives in the special comment to change encoding and
>     tab/space rules. Also allow them to restrict the identifier
>     character set and the string character set.

Sounds like the application of vim settings as a solution to a whole
bunch of completely unrelated "problems" in Python (especially with 4
space indents being the "one true way to indent" and the encoding
declaration already being established).  Please keep your vim out of my
Python ;) .


> (3) Defaults: utf-8 encoding, no mixed tabs and spaces, identifier
>     and string content is not restricted.

All except for the identifier content is already going to be the default
with Python 3.0 .  I've never heard a particularly good reason to allow
for mixing tabs and spaces, and the current encoding declaration works
just fine (except for the whole unicode character thing).

And as stated by basically everyone, the only *sane* default is ascii
identifiers.  Since the vast majority of users will have no use for
unicode identifiers in the short or long term, making them the default
is overzealous at best.


> (4) Have a command line parameter for restricting the character sets
>     of all modules. Every module must satisfy both this and its own
>     directives simultaneously. A default value for this could be set
>     in site.py, but it must be immutable after first assignment.
> Example 3 (inclusion from a file, similar to import):
> 
> # identifier_charset: fooproject.codingstyle.identifier_charset

I really don't like the idea of adding a *different* import-like thing. 
We already have imports (that are evaluated at run time, not compile
time), and due to their semantics, can't use a mechanism like the above.


Obviously I'm overall -1 .  I don't see this as a good solution to the
character set problem.  And I think its a step back regarding encodings,
indentation, etc.

 - Josiah


From g.brandl at gmx.net  Sat Jun  2 09:24:02 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Sat, 02 Jun 2007 09:24:02 +0200
Subject: [Python-3000] map, filter, reduce
In-Reply-To: <4660B69C.5010806@canterbury.ac.nz>
References: <bb8868b90706010905p2dae12b7qc538cf25190c7127@mail.gmail.com>	<f3pk11$607$1@sea.gmane.org>
	<f3prce$9c$1@sea.gmane.org>	<f3q40h$td7$1@sea.gmane.org>
	<4660B69C.5010806@canterbury.ac.nz>
Message-ID: <f3r5ua$6um$1@sea.gmane.org>

Greg Ewing schrieb:
> Terry Reedy wrote:
>> It would generate the sequence of partial reductions (potentially 
>> indefinately).
>> list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10]
>> 
>> This is obviously *not* the same as a reduce() which only returns the final 
>> value without the intermediate values.
> 
> It's sufficiently different that I think calling it
> 'ireduce' would just be confusing.
> 
> It's more like a 'running_reduce' or something.

ISTM that this application is even more suited for a plain old `for` loop.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From talin at acm.org  Sat Jun  2 09:38:40 2007
From: talin at acm.org (Talin)
Date: Sat, 02 Jun 2007 00:38:40 -0700
Subject: [Python-3000] Updating PEP 3101
In-Reply-To: <465E6D13.2030606@acm.org>
References: <465E6D13.2030606@acm.org>
Message-ID: <46611E80.5010002@acm.org>

Some more thoughts on this, and some questions.

PEP 3101 defines two layers of APIs for string formatting: a low-level 
formatting engine, and a high-level set of convenience methods 
(primarily str.format).

Both layers have grown complex due to the desire to satisfy feature 
requests that have been asked for by various folks. What I would like to 
do is move the design back to a more OOWTDI style.

The way I propose to do this is to redesign the low-level engine as a 
class, called Formatter, with overridable methods.

To support the high-level API, there will be a single, built-in global 
singleton instance of Formatter. Calls to str.format will simply be 
routed to this singleton instance.

So for example, when you call:

    "The value is {0}".format(1)

This will call:

     builtin_formatter.format("The value is {0}", 1)

I'm not sure that it makes any sense to allow the built-in formatter 
instance to be replaceable or mutable, since that would cause all string 
formatting behavior to change. Also, there's no way to negotiate 
conflicts between various library modules that might want different 
behavior. Fortunately, the base formatter has no state, so all we have 
to worry about is preventing it from being replaced.

Rather, I think it makes more sense to allow people to create their own 
Formatter instances and use them directly. This does mean, however, that 
people who want to use their own custom Formatter instance won't be able 
to use the high-level convenience methods.

The Formatter class has at least three overridable methods:

    1) The method that parses a format string into constant characters 
and replacement fields.
    2) A method that retrieves a field value given a field name or index.
    3) A method that formats an individual replacement field, given a 
value and a conversion specifier string.

So:

    -- If you want a different syntax for format strings, you override 
method #1. This satisfies the feature requests of people who wanted 
variations in the format string syntax.

    -- If you want to be able to change the way that field values are 
accessed, you override #2. This satisfies the desire of people who want 
to have it automatically access locals() or globals(). You can do this 
via passing in those namespaces as a constructor parameter, or if you 
want to get fancy, you can do look at the stack frames and figure it out 
automatically. The main point is that this functionality won't be built 
in by default, but it could be a cookbook recipe.

    Another reason to override this method is to change the rules for 
tracking what field names are legal. The built-in method does not allow 
fields beginning with an underscore to be used as attributes, i.e. you 
cannot say "{0._index}" as a format string. If you override the field 
value method, however, you can change this behavior.

    Similarly, if you want to add/remove functionality to insure that 
all positional arguments are used, or change the way errors are handled, 
you can do that here as well.

    -- If you want to change the way that built-in types are converted 
to string form, you override #3. (For non-builtin types you can just add 
a __format__ special method to the type.)


The main point is, however, that none of these overrides affect the 
behavior of the built-in string.format function.

Now, in the current version of the PEP, all of the things that I just 
mentioned can be changed on a per-call basis by passing in 
specially-named parameters, i.e.:

    "The name is {0._index}".format(1, flags=ALLOW_LEADING_UNDERSCORES)

I'm proposing to eliminate all of that extra flexibility, and instead 
say that if you want to be able to do that, use a custom formatter 
class, but without the syntactical convenience of str.format.

So my first question is to get a sense of how many people would find 
that agreeable. In other words, is it reasonable to require people to 
give up the syntactical convenience of "string".format() when they want 
to do custom formatting?

My second question deals with implementation. Because 'str' is a 
built-in type, all of its methods must be built-in as well, and 
therefore implemented in C. If 'str' depends on a built-in formatter 
singleton instance, that singleton instance must also be implemented in 
C, and must be initialized in the Parser before any calls to str.format.

Since I am not an expert in the internals of the Python interpreter C 
code, I would ask how feasible is this?

-- Talin

From eric+python-dev at trueblade.com  Sat Jun  2 13:46:06 2007
From: eric+python-dev at trueblade.com (Eric V. Smith)
Date: Sat, 02 Jun 2007 07:46:06 -0400
Subject: [Python-3000] Updating PEP 3101
In-Reply-To: <46611E80.5010002@acm.org>
References: <465E6D13.2030606@acm.org> <46611E80.5010002@acm.org>
Message-ID: <4661587E.3090807@trueblade.com>

Talin wrote:
> Some more thoughts on this, and some questions.
> 
> PEP 3101 defines two layers of APIs for string formatting: a low-level 
> formatting engine, and a high-level set of convenience methods 
> (primarily str.format).
> 
> Both layers have grown complex due to the desire to satisfy feature 
> requests that have been asked for by various folks. What I would like to 
> do is move the design back to a more OOWTDI style.
> 
> The way I propose to do this is to redesign the low-level engine as a 
> class, called Formatter, with overridable methods.

I think this is a good idea, in order to keep "str".format() really
simple, and thereby increase its usage.

> To support the high-level API, there will be a single, built-in global 
> singleton instance of Formatter. Calls to str.format will simply be 
> routed to this singleton instance.

I'm not so sure this is actually required, see below.

> So my first question is to get a sense of how many people would find 
> that agreeable. In other words, is it reasonable to require people to 
> give up the syntactical convenience of "string".format() when they want 
> to do custom formatting?

I like keeping "str".format() simple, because I see it's main use as a
slightly more flexible version of '%' for strings.  I hope it's usage
will be ubiquitous.  It's especially handy for i18n.

> My second question deals with implementation. Because 'str' is a 
> built-in type, all of its methods must be built-in as well, and 
> therefore implemented in C. If 'str' depends on a built-in formatter 
> singleton instance, that singleton instance must also be implemented in 
> C, and must be initialized in the Parser before any calls to str.format.

I don't think there actually needs to be a built-in singleton formatter,
but it just needs to appear "as-if" there is one.  Not having a
singleton makes hiding the singleton a non-issue.  It also simplifies
the C code.  str.format could be implemented in C, using the existing
code in the sandbox implementation.  The Formatter class could be
written in C or Python, and would call some of the existing code in the
sandbox implementation, or refactored versions if we need to expose
anything else (which I don't think we do).

The only real work we'd need to do to the sandbox code is to strip out
some of the code that implements the additional options (such as
multiple syntaxes) and hook up str.format.  Once we reach a consensus,
I'm ready to put some time into this.  Then we'd have to implement
Formatter, of course.  But it shouldn't be too hard.

One comment I'd like to make on your prior email is that I'd like to see
this implemented in 2.6.  To my knowledge, we're not removing any
functionality in 3.0 that will be replaced by str.format, so I can't
argue that it will make it easier to have code that runs in both 2.6.
and 3.0.  But it seems to me that the fewer new feature that exist only
in 3.0, the easier it will be to wrap you head around 3.0.

Eric.


From talin at acm.org  Sat Jun  2 17:19:08 2007
From: talin at acm.org (Talin)
Date: Sat, 02 Jun 2007 08:19:08 -0700
Subject: [Python-3000] Updating PEP 3101
In-Reply-To: <4661583B.5020708@trueblade.com>
References: <465E6D13.2030606@acm.org> <46611E80.5010002@acm.org>
	<4661583B.5020708@trueblade.com>
Message-ID: <46618A6C.60704@acm.org>

Eric V. Smith wrote:

> One comment I'd like to make on your prior email is that I'd like to see 
> this implemented in 2.6.  To my knowledge, we're not removing any 
> functionality in 3.0 that will be replaced by str.format, so I can't 
> argue that it will make it easier to have code that runs in both 2.6. 
> and 3.0.  But it seems to me that the fewer new feature that exist only 
> in 3.0, the easier it will be to wrap you head around 3.0.

I think that supporting it in 2.6 is fine. Now, the PEP in your sandbox 
also talks about making an external module for versions earlier than 
2.6. My feeling on that is if someone wants to do that, fine, but it 
doesn't need to be part of the PEP.

-- Talin

From rauli.ruohonen at gmail.com  Sat Jun  2 18:19:14 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sat, 2 Jun 2007 19:19:14 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070601235750.6F01.JCARLSON@uci.edu>
References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<20070601235750.6F01.JCARLSON@uci.edu>
Message-ID: <f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>

On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> """
> If a comment in the first or second line of the Python script matches
> the regular expression coding[=:]\s*([-\w.]+), this comment is processed
> as an encoding declaration; the first group of this expression names the
> encoding of the source code file.
> """
>
> Your suggestion would unnecessarily change the semantics of the encoding
> declarations.  I would call this gratuitous breakage.

Depending on what the regular expression for the declarations is, the
difference may
not be big. Current code can also reliably be converted with an automated tool,
so this isn't a big deal for py3k.

It may be that the change is unnecessary. Reading Guido's writings, he seems
to be of the opinion that the Java way (no restrictions at all) is
right here, and
anything else can be delegated to pylint and similar tools.

> Sounds like the application of vim settings as a solution to a whole
> bunch of completely unrelated "problems" in Python (especially with 4
> space indents being the "one true way to indent" and the encoding
> declaration already being established).  Please keep your vim out of my
> Python ;) .

The encoding declaration stays mostly the same, I'm just suggesting adding
similar declarations for the identifier/string character sets and making them
deception-proof. You're probably right about the indentation stuff. If
you got rid
of all indentation-related options and simply forbade mixture of tabs and
spaces, I'd just say good riddance.

> And as stated by basically everyone, the only *sane* default is ascii
> identifiers.  Since the vast majority of users will have no use for
> unicode identifiers in the short or long term, making them the default
> is overzealous at best.

"Basically everyone" is not true, because it does not include Guido, who
matters the most. Some quotes from his latest posts on the topic:

Guido van Rossum (May 25):
:I still think such a command-line switch (or switches) is the wrong
:approach. What if I have *one* module that uses Cyrillic legitimately.
:A command-line switch would enable Cyrillic in *all* modules.

Guido van Rossum (May 25):
:On 5/24/07, Josiah Carlson <jcarlson at uci.edu> wrote:
:> Where else in Python have we made the default
:> behavior only desired or useful to 5% of our users?
:
:Where are you getting that statistic? This seems an extremely
:backwards, US-centric worldview.

Guido van Rossum (May 25):
:A more useful approach would seem to be a set of auditing tools that
:can be applied routinely to all new contributions (e.g. as a
:pre-commit hook when using a source control system), or to all code in
:a given directory, download, etc. I don't see this as all that
:different from using e.g. PyChecker of PyLint.
:
:While I routinely perform visual code inspections [...], I certainly don't see
:this as a security audit [...]. Scanning for stray non-ASCII characters is best
:left to automated tools.

Guido van Rossum (May 23):
:In particular very helpful was a couple of reports from the Java
:world, where Unicode letters in identifiers have been legal for a long
:time now. (JavaScript also supports this BTW.) The Java world has not
fallen apart,

Guido van Rossum (May 17):
:As I mentioned before, I don't expect either of these will be much of
:a concern. I guess tools like pylint could optionally warn if
:non-ascii characters are used.
:
:On 5/16/07, Jim Jewett <jimjjewett at gmail.com> wrote:
:> (1)  Security concerns.
:> (2)  Obscure bugs.

Summary of what I think Guido's saying (involves some interpretation):
 - always having no restrictions (the Java way) is not a problem in practice
 - because having no restrictions has worked well with Java, Python
should follow
 - any concerns can be adequately dealt solely with external tools
 - command line switches are a bad implementation of restriction management

It is the last one of these that I was addressing, as there was some demand
for restriction management (despite Guido's leave-it-to-pylint stance) but no
adequate proposal. The defaults are easily changed in any case.

> > # identifier_charset: fooproject.codingstyle.identifier_charset
>
> I really don't like the idea of adding a *different* import-like thing.
> We already have imports (that are evaluated at run time, not compile
> time), and due to their semantics, can't use a mechanism like the above.

I agree that import is problematic. This part could be omitted with the
rationale that it's more trouble than it's worth, and anyone who needs something
complicated can use pylint or similar. In the end, something like this
is what you'd
have most of the time in practice when you care about character sets:

# identifier_charset: 0-7f

# Real code.

When you have a file with Cyrillic, then it'd allow Cyrillic too. For
quick hacks
you could use this and everything would just work:

#!/usr/bin/env python

# Real code.

This isn't really anything more than a countermeasure against Ka-Ping's
tricky.py -exploit and addition of a real charset restriction method instead of
abusing the coding declaration for that (that would force you to use legacy
codings just to restrict the charsets, as pointed out a lot earlier here).

One more thing which might be removed from the suggestion is the command
line option and its associated site.py default. Such checking is more
appropriate
for pylint, and is probably of little use anyway. Either you trust the
files you're
importing in which case the characters they use does not make any difference,
or you don't, in which case you shouldn't be importing them at all and checking
their character sets will not help you at all. For audit purposes the comment
directives are enough as they can't deceive, and if you want to be
extra paranoid
you can use pylint to catch any surreptitious patches like in Guillaume's post.

From jcarlson at uci.edu  Sat Jun  2 19:48:49 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 02 Jun 2007 10:48:49 -0700
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
References: <20070601235750.6F01.JCARLSON@uci.edu>
	<f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
Message-ID: <20070602095920.6F04.JCARLSON@uci.edu>


"Rauli Ruohonen" <rauli.ruohonen at gmail.com> wrote:
> On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > """
> > If a comment in the first or second line of the Python script matches
> > the regular expression coding[=:]\s*([-\w.]+), this comment is processed
> > as an encoding declaration; the first group of this expression names the
> > encoding of the source code file.
> > """
> >
> > Your suggestion would unnecessarily change the semantics of the encoding
> > declarations.  I would call this gratuitous breakage.
> 
> Depending on what the regular expression for the declarations is, the
> difference may
> not be big. Current code can also reliably be converted with an automated tool,
> so this isn't a big deal for py3k.

Whether or not there exists a tool to convert from Python 2.6 to Python
3.0 (2to3), every tool that currently handles Python source code
encodings via the method specified in the documentation (just about
every Python-centric editor I know) would need to be changed.  Further,
not all code will be passed through the 2.6 to 3.0 converter, as the
tool is meant as a sort of "I don't want to go through all the trouble
of converting yet, but I want to support Python 3.0".  And even if it
*were* all passed through, the output of the converter is not meant for
future editing and consumption; it is meant as a stopgap.  People who
really want to support Python 3.0 should be doing the conversion by hand,
possibly with guidance from the converter.


> It may be that the change is unnecessary. Reading Guido's writings, he seems
> to be of the opinion that the Java way (no restrictions at all) is
> right here, and
> anything else can be delegated to pylint and similar tools.

Perhaps, but there is a growing contingent here that are of the opposite
opinion.  And even though this contingent is of differing opinions on
whether unicode identifiers should even be allowed, we all agree that if
they are allowed, they shouldn't be the default.


> > Sounds like the application of vim settings as a solution to a whole
> > bunch of completely unrelated "problems" in Python (especially with 4
> > space indents being the "one true way to indent" and the encoding
> > declaration already being established).  Please keep your vim out of my
> > Python ;) .
> 
> The encoding declaration stays mostly the same, I'm just suggesting adding
> similar declarations for the identifier/string character sets and making them
> deception-proof. You're probably right about the indentation stuff. If
> you got rid
> of all indentation-related options and simply forbade mixture of tabs and
> spaces, I'd just say good riddance.

Python 2.x has a -t option that warns people about inconsistent
tab/space usage.  In 3.0, from what I understand, that option is
automatically enabled and may result in errors instead of warnings.


> > And as stated by basically everyone, the only *sane* default is ascii
> > identifiers.  Since the vast majority of users will have no use for
> > unicode identifiers in the short or long term, making them the default
> > is overzealous at best.
> 
> "Basically everyone" is not true, because it does not include Guido, who
> matters the most. Some quotes from his latest posts on the topic:

Guido doesn't always overrule everyone.  There is quite a long history
of him changing his mind after having seen good reasoning about an issue. 
Most recently, see the dynamic attribute access thread about the o.{a}
syntax.

And when I say "basically everyone", I'm offering everyone the
opportunity who has offered their opinion recently to be in that camp. 
Please see the writings of Baptiste Carvello, Jim Jewett, Ka-Ping Yee,
Stephen Howell, Ivan Krstic, and myself.

If you want to completely ignore the general consensus was reached from
people on both sides of the issue, that's fine.  But pardon me if I
ignore you from here on out.


> Guido van Rossum (May 25):
> :I still think such a command-line switch (or switches) is the wrong
> :approach. What if I have *one* module that uses Cyrillic legitimately.
> :A command-line switch would enable Cyrillic in *all* modules.

I'm not personally a really big fan of the command-line argument
approach, but that doesn't mean that the only two solutions are
in-module with your syntax and command-line.  There are other solutions
(global registry of individual module allowed identifiers, in-module
with a different syntax, etc.). I'm just saying that I don't like *your*
solution.


> Guido van Rossum (May 25):
> :On 5/24/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> :> Where else in Python have we made the default
> :> behavior only desired or useful to 5% of our users?
> :
> :Where are you getting that statistic? This seems an extremely
> :backwards, US-centric worldview.

You will note that I actually responded to this, as have others.  The
use of unicode identifiers will be rare, and your pressure to try to
make them the default won't change that; but it will confuse the hell
out of the large numbers of users who have no use for unicode, and whose
tools are not prepared for unicode.


> Guido van Rossum (May 25):
> :A more useful approach would seem to be a set of auditing tools that
> :can be applied routinely to all new contributions (e.g. as a
> :pre-commit hook when using a source control system), or to all code in
> :a given directory, download, etc. I don't see this as all that
> :different from using e.g. PyChecker of PyLint.
> :
> :While I routinely perform visual code inspections [...], I certainly don't see
> :this as a security audit [...]. Scanning for stray non-ASCII characters is best
> :left to automated tools.

Others have also responded to this.  Adding a tool to an arbitrarily
large or small previously existing toolchain, so that the majority of
users can verify that their code doesn't contain characters that
shouldn't be allowed in the first place, isn't a very good solution.


> Guido van Rossum (May 23):
> :In particular very helpful was a couple of reports from the Java
> :world, where Unicode letters in identifiers have been legal for a long
> :time now. (JavaScript also supports this BTW.) The Java world has not
> :fallen apart,

And we reported about this.  They are rarely used, and the far vast
majority of code that *does* have unicode identifiers is closed-source. 
As someone else has discussed this, do we want to encourage open source
(with which the only sane identifiers are ascii), or do we want to
encourage closed-source and the 'ghettoization' of Python source code?


> Guido van Rossum (May 17):
> :As I mentioned before, I don't expect either of these will be much of
> :a concern. I guess tools like pylint could optionally warn if
> :non-ascii characters are used.
> :
> :On 5/16/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> :> (1)  Security concerns.
> :> (2)  Obscure bugs.
> 
> Summary of what I think Guido's saying (involves some interpretation):
>  - always having no restrictions (the Java way) is not a problem in practice
>  - because having no restrictions has worked well with Java, Python
> should follow

Only because it is so rarely used that no one really runs into unicode
identifiers.  As such, the only sane position is to require the explicit
enabling of unicode identifiers.  Also please see Nich Coghlan's
discussion about *why* this isn't as much an issue with statically typed
declarative languages as it is with Python.


>  - any concerns can be adequately dealt solely with external tools

And having to rely on *additional* tools to verify that what the vast
majority of users want is actually happening is silly.  I'll ask again,
because you don't seem to have been paying attention to the messages you
cited, but where else in Python has the tiny minority defined the
defaults for the vast majority of users?


>  - command line switches are a bad implementation of restriction management

That's the only argument that is worth listening to.  But command line
switches aren't our only option here.

[snip]
> This isn't really anything more than a countermeasure against Ka-Ping's
> tricky.py -exploit and addition of a real charset restriction method instead of
> abusing the coding declaration for that (that would force you to use legacy
> codings just to restrict the charsets, as pointed out a lot earlier here).

Thankfully, no one who has bothered to think for more than a few minutes
about this issue has seriously considered using legacy encodings.  So
it's a non-issue.


> One more thing which might be removed from the suggestion is the command
> line option and its associated site.py default. Such checking is more
> appropriate
> for pylint, and is probably of little use anyway. Either you trust the
> files you're
> importing in which case the characters they use does not make any difference,
> or you don't, in which case you shouldn't be importing them at all and checking
> their character sets will not help you at all. For audit purposes the comment
> directives are enough as they can't deceive, and if you want to be
> extra paranoid
> you can use pylint to catch any surreptitious patches like in Guillaume's post.

Adding Pylint to verify that I don't have characters that shouldn't be
allowed in the first place, when Python should tell me the *the moment*
modules are being compiled, is silly.  Now, you have had the opportunity
to go through the hundreds of posts on the matter and compose a message,
yet you still don't understand that ascii is the only sane default. 
Please read posts in the 3131 thread from the authors I list above, and
please try to inform yourself on the content of postings from people
that are not Guido.


 - Josiah


From rauli.ruohonen at gmail.com  Sat Jun  2 22:39:53 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sat, 2 Jun 2007 23:39:53 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070602095920.6F04.JCARLSON@uci.edu>
References: <20070601235750.6F01.JCARLSON@uci.edu>
	<f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
	<20070602095920.6F04.JCARLSON@uci.edu>
Message-ID: <f52584c00706021339p2b64f4a5qf7080a5ebf44df73@mail.gmail.com>

On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> Whether or not there exists a tool to convert from Python 2.6 to
> Python 3.0 (2to3), every tool that currently handles Python source
> code encodings via the method specified in the documentation
> (just about every Python-centric editor I know) would need to be
> changed.

How so? The old regexp can still match the encoding tag unless
the user insists on using it in an incompatible way. As syntax
changes go, this one causes little trouble for editors.

> Guido doesn't always overrule everyone.

Yet he makes the decisions. That's why i used his latest comments
on the topic to set the defaults in the suggestion. These are
easily changed when necessary, and the whole issue of
defaults is quite minor. What matters more is having a convenient
way of setting the character set restrictions of a module. The
reason I quoted him at such length was that I thought that you
might have missed some of his posts because you simply ignored
what he had to say (and no, I generally don't remember people's
names).

> There are other solutions (global registry of individual module
> allowed identifiers, in-module with a different syntax, etc.).

These are more to the point. Do you have anything concrete?
A global registry sounds unwieldy and most would probably
enable everything instead of going through the trouble of using it.
What kind of in-module syntax would you use?

> Adding a tool to an arbitrarily large or small previously existing
> toolchain, so that the majority of users can verify that their code
> doesn't contain characters that shouldn't be allowed in the first
> place, isn't a very good solution.

I doubt the majority of users care, so the verifiers would be
a minority. You're exaggerating the amount of work caused
by Guido's solution. I made my suggestion because in my opinion
it or something like it is a more convenient solution for most cases,
but Guido's isn't as bad as you make it out to be.

> Only because it is so rarely used that no one really runs into
> unicode identifiers.

It doesn't really matter why they're not a problem in practice,
just that they aren't. A non-issue is a non-issue, no matter why.

> As such, the only sane position is to require
> the explicit enabling of unicode identifiers.

Neither default would cause big problems, so there are
at least two sane positions. One may be better than the other
or they may be equally good, it's hard to say which.

> where else in Python has the tiny minority defined the defaults for
> the vast majority of users?

I'm sure you will find tinier minorities if you search for them, but
most users don't use extended slice notation to its full extent, yet
it's enabled by default even though it silently accepts a probable
typo. Confusing non-ascii characters are also accepted by
default in strings, even though only a tiny minority uses those
particular characters in strings (I'm sure you've seen the examples).

> yet you still don't understand that ascii is the only sane default.

It is not the default in Java, which is a major language, and I don't
hear constant complaints about it having to be changed, so there
are quite many people who think that the above statement is not
true for programming languages in general. The claim that
static typing makes a big enough difference here is less than
convincing.

From martin at v.loewis.de  Sun Jun  3 01:08:01 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 03 Jun 2007 01:08:01 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<87r6p540n4.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705251047w3a27bf43nc461c728e051dc09@mail.gmail.com>	<87646g3u9q.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705260939x64cd9642qd025c9a01ef7604e@mail.gmail.com>	<87veee2wj4.fsf@uwakimon.sk.tsukuba.ac.jp>
	<43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com>
Message-ID: <4661F851.4020403@v.loewis.de>

> Sincere question: if these characters aren't needed, why are they
> provided? From what I can tell by googling, they're needed when, e.g.,
> Arabic is embedded in an otherwise left-to-right script. Do I have
> that right?

I think not. In principle, each character has a directionality
(available through unicodedata.bidirectional), and a rendering
algorithm should be able detect runs of characters that differ
in directionality from the surrounding text, rendering it
properly. As a special case, certain characters are declared
"neutral", extending the run across, say, spaces.

So embedding Arabic in an LTR text *alone* makes no requirement for
these control characters. I'm unsure whether there are cases where
the standard BIDI algorithm would produce incorrect results;
it's certainly the case that not all tools implement it correctly,
so the control characters can help those tools (assuming the
tool implements the control character at least).

Regards,
Martin

From jcarlson at uci.edu  Sun Jun  3 02:59:27 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 02 Jun 2007 17:59:27 -0700
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706021339p2b64f4a5qf7080a5ebf44df73@mail.gmail.com>
References: <20070602095920.6F04.JCARLSON@uci.edu>
	<f52584c00706021339p2b64f4a5qf7080a5ebf44df73@mail.gmail.com>
Message-ID: <20070602174905.6F13.JCARLSON@uci.edu>


"Rauli Ruohonen" <rauli.ruohonen at gmail.com> wrote:
> 
> On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > Whether or not there exists a tool to convert from Python 2.6 to
> > Python 3.0 (2to3), every tool that currently handles Python source
> > code encodings via the method specified in the documentation
> > (just about every Python-centric editor I know) would need to be
> > changed.
> 
> How so? The old regexp can still match the encoding tag unless
> the user insists on using it in an incompatible way. As syntax
> changes go, this one causes little trouble for editors.

As per the spec, only the first two lines need to be scanned.  By your
change, any editor of Python that wanted to follow the spec (like Vim
and Emacs, which helped *define* the spec), would need to scan until
comments stopped being found at the beginning of the source file. 
Further, some editors that don't even understand Python are currently
able to handle alternate encodings precisely because there is exactly
one true way to define encodings: the way Emacs and Vim have defined it,
which Python adopted.


> > Guido doesn't always overrule everyone.
> 
> Yet he makes the decisions. That's why i used his latest comments
> on the topic to set the defaults in the suggestion. These are
> easily changed when necessary, and the whole issue of
> defaults is quite minor. What matters more is having a convenient
> way of setting the character set restrictions of a module. The
> reason I quoted him at such length was that I thought that you
> might have missed some of his posts because you simply ignored
> what he had to say (and no, I generally don't remember people's
> names).

Guido last replied before some 30+ messages more or less closed out the
discussion, of which were replies that addressed precisely those issues
that you quoted to bring up as "proof". If you aren't even going to be
bothered to read the thread, I'm not going to bother replying to you. As
I said before, and as I'm saying again, read the thread.  Until then,
you aren't bringing up anything new to the discussion and are just
wasting everyone's time.

 - Josiah


From jimjjewett at gmail.com  Sun Jun  3 03:31:11 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sat, 2 Jun 2007 21:31:11 -0400
Subject: [Python-3000] Lines breaking
In-Reply-To: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706021831h1dab611ct60a45c621d41d3ef@mail.gmail.com>

On 6/2/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> That's not the point; this is like the logical operations on decimal
> thing.  Adopting a standard in full is reassuring to potential users,
> who won't complain, they just go away.
...
> We only *need* to do it if we want to claim Unicode conformance in
> this area.  I think that is desirable; readline functionality is very
> basic to a text-processing language.

Even then, I don't think we *need* to do it.  Unicode generally allows
tailoring (so long as you specify), and the entirety of chapter 5
(Implementation Guidelines) is explicitly non-normative.

That said, it might be a sensible change anyhow, particularly if we
treat it like the CRLF combination, so that a Form Feed at the the end
of a line doesn't force splitlines to produce an empty line.

-jJ

From jimjjewett at gmail.com  Sun Jun  3 05:14:38 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sat, 2 Jun 2007 23:14:38 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<20070601235750.6F01.JCARLSON@uci.edu>
	<f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
Message-ID: <fb6fbf560706022014r44fe8592xacf7e884ae2014f7@mail.gmail.com>

On 6/2/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:

> > Your suggestion would unnecessarily change the semantics of the encoding
> > declarations.  I would call this gratuitous breakage.

> Depending on what the regular expression for the declarations is, the
> difference may not be big.

I suspect that if coding were always still first, and the identifier
charset followed it (or were on the same line), that would take care
of this objection.

> something like this is what you'd
> have most of the time in practice when you care about character sets:

> # identifier_charset: 0-7f

Why not ASCII?
Why not be more specific, with 0x30-0x39, 0x41-0x5a, 0x5f, 0x61-0x7a

When adding characters, this isn't such a problem.  When restricting
them, a standard spelling is more important.

> For quick hacks you could use this and everything would just work:

> #!/usr/bin/env python
>
> # Real code.

> This isn't really anything more than a countermeasure against Ka-Ping's
> tricky.py -exploit

uhh... I don't see any charset comment there, so his coding: with a
non-ASCII letter in "coding" would still work.

-jJ

From rauli.ruohonen at gmail.com  Sun Jun  3 10:31:30 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 3 Jun 2007 11:31:30 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706022014r44fe8592xacf7e884ae2014f7@mail.gmail.com>
References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<20070601235750.6F01.JCARLSON@uci.edu>
	<f52584c00706020919n4c733af7w1d0f66d157f6bc46@mail.gmail.com>
	<fb6fbf560706022014r44fe8592xacf7e884ae2014f7@mail.gmail.com>
Message-ID: <f52584c00706030131v50250028i57e7d6d14b42e474@mail.gmail.com>

On 6/3/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/2/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > # identifier_charset: 0-7f
>
> Why not ASCII?
> Why not be more specific, with 0x30-0x39, 0x41-0x5a, 0x5f, 0x61-0x7a
>
> When adding characters, this isn't such a problem.  When restricting
> them, a standard spelling is more important.

I followed Stephen Turnbull's convention of only adding additional
restrictions to those already provided by PEP 3131. Here 0-7f would
block out all non-7-bit characters, and within that range the PEP
rule is "Within the ASCII range (U+0001..U+007F), the valid characters
for identifiers are the same as in Python 2.5."

> > #!/usr/bin/env python
> >
> > # Real code.
>
> > This isn't really anything more than a countermeasure against Ka-Ping's
> > tricky.py -exploit
>
> uhh... I don't see any charset comment there, so his coding: with a
> non-ASCII letter in "coding" would still work.

If it came in the comments before the first empty line, then it would cause a
syntax error, because non-ASCII wouldn't be allowed there to prevent such
trickery. The "first empty line" rule was there to make the safe area visually
clear to the reader.

From stephen at xemacs.org  Sun Jun  3 14:48:38 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 03 Jun 2007 21:48:38 +0900
Subject: [Python-3000] Lines breaking
In-Reply-To: <fb6fbf560706021831h1dab611ct60a45c621d41d3ef@mail.gmail.com>
References: <acd65fa20705280956j20409bc1qfe2f82f03ca94247@mail.gmail.com>
	<20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706021831h1dab611ct60a45c621d41d3ef@mail.gmail.com>
Message-ID: <87lkf1yzix.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > Even then, I don't think we *need* to do it.  Unicode generally allows
 > tailoring (so long as you specify), and the entirety of chapter 5
 > (Implementation Guidelines) is explicitly non-normative.

"Non-normative" in this case means you can claim Unicode conformance
without conforming to UAX#14.  However, that means we have to deny
that we conform to UAX#14.  If we want to claim conformance, we have
no choice about FORM FEED; the "bk" class is not tailorable and
"support" for FORM FEED is not optional. :-(  (I don't understand why
they did that; to me Bill's example is compelling.)

 > That said, it might be a sensible change anyhow, particularly if we
 > treat it like the CRLF combination, so that a Form Feed at the the end
 > of a line doesn't force splitlines to produce an empty line.

I don't think that's conformant, but it might be a good enough
compromise to be conformant*<wink>, and is Pythonic (ie, similar to CRLF).


From rauli.ruohonen at gmail.com  Sun Jun  3 15:12:20 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 3 Jun 2007 16:12:20 +0300
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <46371BD2.7050303@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
Message-ID: <f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>

(sorry about replying to so old mail, but I didn't find a better place
to put this)

On 5/1/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> All identifiers are converted into the normal form NFC while parsing;

Actually, shouldn't the whole file be converted to NFC, instead of
only identifiers? If you have decomposable characters in strings and
your editor decides to normalize them to a different form than in the
original source, the meaning of the code will change when you save
without you noticing anything.

It's always better to be explicit when you want to make invisible
distinctions. In the rare cases anything but NFC is really needed you
can do explicit conversion or use escapes. Having to add normalization
calls around all unicode strings to code defensively is neither
convenient nor obvious.

From turnbull at sk.tsukuba.ac.jp  Sun Jun  3 15:42:23 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Sun, 03 Jun 2007 22:42:23 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
Message-ID: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > He did not say that such files or command-line options would be
 > scalable either. They are fine tools for auditing, but not for using
 > finished products. One should provide both auditing tools and ease
 > of use of already audited code.

Ease of use of audited code is trivial; turn the checks off.

The question is how to do that.

 > (1) Add a mandatory ASCII-only special comment at the beginning of
 >     each module. The comment would continue until the first empty
 >     line and would contain only valid directives matching some
 >     regular expression. Only whitespace is allowed before the
 >     comment. Anything else is a syntax error.

-1

You still need command-line options or local configuration files to
decide *what* to audit.  We *don't* trust the file!  Just because it
audits to having the character sets it claims doesn't mean it doesn't
use constructs we want to prohibit.  Merely to define those is
non-trivial, and it is absolutely out of the question to expect that
the average Python user will know what the character set
"strictly-conforms-to-UTR39-restrictions-allows-confusables" is.  So
those character sets are basically meaningless for ease of use; ease
of use is "globally restrict to what my students can read = ASCII +
Japanese".

Now, the same code that would be needed to audit the declarations you
propose could easily be generalized to *generate* them.  Once you've
got that, who needs the auditing code in the Python translator?  AIUI
the implementation of PEP 263, you could just substitute an auditing
UTF-8 codec based on that code for the PEP 263 standard UTF-8 codec.
This codec is Python code, and thus could be configured using a file,
which could be generated by the codec and compared with the old
version; the possibilities are endless ... and in no way need to be
defined in the language if I'm correct about the implementation.[1]

The reason I favor the single command line flag (perhaps even
restricted to the binary choice of compatibility ASCII vs. PEP 3131
Unicode) is as a transition strategy.  I do not agree with Ka-Ping
inter alia that there are bogeymen under the bed, but then I live in
Japan, and there *is* no "under the bed" (we sleep on mats on the
floor<wink>).  I think it's quite reasonable to provide a
non-invasive, *simple* auditing facility for those who want it.  When
you're talking about security holes, the burden of proof should *not*
be on the paranoid, especially when the backward-compatibility cost of
security is *zero* (there are *no* Python programs containing
non-ASCII identifiers in the wild yet!)

As James Knight says, the "configure the world in one file" strategy
that jJ and I were batting around is a bit nuts, but it might not be a
bad strategy for configuring a loadable auditing codec or external
utility; I don't think that's wasted mental effort at all.


Footnotes: 
[1]  Caveat, the implementation will be much more heavyweight than a
standard codec since it must contain a Python parser.

From turnbull at sk.tsukuba.ac.jp  Sun Jun  3 15:51:06 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Sun, 03 Jun 2007 22:51:06 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070601235750.6F01.JCARLSON@uci.edu>
References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<20070601235750.6F01.JCARLSON@uci.edu>
Message-ID: <87ira5ywmt.fsf@uwakimon.sk.tsukuba.ac.jp>

Josiah Carlson writes:

 > And as stated by basically everyone, the only *sane* default is ascii
 > identifiers.

That's a misrepresentation.  I prefer the full range of PEP 3131 as
the default for use by consenting adults.  But you should have the
right to unilaterally refuse to grant that consent, yet still enjoy
the benefits of the rest of Python.


From rauli.ruohonen at gmail.com  Sun Jun  3 17:21:43 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 3 Jun 2007 18:21:43 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706030821h1418d85ds31e22eea87853873@mail.gmail.com>

On 6/3/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> Merely to define those is non-trivial, and it is absolutely out
> of the question to expect that the average Python user will know
> what the character set "strictly-conforms-to-UTR39-restrictions-
> allows-confusables" is.

This is a bit of a strawman, as most of the time the charset would
be ascii or everything, which are much easier concepts. Point
taken about trying anything more complex, as the reader will
generally no longer understand that anyway. A special-purpose tool
can handle the complex cases much better.

> ease of use is "globally restrict to what my students can read =
> ASCII + Japanese".

I prefer your first definition of ease of use:

> Ease of use of audited code is trivial; turn the checks off.

This along with your another idea sounds fairly good, actually:

> The reason I favor the single command line flag (perhaps even
> restricted to the binary choice of compatibility ASCII vs. PEP
> 3131 Unicode) is as a transition strategy.

The KISS way of having a single flag for either ASCII or PEP 3131
(if the even simpler way of only PEP 3131 is too simple) should
take care of most (all?) of the use cases, and nobody's head will
explode. If it's this simple, then it's not a problem to have
it on the command line, and my suggestion is unnecessary.

> I do not agree with Ka-Ping inter alia that there are bogeymen
> under the bed,

Looks like the only ones who do agree want pure ASCII, so a binary
option is sufficient. You could also argue that it's a choice of
old behavior and new behavior, and anything else is unnecessary.
You might even use "from __future__ import unicode_identifiers"
instead of a command line flag, if you view it like that.

> but then I live in Japan, and there *is* no "under
> the bed" (we sleep on mats on the floor<wink>).

?????????????????????<wink>

> I think it's quite reasonable to provide a non-invasive, *simple*
> auditing facility for those who want it.

Emphasis on simple, indeed. If you start adding more complex
auditing systems, then it would make sense for the files to declare
which specification they conform to.

> When you're talking about security holes, the burden of proof
> should *not* be on the paranoid

The default doesn't really matter much. It's simple to use
"#!/usr/bin/env python -U" or whatever in scripts, whether that
option selects PEP 3131 or ascii.

> As James Knight says, the "configure the world in one file"
> strategy that jJ and I were batting around is a bit nuts, but it
> might not be a bad strategy for configuring a loadable auditing
> codec or external utility; I don't think that's wasted mental
> effort at all.

True, but such details have clearly gone beyond a "*simple*
auditing facility" and sound like a solution looking for a problem.

From martin at v.loewis.de  Sun Jun  3 19:11:21 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sun, 03 Jun 2007 19:11:21 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
Message-ID: <4662F639.2070806@v.loewis.de>

>> All identifiers are converted into the normal form NFC while parsing;
> 
> Actually, shouldn't the whole file be converted to NFC, instead of
> only identifiers? If you have decomposable characters in strings and
> your editor decides to normalize them to a different form than in the
> original source, the meaning of the code will change when you save
> without you noticing anything.

Sure - but how can Python tell whether a non-normalized string was
intentionally put into the source, or as a side effect of the editor
modifying it?

In most cases, it won't matter. If it does, it should be explicit in
the code, e.g. by putting an n() function around the string literal.

> It's always better to be explicit when you want to make invisible
> distinctions. In the rare cases anything but NFC is really needed you
> can do explicit conversion or use escapes. Having to add normalization
> calls around all unicode strings to code defensively is neither
> convenient nor obvious.

However, it typically isn't necessary, either.

Also, there is still room for subtle issues, e.g. when concatenating
two normalized strings will produce a string that isn't normalized.
Also, in many cases, strings come from IO, not from source, so if
it is important that they are in NFC, you need to normalize anyway.

Regards,
Martin

From rauli.ruohonen at gmail.com  Sun Jun  3 20:30:01 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 3 Jun 2007 21:30:01 +0300
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4662F639.2070806@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
Message-ID: <f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>

On 6/3/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Sure - but how can Python tell whether a non-normalized string was
> intentionally put into the source, or as a side effect of the editor
> modifying it?

It can't, but does it really need to? It could always assume the latter.

> In most cases, it won't matter. If it does, it should be explicit
> in the code, e.g. by putting an n() function around the string
> literal.

This is only almost true. Consider these two hypothetical files
written by naive newbies:

data.py:

favorite_colors = {'Martin L?wis': 'blue'}

code.py:

import data

print data.favorite_colors['Martin L?wis']

Now if these are written by two different people using different
editors, one might be normalized in a different way than the other,
and the code would look all right but mysteriously fail to work.

Even more mysteriously, when the files are opened and saved
(possibly even automatically) by one of the people without any
changes, the code would then start to work. And magically break again
when the other person edits one of the files.

The most important thing about normalization is that it should be
consistent for internal strings. Similarly when reading in a text
file, you really should normalize it first, if you're going to
handle it as *text*, not binary.

The most common normalization is NFC, because it works best
everywhere and causes the least amount of surprise. E.g.
"L?wis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS),
which most naive users won't expect.

> Also, there is still room for subtle issues, e.g. when concatenating
> two normalized strings will produce a string that isn't normalized.

Sure:

>>> from unicodedata import normalize as n
>>> a=n('NFD', u'?'); n('NFC', a[0])+n('NFC', a[1:]) == n('NFC', a)
False

But a partial solution is better than no solution.

> Also, in many cases, strings come from IO, not from source, so if
> it is important that they are in NFC, you need to normalize anyway.

Indeed, and it would be best if this happened automatically, like
handling of line endings. It doesn't need to always work, just
most of the time.

I haven't read description of Python's syntax, but this happens
with Python 2.5:

test.py:

a = """
"""
print repr(a)

Output: '\n'

The line ending there is '\r\n', and Python normalizes it when
reading in the source code, even though '\r\n' matters even less
than doing NFC normalization.

From martin at v.loewis.de  Sun Jun  3 20:43:03 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sun, 03 Jun 2007 20:43:03 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>	
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	
	<4662F639.2070806@v.loewis.de>
	<f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
Message-ID: <46630BB7.2030205@v.loewis.de>

Rauli Ruohonen schrieb:
> This is only almost true. Consider these two hypothetical files
> written by naive newbies:
> 
> data.py:
> 
> favorite_colors = {'Martin L?wis': 'blue'}
> 
> code.py:
> 
> import data
> 
> print data.favorite_colors['Martin L?wis']

That is an unrealistic example. It's more likely that the
second access reads

user = find_current_user()
print data.favorite_colors[user]

To deal with that safely, I would recommend to write

favorite_colors = nfc_dict({'Martin L?wis': 'blue'})

> The most important thing about normalization is that it should be
> consistent for internal strings. Similarly when reading in a text
> file, you really should normalize it first, if you're going to
> handle it as *text*, not binary.
> 
> The most common normalization is NFC, because it works best
> everywhere and causes the least amount of surprise. E.g.
> "L?wis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS),
> which most naive users won't expect.

Sure. If you think it is worth the effort, write a PEP.
PEP 3131 is only about identifiers.

Regards,
Martin


From talin at acm.org  Sun Jun  3 21:05:32 2007
From: talin at acm.org (Talin)
Date: Sun, 03 Jun 2007 12:05:32 -0700
Subject: [Python-3000] Substantial rewrite of PEP 3101
Message-ID: <466310FC.8020707@acm.org>

I've rewritten large portions of PEP 3101, incorporating some material 
from Patrick Maupin and Eric Smith, as well as rethinking the whole 
custom formatter design as I discussed earlier. Although it isn't 
showing up on the web site yet, you can view the copy in subversion (and 
the diffs) here:

    http://svn.python.org/view/peps/trunk/pep-3101.txt

Please let me know of any errors you find. Thanks.

-- Talin

From showell30 at yahoo.com  Sun Jun  3 21:49:20 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 3 Jun 2007 12:49:20 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070602095920.6F04.JCARLSON@uci.edu>
Message-ID: <706536.22850.qm@web33508.mail.mud.yahoo.com>


--- Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> Perhaps, but there is a growing contingent here that
> are of the opposite
> opinion.  And even though this contingent is of
> differing opinions on
> whether unicode identifiers should even be allowed,
> we all agree that if
> they are allowed, they shouldn't be the default.
> 

I have always supported allowing unicode identifiers,
but as somebody who now uses ascii identifiers in all
the code that I write and all the code that I consume,
I am still 60/40 in favor of having ascii-only be the
default.

It will not the end of the world for me if
unicode-friendly turns out to be the default behavior,
but it does seem reasonable that *some* concession
were made to my general usage, like a simple
environment variable that I could set to disable
unicode identifiers.  In my case, security is not a
complete non-issue, but I mainly want this feature
from a usability standpoint.

I think PEP 3131 could be improved in two ways:

   1) In the Objections section, summarize some of the
reservations that folks have had about allowing
Unicode identifiers into the language, and then
address those reservations with the proposed
solutions.  Rauli's excellent post a few replies back
would be a good starting point.

   2) Propose an ASCII_ONLY environment variable.







       
____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search 
that gives answers, not web links. 
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

From showell30 at yahoo.com  Sun Jun  3 22:18:17 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 3 Jun 2007 13:18:17 -0700 (PDT)
Subject: [Python-3000] example Python code under PEP 3131?
Message-ID: <315672.7992.qm@web33512.mail.mud.yahoo.com>

There has been a lot of interesting debate about PEP
3131, but I think some perspective could be brought to
the table by showing actual code examples.

Can somebody post a few examples of what Python code
would look like under PEP 3131?  Maybe 10-to-15 line
programs that illustrate the following use cases.

  1) Dutch tax lawyer using Dutch identifiers and
English reserved words (def, import, if, while, etc.)

  2) Japanese student using Japanese identifiers and
English reserved words (re, search, match, print,
etc.).

As somebody who has never worked with a language where
I don't know the reserved words, I'm trying to imagine
this type of program:

  1) English student using English identifiers and
Japanese reserved words.

My perspective on this issue is limited by the fact
that I happen to speak English natively.  I often
wonder whether I'd be using Python if the keywords
were in Dutch, and my identifiers weren't allowed to
include certain Anglicisms (say I had to spell "y" as
"ij"), but I was allowed to use English in my strings.
 Then, I wonder how much my decision to use Python
would have been influenced by the ability to use
identifiers in Dutch.  I'm suspecting the answers
would be no and no, even though Dutch is fairly
closely related to English.

I can tell you that it would be a complete showstopper
if Matz had written a Python-like language that
required Japanese reserved words, even if he had
allowed English in other places.  Matz wisely
internationalized the whole language.  (Never mind
that I don't use Ruby much anyway--that has more to do
with other linguistic issues).





       
____________________________________________________________________________________
Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz 

From jimjjewett at gmail.com  Mon Jun  4 02:43:44 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 3 Jun 2007 20:43:44 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <4661F851.4020403@v.loewis.de>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<87r6p540n4.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705251047w3a27bf43nc461c728e051dc09@mail.gmail.com>
	<87646g3u9q.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705260939x64cd9642qd025c9a01ef7604e@mail.gmail.com>
	<87veee2wj4.fsf@uwakimon.sk.tsukuba.ac.jp>
	<43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com>
	<4661F851.4020403@v.loewis.de>
Message-ID: <fb6fbf560706031743j2599b487p9bc118dc44f99061@mail.gmail.com>

On 6/2/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:

> I'm unsure whether there are cases where
> the standard BIDI algorithm would produce incorrect results;

Yes, but I'm not sure any of those cases are appropriate for
programming language identifers.

Quoting from introduction to Unicode Annex 9:

"""
However, in the case of bidirectional text, there are circumstances
where an implicit bidirectional ordering is not sufficient to produce
comprehensible text
"""

Neither the example given (mixed-script part numbers, section 2.2),
nor those I could come up with (all involving archaic scripts) were
appropriate for variable *names*.

> it's certainly the case that not all tools implement it correctly,
> so the control characters can help those tools (assuming the
> tool implements the control character at least).

To be honest, that is probably what I would do; I'm not quite sure I
even understand the correct algorithm for numbers.

-jJ

From stephen at xemacs.org  Mon Jun  4 03:29:29 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Mon, 04 Jun 2007 10:29:29 +0900
Subject: [Python-3000]  example Python code under PEP 3131?
In-Reply-To: <315672.7992.qm@web33512.mail.mud.yahoo.com>
References: <315672.7992.qm@web33512.mail.mud.yahoo.com>
Message-ID: <87ejkszeva.fsf@uwakimon.sk.tsukuba.ac.jp>

Steve Howell writes:

 >   2) Japanese student using Japanese identifiers and
 > English reserved words (re, search, match, print,
 > etc.).

I don't have time to cook up something in Python, but I can give an
example of working code in Lisp: 
<http://cvs.xemacs.org/viewcvs.cgi/XEmacs/packages/mule-packages/edict/edict-japanese.el?rev=1.4&content-type=text/vnd.viewcvs-markup>

You probably already know enough Lisp to read this, but if not, here
are a few hints.  `define-edict-rule' is a factory function, not part
of the Lisp language.  Comments are prefixed by ";" and run to the end
of the line.  Strings are delimited by '"' and may contain newlines.
Pretty much everything else that is not punctuation is an identifier.
"-" may be embedded in an identifier.

Note that the rest of the application contains Japanese only in
comments.  This section deals with de-inflection of Japanese words
(ie, deducing dictionary form from words that occur in natural text),
and thus needs concepts not available in English, or where available,
the English word would not make sense to a Japanese.

BTW, I don't know whether the breakage is in ViewCVS, FireFox, or
both, but several places in the file result in confusion of content
and HTML markup, with more-or-less amusing results.


From jimjjewett at gmail.com  Mon Jun  4 03:27:06 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 3 Jun 2007 21:27:06 -0400
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131)
Message-ID: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>

On 6/2/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:

> and the whole issue of defaults is quite minor.

I disagree; the defaults are the most important issue.

Those most eager for unicode identifiers are afraid that people
(particularly beginning students) won't be able to use local-script
identifiers, unless it is the default.  My feeling is that the teacher
(or the person who pointed them to python) can change the default on a
per-install basis, since it can be a one-time change.

Those of us most nervous about unicode identifiers are concerned
precisely because "anything goes" may become a default.

If national characters become the default in Sweden or Japan, that is
OK.  These national divisions are already there, and probably
unavoidable.

On the other hand, if "anything from *any* script" becomes the
default, even on a single widespread distribution, then the community
starts to splinter in a new way.  It starts to separate between people
who distribute source code (generally ASCII) and people who are
effectively distributing binaries (not for human end-users to read).

That is bad enough on its own, but even worse because the distinction
isn't clearly marked.  As the misleading examples have shown, these
(effective) binaries can pretend to be regular source code doing one
thing, even though they actually do something different.

> On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > Adding a tool to an arbitrarily large or small previously existing
> > toolchain, so that the majority of users can verify that their code
> > doesn't contain characters that shouldn't be allowed in the first
> > place, isn't a very good solution.

> I doubt the majority of users care, so the verifiers would be
> a minority.

Agreed, because the majority of users don't care about security at
all.  Outside the python context, this is one reason we have so much
spam (from compromised computers).  To protect the group at large,
security has to be the default.

Of course, security also has to be non-intrusive, or people will turn
it off.  A one-time decision to allow your own national characters,
which could be rolled into the initial install, or even a local
distribution -- that is fairly non-intrusive.

> You're exaggerating the amount of work caused [by adding to the toolchain]

No, he isn't.

My own process is often exactly:

(1)  Read or skim the code.
(2)
    (a)  Download it/save it as text, or
    (b)  Cut and paste the snippet from the webpage
(3)  Run it.

There is no external automated tool in the middle; forcing me to add
one would move python from the "things just work, and you can test
immediately" category into a compile/build/wait/test language.  I have
used python this way (when developing for a machine I could not access
directly), and ... I don't recommend it.

Hopefully, I can set my own python to enforce ASCII IDs (rather than
ASCII strings and comments).  But if too many people start to assume
that distributed code can freely mix other scripts, I'll start to get
random failures.  I'll probably allow Latin-1.  I might end up
allowing a few other scripts -- but then how should I say "script X or
script Y; not both"?  Keeping the default at ASCII for another release
or two will provide another release or two to answer this question.

> > Only because it is so rarely used that no one really runs into
> > unicode identifiers.

> It doesn't really matter why they're not a problem in practice,
> just that they aren't. A non-issue is a non-issue, no matter why.

Of course it matters.  If it isn't a problem only because of something
that wouldn't apply to python, then we still have to worry.

> ... Java, ... don't hear constant complaints

They aren't actually a problem because they aren't used; they aren't
used because almost no one knows about them.  Python would presumably
advertise the feature, and see more use.  (We shouldn't add it at all
*unless* we expect much more usage than unicode IDs have seen in other
programming languages.)

Also note that Java in particular already has static type checking
(which would resolve many of the objections) and is already a
compile/build/wait/test language (so the cost of additional tools is
less).  (I believe that C# is in this category too, but won't swear to
it.)

Not seeing problems in Lisp would be a valid argument -- except that
the internationalized IDs are explicitly marked.  Not just the files;
the individual IDs.  You have to write |lowercase| to get an ID made
of unexpected characters (including explicitly lower-case letters).

JavaScript would provide a legitimate example of a dynamic language
where unicode IDs caused no problem.  On the other hand, broken
javascript is already so common that I doubt anyone would have
noticed; python should (and currently does) meet a higher standard for
cross-platform interoperability.

In other words, python will be going out on a limb.  That doesn't mean
we shouldn't allow such Identifiers, but it does mean that we should
be cautious.

As an analogy, remember that function decorators were added to python
in version 2.4.  The initial patch would also have handled class
decorators.  No one came up with a single reason to disallow them that
didn't also apply to function decoration -- except one.  Guido wasn't
*sure* they were needed, and it would be easier to add them later (in
2.6) than it would have been to pull them back out.

The same one-step-at-a-time reasoning applies to unicode identifers.
Allowing IDs in your native language (or others that you explicitly
approve) is probably a good step.  Allowing IDs in *any* language by
default is probably going too far.

-jJ

From stephen at xemacs.org  Mon Jun  4 03:53:21 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Mon, 04 Jun 2007 10:53:21 +0900
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
Message-ID: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > On 6/3/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
 > > Sure - but how can Python tell whether a non-normalized string was
 > > intentionally put into the source, or as a side effect of the editor
 > > modifying it?
 > 
 > It can't, but does it really need to? It could always assume the latter.

No, it can't.  One might want to write Python code that implements
normalization algorithms, for example, and there will be "binary
strings".  Only in the context of Unicode text are you allowed to do
those things.

This would require Python to internally distinguish between Unicode
text files and other files.

[example of a dictionary application using Unicode strings]

 > Now if these are written by two different people using different
 > editors, one might be normalized in a different way than the other,
 > and the code would look all right but mysteriously fail to work.

It seems to me that once we have a proper separation between bytes
objects and unicode objects, that the latter should always be compared
internally to the dictionary using the kinds of techniques described
in UTS#10 and UTR#30.  External normalization is not the right way to
handle this issue.

 > But a partial solution is better than no solution.

Not if it leads to unexpected failures that are hard to diagnose,
especially in the face of human belief that this problem has been
"solved".

 > The line ending there is '\r\n', and Python normalizes it when
 > reading in the source code, even though '\r\n' matters even less
 > than doing NFC normalization.

That's not a Python language normalization; that's an artifact of the
line-reading function.  It's deliberate, of course, but it's not
really character-level, it's a line-level transformation.  If I start
up an interpreter and type

>>> a = """^V^M^V^J"""
>>> repr(a)
"'\\r\\n'"

(On my Mac, on other systems the quoting character for key entry of
control characters is probably different.)


From bjourne at gmail.com  Mon Jun  4 03:58:59 2007
From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=)
Date: Mon, 4 Jun 2007 03:58:59 +0200
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP
	3131)
In-Reply-To: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
References: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
Message-ID: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>

> Those most eager for unicode identifiers are afraid that people
> (particularly beginning students) won't be able to use local-script
> identifiers, unless it is the default.  My feeling is that the teacher
> (or the person who pointed them to python) can change the default on a
> per-install basis, since it can be a one-time change.

What if the person discovers Python by him/herself?

> On the other hand, if "anything from *any* script" becomes the
> default, even on a single widespread distribution, then the community
> starts to splinter in a new way.  It starts to separate between people
> who distribute source code (generally ASCII) and people who are
> effectively distributing binaries (not for human end-users to read).

That is FUD.

> Hopefully, I can set my own python to enforce ASCII IDs (rather than
> ASCII strings and comments).  But if too many people start to assume
> that distributed code can freely mix other scripts, I'll start to get
> random failures.  I'll probably allow Latin-1.  I might end up
> allowing a few other scripts -- but then how should I say "script X or
> script Y; not both"?  Keeping the default at ASCII for another release
> or two will provide another release or two to answer this question.

Answer what question? If people will use the feature? Ofcourse they
won't if it isn't default.

> > ... Java, ... don't hear constant complaints
>
> They aren't actually a problem because they aren't used; they aren't
> used because almost no one knows about them.  Python would presumably
> advertise the feature, and see more use.  (We shouldn't add it at all
> *unless* we expect much more usage than unicode IDs have seen in other
> programming languages.)

Every Swedish book I've read about Java (only 2) mentioned that feature.

> The same one-step-at-a-time reasoning applies to unicode identifers.
> Allowing IDs in your native language (or others that you explicitly
> approve) is probably a good step.  Allowing IDs in *any* language by
> default is probably going too far.

If you set different native languages won't you get the exact same
problems that codepages caused and that unicode was invented to solve?

-- 
mvh Bj?rn

From showell30 at yahoo.com  Mon Jun  4 04:45:57 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 3 Jun 2007 19:45:57 -0700 (PDT)
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP
	3131)
In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
Message-ID: <863853.58342.qm@web33508.mail.mud.yahoo.com>


--- BJ?rn Lindqvist <bjourne at gmail.com> wrote:

> > Those most eager for unicode identifiers are
> afraid that people
> > (particularly beginning students) won't be able to
> use local-script
> > identifiers, unless it is the default.  My feeling
> is that the teacher
> > (or the person who pointed them to python) can
> change the default on a
> > per-install basis, since it can be a one-time
> change.
> 
> What if the person discovers Python by him/herself?
> 

How many people discover Python in a cultural vacuum? 
People find out about Python from other Python users.

There are user groups all over the planet:

http://wiki.python.org/moin/LocalUserGroups






       
____________________________________________________________________________________
Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online.
http://smallbusiness.yahoo.com/webhosting 

From stephen at xemacs.org  Mon Jun  4 05:45:11 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Mon, 04 Jun 2007 12:45:11 +0900
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131)
In-Reply-To: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
References: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
Message-ID: <87bqfwz8l4.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > > You're exaggerating the amount of work caused [by adding to the toolchain]
 > 
 > No, he isn't.

It is exaggeration.  AFAICS the work of auditing character sets can be
done by the same codec APIs that implement PEP 263.  The only question
is whether the additional work of parsing out the identifiers would
cause noticable inefficiency in codec operation.  AFAIK, parsing out
the identifiers is cheap (though possibly several times as expensive
as the UTF-8 -> unicode object conversion, if it needs to be done
once in the codec and once in the compiler).

 > Hopefully, I can set my own python to enforce ASCII IDs (rather than
 > ASCII strings and comments).  But if too many people start to assume
 > that distributed code can freely mix other scripts, I'll start to get
 > random failures.

This is unlikely to be a major problem, IMHO.  It definitely is a
consideration, though, and some people will face more difficulty than
others, perhaps a lot more.

 > Not seeing problems in Lisp would be a valid argument -- except that
 > the internationalized IDs are explicitly marked.  Not just the files;
 > the individual IDs.  You have to write |lowercase| to get an ID made
 > of unexpected characters (including explicitly lower-case letters).

This is not true of Emacs Lisp, which not only accepts non-ASCII
characters, but is case-sensitive.

 > noticed; python should (and currently does) meet a higher standard for
 > cross-platform interoperability.

As does Emacs.

 > The same one-step-at-a-time reasoning applies to unicode identifers.
 > Allowing IDs in your native language (or others that you explicitly
 > approve) is probably a good step.  Allowing IDs in *any* language by
 > default is probably going too far.

I don't really see that distinction.  IMO the scenarios where allowing
a native language makes sense are (a) localized (like a programming
class), and you won't run into anything else anyway, and (b)
internationalized, where you'll be sharing with others who have
enabled *their* native languages.

Those with stricter auditing requirements will be vetting production
code with more powerful external tools anyway.

From stephen at xemacs.org  Mon Jun  4 06:01:08 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Mon, 04 Jun 2007 13:01:08 +0900
Subject: [Python-3000] Conservative Defaults (was: Re: Support for
	PEP	3131)
In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
References: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
	<740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
Message-ID: <87abvgz7uj.fsf@uwakimon.sk.tsukuba.ac.jp>

BJ?rn Lindqvist writes:

 > > On the other hand, if "anything from *any* script" becomes the
 > > default, even on a single widespread distribution, then the community
 > > starts to splinter in a new way.  It starts to separate between people
 > > who distribute source code (generally ASCII) and people who are
 > > effectively distributing binaries (not for human end-users to read).
 > 
 > That is FUD.

Not entirely.  XEmacs has found it appropriate to divide its
approximation to a standard library into "no-MULE" and "MULE-required"
groups of packages (~= Python modules).  GNU Emacs did not, and
suffered a lot of internal dissension for their decision to impose
MULE on all users.  Interestingly, they use no non-ASCII identifiers
that I know of.  (edict.el is not included in GNU Emacs due to an
assignment refusenik among the principal authors.)  The technology has
advanced dramatically since then, but there is real precedent for
balkanization.

The phrase "effectively distributing binaries (not for human end-users
to read)" is over the top, though.  Of course they're for end users to
read, they still are Python source, etc.

 > Answer what question? If people will use the feature? Ofcourse they
 > won't if it isn't default.

I assure you, my students will if it is available to my knowledge.

 > If you set different native languages won't you get the exact same
 > problems that codepages caused and that unicode was invented to solve?

No.  There is no confusion of character identity.  This is a perfectly
legitimate way to support Unicode, as long the subset of Unicode that
is allowed is properly declared.  It does not violate the principles
of Unicode in any way.


From martin at v.loewis.de  Mon Jun  4 07:12:39 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 04 Jun 2007 07:12:39 +0200
Subject: [Python-3000] example Python code under PEP 3131?
In-Reply-To: <315672.7992.qm@web33512.mail.mud.yahoo.com>
References: <315672.7992.qm@web33512.mail.mud.yahoo.com>
Message-ID: <46639F47.4070909@v.loewis.de>

> Can somebody post a few examples of what Python code
> would look like under PEP 3131?  Maybe 10-to-15 line
> programs that illustrate the following use cases.

Anbei eine Klassendefinition, wie sie oft von Studenten
in der m?ndlichen Pr?fung vorgeschlagen wird. Mittendrin
fragen sie sich dann, ob das ?berhaupt erlaubt ist.

# Definition von Element sei gegeben

class Liste:
  def __init__(self):
    self.erstes_element = None

  def einf?gen(self, objekt):
    if not self.erstes_element:
      self.erstes_element = Element(objekt)
    else:
      zeiger = self.erstes_elment
      while zeiger.n?chstes_element:
        zeiger = zeiger.n?chstes_element
      zeiger.n?chstes_element = Element(objekt)

  def l?schen(self, objekt):
    if self.erstes_element.wert == objekt:
      self.erstes_element = self.erstes_element.n?chstes_element
    else:
      zeiger = self.erstes_element
      while zeiger.n?chstes_element:
        if zeiger.n?chstes_element.wert == objekt:
          zeiger.n?chstes_element = \
            zeiger.n?chstes_element.n?chstes_element
          return
        zeiger = zeiger.n?chstes_element

Mit freundlichen Gr??en,
Martin

From martin at v.loewis.de  Mon Jun  4 07:26:29 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 04 Jun 2007 07:26:29 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <46371BD2.7050303@v.loewis.de>	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	<4662F639.2070806@v.loewis.de>	<f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
	<87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4663A285.4090009@v.loewis.de>

Stephen J. Turnbull schrieb:
>  > > Sure - but how can Python tell whether a non-normalized string was
>  > > intentionally put into the source, or as a side effect of the editor
>  > > modifying it?
>  > 
>  > It can't, but does it really need to? It could always assume the latter.
> 
> No, it can't.  One might want to write Python code that implements
> normalization algorithms, for example, and there will be "binary
> strings".  Only in the context of Unicode text are you allowed to do
> those things.

Of course, such an algorithm really should \u-escape the relevant
characters in source, so that editors can't mess them up.

>  > Now if these are written by two different people using different
>  > editors, one might be normalized in a different way than the other,
>  > and the code would look all right but mysteriously fail to work.
> 
> It seems to me that once we have a proper separation between bytes
> objects and unicode objects, that the latter should always be compared
> internally to the dictionary using the kinds of techniques described
> in UTS#10 and UTR#30.  External normalization is not the right way to
> handle this issue.

By default, comparison and dictionary lookup won't do normalization,
as that is too expensive and too infrequently needed.

In any case, this has nothing to do with PEP 3131.

Regards,
Martin


From rauli.ruohonen at gmail.com  Mon Jun  4 07:52:14 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Mon, 4 Jun 2007 08:52:14 +0300
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<f52584c00706031130l250c71b5tf987eed18e86e4dd@mail.gmail.com>
	<87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706032252h4038776bn9d657ebf9786116b@mail.gmail.com>

On 6/4/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> No, it can't.  One might want to write Python code that implements
> normalization algorithms, for example, and there will be "binary
> strings".  Only in the context of Unicode text are you allowed to
> do those things.

But Python files are text and should be readable to humans. Invisible
differences in code that are significant aren't good practice -
I think that was well established in the PEP 3131 discussion :-)
Is there some reason normalization algorithm implementations can't
use escapes (which are ASCII and thus not normalized) for non-NFC
strings? Note that editors are allowed to normalize as they will
(though the ones I use don't). From the Unicode standard, chapter 3:

:C9 A process shall not assume that the interpretations of two
:   canonical-equivalent character sequences are distinct.
:
: - The implications of this conformance clause are twofold. First,
:   a process is never required to give different interpretations
:   to two different, but canonical-equivalent character sequences.
:   Second, no process can assume that another process will make
:   a distinction between two different, but canonical-equivalent
:   character sequences.

As other programs processing Python source code files may not be
assumed to distinguish between normalization forms, depending on
them to do so (in normalization algorithm source code or elsewhere)
is a bit disquieting.

> It seems to me that once we have a proper separation between bytes
> objects and unicode objects, that the latter should always be
> compared internally to the dictionary using the kinds of techniques
> described in UTS#10 and UTR#30.

This sounds good if it's feasible performance-wise.

> External normalization is not the right way to handle this issue.

It depends on what problem you're solving. What I'm concerned about
most is that there may be rare (because NFC is so ubiquitous) but
annoying heisenbugs whose immediate cause is an invisible difference
in the source code. Such a class of problems shouldn't exist without
a good reason, and the reason "someone might want to write code that
depends on invisible easter eggs in the source code" doesn't sound
like a good reason to me.

Collation also doesn't solve all of the problem for naive users.
E.g. is len('???') 3 or 4? It depends on the normalization.
Whether each index in it is a hiragana character or not also
depends on the normalization. Same for e.g. 'caf?'.

>  > But a partial solution is better than no solution.
>
> Not if it leads to unexpected failures that are hard to diagnose,
> especially in the face of human belief that this problem has been
> "solved".

Sure, the concatenation of two normalized strings is not necessarily
a normalized string because you can have a string with a
combining character at the beginning, but people who deal with such
things know (or at least really, really, should!) how to fend for
themselves. There's nothing you can do to help them either, except
education.

There's value in keeping simple things simple and ensuring nothing
unexpected happens with simple things. In a large class of use
cases you really don't need to care that it's a complex world.
This is the case with many legacy encodings (such as Latin-1), and
the users of those will surely be surprised if switching to utf-8
causes single characters to sometimes be split to multiple parts
depending on the phase of the Moon.

> If I start up an interpreter and type
>
> >>> a = """^V^M^V^J"""
> >>> repr(a)
> "'\\r\\n'"

What the interpreter prompt does is less of an issue, as the
code is not long-lived and the programmer is there all the time
observing what the code does.

Anyway, the deadline for PEPs for py3k has passed and there's no
PEP this one would fit in, so I guess this wart will have to stay.
It's not a pressing issue, as everyone who's sane uses NFC
anyway, and if someone edits your code with a NFD-normalizing editor
you can just beat them over the head with a stick and force them to
use vim as a penance :-)

From eric+python-dev at trueblade.com  Mon Jun  4 12:37:34 2007
From: eric+python-dev at trueblade.com (Eric V. Smith)
Date: Mon, 04 Jun 2007 06:37:34 -0400
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <466310FC.8020707@acm.org>
References: <466310FC.8020707@acm.org>
Message-ID: <4663EB6E.4080302@trueblade.com>

 > Formatter Creation and Initialization
 >
 >     The Formatter class takes a single initialization argument, 'flags':
 >
 >         Formatter(flags=0)
 >
 >     The 'flags' argument is used to control certain subtle behavioral
 >     differences in formatting that would be cumbersome to change via
 >     subclassing. The flags values are defined as static variables
 >     in the "Formatter" class:
 >
 >         Formatter.ALLOW_LEADING_UNDERSCORES
 >
 >             By default, leading underscores are not allowed in identifier
 >             lookups (getattr or getitem).  Setting this flag will allow
 >             this.
 >
 >         Formatter.CHECK_UNUSED_POSITIONAL
 >
 >             If this flag is set, the any positional arguments which are
 >             supplied to the 'format' method but which are not used by
 >             the format string will cause an error.
 >
 >         Formatter.CHECK_UNUSED_NAME
 >
 >             If this flag is set, the any named arguments which are
 >             supplied to the 'format' method but which are not used by
 >             the format string will cause an error.

I'm not sure I'm wild about these flags which would have to be or'd
together, as opposed to discrete parameters.  I realize have a single
flag field is likely more extensible, but my impression of the
standard library is a move away from bitfield flags.  Perhaps that's
only in my own mind, though!

Also, why put this in the base class at all?  These could all be
implemented in a derived class (or classes), which would leave the
base class state-free and therefore without a constructor.

 > Formatter Methods
 >
 >     The methods of class Formatter are as follows:
 >
 >         -- format(format_string, *args, **kwargs)
 >         -- vformat(format_string, args, kwargs)
 >         -- get_positional(args, index)
 >         -- get_named(kwds, name)
 >         -- format_field(value, conversion)

I've started a sample implementation to test this API.  For starters,
I'm writing it in pure Python, but my intention is to use the code in
the pep3101 sandbox once I have some tests written and we're happy
with the API.


From showell30 at yahoo.com  Mon Jun  4 12:47:04 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 4 Jun 2007 03:47:04 -0700 (PDT)
Subject: [Python-3000] example Python code under PEP 3131?
In-Reply-To: <46639F47.4070909@v.loewis.de>
Message-ID: <471885.91417.qm@web33503.mail.mud.yahoo.com>


--- "Martin v. L?wis" <martin at v.loewis.de> wrote:


> # Definition von Element sei gegeben
> 
> class Liste:
>   def __init__(self):
>     self.erstes_element = None
> 
>   def einf?gen(self, objekt):
>     if not self.erstes_element:
>       self.erstes_element = Element(objekt)
>     else:
>       zeiger = self.erstes_elment
>       while zeiger.n?chstes_element:
>         zeiger = zeiger.n?chstes_element
>       zeiger.n?chstes_element = Element(objekt)
> 
>   def l?schen(self, objekt):
>     if self.erstes_element.wert == objekt:
>       self.erstes_element =
> self.erstes_element.n?chstes_element
>     else:
>       zeiger = self.erstes_element
>       while zeiger.n?chstes_element:
>         if zeiger.n?chstes_element.wert == objekt:
>           zeiger.n?chstes_element = \
>             zeiger.n?chstes_element.n?chstes_element
>           return
>         zeiger = zeiger.n?chstes_element
> 

Neat.

Danke f?r das Beispiel.  (I hope that makes sense.)

FWIW I can follow most of the above program, with a
tiny bit of help from Babelfish.

These were easy for me:

    Liste = list
    nachstes = next
    erstes = first
    objekt = object

These I looked up:

    einfugen = in joints (????)
    gegeben = given
    zeiger = pointer




 
____________________________________________________________________________________
Food fight? Enjoy some healthy debate 
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367

From python at zesty.ca  Mon Jun  4 13:01:02 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Mon, 4 Jun 2007 06:01:02 -0500 (CDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>

On Fri, 25 May 2007, Guillaume Proux wrote:
> If you are really paranoid to see evil chars take over your
> python src dir

On Sun, 3 Jun 2007, Stephen J. Turnbull wrote:
> I do not agree with Ka-Ping inter alia that there are bogeymen
> under the bed

Sigh.  I have lost count of the number of times I (and others by
association) have been labelled "paranoid" or something similar in
this discussion, and I am now asking you all to put a stop to it.
Name-calling isn't going to do us any good.

(I am sorry that this is in reply to your message, Stephen -- your
message above is one of the gentlest of the lot; it just happens to
be the most recent, and I have finally been pushed over the edge
into saying something about it.)

Please: can we all stick to statements about usage, problems, and
solutions, not about the personalities of those who propose them?

Here is what I have to say (to everyone in this discussion, not
specifically to you, Stephen) in response to said labelling:

Many of us value a *predictable* identifier character set.
Whether "predictable" means ASCII only, or user-selectable, or
restricted by default, I think we all agree in this sentiment:

We believe that we should try to make it easier, not harder, for
programmers to understand what Python code says.  This has many
benefits (reliability, readability, transparency, reviewability,
debuggability).  I consider these core strengths of Python.

Python is a source code language.  In other languages you share
binaries, but in Python you share and directly run source code.
This is fundamental to its impact on open source, its impact on
education, and its prevalence as an extension language.

That is what makes these strengths so important.  I hope this
helps you understand why these concerns can't and shouldn't be
brushed off as "paranoia" -- this really has to do with the
core values of the language.


-- ?!ng

From python at zesty.ca  Mon Jun  4 13:08:44 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Mon, 4 Jun 2007 06:08:44 -0500 (CDT)
Subject: [Python-3000] PEP 3131 roundup
Message-ID: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>

Hi,

Here's a summary of some of the remaining open issues and unaddressed
arguments regarding PEP 3131.  These are the ones I'm familiar with,
so I don't claim this to be complete.  I hope it helps give some
perspective on this huge thread, though.


A. Should identifiers be allowed to contain any Unicode letter?

   Drawbacks of allowing non-ASCII identifiers wholesale:

   1. Python will lose the ability to make a reliable round trip to
      a human-readable display on screen or on paper.

      http://mail.python.org/pipermail/python-3000/2007-May/007855.html

   2. Python will become vulnerable to a new class of security exploits;
      code and submitted patches will be much harder to inspect.

      http://mail.python.org/pipermail/python-3000/2007-May/007855.html

   3. Humans will no longer be able to validate Python syntax.

      http://mail.python.org/pipermail/python-3000/2007-May/007855.html

   4. Unicode is young; its problems are not yet well understood and
      solved; tool support is weak.

      http://mail.python.org/pipermail/python-3000/2007-May/007855.html

   5. Languages with non-ASCII identifiers use different character sets
      and normalization schemes; PEP 3131's choices are non-obvious.

      http://mail.python.org/pipermail/python-3000/2007-May/007947.html
      http://mail.python.org/pipermail/python-3000/2007-May/007725.html

   6. The Unicode bidi algorithm yields an extremely confusing display
      order for RTL text when digits or operators are nearby.

      http://www.w3.org/International/iri-edit/draft-duerst-iri.html#anchor5
      http://mail.python.org/pipermail/python-3000/2007-May/007823.html


B. Should the default behaviour accept only ASCII identifiers, or
   should it accept identifiers containing non-ASCII characters?

   Arguments for ASCII only by default:

   1. Non-ASCII identifiers by default makes common practice/assumptions
      subtly/unknowingly wrong; rarely wrong is worse than obviously wrong.

      http://mail.python.org/pipermail/python-3000/2007-May/007992.html
      http://mail.python.org/pipermail/python-3000/2007-May/008009.html
      http://mail.python.org/pipermail/python-3000/2007-May/007961.html

   2. Better to raise a warning than to fail silently when encountering
      an probably unexpected situation.

      http://mail.python.org/pipermail/python-3000/2007-May/007993.html
      http://mail.python.org/pipermail/python-3000/2007-May/007945.html

   3. All of current usage is ASCII-only; the vast majority of future
      usage will be ASCII-only.

      http://mail.python.org/pipermail/python-3000/2007-May/007952.html
      http://mail.python.org/pipermail/python-3000/2007-May/007927.html

   3. It is the pockets of Unicode adoption that are parochial, not the
      ASCII advocates.

      http://mail.python.org/pipermail/python-3000/2007-May/008010.html

   4. Python should audit for ASCII-only identifiers for the same
      reasons that it audits for tab-space consistency

      http://mail.python.org/pipermail/python-3000/2007-May/007942.html

   5. Incremental change is safer.

      http://mail.python.org/pipermail/python-3000/2007-May/008000.html

   6. An ASCII-only default favors open-source development and sharing
      of source code.

      http://mail.python.org/pipermail/python-3000/2007-May/007988.html
      http://mail.python.org/pipermail/python-3000/2007-May/007990.html

   7. Existing projects won't have to waste any brainpower worrying
      about the implications of Unicode identifiers.

      http://mail.python.org/pipermail/python-3000/2007-May/007957.html


C. Should non-ASCII identifiers be optional?

   Various voices in support of a flag (although there's been debate
   over which should be the default, no one seems to be saying that
   there shouldn't be an off switch):

   http://mail.python.org/pipermail/python-3000/2007-May/007855.html
   http://mail.python.org/pipermail/python-3000/2007-May/007916.html
   http://mail.python.org/pipermail/python-3000/2007-May/007923.html
   http://mail.python.org/pipermail/python-3000/2007-May/007935.html
   http://mail.python.org/pipermail/python-3000/2007-May/007948.html


D. Should the identifier character set be configurable?

   Various voices proposing and supporting a selectable character set,
   so that users can get all the benefits of using their own language
   without the drawbacks of confusable/unfamiliar characters:

   http://mail.python.org/pipermail/python-3000/2007-May/007890.html
   http://mail.python.org/pipermail/python-3000/2007-May/007896.html
   http://mail.python.org/pipermail/python-3000/2007-May/007935.html
   http://mail.python.org/pipermail/python-3000/2007-May/007950.html
   http://mail.python.org/pipermail/python-3000/2007-May/007977.html
   http://mail.python.org/pipermail/python-3000/2007-May/007957.html
   http://mail.python.org/pipermail/python-3000/2007-May/008038.html
   http://mail.python.org/pipermail/python-3000/2007-June/008121.html


E. Which identifier characters should be allowed?

   1. What to do about bidi format control characters?

      http://mail.python.org/pipermail/python-3000/2007-May/007750.html
      http://mail.python.org/pipermail/python-3000/2007-May/007823.html
      http://mail.python.org/pipermail/python-3000/2007-May/007826.html

   2. What about other ID_Continue characters?  What about characters
      that look like punctuation?  What about other recommendations
      in UTS #39?  What about mixed-script identifiers?

      http://mail.python.org/pipermail/python-3000/2007-May/007836.html


F.  Which normalization form should be used, NFC or NFKC?

    http://mail.python.org/pipermail/python-3000/2007-May/007995.html


G.  Should source code be required to be in normalized form?

    http://mail.python.org/pipermail/python-3000/2007-May/007997.html
    http://mail.python.org/pipermail/python-3000/2007-June/008137.html


-- ?!ng

From python at zesty.ca  Mon Jun  4 13:11:50 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Mon, 4 Jun 2007 06:11:50 -0500 (CDT)
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4662F639.2070806@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
Message-ID: <Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>

On Sun, 3 Jun 2007, [UTF-8] "Martin v. L??wis" wrote:
> >> All identifiers are converted into the normal form NFC while parsing;
> >
> > Actually, shouldn't the whole file be converted to NFC, instead of
> > only identifiers? If you have decomposable characters in strings and
> > your editor decides to normalize them to a different form than in the
> > original source, the meaning of the code will change when you save
> > without you noticing anything.
>
> Sure - but how can Python tell whether a non-normalized string was
> intentionally put into the source, or as a side effect of the editor
> modifying it?

It seems to me the simplest thing to do is to require that Python
source files be normalized.  Then the ambiguity just goes away.
Everyone knows what form their files should be in, and if you really
need to construct a non-normalized string, you can do that explicitly
using "\u" notation.


-- ?ng

From ncoghlan at gmail.com  Mon Jun  4 14:12:35 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 04 Jun 2007 22:12:35 +1000
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <4663EB6E.4080302@trueblade.com>
References: <466310FC.8020707@acm.org> <4663EB6E.4080302@trueblade.com>
Message-ID: <466401B3.3060904@gmail.com>

Eric V. Smith wrote:
>  >         Formatter.ALLOW_LEADING_UNDERSCORES
>  >         Formatter.CHECK_UNUSED_POSITIONAL
>  >         Formatter.CHECK_UNUSED_NAME

> I'm not sure I'm wild about these flags which would have to be or'd
> together, as opposed to discrete parameters.  I realize have a single
> flag field is likely more extensible, but my impression of the
> standard library is a move away from bitfield flags.  Perhaps that's
> only in my own mind, though!
> 
> Also, why put this in the base class at all?  These could all be
> implemented in a derived class (or classes), which would leave the
> base class state-free and therefore without a constructor.

I think the dict/defaultdict cooperative implementation based on the 
__missing__ method is a good guide to follow here. Instead of having 
flags to the constructor, instead define methods that the base class 
invokes to deal with the relevant checks - subclasses can then override 
them as they see fit.

A couple of possible method signatures:

   def allowed_name(self, name):
       "Return True if name is allowed, False otherwise"
       # default implementation return False if name starts with '_'

   def allow_unused(self, unused_args, unused_kwds):
       "Return True if unused args/names are allowed, False otherwise"
       # default implementation always returns True

Subclasses can then either return False to get a standard 'disallowed' 
exception, or else raise their own exception explicitly.

A few common alternate implementations of the latter method would be:

   def allow_unused(self, unused_args, unused_kwds):
       # All positional arguments must be used
       return not unused_args

   def allow_unused(self, unused_args, unused_kwds):
       # All keyword arguments must be used
       return not unused_kwds

   def allow_unused(self, unused_args, unused_kwds):
       # All arguments must be used
       return not unused_args and not unused_kwds


Cheers,
Nick.


-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From murman at gmail.com  Mon Jun  4 15:04:58 2007
From: murman at gmail.com (Michael Urman)
Date: Mon, 4 Jun 2007 08:04:58 -0500
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
Message-ID: <dcbbbb410706040604y3923d76dl147f01e6c94032dc@mail.gmail.com>

On 6/4/07, Ka-Ping Yee <python at zesty.ca> wrote:
> Many of us value a *predictable* identifier character set.
> Whether "predictable" means ASCII only, or user-selectable, or
> restricted by default, I think we all agree in this sentiment:

As someone who would rather see non-ASCII characters gain even ground,
even I agree with that sentiment. The rest of your message - stressing
that we should make things easier to understand and the importance of
source code - strikes a very strong chord with me. However to me it
sounds like an argument to allow Unicode identifiers, not one to
prevent them.

I think that's the biggest problem with this exchange. We have similar
goals but disagree about which option does a better job fulfilling
those goals. All the rhetoric from all sides about why the shared
goals are good won't convince anyone of anything new.

The arguments then feel reduced to "Unicode enhances readability" vs.
"Unicode impedes readability" and since clearly it does both, how do
we make the value judgement about which it does more? How do we weigh
the ability to use native language identifiers against the risk that
there will be visually indistinguishable differences introduced?

Michael
-- 
Michael Urman

From showell30 at yahoo.com  Mon Jun  4 15:28:56 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 4 Jun 2007 06:28:56 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <dcbbbb410706040604y3923d76dl147f01e6c94032dc@mail.gmail.com>
Message-ID: <717682.96972.qm@web33510.mail.mud.yahoo.com>


--- Michael Urman <murman at gmail.com> wrote:
> 
> The arguments then feel reduced to "Unicode enhances
> readability" vs.
> "Unicode impedes readability" and since clearly it
> does both, how do
> we make the value judgement about which it does
> more? How do we weigh
> the ability to use native language identifiers
> against the risk that
> there will be visually indistinguishable differences
> introduced?
> 

I think offering some Unicode examples will enhance
the "Unicode enhances readability" argument.  Martin
recently posted a small example program written in
German.  As a German non-reader, I still found it
pretty easy to read, with a little bit of effort. 
Interestingly, the one word that I wasn't able to
translate, even with the help of Babelfish, was the
German word for "insert."  It turns out the thing that
threw me off was that I omitted the umlaut.  That was
a bit of an epiphany for me.

I'd also be interested in actual testimonials from
teachers, Dutch tax lawyers, etc., that they will
embrace this feature.  

I hate to make a decision by majority rule, but I
think there is the argument that you need to weigh the
population of ascii-literate people vs.
ascii-illiterate people. 

(I don't mean ascii-illiterate as any kind of a slam;
I just think that's really the target audience for
this feature.  I am kanji-illiterate, but I am also
not lobbying for any kanji programming languages to be
more ascii-friendly.)

(I also recognize that Guido did get quite a few
testimonials from Unicode users that suggest they
embrace this idea, but I haven't seen much in the last
couple weeks.)





       
____________________________________________________________________________________
Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online.
http://smallbusiness.yahoo.com/webhosting 

From jimjjewett at gmail.com  Mon Jun  4 16:05:13 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 4 Jun 2007 10:05:13 -0400
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP
	3131)
In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
References: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
	<740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
Message-ID: <fb6fbf560706040705n38097e1aj8dc02faeb9833c49@mail.gmail.com>

On 6/3/07, BJ?rn Lindqvist <bjourne at gmail.com> wrote:

[Most deleted, Stephen Turnbull already answered better than I knew,
let alone could write]

> > The same one-step-at-a-time reasoning applies to unicode identifers.
> > Allowing IDs in your native language (or others that you explicitly
> > approve) is probably a good step.  Allowing IDs in *any* language by
> > default is probably going too far.

> If you set different native languages won't you get the exact same
> problems that codepages caused and that unicode was invented to solve?

Not at all; if anything, it is the opposite.

(1)  Those different code pages were mainly used for text, not
programming logic.  No one has suggested (re-)limiting comments or
even (continuing to limit) strings.

(2)  The biggest problem that I saw in practice was partial overlap;
people would assume WYSIWYG, and the different code pages were close
enough (mostly overlapping in ASCII) that they didn't usually need to
use the same code page -- but then when the differences did bite, they
were harder to notice.

If you happen to use both Sanskrit and Ethiopic, you can set your own
computer to accept both.  The only catch is that you probably can't
share the Sanskrit with the Coptic community (or vice versa), unless
at least one of the following is true:

    (2a)  The code itself (not comments or strings) is in ASCII, so
both can read it.  Note that this is already the recommened policy for
shared code.

or (2b)  The people you are sharing with trust you enough to add your
script as an acceptable alternate.  (Again, preferably a simple
one-time step -- but an explicit decision.)

or (2c)  The people you are sharing with have already decided to
accept Sanskrit (or Coptic) because other people they trusted were
using it, and said it was safe.


The existence of 2b and 2c rely on the "consenting adults" policy, but
they encourage "informed consent".  I wouldn't be surprised to
discover that Latin-1, Sanskrit, Coptic, and the Japanese characters
were all OK with me.

That still wouldn't mean I want to allow Cyrillic (which carries more
confusable risk).

I already know I don't want to auto-allow the FF10-FF19 (fullwidth
ASCII numbers[1]), simply because I don't see any good
(non-presentational) reason to use them in place of the normal ASCII
numbers -- so the more likely result of using them is confusion.

Adding one script (or character range) at a time lets me add things
that I (or people I trust) think are reasonable.  Turning unicode on
or off with a single blunt switch does not.

-jJ

[1]  Yes, the fullwidth ASCII variants are allowed as ID characters
according to both the unicode ID_* and XID_ properties, which means
they are allowed by the current draft.

From talin at acm.org  Mon Jun  4 18:34:47 2007
From: talin at acm.org (Talin)
Date: Mon, 04 Jun 2007 09:34:47 -0700
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <4663EB44.1010507@trueblade.com>
References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com>
Message-ID: <46643F27.2040804@acm.org>

Eric V. Smith wrote:
>  > Formatter Creation and Initialization
>  >
>  >     The Formatter class takes a single initialization argument, 'flags':
>  >
>  >         Formatter(flags=0)
>  >
>  >     The 'flags' argument is used to control certain subtle behavioral
>  >     differences in formatting that would be cumbersome to change via
>  >     subclassing. The flags values are defined as static variables
>  >     in the "Formatter" class:
>  >
>  >         Formatter.ALLOW_LEADING_UNDERSCORES
>  >
>  >             By default, leading underscores are not allowed in 
> identifier
>  >             lookups (getattr or getitem).  Setting this flag will allow
>  >             this.
>  >
>  >         Formatter.CHECK_UNUSED_POSITIONAL
>  >
>  >             If this flag is set, the any positional arguments which are
>  >             supplied to the 'format' method but which are not used by
>  >             the format string will cause an error.
>  >
>  >         Formatter.CHECK_UNUSED_NAME
>  >
>  >             If this flag is set, the any named arguments which are
>  >             supplied to the 'format' method but which are not used by
>  >             the format string will cause an error.
> 
> I'm not sure I'm wild about these flags which would have to be or'd
> together, as opposed to discrete parameters.  I realize have a single
> flag field is likely more extensible, but my impression of the
> standard library is a move away from bitfield flags.  Perhaps that's
> only in my own mind, though!

Making them separate fields is fine if that's easier.

Another possibility is to make them setter methods rather than 
constructor params.

> Also, why put this in the base class at all?  These could all be
> implemented in a derived class (or classes), which would leave the
> base class state-free and therefore without a constructor.

My reason for doing this is as follows.

Certain kinds of customizations are pretty easy to do via subclassing. 
For example, supporting a default namespace takes only a few lines of 
code in a subclass.

Other kinds of customization require replacing a much larger chunk of 
code. Changing the "underscores" and "check-unused" behavior requires 
overriding 'vformat', which means replacing the entire template string 
parser. I figured that there would be a lot of people who might want 
these features, but didn't want to rewrite all of vformat.

Now, some of this could be resolved by breaking up vformat into a set of 
smaller, overridable functions which controlled these behaviors. 
However, I didn't do this because I didn't want the PEP to micro-manage 
the implementation of vformat - I wanted to leave you guys some leeway 
as to design choices.

For example, I had thought perhaps to break out a separate method that 
would just do the parsing of a replacement field (the part inside the 
brackets) - so in other words, you'd have one function that recognizes 
the start of a replacement field, which then calls a method which 
consumes the contents of that field, and so on. You could also break 
that up into two pieces, one which recognizes the field reference, and 
one which recognizes the conversion string.

However, these various parsing functions aren't entirely isolated from 
each other. The various parsers would need to pass the current parse 
position (character iterator or whatever) and other state back and 
forth. Exposing this requires codifying in the API a lot of the internal 
state of parsing.

Also, the syntax defining the end of a replacement field is a mirror of 
the syntax  that starts one; And conversion specs can contain 
replacement fields too. Which means that the various parsing methods 
aren't entirely independent. (Although I think that in your earlier 
proposal, the syntax for 'internal' replacement fields inside conversion 
specifiers was always the same, regardless of the markup syntax chosen.)

What I wanted to avoid in the PEP was having to specify how all of these 
different parts fit together and the exact nature of the parameters 
being passed between them.

And I think that even if we do break up vformat this way, we still end 
up with people having to replace a fairly substantial chunk of code in 
order to change the behaviors represented by these flags.

>  > Formatter Methods
>  >
>  >     The methods of class Formatter are as follows:
>  >
>  >         -- format(format_string, *args, **kwargs)
>  >         -- vformat(format_string, args, kwargs)
>  >         -- get_positional(args, index)
>  >         -- get_named(kwds, name)
>  >         -- format_field(value, conversion)
> 
> I've started a sample implementation to test this API.  For starters,
> I'm writing it in pure Python, but my intention is to use the code in
> the pep3101 sandbox once I have some tests written and we're happy
> with the API.

Cool.

-- Talin

From jcarlson at uci.edu  Mon Jun  4 21:43:13 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Mon, 04 Jun 2007 12:43:13 -0700
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <717682.96972.qm@web33510.mail.mud.yahoo.com>
References: <dcbbbb410706040604y3923d76dl147f01e6c94032dc@mail.gmail.com>
	<717682.96972.qm@web33510.mail.mud.yahoo.com>
Message-ID: <20070604090135.6F26.JCARLSON@uci.edu>


Steve Howell <showell30 at yahoo.com> wrote:
> --- Michael Urman <murman at gmail.com> wrote:
> > 
> > The arguments then feel reduced to "Unicode enhances
> > readability" vs.
> > "Unicode impedes readability" and since clearly it
> > does both, how do
> > we make the value judgement about which it does
> > more? How do we weigh
> > the ability to use native language identifiers
> > against the risk that
> > there will be visually indistinguishable differences
> > introduced?
> > 
> 
> I think offering some Unicode examples will enhance
> the "Unicode enhances readability" argument.  Martin
> recently posted a small example program written in
> German.  As a German non-reader, I still found it
> pretty easy to read, with a little bit of effort. 
> Interestingly, the one word that I wasn't able to
> translate, even with the help of Babelfish, was the
> German word for "insert."  It turns out the thing that
> threw me off was that I omitted the umlaut.  That was
> a bit of an epiphany for me.

Maybe I'm worse with languages that other people are; it wouldn't
surprise me terribly.  I had some difficulty, primarily because I didn't
try to translate it (as such would be quite difficult with longer programs
and other languages).

Here is some code borrowed right from the Python standard library.  I've
gone ahead and mangled names in a consistant fashion using the tokenize
module.  Can you guess what it does?


class RTrCOlOrB :

    nBBjIUrB =0 

    def __init__ (self ,uX ,nBBjIUrB =1 ):
        self .uX =uX 
        self .nCIZj =[]# KAzWn ezWQ
        self .rBGBr =0 
        self .rInC =0 
        if nBBjIUrB :
            self .nBBjIUrB =1 
            self .nCIAC =self .uX .tell ()
            self .XznnCIZj =[]# KAzWn ezWQ

    def tell (self ):
        if self .rBGBr >0 :
            return self .rInCXzn 
        return self .uX .tell ()-self .nCIAC 

    def nBBj (self ,Xzn ,WDBQZB =0 ):
        DBAB =self .tell ()
        if WDBQZB :
            if WDBQZB ==1 :
                Xzn =Xzn +DBAB 
            elif WDBQZB ==2 :
                if self .rBGBr >0 :
                    Xzn =Xzn +self .rInCXzn 
                else :
                    raise Error ,"ZIQ'C TnB WDBQZB=2 yBC"
        if not 0 <=Xzn <=DBAB or self .rBGBr >0 and Xzn >self .rInCXzn :
            raise Error ,'UIe RTrCOlOrB.nBBj() ZIrr'
        self .uX .seek (Xzn +self .nCIAC )
        self .rBGBr =0 
        self .rInC =0 


> I hate to make a decision by majority rule, but I
> think there is the argument that you need to weigh the
> population of ascii-literate people vs.
> ascii-illiterate people. 

That's a very poor criteria, as not everyone in the world is a potential
programmer (despite what the BASIC folks tried to do). Further, of those
that become programmers in *any* substantial programming language today,
100% of them learn ascii. Even Java, which has been touted here as being
the premier language for allowing unicode identifiers (yes, a bit of
hyperbole), requires ascii to access the java libraries.  This will be
the case for the forseeable future in *any* programming language of
substantial use worldwide (regardless of what Python does regarding
unicode identifiers).

Since the PEP does not discuss the localization of every name in the
Python standard library (nor the builtins, __magic__ methods, etc.),
people are *still* going to need to learn the latin alphabet, at least
as much to distinguish and use Python keywords, builtins, and the
standard library.


With that said, the only question I believe that really matters in this
discussion is:
 * Where would you use unicode identifiers if they were available in
Python? Open source, closed source, personal projects?

Since everyone needs to learn ascii to use Python anyways; for the
ability to share, ascii will continue to dominate regardless of
potentially substantial closed source and personal project use.  This
has been seen (according to various reports available in this list) in
the Java world*.

As for closed source or personal projects, as long as we offer people
the ability to use unicode identifiers (since PEP 3131 is accepted, this
will happen), I don't see that there is any problem being conservative
in our choice of default. If we discover that ascii defaults are wrong,
we can always add unicode defaults later. The converse is not the case.


As I have stated before; offer people the ability to easily add
character sets that they want to see and allow to execute (I would be
happy to write an internationalizable interactive command-line and
wxPython interface for whatever method we choose), and those who want to
use non-ascii identifiers can do so.


 - Josiah

* There also seems to be a limited amount of information (available to
us) regarding how known Java unicode identifiers are.  We hear reports
from some that no one knows of unicode identifiers, but then we hear
about closed Java source using them in China and abroad, and BJ?rn
Lindqvist saying that unicode identifiers were mentioned in the two
Sweedish Java books he read.


From martin at v.loewis.de  Mon Jun  4 21:56:38 2007
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 04 Jun 2007 21:56:38 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
Message-ID: <46646E76.8060804@v.loewis.de>

> It seems to me the simplest thing to do is to require that Python
> source files be normalized.  Then the ambiguity just goes away.
> Everyone knows what form their files should be in, and if you really
> need to construct a non-normalized string, you can do that explicitly
> using "\u" notation.

However, what would that mean wrt. non-Unicode source encodings.

Say you have a Latin-1-encoded source code. Is that in NFC or not?

Regards,
Martin

From jimjjewett at gmail.com  Mon Jun  4 22:50:09 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 4 Jun 2007 16:50:09 -0400
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <46646E76.8060804@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
Message-ID: <fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>

On 6/4/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > It seems to me the simplest thing to do is to require that Python
> > source files be normalized.  Then the ambiguity just goes away.
> > Everyone knows what form their files should be in, and if you really
> > need to construct a non-normalized string, you can do that explicitly
> > using "\u" notation.

> However, what would that mean wrt. non-Unicode source encodings.

> Say you have a Latin-1-encoded source code. Is that in NFC or not?

Doesn't that depend on whether they happened to ever write some of the
combined characters (such as ?) using a two-character form like o??

FWIW, I would prefer "the parser will normalize" to "the parser will
reject unnormalized", to support even the dumbest of editors.

-jJ

From martin at v.loewis.de  Mon Jun  4 22:58:11 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 04 Jun 2007 22:58:11 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>	
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	
	<4662F639.2070806@v.loewis.de>	
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>	
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
Message-ID: <46647CE3.8070200@v.loewis.de>

>> Say you have a Latin-1-encoded source code. Is that in NFC or not?
> 
> Doesn't that depend on whether they happened to ever write some of the
> combined characters (such as ?) using a two-character form like o??

No. Latin-1 does not support that form; the concept does not exist
in that encoding. When converting to an UCS representation, it's
the codec's choice to either produce a pre-composed or decomposed
form.

Regards,
Martin

From dima at hlabs.spb.ru  Mon Jun  4 17:18:38 2007
From: dima at hlabs.spb.ru (Dmitry Vasiliev)
Date: Mon, 04 Jun 2007 19:18:38 +0400
Subject: [Python-3000] example Python code under PEP 3131?
In-Reply-To: <46639F47.4070909@v.loewis.de>
References: <315672.7992.qm@web33512.mail.mud.yahoo.com>
	<46639F47.4070909@v.loewis.de>
Message-ID: <46642D4E.807@hlabs.spb.ru>

Martin v. L?wis wrote:
>> Can somebody post a few examples of what Python code
>> would look like under PEP 3131?  Maybe 10-to-15 line
>> programs that illustrate the following use cases.
> 
> class Liste:
>   def __init__(self):
>     self.erstes_element = None
> 
>   def einf?gen(self, objekt):
>     if not self.erstes_element:
>       self.erstes_element = Element(objekt)
>     else:
>       zeiger = self.erstes_elment
>       while zeiger.n?chstes_element:
>         zeiger = zeiger.n?chstes_element
>       zeiger.n?chstes_element = Element(objekt)
> 
>   def l?schen(self, objekt):
>     if self.erstes_element.wert == objekt:
>       self.erstes_element = self.erstes_element.n?chstes_element
>     else:
>       zeiger = self.erstes_element
>       while zeiger.n?chstes_element:
>         if zeiger.n?chstes_element.wert == objekt:
>           zeiger.n?chstes_element = \
>             zeiger.n?chstes_element.n?chstes_element
>           return
>         zeiger = zeiger.n?chstes_element

I think the example above isn't so cool because except of three 
characters with umlauts it's just plain ASCII so you can write almost 
the same code in the current Python. I guess the following example in 
Russian is more bright:

def ????????_??_???????_?_???????_?????(???_?????):
     ???? = open(???_?????, "rb")
     for ?????? in ????:
         yield ??????.split()

While I can understand the code above I have mixed feeling about it, but 
I think it is better than any code written in a broken English. Many 
years ago I seen the code with functions named 'wright_*', 'writi_*', 
'wrete_*' instead of 'write_*'.

-- 
Dmitry Vasiliev <dima at hlabs.spb.ru>
http://hlabs.spb.ru

From showell30 at yahoo.com  Tue Jun  5 00:28:40 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 4 Jun 2007 15:28:40 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu>
Message-ID: <996481.46939.qm@web33502.mail.mud.yahoo.com>


--- Josiah Carlson <jcarlson at uci.edu> wrote:
> Here is some code borrowed right from the Python
> standard library.  I've
> gone ahead and mangled names in a consistant fashion
> using the tokenize
> module.  Can you guess what it does?
> 
> 
> class RTrCOlOrB :
> 
>     nBBjIUrB =0 
> 
>     def __init__ (self ,uX ,nBBjIUrB =1 ):
>         self .uX =uX 
>         self .nCIZj =[]# KAzWn ezWQ
>         self .rBGBr =0 
>         self .rInC =0 
>         if nBBjIUrB :
>             self .nBBjIUrB =1 
>             self .nCIAC =self .uX .tell ()
>             self .XznnCIZj =[]# KAzWn ezWQ
> 
>     [...]

At first glance, no, although obviously it has
something to do with randomly accessing a file.

If I were trying to reverse engineer this code back to
English, the first thing I'd do is use tokenize to
mangle the tokens back to consistent, easy to
pronounce, relatively meaningless English words like
aardvark, bobble, dog_chow, fredness, parplesnarper,
etc., as XznnCIZj doesn't have even a false cognate to
hook on to in my brain.





       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

From greg.ewing at canterbury.ac.nz  Tue Jun  5 01:39:34 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 05 Jun 2007 11:39:34 +1200
Subject: [Python-3000] example Python code under PEP 3131?
In-Reply-To: <471885.91417.qm@web33503.mail.mud.yahoo.com>
References: <471885.91417.qm@web33503.mail.mud.yahoo.com>
Message-ID: <4664A2B6.903@canterbury.ac.nz>

Steve Howell wrote:

>     einfugen = in joints (????)

Maybe "join in" (as a verb)?

--
Greg

From greg.ewing at canterbury.ac.nz  Tue Jun  5 01:50:13 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 05 Jun 2007 11:50:13 +1200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <717682.96972.qm@web33510.mail.mud.yahoo.com>
References: <717682.96972.qm@web33510.mail.mud.yahoo.com>
Message-ID: <4664A535.1070505@canterbury.ac.nz>

Steve Howell wrote:
> the one word that I wasn't able to
> translate, even with the help of Babelfish, was the
> German word for "insert."  It turns out the thing that
> threw me off was that I omitted the umlaut.

Although that probably wouldn't be such a big problem
for a native German speaker, who I guess would still
be able to recognise what was meant.

--
Greg

From showell30 at yahoo.com  Tue Jun  5 03:34:12 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 4 Jun 2007 18:34:12 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <4664A535.1070505@canterbury.ac.nz>
Message-ID: <529517.11234.qm@web33502.mail.mud.yahoo.com>


--- Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

> Steve Howell wrote:
> > the one word that I wasn't able to
> > translate, even with the help of Babelfish, was
> the
> > German word for "insert."  It turns out the thing
> that
> > threw me off was that I omitted the umlaut.
> 
> Although that probably wouldn't be such a big
> problem
> for a native German speaker, who I guess would still
> be able to recognise what was meant.
> 

Sure, but my point was not so much whether the umlaut
improved clarity for German readers; my point was that
it would also improve clarity for non-German readers
aided by Babelfish.  But I do think the experiment of
me reading the German code was weakened by the
similarity of German to English; plus, the code was
small enough that the intent of the code was just
plain obvious from the overall logical structure.





      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz

From showell30 at yahoo.com  Tue Jun  5 04:33:46 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 4 Jun 2007 19:33:46 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu>
Message-ID: <762386.78677.qm@web33515.mail.mud.yahoo.com>


--- Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > I hate to make a decision by majority rule, but I
> > think there is the argument that you need to weigh
> the
> > population of ascii-literate people vs.
> > ascii-illiterate people. 
> 
> That's a very poor criteria, as not everyone in the
> world is a potential
> programmer (despite what the BASIC folks tried to
> do). 

I didn't think that I needed to call out the criteria
for both groups that potential Python programmers need
aptitude/desire to learn programming in general, but
of course you're correct.

> 
> Since the PEP does not discuss the localization of
> every name in the
> Python standard library (nor the builtins, __magic__
> methods, etc.),
> people are *still* going to need to learn the latin
> alphabet, at least
> as much to distinguish and use Python keywords,
> builtins, and the
> standard library.
> 

I agree with that 100%.  Unless you internationlize
Python completely for certain languages [1], I think
anybody coming to Py3K, even with PEP 3131 accepted, 
will still need first semester familiarity with
English, or at least an English-like language, to be
able to use Python effectively.

In certain parts of the United States we have the
concept of "restaurant Spanish"  that native English
speakers need to learn when they wait tables.  I think
there's something like "Python English" that you need
to learn to start writing Python, and it's a pretty
small subset of the whole language, but the alphabet's
a pretty key part of it.

Cheers,

Steve

[1] - ...but regarding fully internationalizing Python
in Asia, see this post from Ryan Ginstrom
(Japanese-to-English translator):

http://mail.python.org/pipermail/python-list/2007-June/443862.html





 
____________________________________________________________________________________
Get your own web address.  
Have a HUGE year through Yahoo! Small Business.
http://smallbusiness.yahoo.com/domains/?p=BESTDEAL

From jimjjewett at gmail.com  Tue Jun  5 04:37:31 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 4 Jun 2007 22:37:31 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
Message-ID: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>

Ligatures, such as ? and ? (unicode 0x0132, 0x0133) are considered
acceptable identifier characters unless explicitly tailored out.
(They appear in both ID and XID)

Do we really want this, or should we assume that ? and ij should be
equivalent?  If so, then we need to enforce this somehow.

To me, this suggests that we should use the NFKD form.  Examples at
http://www.unicode.org/reports/tr15/tr15-28.html show that only the
Decomposition forms split ? (ligature 0xFB01) into the constituents f
and i.  Kompatibility form is needed to merge characters that are "the
same" except for some presentational quirk, such as being
superscripted or half-width.

The PEP assumes NFC, but I haven't really understood why, unless that
is required for compatibility with other systems (in which case, it
should be made explicit).

-jJ

From talin at acm.org  Tue Jun  5 04:45:51 2007
From: talin at acm.org (Talin)
Date: Mon, 04 Jun 2007 19:45:51 -0700
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
Message-ID: <4664CE5F.3040204@acm.org>

Ka-Ping Yee wrote:
> Hi,
> 
> Here's a summary of some of the remaining open issues and unaddressed
> arguments regarding PEP 3131.  These are the ones I'm familiar with,
> so I don't claim this to be complete.  I hope it helps give some
> perspective on this huge thread, though.

Thanks so much for this excellent roundup from the RoundUp Master :)

Seriously, I've been staying well away from the PEP 3131 threads, and I 
was hoping that someone would post a summary of the issues so I could 
catch up.

I'd like to make a couple of modest proposals on the PEP 3131 issue that 
I'm hoping will short-circuit some parts of this discussion.

1) My first proposal is that someone - one of the PEP 3131 advocates 
probably - create a set of patches, or possibly a branch, that 
implements unicode identifiers in whatever manner they think is 
appropriate. Write some actual code instead of just talking about it.

This fork will consist of a Python interpreter with a different name - 
lets call it 'upython' for 'unicode python'.

These same PEP 3131 advocates should also distribute precompiled 
packages containing the upython interpreter. For simplicity, it is OK to 
assume that regular Python is already installed as a prerequisite.

The 'upython' interpreter can live in the same binary directory as 
regular python. The students who want to learn Python with Japanese 
identifiers can easily be taught to run 'upython' instead of 'python'. 
Since upython runs regular python scripts, they still have access to all 
of the regular python libraries and extension modules.

Once upython becomes available to the public, it will be the goal of the 
3131 advocates to get widespread adoption of upython. If there is much 
adoption, then that makes a strong argument for merging those features 
into regular python. On the other hand, if there is little adoption, 
then that's an argument to either maintain it as a fork, or drop it 
altogether.

In other words - instead of endless discussions of hypotheticals, let 
people vote with their feet. Because I can already tell that as far as 
this mailing list goes, there will never be a consensus on this issue, 
due to basic value differences.

2) My second proposal is to drop all discussions of bidirectional 
support, since I think it's a red herring. So far, I haven't heard 
anyone whose native language is RTL lobbying for support of their 
language. Most of the vocal proponents of 3131 have been mainly 
concerned with asian languages. The people who are mainly bringing up 
the issue of Bidi are the people arguing against 3131, using it as the 
basis of an "excluded middle" argument that says that since its too 
difficult to do Bidi properly, then it's too difficult to do unicode 
identifiers.

Yes, it may be technically "unfair" to certain ethnic groups to not 
support Bidi, but frankly, I don't see why the python-dev community has 
to solve all of the world's problems in one go.

I would even go so far as to say that its OK to drop support for any 
languages that are "hard to do".

(Note that I've done a fair bit of work supporting Bidi in my previous 
job, so I at least have a passing familiarity with the issues involved.)

-- Talin

From foom at fuhm.net  Tue Jun  5 04:48:26 2007
From: foom at fuhm.net (James Y Knight)
Date: Mon, 4 Jun 2007 22:48:26 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu>
References: <dcbbbb410706040604y3923d76dl147f01e6c94032dc@mail.gmail.com>
	<717682.96972.qm@web33510.mail.mud.yahoo.com>
	<20070604090135.6F26.JCARLSON@uci.edu>
Message-ID: <EB3ABC3A-78F3-4FEA-A38F-F1EDB105E24E@fuhm.net>

On Jun 4, 2007, at 3:43 PM, Josiah Carlson wrote:
> Here is some code borrowed right from the Python standard library.   
> I've
> gone ahead and mangled names in a consistant fashion using the  
> tokenize
> module.  Can you guess what it does?

Nope, it's absolutely inscrutable. And actually, after I found this  
module in the stdlib and read the same excerpt in english, I *still*  
couldn't figure out what it was doing. (it's in multifile.py, btw).  
Of course, the given excerpt doesn't really do anything useful  
(without the rest of the class), which doesn't help things.

Anyhow, if it was in a human language, I'd paste it into an online  
translator.

e.g. from another recent message:
> def ????????_??_???????_?_???????_????? 
> (???_?????):
>      ???? = open(???_?????, "rb")
>      for ?????? in ????:
>          yield ??????.split()

pasted verbatim right into google translator results in:

> def iterator_po_tokenam_v_strokah_fayla (filename) : file = open  
> (filename, "rb") for strings in the file : stroka.split yield ()

Not entirely successful -- it's not built to translate code, of  
course. :)

Let's try some of those phrases again:
"???????? ?? ??????? ? ??????? ?????" ->  
"standard for token lines in the file". Hm, I liked "iterator" better  
than "standard" there, but okay. so, this is supposed to iterate  
tokens from lines in a file. Okay.
"??????" -> "line".

All right, I think I've got it. In fact, translation is *much*  
*easier* when the code in the other language is spelled with the  
proper characters of that language, instead of some random  
romanization. I'd have extremely little hope of being able to convert  
a romanization of russian into real russian in order to be able to  
translate it into english.

So, all things considered, allowing russian identifiers is a huge  
plus for my ability to read russian code. +1.

James

From stephen at xemacs.org  Tue Jun  5 05:53:16 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 05 Jun 2007 12:53:16 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu>
References: <dcbbbb410706040604y3923d76dl147f01e6c94032dc@mail.gmail.com>
	<717682.96972.qm@web33510.mail.mud.yahoo.com>
	<20070604090135.6F26.JCARLSON@uci.edu>
Message-ID: <87lkezxdjn.fsf@uwakimon.sk.tsukuba.ac.jp>

Josiah Carlson writes:

 > gone ahead and mangled names in a consistant fashion using the tokenize
 > module.  Can you guess what it does?

OK, here's your straight line:

Throw a lot of "AttributeError: rInCXzn is not defined"?


From martin at v.loewis.de  Tue Jun  5 06:10:32 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 05 Jun 2007 06:10:32 +0200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
Message-ID: <4664E238.9020700@v.loewis.de>

> The PEP assumes NFC, but I haven't really understood why, unless that
> is required for compatibility with other systems (in which case, it
> should be made explicit).

It's because UAX#31 tells us to use NFC, in section 5

"Generally if the programming language has case-sensitive identifiers,
then Normalization Form C is appropriate; whereas, if the programming
language has case-insensitive identifiers, then Normalization Form KC is
more appropriate."

As Python has case-sensitive identifiers, NFC is appropriate.

Regards,
Martin

From rauli.ruohonen at gmail.com  Tue Jun  5 07:21:37 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 5 Jun 2007 08:21:37 +0300
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
Message-ID: <f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>

On 6/4/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/4/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > However, what would that mean wrt. non-Unicode source encodings.
>
> > Say you have a Latin-1-encoded source code. Is that in NFC or not?

The path of least surprise for legacy encodings might be for
the codecs to produce whatever is closest to the original encoding
if possible. I.e. what was one code point would remain one code
point, and if that's not possible then normalize. I don't know if
this is any different from always normalizing (it certainly is
the same for Latin-1).

Always normalizing would have the advantage of simplicity (no matter
what the encoding, the result is the same), and I think that is
the real path of least surprise if you sum over all surprises.

> FWIW, I would prefer "the parser will normalize" to "the parser will
> reject unnormalized", to support even the dumbest of editors.

Me too, as simple open-save in a dumb editor wouldn't change the
semantics of the code, and if any edits are made where the user
expects for some reason that normalization is not done then the first
trial run will immediately disabuse them of this notion. The behavior
is simple to infer and reliable (at least for "always normalize").

FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize
for some reason. Java doesn't even normalize identifiers AFAICS, it's
not even mentioned at
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
and they even process escapes very early (those should certainly not
be normalized, as escapes are the Word of Programmer and meddling
with them will incur holy wrath).

XML 1.1 says this:

:XML processors MUST NOT transform the input to be in fully normalized
:form. XML applications that create XML 1.1 output from either XML 1.1
:or XML 1.0 input SHOULD ensure that the output is fully normalized;
:it is not necessary for internal processing forms to be fully
:normalized.
:
:The purpose of this section is to strongly encourage XML processors
:to ensure that the creators of XML documents have properly normalized
:them, so that XML applications can make tests such as identity
:comparisons of strings without having to worry about the different
:possible "spellings" of strings which Unicode allows.
:
:When entities are in a non-Unicode encoding, if the processor
:transcodes them to Unicode, it SHOULD use a normalizing transcoder.

I do not know why they've done this, but XML 1.0 does not mention
normalization at all, so perhaps they felt normalization would be
too big a change. Some random comments I read mentioned that XML 1.1
is supposed to be independent of changes to Unicode and normalization
may change for new code points in new versions, and some said that
the inavailability of normalizers to implementors would be a reason.
Verification is specified in XML 1.1, though:

:However, a document is still well-formed even if it is not fully
:normalized. XML processors SHOULD provide a user option to verify
:that the document being processed is in fully normalized form, and
:report to the application whether it is or not. The option to not
:verify SHOULD be chosen only when the input text is certified, as
:defined by B Definitions for Character Normalization.

Note that all this applies after character entity (=escape)
replacement, and applies also to what passes for "identifiers"
in XML documents.

I still think simply always normalizing the whole source code file
to NFC before any processing would be the right thing to do :-)
I'm not sure about processing of text files in Python code, it's
certainly easy to do the normalization yourself. Still, it's probably
what's wanted in most cases where line separators are normalized.

From rauli.ruohonen at gmail.com  Tue Jun  5 08:16:19 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 5 Jun 2007 09:16:19 +0300
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <4664CE5F.3040204@acm.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4664CE5F.3040204@acm.org>
Message-ID: <f52584c00706042316u654533a1j739cf836401d7d63@mail.gmail.com>

On 6/5/07, Talin <talin at acm.org> wrote:
> Thanks so much for this excellent roundup from the RoundUp Master :)
> Seriously, I've been staying well away from the PEP 3131 threads, and I
> was hoping that someone would post a summary of the issues so I could
> catch up.

I agree that the roundup is excellent, but it fails to mention
a couple of things, the most important of which is that PEP 3131
has already been accepted. All the discussion is about details such
as what's the default, what the normalization should be, etc. A
fork is therefore not necessary.

From jcarlson at uci.edu  Tue Jun  5 09:15:23 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Tue, 05 Jun 2007 00:15:23 -0700
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <4664CE5F.3040204@acm.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4664CE5F.3040204@acm.org>
Message-ID: <20070604235811.6F2B.JCARLSON@uci.edu>


Talin <talin at acm.org> wrote:
> In other words - instead of endless discussions of hypotheticals, let 
> people vote with their feet. Because I can already tell that as far as 
> this mailing list goes, there will never be a consensus on this issue, 
> due to basic value differences.

If the underlying runtime were written to handle unicode identifiers,
the Python runtime could be easily modified to discern the command used
to execute it.  Alternatively, if we went with a command-line option,
Python could easily ship with a script called 'upython' (on *nix,
upython.bat on Windows) that automatically runs python with the proper
arguments.


> 2) My second proposal is to drop all discussions of bidirectional 
> support, since I think it's a red herring. So far, I haven't heard 
> anyone whose native language is RTL lobbying for support of their 
> language. Most of the vocal proponents of 3131 have been mainly 
> concerned with asian languages. The people who are mainly bringing up 
> the issue of Bidi are the people arguing against 3131, using it as the 
> basis of an "excluded middle" argument that says that since its too 
> difficult to do Bidi properly, then it's too difficult to do unicode 
> identifiers.

While there has been discussion about how to handle bidi issues, I don't
believe I've read anything saying "since bidi is hard, lets not do
unicode at all".

 - Josiah


From martin at v.loewis.de  Tue Jun  5 09:54:30 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 09:54:30 +0200
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <4664CE5F.3040204@acm.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4664CE5F.3040204@acm.org>
Message-ID: <466516B6.8060005@v.loewis.de>

> 1) My first proposal is that someone - one of the PEP 3131 advocates 
> probably - create a set of patches, or possibly a branch, that 
> implements unicode identifiers in whatever manner they think is 
> appropriate. Write some actual code instead of just talking about it.

I'm working on that. I want to base it on the py3k-struni branch,
where identifiers need to become Unicode (string) objects first
before this can be implemented.

Completing that will likely take several weeks.

> These same PEP 3131 advocates should also distribute precompiled 
> packages containing the upython interpreter. For simplicity, it is OK to 
> assume that regular Python is already installed as a prerequisite.

That will likely not work, as the 3k interpreter likely will break with
a 2.x installation.

> Once upython becomes available to the public, it will be the goal of the 
> 3131 advocates to get widespread adoption of upython. If there is much 
> adoption, then that makes a strong argument for merging those features 
> into regular python. On the other hand, if there is little adoption, 
> then that's an argument to either maintain it as a fork, or drop it 
> altogether.

That really isn't necessary. The PEP is already approved, so the feature
will be implemented in Python 3.

Regards,
Martin


From martin at v.loewis.de  Tue Jun  5 10:02:37 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 05 Jun 2007 10:02:37 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	<4662F639.2070806@v.loewis.de>	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>	<46646E76.8060804@v.loewis.de>	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
Message-ID: <4665189D.4020301@v.loewis.de>

> The path of least surprise for legacy encodings might be for
> the codecs to produce whatever is closest to the original encoding
> if possible. I.e. what was one code point would remain one code
> point, and if that's not possible then normalize. I don't know if
> this is any different from always normalizing (it certainly is
> the same for Latin-1).

Depends on the normalization form. For Latin 1, the straight-forward
codec produces output that is not in NFKC, as MICRO SIGN should get
normalized to GREEK SMALL LETTER MU. However, it is normalized under
NFC.

Not sure about other codecs; for the CJK ones, I would expect to
see all sorts of issues.

> Always normalizing would have the advantage of simplicity (no matter
> what the encoding, the result is the same), and I think that is
> the real path of least surprise if you sum over all surprises.

I'd like to repeat that this is out of scope of this PEP, though.
This PEP doesn't, and shouldn't, specify how string literals get
from source to execution.

> FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize
> for some reason.

For XML, I believe the reason is performance. It is *fairly* expensive
to compute NFC in the general case, and I'm yet uncertain what a good
way would be to reduce execution cost in the "common case" (i.e.
data is already in NFC). For XML, enforcing this performance hit on
top of the already costly processing of XML would be unacceptable.

Regards,
Martin


From stephen at xemacs.org  Tue Jun  5 11:19:03 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 05 Jun 2007 18:19:03 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <4664E238.9020700@v.loewis.de>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
Message-ID: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > > The PEP assumes NFC, but I haven't really understood why, unless that
 > > is required for compatibility with other systems (in which case, it
 > > should be made explicit).

"Martin v. L?wis" writes:

 > It's because UAX#31 tells us to use NFC, in section 5
 > 
 > "Generally if the programming language has case-sensitive identifiers,
 > then Normalization Form C is appropriate; whereas, if the programming
 > language has case-insensitive identifiers, then Normalization Form KC is
 > more appropriate."
 > 
 > As Python has case-sensitive identifiers, NFC is appropriate.

It seems to me that what UAX#31 is saying is "Distinguishing (or not)
between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be equivalent to
distinguishing (or not) between LATIN CAPITAL LETTER A and LATIN SMALL
LETTER A."  I don't know that I agree (or disagree) in principle.

Here's what UAX#15 has to say:

----------------
Normalization Forms KC and KD must not be blindly applied to arbitrary
text. Because they erase many formatting distinctions, they will
prevent round-trip conversion to and from many legacy character sets,
and unless supplanted by formatting markup, they may remove
distinctions that are important to the semantics of the text. It is
best to think of these Normalization Forms as being like uppercase or
lowercase mappings: useful in certain contexts for identifying core
meanings, but also performing modifications to the text that may not
always be appropriate. They can be applied more freely to domains with
restricted character sets, such as in Section 13,  Programming
Language Identifiers.
----------------

Note that Section 13 == UAX#31 (from which Martin is quoting).  I
don't see this section as being at all supportive of NFC over NFKC,
though.

Some detailed observations biased by my personal tastes:

It seems to me that while I sometimes find it useful for FOO and
foo to be different identifiers, I would almost always consider R3RS
and R?RS to be the same identifier.  The contrast is just too small to
be useful.  And I would never distinguish between a three-character
fine (fi - n - e) and a four-character fine (f - i - n - e).  I'd
really love to see the printer's ligatures gone.

I'd love to get rid of full-width ASCII and halfwidth kana (via
compatibility decomposition).  Native Japanese speakers often use them
interchangably with the "proper" versions when correcting typos and
updating numbers in a series.  Ugly, to say the least.  I don't think
that native Japanese would care, as long as the decomposition is done
internally to Python.

A scan of the full table for Unicode Version 2.0 (what I have here in
print) suggests that problematic decompositions actually are
restricted to only a few scripts.  LATIN (CAPITAL|SMALL) LETTER L WITH
MIDDLE DOT (used in Catalan, cf sec. 5.1 of UAX#31) are compatibility
decompositions, unlike almost all other Latin decompositions (which
are canonical, and thus get recomposed in NFKC).  'n (Afrikaans), and
a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would
get lost.  The Koreans would lose a truckload of partially composed
Hangul and some archaic ones, the Arabic speakers their presentation
forms.  And that's about it (but I may have missed a bunch because
that database doesn't give the character classes, so I guessed for
stuff like technical symbols -> not ID characters).

I suspect that as long as they have the precomposed Hangul, partial-
syllable "ligature" forms won't be an issue for Koreans.  I can't even
distinguish the archaic versions from their compatibility equivalents
by eye, although I'm comfortable with pronouncing Hangul.  I have no
opinion on the Latin decompositions mentioned above or the Arabic
presentation forms.

However, of the ones I can judge to some extent (Latin printer's
ligatures, width variants, non-syllabic precomposed Korean Jamo), *not
one* of the compatibility decompositions would be a loss in my
opinion.  On the other hand, there are a bunch of cases where NKFC
would be a marked improvement.


From rauli.ruohonen at gmail.com  Tue Jun  5 13:06:53 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 5 Jun 2007 14:06:53 +0300
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>

On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I'd love to get rid of full-width ASCII and halfwidth kana (via
> compatibility decomposition).

If you do forbid compatibility characters in identifiers, then they
should be flagged as an error, not converted silently. NFC, on the
other hand, should be applied silently. The reason is that character
equivalence is the same thing as binary equivalence of the NFC form in
Unicode, and adding extra equivalences (whether it's "FoO" == "foo",
"??" == "??" or "????" == "A123") is surprising.

In short, I would like this function to return 'OK' or be a
syntax error, but it should not fail or return something else:

def test():
    if 'A' == '?': return 'OK'
    A = 'O'
    ? = 'K' # as tested above, 'A' and '?' are not the same thing
    return locals()['A']+locals()['?']

Note that 'A' == '?' should be false (no automatic NFKC for strings,
please).

From ncoghlan at gmail.com  Tue Jun  5 16:32:31 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 06 Jun 2007 00:32:31 +1000
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <46643F27.2040804@acm.org>
References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com>
	<46643F27.2040804@acm.org>
Message-ID: <466573FF.8060001@gmail.com>

Talin wrote:
> What I wanted to avoid in the PEP was having to specify how all of these 
> different parts fit together and the exact nature of the parameters 
> being passed between them.
> 
> And I think that even if we do break up vformat this way, we still end 
> up with people having to replace a fairly substantial chunk of code in 
> order to change the behaviors represented by these flags.

If you make the methods to be overridden simple stateless queries with a 
True/False return like the two I suggested in my other message, then it 
becomes easy to tailor these behaviours without replacing the whole parser.

For cases where changing the behaviour of those cases isn't enough then 
you would still have the option of completely overriding vformat.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From eric+python-dev at trueblade.com  Tue Jun  5 17:16:14 2007
From: eric+python-dev at trueblade.com (Eric V. Smith)
Date: Tue, 05 Jun 2007 11:16:14 -0400
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <46643F27.2040804@acm.org>
References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com>
	<46643F27.2040804@acm.org>
Message-ID: <46657E3E.2000508@trueblade.com>

Talin wrote:
> Other kinds of customization require replacing a much larger chunk of 
> code. Changing the "underscores" and "check-unused" behavior requires 
> overriding 'vformat', which means replacing the entire template string 
> parser. I figured that there would be a lot of people who might want 
> these features, but didn't want to rewrite all of vformat.

Actually you only have to replace get_positional or get_named, I think.

And I don't see how the "check-unused" behavior can be written in the 
base class, in the presence of get_positional and get_named.  If the 
list of identifiers isn't known to the base class (as in your example of 
NamespaceFormatter), then how can the base class know if they're all used?

>> I've started a sample implementation to test this API.  For starters,
>> I'm writing it in pure Python, but my intention is to use the code in
>> the pep3101 sandbox once I have some tests written and we're happy
>> with the API.
> 
> Cool.

I think we'll know more when I've made some more progress on this.

Eric.

From jimjjewett at gmail.com  Tue Jun  5 17:18:48 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 11:18:48 -0400
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <20070604235811.6F2B.JCARLSON@uci.edu>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4664CE5F.3040204@acm.org> <20070604235811.6F2B.JCARLSON@uci.edu>
Message-ID: <fb6fbf560706050818k2361d9f8jeec358f490c64222@mail.gmail.com>

On 6/5/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> Talin <talin at acm.org> wrote:

> > I haven't heard anyone whose native language is RTL
> > lobbying for support of their language.

...

> I don't believe I've read anything saying "since bidi is hard,
> lets not do unicode at all".

Not in those exact words, but Tomer did say, effectively

    bidi is hard -- probably too hard to get right yet.
    The current situation is better than rushing it.
    It wouldn't be fair to add support for some languages, but to exclude his.

Note, though, that this objection is really only to "unicode as a
single-switch".

It doesn't argue against letting individuals (or system admins or
local redistributors) add one script at a time for local use, and
letting each language community work things out for themselves.

I expect the issues to be settled more easily in Swedish than in
Hebrew or Arabic, but they'll both be supported to the extent that
they *can* use their letters if they work out a local agreement on
reasonable limits.  (Also note that Arabic and probably Hebrew have
additional issues to work out beyond bidi, such as whether to allow
certain presentational forms.  The unicode consortium recommends
against them, but they are still included in the ID_ group, as they
are technically letters.)

-jJ

From jimjjewett at gmail.com  Tue Jun  5 17:37:48 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 11:37:48 -0400
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4665189D.4020301@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
Message-ID: <fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>

On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > Always normalizing would have the advantage of simplicity (no
> > matter what the encoding, the result is the same), and I think
> > that is the real path of least surprise if you sum over all
> > surprises.

> I'd like to repeat that this is out of scope of this PEP, though.
> This PEP doesn't, and shouldn't, specify how string literals get
> from source to execution.

I see that as a gray area.

Unicode does say pretty clearly that (at least) canonical equivalents
must be treated the same.

In theory, this could be done only to identifiers, but then it needs
to be done inline for getattr.

Since we don't want the results of (str1 == str2) to change based on
context, I think string equality also needs to look at canonicalized
(though probably not compatibility) forms.  This in turn means that
hashing a unicode string should first canonicalize it.  (I believe
that is a change from 2.x.)

This means that all literal unicode characters are subject to
normalization unless they appear in a comment.

At that point, it might be simpler to just canonicalize the whole
source file up front.

-jJ

From talin at acm.org  Tue Jun  5 18:01:39 2007
From: talin at acm.org (Talin)
Date: Tue, 05 Jun 2007 09:01:39 -0700
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <466573FF.8060001@gmail.com>
References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com>
	<46643F27.2040804@acm.org> <466573FF.8060001@gmail.com>
Message-ID: <466588E3.4020600@acm.org>

Nick Coghlan wrote:
> Talin wrote:
>> What I wanted to avoid in the PEP was having to specify how all of 
>> these different parts fit together and the exact nature of the 
>> parameters being passed between them.
>>
>> And I think that even if we do break up vformat this way, we still end 
>> up with people having to replace a fairly substantial chunk of code in 
>> order to change the behaviors represented by these flags.
> 
> If you make the methods to be overridden simple stateless queries with a 
> True/False return like the two I suggested in my other message, then it 
> becomes easy to tailor these behaviours without replacing the whole parser.
> 
> For cases where changing the behaviour of those cases isn't enough then 
> you would still have the option of completely overriding vformat.

I don't have a problem with this approach either.

-- Talin

From talin at acm.org  Tue Jun  5 18:15:23 2007
From: talin at acm.org (Talin)
Date: Tue, 05 Jun 2007 09:15:23 -0700
Subject: [Python-3000] Substantial rewrite of PEP 3101
In-Reply-To: <46657E3E.2000508@trueblade.com>
References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com>
	<46643F27.2040804@acm.org> <46657E3E.2000508@trueblade.com>
Message-ID: <46658C1B.8090008@acm.org>



Eric V. Smith wrote:
> Talin wrote:
>> Other kinds of customization require replacing a much larger chunk of 
>> code. Changing the "underscores" and "check-unused" behavior requires 
>> overriding 'vformat', which means replacing the entire template string 
>> parser. I figured that there would be a lot of people who might want 
>> these features, but didn't want to rewrite all of vformat.
> 
> Actually you only have to replace get_positional or get_named, I think.

I don't think that people writing replacements for get_positional/named 
should have to reimplement the checking code. I'd like for them to worry 
only about accessing values, and leave the usage checking out of it.

> And I don't see how the "check-unused" behavior can be written in the 
> base class, in the presence of get_positional and get_named.  If the 
> list of identifiers isn't known to the base class (as in your example of 
> NamespaceFormatter), then how can the base class know if they're all used?

Because the checking only applies to arguments that are explicitly 
passed in to vformat(). It never applies to the default namespace.

Think of it this way: Would you consider it an error if the format 
string failed to refer to every global variable? Of course not. The 
default namespace is open-ended, whereas the positional and keyword 
arguments to vformat are a bounded set. So vformat can know exactly 
which arguments are and aren't used.

The checking code is, I think, relatively simple:

    checked_args = set()
    if checking_positional:
       checked_args.update(range(0,len(positional))
    if checking_named:
       checked_args.update(kwds.iterkeys())

    # now parse the template string, removing from the set
    # any arg names/indices that are referred to.

    if checked_args:  # If set non-empty
       # error

The code to populate the set of checked args could be in an overridable 
method, as suggested by Nick Coghlan. This method could simply return 
the set of args to check or None if checking is turned off.

The other way to do it would be to always build the set of 'used' names, 
and then call the method afterwards to do a set.difference operation. 
However, this means you always build a set even if you aren't checking, 
whereas with the first method you can skip creating the set if checking 
is turned off.

>>> I've started a sample implementation to test this API.  For starters,
>>> I'm writing it in pure Python, but my intention is to use the code in
>>> the pep3101 sandbox once I have some tests written and we're happy
>>> with the API.
>>
>> Cool.
> 
> I think we'll know more when I've made some more progress on this.
> 
> Eric.
> 

From martin at v.loewis.de  Tue Jun  5 18:56:37 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 18:56:37 +0200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <466595C5.6070301@v.loewis.de>

> I'd love to get rid of full-width ASCII and halfwidth kana (via
> compatibility decomposition).  Native Japanese speakers often use them
> interchangably with the "proper" versions when correcting typos and
> updating numbers in a series.  Ugly, to say the least.  I don't think
> that native Japanese would care, as long as the decomposition is done
> internally to Python.

Not sure what the proposal is here. If people say "we want the PEP do
NFKC", I understand that as "instead of saying NFC, it should say
NFKC", which in turn means "all identifiers are converted into the
normal form NFKC while parsing".

With that change, the full-width ASCII characters would still be
allowed in source - they just wouldn't be different from the regular
ones anymore when comparing identifiers.

Another option would be to require that the source is in NFKC already,
where I then ask again what precisely that means in presence of
non-UTF source encodings.

Regards,
Martin

From jimjjewett at gmail.com  Tue Jun  5 19:10:02 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 13:10:02 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706051010u5d904e98kb34ca50599fc1087@mail.gmail.com>

On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> It seems to me that what UAX#31 is saying is "Distinguishing (or not)
> between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be
> equivalent to distinguishing (or not) between LATIN CAPITAL
> LETTER A and LATIN SMALL LETTER A."  I don't know that
> I agree (or disagree) in principle.

So effectively, they consider "a" and "A" to be presentational variants.

In some languages, certain presentational variants are used depending
on word position.  I think the ID_START property does exclude letters
that cannot appear in an initial position, but putting a final
character in the middle or vice versa would still be wrong.

If identifiers are only ever typed, I suppose that isn't a problem.
If identifiers are built up in the equivalent of

    handler="do_" + name

then the character will sometimes be wrong in a way that many editors
will either hide or silently "correct."  The standard also says (but I
can't verify) that replacing the presentational variant with the
generic form will generally *improve* presentation, presumably because
there are now more systems which do the font shaping correctly than
there are systems able to handle the old character formats.

The folding rules do say that it is OK  (even good) to exclude certain
characters from certain foldings; I think we could preserve case
(including title-case?) as the only presentational variant we
recognize.

> A scan of the full table for Unicode Version 2.0 (what I have here in
> print) suggests that problematic decompositions actually are
> restricted to only a few scripts.  LATIN (CAPITAL|SMALL)
> LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of
> UAX#31)

As best I understand it, this one would be helped by using
compatibility mappings.  There is an official way to spell l-middle
dot, but enough old texts used the "wrong" character that it has to be
special-cased for round-tripping.  Since the ID is a final
destination, we care less about round-trips, and more about "if they
switch editors, will the identifier still match".

At the very least, it is mentioned as needing special care (when used
as an identifier) in http://www.unicode.org/reports/tr31/ section 5.1
paragraph 1.

> decompositions, unlike almost all other Latin decompositions (which
> are canonical, and thus get recomposed in NFKC).  'n (Afrikaans), and
> a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would
> get lost.  The Koreans would lose a truckload of partially composed
> Hangul and some archaic ones,

http://www.unicode.org/versions/corrigendum3.html suggests that many
of the Hangul are either pronunciation guide variants or even exact
duplicates (that were presumably missed when the canonicalization was
frozen?)

> the Arabic speakers their presentation forms.

http://www.unicode.org/reports/tr31/ 5.1 paragraph 3 includes:

"""It is recommended that all Arabic presentation forms be excluded
from identifiers in any event, although only a few of them must be
excluded for normalization to guarantee identifier closure."""

> And that's about it (but I may have missed a bunch because
> that database doesn't give the character classes, so I guessed for
> stuff like technical symbols -> not ID characters).

Depends on what you mean by technical symbols.  IMHO, many of them are
in fact listed as ID characters.  The math versions (generally 1D400 -
1DC7B) are included.  But
http://unicode.org/reports/tr39/data/xidmodifications.txt suggests
excluding them again.

> However, of the ones I can judge to some extent (Latin printer's
> ligatures, width variants, non-syllabic precomposed Korean Jamo), *not
> one* of the compatibility decompositions would be a loss in my
> opinion.  On the other hand, there are a bunch of cases where NKFC
> would be a marked improvement.

-jJ

From jimjjewett at gmail.com  Tue Jun  5 19:14:59 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 13:14:59 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <466595C5.6070301@v.loewis.de>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<466595C5.6070301@v.loewis.de>
Message-ID: <fb6fbf560706051014x749d53cci566ad4ad8da54dfc@mail.gmail.com>

On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > I'd love to get rid of full-width ASCII and halfwidth kana (via
> > compatibility decomposition).  Native Japanese speakers often use them
> > interchangably with the "proper" versions when correcting typos and
> > updating numbers in a series.  Ugly, to say the least.  I don't think
> > that native Japanese would care, as long as the decomposition is done
> > internally to Python.

> Not sure what the proposal is here. If people say "we want the PEP do
> NFKC", I understand that as "instead of saying NFC, it should say
> NFKC", which in turn means "all identifiers are converted into the
> normal form NFKC while parsing".

I would prefer that.

> With that change, the full-width ASCII characters would still be
> allowed in source - they just wouldn't be different from the regular
> ones anymore when comparing identifiers.

I *think* that would be OK; so long as they mean the same thing, it is
just a quirk like using a different font.  I am slightly concerned
that it might mean "string as string" and "string as identifier" have
different tests for equality.

> Another option would be to require that the source is in NFKC already,
> where I then ask again what precisely that means in presence of
> non-UTF source encodings.

My own opinion is that it would be reasonable to put those in NFKC
form as part of the parser's internal translation to unicode.  (But I
agree that it makes sense to do that for all encodings, if it is done
for any.)

-jJ

From martin at v.loewis.de  Tue Jun  5 19:15:35 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 19:15:35 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>	
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	
	<4662F639.2070806@v.loewis.de>	
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>	
	<46646E76.8060804@v.loewis.de>	
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>	
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>	
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
Message-ID: <46659A37.4000900@v.loewis.de>

Jim Jewett schrieb:
> On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>> > Always normalizing would have the advantage of simplicity (no
>> > matter what the encoding, the result is the same), and I think
>> > that is the real path of least surprise if you sum over all
>> > surprises.
> 
>> I'd like to repeat that this is out of scope of this PEP, though.
>> This PEP doesn't, and shouldn't, specify how string literals get
>> from source to execution.
> 
> I see that as a gray area.

Please read the PEP title again. What is unclear about
"Supporting Non-ASCII Identifiers"?

> Unicode does say pretty clearly that (at least) canonical equivalents
> must be treated the same.

Chapter and verse, please?

> In theory, this could be done only to identifiers, but then it needs
> to be done inline for getattr.

Why that? The caller of getattr would need to apply normalization in
case the input isn't known to be normalized?

> Since we don't want the results of (str1 == str2) to change based on
> context, I think string equality also needs to look at canonicalized
> (though probably not compatibility) forms.  This in turn means that
> hashing a unicode string should first canonicalize it.  (I believe
> that is a change from 2.x.)

And you think this is still within the scope of the PEP?

Please, if you want that to happen, write your own PEP.

Regards,
Martin

From jimjjewett at gmail.com  Tue Jun  5 19:33:59 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 13:33:59 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
Message-ID: <fb6fbf560706051033k12a7799dpe3eaaa30e5c2090f@mail.gmail.com>

On 6/5/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > I'd love to get rid of full-width ASCII and halfwidth kana (via
> > compatibility decomposition).

> If you do forbid compatibility characters in identifiers, then they
> should be flagged as an error, not converted silently.

Forbidding them seems reasonable to me; the only catch is that it is
the first step toward making a ton of individual decisions, some of
which will be wrong.  Better than getting them all wrong, of course,
but not better than postponing.  (I don't mean "ban all unicode
characters"; I do mean to ban far more of them, or to use a
site-specific incremental whitelist, or both.)

-jJ

From jimjjewett at gmail.com  Tue Jun  5 20:48:40 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 14:48:40 -0400
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <46659A37.4000900@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
	<46659A37.4000900@v.loewis.de>
Message-ID: <fb6fbf560706051148s4338ddb1pbf1e8db0f9793c7a@mail.gmail.com>

On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> > Always normalizing would have the advantage of simplicity (no
> >> > matter what the encoding, the result is the same), and I think
> >> > that is the real path of least surprise if you sum over all
> >> > surprises.

> >> I'd like to repeat that this is out of scope of this PEP, though.
> >> This PEP doesn't, and shouldn't, specify how string literals get
> >> from source to execution.

> > I see that as a gray area.

> Please read the PEP title again. What is unclear about
> "Supporting Non-ASCII Identifiers"?

That strings can also be used as identifiers.

> > Unicode does say pretty clearly that (at least) canonical equivalents
> > must be treated the same.

> Chapter and verse, please?

I am pretty sure this list is not exhaustive, but it may be helpful:

The Identifiers Annex http://www.unicode.org/reports/tr31/

"""
UAX31-C2.	An implementation claiming conformance to Level 1 of this
specification shall describe which of the following it observes:

R1 Default Identifiers
R2 Alternative Identifiers
R3 Pattern_White_Space and Pattern_Syntax Characters
R4 Normalized Identifiers
R5 Case-Insensitive Identifiers
"""

I interpret this as "If we normalize the Identifiers, then we must
observe R4."  R4 lets us exclude individual characters from
normalization, but it says that two IDs with the same Normalization
Form are equivalent, unless they include specifically excluded
characters.

"""
R4 	Normalized Identifiers
 	
To meet this requirement, an implementation shall specify the
Normalization Form and shall provide a precise list of any characters
that are excluded from normalization. If the Normalization Form is
NFKC, the implementation shall apply the modifications in Section 5.1,
NFKC Modifications, given by the properties XID_Start and
XID_Continue. Except for identifiers containing excluded characters,
any two identifiers that have the same Normalization Form shall be
treated as equivalent by the implementation.
"""

Additional Support:

The Normalization Annex http://www.unicode.org/reports/tr15/ near the
end of section 1 (but before 1.1)

"""
Normalization Forms KC and KD must not be blindly applied to arbitrary text.
""" ... """
They can be applied more freely to domains with restricted character
sets, such as in Section 13, Programming Language Identifiers.
"""
(section 13 then forwards back to UAX31)

TR 15, section 19, numbered paragraph 3
"""
Higher-level processes that transform or compare strings, or that
perform other higher-level functions, must respect canonical
equivalence or problems will result.
"""

Looking at the main standard, I revert to Unicode 4 because it is
online at http://www.unicode.org/versions/Unicode4.0.0/

2.2 Equivalent Sequences
""" ...
If an application or user attempts to distinguish non-identical
sequences which are nonetheless considered to be equivalent sequences,
as shown in the examples in Figure 2-6, it would not be guaranteed
that other applications or users would recognize the same
distinctions.  To prevent introducing interoperability problems
between applications, such distinctions must be avoided wherever
possible.
"""
which is echoed in chapter 3 (conformance)
"""
C9 A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.
...
Ideally, an implementation would always interpret two
canonical-equivalent character sequences identically. There are
practical circumstances under which implementations may reasonably
distinguish them.
"""
"""
C10 When a process purports not to modify the interpretation of a
valid coded character representation, it shall make no change to that
coded character representation other than the possible replacement of
character sequences by their canonical-equivalent sequences or the
deletion of noncharacter code points.
...
All processes and higher-level protocols are required to abide by C10
as a minimum.  However, higher-level protocols may define additional
equivalences that do not constitute modifications under that protocol.
For example, a higher-level protocol may allow a sequence of spaces to
be replaced by a single space.
"""

> > In theory, this could be done only to identifiers, but then it needs
> > to be done inline for getattr.

> Why that? The caller of getattr would need to apply normalization in
> case the input isn't known to be normalized?

OK, I suppose that might work, if documented, but ... it seems like
another piece of boilerplate; when it isn't there, it won't really be
because the input is normalized so after as it is because the author
didn't think about normalization.

-jJ

From martin at v.loewis.de  Tue Jun  5 21:09:03 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 21:09:03 +0200
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
Message-ID: <4665B4CF.2050107@v.loewis.de>

> Here's a summary of some of the remaining open issues and unaddressed
> arguments regarding PEP 3131.  These are the ones I'm familiar with,
> so I don't claim this to be complete.  I hope it helps give some
> perspective on this huge thread, though.

Thanks, I added them all to the PEP. Not sure which of these you
would consider "open issues", or "unaddressed arguments"; I'll
indicate below how I see them dealt with by the PEP currently.

> A. Should identifiers be allowed to contain any Unicode letter?

Not an open issue; the PEP has been accepted.

>    1. Python will lose the ability to make a reliable round trip to
>       a human-readable display on screen or on paper.

Correct. Was already the case, though, because of comments and string
literals.

>    2. Python will become vulnerable to a new class of security exploits;
>       code and submitted patches will be much harder to inspect.

The first class is correct; I'd question the second part (in particular
the "much" part of it). It's now addressed in the PEP by being listed
in the discussion section.

>    3. Humans will no longer be able to validate Python syntax.

That's not true. Instead, they might not be able to do that for *all*
Python programs - however, that is the case already: if programs
are sufficiently complex, people cannot validate Python syntax today.
Addressed by being listed.

>    4. Unicode is young; its problems are not yet well understood and
>       solved; tool support is weak.

Now listed. I disagree that Unicode is young; it is roughly as old
as Python.

>    5. Languages with non-ASCII identifiers use different character sets
>       and normalization schemes; PEP 3131's choices are non-obvious.

I disagree. PEP 3131 follows UAX#31 literally, and makes that decision
very clear. If people still cannot see that, please provide wording to
make it more clear.

>    6. The Unicode bidi algorithm yields an extremely confusing display
>       order for RTL text when digits or operators are nearby.

Now listed.

> B. Should the default behaviour accept only ASCII identifiers, or
>    should it accept identifiers containing non-ASCII characters?

Added as an open issue.

> C. Should non-ASCII identifiers be optional?

How is that different from B?

> D. Should the identifier character set be configurable?

Still seems to be the same open issue.

> E. Which identifier characters should be allowed?
> 
>    1. What to do about bidi format control characters?

That was already listed as an open issue.

>    2. What about other ID_Continue characters?  What about characters
>       that look like punctuation?  What about other recommendations
>       in UTS #39?  What about mixed-script identifiers?
> 
>       http://mail.python.org/pipermail/python-3000/2007-May/007836.html

That was also listed as an open issue.

> F.  Which normalization form should be used, NFC or NFKC?

Now listed as an open issue.

> G.  Should source code be required to be in normalized form?

Should I add a section "Rejected ideas"? This is out of scope of the PEP.

Regards,
Martin

From martin at v.loewis.de  Tue Jun  5 21:21:59 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 21:21:59 +0200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <fb6fbf560706051148s4338ddb1pbf1e8db0f9793c7a@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>	
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>	
	<4662F639.2070806@v.loewis.de>	
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>	
	<46646E76.8060804@v.loewis.de>	
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>	
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>	
	<4665189D.4020301@v.loewis.de>	
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>	
	<46659A37.4000900@v.loewis.de>
	<fb6fbf560706051148s4338ddb1pbf1e8db0f9793c7a@mail.gmail.com>
Message-ID: <4665B7D7.6030501@v.loewis.de>

>> > Unicode does say pretty clearly that (at least) canonical equivalents
>> > must be treated the same.
> 
>> Chapter and verse, please?
> 
> I am pretty sure this list is not exhaustive, but it may be helpful:
> 
> The Identifiers Annex http://www.unicode.org/reports/tr31/

Ah, that's in the context of identifiers, not in the context of text
in general.

> """
> UAX31-C2.    An implementation claiming conformance to Level 1 of this
> specification shall describe which of the following it observes:
> 
> R1 Default Identifiers
> R2 Alternative Identifiers
> R3 Pattern_White_Space and Pattern_Syntax Characters
> R4 Normalized Identifiers
> R5 Case-Insensitive Identifiers
> """
> 
> I interpret this as "If we normalize the Identifiers, then we must
> observe R4."  R4 lets us exclude individual characters from
> normalization, but it says that two IDs with the same Normalization
> Form are equivalent, unless they include specifically excluded
> characters.

Correct, and that's indeed what PEP 3131 does.

> """
> Normalization Forms KC and KD must not be blindly applied to arbitrary
> text.
> """ ... """
> They can be applied more freely to domains with restricted character
> sets, such as in Section 13, Programming Language Identifiers.
> """
> (section 13 then forwards back to UAX31)

How is that a requirement that comparison should apply normalization?


> TR 15, section 19, numbered paragraph 3
> """
> Higher-level processes that transform or compare strings, or that
> perform other higher-level functions, must respect canonical
> equivalence or problems will result.
> """

That's not a mandatory requirement, but an "important aspect". Also,
it applies to "higher-level processes"; I would expect that string
comparison is not a higher-level function. Indeed, UAX#15 only
gives definitions, no rules.

> C9 A process shall not assume that the interpretations of two
> canonical-equivalent character sequences are distinct.

Right. What is "a process"?

> ...
> Ideally, an implementation would always interpret two
> canonical-equivalent character sequences identically. There are
> practical circumstances under which implementations may reasonably
> distinguish them.
> """

So it should be the application's choice.

> """
> C10 When a process purports not to modify the interpretation of a
> valid coded character representation, it shall make no change to that
> coded character representation other than the possible replacement of
> character sequences by their canonical-equivalent sequences or the
> deletion of noncharacter code points.
> ...
> All processes and higher-level protocols are required to abide by C10
> as a minimum.  However, higher-level protocols may define additional
> equivalences that do not constitute modifications under that protocol.
> For example, a higher-level protocol may allow a sequence of spaces to
> be replaced by a single space.
> """

So this *allows* to canonicalize strings, it doesn't *require* Python
to do so. Indeed, doing so would be fairly expensive, and therefore
it should not be done (IMO).

>> Why that? The caller of getattr would need to apply normalization in
>> case the input isn't known to be normalized?
> 
> OK, I suppose that might work, if documented, but ... it seems like
> another piece of boilerplate; when it isn't there, it won't really be
> because the input is normalized so after as it is because the author
> didn't think about normalization.

No. It might also be because the author *knows* that the string is
already normalized.

Regards,
Martin

From martin at v.loewis.de  Tue Jun  5 22:59:55 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Jun 2007 22:59:55 +0200
Subject: [Python-3000] example Python code under PEP 3131?
In-Reply-To: <4664A2B6.903@canterbury.ac.nz>
References: <471885.91417.qm@web33503.mail.mud.yahoo.com>
	<4664A2B6.903@canterbury.ac.nz>
Message-ID: <4665CECB.3000109@v.loewis.de>

Greg Ewing schrieb:
> Steve Howell wrote:
> 
>>     einfugen = in joints (????)
> 
> Maybe "join in" (as a verb)?

It's actually "insert" (into the list).

Regards,
Martin

From alexandre at peadrop.com  Tue Jun  5 23:33:10 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Tue, 5 Jun 2007 17:33:10 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
Message-ID: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>

Hi,

On Ubuntu linux, when I try run make in the py3k-struni branch I get
an weird error about split(). However, I don't get this error when I
run ``make clean; make''.

Thanks,
-- Alexandre

% make
Traceback (most recent call last):
  File "./setup.py", line 6, in <module>
    import sys, os, imp, re, optparse
  File "/home/alex/src/python.org/py3k-struni/Lib/optparse.py", line
412, in <module>
    _builtin_cvt = { "int" : (_parse_int, _("integer")),
  File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
563, in gettext
    return dgettext(_current_domain, message)
  File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
527, in dgettext
    codeset=_localecodesets.get(domain))
  File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
462, in translation
    mofiles = find(domain, localedir, languages, all=1)
  File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 434, in find
    for nelang in _expand_lang(lang):
  File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
129, in _expand_lang
    locale = normalize(locale)
  File "/home/alex/src/python.org/py3k-struni/Lib/locale.py", line
329, in normalize
    norm_encoding = encodings.normalize_encoding(encoding)
  File "/home/alex/src/python.org/py3k-struni/Lib/encodings/__init__.py",
line 68, in normalize_encoding
    return '_'.join(encoding.translate(_norm_encoding_map).split())
TypeError: split() takes at least 1 argument (0 given)
make: *** [sharedmods] Error 1

From guido at python.org  Wed Jun  6 00:04:30 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Jun 2007 15:04:30 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
Message-ID: <ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>

If "make clean" makes the problem go away, it's usually because there
were old .pyc files with incompatible byte code. We don't change the
.pyc magic number for each change to the compiler.

--Guido

On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hi,
>
> On Ubuntu linux, when I try run make in the py3k-struni branch I get
> an weird error about split(). However, I don't get this error when I
> run ``make clean; make''.
>
> Thanks,
> -- Alexandre
>
> % make
> Traceback (most recent call last):
>   File "./setup.py", line 6, in <module>
>     import sys, os, imp, re, optparse
>   File "/home/alex/src/python.org/py3k-struni/Lib/optparse.py", line
> 412, in <module>
>     _builtin_cvt = { "int" : (_parse_int, _("integer")),
>   File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
> 563, in gettext
>     return dgettext(_current_domain, message)
>   File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
> 527, in dgettext
>     codeset=_localecodesets.get(domain))
>   File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
> 462, in translation
>     mofiles = find(domain, localedir, languages, all=1)
>   File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 434, in find
>     for nelang in _expand_lang(lang):
>   File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line
> 129, in _expand_lang
>     locale = normalize(locale)
>   File "/home/alex/src/python.org/py3k-struni/Lib/locale.py", line
> 329, in normalize
>     norm_encoding = encodings.normalize_encoding(encoding)
>   File "/home/alex/src/python.org/py3k-struni/Lib/encodings/__init__.py",
> line 68, in normalize_encoding
>     return '_'.join(encoding.translate(_norm_encoding_map).split())
> TypeError: split() takes at least 1 argument (0 given)
> make: *** [sharedmods] Error 1
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Wed Jun  6 00:43:29 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Tue, 5 Jun 2007 18:43:29 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
Message-ID: <acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>

On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> If "make clean" makes the problem go away, it's usually because there
> were old .pyc files with incompatible byte code. We don't change the
> .pyc magic number for each change to the compiler.

Nope. It is still not working. I just did the following, and I still
get the same error.

   % unset CC  # to turn off ccache
   % make distclean
   % svn revert -R .
   % svn up
   % ./configure
   % make  # run fine
   % make  # fail

-- Alexandre

From rrr at ronadam.com  Wed Jun  6 01:14:12 2007
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 05 Jun 2007 18:14:12 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
Message-ID: <4665EE44.2010306@ronadam.com>

Alexandre Vassalotti wrote:
> On 6/5/07, Guido van Rossum <guido at python.org> wrote:
>> If "make clean" makes the problem go away, it's usually because there
>> were old .pyc files with incompatible byte code. We don't change the
>> .pyc magic number for each change to the compiler.
> 
> Nope. It is still not working. I just did the following, and I still
> get the same error.
> 
>    % unset CC  # to turn off ccache
>    % make distclean
>    % svn revert -R .
>    % svn up
>    % ./configure
>    % make  # run fine
>    % make  # fail
> 
> -- Alexandre

I can confirm the same behavior.  Works on the first make, same error on 
the second.  I deleted the contents of the branch and did an "svn up" on an 
empty directory.  Same thing.

Ron

From jimjjewett at gmail.com  Wed Jun  6 01:18:09 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 19:18:09 -0400
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <4665B4CF.2050107@v.loewis.de>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4665B4CF.2050107@v.loewis.de>
Message-ID: <fb6fbf560706051618g2fbf4cfemaf7f87170fd69743@mail.gmail.com>

On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:

> >    1. Python will lose the ability to make a reliable round trip to
> >       a human-readable display on screen or on paper.

> Correct. Was already the case, though, because of comments
> and string literals.

But these are usually less important; when written as literals, they
are normally part of the User Interface, and if the user can't see the
difference, it doesn't matter.

There are exceptions, such as the "HELO" magic cookie in the
(externally defined) SMTP protocol, but I think these exceptions are
uncommon -- and outside python's control anyhow.

> >    5. Languages with non-ASCII identifiers use different
> >  character sets  and normalization schemes; PEP 3131's
> > choices are non-obvious.

> I disagree. PEP 3131 follows UAX#31 literally, and makes that
> decision very clear. If people still cannot see that,

I think "obvious" referred to the reasoning, not the outcome.

I can tell that the decision was "NFC, anything goes", but I don't see why.

(1)
I am not sure why it was NFC; UAX 31 seems agnostic on which
normalization form to use.

The only explicit recommendations I can find suggest using NFKC for
identifiers.  http://www.unicode.org/faq/normalization.html#2

(Outside of that recommendation for KC, it isn't even clear why we
should use the Composed form.  As of tonight, I realized that
"composed" means less than I thought, and the actual algorithm means
it should work as well as the Decomposed forms -- but I had missed
that detail the first several times I read about the different
Normalization forms, and it certainly isn't included directly in the
PEP.)

(2)
I cannot understand why ID_START/CONTINUE was chosen instead of the
newer and more recommended XID_START/CONTINUE.  From UAX31 section 2:
"""
The XID_Start and XID_Continue properties are improved lexical classes
that incorporate the changes described in Section 5.1, NFKC
Modifications. They are recommended for most purposes, especially for
security, over the original ID_Start and ID_Continue properties.
"""

Nor can I understand why the additional restrictions in
xidmodifications (from TR39) were ignored.  The reason to remove those
characters is given as
"""
The restricted characters are characters not in common use, removed so
as to further reduce the possibilities for visual confusion.
Initially, the following are being excluded: characters not in modern
use; characters only used in specialized fields, such as liturgical
characters, mathematical letter-like symbols, and certain phonetic
alphabetics; and ideographic characters that are not part of a set of
core CJK ideographs consisting of the CJK Unified Ideographs block
plus IICore (the set of characters defined by the IRG as the minimal
set of required ideographs for East Asian use). A small number of such
characters are allowed back in so that the profile includes all the
characters in the country-specific restricted IDN lists:
"""

As best I can tell, the remaining list is *still* too generous to be
called conservative, but the characters being removed are almost
certainly good choices for removal -- no one's native language
requires  them.


> > B. Should the default behaviour accept only ASCII identifiers, or
> >    should it accept identifiers containing non-ASCII characters?

> > D. Should the identifier character set be configurable?

> Still seems to be the same open issue.

Defaulting to ASCII or defaulting to "accept unicode" is one issue.

A related but separate issue is whether accepting unicode is a single
on/off switch, or whether it will be possible to accept only some
unicode characters.

As written, there is no good way to accept, say, Japanese characters,
but not Cyrillic.

I would prefer to whitelist individual characters or scripts, but
there should at least be a way to exclude certain characters.

http://www.unicode.org/reports/tr39/data/intentional.txt

is a list of characters that *should* be impossible to distinguish
visually.  It isn't just that the standard representations are
identical; (like some of the combining marks looking like quote
signs), it is that the (distinct abstract) characters *should* use the
same glyph, so long as they are in the same (or even harmonized)
fonts.

Several of the Greek and Cyrillic characters are glyph-identical with
ASCII letters.  I won't say that people using those scripts shouldn't
be allowed to use those letters, but *I* certainly don't want to get
code using them just because I allowed the ?.

-jJ

From alexandre at peadrop.com  Wed Jun  6 01:45:24 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Tue, 5 Jun 2007 19:45:24 -0400
Subject: [Python-3000] help() broken in the py3k-struni branch
Message-ID: <acd65fa20706051645p3f05a292u243b9623dbafda5b@mail.gmail.com>

Hi,

I found another bug to report. It seems there is a bug in
subprocess.py that makes help() fail.

-- Alexandre

Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> help(open)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350,
in __call__
    return pydoc.help(*args, **kwds)
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
1687, in __call__
    self.help(request)
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help
    else: doc(request, 'Help on %s:')
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc
    pager(render_doc(thing, title, forceload))
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager
    pager(text)
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
1333, in <lambda>
    return lambda text: pipepager(text, 'less')
  File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
1352, in pipepager
    pipe = os.popen(cmd, 'w')
  File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen
    bufsize=buffering)
  File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line
476, in __init__
    raise TypeError("bufsize must be an integer")
TypeError: bufsize must be an integer

From guido at python.org  Wed Jun  6 01:47:24 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Jun 2007 16:47:24 -0700
Subject: [Python-3000] help() broken in the py3k-struni branch
In-Reply-To: <acd65fa20706051645p3f05a292u243b9623dbafda5b@mail.gmail.com>
References: <acd65fa20706051645p3f05a292u243b9623dbafda5b@mail.gmail.com>
Message-ID: <ca471dc20706051647s249727abwd10a526ad13c98bb@mail.gmail.com>

Feel free to mail me a patch to fix it.

On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hi,
>
> I found another bug to report. It seems there is a bug in
> subprocess.py that makes help() fail.
>
> -- Alexandre
>
> Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> help(open)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350,
> in __call__
>     return pydoc.help(*args, **kwds)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> 1687, in __call__
>     self.help(request)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help
>     else: doc(request, 'Help on %s:')
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc
>     pager(render_doc(thing, title, forceload))
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager
>     pager(text)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> 1333, in <lambda>
>     return lambda text: pipepager(text, 'less')
>   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> 1352, in pipepager
>     pipe = os.popen(cmd, 'w')
>   File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen
>     bufsize=buffering)
>   File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line
> 476, in __init__
>     raise TypeError("bufsize must be an integer")
> TypeError: bufsize must be an integer
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Wed Jun  6 01:51:44 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Tue, 5 Jun 2007 19:51:44 -0400
Subject: [Python-3000] pdb help is broken in py3k-struni branch
Message-ID: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>

Hi again,

I just found yet another bug in py3k-struni branch. This one about the
pdb module.

Should I start to report these bugs to the bug tracker, instead? At
this pace, I will flood the mailing list. :)

-- Alexandre

Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> raise TypeError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError
>>> import pdb
>>> pdb.pm()
> <stdin>(1)<module>()
(Pdb) help

Documented commands (type help <topic>):
========================================
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1198, in pm
    post_mortem(sys.last_traceback)
  File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1195,
in post_mortem
    p.interaction(t.tb_frame, t)
  File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 192,
in interaction
    self.cmdloop()
  File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 139, in cmdloop
    stop = self.onecmd(line)
  File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 242, in onecmd
    return cmd.Cmd.onecmd(self, line)
  File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 216, in onecmd
    return func(arg)
  File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 336, in do_help
    self.print_topics(self.doc_header,   cmds_doc,   15,80)
  File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 345,
in print_topics
    self.columnize(cmds, maxcol-1)
  File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 361,
in columnize
    ", ".join(map(str, nonstrings)))
TypeError: list[i] not a string for i in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45
>>>

From guido at python.org  Wed Jun  6 02:00:33 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Jun 2007 17:00:33 -0700
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
Message-ID: <ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>

I'd rather see them here than in SF, SF is a pain to use.

But unless the bugs prevent you from proceeding, you could also ignore them.

There are 96 failing unit tests right now in that branch -- no need to
report all of them.

--Guido

On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hi again,
>
> I just found yet another bug in py3k-struni branch. This one about the
> pdb module.
>
> Should I start to report these bugs to the bug tracker, instead? At
> this pace, I will flood the mailing list. :)
>
> -- Alexandre
>
> Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> raise TypeError
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError
> >>> import pdb
> >>> pdb.pm()
> > <stdin>(1)<module>()
> (Pdb) help
>
> Documented commands (type help <topic>):
> ========================================
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1198, in pm
>     post_mortem(sys.last_traceback)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1195,
> in post_mortem
>     p.interaction(t.tb_frame, t)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 192,
> in interaction
>     self.cmdloop()
>   File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 139, in cmdloop
>     stop = self.onecmd(line)
>   File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 242, in onecmd
>     return cmd.Cmd.onecmd(self, line)
>   File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 216, in onecmd
>     return func(arg)
>   File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 336, in do_help
>     self.print_topics(self.doc_header,   cmds_doc,   15,80)
>   File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 345,
> in print_topics
>     self.columnize(cmds, maxcol-1)
>   File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 361,
> in columnize
>     ", ".join(map(str, nonstrings)))
> TypeError: list[i] not a string for i in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
> 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
> 44, 45
> >>>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Wed Jun  6 02:14:08 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Tue, 5 Jun 2007 20:14:08 -0400
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
	<ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
Message-ID: <acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>

On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> I'd rather see them here than in SF, SF is a pain to use.
>
> But unless the bugs prevent you from proceeding, you could also ignore them.

The first bug that I reported today (the one about `make`) stop me
from running the test suite. So, can't really test the _string_io and
_bytes_io modules.

> There are 96 failing unit tests right now in that branch -- no need to
> report all of them.

Ah, well. Then, running the test suite wouldn't really useful, after all.

Thanks,
-- Alexandre

From guido at python.org  Wed Jun  6 02:27:45 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Jun 2007 17:27:45 -0700
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
	<ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
	<acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>
Message-ID: <ca471dc20706051727j6a7f1738g96f45cf9a2a2d4aa@mail.gmail.com>

On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > I'd rather see them here than in SF, SF is a pain to use.
> >
> > But unless the bugs prevent you from proceeding, you could also ignore them.
>
> The first bug that I reported today (the one about `make`) stop me
> from running the test suite. So, can't really test the _string_io and
> _bytes_io modules.

I tried to reproduce it but it works fine for me -- I'm on Ubuntu
dapper (with some Google mods) on a 2.6.18.5-gg4 kernel.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From python at zesty.ca  Wed Jun  6 03:21:53 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Tue, 5 Jun 2007 20:21:53 -0500 (CDT)
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <4665B4CF.2050107@v.loewis.de>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4665B4CF.2050107@v.loewis.de>
Message-ID: <Pine.LNX.4.58.0706052012300.7196@server1.LFW.org>

> > A. Should identifiers be allowed to contain any Unicode letter?
>
> Not an open issue; the PEP has been accepted.

The items listed under "A." are concerns that I wanted to be noted
in the PEP, so thanks for listing them.

> > B. Should the default behaviour accept only ASCII identifiers, or
> >    should it accept identifiers containing non-ASCII characters?
>
> Added as an open issue.
>
> > C. Should non-ASCII identifiers be optional?
>
> How is that different from B?

C asks "should there be an on/off switch"; B asks whether the
default should be on or off.

> > D. Should the identifier character set be configurable?
>
> Still seems to be the same open issue.

D asks "should you be able to select which character set you want",
which is finer-grained than an all-or-nothing switch.

> > G.  Should source code be required to be in normalized form?
>
> Should I add a section "Rejected ideas"? This is out of scope of the PEP.

It seems to me that the issue is directly related -- since the
PEP intends to change the definition of acceptable source code,
ought we not to settle what we're going to accept?

To your earlier question of "what about non-UTF-8 files", I imagine
that the normalization restriction would apply to the decoded characters.
That is, once you know the source code encoding, there's a one-to-one
mapping between the sequence of bytes in the source file and the
sequence of characters to be parsed.  Thus, two references to the
same identifier will be represented by exactly the same bytes in the
source file (you can't have different byte sequences in the source
file alias to the same identifier).


-- ?!ng

From jimjjewett at gmail.com  Wed Jun  6 03:47:40 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Jun 2007 21:47:40 -0400
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <Pine.LNX.4.58.0706052012300.7196@server1.LFW.org>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4665B4CF.2050107@v.loewis.de>
	<Pine.LNX.4.58.0706052012300.7196@server1.LFW.org>
Message-ID: <fb6fbf560706051847n3043b18cg1bb4c25b28d85e65@mail.gmail.com>

On 6/5/07, Ka-Ping Yee <python at zesty.ca> wrote:
> > > G.  Should source code be required to be in normalized
> > > form?
...
> To your earlier question of "what about non-UTF-8 files", I
> imagine that the normalization restriction would apply to the
> decoded characters.  That is, once you know the source code
> encoding, there's a one-to-one mapping between the
> sequence of bytes in the source file and the sequence of
> characters to be parsed.

One of the unicode goals is that a given sequence of bytes in the
source encoding will round-trip to a corresponding sequence of bytes
in unicode.  But that corresponding sequence will not always be in
Normal form; normalization may prevent an (unchanged) round-trip.
Even if they can produce the "correct" form, it may not be as easy.
If someone's keyboard easily produces the "wrong" form, I don't want
to give them syntax errors for something that can be automatically
corrected.

> Thus, two references to the same identifier will be
> represented by exactly the same bytes in the source
> file (you can't have different byte sequences in the source
> file alias to the same identifier).

The bytes -- and possibly even the original character -- can still be
different between different files (with different encodings), even if
they reference the same (imported) identifier.  I think (limited,
source) aliasing is something we just have to accept with unicode.  I
believe the best we can do is to say:

    Python will normalize, so if two identifiers are
    canonically equivalent, you won't get any rare
    impossible-to-debug inequality showing as an
    AttributeError.

Ideally, that "canonical equivalence" would extend to strings (or at
least be done automatically before hashing).

Ideally, either that equivalence would also include compatibility, or
else characters whose compatibility and canonical equivalents are
different would be banned for use in identifiers.

-jJ

From showell30 at yahoo.com  Wed Jun  6 03:49:59 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 5 Jun 2007 18:49:59 -0700 (PDT)
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <Pine.LNX.4.58.0706052012300.7196@server1.LFW.org>
Message-ID: <738986.24163.qm@web33513.mail.mud.yahoo.com>


--- Ka-Ping Yee <python at zesty.ca> wrote:

>  
> > > B. Should the default behaviour accept only
> ASCII identifiers, or
> > >    should it accept identifiers containing
> non-ASCII characters?
> >
> > Added as an open issue.
> [...]

Martin, I hope you close out this issue, and just make
a firm, explicit stance that PEP 3131 accepts
non-ascii identifiers as the default, even though I'm
60/40 against it.  Guido has already posted some
comments that suggest that he is behind the already
implicit idea from the PEP that unicode would be the
default.

Then I would change the open issue to be how best to
address ascii users who want to revert to an
ascii-only mode.  A simple environment variable like
ASCII_ONLY would to the trick.


> 
> C asks "should there be an on/off switch"; B asks
> whether the
> default should be on or off.
> 
> > > D. Should the identifier character set be
> configurable?
> >
> > Still seems to be the same open issue.
> 
> D asks "should you be able to select which character
> set you want",
> which is finer-grained than an all-or-nothing
> switch.
> 

I agree with the importance of this distinction.  

For example, in my American corporate day job, on
question (B), I'm 90/10 on ascii-only, and (D) is a
total non-issue to me, because at least in the short
term, I could probably deal with the few
Unicode-identifierified modules that I ever needed
using some kind of very coarse workaround.

In a more international context, such as trying to get
more international users for some open source app I'd
written, I'd be 90/10 on unicode-tolerance, and (D)
would be much more of an important issue for me,
because it could affect usability of the app.

(B) and (D) really address two different classes of
users, and I think both groups could reasonably
include a lot of opponents to PEP 3131 as currently
written.

Cheers,

Steve

P.S.  Martin, thanks for adding the objections to the
PEP.  I really think it's good to have it for the
records.  Maybe five years later from now, we'll look
back on it and wonder what the heck we were thinking.
:)






 



       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/

From showell30 at yahoo.com  Wed Jun  6 04:24:18 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 5 Jun 2007 19:24:18 -0700 (PDT)
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <fb6fbf560706051847n3043b18cg1bb4c25b28d85e65@mail.gmail.com>
Message-ID: <461528.7704.qm@web33514.mail.mud.yahoo.com>


--- Jim Jewett <jimjjewett at gmail.com> wrote:
> 
> Ideally, either that equivalence would also include
> compatibility, or
> else characters whose compatibility and canonical
> equivalents are
> different would be banned for use in identifiers.
> 

Current Python has the precedence that color/colour
are treated as two separate identifers, as are
metre/meter, despite the equivalence of "o" to "ou"
and "re" to "er," and I don't think that burns too
many people.  So I'm +1 on the unquoted third option,
that canonically equivalent, but differently encoded,
Unicode characters are allowed yet treated as
different.

Am I stretching the analogy too far?






       
____________________________________________________________________________________Ready for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/

From stephen at xemacs.org  Wed Jun  6 05:01:10 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 12:01:10 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
Message-ID: <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > > I'd love to get rid of full-width ASCII and halfwidth kana (via
 > > compatibility decomposition).
 > 
 > If you do forbid compatibility characters in identifiers, then they
 > should be flagged as an error, not converted silently.

No.  The point is that people want to use their current tools; they
may not be able to easily specify normalization.  We should provide
tools to pick this lint from programs, but the normalization should be
done inside of Python, not by the user.

Please look through the list (I've already done so; I'm speaking from
detailed examination of the data) and state what compatibility
characters you want to keep.

On reflection, I would make an exception for LATIN L WITH MIDDLE DOT
(both cases); just don't decompose it for the sake of Catalan.  (And
there possibly should be a warning for L followed by MIDDLE DOT.)  But
as a native English speaker and one who lectures and deals with the
bureaucracy in Japanese, I can tell you unequivocally I want the fi
and ffi ligatures and full-width ASCII compatibility decomposed, and
as a daily user of several Japanese input methods, I can tell you it
would be a massive pain in the ass if Python doesn't convert those,
and errors would be an on-the-minute-every-minute annoyance.

 > Unicode, and adding extra equivalences (whether it's "FoO" == "foo",
 > "??" == 
-------------- next part --------------
"??" or "????" == "A123") is surprising.

How many Japanese documents do you deal with on a daily basis?  I live
with the half-width kana and full-width ASCII every day, and they are
simply an annoyance to me and to everybody I know.  They are treated
as font variants, not different characters, by *all* users.  Users are
quite happy to substitute ultra-wide ASCII fonts for JIS X 0208 ASCII,
or ultra-condensed fonts for JIS X 0201 kana.

Japanese don't expect equivalence, but that's because it's too much
effort for the programmers when nobody is asking for it; the users are
unsophisticated and don't demand it.  But where equivalence is
provided on web forms and the like, people are indeed surprised, they
are *impressed*.  "Wow!  Gaijin magic!  How'd he *do* that?!"  They
*hate* the fact that some forms want the postal code entered in JIS X
0208 full-width digits while others want ASCII (and I've even seen a
form that expected the address, including the yuubin mark, to be in
full-width JIS, but the postal code itself, embedded in the address,
had to be entered in ASCII or the form couldn't parse it).

 > In short, I would like this function to return 'OK' or be a
 > syntax error, but it should not fail or return something else:
 > 
 > def test():
 >     if 'A' == '?': return 'OK'
 >     A = 'O'
 >     ? = 'K' # as tested above, 'A' and '?' are not the same thing
 >     return locals()['A']+locals()['?']

I would like this code to return "KK".  This might be an unpleasant
surprise, once, and there would need to be a warning on the box for
distribution in Japan (and other cultures with compatibility
decompositions).

On the other hand, diffusion of non-ASCII identifiers at best will be
moderately paced; people will have to learn about usage and will have
time to get used to it.


From stephen at xemacs.org  Wed Jun  6 05:44:59 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 12:44:59 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706051014x749d53cci566ad4ad8da54dfc@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<466595C5.6070301@v.loewis.de>
	<fb6fbf560706051014x749d53cci566ad4ad8da54dfc@mail.gmail.com>
Message-ID: <874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > > Not sure what the proposal is here. If people say "we want the PEP do
 > > NFKC", I understand that as "instead of saying NFC, it should say
 > > NFKC", which in turn means "all identifiers are converted into the
 > > normal form NFKC while parsing".
 > 
 > I would prefer that.

+1

 > > With that change, the full-width ASCII characters would still be
 > > allowed in source - they just wouldn't be different from the regular
 > > ones anymore when comparing identifiers.
 > 
 > I *think* that would be OK;

+1

For the case of Japanese compatibility characters, this would make it
much easier to teach use of non-ASCII identifiers ("sensei, sensei, do
I use full-width numbers or half-width numbers?"  "Whatever you like,
kid, whatever you like."), and eliminate a common source of typos for
neophytes and experienced typists alike.

Rauli Ruohonen disagrees pretty strongly.  While I suspect I have a
substantial edge over Rauli in experience with daily use of Japanese,
that worries me.  I will be polling my students (for "would you be
more interested in learning Python if ...") and my more or less
able-to-program colleagues.

BTW -- Martin, what about numeric tokens?  I don't expect ideographic
numbers to be translated to decimal, but if full-width "ABC123" is
decomposed to halfwidth as an identifier, I think Japanese will expect
a literal full-width "123" to be recognized as the decimal number 123
(and similarly for e notation for floating point).  I really think
this should be in the scope of this PEP.  (Feel free to count it as a
reason against NFKC, if that simplifies things for you.)

 > so long as they mean the same thing, it is just a quirk like using
 > a different font.  I am slightly concerned that it might mean
 > "string as string" and "string as identifier" have different tests
 > for equality.

It does mean that; see Rauli's code.  Does anybody know if this
bothers LISP users, where identifiers are case-insensitive?  (My Emacs
LISP experience is useless, since identifiers are case-sensitive.)

We will need (possibly external) tools to warn about such
decompositions, and a sophisticated tool should warn about accesses to
identifier dictionaries in the presence of such decompositions as
well.

 > > Another option would be to require that the source is in NFKC already,
 > > where I then ask again what precisely that means in presence of
 > > non-UTF source encodings.

I don't think this is a good idea.

NB: if there's substantial resistance from users of some of the other
classes of compatibility characters, I have an acceptable fallback.
NFC plus external tools to audit for NFKC would be usable, and for the
character sets I'm likely to encounter, it would be well-defined for
the usual encodings.


From turnbull at sk.tsukuba.ac.jp  Wed Jun  6 06:19:33 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 13:19:33 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706051010u5d904e98kb34ca50599fc1087@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706051010u5d904e98kb34ca50599fc1087@mail.gmail.com>
Message-ID: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > 
 > > It seems to me that what UAX#31 is saying is "Distinguishing (or not)
 > > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be
 > > equivalent to distinguishing (or not) between LATIN CAPITAL
 > > LETTER A and LATIN SMALL LETTER A."  I don't know that
 > > I agree (or disagree) in principle.
 > 
 > So effectively, they consider "a" and "A" to be presentational variants.

Well, no, they're pretty explicit that they have semantic content, as
do superscripts.  This is different from the Arabic initial, medial,
and final forms, ligatures, the Croatian digraphs, and the Japanese
double-byte ASCII, where there is no semantic content (not even word
division for Arabic AFAIK), use is just required by "the rules" (for
Arabic) or is 100% at the discretion of the user (ASCII variants).

 > In some languages, certain presentational variants are used depending
 > on word position.  I think the ID_START property does exclude letters
 > that cannot appear in an initial position, but putting a final
 > character in the middle or vice versa would still be wrong.

Good point.  I'm going to interview some Arabic speakers who I believe
have some programming skills; I'll add that to the list.

 > If identifiers are built up in the equivalent of
 > 
 >     handler="do_" + name

I think this is pretty likely, and one of the attractions of languages
like Python.

 > The folding rules do say that it is OK  (even good) to exclude certain
 > characters from certain foldings; I think we could preserve case
 > (including title-case?) as the only presentational variant we
 > recognize.

AFAICS from looking at the V2 table, case is an *analogy* used by
UAX#31 to clarify when NKFC is useful.  NKFC itself does not fold
case, it is considered appropriate if you have a language that folds
case anyway.

 > http://www.unicode.org/versions/corrigendum3.html suggests that many
 > of the Hangul are either pronunciation guide variants or even exact
 > duplicates (that were presumably missed when the canonicalization was
 > frozen?)

I'll have to ask some Koreans what they would use.

 > """It is recommended that all Arabic presentation forms be excluded
 > from identifiers in any event, although only a few of them must be
 > excluded for normalization to guarantee identifier closure."""

Cool.  I'll ask that, too.

 > Depends on what you mean by technical symbols.

Eg, the letterlike symbols (DEGREE CELSIUS), the number forms (ROMAN
NUMERAL ONE), and the APL set (2336--237A) in the BMP.  [[ I really
need to put together some tools to access that database from
XEmacs.... ]]

 > IMHO, many of them are in fact listed as ID characters.  The math
 > versions (generally 1D400 - 1DC7B) are included.  But
 > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests
 > excluding them again.

I'm not really worried about people using characters outside the BMP
very often, any more than people use an embedded comma in LISP
identifiers or file names (eg RCS ,v), unless they use a script lately
admitted to Unicode, or if they just wish to tempt the wrath of the
gods.  The former will not have a problem, and the latter can look out
for themselves, I'm sure.


From stephen at xemacs.org  Wed Jun  6 06:41:28 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 13:41:28 +0900
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4665B7D7.6030501@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
	<46659A37.4000900@v.loewis.de>
	<fb6fbf560706051148s4338ddb1pbf1e8db0f9793c7a@mail.gmail.com>
	<4665B7D7.6030501@v.loewis.de>
Message-ID: <871wgpy9s7.fsf@uwakimon.sk.tsukuba.ac.jp>

"Martin v. L?wis" writes:

 > > TR 15, section 19, numbered paragraph 3
 > > """
 > > Higher-level processes that transform or compare strings, or that
 > > perform other higher-level functions, must respect canonical
 > > equivalence or problems will result.
 > > """
 > 
 > That's not a mandatory requirement, but an "important aspect". Also,
 > it applies to "higher-level processes"; I would expect that string
 > comparison is not a higher-level function. Indeed, UAX#15 only
 > gives definitions, no rules.

In the language of these standards, I would expect that string
comparison is exactly the kind of higher-level process they have in
mind.  In fact, it is given as an example in what Jim quoted above.

 > > C9 A process shall not assume that the interpretations of two
 > > canonical-equivalent character sequences are distinct.
 > 
 > Right. What is "a process"?

Anything that accepts Unicode on input or produces it on output, and
claims to conform to the standard.

 > > ...
 > > Ideally, an implementation would always interpret two
 > > canonical-equivalent character sequences identically. There are
 > > practical circumstances under which implementations may reasonably
 > > distinguish them.
 > > """
 > 
 > So it should be the application's choice.

I don't think so.  I think the kind of practical circumstance they
have in mind is (eg) a Unicode document which is PGP-signed.  PGP
clearly will not be able to verify a canonicalized document, unless it
happened to be in canonical form when transmitted.  But I think it is
quite clear that they do not admit that an implementation might return
False when evaluating u"L\u00F6wis" == u"Lo\u0308wis".

 > So this *allows* to canonicalize strings, it doesn't *require* Python
 > to do so. Indeed, doing so would be fairly expensive, and therefore
 > it should not be done (IMO).

It would be much more expensive to make all string comparisons grok
canonical equivalence.  That's why it *allows* canonicalization.
Otherwise the PGP signature case would suggest that canonicalization
should be forbidden (except where that is part of the definition of
the process), and canonical equivalencing be done at the site of each
comparison.

You are correct that this is outside the scope of PEP 3131, but I
don't want your interpretation of "Unicode conformance" (which I
believe to be incorrect) to go unchallenged.

From martin at v.loewis.de  Wed Jun  6 07:15:21 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 06 Jun 2007 07:15:21 +0200
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <fb6fbf560706051618g2fbf4cfemaf7f87170fd69743@mail.gmail.com>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>	
	<4665B4CF.2050107@v.loewis.de>
	<fb6fbf560706051618g2fbf4cfemaf7f87170fd69743@mail.gmail.com>
Message-ID: <466642E9.1020505@v.loewis.de>

> I think "obvious" referred to the reasoning, not the outcome.
> 
> I can tell that the decision was "NFC, anything goes", but I don't see why.

I think I'm repeating myself: Because UAX 31 says so. That's it. There
is a standard that experts in the domain have specified, and PEP 3131
follows it. Following standards is a good thing, deviating from them
is a bad thing.

> (2)
> I cannot understand why ID_START/CONTINUE was chosen instead of the
> newer and more recommended XID_START/CONTINUE.  From UAX31 section 2:
> """
> The XID_Start and XID_Continue properties are improved lexical classes
> that incorporate the changes described in Section 5.1, NFKC
> Modifications. They are recommended for most purposes, especially for
> security, over the original ID_Start and ID_Continue properties.
> """

Right. I read it that these should be used when 5.1 is considered
in the language. This, in turn, should be used when the
normalization form is NFKC:

"""
Where programming languages are using NFKC to fold differences between
characters, they need the following modifications of the identifier
syntax from the Unicode Standard to deal with the idiosyncrasies of a
small number of characters. These modifications are reflected in the
XID_Start and XID_Continue properties.
"""

As the PEP does not use NFKC (currently), it should not use XID_Start
and XID_Continue either.

> Nor can I understand why the additional restrictions in
> xidmodifications (from TR39) were ignored. 

Consideration of UTR 39 is listed as an open issue. One problem
with it is that using it would restrict the language over time,
so that previously correct programs might not be correct anymore
in a future version. So using it might break backwards
compatibility.

Regards,
Martin


From stephen at xemacs.org  Wed Jun  6 07:28:36 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 14:28:36 +0900
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com>
References: <fb6fbf560706051847n3043b18cg1bb4c25b28d85e65@mail.gmail.com>
	<461528.7704.qm@web33514.mail.mud.yahoo.com>
Message-ID: <87zm3dwt17.fsf@uwakimon.sk.tsukuba.ac.jp>

Steve Howell writes:

 > So I'm +1 on the unquoted third option, that canonically
 > equivalent, but differently encoded, Unicode characters are allowed
 > yet treated as different.
 > 
 > Am I stretching the analogy too far?

Yes.  By definition, that is nonconformant to the standard.
Canonically equivalent sequences are *identical characters* in
Unicode.  The difference you are talking about is equivalent to the
differences among "7", "07", and "0x7" as C numeric literals.  They
look different, but their semantics is identical in the program.

Pragmatically, if you have an editor which normally produces NFD, and
another which normally produces NFC, those programs will not be
link-compatible under your program, yet both editors will present the
user with identical displays.


From rauli.ruohonen at gmail.com  Wed Jun  6 09:09:43 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Wed, 6 Jun 2007 10:09:43 +0300
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
	<876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706060009s40e051cat1141a95d89bf1199@mail.gmail.com>

On 6/6/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> No.  The point is that people want to use their current tools; they
> may not be able to easily specify normalization.

> Please look through the list (I've already done so; I'm speaking from
> detailed examination of the data) and state what compatibility
> characters you want to keep.

I cannot really say about code points I'm not familiar with, but I
wouldn't use any of the ones I do know in identifiers. The only
compatibility characters in ID_Continue I have used myself are,
I think, halfwidth katakana and fullwidth alphanumerics. Examples:

? -> ? # halfwidth katakana
? -> x # fullwidth alphabetic
? -> 1 # fullwidth numeric

Practically speaking I won't be using such things in my code. I don't
like them but if it's more pragmatic to allow them then I guess it can't
be helped.

There are some cases where users might in the future want to make
a distinction between "compatibility" characters, such as these:
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
If some day everyone writes their TeX using such things, then it'd make
sense to allow and distinguish them in Python, too. For this reason
I think that compatibility transformation, if any, should only be
applied to characters where there's a practical reason to do so, and for
other cases punting (=syntax error) is safest. When in doubt, refuse
the temptation to guess.

> as a daily user of several Japanese input methods, I can tell you it
> would be a massive pain in the ass if Python doesn't convert those,
> and errors would be an on-the-minute-every-minute annoyance.

I use two Japanese input methods (MS IME and scim/anthy), but only the
latter one daily. When I type text that mixes Japanese and other
languages, I switch the input mode off when not typing Japanese. For
code that uses a lot of Japanese this may not be convenient, but then
you'd want to set your input method to use ASCII for ASCII anyway,
as that would still be required in literals (???? or "?" won't
work) and punctuation (??????????????? won't work).
A code mixing fullwidth and halfwidth alphanumerics also looks
horrible, but that's just a coding style issue :-)

>  > Unicode, and adding extra equivalences (whether it's "FoO" == "foo",
>  > "??" ==
> "??" or "????" == "A123") is surprising.
>
> How many Japanese documents do you deal with on a daily basis?

Much fewer than you, as I don't live in Japan. I read a fair amount
but don't type long texts in Japanese. When I do type, I usually use
fullwidth alphanumerics except for foreign words that aren't acronyms.
E.g. ??? but not ????????. For code, consistently using
ASCII for ASCII would be the most predictable rule (TOOWTDI).

You have to go out of your way to type halfwidth katakana, and it
isn't really useful in identifiers IMHO.

> They are treated as font variants, not different characters, by *all*
> users.

I think programmers in general expect identifier identity to behave the
same way as string identity. In this way they are a special class of
users. (those who use case-insensitive programming languages have
all my sympathy :-)

> I would like this code to return "KK".  This might be an unpleasant
> surprise, once, and there would need to be a warning on the box for
> distribution in Japan (and other cultures with compatibility
> decompositions).

This won't have a big impact if you apply it only to carefully
selected code points, and that way it sounds like a viable choice. Asking
your students for input as you suggested is surely a good idea.

From hfuerstenau at gmx.net  Wed Jun  6 08:01:04 2007
From: hfuerstenau at gmx.net (=?ISO-8859-1?Q?Hagen_F=FCrstenau?=)
Date: Wed, 06 Jun 2007 08:01:04 +0200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>	<4664E238.9020700@v.loewis.de>	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560706051010u5d904e98kb34ca50599fc1087@mail.gmail.com>
	<873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <46664DA0.9070307@gmx.net>

Stephen J. Turnbull writes:

>  > http://www.unicode.org/versions/corrigendum3.html suggests that many
>  > of the Hangul are either pronunciation guide variants or even exact
>  > duplicates (that were presumably missed when the canonicalization was
>  > frozen?)
> 
> I'll have to ask some Koreans what they would use.

The Windows Korean Input Method chooses between Unified Han and
Compatibility characters based on the reading you use to enter them. So
I guess most Koreans won't be aware of what variant they're using at any
given moment. Seems to me that NFKC would be essential here.



From stephen at xemacs.org  Wed Jun  6 10:26:33 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 17:26:33 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <f52584c00706060009s40e051cat1141a95d89bf1199@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706050406o63dc9427ub4b7cae7a2451391@mail.gmail.com>
	<876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706060009s40e051cat1141a95d89bf1199@mail.gmail.com>
Message-ID: <87tztlwksm.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > There are some cases where users might in the future want to make
 > a distinction between "compatibility" characters, such as these:
 > http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols

I don't think they belong in identifiers in a general purpose
programming language, though their usefulness to mathematical printers
is obvious.  I think programs should be verbalizable, unlike math
where most of the text is not intended to correspond to any reality,
but is purely syntactic transformation.

 > For this reason I think that compatibility transformation, if any,
 > should only be applied to characters where there's a practical
 > reason to do so, and for other cases punting (=syntax error) is
 > safest.

"Banzai Python!" and all that, but even if Python is in use 10,000
years from now, I think compatibility characters will still be a
YAGNI.  I admit that's a reasonable compromise, and allows future
extension without gratuitously making existing programs illegal; I
could live with it very easily (but I'd want those full-width ASCII
decomposed :-).  I just feel it would be wiser to limit Python
identifiers to NFKC.

 > I use two Japanese input methods (MS IME and scim/anthy), but only the
 > latter one daily. When I type text that mixes Japanese and other
 > For code that uses a lot of Japanese this may not be convenient,
 > but then you'd want to set your input method to use ASCII for ASCII
 > anyway,

Both of those address the issue of the annoyance of syntax errors in
original code to a great extent, but not in debug/maintenance mode
where you only type a few characters of code at a time, and typically
enter from user mode.

 > You have to go out of your way to type halfwidth katakana, and it
 > isn't really useful in identifiers IMHO.

I agree, but then I don't work for the Japanese Social Security
Administration.


From rauli.ruohonen at gmail.com  Wed Jun  6 10:50:08 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Wed, 6 Jun 2007 11:50:08 +0300
Subject: [Python-3000] String comparison
Message-ID: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>

(Martin's right, it's not good to discuss this in the huge PEP 3131
thread, so I'm changing the subject line)

On 6/6/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> In the language of these standards, I would expect that string
> comparison is exactly the kind of higher-level process they have in
> mind.  In fact, it is given as an example in what Jim quoted above.
>
>  > > C9 A process shall not assume that the interpretations of two
>  > > canonical-equivalent character sequences are distinct.
>  >
>  > Right. What is "a process"?
>
> Anything that accepts Unicode on input or produces it on output, and
> claims to conform to the standard.

Strings are internal to Python. This is a whole separate issue from
normalization of source code or its parts (such as identifiers). Once
you have read in a text file and done the normalizations you want to,
what you have left is an internal representation in memory, which
may be anything that's convenient to the programmer. The question is,
what is convenient to the programmer?

>  > > Ideally, an implementation would always interpret two
>  > > canonical-equivalent character sequences identically. There are
>  >
>  > So it should be the application's choice.
>
> I don't think so.  I think the kind of practical circumstance they
> have in mind is (eg) a Unicode document which is PGP-signed.  PGP
> clearly will not be able to verify a canonicalized document, unless it
> happened to be in canonical form when transmitted.  But I think it is
> quite clear that they do not admit that an implementation might return
> False when evaluating u"L\u00F6wis" == u"Lo\u0308wis".

It is up to Python to define what "==" means, just like it defines
what "is" means. It may be canonical equivalence for strings, but
then again it may not. It depends on what you want, and what you
think strings are. If you think they're sequences of code points,
which is what they act like in general (assuming UTF-32 was selected
at compile time), then bitwise comparison is quite consistent whether
the string is in normalized form or not.

Handling strings as sequences of code points is the most general and
simple thing to do, but there are other options. One is to simply
change comparison to be collation (and presumably also make regexp
matching and methods like startswith consistent with that). Another is
to always keep strings in a specific normalized form. Yet another is to
have another type for strings-as-grapheme-sequences, which would
strictly follow user expectations for characters (= graphemes), such as
string length and indexing, comparison, etc.

Changing just the comparison has the drawback that many current string
invariants break. a == b would no longer imply any of len(a) == len(b),
set(a) == set(b), a[i:j] == b[i:j], repr(a) == repr(b). You'd also
have to use bytes for any processing of code point sequences (such as
XML processing), because most operations would act as if you had
normalized your strings (including dictionary and set operations), and
if you have to do contortions to avoid problems with that, then it's
easier to just use bytes. There would also be security implications with
strings comparing equal but not always quite acting equal.

Always doing normalization would still force you to use bytes for
processing code point sequences (e.g. XML, which must not be
normalized), which is not nice. It's also not nice to force a particular
normalization on the programmer, as another one may be better for some
uses. E.g. an editor may be simpler to implement if everything is
consistently decomposed (NFD), but for most communication you'd want to
use NFC, as you would for many other processing (e.g. the "code point ==
grapheme" equation is perfectly adequate for many purposes with NFC,
but not with NFD).

Having a type for grapheme sequences would seem like the least
problematic choice, but there's little demand for such a type.
Most intelligent Unicode processing doesn't use a grapheme
representation for performance reasons, and in most other cases the
"code point == grapheme" equation or treatment of strings as atoms is
adequate. The standard library might provide this type if necessary.

>  > So this *allows* to canonicalize strings, it doesn't *require* Python
>  > to do so. Indeed, doing so would be fairly expensive, and therefore
>  > it should not be done (IMO).
>
> It would be much more expensive to make all string comparisons grok
> canonical equivalence.  That's why it *allows* canonicalization.

FWIW, I don't buy that normalization is expensive, as most strings are
in NFC form anyway, and there are fast checks for that (see UAX#15,
"Detecting Normalization Forms"). Python does not currently have
a fast path for this, but if it's added, then normalizing everything
to NFC should be fast.

From turnbull at sk.tsukuba.ac.jp  Wed Jun  6 14:33:19 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Wed, 06 Jun 2007 21:33:19 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
Message-ID: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > Strings are internal to Python. This is a whole separate issue from
 > normalization of source code or its parts (such as identifiers).

Agreed.  But please note that we're not talking about representation.
We're talking about the result of evaluating a comparison:

    if u"L\u00F6wis" == u"Lo\u0308wis":
        print "Python is Unicode conforming in this respect."
    else:
        print "I guess it's time to start learning Ruby."

I think it's reasonable to be astonished if Python doesn't at least
try to print "Python is Unicode conforming in this respect." for the
above snippet by default.

 > It is up to Python to define what "==" means, just like it defines
 > what "is" means.

You are of course correct.  However, if given that u prefix Python
chooses to define == in a way that does not respect canonical
equivalence, what's the point of having these things?  

 > Always doing normalization would still force you to use bytes for
 > processing code point sequences (e.g. XML, which must not be
 > normalized), which is not nice.

I'm not talking about "nice" yet, just about Unicode conformance.  How
to implement conformant behavior is of course entirely up to Python.
As is choosing *whether* to conform or not, but it seems bizarre to me
that one might choose to implement UAX#31 verbatim, and also have
u"L\u00F6wis" == u"Lo\u0308wis" evaluate to False.

 > FWIW, I don't buy that normalization is expensive, as most strings are
 > in NFC form anyway, and there are fast checks for that (see UAX#15,
 > "Detecting Normalization Forms"). Python does not currently have
 > a fast path for this, but if it's added, then normalizing everything
 > to NFC should be fast.

If O(n) is "fast".


From jcarlson at uci.edu  Wed Jun  6 17:57:45 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 06 Jun 2007 08:57:45 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20070606084543.6F3D.JCARLSON@uci.edu>


"Stephen J. Turnbull" <turnbull at sk.tsukuba.ac.jp> wrote:
> Rauli Ruohonen writes:
> 
>  > Strings are internal to Python. This is a whole separate issue from
>  > normalization of source code or its parts (such as identifiers).
> 
> Agreed.  But please note that we're not talking about representation.
> We're talking about the result of evaluating a comparison:
> 
>     if u"L\u00F6wis" == u"Lo\u0308wis":
>         print "Python is Unicode conforming in this respect."
>     else:
>         print "I guess it's time to start learning Ruby."
> 
> I think it's reasonable to be astonished if Python doesn't at least
> try to print "Python is Unicode conforming in this respect." for the
> above snippet by default.
> 
>  > It is up to Python to define what "==" means, just like it defines
>  > what "is" means.
> 
> You are of course correct.  However, if given that u prefix Python
> chooses to define == in a way that does not respect canonical
> equivalence, what's the point of having these things?  

Maybe I'm missing something, but it seems to me that there might be a
simple solution.  Don't normalize any identifiers or strings.

Hear me out for a moment.  People type what they want.  Isn't that the
whole point of PEP 3131? If they don't know what they want, then that is
as much a problem with display/representation as anything else that we
have discussed.  Any of the flagging methods could easily disable things
like u"o\u0308" for identifiers to force them to be in the "one true
form" to begin with.

As for strings, I think we should opt for keeping it as simple as
possible.  Compare by code points.  To handle normalization issues, add
a normalization method that people call if they care about normalized
unicode strings*.

If at some point we think that normalization should happen on
identifiers by default, all we need to do is to call st.normalize() on
any string that is used for getattr, and/or could use a subclass of dict
to make it happen automatically.


 - Josiah

* Or leave out normalization all together in 3.0 .  I haven't heard any
complaints about the lack of normalization in Python so far (though
maybe I'm not reading the right python-list messages), and Python has
had unicode for what, almost 10 years now?


From guido at python.org  Wed Jun  6 18:46:13 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 09:46:13 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <ca471dc20706060946m48865311oa1ce59101dbcde33@mail.gmail.com>

On 6/6/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> Rauli Ruohonen writes:
>
>  > Strings are internal to Python. This is a whole separate issue from
>  > normalization of source code or its parts (such as identifiers).
>
> Agreed.  But please note that we're not talking about representation.
> We're talking about the result of evaluating a comparison:
>
>     if u"L\u00F6wis" == u"Lo\u0308wis":
>         print "Python is Unicode conforming in this respect."
>     else:
>         print "I guess it's time to start learning Ruby."
>
> I think it's reasonable to be astonished if Python doesn't at least
> try to print "Python is Unicode conforming in this respect." for the
> above snippet by default.

Alas, you will remain astonished for a long time, and you're welcome
to try Ruby instead. I'm all for adding a way to do normalized string
comparisons to the library. But I'm not about to change the ==
operator to apply normalization first. It would affect too much (e.g.
hashing).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From janssen at parc.com  Wed Jun  6 19:12:56 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 6 Jun 2007 10:12:56 PDT
Subject: [Python-3000] String comparison
In-Reply-To: <20070606084543.6F3D.JCARLSON@uci.edu> 
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
Message-ID: <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com>

> Hear me out for a moment.  People type what they want.

I do a lot of Pythonic processing of UTF-8, which is not "typed by
people", but instead extracted from documents by automated processing.
Text is also data -- an important thing to keep in mind.

As far as normalization goes, I agree with you about identifiers, and
I use "unicodedata.normalize" extensively in the cases where I care
about normalization of data strings.  The big issue is string literals.
I think I agree with Stephen here:

    u"L\u00F6wis" == u"Lo\u0308wis"

should be True (assuming he typed it correctly in the first place :-),
because they are the same Unicode string.  I don't understand Guido's
objection here -- it's a lexer issue, right?  The underlying character
string will still be the same in both cases.

But it's complicated.  Clearly we expect

    (u"abc" + u"def") == (u"a" + u"bcdef")

to be True, so

    (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")

should also be True.  Where I see difficulty is

    (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")

I suppose unichr(0x0308) should raise an exception -- a combining
diacritic by itself shouldn't be convertible to a character.

Bill



From guido at python.org  Wed Jun  6 19:37:47 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 10:37:47 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <-6248387165431892706@unknownmsgid>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
Message-ID: <ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>

On 6/6/07, Bill Janssen <janssen at parc.com> wrote:
> > Hear me out for a moment.  People type what they want.
>
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to keep in mind.
>
> As far as normalization goes, I agree with you about identifiers, and
> I use "unicodedata.normalize" extensively in the cases where I care
> about normalization of data strings.  The big issue is string literals.
> I think I agree with Stephen here:
>
>     u"L\u00F6wis" == u"Lo\u0308wis"
>
> should be True (assuming he typed it correctly in the first place :-),
> because they are the same Unicode string.  I don't understand Guido's
> objection here -- it's a lexer issue, right?  The underlying character
> string will still be the same in both cases.

So let me explain it. I see two different sequences of code points:
'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
'w', 'i', 's' on the other. Never mind that Unicode has semantics that
claim they are equivalent. They are two different sequences of code
points. We should not hide that Python's unicode string object can
store each sequence of code points equally well, and that when viewed
as a sequence they are different: the first has len() == 5, the scond
has len() == 6! When read from a file they are different. Why should
the lexer apply normalization to literals behind my back? I might be
writing either literal with the expectation to get exactly that
sequence of code points, in order to use it as a test case or as input
for another program that requires specific input.

> But it's complicated.  Clearly we expect
>
>     (u"abc" + u"def") == (u"a" + u"bcdef")
>
> to be True, so
>
>     (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")
>
> should also be True.  Where I see difficulty is
>
>     (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")
>
> I suppose unichr(0x0308) should raise an exception -- a combining
> diacritic by itself shouldn't be convertible to a character.

There's a simpler solution. The unicode (or str, in Py3k) data type
represents a sequence of code points, not a sequence of characters.
This has always been the case, and will continue to be the case.

Note that I'm not arguing against normalization of *identifiers*. I
see that as a necessity. I also see that there will be border cases
where getattr(x, 'XXX') and x.XXX are not equivalent for some values
of XXX where the normalized form is a different sequence of code
points. But I don't believe the solution should be to normalize all
string literals. Clearly we will have a normalization routine so the
lexer can normalize identifiers, so if you need normalized data it is
as simple as writing 'XXX'.normalize() (or whatever the spelling
should be).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rauli.ruohonen at gmail.com  Wed Jun  6 20:18:53 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Wed, 6 Jun 2007 21:18:53 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
Message-ID: <f52584c00706061118s1e017432n67ebdaba86448fc0@mail.gmail.com>

On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> Why should the lexer apply normalization to literals behind my back?

The lexer shouldn't, but NFC normalizing the source before the lexer
sees it would be slightly more robust and standards-compliant. This is
because technically an editor or any other program is allowed by the
Unicode standard to apply any normalization or other canonical
equivalent replacement it sees fit, and other programs aren't supposed
to care. The standard even says that such differences should be rendered
in an indistinguishable way. Practically everyone uses NFC, though.

> There's a simpler solution. The unicode (or str, in Py3k) data type
> represents a sequence of code points, not a sequence of characters.
> This has always been the case, and will continue to be the case.

This is how Java and ICU (http://www.icu-project.org/) do it, too.
The latter is a library specifically designed for processing Unicode
text. Both Java and ICU are even mentioned in the Unicode FAQ.

> Clearly we will have a normalization routine so the lexer can
> normalize identifiers, so if you need normalized data it is
> as simple as writing 'XXX'.normalize() (or whatever the spelling
> should be).

The routine is at the moment at unicodedata.normalize.

From martin at v.loewis.de  Wed Jun  6 20:21:10 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 06 Jun 2007 20:21:10 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
Message-ID: <4666FB16.2070209@v.loewis.de>

> FWIW, I don't buy that normalization is expensive, as most strings are
> in NFC form anyway, and there are fast checks for that (see UAX#15,
> "Detecting Normalization Forms"). Python does not currently have
> a fast path for this, but if it's added, then normalizing everything
> to NFC should be fast.

That would be useful to have, anyway. Would you like to contribute it?

Regards,
Martin

From martin at v.loewis.de  Wed Jun  6 20:26:01 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 06 Jun 2007 20:26:01 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>	<20070606084543.6F3D.JCARLSON@uci.edu>	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
Message-ID: <4666FC39.8050307@v.loewis.de>

Guido van Rossum schrieb:
> Clearly we will have a normalization routine so the
> lexer can normalize identifiers, so if you need normalized data it is
> as simple as writing 'XXX'.normalize() (or whatever the spelling
> should be).

It's actually in Python already, and spelled as
unicodedata.normalize("NFC", 'XXX')

Regards,
Martin

From stephen at xemacs.org  Wed Jun  6 20:41:28 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 07 Jun 2007 03:41:28 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706060946m48865311oa1ce59101dbcde33@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706060946m48865311oa1ce59101dbcde33@mail.gmail.com>
Message-ID: <87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp>

Guido van Rossum writes:

 > But I'm not about to change the == operator to apply normalization
 > first. It would affect too much (e.g. hashing).

Yah, that's one reason why Jim Jewett and I lean to normalizing on the
way in for explicitly Unicode data.  But since that's not going to
happen, I guess the thing is to get cracking on that library just in
case there's some help that Python itself could give.


From martin at v.loewis.de  Wed Jun  6 20:44:27 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 06 Jun 2007 20:44:27 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>	<ca471dc20706060946m48865311oa1ce59101dbcde33@mail.gmail.com>
	<87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <4667008B.5080302@v.loewis.de>

>  > But I'm not about to change the == operator to apply normalization
>  > first. It would affect too much (e.g. hashing).
> 
> Yah, that's one reason why Jim Jewett and I lean to normalizing on the
> way in for explicitly Unicode data.  But since that's not going to
> happen, I guess the thing is to get cracking on that library just in
> case there's some help that Python itself could give.

There are issues with that as well. Concatenation would need to perform
normalization, and then len(a+b) <> len(a)+len(b), for some a and b.

Regards,
Martin

From guido at python.org  Wed Jun  6 20:47:20 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 11:47:20 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706061118s1e017432n67ebdaba86448fc0@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<f52584c00706061118s1e017432n67ebdaba86448fc0@mail.gmail.com>
Message-ID: <ca471dc20706061147l26f08ca7y3ee3ca78fd2e8933@mail.gmail.com>

On 6/6/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> > Why should the lexer apply normalization to literals behind my back?
>
> The lexer shouldn't, but NFC normalizing the source before the lexer
> sees it would be slightly more robust and standards-compliant.

I have no opinion on this, but NFC normalizing the source shouldn't
affect the use of \u.... in string literals. Remember, Python's \u is
very different from \u in Java (where it happens before the lexer
starts tokenizing). Python's \u is more like \x, only valid in string
literals.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jcarlson at uci.edu  Wed Jun  6 22:05:37 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 06 Jun 2007 13:05:37 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com>
References: <20070606084543.6F3D.JCARLSON@uci.edu>
	<07Jun6.101305pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <20070606125328.6F40.JCARLSON@uci.edu>


Bill Janssen <janssen at parc.com> wrote:
> 
> > Hear me out for a moment.  People type what they want.
> 
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to keep in mind.

Right, but (and this is a big but), you are reading data in from a file. 
That is different from source code identifiers and embedded strings.  If
you *want* normalization to happen on your data, that is perfectly
reasonable, and you can do so (Explicit is better than implicit?).  But
if someone didn't want normalization, and Python did it anyways, then
there would be an error that passed silently.


> As far as normalization goes, I agree with you about identifiers, and
> I use "unicodedata.normalize" extensively in the cases where I care
> about normalization of data strings.  The big issue is string literals.
> I think I agree with Stephen here:
> 
>     u"L\u00F6wis" == u"Lo\u0308wis"
> 
> should be True (assuming he typed it correctly in the first place :-),
> because they are the same Unicode string.  I don't understand Guido's
> objection here -- it's a lexer issue, right?  The underlying character
> string will still be the same in both cases.

It's the unicode character versus code point issue.  I personally prefer
code points, as a code point approach does exactly what I want it to do
by default; nothing.  If it *does* something without me asking, then
that would seem to be magic to me, and I'm a minimal magic kind of guy.

 - Josiah


From guido at python.org  Wed Jun  6 23:31:17 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 14:31:17 -0700
Subject: [Python-3000] [Python-Dev] PEP 367: New Super
In-Reply-To: <20070531170734.273393A40AA@sparrow.telecommunity.com>
References: <001101c79aa7$eb26c130$0201a8c0@mshome.net>
	<002d01c79f6d$ce090de0$0201a8c0@mshome.net>
	<ca471dc20705260708t952d820w7473474554c9469b@mail.gmail.com>
	<003f01c79fd9$66948ec0$0201a8c0@mshome.net>
	<ca471dc20705270259ke665af6v3b5bdbffbd926330@mail.gmail.com>
	<009c01c7a04f$7e348460$0201a8c0@mshome.net>
	<ca471dc20705270550j5e199624xd4e8f6caa9dda93d@mail.gmail.com>
	<ca471dc20705281937y48300821u840add9d5454e8d9@mail.gmail.com>
	<ca471dc20705310448p5c5cfeds41fdc75e05c21f55@mail.gmail.com>
	<20070531170734.273393A40AA@sparrow.telecommunity.com>
Message-ID: <ca471dc20706061431i3de7914bq14307fe7bc4f7ba7@mail.gmail.com>

On 5/31/07, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 07:48 PM 5/31/2007 +0800, Guido van Rossum wrote:
> >I've updated the patch; the latest version now contains the grammar
> >and compiler changes needed to make super a keyword and to
> >automatically add a required parameter 'super' when super is used.
> >This requires the latest p3yk branch (r55692 or higher).
> >
> >Comments anyone? What do people think of the change of semantics for
> >the im_class field of bound (and unbound) methods?
>
> Please correct me if I'm wrong, but just looking at the patch it
> seems to me that the descriptor protocol is being changed as well --
> i.e., the 'type' argument is now the found-in-type in the case of an
> instance __get__ as well as class __get__.
>
> It would seem to me that this change would break classmethods both on
> the instance and class level, since the 'cls' argument is supposed to
> be the derived class, not the class where the method was
> defined.  There also don't seem to be any tests for the use of super
> in classmethods.

I've not gotten a new patch out based on a completely different
approach. (I'm afraid I didn't quite get your suggestion so this is
original work.)

It creates a cell named __class__ which is shared between all methods
defined in a particular class, and initialized to the class object
(before class decorators are applied). Only a small change is made to
super(): instead of making it a keyword, it can be invoked as a
function without arguments, and then it digs around in the frame to
find the __class__ cell and the first argument, and uses those as its
arguments. Example:

class B:
  def foo(self): return 'B'

class C(B):
  def foo(self): return 'C' + super().foo()

C().foo() will return 'CB'. The notation super() is equivalent to
super(C, self) or super(__class__, self). It works for class methods
too.

I realize this is a deviation from the PEP: you need to call
super().foo() instead of super.foo(). Looking at the examples I find
that quite acceptable; in hindsight making super a keyword smelled a
bit too magical. (Yes, I know I've been flip-flopping a lot on this
issue. Working code is convincing. :-)

This __class__ variable can also be used explicitly (thereby
implementing 33% of PEP 3130):

class C:
  def f(self): print(__class__)

C().f()

I wonder if this may meet the needs for your PEP 3124? In
particularly, earlier on, you wrote:

> Btw, PEP 3124 needs a way to receive the same class object at more or
> less the same moment, although in the form of a callback rather than
> a cell assignment.  Guido suggested I co-ordinate with you to design
> a mechanism for this.

Is this relevant at all?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jimjjewett at gmail.com  Wed Jun  6 23:43:15 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 17:43:15 -0400
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com>
References: <fb6fbf560706051847n3043b18cg1bb4c25b28d85e65@mail.gmail.com>
	<461528.7704.qm@web33514.mail.mud.yahoo.com>
Message-ID: <fb6fbf560706061443x7fd000ffse77d215992bd7e14@mail.gmail.com>

On 6/5/07, Steve Howell <showell30 at yahoo.com> wrote:
>
> --- Jim Jewett <jimjjewett at gmail.com> wrote:

> > Ideally, either that equivalence would also include
> > compatibility, or
> > else characters whose compatibility and canonical
> > equivalents are
> > different would be banned for use in identifiers.

> Current Python has the precedence that color/colour
> are treated as two separate identifers, as are
> metre/meter, despite the equivalence of "o" to "ou"
> and "re" to "er," and I don't think that burns too
> many people.  So I'm +1 on the unquoted third option,
> that canonically equivalent, but differently encoded,
> Unicode characters are allowed yet treated as
> different.

> Am I stretching the analogy too far?

I think so.  As best I can judge,  "color/colour" is arguably a
compatibility equivalent, but is not a canonical equivalent.

A better analogy for canonical equivalence would be "color" typed on a
PC vs "color" typed on an old EBCDIC mainframe terminal.  In that
particular case, I think the re-encoding to unicode would be able to
use the same code points, but that "mostly invisible; might need it
for a round-trip" level of difference is the sort of thing expressed
by different code points with canonical equivalence.

-jJ

From baptiste13 at altern.org  Thu Jun  7 00:01:26 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Thu, 07 Jun 2007 00:01:26 +0200
Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP
	3131)
In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
References: <fb6fbf560706031827u4c13687t280f80d785c05d83@mail.gmail.com>
	<740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com>
Message-ID: <f47arp$cs9$1@sea.gmane.org>

BJ?rn Lindqvist a ?crit :
>> Those most eager for unicode identifiers are afraid that people
>> (particularly beginning students) won't be able to use local-script
>> identifiers, unless it is the default.  My feeling is that the teacher
>> (or the person who pointed them to python) can change the default on a
>> per-install basis, since it can be a one-time change.
> 
> What if the person discovers Python by him/herself?
> 
Don't people read the (funky:-) manual any more? More seriously, they will
probably read some tutorials in that case. Also, the error message could
advertise the feature, as in:

SyntaxError: if you really want to use unicode identifiers, call python with -U

Also, think of it from the other side: the person who discovers python by
him/herself and reads no manuals won't know that you should avoid unicode
identifiers in code you later want to distribute, or that there can be security
issues.

>> On the other hand, if "anything from *any* script" becomes the
>> default, even on a single widespread distribution, then the community
>> starts to splinter in a new way.  It starts to separate between people
>> who distribute source code (generally ASCII) and people who are
>> effectively distributing binaries (not for human end-users to read).
> 
> That is FUD.
> 
definitely not. Big open source projects will of course do the right thing, but
the smaller ones? I doubt it. Think of all those little apps on the cheeseshop
which get updated every other year. Do you really think all of them run a test
suite?

>>> ... Java, ... don't hear constant complaints
>> They aren't actually a problem because they aren't used; they aren't
>> used because almost no one knows about them.  Python would presumably
>> advertise the feature, and see more use.  (We shouldn't add it at all
>> *unless* we expect much more usage than unicode IDs have seen in other
>> programming languages.)
> 
> Every Swedish book I've read about Java (only 2) mentioned that feature.
> 
cool, then everybody reading Swedish tutorials on python will also learn about
the feature, even if it s not the default!

>> The same one-step-at-a-time reasoning applies to unicode identifers.
>> Allowing IDs in your native language (or others that you explicitly
>> approve) is probably a good step.  Allowing IDs in *any* language by
>> default is probably going too far.
> 
> If you set different native languages won't you get the exact same
> problems that codepages caused and that unicode was invented to solve?
> 
nope, because you do not reuse the same coding for different characters in
different languages. You just turn languages (scripts, in fact) on or off.

Cheers,
BC


From jimjjewett at gmail.com  Thu Jun  7 00:22:17 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 18:22:17 -0400
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <466642E9.1020505@v.loewis.de>
References: <Pine.LNX.4.58.0706040607550.7196@server1.LFW.org>
	<4665B4CF.2050107@v.loewis.de>
	<fb6fbf560706051618g2fbf4cfemaf7f87170fd69743@mail.gmail.com>
	<466642E9.1020505@v.loewis.de>
Message-ID: <fb6fbf560706061522o5e4acd38sd69d78ced8cb76c4@mail.gmail.com>

On 6/6/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > I think "obvious" referred to the reasoning, not the outcome.

> > I can tell that the decision was "NFC, anything goes", but I don't see why.

> I think I'm repeating myself: Because UAX 31 says so. That's it. There
> is a standard that experts in the domain have specified, and PEP 3131
> follows it. Following standards is a good thing, deviating from them
> is a bad thing.

I think we are reading UAX31 very differently.

If it is (or even seems) ambiguous, then we need to specify our interpretation.

> > (2)
> > I cannot understand why ID_START/CONTINUE was chosen instead of the
> > newer and more recommended XID_START/CONTINUE.  From UAX31 section 2:
> > """
> > The XID_Start and XID_Continue properties are improved lexical classes
> > that incorporate the changes described in Section 5.1, NFKC
> > Modifications. They are recommended for most purposes, especially for
> > security, over the original ID_Start and ID_Continue properties.
> > """

> Right. I read it that these should be used when 5.1 is considered
> in the language. This, in turn, should be used when the
> normalization form is NFKC:

I read that as

XID is almost always better.  XID is better for security in
particular, but also better for other things.  And as an extra bonus,
XID even already takes care of some 5.1 issues for you.

And my personal opinion is that those 5.1 issues are not really
restricted to NFKC.  Other normalization forms won't get syntactic
errors over them, but the results could still be nonsense.

Issue 1 is that Catalan treats a 0xB7 as a character instead of as
punctuation.  The unicode recommendation (*required* only for NFKC,
but already supported by XID, since it is recommended) says "OK, it
isn't syntax or whitespace, and it is a character sometimes in
practice, so we'll allow it."

Issue 2 says "Technically these are characters, but they should never
be used to start a word, so don't start an identifier with them
anyhow."  If you're not using NFKC, you *can* just ignore the problem
(and produce garbage), but you probably shouldn't.  XID takes care of
it for you.  (At least for these characters.)

Issue 3 says "OK, these characters don't work with NFKC -- but you
shouldn't be using them anyhow."  It even says explicitly that

    "It is recommended that all Arabic presentation
    forms be excluded from identifiers in any event"

Note that neither ID nor XID actually remove all the Arabic
presentation forms, despite this clear recommendation.  Technically,
they are characters, and *could* be processed.  XID removes the ones
that break NFKC, and xidmodifications removes some more (hopefully,
all the rest, but I haven't verified that).

> """
> Where programming languages are using NFKC to fold differences between
> characters, they need the following modifications of the identifier
> syntax from the Unicode Standard to deal with the idiosyncrasies of a
> small number of characters. These modifications are reflected in the
> XID_Start and XID_Continue properties.
> """

> As the PEP does not use NFKC (currently), it should not use XID_Start
> and XID_Continue either.

I read that as "If you are using NFKC, then you need to do some extra
work.  But notice that if you are using the new and improved XID, then
some of this work was already done for you..."

> > Nor can I understand why the additional restrictions in
> > xidmodifications (from TR39) were ignored.

> Consideration of UTR 39 is listed as an open issue. One problem
> with it is that using it would restrict the language over time,
> so that previously correct programs might not be correct anymore
> in a future version. So using it might break backwards
> compatibility.

Then we should start with a more restricted charset, and expand it over time.

The restrictions in xidmodifications are not remotely sufficient for
security, even now.  (Doing that would require restricting some
characters that are actually needed in some languages.)

Instead, xidmodifications represents (a mechanically determined subset
of) characters that can be removed cheaply, because they shouldn't be
used in identifiers anyhow.

-jJ

From guido at python.org  Thu Jun  7 00:57:23 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 15:57:23 -0700
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
Message-ID: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>

A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
367 (new super) and PEP 344 (exception chaining). Are there any
others? I propose that we renumber these to numbers in the 3100+
range. I can see two forms of renaming:

(a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number

(b) just use the next available number

Preferences?

What other PEPs should be renumbered?

Should we renumber at all?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From collinw at gmail.com  Thu Jun  7 01:00:24 2007
From: collinw at gmail.com (Collin Winter)
Date: Wed, 6 Jun 2007 16:00:24 -0700
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
Message-ID: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>

On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
> 367 (new super) and PEP 344 (exception chaining). Are there any
> others? I propose that we renumber these to numbers in the 3100+
> range. I can see two forms of renaming:
>
> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
>
> (b) just use the next available number
>
> Preferences?
>
> What other PEPs should be renumbered?
>
> Should we renumber at all?

Renumbering, +1; using the next 31xx number, +1.

Collin Winter

From jimjjewett at gmail.com  Thu Jun  7 01:06:05 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 19:06:05 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>

On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> A scan of the full table for Unicode Version 2.0 ...  'n (Afrikaans), and

I asked a friend who speaks Afrikaans; apparently it is more a word
than a letter.

"""
? is derived from the Dutch word en which means "a" in English. The `
is in place of the e e.g. a woman would translate into "? vrou" It is
used very often as it is an indefinite article. SMS language usually
just uses the n without the apostrophe.
""' -- Tania Adendorff

So it is common, but losing it is already sort of acceptable.  And
that is the strongest endorsement we have seen.

(There were mixed opinions on Technical symbols, and no one has spoken
up yet about the half-dozen Croatian digraphs corresponding to Serbian
Cyrillic.)

There is legitimate disagreement over whether to

(1)  forbid the Kompatibility characters in IDs
(2)  translate them to the canonical equivalents,
(3)  or just leave them alone because ID= should be the same as string=,

but I think dealing with K characters is now a "least of evils"
decision, instead of "we need them for something."

On another note, I have no idea how Martin's name (in the Cc line) ended up as:

"""
 L$(D+S(Bwis"
"""

If I knew, it *might* have a bearing on what sorts of
canonicalizations should be performed, and what sorts of warnings the
parser ought to emit for likely corrupted text.

-jJ

From jimjjewett at gmail.com  Thu Jun  7 01:29:09 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 19:29:09 -0400
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706051010u5d904e98kb34ca50599fc1087@mail.gmail.com>
	<873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706061629k74e0f5e2tf917786be9d6ffe1@mail.gmail.com>

On 6/6/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> Jim Jewett writes:

>  > Depends on what you mean by technical symbols.  ... The math
>  > versions (generally 1D400 - 1DC7B) are included.  But
>  > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests
>  > excluding them again.

> Eg, the letterlike symbols (DEGREE CELSIUS),

not an ID character

> the number forms (ROMAN NUMERAL ONE),

an ID_START (a letter), not excluded even by xidmodifications
No canonical equivalent.
Will be turned into the regular ASCII letters (only) by Kompatibility
canonicalization.

> and the APL set (2336--237A) in the BMP.

not ID characters

-jJ

From jimjjewett at gmail.com  Thu Jun  7 02:09:50 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 20:09:50 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706061709x6e43cdbbvc1fa5bd300239aa@mail.gmail.com>

On 6/6/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> Rauli Ruohonen writes:

>  > FWIW, I don't buy that normalization is expensive, as most strings are
>  > in NFC form anyway, and there are fast checks for that (see UAX#15,
>  > "Detecting Normalization Forms"). Python does not currently have
>  > a fast path for this, but if it's added, then normalizing everything
>  > to NFC should be fast.

> If O(n) is "fast".

Normalize before hashing; then it becomes O(1) for the remaining uses.
 The hash is already O(N), and most literals already end up being
interned, which requires hashing.

-jJ

From jimjjewett at gmail.com  Thu Jun  7 02:38:51 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 20:38:51 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
Message-ID: <fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>

On 6/6/07, Guido van Rossum <guido at python.org> wrote:

> > about normalization of data strings.  The big issue is string literals.
> > I think I agree with Stephen here:

> >     u"L\u00F6wis" == u"Lo\u0308wis"

> > should be True (assuming he typed it correctly in the first place :-),
> > because they are the same Unicode string.

> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent.

Your (conforming) editor can silently replace one with the other.
A second editor can silently use one, and not replace the other.
==> Uncontrollable, invisible bugs.

> They are two different sequences of code points.

So "str" is about bytes, rather than text?
and bytes is also about bytes; it just happens to be mutable?

Then what was the point of switching to unicode?  Why not just say
"When printed, a string will be interpreted as if it were UTF-8" and
be done with it?

> We should not hide that Python's unicode string object can
> store each sequence of code points equally well, and that when viewed
> as a sequence they are different: the first has len() == 5, the scond
> has len() == 6!

For a bytes object, that is true.  For unicode text, they shouldn't be
different -- at least not by the time a user can see it (or measure
it).

> I might be writing either literal with the expectation to get exactly that
> sequence of code points,

Then you are assuming non-conformance with unicode, which requires you
not to depend on that distinction.  You should have used bytes, rather
than text.

http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance)

C9 A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.

> Note that I'm not arguing against normalization of *identifiers*. I
> see that as a necessity. I also see that there will be border cases
> where getattr(x, 'XXX') and x.XXX are not equivalent for some values
> of XXX where the normalized form is a different sequence of code
> points. But I don't believe the solution should be to normalize all
> string literals.

For strings created by an extension module, that would be valid.  But
python source code is human-readable text, and should be treated that
way.  Either follow the unicode rules (at least for strings), or don't
call them unicode.

-jJ

From guido at python.org  Thu Jun  7 02:47:38 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 6 Jun 2007 17:47:38 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
Message-ID: <ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>

On 6/6/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/6/07, Guido van Rossum <guido at python.org> wrote:
>
> > > about normalization of data strings.  The big issue is string literals.
> > > I think I agree with Stephen here:
>
> > >     u"L\u00F6wis" == u"Lo\u0308wis"
>
> > > should be True (assuming he typed it correctly in the first place :-),
> > > because they are the same Unicode string.
>
> > So let me explain it. I see two different sequences of code points:
> > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > claim they are equivalent.
>
> Your (conforming) editor can silently replace one with the other.

No it cannot. We are talking about \u escapes, not about a string
literal containing Unicode characters ("L?wis").

> A second editor can silently use one, and not replace the other.
> ==> Uncontrollable, invisible bugs.

No. Seems you're again not reading before posting. :-(

> > They are two different sequences of code points.
>
> So "str" is about bytes, rather than text?
> and bytes is also about bytes; it just happens to be mutable?

Bytes are not code points. The unicode string type has always been
about code points, not characters.

> Then what was the point of switching to unicode?  Why not just say
> "When printed, a string will be interpreted as if it were UTF-8" and
> be done with it?

Manipulating code points is a lot more convenient than manipulating UTF-8.

> > We should not hide that Python's unicode string object can
> > store each sequence of code points equally well, and that when viewed
> > as a sequence they are different: the first has len() == 5, the scond
> > has len() == 6!
>
> For a bytes object, that is true.  For unicode text, they shouldn't be
> different -- at least not by the time a user can see it (or measure
> it).

Have you ever even used the unicode string type in Python 2?

> > I might be writing either literal with the expectation to get exactly that
> > sequence of code points,
>
> Then you are assuming non-conformance with unicode, which requires you
> not to depend on that distinction.  You should have used bytes, rather
> than text.

Again, bytes != code points.

> http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance)
>
> C9 A process shall not assume that the interpretations of two
> canonical-equivalent character sequences are distinct.

That is surely contained inside all sorts of weasel words that allow
us to define a "normalized equivalence" function that works that way,
and leave the "==" operator for arrays of code points alone.

> > Note that I'm not arguing against normalization of *identifiers*. I
> > see that as a necessity. I also see that there will be border cases
> > where getattr(x, 'XXX') and x.XXX are not equivalent for some values
> > of XXX where the normalized form is a different sequence of code
> > points. But I don't believe the solution should be to normalize all
> > string literals.
>
> For strings created by an extension module, that would be valid.  But
> python source code is human-readable text, and should be treated that
> way.  Either follow the unicode rules (at least for strings), or don't
> call them unicode.

Again, did you realize that the example was about \u escapes?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jimjjewett at gmail.com  Thu Jun  7 02:49:09 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 20:49:09 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061147l26f08ca7y3ee3ca78fd2e8933@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<f52584c00706061118s1e017432n67ebdaba86448fc0@mail.gmail.com>
	<ca471dc20706061147l26f08ca7y3ee3ca78fd2e8933@mail.gmail.com>
Message-ID: <fb6fbf560706061749w1360bb23q35bb595d483f91bf@mail.gmail.com>

On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> On 6/6/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> > > Why should the lexer apply normalization to literals behind my back?

> > The lexer shouldn't, but NFC normalizing the source before the lexer
> > sees it would be slightly more robust and standards-compliant.

> I have no opinion on this, but NFC normalizing the source shouldn't
> affect the use of \u.... in string literals.

Agreed; normalizing the source should be applied only to code points;
the code sequence <0x5c, 0x75> normalizes to itself.  If there is a \u
in a string, it will still be there after normalization, before python
lexes.  If there is a \u outside a string, it will still be there to
cause syntax errors.

-jJ

From jimjjewett at gmail.com  Thu Jun  7 03:15:57 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Jun 2007 21:15:57 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
Message-ID: <fb6fbf560706061815s3a67ed38m3c3d2812df09e566@mail.gmail.com>

On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> On 6/6/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> >
> > > > about normalization of data strings.  The big issue is string literals.
> > > > I think I agree with Stephen here:

> > > >     u"L\u00F6wis" == u"Lo\u0308wis"

> > > > should be True (assuming he typed it correctly in the first place :-),
> > > > because they are the same Unicode string.

> > > So let me explain it. I see two different sequences of code points:
> > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > > claim they are equivalent.

> > Your (conforming) editor can silently replace one with the other.

> No it cannot. We are talking about \u escapes, not about a string
> literal containing Unicode characters ("L?wis").

ahh... my apologies.  I was interpreting the \u as a way of showing
the bytes in email.  I discarded the interpretation you are using
because that would require a sequence of 10 or 11 code points, rather
than the 5 or 6 you mentioned.

Python lexes it into a shorter string (just as it lexes 1.0 into a
number) at a conceptually later time.  Those later strings should
compare equal according to unicode, but I agree that you no longer
need to worry about editors introducing bugs.  (And I even agree that
this may be valid case for ignoring the recommendation; if someone has
been explicit by writing out 6 characters to represent one, they
probably meant it.)

-jJ

From shiblon at gmail.com  Thu Jun  7 03:21:58 2007
From: shiblon at gmail.com (Chris Monson)
Date: Wed, 6 Jun 2007 21:21:58 -0400
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
	<43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
Message-ID: <da3f900e0706061821r3c39616ag8e935e96f7b3b9f2@mail.gmail.com>

On 6/6/07, Collin Winter <collinw at gmail.com> wrote:
>
> On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
> > 367 (new super) and PEP 344 (exception chaining). Are there any
> > others? I propose that we renumber these to numbers in the 3100+
> > range. I can see two forms of renaming:
> >
> > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
> >
> > (b) just use the next available number
> >
> > Preferences?
> >
> > What other PEPs should be renumbered?
> >
> > Should we renumber at all?
>
> Renumbering, +1; using the next 31xx number, +1.


Renumbering +1
Leaving (old PEP number) in place as a stripped down PEP that just points to
the new number: +1

Collin Winter
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/shiblon%40gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070606/fec1137f/attachment.html 

From greg.ewing at canterbury.ac.nz  Thu Jun  7 03:38:07 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 07 Jun 2007 13:38:07 +1200
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
Message-ID: <4667617F.5060807@canterbury.ac.nz>

Jim Jewett wrote:
> Since we don't want the results of (str1 == str2) to change based on
> context, I think string equality also needs to look at canonicalized
> (though probably not compatibility) forms.

Are you suggesting that this should be done on the fly
when comparing strings? Or that all strings should be
stored in canonicalised form?

I can see some big cans of worms being opened up by
either approach. Surprising results could include
things like s1 == s2 but len(s1) <> len(s2), or
len(s1 + s2) <> len(s1) + len(s2).

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From janssen at parc.com  Thu Jun  7 03:57:40 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 6 Jun 2007 18:57:40 PDT
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com> 
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
Message-ID: <07Jun6.185746pdt."57996"@synergy1.parc.xerox.com>

> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent. They are two different sequences of code
> points.

If they were sequences of integers, or sequences of bytes, I'd agree
with you.  But they are explicitly sequences of characters, not
sequences of codepoints.  There should be one internal normalized form
for strings.

> We should not hide that Python's unicode string object can
> store each sequence of code points equally well, and that when viewed
> as a sequence they are different: the first has len() == 5, the scond
> has len() == 6!

We should definitely not expose that difference!

> When read from a file they are different.

A file is in UTF-8, or UTF-2, or whatever -- it contains a string
coerced to a sequence of bits.  Whatever reads that file should in
fact either preserve that sequence of bytes (in which case it's not a
string), or coerce it to a Unicode string, in which case the file
representation is immaterial and the Python normalized form is used
internally.

> I might be
> writing either literal with the expectation to get exactly that
> sequence of code points, in order to use it as a test case or as input
> for another program that requires specific input.

In that case you should write it as a sequence of integers, because
that's what you're dealing with.

> There's a simpler solution. The unicode (or str, in Py3k) data type
> represents a sequence of code points, not a sequence of characters.
> This has always been the case, and will continue to be the case.

Bad idea, IMO.

Bill

From janssen at parc.com  Thu Jun  7 03:59:52 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 6 Jun 2007 18:59:52 PDT
Subject: [Python-3000] String comparison
In-Reply-To: <20070606125328.6F40.JCARLSON@uci.edu> 
References: <20070606084543.6F3D.JCARLSON@uci.edu>
	<07Jun6.101305pdt."57996"@synergy1.parc.xerox.com>
	<20070606125328.6F40.JCARLSON@uci.edu>
Message-ID: <07Jun6.190001pdt."57996"@synergy1.parc.xerox.com>

> But
> if someone didn't want normalization, and Python did it anyways, then
> there would be an error that passed silently.

Then they'd read it as bytes, and do the processing themselves
explicitly (actually, what I do).

> It's the unicode character versus code point issue.  I personally prefer
> code points, as a code point approach does exactly what I want it to do
> by default; nothing.  If it *does* something without me asking, then
> that would seem to be magic to me, and I'm a minimal magic kind of guy.

Strings are not code point sequences, which are available anyway for
people who want them as tuples of integer values.

Bill

From tjreedy at udel.edu  Thu Jun  7 04:00:07 2007
From: tjreedy at udel.edu (Terry Reedy)
Date: Wed, 6 Jun 2007 22:00:07 -0400
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com><43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
	<da3f900e0706061821r3c39616ag8e935e96f7b3b9f2@mail.gmail.com>
Message-ID: <f47or6$ha4$1@sea.gmane.org>


"Chris Monson" <shiblon at gmail.com> wrote in message | Leaving (old PEP 
number) in place as a stripped down PEP that just points to
| the new number: +1

Good idea.  And new number = next available.  Special PEP numbers should be 
for special PEPs.

tjr




From janssen at parc.com  Thu Jun  7 04:19:56 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 6 Jun 2007 19:19:56 PDT
Subject: [Python-3000] String comparison
In-Reply-To: <07Jun6.185746pdt."57996"@synergy1.parc.xerox.com> 
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<07Jun6.185746pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <07Jun6.191956pdt."57996"@synergy1.parc.xerox.com>

I wrote:
> Guido wrote:
> > So let me explain it. I see two different sequences of code points:
> > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > claim they are equivalent. They are two different sequences of code
> > points.
> 
> If they were sequences of integers, or sequences of bytes, I'd agree
> with you.  But they are explicitly sequences of characters, not
> sequences of codepoints.  There should be one internal normalized form
> for strings.

I meant to say that *strings* are explicitly sequences of characters,
not codepoints.  So both sequences of codepoints should collapse to
the same *string* when they are turned into a string.  While the two
sequences of codepoints should not compare equal, the strings formed
from them should compare equal.

I also believe that the literal form '\u0308' should generate a compile
error.  It's a valid Unicode codepoint, sure, but not a valid string.

 string((ord('L'), 0xF6, ord('w'), ord('i'), ord('s'))) ==
 string((ord('L'), ord('o'), 0x308, ord('w'), ord('i'), ord('s')))

Bill

From greg.ewing at canterbury.ac.nz  Thu Jun  7 04:31:38 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 07 Jun 2007 14:31:38 +1200
Subject: [Python-3000] PEP 3131 roundup
In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com>
References: <461528.7704.qm@web33514.mail.mud.yahoo.com>
Message-ID: <46676E0A.6020506@canterbury.ac.nz>

Steve Howell wrote:
> Current Python has the precedence that color/colour
> are treated as two separate identifers,

But there's always a clear visual difference between
"color" and "colour", and your editor is not going
to turn one into the other while you're not looking
(unless you've got some sort of automatic english-
to-american spelling correction, which would be
insane to turn on for editing code).

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Thu Jun  7 04:46:32 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 07 Jun 2007 14:46:32 +1200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<466595C5.6070301@v.loewis.de>
	<fb6fbf560706051014x749d53cci566ad4ad8da54dfc@mail.gmail.com>
	<874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <46677188.2070809@canterbury.ac.nz>

Stephen J. Turnbull wrote:
> Jim Jewett writes:
 >
>  > I am slightly concerned that it might mean
>  > "string as string" and "string as identifier" have different tests
>  > for equality.
> 
> It does mean that; see Rauli's code.  Does anybody know if this
> bothers LISP users, where identifiers are case-insensitive?

I don't think the issue arises in Lisp, because to use
a string as an identifier you have to explicitly convert
it to a symbol, whereupon there is an opportunity for
case folding, normalisation, etc. to be done.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From nnorwitz at gmail.com  Thu Jun  7 05:18:27 2007
From: nnorwitz at gmail.com (Neal Norwitz)
Date: Wed, 6 Jun 2007 20:18:27 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4665EE44.2010306@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
Message-ID: <ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>

On 6/5/07, Ron Adam <rrr at ronadam.com> wrote:
> Alexandre Vassalotti wrote:
> > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> >> If "make clean" makes the problem go away, it's usually because there
> >> were old .pyc files with incompatible byte code. We don't change the
> >> .pyc magic number for each change to the compiler.
> >
> > Nope. It is still not working. I just did the following, and I still
> > get the same error.
> >
> >    % make  # run fine
> >    % make  # fail
>
> I can confirm the same behavior.  Works on the first make, same error on
> the second.  I deleted the contents of the branch and did an "svn up" on an
> empty directory.  Same thing.

This probably means there is a problem with marshalling the byte code
out.  The first run compiles the .pyc files.  Theoretically this
writes out the same thing in memory.  This isn't always the case
though (ie, when there are bugs).

A work around would be to just remove the .pyc files each time rather
than do a make clean.  Do:

  find . -name '*.pyc' -print0 | xargs -0 rm

Bonus points for finding the bug. :-)

A quick way to test this is to try to roundrip it.  Something like:

>>> s = '''\
... class F:
...   def foo(self, *args):
...     print(self, args)
... '''
>>> code = compile(s, 'foo', 'exec')
>>> import marshal
>>> marshal.loads(marshal.dumps(code)) == code
True

If it doesn't equal True, you found the problem.

n

From rauli.ruohonen at gmail.com  Thu Jun  7 05:32:47 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Thu, 7 Jun 2007 06:32:47 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <4846239003818249252@unknownmsgid>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<4846239003818249252@unknownmsgid>
Message-ID: <f52584c00706062032l6871dd0ag87ea1adce445b3f0@mail.gmail.com>

On 6/7/07, Bill Janssen <janssen at parc.com> wrote:
> I meant to say that *strings* are explicitly sequences of characters,
> not codepoints.

This is false. When you access the contents of a string using the
*sequence* protocol, what you get is code points, not characters
(grapheme clusters). To get those, you have to use a regexp, as
outlined in UAX#29. You could normalize at the same time so you
can do bitwise comparison instead of collation to compare graphemes
the way the user does. If you're going to do all that, then you could
as well implement your own type (which could even be provided by
the standard library).

Note that normalization alone does not produce a sequence of
grapheme clusters, because there aren't precomposed characters for
everything - for full generality you just have to deal with
combining characters.

> I also believe that the literal form '\u0308' should generate a compile
> error.  It's a valid Unicode codepoint, sure, but not a valid string.

Then you wouldn't even be able to iterate over or index strings anymore,
as that could produce such "invalid" strings, which would need to
generate exceptions if you really want to ban them. Or is there point
in
making people type 'o\u0308'[1] instead of '\u0308'?

From brett at python.org  Thu Jun  7 05:47:03 2007
From: brett at python.org (Brett Cannon)
Date: Wed, 6 Jun 2007 20:47:03 -0700
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
	<43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
Message-ID: <bbaeab100706062047q5235c801s6d37c6dbd63ac3ec@mail.gmail.com>

On 6/6/07, Collin Winter <collinw at gmail.com> wrote:
>
> On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
> > 367 (new super) and PEP 344 (exception chaining). Are there any
> > others? I propose that we renumber these to numbers in the 3100+
> > range. I can see two forms of renaming:
> >
> > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
> >
> > (b) just use the next available number
> >
> > Preferences?
> >
> > What other PEPs should be renumbered?
> >
> > Should we renumber at all?
>
> Renumbering, +1; using the next 31xx number, +1.



+1 on this vote.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070606/f40d1c15/attachment.htm 

From showell30 at yahoo.com  Thu Jun  7 07:00:11 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Wed, 6 Jun 2007 22:00:11 -0700 (PDT)
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
Message-ID: <652940.60544.qm@web33501.mail.mud.yahoo.com>


--- Guido van Rossum <guido at python.org> wrote:
>
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
> (Conformance)
> >
> > C9 A process shall not assume that the
> interpretations of two
> > canonical-equivalent character sequences are
> distinct.
> 
> That is surely contained inside all sorts of weasel
> words that allow
> us to define a "normalized equivalence" function
> that works that way,
> and leave the "==" operator for arrays of code
> points alone.
> 

Regarding weasel words, my reading of the text below
(particularly the word "Ideally") is that processes
should not make assumptions about other processes, but
C9 is not strict on how processes themselves behave.

'''
C9 A process shall not assume that the interpretations
of two canonical-equivalent character
sequences are distinct.

The implications of this conformance clause are
twofold. First, a process is never
required to give different interpretations to two
different, but canonical-equivalent
character sequences. Second, no process can assume
that another process will make
a distinction between two different, but
canonical-equivalent character sequences.

*Ideally* [emphasis added], an implementation would
always interpret two canonical-equivalent character
sequences identically. There are practical
circumstances under which implementations
may reasonably distinguish them.
'''

I guess you could interpret the following tidbit to
say that Python should never assume that text editors
will distinguish canonical-equivalent sequences, but I
doubt that settles any debate about what Python should
do, and I think I'm stretching the interpretation to
begin with:

'''
Second, no process can assume that another process
will make
a distinction between two different, but
canonical-equivalent character sequences.
'''


       
____________________________________________________________________________________
Choose the right car based on your needs.  Check out Yahoo! Autos new Car Finder tool.
http://autos.yahoo.com/carfinder/

From nnorwitz at gmail.com  Thu Jun  7 09:16:04 2007
From: nnorwitz at gmail.com (Neal Norwitz)
Date: Thu, 7 Jun 2007 00:16:04 -0700
Subject: [Python-3000] problem with checking whitespace in svn pre-commit
	hook
Message-ID: <ee2a432c0706070016ha67354bi3dd882a69851e4f@mail.gmail.com>

When I originally tried to check in rev 55797, I got this exception:

Traceback (most recent call last):
  File "/data/repos/projects/hooks/checkwhitespace.py", line 50, in ?
    run_app(main)
  File "/usr/lib/python2.3/site-packages/svn/core.py", line 33, in run_app
    return apply(func, (pool,) + args, kw)
  File "/data/repos/projects/hooks/checkwhitespace.py", line 43, in main
    if reindenter.run():
  File "/data/repos/projects/hooks/reindent.py", line 166, in run
    tokenize.tokenize(self.getline, self.tokeneater)
  File "/usr/lib/python2.3/tokenize.py", line 153, in tokenize
    tokenize_loop(readline, tokeneater)
  File "/usr/lib/python2.3/tokenize.py", line 159, in tokenize_loop
    for token_info in generate_tokens(readline):
  File "/usr/lib/python2.3/tokenize.py", line 233, in generate_tokens
    raise TokenError, ("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (315, 0))

I'm guessing this is because tokenization has changed between 2.3 and
3.0.  I didn't have 2.3 on my system to test with.  I ran reindent
prior to committing, but that had no effect (ie, still go the error).

I disabled the hook so I could check in the change which I'm pretty
sure is normalized.  However, we are likely to have this problem in
the future.  I fixed the script so it shouldn't raise an exception any
longer.  But people will still be prevented from checking in if this
happens again.

I wish I had modified the commit hook *before* checking in so I would
at least know which file caused it.  Oh well, I shouldn't do these
things at the end of the day or beginning depending on how you look at
it. :-)

n

From turnbull at sk.tsukuba.ac.jp  Thu Jun  7 09:34:51 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Thu, 07 Jun 2007 16:34:51 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <20070606084543.6F3D.JCARLSON@uci.edu>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
Message-ID: <87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp>

Josiah Carlson writes:

 > Maybe I'm missing something, but it seems to me that there might be a
 > simple solution.  Don't normalize any identifiers or strings.

That's not a solution, that's denying that there's a problem.

 > Hear me out for a moment.  People type what they want.

You're thinking in ASCII terms still, where code points == characters.

With Unicode, what they see is the *single character* they *want*, but
it may be represented by a half-dozen characters in RAM, a different
set of characters in the file, and they may have typed a dozen hard-
to-relate keystrokes to get it (eg, typing a *phonetic prefix* of a
word whose untyped trailing character is the one they want).  And if
everything that handles the text is Unicode conformant, it doesn't
matter!  In that context, just what does "people type what they want"
mean?

By analogy, suppose I want to generate a table (such as Martin's
table-331.html) algorithmically.  Then doesn't it seem reasonable that
the representation might be something like u"\u0041"?  But you know
what that sneaky ol' Python 2.5 does to me if I evaluate it?  It
returns u'A'!  And guess what else?  u"\u0041" == u'A' returns True!
And when I print either of them, I see what I expect: A.

Well, what Unicode-conformant editors are allowed to do with NKD and
NKC (and all the non-normalized forms as well) is quite analogous.
But a conformant process is expected not to distinguish among them,
just as two instances of Python are expected to compare those two
*different* string literals as equal.  Thus it doesn't matter (for
most purposes) what those editors do, just as it doesn't matter
(except as a point of style) how you spell u"A".

 > As for strings, I think we should opt for keeping it as simple as
 > possible.  Compare by code points.

If you normalize on the way in, you can do that *correctly*.  If you
don't ...

 > To handle normalization issues, add a normalization method that
 > people call if they care about normalized unicode strings*.

...you impose the normalization on application programmers who think
of unicode strings as internationalized text (but they aren't! they're
arrays of unsigned shorts), or on module writers who have weak
incentive to get 100% coverage.  Note that these programs don't crash;
they silently give false negatives.  Fixing these bugs *before*
selling the code is hard and expensive; who will care to do it?

Eg, *you*.  You clearly *don't* care in your daily work, even though
you are sincerely trying to understand on python-dev.  But your (quite
proper!) objective is to lower costs to you and your code since YAGNI.
Where *I* need it, I will cross you off my list of acceptable vendors
(of off-the-shelf modules, I can't afford your consulting rates).
Well and good, that's how it *should* work.  But your (off-the-shelf)
modules will possibly see use by the Japanese Social Security
Administration, who have demonstrated quite graphically how little
they care[1]. :-(

Furthermore, there are typically an awful lot of ways that a string
can get into the process, and if you do care, you want to catch them
all.

This is a lot easier to do *in* the Python compiler and interpreter,
which have a limited number of I/O channels, than it will be to do for
a large library of modules, not all of which even exist at this date.

 > * Or leave out normalization all together in 3.0 .  I haven't heard any
 > complaints about the lack of normalization in Python so far (though
 > maybe I'm not reading the right python-list messages), and Python has
 > had unicode for what, almost 10 years now?

I presented a personal anecdote about docutils in my response to GvR,
and an failed test from XEmacs (which, admittedly, Python already gets
right).  Strictly speaking the former is not a normalization issue,
since it's probably a fairly idiosyncratic change in docutils, but
it's the kind of problem that would be mitigated by normalization.

But you won't see a lot, because almost all text in Western European
languages is almost automatically NFC, unless somebody who knows what
they're doing deliberately denormalizes or renormalizes it (as in Mac
OS X).  Also, a lot of problems will get attributed to legacy
encodings, although proper attention to canonical (and a subset of
compatibility) equivalences would go a long way to resolve them.
These issues are going to become more prevalent as more scripts are
added to Unicode, and actually come into use.  And as their users
start deploying IT on a large scale for the first time.

Footnotes: 
[1]  About 20 million Japanese face partial or total loss of their
pensions because the Japanese SSA couldn't be bothered to canonicalize
their names accurately when the system was automated in the '90s.


From rrr at ronadam.com  Thu Jun  7 11:15:30 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 07 Jun 2007 04:15:30 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
Message-ID: <4667CCB2.6040405@ronadam.com>

Neal Norwitz wrote:
> On 6/5/07, Ron Adam <rrr at ronadam.com> wrote:
>> Alexandre Vassalotti wrote:
>> > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
>> >> If "make clean" makes the problem go away, it's usually because there
>> >> were old .pyc files with incompatible byte code. We don't change the
>> >> .pyc magic number for each change to the compiler.
>> >
>> > Nope. It is still not working. I just did the following, and I still
>> > get the same error.
>> >
>> >    % make  # run fine
>> >    % make  # fail
>>
>> I can confirm the same behavior.  Works on the first make, same error on
>> the second.  I deleted the contents of the branch and did an "svn up" 
>> on an
>> empty directory.  Same thing.
> 
> This probably means there is a problem with marshalling the byte code
> out.  The first run compiles the .pyc files.  Theoretically this
> writes out the same thing in memory.  This isn't always the case
> though (ie, when there are bugs).
> 
> A work around would be to just remove the .pyc files each time rather
> than do a make clean.  Do:
> 
>  find . -name '*.pyc' -print0 | xargs -0 rm
> 
> Bonus points for finding the bug. :-)


Well not the bug yet, but I did find the file.  :-)


The following clears it so make will work.

     rm ./build/lib.linux-i686-3.0/_struct.so

So maybe something to do with Modules/_struct.c, or would it be something 
else that uses it?

Removing all the .pyc files wasn't enough,  nor was removing all the .o files.


BTW,  I found it by running the commands from the 'clean' section of the 
makefile one at a time, then narrowed it down from there by making it more 
and more specific.

Version info:

Python 3.0x (py3k-struni, Jun  7 2007, 03:28:43)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2

On 7.04 ?Fiesty Fawn?


Ron




From ncoghlan at gmail.com  Thu Jun  7 12:59:50 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 07 Jun 2007 20:59:50 +1000
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
Message-ID: <4667E526.1000503@gmail.com>

Guido van Rossum wrote:
> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
> 367 (new super) and PEP 344 (exception chaining). Are there any
> others? I propose that we renumber these to numbers in the 3100+
> range. I can see two forms of renaming:
> 
> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
> 
> (b) just use the next available number
> 
> Preferences?
> 
> What other PEPs should be renumbered?
> 
> Should we renumber at all?
> 

+1 for renumbering to the next available 31xx number, with the old 
number kept as a pointer to the new one.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From barry at python.org  Thu Jun  7 13:45:45 2007
From: barry at python.org (Barry Warsaw)
Date: Thu, 7 Jun 2007 07:45:45 -0400
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <da3f900e0706061821r3c39616ag8e935e96f7b3b9f2@mail.gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
	<43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com>
	<da3f900e0706061821r3c39616ag8e935e96f7b3b9f2@mail.gmail.com>
Message-ID: <E698409C-522E-4ADE-8CD1-4F83EA6670E5@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jun 6, 2007, at 9:21 PM, Chris Monson wrote:

> Renumbering, +1; using the next 31xx number, +1.
>
> Renumbering +1
> Leaving (old PEP number) in place as a stripped down PEP that just  
> points to the new number: +1

I don't want to (accidentally) re-use the old number for some other  
PEP, and PEPs are intended to be the historical record of a feature,  
so my own preferences would be:

- - Leave the old PEP in place, with a pointer to the renumbered PEP
- - Renumber the PEP by putting a '3' in front of it instead of using  
the next available

We don't have a template for renumbered PEPs so just come up with  
something reasonable that fits the flavor of PEPs.  If we need to  
generalize later, we can.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRmfv6nEjvBPtnXfVAQL1yAP9FOGBU5TMa4HUiP8IRoqS/wemFOdotHwf
GwvNPIEthJXheUBS/lOWLpSCERUzToSfqWzUJWkOUk5JfxsDP6MgWKwfkOwhvp35
oihXrkWoc/XtK2qJipLXVWLhg/5CkPuvnjXSrVMzqpu5J26YPV/QIb2Xa0ICF90e
c2mQY0cuzWM=
=HMOs
-----END PGP SIGNATURE-----

From g.brandl at gmx.net  Thu Jun  7 15:01:29 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Thu, 07 Jun 2007 15:01:29 +0200
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <4667E526.1000503@gmail.com>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
	<4667E526.1000503@gmail.com>
Message-ID: <f48vj9$vk9$1@sea.gmane.org>

Nick Coghlan schrieb:
> Guido van Rossum wrote:
>> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
>> 367 (new super) and PEP 344 (exception chaining). Are there any
>> others? I propose that we renumber these to numbers in the 3100+
>> range. I can see two forms of renaming:
>> 
>> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
>> 
>> (b) just use the next available number
>> 
>> Preferences?
>> 
>> What other PEPs should be renumbered?
>> 
>> Should we renumber at all?
>> 
> 
> +1 for renumbering to the next available 31xx number, with the old 
> number kept as a pointer to the new one.

That would be my vote too.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From stephen at xemacs.org  Thu Jun  7 15:30:13 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 07 Jun 2007 22:30:13 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
Message-ID: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>

Guido van Rossum writes:

 > No it cannot. We are talking about \u escapes, not about a string
 > literal containing Unicode characters ("L?wis").

Ah, good point.

I apologize for mistyping the example.  *I* *was* talking about a
string literal containing Unicode characters.  However, on my
terminal, you can't see the difference!  So I (ab)used the \u escapes
to make clear that in one case the representation used 5 characters
and in the other 6.

 > > > I might be writing either literal with the expectation to get
 > > > exactly that sequence of code points,

This should be possible, agreed.  Couldn't rawstring read syntax be
given the right semantics?  And of course you've always got tuples of
integers.

What bothers me about the "sequence of code points" way of thinking is
that len("L?wis") is nondeterministic.  To my mind, especially from
the educational standpoint, but also from the point of view of
implementing a text editor or docutils, that's much more horrible than
Martin's point that len(a) + len(b) == len(a+b) could fail if we do
NFC normalization.  (NKD would work here.)

I'm not sure what happened, but after recent upgrades to Python and
docutils (presumably the latter) a bunch of Japanese reST documents of
mine broke.  I have no idea how to count the number of characters in a
line containing Japanese any more (even having fixed the tables by
trial and error, it's not obvious), but of course tables require being
able to do that exactly.  Normalization would guarantee TOOWDTI.

But IMO the right way to do normalization in such cases is in Python
itself.  One is *never* going to be able to keep up with all the
external libraries, and it seems very unlikely that many will be high
quality from this point of view.  So even if your own code does the
right thing, you have to wrap every external module you call.  Or you
can rewrite Python to normalize in the right places once, and then you
don't have to worry about it.  (Bugs, yes, but then you fix them in
the forked Python, and all your code benefits from the fix
automatically.)

 > Bytes are not code points. The unicode string type has always been
 > about code points, not characters.

I wish you had named it "widechar", then.  I think that a language
where len("L?wis") == len("L?wis") is an invariant is one honking good
idea!

 > Have you ever even used the unicode string type in Python 2?

Yes.  On the Mac, I often have to run unicodes through normalization
NFD because some levels of Mac OS X do normalize NFD and others don't
normalize at all.  That means that file names in particular tend to be
different depending on whether I get them from the OS or from the
user.  But a test as simple as creating a file with a name containing
\u010D and trying to stat it can fail, AIUI because stdio normalizes
NFD but the raw OS stat call doesn't.  This particular test does work
in Python, I'm not sure what the difference is.

Granted that that's part of the plan and not serendipity, nonetheless,
I think the default case should be that text operations produce the
expected result in the text domain, even at the expense of array
invariants.  People who need arrays of code points have several ways
to get them, and the usual comparison operators will work on them as
desired.  While people who need operations on *text* still have no
straightforward way to get them, and no promise of one as I read your
remarks.


From jimjjewett at gmail.com  Thu Jun  7 17:24:22 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 7 Jun 2007 11:24:22 -0400
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4665B7D7.6030501@v.loewis.de>
References: <46371BD2.7050303@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
	<46659A37.4000900@v.loewis.de>
	<fb6fbf560706051148s4338ddb1pbf1e8db0f9793c7a@mail.gmail.com>
	<4665B7D7.6030501@v.loewis.de>
Message-ID: <fb6fbf560706070824p32b9ad89maaf6a0586bc9a2d8@mail.gmail.com>

On 6/5/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> > Unicode does say pretty clearly that (at least) canonical
> >> > equivalents must be treated the same.

On reflection, what it actually says is that you may not assume they
are different.  They can be different in the same way that two
identical strings are different under "is", but anything stronger has
to be strictly internal.

If any code outside the python core even touches the string, then the
the choice of representations becomes arbitrary, and can switch for
spurious reasons.  Immutability should prevent mid-run switching for a
single "is" string, but not for different strings that should compare
"==".

Dictionaries keys need to keep working, which means hash and equality
have to do the right thing.  Ordering may technically be a
quality-of-implementation issue, but ... normalizing strings on
creation solves an awful lot of problems, including providing a "best
practice" for C extensions.  Not normalizing will save a small amount
of time, at the cost of a never-ending hunt for rare and obscure bugs.

> >> Chapter and verse, please?

> > I am pretty sure this list is not exhaustive, but it may be
> > helpful:

> > The Identifiers Annex http://www.unicode.org/reports/tr31/

> Ah, that's in the context of identifiers, not in the context of text
> in general.

Yes, but that should also apply to dict and shelve keys.  If you want
an array of code points, then you want a tuple of ints, not text.

> > """
> > Normalization Forms KC and KD must not be blindly
> > applied to arbitrary text.
> > """

Note that it lists only the Kompatibility forms.  By implication,
forms NFC and NFD *can* be blindly applied to arbitrary text.  (And
conformance rule C9 means you have to assume that someone else might
do so, if, say, the text is python source code that may have been
externally edited.)

... """
> > They can be applied more freely to domains with restricted
> > character sets, such as in Section 13, Programming
> > Language Identifiers.
> > """
> > (section 13 then forwards back to UAX31)

> How is that a requirement that comparison should apply
> normalization?

It isn't a requirement that we apply normalization.  But

(1)  There is a requirement that semantics not change based on
external canonical [de]normalization of source code, including literal
strings.  (I agree that explicit python-level escapes -- made after
the file has already been converted from bytes to characters -- are
legitimate, just as changing 1.0 from a string to a number is
legitimate.)

(2)  It is a *suggestion* that we consider the stronger Kompatibility
normalizations for source code.

There are cases where strings which are equal under Kompatibility
shouldl be treated differently, but, I think, in practice, the
difference is more likely to be from typos or difficulty entering the
proper characters.  Normalizing to the compatibility form would be
helpful for some people (Japanese and Korean input was mentioned).

I think needed to distinguish the Kompatibility characters (and not
even in data; in source literals) will be rare enough that it is worth
making the distinction explicit.  (If you need to use a compatibility
character, then use an escape, rather than the character, so that
people will know you really mean the alternate, instead of the
"normal" character looking like that.)

-jJ

From jimjjewett at gmail.com  Thu Jun  7 17:29:39 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 7 Jun 2007 11:29:39 -0400
Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers
In-Reply-To: <4667617F.5060807@canterbury.ac.nz>
References: <46371BD2.7050303@v.loewis.de>
	<f52584c00706030612g6d091631se4500a25e58fc006@mail.gmail.com>
	<4662F639.2070806@v.loewis.de>
	<Pine.LNX.4.58.0706040609010.7196@server1.LFW.org>
	<46646E76.8060804@v.loewis.de>
	<fb6fbf560706041350n4008e6a8q96c53943ae663d7d@mail.gmail.com>
	<f52584c00706042221u109cdfdcme60e52ddeac14bb4@mail.gmail.com>
	<4665189D.4020301@v.loewis.de>
	<fb6fbf560706050837v7505d12wbbfafa9ed7732b1f@mail.gmail.com>
	<4667617F.5060807@canterbury.ac.nz>
Message-ID: <fb6fbf560706070829j228d3010racf14e7d4c22fa6b@mail.gmail.com>

On 6/6/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Are you suggesting that this should be done on the fly
> when comparing strings? Or that all strings should be
> stored in canonicalised form?

Preferably the second; store them canonicalized.

> I can see some big cans of worms being opened up by
> either approach. Surprising results could include
> things like s1 == s2 but len(s1) <> len(s2), or
> len(s1 + s2) <> len(s1) + len(s2).

Yes, these are surprising, but that is the nature of unicode.

People will get used to it, with the same pains they face now over "1"
+ "1" = "11", or output that doesn't line up because one row had a
single-digit number.

-jJ

From alexandre at peadrop.com  Thu Jun  7 17:47:28 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 11:47:28 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
Message-ID: <acd65fa20706070847g52829cd6kaf4ffaff864874b1@mail.gmail.com>

On 6/6/07, Neal Norwitz <nnorwitz at gmail.com> wrote:
> This probably means there is a problem with marshalling the byte code
> out.  The first run compiles the .pyc files.  Theoretically this
> writes out the same thing in memory.  This isn't always the case
> though (ie, when there are bugs).
>
> A work around would be to just remove the .pyc files each time rather
> than do a make clean.  Do:
>
>   find . -name '*.pyc' -print0 | xargs -0 rm
>

Nope. Removing the byte-compiled Python files didn't change anything.

> Bonus points for finding the bug. :-)

Oh? :)

> A quick way to test this is to try to roundrip it.  Something like:
>
> >>> s = '''\
> ... class F:
> ...   def foo(self, *args):
> ...     print(self, args)
> ... '''
> >>> code = compile(s, 'foo', 'exec')
> >>> import marshal
> >>> marshal.loads(marshal.dumps(code)) == code
> True
>
> If it doesn't equal True, you found the problem.

I got True. So, the problem probably not the byte code.

-- Alexandre

From alexandre at peadrop.com  Thu Jun  7 17:50:05 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 11:50:05 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4667CCB2.6040405@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
Message-ID: <acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>

On 6/7/07, Ron Adam <rrr at ronadam.com> wrote:
> Well not the bug yet, but I did find the file.  :-)
>
> The following clears it so make will work.
>
>      rm ./build/lib.linux-i686-3.0/_struct.so
>
> So maybe something to do with Modules/_struct.c, or would it be something
> else that uses it?

Removing any compiled extension files will work too. So, _struct isn't
the source of the problem.

-- Alexandre

From janssen at parc.com  Thu Jun  7 18:35:55 2007
From: janssen at parc.com (Bill Janssen)
Date: Thu, 7 Jun 2007 09:35:55 PDT
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706062032l6871dd0ag87ea1adce445b3f0@mail.gmail.com> 
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<4846239003818249252@unknownmsgid>
	<f52584c00706062032l6871dd0ag87ea1adce445b3f0@mail.gmail.com>
Message-ID: <07Jun7.093602pdt."57996"@synergy1.parc.xerox.com>

> Then you wouldn't even be able to iterate over or index strings anymore,
> as that could produce such "invalid" strings, which would need to
> generate exceptions if you really want to ban them.

I don't think that's right: iterating over the the string should
presumably generate a iteration of valid sub-strings, each of length
one.  It would not generate a sequence of integers.

  [x for x in "abc"] != [ord(x) for x in "abc"]

> making people type 'o\u0308'[1] instead of '\u0308'?

'o\u0308'[1] should generate an ArrayBounds exception, since you're
indexing into a string of length 1.

Bill

From rauli.ruohonen at gmail.com  Thu Jun  7 18:47:17 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Thu, 7 Jun 2007 19:47:17 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>

On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I apologize for mistyping the example.  *I* *was* talking about a
> string literal containing Unicode characters.

Then I misunderstood you too. To avoid such problems, I will use XML
character references to denote code points here. Wherever you see such
a thing in this e-mail, replace it in your mind with the corresponding
code point *immediately*. E.g. len(r'&#00c5;') == 1, but
len(r'\u00c5') == 6. This is not a proposal for Python syntax, it is a
device to make what I say clear.

> However, on my terminal, you can't see the difference!  So I (ab)used
> the \u escapes to make clear that in one case the representation used
> 5 characters and in the other 6.

Your code was:

> if u"L\u00F6wis" == u"Lo\u0308wis":
>     print "Python is Unicode conforming in this respect."

I take it, by your explanation above, that you meant that the (py3k)
source code is this:

if "L&#00F6;wis" == "Lo&#0308;wis":
    print "Python is Unicode conforming in this respect."

I agree that here == should be true, but only because Python should
normalize the source code to look like this before processing it:

if "L&#00F6;wis" == "L&#00F6;wis":
    print "Python is Unicode conforming in this respect."

In the following code == should be false:

if "L\u00F6wis" == "Lo\u0308wis":
    print "Python is Unicode conforming in this respect."

> I think the default case should be that text operations produce the
> expected result in the text domain, even at the expense of array
> invariants.

If you really want that, then you need a type for sequences of graphemes.
E.g. 'c\u0308' is already normalized according to all four normalization
rules, but it's still one grapheme ('c' with diaeresis, c?) and two
code points. This type could be provided in the standard library.

> People who need arrays of code points have several ways to get them,
> and the usual comparison operators will work on them as desired.

But regexps and other string operations won't, and those are the whole
point of strings, not comparison operators. If comparisons were enough,
then the string type could be removed as redundant - there's already the
array module (or numpy) if you're only concerned about efficient storage.

> While people who need operations on *text* still have no
> straightforward way to get them, and no promise of one as I read your
> remarks.

Then you missed some of his earlier remarks:

Guido:
: I'm all for adding a way to do normalized string comparisons to the
: library. But I'm not about to change the == operator to apply
: normalization first.

From guido at python.org  Thu Jun  7 19:10:12 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 10:10:12 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>

On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> What bothers me about the "sequence of code points" way of thinking is
> that len("L?wis") is nondeterministic.

It doesn't have to be, *for this specific example*. After what I've
read so far, I'm okay with normalization happening on the text of the
source code before it reaches the lexer, if that's what people prefer.
I'm also okay with normalization happening by default in the text I/O
layer, as long as there's a way to disable it that doesn't require me
to switch to bytes.

However, I'm *not* okay with requiring all text strings to be
normalized, or normalizing them before comparing/hashing, after
slicing/concatenation, etc. If you want to have an abstraction that
guarantees you'll never see an unnormalized text string you should
design a library for doing so. I encourage you or others to contribute
such a library (*). But the 3.0 core language's 'str' type (like
Python 2.x's 'unicode' type) will be an array of code points that is
neutral about normalization.

Python is a general programming language, not a text manipulating
library. As a general programming language, it must be possible to
represent unnormalized sequences of code points -- otherwise, it could
not implement algorithms for normalization in Python! (Again, forcing
me to do this using UTF-8-encoded bytes or lists of ints is
unacceptable.)

There are also Jython and IronPython to consider. These have extensive
integration in the Java and .NET runtimes, respectively, where strings
are represented as sequences of code points. Having a correspondence
between the "natural" string type across language boundaries is very
important.

Yes, this makes text processing harder if you want to get every corner
case right. We need to educate our users about Unicode and point them
to relevant portions of the standard. I don't think that can be
avoided anyway -- the complexity is inherent to the domain of
multi-alphabet text processing, and cannot be argued away by insisting
that the language handle it.

(*) It looks like such a library will not have a way to talk about
"\u0308" at all, since it is considered unnormalized. Things like
bidirectionality will probably have to be handled in a different way
(without referencing the code points indicating text direction) as
well.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Thu Jun  7 19:34:22 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 10:34:22 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
Message-ID: <ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>

On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/7/07, Ron Adam <rrr at ronadam.com> wrote:
> > Well not the bug yet, but I did find the file.  :-)
> >
> > The following clears it so make will work.
> >
> >      rm ./build/lib.linux-i686-3.0/_struct.so
> >
> > So maybe something to do with Modules/_struct.c, or would it be something
> > else that uses it?
>
> Removing any compiled extension files will work too. So, _struct isn't
> the source of the problem.

It's time to look at the original traceback (attached as "tb", after
fixing the formatting problems). it looks like any call to
encodings.normalize_encoding() causes this problem.

I don't know why linking an extension avoids this, and why it's only a
problem for you and not for me, but that's probably a locale setting
(if you mail me the values of all your locale-specific environment
variables I can try to reproduce it). The trail leads back to the
optparse module using the gettext module to translate its error
messages. That seems overengineered to me, but I won't argue too
strongly.

In any case, the root cause is that normalize_encoding() is badly
broken. I've attached a hack that might fix it. Can you try if that
helps?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tb
Type: application/octet-stream
Size: 1267 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/7b1608fb/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hack
Type: application/octet-stream
Size: 778 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/7b1608fb/attachment-0001.obj 

From martin at v.loewis.de  Thu Jun  7 19:37:44 2007
From: martin at v.loewis.de (martin at v.loewis.de)
Date: Thu, 07 Jun 2007 19:37:44 +0200
Subject: [Python-3000] problem with checking whitespace in svn
	pre-commit	hook
Message-ID: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu>

> tokenize.TokenError: ('EOF in multi-line statement', (315, 0))

I analyzed that a bit further, and found that
Lib/distutils/unixccompiler.py:214 reads

if not isinstance(output_dir, (str, type(None)):

This is a syntax error; a closing parenthesis is missing.
tokenize.py chokes at the EOF as the parentheses aren't balanced.

> I ran reindent prior to committing, but that had no effect (ie,  
> still go the error).

I find that hard to believe - running reindent.py on the file
fails for me with Python 2.5 as well.

Regards,
Martin



From nnorwitz at gmail.com  Thu Jun  7 19:55:35 2007
From: nnorwitz at gmail.com (Neal Norwitz)
Date: Thu, 7 Jun 2007 10:55:35 -0700
Subject: [Python-3000] problem with checking whitespace in svn
	pre-commit hook
In-Reply-To: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu>
References: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu>
Message-ID: <ee2a432c0706071055q1406b84fp586988ff03b8e8ff@mail.gmail.com>

On 6/7/07, martin at v.loewis.de <martin at v.loewis.de> wrote:
> > tokenize.TokenError: ('EOF in multi-line statement', (315, 0))
>
> I analyzed that a bit further, and found that
> Lib/distutils/unixccompiler.py:214 reads
>
> if not isinstance(output_dir, (str, type(None)):
>
> This is a syntax error; a closing parenthesis is missing.
> tokenize.py chokes at the EOF as the parentheses aren't balanced.
>
> > I ran reindent prior to committing, but that had no effect (ie,
> > still go the error).
>
> I find that hard to believe - running reindent.py on the file
> fails for me with Python 2.5 as well.

I ran reindent with py3k, something like:  ./python
Tools/scripts/reindent.py Lib
IIRC.  I don't have the command line handy.  I'll fix this when I get
home tonight.

Has anyone tried the 3k reindent?  Or did I just screw that up?

n

From guido at python.org  Thu Jun  7 20:16:41 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 11:16:41 -0700
Subject: [Python-3000] problem with checking whitespace in svn
	pre-commit hook
In-Reply-To: <ee2a432c0706071055q1406b84fp586988ff03b8e8ff@mail.gmail.com>
References: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu>
	<ee2a432c0706071055q1406b84fp586988ff03b8e8ff@mail.gmail.com>
Message-ID: <ca471dc20706071116q12a99c69nfbfa26f5a02970c1@mail.gmail.com>

On 6/7/07, Neal Norwitz <nnorwitz at gmail.com> wrote:
> On 6/7/07, martin at v.loewis.de <martin at v.loewis.de> wrote:
> > > tokenize.TokenError: ('EOF in multi-line statement', (315, 0))
> >
> > I analyzed that a bit further, and found that
> > Lib/distutils/unixccompiler.py:214 reads
> >
> > if not isinstance(output_dir, (str, type(None)):
> >
> > This is a syntax error; a closing parenthesis is missing.
> > tokenize.py chokes at the EOF as the parentheses aren't balanced.
> >
> > > I ran reindent prior to committing, but that had no effect (ie,
> > > still go the error).
> >
> > I find that hard to believe - running reindent.py on the file
> > fails for me with Python 2.5 as well.
>
> I ran reindent with py3k, something like:  ./python
> Tools/scripts/reindent.py Lib
> IIRC.  I don't have the command line handy.  I'll fix this when I get
> home tonight.
>
> Has anyone tried the 3k reindent?  Or did I just screw that up?
http://mail.python.org/mailman/options/python-3000/guido%40python.org

The py3k reindent is just fine; you screwed up the closing paren on
line 214 in unixccompile.py. All versions of reindent that I can find
correctly complain about that. I'm curious how you managed to bypass
it! :-)

I've checked in the fix.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jcarlson at uci.edu  Thu Jun  7 20:34:16 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 07 Jun 2007 11:34:16 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <20070606084543.6F3D.JCARLSON@uci.edu>
	<87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <20070607084121.6F4A.JCARLSON@uci.edu>


"Stephen J. Turnbull" <turnbull at sk.tsukuba.ac.jp> wrote:
> Josiah Carlson writes:
> 
>  > Maybe I'm missing something, but it seems to me that there might be a
>  > simple solution.  Don't normalize any identifiers or strings.
> 
> That's not a solution, that's denying that there's a problem.

For core Python, there is no problem.  The standard libraries don't have
any normalization issues, nor will they have any normalization issues. 
The only place where there could be potential for normalization issues
is in to-be-written 3rd party code.

With that said, from what I understand, there are three places where we
could potentially do normalization; identifiers, literals, data.
Identifiers and literals have the best case for normalization, data the
worst (don't change my data without me telling you to!)  From Guido's
recent post, he seems to say more or less the same thing with
normalization to text read through the text IO layer.

Since I don't expect to be reading much unicode from disk (and/or I
expect to be reading bytes and decoding them to unicode manually), being
able to disable normalization on data from text IO is fine.


Regarding the rest of it, I've come to the point of exhaustion.  I no
longer have the energy to care what happens with Python 3.0 and unicode
(identifiers, literals, data, types, etc.), but I hope Ka-Ping is able
to convince people more than I have. Good luck with the decisions.

Good day,
 - Josiah


From alexandre at peadrop.com  Thu Jun  7 20:37:45 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 14:37:45 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
Message-ID: <acd65fa20706071137j67f4a805kc17f5fc11a9402a5@mail.gmail.com>

On 6/7/07, Guido van Rossum <guido at python.org> wrote:
> It's time to look at the original traceback (attached as "tb", after
> fixing the formatting problems). it looks like any call to
> encodings.normalize_encoding() causes this problem.

Don't know if it will help to know that, but it seems adding a
debugging print() in the normalize_encoding method, makes Python act
weird:

  >>> print("hello")  # no output
  [38357 refs]
  >>> hello?          # note the exception is not shown
  [30684 refs]
  >>> exit()          # does quit

> I don't know why linking an extension avoids this, and why it's only
> a problem for you and not for me, but that's probably a locale
> setting (if you mail me the values of all your locale-specific
> environment variables I can try to reproduce it).

I don't think it is related to locales settings. Since even with a
minimum number of environment variables, I still can reproduce the
problem.

  % sh
  $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1`
  > do unset $v; done
  $ make
  make: *** [sharedmods] Error 1

> The trail leads back to the optparse module using the gettext module
> to translate its error messages. That seems overengineered to me,
> but I won't argue too strongly.
>
> In any case, the root cause is that normalize_encoding() is badly
> broken. I've attached a hack that might fix it. Can you try if that
> helps?

Yep, that worked. What this new str8 type is for, btw? It is the second
time I encounter it, today.

-- Alexandre

From alexandre at peadrop.com  Thu Jun  7 20:46:15 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 14:46:15 -0400
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <ca471dc20706051727j6a7f1738g96f45cf9a2a2d4aa@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
	<ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
	<acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>
	<ca471dc20706051727j6a7f1738g96f45cf9a2a2d4aa@mail.gmail.com>
Message-ID: <acd65fa20706071146r61f4354ctb1a92aee41ce5dd1@mail.gmail.com>

I found a way to fix the bug; look at the attached patch. Although, I
am not sure it was correct way to fix it. The problem was due to str8
that is  recognized as an instance of `str'.

-- Alexandre

On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > > I'd rather see them here than in SF, SF is a pain to use.
> > >
> > > But unless the bugs prevent you from proceeding, you could also ignore them.
> >
> > The first bug that I reported today (the one about `make`) stop me
> > from running the test suite. So, can't really test the _string_io and
> > _bytes_io modules.
>
> I tried to reproduce it but it works fine for me -- I'm on Ubuntu
> dapper (with some Google mods) on a 2.6.18.5-gg4 kernel.
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdb-help.patch
Type: text/x-patch
Size: 1056 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/b2463067/attachment.bin 

From alexandre at peadrop.com  Thu Jun  7 20:55:08 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 14:55:08 -0400
Subject: [Python-3000] help() broken in the py3k-struni branch
In-Reply-To: <ca471dc20706051647s249727abwd10a526ad13c98bb@mail.gmail.com>
References: <acd65fa20706051645p3f05a292u243b9623dbafda5b@mail.gmail.com>
	<ca471dc20706051647s249727abwd10a526ad13c98bb@mail.gmail.com>
Message-ID: <acd65fa20706071155j6361f00avcc745f7e62a8ef87@mail.gmail.com>

On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> Feel free to mail me a patch to fix it.
>

Since you asked so politely, here a patch for you. :)

> On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > Hi,
> >
> > I found another bug to report. It seems there is a bug in
> > subprocess.py that makes help() fail.
> >
> > -- Alexandre
> >
> > Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
> > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> help(open)
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350,
> > in __call__
> >     return pydoc.help(*args, **kwds)
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > 1687, in __call__
> >     self.help(request)
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help
> >     else: doc(request, 'Help on %s:')
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc
> >     pager(render_doc(thing, title, forceload))
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager
> >     pager(text)
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > 1333, in <lambda>
> >     return lambda text: pipepager(text, 'less')
> >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > 1352, in pipepager
> >     pipe = os.popen(cmd, 'w')
> >   File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen
> >     bufsize=buffering)
> >   File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line
> > 476, in __init__
> >     raise TypeError("bufsize must be an integer")
> > TypeError: bufsize must be an integer
> > _______________________________________________
> > Python-3000 mailing list
> > Python-3000 at python.org
> > http://mail.python.org/mailman/listinfo/python-3000
> > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
> >
>
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>


-- 
Alexandre Vassalotti
-------------- next part --------------
A non-text attachment was scrubbed...
Name: help-buf-fix.patch
Type: text/x-patch
Size: 441 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/a36a4ac0/attachment.bin 

From guido at python.org  Thu Jun  7 20:55:40 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 11:55:40 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <acd65fa20706071137j67f4a805kc17f5fc11a9402a5@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<acd65fa20706071137j67f4a805kc17f5fc11a9402a5@mail.gmail.com>
Message-ID: <ca471dc20706071155m274a11fcy4b1e4fd57215b907@mail.gmail.com>

On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/7/07, Guido van Rossum <guido at python.org> wrote:
> > It's time to look at the original traceback (attached as "tb", after
> > fixing the formatting problems). it looks like any call to
> > encodings.normalize_encoding() causes this problem.
>
> Don't know if it will help to know that, but it seems adding a
> debugging print() in the normalize_encoding method, makes Python act
> weird:
>
>   >>> print("hello")  # no output
>   [38357 refs]
>   >>> hello?          # note the exception is not shown
>   [30684 refs]
>   >>> exit()          # does quit

That's a bootstrapping issue. normalize_encoding() is apparently
called in order to set up stdin/stdout/stderr, so it shouldn't attempt
to touch those (or raise errors).

> > I don't know why linking an extension avoids this, and why it's only
> > a problem for you and not for me, but that's probably a locale
> > setting (if you mail me the values of all your locale-specific
> > environment variables I can try to reproduce it).
>
> I don't think it is related to locales settings. Since even with a
> minimum number of environment variables, I still can reproduce the
> problem.
>
>   % sh
>   $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1`
>   > do unset $v; done
>   $ make
>   make: *** [sharedmods] Error 1

Well, then it is up to you to come up with a hypothesis for why it
doesn't happen on my system. (I tried the above thing and it still
works.)

> > The trail leads back to the optparse module using the gettext module
> > to translate its error messages. That seems overengineered to me,
> > but I won't argue too strongly.
> >
> > In any case, the root cause is that normalize_encoding() is badly
> > broken. I've attached a hack that might fix it. Can you try if that
> > helps?
>
> Yep, that worked. What this new str8 type is for, btw? It is the second
> time I encounter it, today.

It is the temporary new name for the old 8-bit str type. The plan is
to rename unicode->str and delete the old str type, but in the short
term that doesn't quite work because there is too much C code that
requires 8-bit strings (and can't be made to work with the bytes type
either). So for the time being I've renamed the old str type to str8
rather than deleting it altogether. Once we have things 99% working
tis way we'll make another pass to get rid of str8 completely -- or
perhaps keep it around under some other name with reduced
functionality (since there have been requests for an immutable bytes
type).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rrr at ronadam.com  Thu Jun  7 22:54:07 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 07 Jun 2007 15:54:07 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706071155m274a11fcy4b1e4fd57215b907@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	<4665EE44.2010306@ronadam.com>	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	<4667CCB2.6040405@ronadam.com>	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	<acd65fa20706071137j67f4a805kc17f5fc11a9402a5@mail.gmail.com>
	<ca471dc20706071155m274a11fcy4b1e4fd57215b907@mail.gmail.com>
Message-ID: <4668706F.6080406@ronadam.com>

Guido van Rossum wrote:
 > On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
 >> On 6/7/07, Guido van Rossum <guido at python.org> wrote:
 >>> It's time to look at the original traceback (attached as "tb", after
 >>> fixing the formatting problems). it looks like any call to
 >>> encodings.normalize_encoding() causes this problem.
 >> Don't know if it will help to know that, but it seems adding a
 >> debugging print() in the normalize_encoding method, makes Python act
 >> weird:
 >>
 >>   >>> print("hello")  # no output
 >>   [38357 refs]
 >>   >>> hello?          # note the exception is not shown
 >>   [30684 refs]
 >>   >>> exit()          # does quit
 >
 > That's a bootstrapping issue. normalize_encoding() is apparently
 > called in order to set up stdin/stdout/stderr, so it shouldn't attempt
 > to touch those (or raise errors).
 >
 >>> I don't know why linking an extension avoids this, and why it's only
 >>> a problem for you and not for me, but that's probably a locale
 >>> setting (if you mail me the values of all your locale-specific
 >>> environment variables I can try to reproduce it).
 >> I don't think it is related to locales settings. Since even with a
 >> minimum number of environment variables, I still can reproduce the
 >> problem.
 >>
 >>   % sh
 >>   $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1`
 >>   > do unset $v; done
 >>   $ make
 >>   make: *** [sharedmods] Error 1
 >
 > Well, then it is up to you to come up with a hypothesis for why it
 > doesn't happen on my system. (I tried the above thing and it still
 > works.)

There's a couple of things going on here.

The "sharedmods" section of the makefile doesn't execute on every make 
depending on what options are set or what targets are built.  That is why 
the error doesn't occur on the first run after a 'make clean', and why it 
doesn't occur if some targets are rebuilt like _struct.so.  I'm not sure 
why it matters which files are built in this case.  <shrug>

Also if you have some make flags set then it may be avoiding that 
particular problem because the default 'all' section is never ran.

Does setup.py run without an error for you?  (Without the 
encodings.__init__.py patch.)   How about "make test".


I've ran across the same zero arg split error a while back when attempting 
to run 'make test'.  Below was the solution I came up with.  Is there going 
to be an unicode equivalent to the str.translate() method?

Cheers,
    Ron



Index: Lib/encodings/__init__.py
===================================================================
--- Lib/encodings/__init__.py   (revision 55388)
+++ Lib/encodings/__init__.py   (working copy)
@@ -34,19 +34,16 @@
  _cache = {}
  _unknown = '--unknown--'
  _import_tail = ['*']
-_norm_encoding_map = ('                                              . '
-                      '0123456789       ABCDEFGHIJKLMNOPQRSTUVWXYZ     '
-                      ' abcdefghijklmnopqrstuvwxyz                     '
-                      '                                                '
-                      '                                                '
-                      '                ')
+_norm_encoding_map = ('.0123456789'
+                      'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+                      'abcdefghijklmnopqrstuvwxyz')
+
  _aliases = aliases.aliases

  class CodecRegistryError(LookupError, SystemError):
      pass

  def normalize_encoding(encoding):
-
      """ Normalize an encoding name.

          Normalization works as follows: all non-alphanumeric
@@ -54,18 +51,12 @@
          collapsed and replaced with a single underscore, e.g. '  -;#'
          becomes '_'. Leading and trailing underscores are removed.

-        Note that encoding names should be ASCII only; if they do use
-        non-ASCII characters, these must be Latin-1 compatible.
+        Note that encoding names should be ASCII characters only; if they
+        do use non-ASCII characters, these must be Latin-1 compatible.

      """
-    # Make sure we have an 8-bit string, because .translate() works
-    # differently for Unicode strings.
-    if isinstance(encoding, str):
-        # Note that .encode('latin-1') does *not* use the codec
-        # registry, so this call doesn't recurse. (See unicodeobject.c
-        # PyUnicode_AsEncodedString() for details)
-        encoding = encoding.encode('latin-1')
-    return '_'.join(encoding.translate(_norm_encoding_map).split())
+    return ''.join([ch if ch in _norm_encoding_map else '_'
+                        for ch in encoding])


 >>> The trail leads back to the optparse module using the gettext module
 >>> to translate its error messages. That seems overengineered to me,
 >>> but I won't argue too strongly.
 >>>
 >>> In any case, the root cause is that normalize_encoding() is badly
 >>> broken. I've attached a hack that might fix it. Can you try if that
 >>> helps?
 >> Yep, that worked. What this new str8 type is for, btw? It is the second
 >> time I encounter it, today.
 >
 > It is the temporary new name for the old 8-bit str type. The plan is
 > to rename unicode->str and delete the old str type, but in the short
 > term that doesn't quite work because there is too much C code that
 > requires 8-bit strings (and can't be made to work with the bytes type
 > either). So for the time being I've renamed the old str type to str8
 > rather than deleting it altogether. Once we have things 99% working
 > tis way we'll make another pass to get rid of str8 completely -- or
 > perhaps keep it around under some other name with reduced
 > functionality (since there have been requests for an immutable bytes
 > type).

From martin at v.loewis.de  Thu Jun  7 23:05:26 2007
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 07 Jun 2007 23:05:26 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	<4665EE44.2010306@ronadam.com>	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	<4667CCB2.6040405@ronadam.com>	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
Message-ID: <46687316.8090109@v.loewis.de>

> It's time to look at the original traceback (attached as "tb", after
> fixing the formatting problems). it looks like any call to
> encodings.normalize_encoding() causes this problem.

One problem with normalize_encoding is that it might do

  encoding = encoding.encode('latin-1')
  return '_'.join(encoding.translate(_norm_encoding_map).split())

Here, encoding is converted from a str (unicode) object
into a bytes object. That is passed to translate, and then
split, which in turn gives

py> b"Hallo, World".split()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: split() takes at least 1 argument (0 given)

So the problem is that bytes is not fully compatible with
str or str8, here: it doesn't support the parameter-less
split.

In turn, normalize_encoding encodes as latin-1 because
otherwise, translate won't work as expected.

I think the right solution would be to just fix the
translate table, replacing everything but [A-Za-z0-9]
with a space.

FWIW, for me the build error goes away when I unset
LANG, so that the error occurs during build definitely
*is* a locale issue.

Regards,
Martin

From martin at v.loewis.de  Thu Jun  7 23:07:32 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 07 Jun 2007 23:07:32 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4668706F.6080406@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	<4665EE44.2010306@ronadam.com>	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	<4667CCB2.6040405@ronadam.com>	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	<acd65fa20706071137j67f4a805kc17f5fc11a9402a5@mail.gmail.com>	<ca471dc20706071155m274a11fcy4b1e4fd57215b907@mail.gmail.com>
	<4668706F.6080406@ronadam.com>
Message-ID: <46687394.4070003@v.loewis.de>

> I've ran across the same zero arg split error a while back when attempting 
> to run 'make test'.  Below was the solution I came up with.  Is there going 
> to be an unicode equivalent to the str.translate() method?

The unicode type supports translate since 2.0.

Regards,
Martin

From guido at python.org  Thu Jun  7 23:47:10 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 14:47:10 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46687316.8090109@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de>
Message-ID: <ca471dc20706071447o2398cdefs8540fbcd975b6e8a@mail.gmail.com>

On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > It's time to look at the original traceback (attached as "tb", after
> > fixing the formatting problems). it looks like any call to
> > encodings.normalize_encoding() causes this problem.
>
> One problem with normalize_encoding is that it might do
>
>   encoding = encoding.encode('latin-1')
>   return '_'.join(encoding.translate(_norm_encoding_map).split())
>
> Here, encoding is converted from a str (unicode) object
> into a bytes object. That is passed to translate, and then
> split, which in turn gives
>
> py> b"Hallo, World".split()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: split() takes at least 1 argument (0 given)
>
> So the problem is that bytes is not fully compatible with
> str or str8, here: it doesn't support the parameter-less
> split.

Which is intentional (sort of).

> In turn, normalize_encoding encodes as latin-1 because
> otherwise, translate won't work as expected.
>
> I think the right solution would be to just fix the
> translate table, replacing everything but [A-Za-z0-9]
> with a space.

I rewrote the algorithm using more basic operations. It's slower now
-- does that matter?  Here's what I checked in:

    chars = []
    punct = False
    for c in encoding:
        if c.isalnum() or c == '.':
            if punct and chars:
                chars.append('_')
            chars.append(c)
            punct = False
        else:
            punct = True
    return ''.join(chars)

> FWIW, for me the build error goes away when I unset
> LANG, so that the error occurs during build definitely
> *is* a locale issue.

I still can't reproduce this. Oh well. It should be gone.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Thu Jun  7 23:50:37 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 14:50:37 -0700
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <acd65fa20706071146r61f4354ctb1a92aee41ce5dd1@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
	<ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
	<acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>
	<ca471dc20706051727j6a7f1738g96f45cf9a2a2d4aa@mail.gmail.com>
	<acd65fa20706071146r61f4354ctb1a92aee41ce5dd1@mail.gmail.com>
Message-ID: <ca471dc20706071450o192c943aya611c4f502137f9c@mail.gmail.com>

Looks great -- can you check it in yourself?

On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> I found a way to fix the bug; look at the attached patch. Although, I
> am not sure it was correct way to fix it. The problem was due to str8
> that is  recognized as an instance of `str'.
>
> -- Alexandre
>
> On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > > > I'd rather see them here than in SF, SF is a pain to use.
> > > >
> > > > But unless the bugs prevent you from proceeding, you could also ignore them.
> > >
> > > The first bug that I reported today (the one about `make`) stop me
> > > from running the test suite. So, can't really test the _string_io and
> > > _bytes_io modules.
> >
> > I tried to reproduce it but it works fine for me -- I'm on Ubuntu
> > dapper (with some Google mods) on a 2.6.18.5-gg4 kernel.
> >
> > --
> > --Guido van Rossum (home page: http://www.python.org/~guido/)
> >
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jimjjewett at gmail.com  Thu Jun  7 23:53:32 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 7 Jun 2007 17:53:32 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>
Message-ID: <fb6fbf560706071453uf4d6087s56bf95ec8c8e9e8f@mail.gmail.com>

On 6/7/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:

> ... I will use XML character references to denote code points here.
> Wherever you see such a thing in this e-mail, replace it in your
> mind with the corresponding code point *immediately*. E.g.
> len(r'&#00c5;') == 1, but len(r'\u00c5') == 6.

> In the following code == should be false:

> if "L\u00F6wis" == "Lo\u0308wis":
>     print "Python is Unicode conforming in this respect."

> On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > I think the default case should be that text operations produce the
> > expected result in the text domain, even at the expense of array
> > invariants.

(There was confusion -- an explicit escape such as \u probably stands
out enough to signal the non-default case.  But even there, it would
also be reasonable to say "use something other than text.")

> > People who need arrays of code points have several ways to
> > get them, and the usual comparison operators will work on them
> > as desired.

> But regexps and other string operations won't, and those are the
> whole point of strings,

(I was thinking that regexps would actually take an buffer interface, but...)

How would you expect them to work on arrays of code points?  What sort
of answer should the following produce?

    # matches by codepoints, but doesn't look like it
    "Lo&#0308wis".startswith("Lo")

    # if the above did match, then people will assume ? folds to o
    "L&#00F6wis".startswith("Lo")

    # looks like it matches.  Matches as text.  Does not match as bytes.
    "Lo&#0308wis".startswith("L&#00F6")

-jJ

From guido at python.org  Thu Jun  7 23:54:01 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 14:54:01 -0700
Subject: [Python-3000] help() broken in the py3k-struni branch
In-Reply-To: <acd65fa20706071155j6361f00avcc745f7e62a8ef87@mail.gmail.com>
References: <acd65fa20706051645p3f05a292u243b9623dbafda5b@mail.gmail.com>
	<ca471dc20706051647s249727abwd10a526ad13c98bb@mail.gmail.com>
	<acd65fa20706071155j6361f00avcc745f7e62a8ef87@mail.gmail.com>
Message-ID: <ca471dc20706071454w1df634d7l38885d8b746e778a@mail.gmail.com>

Thanks for finding the issue!

On this one I think subprocess.py should be changed to allow None
(like all the other open() functions).

I'll check it in.

--Guido

On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > Feel free to mail me a patch to fix it.
> >
>
> Since you asked so politely, here a patch for you. :)
>
> > On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > > Hi,
> > >
> > > I found another bug to report. It seems there is a bug in
> > > subprocess.py that makes help() fail.
> > >
> > > -- Alexandre
> > >
> > > Python 3.0x (py3k-struni, Jun  5 2007, 18:41:44)
> > > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> > > Type "help", "copyright", "credits" or "license" for more information.
> > > >>> help(open)
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in <module>
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350,
> > > in __call__
> > >     return pydoc.help(*args, **kwds)
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > > 1687, in __call__
> > >     self.help(request)
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help
> > >     else: doc(request, 'Help on %s:')
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc
> > >     pager(render_doc(thing, title, forceload))
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager
> > >     pager(text)
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > > 1333, in <lambda>
> > >     return lambda text: pipepager(text, 'less')
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line
> > > 1352, in pipepager
> > >     pipe = os.popen(cmd, 'w')
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen
> > >     bufsize=buffering)
> > >   File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line
> > > 476, in __init__
> > >     raise TypeError("bufsize must be an integer")
> > > TypeError: bufsize must be an integer
> > > _______________________________________________
> > > Python-3000 mailing list
> > > Python-3000 at python.org
> > > http://mail.python.org/mailman/listinfo/python-3000
> > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
> > >
> >
> >
> > --
> > --Guido van Rossum (home page: http://www.python.org/~guido/)
> >
>
>
> --
> Alexandre Vassalotti
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Thu Jun  7 23:58:57 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 17:58:57 -0400
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46687316.8090109@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de>
Message-ID: <acd65fa20706071458p6eb912b2he6b70013f55d6614@mail.gmail.com>

On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> FWIW, for me the build error goes away when I unset
> LANG, so that the error occurs during build definitely
> *is* a locale issue.

Ah! You're right. I needed to do a `make clean` before, though.

My LANG variable was set to "en_CA.UTF-8".

-- Alexandre

From guido at python.org  Fri Jun  8 00:19:32 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 15:19:32 -0700
Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k?
In-Reply-To: <f48vj9$vk9$1@sea.gmane.org>
References: <ca471dc20706061557g717d43b6ud61d1e7acee354c5@mail.gmail.com>
	<4667E526.1000503@gmail.com> <f48vj9$vk9$1@sea.gmane.org>
Message-ID: <ca471dc20706071519p24683fb3g4703e62a78213e71@mail.gmail.com>

On 6/7/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Nick Coghlan schrieb:
> > Guido van Rossum wrote:
> >> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP
> >> 367 (new super) and PEP 344 (exception chaining). Are there any
> >> others? I propose that we renumber these to numbers in the 3100+
> >> range. I can see two forms of renaming:
> >>
> >> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number
> >>
> >> (b) just use the next available number
> >>
> >> Preferences?
> >>
> >> What other PEPs should be renumbered?
> >>
> >> Should we renumber at all?
> >>
> >
> > +1 for renumbering to the next available 31xx number, with the old
> > number kept as a pointer to the new one.
>
> That would be my vote too.

And so it is done. 344 -> 3134, 367 -> 3135. I've left the old ones in
place with status "Replaced" and a "Numbering Note" in front of the
abstract.

Are there any other candidates for such a renumbering?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From pje at telecommunity.com  Fri Jun  8 00:33:12 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Thu, 07 Jun 2007 18:33:12 -0400
Subject: [Python-3000] [Python-Dev] PEP 367: New Super
In-Reply-To: <ca471dc20706061431i3de7914bq14307fe7bc4f7ba7@mail.gmail.co
 m>
References: <001101c79aa7$eb26c130$0201a8c0@mshome.net>
	<002d01c79f6d$ce090de0$0201a8c0@mshome.net>
	<ca471dc20705260708t952d820w7473474554c9469b@mail.gmail.com>
	<003f01c79fd9$66948ec0$0201a8c0@mshome.net>
	<ca471dc20705270259ke665af6v3b5bdbffbd926330@mail.gmail.com>
	<009c01c7a04f$7e348460$0201a8c0@mshome.net>
	<ca471dc20705270550j5e199624xd4e8f6caa9dda93d@mail.gmail.com>
	<ca471dc20705281937y48300821u840add9d5454e8d9@mail.gmail.com>
	<ca471dc20705310448p5c5cfeds41fdc75e05c21f55@mail.gmail.com>
	<20070531170734.273393A40AA@sparrow.telecommunity.com>
	<ca471dc20706061431i3de7914bq14307fe7bc4f7ba7@mail.gmail.com>
Message-ID: <20070607223114.274A73A4060@sparrow.telecommunity.com>

At 02:31 PM 6/6/2007 -0700, Guido van Rossum wrote:
>I wonder if this may meet the needs for your PEP 3124? In
>particularly, earlier on, you wrote:
>
>>Btw, PEP 3124 needs a way to receive the same class object at more or
>>less the same moment, although in the form of a callback rather than
>>a cell assignment.  Guido suggested I co-ordinate with you to design
>>a mechanism for this.
>
>Is this relevant at all?

Well, it tells us more or less where the callback would need to 
be.  :)  Although I think that __class__ should really point to the 
*decorated* class, rather than the undecorated one.  I have used 
decorators before that had to re-create the class object, but can't 
think of any use cases where I'd have wanted to use super() to refer 
to the *un*decorated class.

Btw, my thought on the keyword and __class__ thing is simply that the 
plus of having a keyword (or other compiler support) is that we don't 
have to have the cell variable cluttering up the frames for every 
single method, whether it uses super or not.

Thus, my inclination is either to require explicit use of __class__ 
(so the compiler would know whether to include the free variable), or 
to make super a keyword, so that in either case, only the functions 
that use it must pay for the overhead.

(Currently, functions that use any cell variables are invoked more 
slowly than ones without them; in 2.x at least there's a fast calling 
path for code objects with CO_NOFREE, and this change would make it 
useless for everything but top-level functions.)


From alexandre at peadrop.com  Fri Jun  8 00:38:34 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 7 Jun 2007 18:38:34 -0400
Subject: [Python-3000] pdb help is broken in py3k-struni branch
In-Reply-To: <ca471dc20706071450o192c943aya611c4f502137f9c@mail.gmail.com>
References: <acd65fa20706051651k116fa579t41a51aad52f285ef@mail.gmail.com>
	<ca471dc20706051700m61cad7f8wb4973f6857685c96@mail.gmail.com>
	<acd65fa20706051714v737d4d2br5b3059317d0b1210@mail.gmail.com>
	<ca471dc20706051727j6a7f1738g96f45cf9a2a2d4aa@mail.gmail.com>
	<acd65fa20706071146r61f4354ctb1a92aee41ce5dd1@mail.gmail.com>
	<ca471dc20706071450o192c943aya611c4f502137f9c@mail.gmail.com>
Message-ID: <acd65fa20706071538k6bd03762g35d414561dcbb4ff@mail.gmail.com>

Done. Commited to r55817.

On 6/7/07, Guido van Rossum <guido at python.org> wrote:
> Looks great -- can you check it in yourself?
>
> On 6/7/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > I found a way to fix the bug; look at the attached patch. Although, I
> > am not sure it was correct way to fix it. The problem was due to str8
> > that is  recognized as an instance of `str'.
> >
> > -- Alexandre
> >
> > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > > On 6/5/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > > > On 6/5/07, Guido van Rossum <guido at python.org> wrote:
> > > > > I'd rather see them here than in SF, SF is a pain to use.
> > > > >
> > > > > But unless the bugs prevent you from proceeding, you could also ignore them.
> > > >
> > > > The first bug that I reported today (the one about `make`) stop me
> > > > from running the test suite. So, can't really test the _string_io and
> > > > _bytes_io modules.
> > >
> > > I tried to reproduce it but it works fine for me -- I'm on Ubuntu
> > > dapper (with some Google mods) on a 2.6.18.5-gg4 kernel.
> > >
> > > --
> > > --Guido van Rossum (home page: http://www.python.org/~guido/)
> > >
> >
> >
>
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>


-- 
Alexandre Vassalotti

From guido at python.org  Fri Jun  8 00:41:09 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 15:41:09 -0700
Subject: [Python-3000] [Python-Dev] PEP 367: New Super
In-Reply-To: <20070607223114.274A73A4060@sparrow.telecommunity.com>
References: <001101c79aa7$eb26c130$0201a8c0@mshome.net>
	<003f01c79fd9$66948ec0$0201a8c0@mshome.net>
	<ca471dc20705270259ke665af6v3b5bdbffbd926330@mail.gmail.com>
	<009c01c7a04f$7e348460$0201a8c0@mshome.net>
	<ca471dc20705270550j5e199624xd4e8f6caa9dda93d@mail.gmail.com>
	<ca471dc20705281937y48300821u840add9d5454e8d9@mail.gmail.com>
	<ca471dc20705310448p5c5cfeds41fdc75e05c21f55@mail.gmail.com>
	<20070531170734.273393A40AA@sparrow.telecommunity.com>
	<ca471dc20706061431i3de7914bq14307fe7bc4f7ba7@mail.gmail.com>
	<20070607223114.274A73A4060@sparrow.telecommunity.com>
Message-ID: <ca471dc20706071541t43588314h45adb52622d6d1a5@mail.gmail.com>

On 6/7/07, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 02:31 PM 6/6/2007 -0700, Guido van Rossum wrote:
> >I wonder if this may meet the needs for your PEP 3124? In
> >particularly, earlier on, you wrote:
> >
> >>Btw, PEP 3124 needs a way to receive the same class object at more or
> >>less the same moment, although in the form of a callback rather than
> >>a cell assignment.  Guido suggested I co-ordinate with you to design
> >>a mechanism for this.
> >
> >Is this relevant at all?
>
> Well, it tells us more or less where the callback would need to
> be.  :)  Although I think that __class__ should really point to the
> *decorated* class, rather than the undecorated one.  I have used
> decorators before that had to re-create the class object, but can't
> think of any use cases where I'd have wanted to use super() to refer
> to the *un*decorated class.

That's a problem, because I wouldn't know where to save a reference to
the cell until after the decorations are done. If you want to suggest
a solution, please study the patch first to see the difficulty.

> Btw, my thought on the keyword and __class__ thing is simply that the
> plus of having a keyword (or other compiler support) is that we don't
> have to have the cell variable cluttering up the frames for every
> single method, whether it uses super or not.

Oh, but the patch *does* have compiler support, and only creates the
cell when it is needed, and only passes it into those methods that
need it.

> Thus, my inclination is either to require explicit use of __class__
> (so the compiler would know whether to include the free variable), or
> to make super a keyword, so that in either case, only the functions
> that use it must pay for the overhead.

My patch uses an intermediate solution: it assumes you need __class__
whenever you use a variable named 'super'. Thus, if you (globally)
rename super to supper and use supper but not super, it won't work
without arguments (but it will still work if you pass it either
__class__ or the actual class object); if you have an unrelated
variable named super, things will work but the method will use the
slightly slower call path used for cell variables.

I believe IronPython uses a similar strategy to support locals() --
AFAIK it generates slower code that provides an accessible stack frame
when it thinks you may be using a global named 'locals'. So again,
globally renaming locals to something else won't work, but having an
unrelated variable named 'locals' will work at a slight performance
penalty.

> (Currently, functions that use any cell variables are invoked more
> slowly than ones without them; in 2.x at least there's a fast calling
> path for code objects with CO_NOFREE, and this change would make it
> useless for everything but top-level functions.)

Not true, explained above.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Fri Jun  8 00:42:08 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 15:42:08 -0700
Subject: [Python-3000] [Python-Dev] PEP 367: New Super
In-Reply-To: <ca471dc20706071541t43588314h45adb52622d6d1a5@mail.gmail.com>
References: <001101c79aa7$eb26c130$0201a8c0@mshome.net>
	<ca471dc20705270259ke665af6v3b5bdbffbd926330@mail.gmail.com>
	<009c01c7a04f$7e348460$0201a8c0@mshome.net>
	<ca471dc20705270550j5e199624xd4e8f6caa9dda93d@mail.gmail.com>
	<ca471dc20705281937y48300821u840add9d5454e8d9@mail.gmail.com>
	<ca471dc20705310448p5c5cfeds41fdc75e05c21f55@mail.gmail.com>
	<20070531170734.273393A40AA@sparrow.telecommunity.com>
	<ca471dc20706061431i3de7914bq14307fe7bc4f7ba7@mail.gmail.com>
	<20070607223114.274A73A4060@sparrow.telecommunity.com>
	<ca471dc20706071541t43588314h45adb52622d6d1a5@mail.gmail.com>
Message-ID: <ca471dc20706071542o67c65448x647e2058f5f398ca@mail.gmail.com>

BTW, from now on this is PEP 3135. http://python.org/dev/peps/pep-3135/

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rauli.ruohonen at gmail.com  Fri Jun  8 00:47:07 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Fri, 8 Jun 2007 01:47:07 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706071453uf4d6087s56bf95ec8c8e9e8f@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>
	<fb6fbf560706071453uf4d6087s56bf95ec8c8e9e8f@mail.gmail.com>
Message-ID: <f52584c00706071547o5d92d43ewea208a3111bc6d09@mail.gmail.com>

On 6/8/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> How would you expect them to work on arrays of code points?

Just like they do with Python 2.5 unicode objects, as long as the
"array of code points" is str, not e.g. a numpy array or tuple of ints,
which I don't expect to grow string methods :-)

> What sort of answer should the following produce?

That depends on what Python does when it reads in the source code.
I think it should normalize to NFC (which Python 2.5 does not do).

>     # matches by codepoints, but doesn't look like it
>     "Lo&#0308wis".startswith("Lo")
>     # if the above did match, then people will assume ? folds to o
>     "L&#00F6wis".startswith("Lo")
>     # looks like it matches.  Matches as text.  Does not match as bytes.
>     "Lo&#0308wis".startswith("L&#00F6")

Normalized to NFC:

"L&#00F6;wis".startswith("Lo")
"L&#00F6;wis".startswith("Lo")
"L&#00F6;wis".startswith("L&#00F6;")

After this Python lexes, parses and executes. The first two are false,
the last one true. All of the examples should look the same in your editor
(at least ideally). The following would, OTOH, be true false false:

"Lo\u0308wis".startswith("Lo")
"L\u00F6wis".startswith("Lo")
"Lo\u0308wis".startswith("L\u00F6")

As here the source code is pure ASCII, it's WYSIWYG everywhere.

Python 2.5's output with each:

>>> u"Lo?wis".startswith(u"Lo")
True
>>> u"L?wis".startswith(u"Lo")
False
>>> u"Lo?wis".startswith(u"L?")
False
>>> u"Lo\u0308wis".startswith(u"Lo")
True
>>> u"L\u00F6wis".startswith(u"Lo")
False
>>> u"Lo\u0308wis".startswith(u"L\u00F6")
False

From rrr at ronadam.com  Fri Jun  8 01:20:23 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 07 Jun 2007 18:20:23 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46687316.8090109@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	<4665EE44.2010306@ronadam.com>	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	<4667CCB2.6040405@ronadam.com>	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de>
Message-ID: <466892B7.4050108@ronadam.com>

Martin v. L?wis wrote:

> FWIW, for me the build error goes away when I unset
> LANG, so that the error occurs during build definitely
> *is* a locale issue.

Yes, and to pin it down a bit further...

This avoids the problem by setting the language to the default "C" which is 
a unicode string and has a .split method that accepts 0 args.

Also LANG is 4th on the list of possible language setting sources, so if 
one of the other 3 environment variables is set, setting or unsetting LANG 
will have no effect.


--- From gettext.py ---

# Locate a .mo file using the gettext strategy
def find(domain, localedir=None, languages=None, all=0):
     # Get some reasonable defaults for arguments that were not supplied
     if localedir is None:
         localedir = _default_localedir
     if languages is None:
         languages = []
         for envar in ('LANGUAGE', 'LC_ALL', 'LC_MESSAGES', 'LANG'):
                       # ^^^  first one is accepted.
             val = os.environ.get(envar)      #<<< should return unicode?
             if val:
                 languages = val.split(':')
                 break
         if 'C' not in languages:
             languages.append('C')                     # <<< unicode 'C'
     # now normalize and expand the languages
     nelangs = []
     for lang in languages:
         for nelang in _expand_lang(lang):    #<<< error in this call
                                              #    when it's normalized.
             if nelang not in nelangs:
                 nelangs.append(nelang)

------

Guido's patch avoids this, but that fix was also needed as unicode 
translate works differently than str.translate.

The os.environ.get() method probably should return a unicode string. (?)

Ron



From guido at python.org  Fri Jun  8 01:54:40 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 16:54:40 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <466892B7.4050108@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
Message-ID: <ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>

On 6/7/07, Ron Adam <rrr at ronadam.com> wrote:
> Martin v. L?wis wrote:
>
> > FWIW, for me the build error goes away when I unset
> > LANG, so that the error occurs during build definitely
> > *is* a locale issue.
>
> Yes, and to pin it down a bit further...
>
> This avoids the problem by setting the language to the default "C" which is
> a unicode string and has a .split method that accepts 0 args.
>
> Also LANG is 4th on the list of possible language setting sources, so if
> one of the other 3 environment variables is set, setting or unsetting LANG
> will have no effect.
>
>
> --- From gettext.py ---
>
> # Locate a .mo file using the gettext strategy
> def find(domain, localedir=None, languages=None, all=0):
>      # Get some reasonable defaults for arguments that were not supplied
>      if localedir is None:
>          localedir = _default_localedir
>      if languages is None:
>          languages = []
>          for envar in ('LANGUAGE', 'LC_ALL', 'LC_MESSAGES', 'LANG'):
>                        # ^^^  first one is accepted.
>              val = os.environ.get(envar)      #<<< should return unicode?
>              if val:
>                  languages = val.split(':')
>                  break
>          if 'C' not in languages:
>              languages.append('C')                     # <<< unicode 'C'
>      # now normalize and expand the languages
>      nelangs = []
>      for lang in languages:
>          for nelang in _expand_lang(lang):    #<<< error in this call
>                                               #    when it's normalized.
>              if nelang not in nelangs:
>                  nelangs.append(nelang)
>
> ------
>
> Guido's patch avoids this, but that fix was also needed as unicode
> translate works differently than str.translate.
>
> The os.environ.get() method probably should return a unicode string. (?)

Indeed -- care to contribute a patch?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rauli.ruohonen at gmail.com  Fri Jun  8 02:26:41 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Fri, 8 Jun 2007 03:26:41 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <4666FB16.2070209@v.loewis.de>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<4666FB16.2070209@v.loewis.de>
Message-ID: <f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>

On 6/6/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > FWIW, I don't buy that normalization is expensive, as most strings are
> > in NFC form anyway, and there are fast checks for that (see UAX#15,
> > "Detecting Normalization Forms"). Python does not currently have
> > a fast path for this, but if it's added, then normalizing everything
> > to NFC should be fast.
>
> That would be useful to have, anyway. Would you like to contribute it?

I implemented it for all normalizations in the most straightforward way I
could think of, which was adding a field to _PyUnicode_DatabaseRecord,
generating data for it in makeunicodedata.py from
DerivedNormalizationProps.txt of UCD 4.1, and writing a function
is_normalized which uses it. The function is called from
unicodedata_normalized. I made the modifications against py3k-struni.
Does this sound reasonable?

I haven't made any contributions to Python before, but I heard attempting
such hazardous activity involves lots of hard knocks :-) Where should I
send the patch? I saw some patches here in other threads, but then again
http://www.python.org/dev/patches/ tells to use SourceForge.

From rrr at ronadam.com  Fri Jun  8 04:31:44 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 07 Jun 2007 21:31:44 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	
	<4665EE44.2010306@ronadam.com>	
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	
	<4667CCB2.6040405@ronadam.com>	
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
Message-ID: <4668BF90.5080302@ronadam.com>

Guido van Rossum wrote:

>> The os.environ.get() method probably should return a unicode string. (?)
> 
> Indeed -- care to contribute a patch?

I thought you might ask that.  :-)

It looks like os.py module imports a 'envirion' dictionary from various 
sources depending on the platform.

       posix, nt, os2   <--->  posixmodule.c

       mac, ce, riscos  <--->  ?, ?, ?

Then os.py uses it to initialize the os._Environ user dict.  I can 
contribute a patch for os.py to covert the items at that point, but if 
someone imports the platform modules directly they will get surprises.

Patching posixmodule.c and the other platform files where ever they live 
may still be a bit beyond me at this time.

I'm still learning my way around pythons C code.  :-)

Cheers,
    Ron













From turnbull at sk.tsukuba.ac.jp  Fri Jun  8 05:31:44 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Fri, 08 Jun 2007 12:31:44 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706070947g576a8177xa862efd03e71fc47@mail.gmail.com>
Message-ID: <87fy53glzz.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 Stephen wrote:

 > > I think the default case should be that text operations produce the
 > > expected result in the text domain, even at the expense of array
 > > invariants.
 > 
 > If you really want that, then you need a type for sequences of graphemes.

No.  "Text" != "sequence of graphemes".  For example:

 > E.g. 'c\u0308' is already normalized according to all four normalization
 > rules, but it's still one grapheme ('c' with diaeresis, c~)

Not on my terminal, it's not; it's two.  And what about audible
representation?

Python cannot compute graphemes, the Python user can only observe them
after some other process displays them.  So Python's definition of
"text" cannot be grapheme-based.

 > > People who need arrays of code points have several ways to get them,
 > > and the usual comparison operators will work on them as desired.
 > 
 > But regexps and other string operations won't,

I do not have any objection to treating Unicode strings as sequences
of code points, and allowing them to be unnormalized -- as an option.

The *default* should be to treat them as text, or there should be a
simple way to make it default ("import trueunicode").  I do not want
to have to check every string for normalization by hand.  I don't
object to the overhead---the overhead is already pretty high for
Unicode conformance.  It's that I know I'll make mistakes, or use
libraries that do undocumented I/O or non-Unicode-conformant
transformations, or whatever.  The right place to do such checking is
in the Unicode datatype, not in application code.

 > > While people who need operations on *text* still have no
 > > straightforward way to get them, and no promise of one as I read your
 > > remarks.
 > 
 > Then you missed some of his earlier remarks:
 > 
 > Guido:

 > : I'm all for adding a way to do normalized string comparisons to the
 > : library. But I'm not about to change the == operator to apply
 > : normalization first.

Funny, that's precisely the remark I was thinking of.

If I write a Unicode string, I want the == operator to "just work".
As quoted, Guido says it will not.  Note that we *already* have a way
to do normalized string comparisons via unicodedata, and we can even
use "==" for it.  So Guido would have every right to consider his
promise already fulfilled.

The problem is not that a code-point oriented operator won't work if
you know you have two TrueText objects; you only have to implement
them correctly, and code-point comparison Just Works.  The problem is
that it's going to be very hard to be sure that you've got TrueText as
opposed to arrays of shorts if the *language* does not provide ways to
enforce the distinction.


From martin at v.loewis.de  Fri Jun  8 06:04:05 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 08 Jun 2007 06:04:05 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<ca471dc20706051504k53d79e8i9ef7b0ae647ba89a@mail.gmail.com>	
	<acd65fa20706051543w24d24a5fo779ff781f75bff33@mail.gmail.com>	
	<4665EE44.2010306@ronadam.com>	
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	
	<4667CCB2.6040405@ronadam.com>	
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
Message-ID: <4668D535.7020103@v.loewis.de>

>> The os.environ.get() method probably should return a unicode string. (?)
> 
> Indeed -- care to contribute a patch?

Ideally, such a patch would make use of the Win32 Unicode API for
environment variables on Windows. People had already been complaining
that they can't have "funny characters" in the value of an environment
variable, even though the UI allows them to set the variable just fine.

Regards,
Martin


From guido at python.org  Fri Jun  8 06:06:49 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Jun 2007 21:06:49 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4668D535.7020103@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<4665EE44.2010306@ronadam.com>
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
	<4668D535.7020103@v.loewis.de>
Message-ID: <ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>

On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> The os.environ.get() method probably should return a unicode string. (?)
> >
> > Indeed -- care to contribute a patch?
>
> Ideally, such a patch would make use of the Win32 Unicode API for
> environment variables on Windows. People had already been complaining
> that they can't have "funny characters" in the value of an environment
> variable, even though the UI allows them to set the variable just fine.

Yeah, but the Windows build of py3k is currently badly broken (e.g.
the _fileio.c extension probably doesn't work at all) -- and I don't
have access to a Windows box to work on it. I'm afraid 3.0a1 will be
released without Windows support. Of course I'm counting on others to
fix that before 3.0 final is released.

I don't mind for now that the posix.environ variable contains 8-bit
strings -- people shouldn't be importing that anyway.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Fri Jun  8 06:15:51 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Fri, 08 Jun 2007 06:15:51 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	
	<4666FB16.2070209@v.loewis.de>
	<f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>
Message-ID: <4668D7F7.7000106@v.loewis.de>

> I implemented it for all normalizations in the most straightforward way I
> could think of, which was adding a field to _PyUnicode_DatabaseRecord,
> generating data for it in makeunicodedata.py from
> DerivedNormalizationProps.txt of UCD 4.1, and writing a function
> is_normalized which uses it. The function is called from
> unicodedata_normalized. I made the modifications against py3k-struni.
> Does this sound reasonable?

In principle, yes. What's the cost of the additional field in terms of
a size increase? If you just need another bit, could that fit into
_PyUnicode_TypeRecord.flags instead?

> I haven't made any contributions to Python before, but I heard attempting
> such hazardous activity involves lots of hard knocks :-) Where should I
> send the patch? I saw some patches here in other threads, but then again
> http://www.python.org/dev/patches/ tells to use SourceForge.

That would be best. You only need to include the patch to the generator,
not the generated data. I'd like to see it in 2.6, so ideally, you would
test it for the trunk (not that the branch should matter much)).

Don't forget to include test suite and documentation changes.

Regards,
Martin

From stephen at xemacs.org  Fri Jun  8 10:21:36 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 08 Jun 2007 17:21:36 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
Message-ID: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>

Guido van Rossum writes:

 > If you want to have an abstraction that guarantees you'll never see
 > an unnormalized text string you should design a library for doing so.

OK.

 > (*) It looks like such a library will not have a way to talk about
 > "\u0308" at all, since it is considered unnormalized.

>From the Unicode Standard, v4.0, p. 43: "In the Unicode Standard, all
sequences of character codes are permitted."  Since normalization only
applies to characters with decompositions, "\u0308" is indeed valid
Unicode, a one-character sequence in NFC.

AFAIK, the only strings the Unicode standard absolutely prohibits
emitting are those containing code points guaranteed not to be
characters by the standard.  And normalization is simply a internal
technique that allows text operations to be implemented code-point-
wise without fear that emitting them would result in illegal sequences
or other externally visible incompatibilities with the standard.

So there's nothing "wrong by definition" about defining strings as
sequences of code points, and string operations in code-point-wise
fashion.  It just makes that library for Unicode more expensive to
design and operate, and will require auditing and reimplementation of
common libraries (including the standard library) by every program
that requires strict Unicode conformance.


From rauli.ruohonen at gmail.com  Fri Jun  8 10:21:01 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Fri, 8 Jun 2007 11:21:01 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <4668D7F7.7000106@v.loewis.de>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<4666FB16.2070209@v.loewis.de>
	<f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>
	<4668D7F7.7000106@v.loewis.de>
Message-ID: <f52584c00706080121g14088d60h6876942bb7cc9fd@mail.gmail.com>

On 6/8/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> In principle, yes. What's the cost of the additional field in terms of
> a size increase? If you just need another bit, could that fit into
> _PyUnicode_TypeRecord.flags instead?

The additional field is 8 bits, two bits for each normalization (a
Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
used, but I don't know if that's true of later versions. As
_PyUnicode_Database_Records stores only unique records, this also results
in an increase of the number of records, from 219 to 304. Each record
looks like this:

typedef struct {
    const unsigned char category;
    const unsigned char combining;
    const unsigned char bidirectional;
    const unsigned char mirrored;
    const unsigned char east_asian_width;
    const unsigned char normalization_quick_check; /* my addition */
} _PyUnicode_DatabaseRecord;

I added the field to this record because the function needs to get the
record anyway for each character (it needs the field "combining", too).
The new field combines values for the derived properties (trinary)
NFD_Quick_Check, NFKD_Quick_Check, NFC_Quick_Check and NFKC_Quick_Check.

Here's the main loop (works for all four normalizations, only the value of
quickcheck_shift changes):

    while (i < end) {
        const _PyUnicode_DatabaseRecord *record = _getrecord_ex(*i++);
        unsigned char combining = record->combining;
        unsigned char quickcheck = record->normalization_quick_check;

        if ((quickcheck>>quickcheck_shift) & 3)
            return 0; /* this character might need normalization */
        if (combining && prev_combining > combining)
            return 0; /* non-canonical order, not normalized */
        prev_combining = combining;
    }

> That would be best. You only need to include the patch to the generator,
> not the generated data. I'd like to see it in 2.6, so ideally, you would
> test it for the trunk (not that the branch should matter much)).

This is easy to do. The differences in these files between the versions
are very small, and I actually initially wrote it for 2.5, as
py3k-struni's normalization test fails at the moment.

> Don't forget to include test suite and documentation changes.

It doesn't affect behavior or the API much(*), only performance. Current
test_normalize.py uses a test suite it fetches from UCD, so it
should be adequate.

(*) You *can* test for its presence by e.g. checking whether
    id(unicodedata.normalize('NFC', u'a')) is id(u'a') or not.
    The documentation does not specify either way. I'd say it's an
    implementation detail, and both tests and documentation should ignore
    it.

From rauli.ruohonen at gmail.com  Fri Jun  8 15:38:13 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Fri, 8 Jun 2007 16:38:13 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>

On 6/8/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> AFAIK, the only strings the Unicode standard absolutely prohibits
> emitting are those containing code points guaranteed not to be
> characters by the standard.

The ones it absolutely prohibits in interchange are surrogates. They
are also illegal in both UTF-16 and UTF-8. The pragmatic reason is that
if you do encode them despite their illegality (like Python codecs do),
strings won't always survive a round-trip to such pseudo-UTF-16
because multiple code point sequences necessarily map to the same byte
sequence. For some reason Python's UTF-8 encoder introduces this
ambiguity too, even though there's no need to do so with pseudo-UTF-8.

In Python UCS-2 builds even string processing in the core works
inconsistently with surrogates. Sometimes pseudo-UCS-2 is assumed,
sometimes pseudo-UTF-16, and these are incompatible because
pseudo-UTF-16 can't always represent surrogates, but pseudo-UCS-2 can.
OTOH pseudo-UCS-2 can't represent code points outside the BMP, but
pseudo-UTF-16 can. There's no way to always do the right thing as long
as these two are mixed, but somebody somewhere probably depends on this
behavior.

Other than surrogates, there are two classes of characters with
"restricted interchange". One is reserved characters, which need to
be preserved if found in text for compatibility with future versions of
the standard. Another is noncharacters, which are "reserved for
internal use, such as for sentinel values". These should obviously be
allowed, as the user may want to use them internally in their Python
program.

> So there's nothing "wrong by definition" about defining strings as
> sequences of code points, and string operations in code-point-wise
> fashion. It just makes that library for Unicode more expensive to
> design and operate, and will require auditing and reimplementation of
> common libraries (including the standard library) by every program
> that requires strict Unicode conformance.

It's not perfect, but that's the state of the art. AFAIK this (or worse)
is what the other implementations do. Even the Unicode standard
explains that strings generally work that way:

  2.7. Unicode Strings

  A Unicode string datatype is simply an ordered sequence of code
  units. Thus a Unicode 8-bit string is an ordered sequence of
  8-bit code units, a Unicode 16-bit string is an ordered sequence
  of 16-bit code units, and a Unicode 32-bit string is an ordered
  sequence of 32-bit code units.

  Depending on the programming environment, a Unicode string may or
  may not also be required to be in the corresponding Unicode encoding
  form. For example, strings in Java, C#, or ECMAScript are Unicode
  16-bit strings, but are not necessarily well-formed UTF-16 sequences.

From jimjjewett at gmail.com  Fri Jun  8 16:27:40 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 8 Jun 2007 10:27:40 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706080121g14088d60h6876942bb7cc9fd@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<4666FB16.2070209@v.loewis.de>
	<f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>
	<4668D7F7.7000106@v.loewis.de>
	<f52584c00706080121g14088d60h6876942bb7cc9fd@mail.gmail.com>
Message-ID: <fb6fbf560706080727w527c7d68t694f32d37c1d1dfa@mail.gmail.com>

On 6/8/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> The additional field is 8 bits, two bits for each normalization (a
> Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
> used, but I don't know if that's true of later versions.

There are no "Maybe" values for the Decomposed forms.

It is impossible to be Compatibility without also being Canonical.
(The definition of Compatibility includes folding as much as possible
under either form.)

So there are really 3 possibilities (both, canonical only, neither)
for the decomposed, and (at most) 6 for the composed forms.  (I'm not
sure all 6 of those can occur in practice.)

But there are other normalization forms that may be added later.  The
ones I found reference to are basically orthogonal (an existing
normalization may or may not meet them).

See the proposed changes at http://www.unicode.org/reports/tr15/tr15-28.html

-jJ

From amcnabb at mcnabbs.org  Fri Jun  8 19:00:49 2007
From: amcnabb at mcnabbs.org (Andrew McNabb)
Date: Fri, 8 Jun 2007 11:00:49 -0600
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
Message-ID: <20070608170049.GB20665@mcnabbs.org>

On Thu, Jun 07, 2007 at 06:50:57PM -0400, Jim Jewett wrote:
>  On 6/7/07, Andrew McNabb <amcnabb at mcnabbs.org> wrote:
> > On Wed, Jun 06, 2007 at 07:06:05PM -0400, Jim Jewett wrote:
> > > (There were mixed opinions on Technical symbols, and no one has spoken
> > > up yet about the half-dozen Croatian digraphs corresponding to Serbian
> > > Cyrillic.)
> 
>  If the digraphs were converted to compatibility characters, would
>  that be good, bad, or no big deal?
> 
>  I'm not entirely certain which letters Stephen was talking about, but
>  believe they are the (upper, lower, and titlecase) digraphs for ?, ?,
>  ?, ? (DZ caron)
> 
>  Would it be acceptable if (only in identifier names, not normal text)
>  python treated those the same as the two-character sequences LJ, NJ,
>  DZ, and D??

I speak Serbian as a second language (and lived in Serbia for a few
years), and my opinion is that a Serbian/Croatian speaker would expect
the digraphs to be treated the same as the two-character sequences.

The issue doesn't seem to come up too often, but people using
typewriters have been typing the digraphs as separate characters for
years.  The place I noticed the issue most frequently was if there was a
vertical sign, such as a storefront.

A sign saying "bookstore" would like like this:

K
?
i
?
a
r
a

or:

K
nj
i
?
a
r
a

The following would be incorrect:

K
n
j
i
?
a
r
a

But even many native speakers make this mistake.

Other than that, ? is practically indistinguishable from nj, and the
other Croatian digraphs have the same behavior.

I hope this helps in the discussion.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 186 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070608/867eb658/attachment.pgp 

From martin at v.loewis.de  Fri Jun  8 19:36:30 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Fri, 08 Jun 2007 19:36:30 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706080121g14088d60h6876942bb7cc9fd@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	
	<4666FB16.2070209@v.loewis.de>	
	<f52584c00706071726m515560acn9670c84d1b96943e@mail.gmail.com>	
	<4668D7F7.7000106@v.loewis.de>
	<f52584c00706080121g14088d60h6876942bb7cc9fd@mail.gmail.com>
Message-ID: <4669939E.2020700@v.loewis.de>

> The additional field is 8 bits, two bits for each normalization (a
> Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
> used, but I don't know if that's true of later versions. As
> _PyUnicode_Database_Records stores only unique records, this also results
> in an increase of the number of records, from 219 to 304. Each record
> looks like this:

If I count correctly, this gives roughly 900 additional bytes. That's
fine.

> It doesn't affect behavior or the API much(*), only performance. Current
> test_normalize.py uses a test suite it fetches from UCD, so it
> should be adequate.

I assumed you want to expose it to Python also, as an is_normalized
function. I guess not having such a function is fine if applications
can do normalize(form, s) == s and have that be efficient as long
as the outcome is true (i.e. if it is more expensive only if it's
not normalized).

Regards,
Martin

From martin at v.loewis.de  Fri Jun  8 22:31:28 2007
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 08 Jun 2007 22:31:28 +0200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <20070608170049.GB20665@mcnabbs.org>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>	<4664E238.9020700@v.loewis.de>	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
	<20070608170049.GB20665@mcnabbs.org>
Message-ID: <4669BCA0.6000902@v.loewis.de>

> I hope this helps in the discussion.

Indeed it does. When I find the time, I'll propose a change
to the PEP to do NFKC.

Regards,
Martin

From martin at v.loewis.de  Fri Jun  8 22:41:31 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 08 Jun 2007 22:41:31 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0705250341180.27740@server1.LFW.org>
References: <465615C9.4080505@v.loewis.de>	<320102.38046.qm@web33515.mail.mud.yahoo.com>	<19dd68ba0705241805y52ba93fdt284a2c696b004989@mail.gmail.com>
	<Pine.LNX.4.58.0705242126440.27740@server1.LFW.org>	<ca471dc20705242009h27882084la242b96222e28b29@mail.gmail.com>
	<Pine.LNX.4.58.0705250341180.27740@server1.LFW.org>
Message-ID: <4669BEFB.1000301@v.loewis.de>

> This keeps getting characterized as only a security argument, but
> it's much deeper; it's a basic code comprehension issue.

Despite you repeating this over and over, I still honestly, sincerely
do not understand the concern. You might be technically correct,
but I feel that the cases where these issues could really arise
in practice are so obscure that I can safely ignore them.

More specifically:

> Python will lose the ability to make a reliable round trip
> between a computer file and any human-accessible medium
> such as a visual display or a printed page.

Practically, this is just not true. *Of course* you will be
able to type in a piece of Python code written on a paper,
provided you understand the natural language that the
identifiers use. That the glyphs might be ambiguous is
not an issue at all. What could really stop you from typing
in the code is that you don't know how to type the
characters, however I don't see that as a problem, either -
I rarely need to type in code from a piece of paper,
anyway, and only ever do so when I understand what the
code does (so I likely don't type it in *literally*).

> The Python language will become too large for any single
> person to fully know

Again, practically, this is not true. We both know what
PEP 3131 says about identifiers: they start with a letter,
followed by letters and digits. I fully well know the
entire language. The fact that I cannot enumerate all
letters doesn't bother me to the slightest.

> Python programs that reuse other Python modules may come
> to contain a mix of character sets such that no one can
> fully read them or properly display them.

We will see. I find that unlikely to happen (although
not entirely impossible).

> Unicode is young and unfinished.

I commented on this earlier already: this is non-sense.
Unicode is as old as Python (so perhaps Python is also
young and unfinished).

Regards,
Martin

From guido at python.org  Sat Jun  9 00:27:51 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 8 Jun 2007 15:27:51 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
Message-ID: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>

PEP 3127 (Integer Literal Support and Syntax) introduces new notations
for octal and binary integers. This isn't implemented yet. Are there
any takers? It shouldn't be particularly complicated.

Separately, the 2to3 tool needs a fixer for this (and it should also
accept the new notations in its input).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From collinw at gmail.com  Sat Jun  9 00:36:30 2007
From: collinw at gmail.com (Collin Winter)
Date: Fri, 8 Jun 2007 15:36:30 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
Message-ID: <43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com>

On 6/8/07, Guido van Rossum <guido at python.org> wrote:
> Separately, the 2to3 tool needs a fixer for this (and it should also
> accept the new notations in its input).

I wrote a num_literals fixer when the debate over this feature was
still in progress. It's checked in, but I need to sync it with the
latest version of the PEP. I'll take care of that.

Collin Winter

From collinw at gmail.com  Sat Jun  9 00:37:55 2007
From: collinw at gmail.com (Collin Winter)
Date: Fri, 8 Jun 2007 15:37:55 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com>
Message-ID: <43aa6ff70706081537gc0b218ap379d3383fa139051@mail.gmail.com>

On 6/8/07, Collin Winter <collinw at gmail.com> wrote:
> On 6/8/07, Guido van Rossum <guido at python.org> wrote:
> > Separately, the 2to3 tool needs a fixer for this (and it should also
> > accept the new notations in its input).
>
> I wrote a num_literals fixer when the debate over this feature was
> still in progress. It's checked in, but I need to sync it with the
> latest version of the PEP. I'll take care of that.

Oops, Georg Brandl was actually the fixer's original author. Sorry, Georg!

From stephen at xemacs.org  Sat Jun  9 06:33:07 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 09 Jun 2007 13:33:07 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp>
	<20070606084543.6F3D.JCARLSON@uci.edu>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
Message-ID: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > The ones it absolutely prohibits in interchange are surrogates.

Excuse me?  Surrogates are code points with a specific interpretation
if it is "purported that the stream is in UTF-16".  Otherwise, Unicode
4.0 explicitly says that there is nothing illegal about an isolated
surrogate (p.75, where an example is given of how such a surrogate
might occur).  That surrogate may not be interpreted as an abstract
character (C4, p.58), but it is not a non-character (Table 2-2, p.25).

I agree that it's unfortunate that some parts of Python treat Unicode
strings objects purely as sequences of Unicode code points, and others
purport (apparently without checking) that such strings are in UTF-16.
Unicode conformance is not part of the Python language.  That's life.

But let's try to avoid creating difficulties that don't exist in the
standard.

 > > So there's nothing "wrong by definition" about defining strings as
 > > sequences of code points, and string operations in code-point-wise
 > > fashion.

 > It's not perfect, but that's the state of the art. AFAIK this (or worse)
 > is what the other implementations do.

My point was precisely that I don't object to this implementation.  I
want Unicode-ly-correct behavior to be a goal of the language, the
community disagrees, and Guido disagrees.  That's that.

Thanks you for starting work on implementation; let's concentrate on
that.

From stephen at xemacs.org  Sat Jun  9 09:45:02 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 9 Jun 2007 16:45:02 +0900
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>
	<4664E238.9020700@v.loewis.de>
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
Message-ID: <18026.23166.928863.613890@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > but I think dealing with K characters is now a "least of evils"
 > decision, instead of "we need them for something."

Agreed.

 > On another note, I have no idea how Martin's name (in the Cc line)
 > ended up as: [scrambled stuff]

That's almost surely me.  The composer part of my MUA of choice
handles Japanese fine, but doesn't like general Unicode much.  So I've
switched to a different composer, but the two MUAs differ on the
protocol for passing reply information from the reader to the
composer.  RFC 2047 headers are one thing that often gets fumbled.


From g.brandl at gmx.net  Sat Jun  9 09:39:19 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Sat, 09 Jun 2007 09:39:19 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
Message-ID: <f4dlf6$3fn$1@sea.gmane.org>

Guido van Rossum schrieb:
> PEP 3127 (Integer Literal Support and Syntax) introduces new notations
> for octal and binary integers. This isn't implemented yet. Are there
> any takers? It shouldn't be particularly complicated.

I have a patch lying around here which might be quite complete...

One thing that's unclear to me though: didn't we decide to drop the uppercase
string modfiers/number suffixes/prefixes?

Also, I'm not sure what int() should do with "010".

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From martin at v.loewis.de  Sat Jun  9 09:55:42 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sat, 09 Jun 2007 09:55:42 +0200
Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
In-Reply-To: <fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
References: <fb6fbf560706041937v666ec520x3e69196fd5451a38@mail.gmail.com>	
	<4664E238.9020700@v.loewis.de>	
	<87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706061606g73fe279di2843add1c48969ea@mail.gmail.com>
Message-ID: <466A5CFE.5040906@v.loewis.de>

> On another note, I have no idea how Martin's name (in the Cc line) ended
> up as:
> 
> """
> L$(D+S(Bwis"
> """
> 
> If I knew, it *might* have a bearing on what sorts of
> canonicalizations should be performed, and what sorts of warnings the
> parser ought to emit for likely corrupted text.

That results from a faulty iso-2022-jp-1 conversion. ESC $ ( D switches
to JIS X 0212-1990 (which apparently includes ? at code position 0x25B3);
ESC ( B switches back to ASCII.

I don't think this has anything to do with normalization.

Regards,
Martin

From ncoghlan at gmail.com  Sat Jun  9 13:19:00 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 09 Jun 2007 21:19:00 +1000
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4dlf6$3fn$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4dlf6$3fn$1@sea.gmane.org>
Message-ID: <466A8CA4.5030906@gmail.com>

Georg Brandl wrote:
> Also, I'm not sure what int() should do with "010".

The only change would be for int(x, 0), and that should raise a 
ValueError, just like any other invalid string.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From guido at python.org  Sat Jun  9 17:39:11 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 9 Jun 2007 08:39:11 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4dlf6$3fn$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4dlf6$3fn$1@sea.gmane.org>
Message-ID: <ca471dc20706090839j51407145u543449943da9fb98@mail.gmail.com>

On 6/9/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Guido van Rossum schrieb:
> > PEP 3127 (Integer Literal Support and Syntax) introduces new notations
> > for octal and binary integers. This isn't implemented yet. Are there
> > any takers? It shouldn't be particularly complicated.
>
> I have a patch lying around here which might be quite complete...

Cool!

> One thing that's unclear to me though: didn't we decide to drop the uppercase
> string modfiers/number suffixes/prefixes?

In the end (doesn't the PEP confirms this?) we decided to keep them
and make it a style rule instead. Some folks have generated data sets
using uppercase.

> Also, I'm not sure what int() should do with "010".

int("010") should return (decimal) 10.
int("010", 0) should raise ValueError.

I thought that was also in the PEP.

Anyway, with these tweaks, feel free to just check it in (well, if you
also fix the standard library to use the new notation).

--Guido

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rauli.ruohonen at gmail.com  Sat Jun  9 23:01:57 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 10 Jun 2007 00:01:57 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>

On 6/9/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Rauli Ruohonen writes:
>  > The ones it absolutely prohibits in interchange are surrogates.
>
> Excuse me?  Surrogates are code points with a specific interpretation
> if it is "purported that the stream is in UTF-16".  Otherwise, Unicode
> 4.0 explicitly says that there is nothing illegal about an isolated
> surrogate (p.75, where an example is given of how such a surrogate
> might occur).

I meant interchange instead of strings. Anything is allowed in strings.

Chapter 2 (not normative, but clear) explains on page 26:

 Restricted interchange. [...]
  - Surrogate code points cannot be conformantly interchanged using
    Unicode encoding forms. [...]
  - Noncharacter code points are reserved for internal use, such as for
    sentinel values. They should never be interchanged. [...]

> My point was precisely that I don't object to this implementation.  I
> want Unicode-ly-correct behavior to be a goal of the language, the
> community disagrees, and Guido disagrees.  That's that.

My understanding is that it is a goal, but practicality beats purity.
I think the only disagreement is on what's practical.

From tomerfiliba at gmail.com  Sun Jun 10 01:32:05 2007
From: tomerfiliba at gmail.com (tomer filiba)
Date: Sun, 10 Jun 2007 01:32:05 +0200
Subject: [Python-3000] rethinking pep 3115
Message-ID: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>

pep 3115 (new metaclasses) seems overly complicated imho.
it fails my understanding of "keeping it simple", among other heuristics.

(1)
the trivial fix-up would be to extend the type constructor to take
4 arguments: (name, bases, attrs, order), where 'attrs' is a plain
old dict, and 'order' is a list, into which the names are appended
in the order they were defined in the body of the class. this way,
no new types are introduced and 99% of the use cases are covered.

things like "forward referencing in the class namespace" are evil.
and besides, it's not possible to do with functions and modules,
so why should classes be allowed such a mischief?

(2)
the second-best solution i could think of is just passing the dict as a
keyword argument to the class, like so:

class Spam(metaclass = Bacon, dict = {}):
    ...

so you could explicitly state you need a special dict.

following the cosmetic change of removing the magical __metaclass__
attribute from the class body into the class header, it makes so
sense to replace it by another magical method, __prepare__.
the straight-forward-and-simple way would be to make it a keyword
argument, just like 'metaclass'.

(3)
personally, i refrain from metaclasses. according to my experience,
they just cause trouble, while the benefits of using them are marginal.
the problem is noticeable especially when  trying to understand
and debug  third-party code. metaclasses + bugs = blackmagic.

moreover, they introduce inheritance issues. the class hierarchy
becomes rigid and difficult to evolve as the need arises, which
contradicts my perception of agile languages. i like to view programming
as an iterative task which approaches the final objective after
several loops. rigidness makes each loop longer, which is why
i prefer dynamic languages to compiled ones.

on the other hand, i do understand the need for metaclasses,
even if for the sake of symmetry (as types are objects).
but the solution proposed by pep 3115, of making metaclasses
even more complicated and magical, seems all wrong to me.

i understand it's already been accepted, but i'm hoping there's
still time to reconsider this before 3.0 becomes final.


-tomer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070610/a2c7369a/attachment.htm 

From stephen at xemacs.org  Sun Jun 10 10:03:19 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sun, 10 Jun 2007 17:03:19 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<-6248387165431892706@unknownmsgid>
	<ca471dc20706061037n16d35c3cs566c9dec9fa6e1ac@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
Message-ID: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > On 6/9/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > > Rauli Ruohonen writes:
 > >  > The ones it absolutely prohibits in interchange are surrogates.
 > >
 > > Excuse me?  Surrogates are code points with a specific interpretation
 > > if it is "purported that the stream is in UTF-16".  Otherwise, Unicode
 > > 4.0 explicitly says that there is nothing illegal about an isolated
 > > surrogate (p.75, where an example is given of how such a surrogate
 > > might occur).
 > 
 > I meant interchange instead of strings. Anything is allowed in
 > strings.

I think you misunderstand.  Anything in Unicode that is normative is
about interchange.  Strings are also a means of interchange---between
modules (separate Unicode processes) in a program (single OS process).
Python language and library implementation is going to be primarily
concerned with interchange in the intermodule sense.

Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2"
is precisely a statement that various modules in Python do not specify
what encoding forms they purport to accept or emit.  The purpose of
the definitions in chapter 3 is to clarify the requirements of
conformance.  The discussion of strings is implicitly about
interchange, otherwise it would be somewhere else than the chapter
about conformance.

 > My understanding is that it is a goal, but practicality beats purity.
 > I think the only disagreement is on what's practical.

It is not a goal of the *language*; there is no object in the
*language* that we can say is buggy if it doesn't conform to the
Unicode standard.  Unicode conformance for Python, as of today, is a
WIBNI.

As Guido points out, the goal is a language that can be used to write
efficient implementations of Unicode *if the users want to pay that
cost*, not to provide an implementation so the users don't have to.


From g.brandl at gmx.net  Sun Jun 10 10:30:30 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Sun, 10 Jun 2007 10:30:30 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706090839j51407145u543449943da9fb98@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>	<f4dlf6$3fn$1@sea.gmane.org>
	<ca471dc20706090839j51407145u543449943da9fb98@mail.gmail.com>
Message-ID: <f4gcr4$jtc$1@sea.gmane.org>

Guido van Rossum schrieb:
> On 6/9/07, Georg Brandl <g.brandl at gmx.net> wrote:
>> Guido van Rossum schrieb:
>> > PEP 3127 (Integer Literal Support and Syntax) introduces new notations
>> > for octal and binary integers. This isn't implemented yet. Are there
>> > any takers? It shouldn't be particularly complicated.
>>
>> I have a patch lying around here which might be quite complete...
> 
> Cool!
> 
>> One thing that's unclear to me though: didn't we decide to drop the uppercase
>> string modfiers/number suffixes/prefixes?
> 
> In the end (doesn't the PEP confirms this?) we decided to keep them
> and make it a style rule instead. Some folks have generated data sets
> using uppercase.

The PEP lists it as an "Open Issue".

>> Also, I'm not sure what int() should do with "010".
> 
> int("010") should return (decimal) 10.
> int("010", 0) should raise ValueError.
> 
> I thought that was also in the PEP.

Yes, but rather than follow the PEP blindly, which might not have been updated
to the latest discussion results, asking can't hurt :)

> Anyway, with these tweaks, feel free to just check it in (well, if you
> also fix the standard library to use the new notation).

That should be easy enough.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From martin at v.loewis.de  Sun Jun 10 10:46:12 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 10:46:12 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <73543.28835.qm@web33502.mail.mud.yahoo.com>
References: <73543.28835.qm@web33502.mail.mud.yahoo.com>
Message-ID: <466BBA54.3020300@v.loewis.de>

> To truly enable Python in a non-English teaching
> environment, I think you'd actually want to go a step
> further and just internationalize the whole program.

I don't know why that theory keeps popping up when people
have repeatedly pointed out that it is just false.

People *can* get used to the keywords of Python even if
they have no clue what they mean. There is plenty of
evidence for that. Likewise for the standard library.

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 11:00:17 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 11:00:17 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
Message-ID: <466BBDA1.7070808@v.loewis.de>

> Here is what I have to say (to everyone in this discussion, not
> specifically to you, Stephen) in response to said labelling:

Interestingly enough, we agree on the principles, and just
judge the PEP differently wrt. these principles

> Many of us value a *predictable* identifier character set.
> Whether "predictable" means ASCII only, or user-selectable, or
> restricted by default, I think we all agree in this sentiment:

Indeed, PEP 3131 gives a predictable identifier character set.
Adding per-site options to change the set of allowable characters
makes it less predictable.

> We believe that we should try to make it easier, not harder, for
> programmers to understand what Python code says.  This has many
> benefits (reliability, readability, transparency, reviewability,
> debuggability).  I consider these core strengths of Python.

Indeed. That was my primary motivation for the PEP: to make
it easier for programmers to understand Python, and to allow
people to write more transparent programs.

> That is what makes these strengths so important.  I hope this
> helps you understand why these concerns can't and shouldn't be
> brushed off as "paranoia" -- this really has to do with the
> core values of the language.

It just seems that the concerns don't directly follow from
the principles. Something else has to be added to make that
conclusion. It may not be paranoia (i.e. excessive anxiety),
but there surely is some fear, no?

Regards,
Martin

From rauli.ruohonen at gmail.com  Sun Jun 10 18:20:44 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Sun, 10 Jun 2007 19:20:44 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466BBA54.3020300@v.loewis.de>
References: <73543.28835.qm@web33502.mail.mud.yahoo.com>
	<466BBA54.3020300@v.loewis.de>
Message-ID: <f52584c00706100920yd4b9529x6c51eafabb0a19c9@mail.gmail.com>

On 6/10/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > To truly enable Python in a non-English teaching
> > environment, I think you'd actually want to go a step
> > further and just internationalize the whole program.
>
> I don't know why that theory keeps popping up when people
> have repeatedly pointed out that it is just false.

It isn't contrary to the PEP either. If somebody wants to go a step
further with syntax (keywords etc), then they can provide alternative BNF
syntaxes for different languages. It wouldn't necessitate any changes to
PEP 3131. OTOH, PEP 3131 cannot be implemented at the syntax level.

> People *can* get used to the keywords of Python even if
> they have no clue what they mean. There is plenty of
> evidence for that. Likewise for the standard library.

True, but your PEP does not preclude later implementing the "step
further". For libraries the step further would mean separate wrapped
versions, as there probably isn't any other general solution. Using
gettext() or something for identifiers would easily break with
introspection, and would in any case be complicated (which is worse than
complex, which is worse than simple, and wrappers are simple :-).

BTW, I submitted the normalization patch for 2.6, if you want to look
at it.

From martin at v.loewis.de  Sun Jun 10 18:55:18 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 18:55:18 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f37hra$sik$1@sea.gmane.org>
References: <Pine.LNX.4.58.0705241759100.8399@server1.LFW.org>	<465667AE.2090000@v.loewis.de>	<20070524215742.864E.JCARLSON@uci.edu>	<46568116.202@v.loewis.de>
	<f37hra$sik$1@sea.gmane.org>
Message-ID: <466C2CF6.30508@v.loewis.de>

>> "I know what you want, and I could easily do it, but I don't feel
>> like doing it, read these ten pages of text to learn more about the
>> problem".
>>
> in one word: exit

That's indeed close, and has caused grief for this exact property.
However, the case is actually different: exit could *not* easily
do what was requested; for that to work, exit would have to be
promoted to a keyword.

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 18:59:59 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 18:59:59 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070525095511.866D.JCARLSON@uci.edu>
References: <20070525091105.8663.JCARLSON@uci.edu>	<19dd68ba0705250945j3dadcefcu8db91b3d2c055fdf@mail.gmail.com>
	<20070525095511.866D.JCARLSON@uci.edu>
Message-ID: <466C2E0F.3010402@v.loewis.de>

> It does, but it also refuses the temptation to guess that *everyone*
> wants to use unicode identifiers by default.

Please call them non-ASCII identifiers. All identifiers are Unicode,
anyway, since Python 1.0 or so. They will be represented as Unicode
strings in Python 3.

Regards,
Martin


From jimjjewett at gmail.com  Sun Jun 10 21:40:08 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 10 Jun 2007 15:40:08 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466BBDA1.7070808@v.loewis.de>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de>
Message-ID: <fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>

On 6/10/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > Many of us value a *predictable* identifier character set.
> > Whether "predictable" means ASCII only, or user-selectable, or
> > restricted by default, I think we all agree in this sentiment:

> Indeed, PEP 3131 gives a predictable identifier character set.
> Adding per-site options to change the set of allowable characters
> makes it less predictable.

Not in practice.

Today, identifiers are drawn from [A-Za-z0-9], which is a fairly small set.

Under the current PEP 3131 proposal, they will be drawn from a much
larger set.  There won't normally be many more letters actually used
in any given program, but there will be many more that are possible
(with very low probability).  Unfortunately, some of these are
visually identical.  (Even with modified XID, they don't get rid of
confusables; they unicode consortium is very unwilling to rule out
anything which might theoretically be needed for valid reasons.)  Many
more are visually indistinguishable in practice, simply because the
reader hasn't seen them before.  While Unicode is still a finite set,
it is much larger than ASCII.

By allowing site modifications, the rule becomes:

It will use ASCII.

Local code can also use local characters.

There are potential exceptions for code that gets shared beyond local
groups without ASCII-fication, but this is a strict subset of the
"unreadable" code used under "anything-goes".  Distribution without
ASCIIfication is discouraged (by the extra decision required at
installation time), users have explicit notice (by accepting it at
install time), and the expanded charset is still a tiny fraction of
what PEP3131 currently proposes (you can accept French characters
withough accepting Chinese ideographs).

-jJ

From martin at v.loewis.de  Sun Jun 10 21:51:27 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 21:51:27 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <236066.59081.qm@web33506.mail.mud.yahoo.com>
References: <236066.59081.qm@web33506.mail.mud.yahoo.com>
Message-ID: <466C563F.6090305@v.loewis.de>

> I think this whole debate could be put to rest by
> agreeing to err on the side of ascii in 3.0 beta, and
> if in real world experience, that turns out to be the
> wrong decision, simply fix it in 3.0 production, 3.1,
> or 3.2.

Likewise, this whole debate could also be put to rest
by agreeing to err on the side of unrestricted support
for the PEP, and if that turns out to be the wrong
decision, simply fix any problems discovered in 3.0
production, 3.1, or 3.2.

IOW, any debate can be put to rest by agreeing.

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 21:55:28 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 21:55:28 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <19685.97380.qm@web33511.mail.mud.yahoo.com>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
Message-ID: <466C5730.3060003@v.loewis.de>

> That describes me perfectly.  I am self-interested to
> the extent that my employers just pay me to write
> working Python code, so I want the simplicity of ASCII
> only.  

What I don't understand is why you can't simply continue
to do so, with PEP 3131 implemented?

If you have no need for accessing the NIS database,
or for TLS sockets, you just don't use them - no
need to make these features optional in the library.

Regards,
Martin


From martin at v.loewis.de  Sun Jun 10 22:04:30 2007
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 22:04:30 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0705250454420.27740@server1.LFW.org>
References: <Pine.LNX.4.58.0705241759100.8399@server1.LFW.org>
	<465667AE.2090000@v.loewis.de>
	<20070524215742.864E.JCARLSON@uci.edu> <46568116.202@v.loewis.de>
	<Pine.LNX.4.58.0705250454420.27740@server1.LFW.org>
Message-ID: <466C594E.6030106@v.loewis.de>

>> People should not have to read long system configuration pages
>> just to run the program that they intuitively wrote correctly
>> right from the start.
> 
> It is not intuitive.  One thing I learned from the discussion here
> about Unicode identifiers in other languages is that, though this
> support exists in several other languages, it is *different* in each
> of them.  And PEP 3131 is different still.  They allow different
> sets of characters, and even worse, use different normalization rules.

This is a theoretical problem only. People intuitively know what a
"word" is in their language, and now we tell them they can use
words as identifiers, as long as there are no spaces in them.

That different (programming) languages encode that intuition
in slightly different rules makes no practical difference:
the actual differences are only in boundary cases that are
unlikely to occur in real life (unless somebody deliberately
tries to come up with border cases).

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 22:09:39 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 22:09:39 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <4656EAB5.6080405@gmail.com>
References: <465615C9.4080505@v.loewis.de>	<320102.38046.qm@web33515.mail.mud.yahoo.com>	<19dd68ba0705241805y52ba93fdt284a2c696b004989@mail.gmail.com>	<Pine.LNX.4.58.0705242126440.27740@server1.LFW.org>
	<46568CEF.2030900@v.loewis.de> <4656EAB5.6080405@gmail.com>
Message-ID: <466C5A83.8090703@v.loewis.de>

Nick Coghlan schrieb:
> Martin v. L?wis wrote:
>>> I think that's a pretty strong reason for making the new, more complex
>>> behaviour optional.
>>
>> Thus making it simpler????? The more complex behavior still remains,
>> to fully understand the language, you have to understand that behavior,
>> *plus* you need to understand that it may sometimes not be present.
> 
> It's simpler because any existing automated unit tests will flag
> non-ascii identifiers without modification. Not only does it prevent
> surreptitious insertion of malicious code, but existing projects don't
> have to even waste any brainpower worrying about the implications of
> Unicode identifiers (because library code typically doesn't care about
> client code's identifiers, only about the objects the library is asked
> to deal with).

I don't understand why existing projects would worry about the
feature, for reasons different from the malicious code issue.
If you don't want to waste brainpower on it, then just don't.

> A free-for-all wasn't even proposed for strings and comments in PEP 263
> - why shouldn't we be equally conservative when it comes to
> progressively enabling Unicode identifiers?

Unfortunately, I don't understand this sentence. What is a
"free-for-all", and why could it have been proposed by PEP 263,
but wasn't?

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 22:14:47 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 22:14:47 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <20070525091105.8663.JCARLSON@uci.edu>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>
Message-ID: <466C5BB7.8050909@v.loewis.de>

>> If it is the latter, I don't understand why the 95% ascii users need
>> to run additional verification and checking tools. If they don't
>> know the full language, they won't use it - why should they run
>> any checking tools?
> 
> I drop this
> package into my tree, add the necessary imports and...
> 
> ImportError: non-ascii identifier used without -U option
> 
> Huh, apparently this 3rd party package uses non-ascii identifiers.  If I
> wanted to keep my codebase ascii-only (a not unlikely case), I can
> choose to either look for a different package, look for a variant of
> this package with only ascii identifiers, or attempt to convert the
> package myself (a tool that does the unicode -> ascii transliteration
> process would make this smoother).

I cannot imagine this scenario as realistic. It is certain realistic
that you want to keep your own code base ASCII-only - what I don't
understand why such a policy would extend to libraries that you use.
If the interfaces of the library are non-ASCII, you will automatically
notice; if it only has some non-ASCII identifiers inside, why would
you bother?

>  * Or I copy and paste code from the Python Cookbook, a blog, etc.

You copy code from the Python Cookbook and don't notice that it
contains Chinese characters in identifiers???

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 22:16:34 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sun, 10 Jun 2007 22:16:34 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706100920yd4b9529x6c51eafabb0a19c9@mail.gmail.com>
References: <73543.28835.qm@web33502.mail.mud.yahoo.com>	
	<466BBA54.3020300@v.loewis.de>
	<f52584c00706100920yd4b9529x6c51eafabb0a19c9@mail.gmail.com>
Message-ID: <466C5C22.4060708@v.loewis.de>

> BTW, I submitted the normalization patch for 2.6, if you want to look
> at it.

Thanks. It might take some time until I get a chance (or somebody else
may respond quicker); the 2.6 release is still ahead, so there is still
plenty of time.

Regards,
Martin

From martin at v.loewis.de  Sun Jun 10 22:23:30 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 10 Jun 2007 22:23:30 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	
	<466BBDA1.7070808@v.loewis.de>
	<fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>
Message-ID: <466C5DC2.6090109@v.loewis.de>

>> Indeed, PEP 3131 gives a predictable identifier character set.
>> Adding per-site options to change the set of allowable characters
>> makes it less predictable.
> 
> Not in practice.
> 
> Today, identifiers are drawn from [A-Za-z0-9], which is a fairly small set.
> 
> Under the current PEP 3131 proposal, they will be drawn from a much
> larger set.  There won't normally be many more letters actually used
> in any given program, but there will be many more that are possible
> (with very low probability).

It's true that nobody could realistically enumerate all characters that
would be allowed in identifiers. However, it is still practically
easily predictable whether a given string makes an identifier
*for a speaker of that language it is in*. The rule still is
"letters, digits, and the underscore".

It's certainly possible to come up with obscure cases where people
will guess incorrectly whether they are valid syntax, but it is
always possible to deliberately obfuscate code. Except for the
malicious-user case (which apparently needs to be addressed),
I don't see a problem with the existence of obscure cases.

> By allowing site modifications, the rule becomes:
> 
> It will use ASCII.

Not universally - only on that site. I don't know what rule is
in force on my buddy's machine, so predicting it becomes harder.

> There are potential exceptions for code that gets shared beyond local
> groups without ASCII-fication, but this is a strict subset of the
> "unreadable" code used under "anything-goes".  Distribution without
> ASCIIfication is discouraged (by the extra decision required at
> installation time), users have explicit notice (by accepting it at
> install time), and the expanded charset is still a tiny fraction of
> what PEP3131 currently proposes (you can accept French characters
> withough accepting Chinese ideographs).

I just put wording in the PEP that makes it clear that, whatever
the problem, a global flag is not an acceptable solution.

Regards,
Martin

From baptiste13 at altern.org  Sun Jun 10 22:57:47 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Sun, 10 Jun 2007 22:57:47 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466BBDA1.7070808@v.loewis.de>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de>
Message-ID: <f4hoku$i57$1@sea.gmane.org>

Martin v. L?wis a ?crit :
>> Here is what I have to say (to everyone in this discussion, not
>> specifically to you, Stephen) in response to said labelling:
> 
> Interestingly enough, we agree on the principles, and just
> judge the PEP differently wrt. these principles
> 
>> Many of us value a *predictable* identifier character set.
>> Whether "predictable" means ASCII only, or user-selectable, or
>> restricted by default, I think we all agree in this sentiment:
> 
> Indeed, PEP 3131 gives a predictable identifier character set.
> Adding per-site options to change the set of allowable characters
> makes it less predictable.
> 
true. However, this will only matter if you distribute code with non-ASCII
identifiers to the wider public. Something that we agree is a bad idea, don't we?

>> We believe that we should try to make it easier, not harder, for
>> programmers to understand what Python code says.  This has many
>> benefits (reliability, readability, transparency, reviewability,
>> debuggability).  I consider these core strengths of Python.
> 
> Indeed. That was my primary motivation for the PEP: to make
> it easier for programmers to understand Python, and to allow
> people to write more transparent programs.
> 
The real question is: transparent *to whom*. Transparent to the developper
himself when he rereads his own code (which I value as a developper), or
transparent to the user of the program when he tries to fix a bug (which I value
as a user of open-source software) ? Non-ASCII identifiers are marginally better
for the first case, but can be dramatically worse for the second one. Clearly,
there is a tradeoff.

>> That is what makes these strengths so important.  I hope this
>> helps you understand why these concerns can't and shouldn't be
>> brushed off as "paranoia" -- this really has to do with the
>> core values of the language.
> 
> It just seems that the concerns don't directly follow from
> the principles. Something else has to be added to make that
> conclusion. It may not be paranoia (i.e. excessive anxiety),
> but there surely is some fear, no?
> 
That argument is not really honest :-) Every risk can be estimated opimistically
or pessimistically. In both cases, there is some part of irrationallity.

> Regards,
> Martin

Cheers,
Baptiste


From santagada at gmail.com  Mon Jun 11 00:06:20 2007
From: santagada at gmail.com (Leonardo Santagada)
Date: Sun, 10 Jun 2007 19:06:20 -0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4hoku$i57$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
Message-ID: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>


Em 10/06/2007, ?s 17:57, Baptiste Carvello escreveu:
>> Indeed, PEP 3131 gives a predictable identifier character set.
>> Adding per-site options to change the set of allowable characters
>> makes it less predictable.
>>
> true. However, this will only matter if you distribute code with  
> non-ASCII
> identifiers to the wider public. Something that we agree is a bad  
> idea, don't we?

I don't. It is a bad idea to distribute non-ASCII code for libraries  
that are supposed to be used by the world as a whole. But  
distributing a chinese code for doing something like taxes using  
chinese rules is ok and should be encouraged (now, I don't know they  
have taxes in china, but that is not the point). And not even that,  
in a school you would have to have all computers and students  
computers configured for the same "locale" to make a working code on  
one machine work in another

>
>>> We believe that we should try to make it easier, not harder, for
>>> programmers to understand what Python code says.  This has many
>>> benefits (reliability, readability, transparency, reviewability,
>>> debuggability).  I consider these core strengths of Python.
>>
>> Indeed. That was my primary motivation for the PEP: to make
>> it easier for programmers to understand Python, and to allow
>> people to write more transparent programs.
>>
> The real question is: transparent *to whom*. Transparent to the  
> developper
> himself when he rereads his own code (which I value as a  
> developper), or
> transparent to the user of the program when he tries to fix a bug  
> (which I value
> as a user of open-source software) ? Non-ASCII identifiers are  
> marginally better
> for the first case, but can be dramatically worse for the second  
> one. Clearly,
> there is a tradeoff.

No they are not, people doing open source work are probably going to  
still be coding in english so that is not a problem, but that chinese  
tax system if it is open sourced people in china can easily help  
fixing bugs because identifiers are in their own language, which they  
can identify.

>
>>> That is what makes these strengths so important.  I hope this
>>> helps you understand why these concerns can't and shouldn't be
>>> brushed off as "paranoia" -- this really has to do with the
>>> core values of the language.
>>
>> It just seems that the concerns don't directly follow from
>> the principles. Something else has to be added to make that
>> conclusion. It may not be paranoia (i.e. excessive anxiety),
>> but there surely is some fear, no?
>>
> That argument is not really honest :-) Every risk can be estimated  
> opimistically
> or pessimistically. In both cases, there is some part of  
> irrationallity.

The thing is, people are predicting a future for python code on the  
open source world. One in which devs of open source libraries and  
programs will start coding in different languages if you support  
unicode identifiers, something that is not common today (using some  
form of ASCIIfication of their languages) and didn't happen with the  
Java, C#, Javascript and Common Lisp communities. In light of all  
that I think this prediction is probably wrong. We are all consenting  
adults and we know that we should code in english if we want our code  
to be used and to be a first class citizen of the open source world.  
What do you have to support your prediction?

--
Leonardo Santagada
"If it looks like a duck, and quacks like a duck, we have at least to  
consider the possibility that we have a small aquatic bird of the  
family anatidae on our hands." - Douglas Adams




From g.brandl at gmx.net  Mon Jun 11 00:39:51 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 11 Jun 2007 00:39:51 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
Message-ID: <f4hujl$2dq$1@sea.gmane.org>

Guido van Rossum schrieb:
> PEP 3127 (Integer Literal Support and Syntax) introduces new notations
> for octal and binary integers. This isn't implemented yet. Are there
> any takers? It shouldn't be particularly complicated.

Okay, it's done.

I'll be grateful for reviews. I've also removed traces of the "L" literal
suffix where I encountered them, but may not have gotten them all.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From foom at fuhm.net  Mon Jun 11 00:50:55 2007
From: foom at fuhm.net (James Y Knight)
Date: Sun, 10 Jun 2007 18:50:55 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4hoku$i57$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
Message-ID: <FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>


On Jun 10, 2007, at 4:57 PM, Baptiste Carvello wrote:
>
>> Indeed. That was my primary motivation for the PEP: to make
>> it easier for programmers to understand Python, and to allow
>> people to write more transparent programs.
> The real question is: transparent *to whom*. Transparent to the  
> developper
> himself when he rereads his own code (which I value as a  
> developper), or
> transparent to the user of the program when he tries to fix a bug  
> (which I value
> as a user of open-source software) ? Non-ASCII identifiers are  
> marginally better
> for the first case, but can be dramatically worse for the second  
> one. Clearly,
> there is a tradeoff.

If another developer is planning to write code in English, this whole  
debate is moot. So, let's take as a given that he is going to write a  
program in his own non-English language. Now, will he write in a  
asciified form of his language, or using the proper character set?  
Right now, the only option is the first. The PEP proposes to also  
allow the second.

So, your question should be: is it easier to understand an ASCIIified  
form of another language, or the actual language itself? For me (who  
doesn't speak said langauge, nor perhaps even know its character  
set), I'm pretty sure the answer is still going to be the second: I'd  
rather a program written in Chinese use Chinese characters, rather  
than a transliteration of Chinese into ASCII. because it is actually  
feasible for me to do automatic translation of Chinese into something  
resembling English. And of course, that's even more true when talking  
about a language like French, which uses an alphabet quite familiar  
to me, but yet online translators still fail to function if it's been  
transliterated into ASCII.

James


From guido at python.org  Mon Jun 11 00:54:09 2007
From: guido at python.org (Guido van Rossum)
Date: Sun, 10 Jun 2007 15:54:09 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4hujl$2dq$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org>
Message-ID: <ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>

Very cool; thanks!!! No problems so far.

I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
and hex() to 0x?

--Guido

On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Guido van Rossum schrieb:
> > PEP 3127 (Integer Literal Support and Syntax) introduces new notations
> > for octal and binary integers. This isn't implemented yet. Are there
> > any takers? It shouldn't be particularly complicated.
>
> Okay, it's done.
>
> I'll be grateful for reviews. I've also removed traces of the "L" literal
> suffix where I encountered them, but may not have gotten them all.
>
> Georg
>
> --
> Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
> Four shall be the number of spaces thou shalt indent, and the number of thy
> indenting shall be four. Eight shalt thou not indent, nor either indent thou
> two, excepting that thou then proceed to four. Tabs are right out.
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From g.brandl at gmx.net  Mon Jun 11 01:02:31 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 11 Jun 2007 01:02:31 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>	<f4hujl$2dq$1@sea.gmane.org>
	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>
Message-ID: <f4hvu6$5gk$1@sea.gmane.org>

Guido van Rossum schrieb:
> Very cool; thanks!!! No problems so far.
> 
> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
> and hex() to 0x?

Would that also require a __bin__() special method?

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From guido at python.org  Mon Jun 11 01:15:06 2007
From: guido at python.org (Guido van Rossum)
Date: Sun, 10 Jun 2007 16:15:06 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4hvu6$5gk$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org>
	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>
	<f4hvu6$5gk$1@sea.gmane.org>
Message-ID: <ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>

On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Guido van Rossum schrieb:
> > Very cool; thanks!!! No problems so far.
> >
> > I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
> > and hex() to 0x?
>
> Would that also require a __bin__() special method?

If the other two use it, we might as well model it that way.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From showell30 at yahoo.com  Mon Jun 11 01:21:33 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 10 Jun 2007 16:21:33 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
Message-ID: <912322.26284.qm@web33510.mail.mud.yahoo.com>


--- James Y Knight <foom at fuhm.net> wrote:
> 
> I'm pretty sure the answer is still going to
> be the second: I'd  
> rather a program written in Chinese use Chinese
> characters, rather  
> than a transliteration of Chinese into ASCII.
> because it is actually  
> feasible for me to do automatic translation of
> Chinese into something  
> resembling English. And of course, that's even more
> true when talking  
> about a language like French, which uses an alphabet
> quite familiar  
> to me, but yet online translators still fail to
> function if it's been  
> transliterated into ASCII.
> 

This was exactly my experience with translating the
German program Martin posted a while back.  I used
Babelfish to translate it to English, and the one word
that I didn't translate properly was a word with an
umlaut.  (It was my own error not to use the umlaut
when looking up the translation; Martin's program did
include the umlaut, and once I was clued in to the
errors of my ways, I went back to babelfish with the
umlaut and I got the exact translation I was looking
for.)








       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/

From showell30 at yahoo.com  Mon Jun 11 01:13:08 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 10 Jun 2007 16:13:08 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C563F.6090305@v.loewis.de>
Message-ID: <323254.95713.qm@web33508.mail.mud.yahoo.com>


--- "Martin v. L?wis" <martin at v.loewis.de> wrote:

> > I think this whole debate could be put to rest by
> > agreeing to err on the side of ascii in 3.0 beta,
> and
> > if in real world experience, that turns out to be
> the
> > wrong decision, simply fix it in 3.0 production,
> 3.1,
> > or 3.2.
> 

I wrote this a while back, and at the time I wrote, I
felt it was a pretty reasonable statement.

Having said that, after following this thread a little
more...

> Likewise, this whole debate could also be put to
> rest
> by agreeing to err on the side of unrestricted
> support
> for the PEP, and if that turns out to be the wrong
> decision, simply fix any problems discovered in 3.0
> production, 3.1, or 3.2.
> 

...I am now in favor of the PEP, with no restrictions,
even though it now goes a little further than I'd
like.

I wish the debate would turn to actual use cases.  For
example, one of the arguments behind PEP 3131 is that
it will facilitate the use of Python in educational
environments.  It would be interesting to hear from
actual teachers what their biggest impediments to
using Python are right now.  It could be that the lack
of foreign language documentation is far bigger an
impediment to using Python in a Chinese classroom than
the current restrictions on ASCII identifiers.  It
could be that the standard library involves knowing
too much English, which PEP 3131 won't really address.
 It could be that teachers simply want error messages
to be internationalized, so that students can follow
tracebacks, and identifiers aren't really an issue. 
It could be that some foreign schools actually embrace
the use of an English alphabet in Python, as it allows
for a more integrated education opportunity (students
learn an important programming language while
simultaneously mastering one of the world's most
commercially important written languages...).







       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

From martin at v.loewis.de  Mon Jun 11 05:07:06 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 11 Jun 2007 05:07:06 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4hoku$i57$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>
	<f4hoku$i57$1@sea.gmane.org>
Message-ID: <466CBC5A.3050907@v.loewis.de>

>> Indeed, PEP 3131 gives a predictable identifier character set.
>> Adding per-site options to change the set of allowable characters
>> makes it less predictable.
>>
> true. However, this will only matter if you distribute code with non-ASCII
> identifiers to the wider public.

No - it will matter for any kind of distribution, not just to the "wider
public". If I move code to the next machine it may stop working, or
if I upgrade to the next Python version, assuming the default is
to restrict identifiers.

> The real question is: transparent *to whom*. Transparent to the developper
> himself when he rereads his own code (which I value as a developper), or
> transparent to the user of the program when he tries to fix a bug (which I value
> as a user of open-source software) ? Non-ASCII identifiers are marginally better
> for the first case, but can be dramatically worse for the second one. Clearly,
> there is a tradeoff.

Why do you say that? Non-ASCII identifiers significantly improve the
readability of code to speakers of the natural language from which
the identifiers are drawn. With ASCII identifiers, the reader needs
to understand the English words, or recognize the transliteration.
With non-ASCII identifiers, the intended meaning of the class or
function becomes immediately apparent, in the way identifiers have
always been self-documentation for English-speaking people.

>>> That is what makes these strengths so important.  I hope this
>>> helps you understand why these concerns can't and shouldn't be
>>> brushed off as "paranoia" -- this really has to do with the
>>> core values of the language.
>> It just seems that the concerns don't directly follow from
>> the principles. Something else has to be added to make that
>> conclusion. It may not be paranoia (i.e. excessive anxiety),
>> but there surely is some fear, no?
>>
> That argument is not really honest :-) Every risk can be estimated opimistically
> or pessimistically. In both cases, there is some part of irrationallity.

Still, what is the risk being estimated? Is it that somebody
maliciously tries to provide patches that use look-alike
characters? I honestly don't know what risks you see.

Regards,
Martin

From martin at v.loewis.de  Mon Jun 11 05:27:45 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 11 Jun 2007 05:27:45 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <323254.95713.qm@web33508.mail.mud.yahoo.com>
References: <323254.95713.qm@web33508.mail.mud.yahoo.com>
Message-ID: <466CC131.3030006@v.loewis.de>

> I wish the debate would turn to actual use cases.  For
> example, one of the arguments behind PEP 3131 is that
> it will facilitate the use of Python in educational
> environments.  It would be interesting to hear from
> actual teachers what their biggest impediments to
> using Python are right now.  It could be that the lack
> of foreign language documentation is far bigger an
> impediment to using Python in a Chinese classroom than
> the current restrictions on ASCII identifiers.

I don't know whether you have seen

http://groups.google.com/group/comp.lang.python/msg/ccffec1abd4dd24d

which discusses these points precisely. See also a
few follow-up messages in that part of the thread.

FWIW, I don't think that foreign-language documentation
is lacking. I don't know about Chinese, but for
German, there is plenty of documentation. I wrote
a German Python book myself 10 years ago, and other
people have since written other books. A PowerPoint
presentation discussing Python for school can be
found at

http://ada.rg16.asn-wien.ac.at/~python/Py4KidsFolien1.ppt

Gregor Lingl is the author of "Python f?r Kids".

> It
> could be that the standard library involves knowing
> too much English, which PEP 3131 won't really address.
>  It could be that teachers simply want error messages
> to be internationalized, so that students can follow
> tracebacks, and identifiers aren't really an issue. 
> It could be that some foreign schools actually embrace
> the use of an English alphabet in Python, as it allows
> for a more integrated education opportunity (students
> learn an important programming language while
> simultaneously mastering one of the world's most
> commercially important written languages...).

Unfortunately, teachers don't participate
in python-3000, as don't many other Python users.
So it's unlikely that you find a teacher posting
*here*, it was pure luck that I found a Chinese
teacher posting on comp.lang.python. You would
need to go to places where teachers discuss
in the internet, which likely isn't even Usenet.
Not being a (high school) teacher myself, I don't
know how to find them.

Regards,
Martin

From showell30 at yahoo.com  Mon Jun 11 06:27:29 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Sun, 10 Jun 2007 21:27:29 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466CC131.3030006@v.loewis.de>
Message-ID: <954258.3730.qm@web33506.mail.mud.yahoo.com>

--- "Martin v. L?wis" <martin at v.loewis.de> wrote:
>
> Unfortunately, teachers don't participate
> in python-3000, as don't many other Python users.
> So it's unlikely that you find a teacher posting
> *here*, it was pure luck that I found a Chinese
> teacher posting on comp.lang.python. You would
> need to go to places where teachers discuss
> in the internet, which likely isn't even Usenet.
> Not being a (high school) teacher myself, I don't
> know how to find them.
> 

In high schools? :)

Seriously, that's where you find high school teachers.

I've been in high school environments where Python is
being taught, and that's why I'm a little skeptical
that folks arguing on either side of this argument are
maybe a bit too much in the ivory tower, and not
enough dealing with actual use cases.

The Chinese teacher that you mention made some
interesting points in his posts, and I take his
advocacy for PEP 3131 very seriously, but I think he
would actually be well served using a language more
suitable for educational purposes than Python.  I have
experience with using a learning language in the
classroom, and it was very positive for students.





       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC

From ncoghlan at gmail.com  Mon Jun 11 08:04:16 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 11 Jun 2007 16:04:16 +1000
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>	<f4hujl$2dq$1@sea.gmane.org>	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>	<f4hvu6$5gk$1@sea.gmane.org>
	<ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>
Message-ID: <466CE5E0.3020106@gmail.com>

Guido van Rossum wrote:
> On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
>> Guido van Rossum schrieb:
>>> Very cool; thanks!!! No problems so far.
>>>
>>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
>>> and hex() to 0x?
>> Would that also require a __bin__() special method?
> 
> If the other two use it, we might as well model it that way.
> 

I must admit I've never understood why hex() and oct() don't just go 
through __int__() (Note that the integer formats are all defined as 
going through int() in PEP 3101).

If we only want them to work for true integers, then we have __index__() 
available now.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From python at zesty.ca  Mon Jun 11 08:54:26 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Mon, 11 Jun 2007 01:54:26 -0500 (CDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C563F.6090305@v.loewis.de>
References: <236066.59081.qm@web33506.mail.mud.yahoo.com>
	<466C563F.6090305@v.loewis.de>
Message-ID: <Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>

Steve Howell wrote:
> I think this whole debate could be put to rest by
> agreeing to err on the side of ascii in 3.0 beta, and
> if in real world experience, that turns out to be the
> wrong decision, simply fix it in 3.0 production, 3.1,
> or 3.2.

On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote:
> Likewise, this whole debate could also be put to rest
> by agreeing to err on the side of unrestricted support
> for the PEP, and if that turns out to be the wrong
> decision, simply fix any problems discovered in 3.0
> production, 3.1, or 3.2.

Your attempted parallel does not match: it breaks code,
whereas Steve's does not.


-- ?!ng

From python at zesty.ca  Mon Jun 11 09:20:42 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Mon, 11 Jun 2007 02:20:42 -0500 (CDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C5730.3060003@v.loewis.de>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
Message-ID: <Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>

On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote:
> > That describes me perfectly.  I am self-interested to
> > the extent that my employers just pay me to write
> > working Python code, so I want the simplicity of ASCII
> > only.
>
> What I don't understand is why you can't simply continue
> to do so, with PEP 3131 implemented?
>
> If you have no need for accessing the NIS database,
> or for TLS sockets, you just don't use them - no
> need to make these features optional in the library.

Because the existence of these library modules does not make it
impossible to reliably read source code.  We're talking about
changing the definition of the language here, which is deeper
than adding or removing things in the library.

Python currently provides to everyone the restriction of
identifiers to a character set that everyone knows and trusts.
Many of us want Python to continue to provide such restriction
for those who want identifiers to be in a character set they
know and trust.  This is not incompatible with your desire to
permit alternative character sets, as long as Python offers an
option to make that choice.  We can continue to discuss the
details of how that choice is expressed, but this general idea
is a solution that would give us both what we want.

Can we agree on that?


-- ?!ng

From aleaxit at gmail.com  Mon Jun 11 11:19:44 2007
From: aleaxit at gmail.com (Alex Martelli)
Date: Mon, 11 Jun 2007 11:19:44 +0200
Subject: [Python-3000] rethinking pep 3115
In-Reply-To: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>
References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>
Message-ID: <e8a0972d0706110219r1e2a7854u6884c00881cddb1f@mail.gmail.com>

On 6/10/07, tomer filiba <tomerfiliba at gmail.com> wrote:
> pep 3115 (new metaclasses) seems overly complicated imho.

It does look over-engineered to me, too.

> it fails my understanding of "keeping it simple", among other heuristics.
>
> (1)
> the trivial fix-up would be to extend the type constructor to take
> 4 arguments: (name, bases, attrs, order), where 'attrs' is a plain
> old dict, and 'order' is a list, into which the names are appended
> in the order they were defined in the body of the class. this way,
> no new types are introduced and 99% of the use cases are covered.

Agreed, but it doesn't look very elegant.

>
> things like "forward referencing in the class namespace" are evil.
> and besides, it's not possible to do with functions and modules,
> so why should classes be allowed such a mischief?
>
> (2)
> the second-best solution i could think of is just passing the dict as a
> keyword argument to the class, like so:
>
> class Spam(metaclass = Bacon, dict = {}):
>     ...
>
> so you could explicitly state you need a special dict.

I like this one, with classdict being the keyword (dict is the name of
a builtin type and we shouldn't encourage the frequent but iffy
practice of 'overriding' builtin identifiers).

>
> following the cosmetic change of removing the magical __metaclass__
> attribute from the class body into the class header, it makes so
>  sense to replace it by another magical method, __prepare__.
> the straight-forward-and-simple way would be to make it a keyword
> argument, just like 'metaclass'.
>
> (3)
> personally, i refrain from metaclasses. according to my experience,
> they just cause trouble, while the benefits of using them are marginal.
> the problem is noticeable especially when  trying to understand
> and debug  third-party code. metaclasses + bugs = blackmagic.
>
> moreover, they introduce inheritance issues. the class hierarchy
> becomes rigid and difficult to evolve as the need arises, which
> contradicts my perception of agile languages. i like to view programming
> as an iterative task which approaches the final objective after
> several loops. rigidness makes each loop longer, which is why
> i prefer dynamic languages to compiled ones.
>
> on the other hand, i do understand the need for metaclasses,
> even if for the sake of symmetry (as types are objects).
> but the solution proposed by pep 3115, of making metaclasses
> even more complicated and magical, seems all wrong to me.
>
> i understand it's already been accepted, but i'm hoping there's
> still time to reconsider this before 3.0 becomes final.

I agree with your observations and with your hope.


Alex

From murman at gmail.com  Mon Jun 11 15:15:20 2007
From: murman at gmail.com (Michael Urman)
Date: Mon, 11 Jun 2007 08:15:20 -0500
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>
References: <236066.59081.qm@web33506.mail.mud.yahoo.com>
	<466C563F.6090305@v.loewis.de>
	<Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>
Message-ID: <dcbbbb410706110615r69306beew3bb34654023f9787@mail.gmail.com>

On 6/11/07, Ka-Ping Yee <python at zesty.ca> wrote:
> Your attempted parallel does not match: it breaks code,
> whereas Steve's does not.

However the same code which would break only if we find we need to
restrict the characters in identifiers further than the restrictions
in the PEP, is broken off the bat in Steve's scenario because it won't
run in differently configured environments.

Michael
-- 
Michael Urman

From murman at gmail.com  Mon Jun 11 15:29:35 2007
From: murman at gmail.com (Michael Urman)
Date: Mon, 11 Jun 2007 08:29:35 -0500
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
Message-ID: <dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>

On 6/11/07, Ka-Ping Yee <python at zesty.ca> wrote:
> Because the existence of these library modules does not make it
> impossible to reliably read source code.  We're talking about
> changing the definition of the language here, which is deeper
> than adding or removing things in the library.

This has already been demonstrated to be false - you already cannot
visually inspect a printed python program and know what it will do.
There is the risk of visually aliased identifiers, but how is that
qualitatively worse than the truly conflicting identifiers you can
import with a *, or have inserted by modules mucking with
__builtins__?

> permit alternative character sets, as long as Python offers an
> option to make that choice.  We can continue to discuss the
> details of how that choice is expressed, but this general idea
> is a solution that would give us both what we want.

I can't agree with this. The predictability of needing only to
duplicate dependencies (version of python, modules) to ensure a
program that ran over there will run over here (and vice versa) is too
important to me. When end users see a NameError or SyntaxError when
they try to run a python script, they will generally assume it is the
script at fault, not their environment.

Michael
-- 
Michael Urman

From jimjjewett at gmail.com  Mon Jun 11 15:37:00 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 09:37:00 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C5BB7.8050909@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
Message-ID: <fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>

On 6/10/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >  * Or I copy and paste code from the Python Cookbook, a blog, etc.

> You copy code from the Python Cookbook and don't notice that it
> contains Chinese characters in identifiers???

Chinese in particular you would recognize as "not what I expected".
Cyrillic you might not recognize, because it looks like ASCII letters.
 Prime (or tone) marks, you might not recognize, because they look
like ASCII quote marks.

If you're retyping, I'm not sure how much problem this would cause in
practice.  I wouldn't want to ban those letters entirely, but I would
like some indication that I should expect characters in the Cyrillic
range.

-jJ

From jimjjewett at gmail.com  Mon Jun 11 15:58:58 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 09:58:58 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C5DC2.6090109@v.loewis.de>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de>
	<fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>
	<466C5DC2.6090109@v.loewis.de>
Message-ID: <fb6fbf560706110658o2f65537fk5adc02fd34437a3c@mail.gmail.com>

On 6/10/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> Indeed, PEP 3131 gives a predictable identifier character set.
> >> Adding per-site options to change the set of allowable characters
> >> makes it less predictable.

> > Not in practice.
...

> > By allowing site modifications, the rule becomes:

> > It will use ASCII.

[and clipped "programs intended only for local use will use ASCII plus
letters that locla users  recognize."]

> Not universally - only on that site.

Yes, universally.  By allowing "any unicode character", you have
reason to believe the next piece of code isn't doing something
strange, either by accident or by malice.

By allowing "ASCII + those listed in the site config", then the rule
will change from

    "It will use ASCII, always" (today)
to
    "It will use ASCII if it is intended for distribution."
plus
    "local programs can use ASCII + locally recognized letters"

That is slightly more complicated than ASCII-only, but only for those
who want to use the extended charsets -- and either rule is still
straightforward.

The rule proposed in PEP 3131 is

    "It will use something that is numerically a letter or number, to
someone somewhere."

Given the style guide of ASCII for internationally targeted open
source, that will degrade to

    "It should use ASCII".
    "But it might not, since there will be no feedback or apparent
downside to violating the style rule, even for distributed code."
    "In fact, it might even use something downright misleading, and
you won't have any warning, because we thought that maybe someone,
somewhere, might have wanted that character in a different context."

And no, I don't think I'm exagerating with that last one; we aren't
proposing rules against mixed script identifiers (or even limiting
script switches to occur only at the _ character).  It will be
perfectly legitimate to apparently end a string with three consecutive
prime characters.  It will be bad style, but there will be nothing to
tip off the non-paranoid.

In theory, we could solve this by limiting the non-ASCII characters,
but I don't we can do that in practice.  The unicode consortium hasn't
even tried; even XID + security modifications + NFKC still includes
characters that are intended to look identical; all the security
modifications do is eliminate characters that do *not* have any
expected legitimate use.  (Example:  no living language uses them.)

I don't think we want to wade too deeply into the morass of
confusables detection; the unicode consortium itself says the problem
is neither solved nor stable.

It might be a good idea to restrict (wihtin-a-single-ID) script
switches to only occur at the "_", but I'm not sure a 95% solution is
worth doing.

By saying "Only charcacters you or your sysadmin expected", we at
least limit it to things the user will be expecting and can recognize.
 (Unless the sysadmin decides otherwise.)

> I don't know what rule is
> in force on my buddy's machine, so predicting it becomes harder.

But you know ASCII will work.

If he used the same local install (classroom peer, member of the same
user group, etc), then your local characters will probably work too.

If he is really your buddy, he probably trusts you enough to allow
your charset if you tell him about it.

> I just put wording in the PEP that makes it clear that, whatever
> the problem, a global flag is not an acceptable solution.

I agree that a single flag doesn't really solve the problem.  But a
global configuration does go a long way.

For me personally, I would be more willing to allow Latin-1 than
Hangul, because I can recognize the Latin-1 characters.  (I still
wouldn't allow them all by default; the difference between the various
lower-case i's is small enough -- to me -- that I want a warning when
one is used.)  Hangul is more acceptable than Cyrillic, because at
least it is obviously foreign; I won't mistake it for something.

Someone who uses Cyrillic on a daily basis might well have the
opposite preferences.  I support letting her use Cyrillic if she wants
to; I just don't want it to work on my machine without my knowing
about it.  But I would like to be able to accept ? and ? (French
characters) without shutting off the warning for Cyrillic or Ogham.

Allowing ASCII plus "chars specified by the site or user through a
config file" meets that goal.

-jJ

From jimjjewett at gmail.com  Mon Jun 11 16:09:25 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 10:09:25 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
	<1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>
Message-ID: <fb6fbf560706110709o1e071dei1c674b666a9fa436@mail.gmail.com>

On 6/10/07, Leonardo Santagada <santagada at gmail.com> wrote:

> We are all consenting
> adults and we know that we should code in english if we want our code
> to be used and to be a first class citizen of the open source world.

I have no objection to Open Source being written in Chinese.

My objection is to not knowing which file a script is using.

Think of it like the coding directive.

Once upon a time, if you didn't have a coding directive, but used
characters outside of ASCII, the results were system-dependent.  It
didn't cause much of a problem, because most people stuck to ASCII,
and the exceptions mostly stuck to characters that were common across
codesets.  Still, it was better to be explicit.

I want an explicit notice of which scripts are being used.  I'll
settle for an explicit choice of which scripts can be used, so that I
can just exclude the ones I wasn't expecting.

This doesn't fully cover the malicious (or careless) user case, but it
gives me the tools to set my own ease-of-use tradeoffs between "it
just runs" and "it does what I think it does".

-jJ

From ncoghlan at gmail.com  Mon Jun 11 16:10:28 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 12 Jun 2007 00:10:28 +1000
Subject: [Python-3000] rethinking pep 3115
In-Reply-To: <e8a0972d0706110219r1e2a7854u6884c00881cddb1f@mail.gmail.com>
References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>
	<e8a0972d0706110219r1e2a7854u6884c00881cddb1f@mail.gmail.com>
Message-ID: <466D57D4.8070702@gmail.com>

Alex Martelli wrote:
>> (2)
>> the second-best solution i could think of is just passing the dict as a
>> keyword argument to the class, like so:
>>
>> class Spam(metaclass = Bacon, dict = {}):
>>     ...
>>
>> so you could explicitly state you need a special dict.
> 
> I like this one, with classdict being the keyword (dict is the name of
> a builtin type and we shouldn't encourage the frequent but iffy
> practice of 'overriding' builtin identifiers).

So instead of being able to write:

   class MyStruct(Struct):
      first = 1
      second = 2
      third = 3

everyone defining a Struct subclass has to write:

   class MyStruct(Struct, classdict=OrderedDict()):
      first = 1
      second = 2
      third = 3

Forgive my confusion, but exactly *how* is that meant to be an improvement?

The use of a special ordered dictionary should be an internal 
implementation detail of the Struct class, and PEP 3115 makes it exactly 
that. The PEP's approach means that simple cases, while possibly being 
slightly harder to write, will 'just work' when it comes time to use 
them, while more complicated cases involving multiple metaclasses should 
still be possible.

I will also note that the PEP allows someone to write their own base 
class which accepts the 'classdict' keyword argument if they so choose.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jimjjewett at gmail.com  Mon Jun 11 16:29:16 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 10:29:16 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706110658o2f65537fk5adc02fd34437a3c@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de>
	<fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>
	<466C5DC2.6090109@v.loewis.de>
	<fb6fbf560706110658o2f65537fk5adc02fd34437a3c@mail.gmail.com>
Message-ID: <fb6fbf560706110729yfa45137p85a1fff219dbf563@mail.gmail.com>

On 6/11/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> Yes, universally.  By allowing "any unicode character", you have

(oops -- apparently this posted with only half the edits)

> reason to believe the next piece of code isn't doing something
> strange, either by accident or by malice.

By allowing "any unicode character", you have NO reason to believe...

From guido at python.org  Mon Jun 11 16:42:12 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Jun 2007 07:42:12 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <466CE5E0.3020106@gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org>
	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>
	<f4hvu6$5gk$1@sea.gmane.org>
	<ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>
	<466CE5E0.3020106@gmail.com>
Message-ID: <ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>

On 6/10/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Guido van Rossum wrote:
> > On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
> >> Guido van Rossum schrieb:
> >>> Very cool; thanks!!! No problems so far.
> >>>
> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
> >>> and hex() to 0x?
> >> Would that also require a __bin__() special method?
> >
> > If the other two use it, we might as well model it that way.
>
> I must admit I've never understood why hex() and oct() don't just go
> through __int__() (Note that the integer formats are all defined as
> going through int() in PEP 3101).
>
> If we only want them to work for true integers, then we have __index__()
> available now.

Well, maybe it's time to kill __oct__ and __hex__ then.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jimjjewett at gmail.com  Mon Jun 11 16:43:35 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 10:43:35 -0400
Subject: [Python-3000] PEP 3131: what are the risks?
Message-ID: <fb6fbf560706110743y2d174ccl660c629906ac7f02@mail.gmail.com>

On 6/10/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Still, what is the risk being estimated? Is it that somebody
> maliciously tries to provide patches that use look-alike
> characters? I honestly don't know what risks you see.

Here are the top three that I see; note that none of these concerns
say "Don't use non-ASCII ids".  They do all say "Don't use ids from a
script the user hasn't said to expect".

(1)  Malicious user is indeed one risk.  A small probability, but a
big enough loss that I want a warning when the door is unlocked.

(2)  Typos is another risk.  Even in mono-lingual environments, it is
possible to get a wrong letter.  If you're expecting ?, it is fine.
If you're not, then it shouldn't pass silently.

(3)  "Reados".  When doing maintenance later, if I wasn't expecting ?,
I may see it as a regular i, and code that way.  Now I have two
doppelganger/d?ppelganger variables (or inherited methods) serving the
same purpose, but using different memory locations.

Ideally, the test cases will catch this.  In real life, even the
python stdlib has plenty of modules with poor test coverage.  I can't
expect better of random code, particularly given that it has chosen to
ignore the style-guide (and history) about sticking to ASCII for
distributed code.  (Learning to store your tests generally comes long
after picking up the basic style guidelines.)


-jJ

From tomerfiliba at gmail.com  Mon Jun 11 16:52:47 2007
From: tomerfiliba at gmail.com (tomer filiba)
Date: Mon, 11 Jun 2007 16:52:47 +0200
Subject: [Python-3000] rethinking pep 3115
In-Reply-To: <466D57D4.8070702@gmail.com>
References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>
	<e8a0972d0706110219r1e2a7854u6884c00881cddb1f@mail.gmail.com>
	<466D57D4.8070702@gmail.com>
Message-ID: <1d85506f0706110752x3af8f232o9a66fb11d68b7e75@mail.gmail.com>

On 6/11/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> So instead of being able to write:
>
>    class MyStruct(Struct):
>       first = 1
>       second = 2
>       third = 3
>
[...]
>
> Forgive my confusion, but exactly *how* is that meant to be an improvement?

as your example shows, the most common use-case is an ordered dict,
so as i was saying, just "upgrading" the type() constructor to accept
four arguments solves almost all of the desired use cases. imho,
"forward name binding" is an undesired side effect.

what i'm trying to say is, this pep is an *overkill*. yes, it is "more powerful"
than what i'm suggesting, but my point is we don't want to have all that
"power". it's too complex and provides only a marginal benefit.

you're just using classes as syntactic sugar for namespaces (because
python lacks other syntactic namespaces), which is useful --
but conceptually wrong. python should have introduced a separate
namespace construct, not to be confused with classes (something like
the "make pep")

the pep at hand is basically *overloading* classes into a generic
namespace device -- to which i'm saying: (a) it's wrong and (b) it's
not that frequently used to deserve complicating the interpreter
for that.


-tomer

P.S. per your "class Something(Struct)" example above, you might want
to check out how Construct solves that (see below). Construct's
declarative approach is able to express more kinds of relations between
data structures than simple structs, such as nested structs, arrays,
swtiches, etc.

http://construct.wikispaces.com
http://sebulbasvn.googlecode.com/svn/trunk/construct/formats/filesystem/mbr.py
http://sebulbasvn.googlecode.com/svn/trunk/construct/formats/executable/elf32.py

From g.brandl at gmx.net  Mon Jun 11 17:29:04 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 11 Jun 2007 17:29:04 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>	<f4hujl$2dq$1@sea.gmane.org>	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>	<f4hvu6$5gk$1@sea.gmane.org>	<ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>	<466CE5E0.3020106@gmail.com>
	<ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>
Message-ID: <f4jpnu$ekm$1@sea.gmane.org>

Guido van Rossum schrieb:
> On 6/10/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Guido van Rossum wrote:
>> > On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
>> >> Guido van Rossum schrieb:
>> >>> Very cool; thanks!!! No problems so far.
>> >>>
>> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
>> >>> and hex() to 0x?
>> >> Would that also require a __bin__() special method?
>> >
>> > If the other two use it, we might as well model it that way.
>>
>> I must admit I've never understood why hex() and oct() don't just go
>> through __int__() (Note that the integer formats are all defined as
>> going through int() in PEP 3101).
>>
>> If we only want them to work for true integers, then we have __index__()
>> available now.
> 
> Well, maybe it's time to kill __oct__ and __hex__ then.

Sounds fine to me; using __index__ to get at the number to convert would be
ideal.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From g.brandl at gmx.net  Mon Jun 11 18:12:01 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 11 Jun 2007 18:12:01 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>	<f4hujl$2dq$1@sea.gmane.org>	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>	<f4hvu6$5gk$1@sea.gmane.org>	<ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>	<466CE5E0.3020106@gmail.com>
	<ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>
Message-ID: <f4js8f$pe9$1@sea.gmane.org>

Guido van Rossum schrieb:
> On 6/10/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Guido van Rossum wrote:
>> > On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
>> >> Guido van Rossum schrieb:
>> >>> Very cool; thanks!!! No problems so far.
>> >>>
>> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
>> >>> and hex() to 0x?
>> >> Would that also require a __bin__() special method?
>> >
>> > If the other two use it, we might as well model it that way.
>>
>> I must admit I've never understood why hex() and oct() don't just go
>> through __int__() (Note that the integer formats are all defined as
>> going through int() in PEP 3101).
>>
>> If we only want them to work for true integers, then we have __index__()
>> available now.
> 
> Well, maybe it's time to kill __oct__ and __hex__ then.

Okay, attached is a patch to do that.

It adds a new abstract function, PyNumber_ToBase, that converts an __index__able
integer to an arbitrary base. bin(), oct() and hex() just uses it.
(I've left the slots in the PyNumberMethods struct for now.)

There was not much library code to change: only tests used the special methods.

Though /me wonders if we shouldn't just expose PyNumber_ToBase as a single
function that mirrors int(str, base).

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: no_hexoct.diff
Type: text/x-patch
Size: 10727 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070611/5d1f29c2/attachment.bin 

From rauli.ruohonen at gmail.com  Mon Jun 11 18:43:58 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Mon, 11 Jun 2007 19:43:58 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706110658o2f65537fk5adc02fd34437a3c@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de>
	<fb6fbf560706101240q43c2b4a9k6c0be7ba38979d25@mail.gmail.com>
	<466C5DC2.6090109@v.loewis.de>
	<fb6fbf560706110658o2f65537fk5adc02fd34437a3c@mail.gmail.com>
Message-ID: <f52584c00706110943u3e28e2cegce18bdfe7fdd63d6@mail.gmail.com>

On 6/11/07, Jim Jewett <jimjjewett at gmail.com> wrote:
>     "In fact, it might even use something downright misleading, and
> you won't have any warning, because we thought that maybe someone,
> somewhere, might have wanted that character in a different context."
>
> And no, I don't think I'm exagerating with that last one; we aren't
> proposing rules against mixed script identifiers (or even limiting
> script switches to occur only at the _ character).

This isn't limited to identifiers, though. You can already write "tricky"
code in 2.5, but the coding directive and the unicode/str separation
make it obvious that something funny is going on. The former will not
be necessary in 3.0 and the latter will be gone. Won't restricting
identifiers only give you false sense of security?

Small example using strings:

authors = ['Mich?lle Mischi?-Vous', 'G?nther Gutenberg']
clearances = ['infrared', 'red', 'orange', 'yellow', 'green',
              'blue', 'indigo', 'violet', 'ultraviolet']

class ClearanceError(ValueError):
    pass

def validate_clearance(clearance):
    if clearance not in clearances:
        raise ClearanceError(clearance)

def big_red_button451(clearance):
    validate_clearance(clearance)
    if clearance == 'infrar?d': # cyrillic e
        # Even this button has *some* standards! -- Mich?lle
        raise ClearanceError(clearance)
    # Set G?nther's printer on fire

def main():
    try:
        big_red_button451('infrar?d') # cyrillic e
    except ClearanceError:
        pass
    else:
        print('BRB 451 does not check clearances properly!')

if __name__ == '__main__':
    main() # run tests

From guido at python.org  Mon Jun 11 18:45:40 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Jun 2007 09:45:40 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4js8f$pe9$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org>
	<ca471dc20706101554i266bdbb4x509ef4bdb66cfa90@mail.gmail.com>
	<f4hvu6$5gk$1@sea.gmane.org>
	<ca471dc20706101615o6757beeeq57f97a696fd66d68@mail.gmail.com>
	<466CE5E0.3020106@gmail.com>
	<ca471dc20706110742o4acc4e48kde4567b64d385ad5@mail.gmail.com>
	<f4js8f$pe9$1@sea.gmane.org>
Message-ID: <ca471dc20706110945j7a61f806wd8a55929bf8447df@mail.gmail.com>

On 6/11/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Guido van Rossum schrieb:
> > On 6/10/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> >> Guido van Rossum wrote:
> >> > On 6/10/07, Georg Brandl <g.brandl at gmx.net> wrote:
> >> >> Guido van Rossum schrieb:
> >> >>> Very cool; thanks!!! No problems so far.
> >> >>>
> >> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o
> >> >>> and hex() to 0x?
> >> >> Would that also require a __bin__() special method?
> >> >
> >> > If the other two use it, we might as well model it that way.
> >>
> >> I must admit I've never understood why hex() and oct() don't just go
> >> through __int__() (Note that the integer formats are all defined as
> >> going through int() in PEP 3101).
> >>
> >> If we only want them to work for true integers, then we have __index__()
> >> available now.
> >
> > Well, maybe it's time to kill __oct__ and __hex__ then.
>
> Okay, attached is a patch to do that.
>
> It adds a new abstract function, PyNumber_ToBase, that converts an __index__able
> integer to an arbitrary base. bin(), oct() and hex() just uses it.
> (I've left the slots in the PyNumberMethods struct for now.)
>
> There was not much library code to change: only tests used the special methods.

Beautiful. Check it in please!

> Though /me wonders if we shouldn't just expose PyNumber_ToBase as a single
> function that mirrors int(str, base).

I think not. int(), oct(), hex() mirror the literal notations, and
because of this, they can insert the 0[box] prefix. I think the
discussions about this issue have revealed that there really isn't any
use case for other bases.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From aleaxit at gmail.com  Mon Jun 11 18:51:06 2007
From: aleaxit at gmail.com (Alex Martelli)
Date: Mon, 11 Jun 2007 18:51:06 +0200
Subject: [Python-3000] rethinking pep 3115
In-Reply-To: <466D57D4.8070702@gmail.com>
References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com>
	<e8a0972d0706110219r1e2a7854u6884c00881cddb1f@mail.gmail.com>
	<466D57D4.8070702@gmail.com>
Message-ID: <e8a0972d0706110951u48235b6ai894ef906fce6812a@mail.gmail.com>

On Jun 11, 2007, at 4:10 PM, Nick Coghlan wrote:

Alex Martelli wrote:
(2)
the second-best solution i could think of is just passing the dict as a
keyword argument to the class, like so:

class Spam(metaclass = Bacon, dict = {}):
    ...

so you could explicitly state you need a special dict.
I like this one, with classdict being the keyword (dict is the name of
a builtin type and we shouldn't encourage the frequent but iffy
practice of 'overriding' builtin identifiers).

So instead of being able to write:

  class MyStruct(Struct):
     first = 1
     second = 2
     third = 3

everyone defining a Struct subclass has to write:

  class MyStruct(Struct, classdict=OrderedDict()):
     first = 1
     second = 2
     third = 3

Forgive my confusion, but exactly *how* is that meant to be an improvement?

Why can't the classdict get inherited just like the metaclass can?

I'm not sure, btw, if we want the classdict to be an _instance_ of a
mapping, exactly because of inheritance -- a type or factory seems
more natural to me.  Sure, the metaclass might deal with that if
necessary, but I don't see an advantage in making it have to do so.
Thus, I'd use classdict=dict instead of classdict={}, etc.

The use of a special ordered dictionary should be an internal
implementation detail of the Struct class, and PEP 3115 makes it
exactly that. The PEP's approach means that simple cases, while
possibly being slightly harder to write, will 'just work' when it
comes time to use them, while more complicated cases involving
multiple metaclasses should still be possible.

I will also note that the PEP allows someone to write their own base
class which accepts the 'classdict' keyword argument if they so
choose.

The PEP seems to allow a whole lot of things.  I'm with Tomer in
wondering whether this lot may be too much.


Alex

From guido at python.org  Mon Jun 11 19:00:02 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Jun 2007 10:00:02 -0700
Subject: [Python-3000] PEP 3135 (New Super) - what to do?
Message-ID: <ca471dc20706111000l1a8fc247vda732401e53258c0@mail.gmail.com>

I'm very tempted to check in my patch even though the PEP isn't
updated (it's been renamed from PEP 367 though). Any objections?

It is python.org/sf/1727209, use the latest (topmost) super2.diff patch.

This would make the new and improved syntax super().foo(), which gets
the class and object from the current frame, as the __class__ cell and
the first argument, respectively.

Neither __class__ nor super are keywords, but the compiler spots the
use of 'super' as a free variable and makes sure the '__class__' is
available in any frame where 'super' is used.

Code is welcome to also use __class__ directly. It is set to the class
before decoration (since this is the only way that I can figure out
how to generate the code).

The old syntax super(SomeClass, self) still works; also,
super(__class__, self) is equivalent (assuming SomeClass is the
nearest lexically containing class).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From g.brandl at gmx.net  Mon Jun 11 20:48:56 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 11 Jun 2007 20:48:56 +0200
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4hujl$2dq$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org>
Message-ID: <f4k5el$rvr$1@sea.gmane.org>

Georg Brandl schrieb:
> Guido van Rossum schrieb:
>> PEP 3127 (Integer Literal Support and Syntax) introduces new notations
>> for octal and binary integers. This isn't implemented yet. Are there
>> any takers? It shouldn't be particularly complicated.
> 
> Okay, it's done.
> 
> I'll be grateful for reviews. I've also removed traces of the "L" literal
> suffix where I encountered them, but may not have gotten them all.

Ah, one thing in the PEP I haven't implemented is the special helpful
syntax error message if you have an old-style octal literal in your code.

If someone who wouldn't have to dig into tokenizer/parser details could
do that, I'd be grateful :)

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From guido at python.org  Mon Jun 11 20:50:39 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Jun 2007 11:50:39 -0700
Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers?
In-Reply-To: <f4k5el$rvr$1@sea.gmane.org>
References: <ca471dc20706081527vc2e6530u298598b476eaa3d0@mail.gmail.com>
	<f4hujl$2dq$1@sea.gmane.org> <f4k5el$rvr$1@sea.gmane.org>
Message-ID: <ca471dc20706111150t636baaffm170b63eaa430c2c@mail.gmail.com>

On 6/11/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Georg Brandl schrieb:
> > Guido van Rossum schrieb:
> >> PEP 3127 (Integer Literal Support and Syntax) introduces new notations
> >> for octal and binary integers. This isn't implemented yet. Are there
> >> any takers? It shouldn't be particularly complicated.
> >
> > Okay, it's done.
> >
> > I'll be grateful for reviews. I've also removed traces of the "L" literal
> > suffix where I encountered them, but may not have gotten them all.
>
> Ah, one thing in the PEP I haven't implemented is the special helpful
> syntax error message if you have an old-style octal literal in your code.
>
> If someone who wouldn't have to dig into tokenizer/parser details could
> do that, I'd be grateful :)

Or you could leave this up to the 2.6 backport team. :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From timothy.c.delaney at gmail.com  Mon Jun 11 22:51:07 2007
From: timothy.c.delaney at gmail.com (Tim Delaney)
Date: Tue, 12 Jun 2007 06:51:07 +1000
Subject: [Python-3000] PEP 3135 (New Super) - what to do?
References: <ca471dc20706111000l1a8fc247vda732401e53258c0@mail.gmail.com>
Message-ID: <00dd01c7ac6a$3e767550$0201a8c0@mshome.net>

Guido van Rossum wrote:

> I'm very tempted to check in my patch even though the PEP isn't
> updated (it's been renamed from PEP 367 though). Any objections?

Sorry - had to go visit family on the long weekend - only got back late last 
night.

My only objection is the special-casing of the 'super' name - specifically, 
that it *won't* work if super is assigned to something else, and then called 
with the no-arg version. But I'm happy to have the changes checked in, and 
look at whether we can fix that without a performance penalty later.

I'll update the PEP when I get the chance to reflect the new direction. 
Ironically, it's now gone back more towards Calvin's original approach (and 
my original self.super recipe).

So - just clarifying the semantics for the PEP:

1. super() is a shortcut for super(__class__, first_arg).

Any reason we wouldn't just emit bytecode for the above if we detect a 
no-arg call of super()? Ah - in case 'super' had been rebound. We could 
continue to make 'super' a non-rebindable name.

2. __class__ can be called directly.

__class__ will be available in any frame that uses either 'super' or 
'__class__' (including inner functions of methods).

What if the function is *not* inside a class (lexically)? Will __class__ 
exist, or will it be None?

> It is python.org/sf/1727209, use the latest (topmost) super2.diff
> patch.
> This would make the new and improved syntax super().foo(), which gets
> the class and object from the current frame, as the __class__ cell and
> the first argument, respectively.
>
> Neither __class__ nor super are keywords, but the compiler spots the
> use of 'super' as a free variable and makes sure the '__class__' is
> available in any frame where 'super' is used.
>
> Code is welcome to also use __class__ directly. It is set to the class
> before decoration (since this is the only way that I can figure out
> how to generate the code).
>
> The old syntax super(SomeClass, self) still works; also,
> super(__class__, self) is equivalent (assuming SomeClass is the
> nearest lexically containing class).

Tim Delaney 


From guido at python.org  Mon Jun 11 23:16:44 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Jun 2007 14:16:44 -0700
Subject: [Python-3000] PEP 3135 (New Super) - what to do?
In-Reply-To: <00dd01c7ac6a$3e767550$0201a8c0@mshome.net>
References: <ca471dc20706111000l1a8fc247vda732401e53258c0@mail.gmail.com>
	<00dd01c7ac6a$3e767550$0201a8c0@mshome.net>
Message-ID: <ca471dc20706111416g43cbb99cqfbe790fcec92534a@mail.gmail.com>

On 6/11/07, Tim Delaney <timothy.c.delaney at gmail.com> wrote:
> Guido van Rossum wrote:
>
> > I'm very tempted to check in my patch even though the PEP isn't
> > updated (it's been renamed from PEP 367 though). Any objections?
>
> Sorry - had to go visit family on the long weekend - only got back late last
> night.
>
> My only objection is the special-casing of the 'super' name - specifically,
> that it *won't* work if super is assigned to something else, and then called
> with the no-arg version.

Well, what's the use case? I don't see that there would ever be a
reason to alias super. So the use case seems to be only semantic
purity.

> But I'm happy to have the changes checked in, and
> look at whether we can fix that without a performance penalty later.

There's the rub -- it's easy to always add a reference to __class__ to
every method, but that means that every method call slows down a tiny
bit on account of passign the __class__ cell.

Anyway, I'll check it in.

> I'll update the PEP when I get the chance to reflect the new direction.
> Ironically, it's now gone back more towards Calvin's original approach (and
> my original self.super recipe).

Working implementations talk. :-)

> So - just clarifying the semantics for the PEP:
>
> 1. super() is a shortcut for super(__class__, first_arg).

Yes.

> Any reason we wouldn't just emit bytecode for the above if we detect a
> no-arg call of super()? Ah - in case 'super' had been rebound. We could
> continue to make 'super' a non-rebindable name.

Believe me, modifying the byte code would be much harder.

I don't like the idea of non-rebindable names that aren't keywords --
there are too many syntactic loopholes (e.g. someone found a way to
bind __debug__ via an argument).

> 2. __class__ can be called directly.

But why should you? It's not there for calling, but for referencing
(e.g. isinstance). Or maybe you meant "__class__ can be *used*
directly." Yes, in that case.

> __class__ will be available in any frame that uses either 'super' or
> '__class__' (including inner functions of methods).

Yeah, but that's only relevant to code digging around in the frame
object. More usefully, the __class__ variable will be available in all
function definitions that are lexically contained inside a class.

> What if the function is *not* inside a class (lexically)? Will __class__
> exist, or will it be None?

It will be undefined (i.e. give a NameError). super can still be used,
but you must provide arguments as before.

I should note that with the current class, while using __class__ in a
nested function works, using super() in a nested function doesn't
really work: while it gets the __class__ variable just fine, it gets
the first argument of the nested function, which is most likely
useless. Not that I can think of a use case for super() in a nested
function anyway, but it should be noted.

> > It is python.org/sf/1727209, use the latest (topmost) super2.diff
> > patch.
> > This would make the new and improved syntax super().foo(), which gets
> > the class and object from the current frame, as the __class__ cell and
> > the first argument, respectively.
> >
> > Neither __class__ nor super are keywords, but the compiler spots the
> > use of 'super' as a free variable and makes sure the '__class__' is
> > available in any frame where 'super' is used.
> >
> > Code is welcome to also use __class__ directly. It is set to the class
> > before decoration (since this is the only way that I can figure out
> > how to generate the code).
> >
> > The old syntax super(SomeClass, self) still works; also,
> > super(__class__, self) is equivalent (assuming SomeClass is the
> > nearest lexically containing class).
>
> Tim Delaney

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Mon Jun 11 23:20:17 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Mon, 11 Jun 2007 23:20:17 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>
References: <236066.59081.qm@web33506.mail.mud.yahoo.com>
	<466C563F.6090305@v.loewis.de>
	<Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>
Message-ID: <466DBC91.7050101@v.loewis.de>

Ka-Ping Yee schrieb:
> Steve Howell wrote:
>> I think this whole debate could be put to rest by
>> agreeing to err on the side of ascii in 3.0 beta, and
>> if in real world experience, that turns out to be the
>> wrong decision, simply fix it in 3.0 production, 3.1,
>> or 3.2.
> 
> On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote:
>> Likewise, this whole debate could also be put to rest
>> by agreeing to err on the side of unrestricted support
>> for the PEP, and if that turns out to be the wrong
>> decision, simply fix any problems discovered in 3.0
>> production, 3.1, or 3.2.
> 
> Your attempted parallel does not match: it breaks code,
> whereas Steve's does not.

PEP 3131 does not break any code.

Regards,
Martin


From martin at v.loewis.de  Mon Jun 11 23:23:32 2007
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 11 Jun 2007 23:23:32 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
Message-ID: <466DBD54.4030006@v.loewis.de>

> Python currently provides to everyone the restriction of
> identifiers to a character set that everyone knows and trusts.
> Many of us want Python to continue to provide such restriction
> for those who want identifiers to be in a character set they
> know and trust.  This is not incompatible with your desire to
> permit alternative character sets, as long as Python offers an
> option to make that choice.  We can continue to discuss the
> details of how that choice is expressed, but this general idea
> is a solution that would give us both what we want.
> 
> Can we agree on that?

So far, all proposals I have seen *are* incompatible, or had
some other flaws, so I'm not certain that this general idea
provides a non-empty solution set.

Regards,
Martin

From martin at v.loewis.de  Mon Jun 11 23:26:40 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 11 Jun 2007 23:26:40 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
References: <20070524234516.8654.JCARLSON@uci.edu>	
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>	
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
Message-ID: <466DBE10.1070804@v.loewis.de>

> Chinese in particular you would recognize as "not what I expected".
> Cyrillic you might not recognize, because it looks like ASCII letters.

Please take a look at

http://ru.wikipedia.org/wiki/Python

In what way does that look like ASCII letters? Cyrillic is
*significantly* different from Latin.

> Prime (or tone) marks, you might not recognize, because they look
> like ASCII quote marks.

Not to me. Quote marks are before and after letters; tone marks
are above letters.

Regards,
Martin

From martin at v.loewis.de  Mon Jun 11 23:42:36 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 11 Jun 2007 23:42:36 +0200
Subject: [Python-3000] PEP 3131: what are the risks?
In-Reply-To: <fb6fbf560706110743y2d174ccl660c629906ac7f02@mail.gmail.com>
References: <fb6fbf560706110743y2d174ccl660c629906ac7f02@mail.gmail.com>
Message-ID: <466DC1CC.2020208@v.loewis.de>

> Here are the top three that I see; note that none of these concerns
> say "Don't use non-ASCII ids".  They do all say "Don't use ids from a
> script the user hasn't said to expect".
> 
> (1)  Malicious user is indeed one risk.  A small probability, but a
> big enough loss that I want a warning when the door is unlocked.
> 
> (2)  Typos is another risk.  Even in mono-lingual environments, it is
> possible to get a wrong letter.  If you're expecting ?, it is fine.
> If you're not, then it shouldn't pass silently.
> 
> (3)  "Reados".  When doing maintenance later, if I wasn't expecting ?,
> I may see it as a regular i, and code that way.  Now I have two
> doppelganger/d?ppelganger variables (or inherited methods) serving the
> same purpose, but using different memory locations.

I can see 1 as a risk, and I agree it has a small probability (because
the risk for the submitter of being discovered is much higher).

I can't see issues 2 or 3 as a risk. It *never* happened to me that
I mistakenly typed ?, as this just isn't on my keyboard. If it was
on my keyboard, I would be using a natural language that actually
uses that character, and then my eye would be trained to easily
recognize the typo. Likewise for 3: I could *never* confuse these
two words, and would always recognize both of them as typos
for doppelg?nger (which is where the umlauts really belong).

To elaborate on the ? issue: there is a mode for German keyboards
where the accent characters are "dead", i.e. you type an accented
character, then the regular character. I usually turn that mode off,
but even if it was on, I would not *mistakenly* type ` first,
then the i. If I type ` on a keyboard with dead keys, I get *always*
puzzled about no character appearing, and then if the next vowel
eats the character, I immediately recognize - I meant to type a
backquote, but got none. If the backquote is part of the syntax,
the vowel "eating" it actually makes the entire text a syntax
error.

Regards,
Martin

From jimjjewett at gmail.com  Mon Jun 11 23:55:58 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 17:55:58 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466DBE10.1070804@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
Message-ID: <fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>

On 6/11/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > Chinese in particular you would recognize as "not what I expected".
> > Cyrillic you might not recognize, because it looks like ASCII letters.

> Please take a look at http://ru.wikipedia.org/wiki/Python

> In what way does that look like ASCII letters? Cyrillic is
> *significantly* different from Latin.

In long stretches of long words, yes.  In isolated abbreviations, not
so much.  From the second key-value pair in the top box, the value is

    ?????????????

I can tell that isn't english, but I have to slow down a bit before I
recognize that it isn't ASCII.  (The "N" is backwards, and the
"n"-looking thing between the "p"s isn't quite an n.)

    ???????????

I wouldn't recognize at all (except that for the next few weeks, I
might know to check).

One reason this matters -- even when the original author had good
intentions -- is that I edit my code as text, rather than graphics.  I
will often retype rather than cutting and pasting.  Since ??? and ????
are not the same as the visually similar Top and HTep, that will
eventually cause problems.

If I can say "I accept Latin-1, but not Cyrillic", then I won't have
this problem; at the very least, I will be forewarned.

-jJ

From jimjjewett at gmail.com  Mon Jun 11 23:59:12 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 17:59:12 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466DBC91.7050101@v.loewis.de>
References: <236066.59081.qm@web33506.mail.mud.yahoo.com>
	<466C563F.6090305@v.loewis.de>
	<Pine.LNX.4.58.0706110151540.7196@server1.LFW.org>
	<466DBC91.7050101@v.loewis.de>
Message-ID: <fb6fbf560706111459o6e453daao91acbd567fadad28@mail.gmail.com>

On 6/11/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Ka-Ping Yee schrieb:
> > Steve Howell wrote:
> >> I think this whole debate could be put to rest by
> >> agreeing to err on the side of ascii in 3.0 beta, and
> >> if in real world experience, that turns out to be the
> >> wrong decision, simply fix it in 3.0 production, 3.1,
> >> or 3.2.

> > On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote:
> >> Likewise, this whole debate could also be put to rest
> >> by agreeing to err on the side of unrestricted support
> >> for the PEP, and if that turns out to be the wrong
> >> decision, simply fix any problems discovered in 3.0
> >> production, 3.1, or 3.2.

> > Your attempted parallel does not match: it breaks code,
> > whereas Steve's does not.

> PEP 3131 does not break any code.

Going with the widest possible set of source characters (as PEP 3131
does) and restricting them later (in 3.1) would break code.

Going with a smaller set of possible source characters and expanding
them later would not break code.

Going with ASCII by default plus locally approved charset-extensions
would not break code unless the new restrictions overrode local
decisions.

-jJ

From martin at v.loewis.de  Tue Jun 12 00:13:34 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 12 Jun 2007 00:13:34 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
References: <20070524234516.8654.JCARLSON@uci.edu>	
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>	
	<466C5BB7.8050909@v.loewis.de>	
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>	
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
Message-ID: <466DC90E.1070009@v.loewis.de>

> One reason this matters -- even when the original author had good
> intentions -- is that I edit my code as text, rather than graphics.  I
> will often retype rather than cutting and pasting.  Since ??? and ????
> are not the same as the visually similar Top and HTep, that will
> eventually cause problems.

It's actually unlikely that you encounter "???" or "????" - they
don't mean anything in Russian (FWIW, ????????????? means interpreter;
so "???" is akin "ter" and "????" akin "nter").

I cannot believe that you would actually consider retyping code
that contains Cyrillic characters (you won't understand what it does,
would you?), and even if you did - how would an ASCII-only flag
on the interpreter help? If you type Top and HTep (again, please
look in my eyes and tell me that you would *actually* type in
these identifiers), the error in the interpreter won't trigger.

Regards,
Martin

From bjourne at gmail.com  Tue Jun 12 00:50:22 2007
From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=)
Date: Tue, 12 Jun 2007 00:50:22 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466DBD54.4030006@v.loewis.de>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
	<466DBD54.4030006@v.loewis.de>
Message-ID: <740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com>

On 6/11/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > Python currently provides to everyone the restriction of
> > identifiers to a character set that everyone knows and trusts.
> > Many of us want Python to continue to provide such restriction
> > for those who want identifiers to be in a character set they
> > know and trust.  This is not incompatible with your desire to
> > permit alternative character sets, as long as Python offers an
> > option to make that choice.  We can continue to discuss the
> > details of how that choice is expressed, but this general idea
> > is a solution that would give us both what we want.
> >
> > Can we agree on that?
>
> So far, all proposals I have seen *are* incompatible, or had
> some other flaws, so I'm not certain that this general idea
> provides a non-empty solution set.

python -ascii-only


-- 
mvh Bj?rn

From jimjjewett at gmail.com  Tue Jun 12 01:13:14 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Jun 2007 19:13:14 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466DC90E.1070009@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
Message-ID: <fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>

On 6/11/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > One reason this matters -- even when the original author had good
> > intentions -- is that I edit my code as text, rather than graphics.  I
> > will often retype rather than cutting and pasting.  Since ??? and ????
> > are not the same as the visually similar Top and HTep, that will
> > eventually cause problems.

> It's actually unlikely that you encounter "???" or "????" - they
> don't mean anything in Russian (FWIW, ????????????? means interpreter;
> so "???" is akin "ter" and "????" akin "nter").

> I cannot believe that you would actually consider retyping code
> that contains Cyrillic characters

Not if I realized they were Cyrillic -- and that is exactly my point.

By allowing any unicode letters, we would allow Cyrillic, and I might
open a file that uses Cyrillic without realizing it.

By allowing ASCII + locally approved charsets, I either won't have
Cyrillic indentifiers, or I will have turned them on explicitly, and
will know to look out for them.

> would you?), and even if you did - how would an ASCII-only flag
> on the interpreter help?

With ASCII-only, I would have gotten an error when I loaded the
original module in the first place, so I would know that I'm dealing
with Cyrillic (or at least with non-ASCII.)

> If you type Top and HTep (again, please
> look in my eyes and tell me that you would *actually* type in
> these identifiers), the error in the interpreter won't trigger.

To repeat:  Yes, if I thought those were the variable names, I would
type them -- and I've seen dumber variable names than those."

Of course, I wouldn't type them if I knew they were wrong.  With an
ASCII-only install, I would get that error-check because the
(remaining original uses) were in Cyrillic.  With an "any unicode
character" install, ... well, I might figure out my problem the next
morning.

-jJ

From pje at telecommunity.com  Tue Jun 12 01:18:40 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon, 11 Jun 2007 19:18:40 -0400
Subject: [Python-3000] Pre-PEP on fast imports
In-Reply-To: <466DD0D8.7040407@develer.com>
References: <466DD0D8.7040407@develer.com>
Message-ID: <20070611231640.8A55E3A407F@sparrow.telecommunity.com>

At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote:
>Hi Philip,
>
>I'm going to submit a PEP for Python 3000 (and possibly backported 
>as an option off by default in Python 2). It's related to imports 
>and how to make them faster. Given your expertise on the subject, 
>I'd appreciate if you could review my ideas. I briefly spoken of it 
>with Alex Martelli a few days ago at PyCon Italia and he was not 
>negative about it.
>
>Problems:
>
>- A single import causes many syscalls (.pyo, .pyc, .py, in both 
>directory and .zip file).
>- Situation is getting worse and worse with the advent of 
>easy_install which produces many .pth files (longer sys.path).
>- Python startup time is slow, and a noticable fraction of it is 
>dominated by site.py-related stuff (a simple hello world runs takes 
>0.012s if run without -S, and 0.008s if run with -S).
>- Many people might not be interested in this, but others are really 
>concerned. Eg: again at PyCon italia, I spoke with one of the 
>leading Sugar programmers (OLPC) who told me that one of the biggest 
>blocker right now is the python startup time (applications on latest 
>OLPC prototype take 3-4 seconds to startup). He suggested that this 
>was related to the large number of syscalls made for imports.
>
>
>Proposed solution:
>
>- A site cache is introduced. It's a dictionary mapping module names 
>to absolute file paths.
>- When an import occurs, for each directory/zipfile we walk in 
>sys.path, we read all directory entries, and update the site cache 
>with all the Python modules found in it (all the Python modules 
>found in the directory/zipfile).
>- If the filepath for a certain module is found in the site cache, 
>the module is directly accessed. Otherwise, sys.path is walked.
>- The site cache can be cleared with sys.clear_site_cache(). This 
>must be used after manual editing of sys.path (or could be done 
>automatically by making sys.path a list subclass which notices each 
>modification).
>- The site cache must be manually cleared if a Python file is added 
>to a directory in sys.path after the application has started. This 
>is a rare-enough scenario to require an additional explicit call.
>- If for whatever reason a filepath found in the site cache cannot 
>be accessed (unmounted device, whatever) ImportError is raised. 
>Again, this is something which is very rare and does not require 
>much attention.

Here's a simpler solution, one that's easily testable using existing 
Python versions.  Create a subclass of pkgutil.ImpImporter 
(Python >=2.5) that caches a listdir of its contents, and uses it to 
immediately reject any find_module() requests for which matching data 
is not in its cached listdir.  Add this class to sys.path_hooks, and 
see if it speeds things up.

If it doesn't produce an improvement, your more-ambitious version of 
the idea won't work.  If it does produce an improvement, it's likely 
to be much simpler to implement at the C level than your idea 
is.  Meanwhile, it doesn't tear up the import machinery with a new 
special-purpose mechanism; it simply leverages the existing hooks.

The subclass might look something like this:

     import imp, os, sys
     from pkgutil import ImpImporter

     suffixes = set(ext for ext,mode,typ in imp.get_suffixes())

     class CachedImporter(ImpImporter):
         def __init__(self, path):
             if not os.path.isdir(path):
                 raise ImportError("Not an existing directory")
             super(CachedImporter, self).__init__(path)
             self.refresh()

         def refresh(self):
             self.cache = set()
             for fname in os.listdir(path):
                 base, ext = os.path.splitext(fname)
                 if ext in suffixes and '.' not in base:
                     self.cache.add(base)

         def find_module(self, fullname, path=None):
             if fullname.split(".")[-1] not in self.cache:
                 return None  # no need to check further
             return super(CachedImporter, self).find_module(fullname, path)

     sys.path_hooks.append(CachedImporter)

Stick this at the top of your site.py and see what happens.  I'll be 
interested to hear the results.  (Notice, by the way, that with this 
implementation one can easily clear the entire cache by clearing 
sys.path_importer_cache, or deleting the entry for a specific path, 
as well as by taking the entry for that path and calling its refresh() method.)


From baptiste13 at altern.org  Tue Jun 12 01:48:47 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 01:48:47 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>
	<f4hoku$i57$1@sea.gmane.org>
	<1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>
Message-ID: <f4kn1r$q2i$1@sea.gmane.org>

Leonardo Santagada a ?crit :
> I don't. It is a bad idea to distribute non-ASCII code for libraries  
> that are supposed to be used by the world as a whole. But  
> distributing a chinese code for doing something like taxes using  
> chinese rules is ok and should be encouraged (now, I don't know they  
> have taxes in china, but that is not the point). 
> 
I wouldn't be so sure. In open source, you never know in advance to whom your
code can be useful. Maybe some part of you chinese tax software can be
refactored into a more generic library. If you write the software with non-ASCII
identifiers, this refactored library won't be usable for non-chinese speakers. A
good opportunity will be missed, but *you won't even know*.

> No they are not, people doing open source work are probably going to  
> still be coding in english so that is not a problem, but that chinese  
> tax system if it is open sourced people in china can easily help  
> fixing bugs because identifiers are in their own language, which they  
> can identify.
> 
good point, but I'm not sure it is so much more difficult to identify
identifiers, given that you already need to know ASCII characters in order to
identify the keywords. Sure, you won't understand what the identifiers mean, but
you'll probably be able to tell them from one another.

> The thing is, people are predicting a future for python code on the  
> open source world. One in which devs of open source libraries and  
> programs will start coding in different languages if you support  
> unicode identifiers, something that is not common today (using some  
> form of ASCIIfication of their languages) and didn't happen with the  
> Java, C#, Javascript and Common Lisp communities. In light of all  
> that I think this prediction is probably wrong. 
>
Well that's only true for the open source libraries and programs *that we know
of*. Maybe there is useful software that we don't know of, precisely because it
is not "marketed" to a global audience. That's what I call lost opportunities.

> We are all consenting  
> adults and we know that we should code in english if we want our code  
> to be used and to be a first class citizen of the open source world.  
> What do you have to support your prediction?
> 
I have experience in another community, namely the community of physicists.
Here, most people don't know in advance how you're supposed to write open source
code. They learn in the doing. And if someone starts coding with non-ASCII
identifiers, he won't have time to recode his program later. So he will simply
not publih it. Lost opportunity again.

Cheers,
BC


From baptiste13 at altern.org  Tue Jun 12 02:00:44 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 02:00:44 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>
	<f4hoku$i57$1@sea.gmane.org>
	<FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
Message-ID: <f4kno7$rr5$1@sea.gmane.org>

James Y Knight a ?crit :
> If another developer is planning to write code in English, this whole  
> debate is moot. So, let's take as a given that he is going to write a  
> program in his own non-English language. Now, will he write in a  
> asciified form of his language, or using the proper character set?  
> Right now, the only option is the first. The PEP proposes to also  
> allow the second.
> 
that's a very nice summary of the situation.

> So, your question should be: is it easier to understand an ASCIIified  
> form of another language, or the actual language itself? For me (who  
> doesn't speak said langauge, nor perhaps even know its character  
> set), I'm pretty sure the answer is still going to be the second: I'd  
> rather a program written in Chinese use Chinese characters, rather  
> than a transliteration of Chinese into ASCII. 
>
This is where we strongly disagree. If an identifier is written in
transliterated chinese, I cannot understand what it means, but I can recognise
it when it is used in the code. I will then find out the meaning from the
context. By contrast, with chineses identifiers, I will not recognise them from
one another. So I won't be able to make any sense from the code without going
through the complex task of translating everything.

> because it is actually  
> feasible for me to do automatic translation of Chinese into something  
> resembling English. And of course, that's even more true when talking  
> about a language like French, which uses an alphabet quite familiar  
> to me, but yet online translators still fail to function if it's been  
> transliterated into ASCII.
> 
dream on! Automatic translation won't work. For example, if you actually try
feeding python code to a french-to-english translator, you might be surprised by
what happens to the keyword "if" (just try it:-). You would have to translate
the identifiers one by one, which is not practical.

Cheers,
BC


From baptiste13 at altern.org  Tue Jun 12 02:22:04 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 02:22:04 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466CBC5A.3050907@v.loewis.de>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>	<f4hoku$i57$1@sea.gmane.org>
	<466CBC5A.3050907@v.loewis.de>
Message-ID: <f4kp08$uno$1@sea.gmane.org>

Martin v. L?wis a ?crit :
>>> Indeed, PEP 3131 gives a predictable identifier character set.
>>> Adding per-site options to change the set of allowable characters
>>> makes it less predictable.
>>>
>> true. However, this will only matter if you distribute code with non-ASCII
>> identifiers to the wider public.
> 
> No - it will matter for any kind of distribution, not just to the "wider
> public". If I move code to the next machine it may stop working, 
>
if that machine is controlled by you (or your sysadmin), you should be able to
reconfigure Python the way you like. However, I have to agree that this is
suboptimal.

> or if I upgrade to the next Python version, assuming the default is
> to restrict identifiers.
> 
That would only happen if the default changes to a more strict rule. If we start
with ASCII only, this is unlikely to ever happen!

>> The real question is: transparent *to whom*. Transparent to the developper
>> himself when he rereads his own code (which I value as a developper), or
>> transparent to the user of the program when he tries to fix a bug (which I value
>> as a user of open-source software) ? Non-ASCII identifiers are marginally better
>> for the first case, but can be dramatically worse for the second one. Clearly,
>> there is a tradeoff.
> 
> Why do you say that? Non-ASCII identifiers significantly improve the
> readability of code to speakers of the natural language from which
> the identifiers are drawn. With ASCII identifiers, the reader needs
> to understand the English words, or recognize the transliteration.
> With non-ASCII identifiers, the intended meaning of the class or
> function becomes immediately apparent, in the way identifiers have
> always been self-documentation for English-speaking people.
> 
my problem is then: what happens if the reader does not speak the same language
as the author of the code? Right now, if I come across python code written in a
language I don't speak, I can still try to make sense of it. Sure, I may have to
do without the comments, sure, I may not understand what the identifier names
mean. But I can still follow the instructions flow and try to figure out what
happens. With non-ASCII identifiers, I cannot do that because I cannot recognise
the identifiers from one another.

>>>> That is what makes these strengths so important.  I hope this
>>>> helps you understand why these concerns can't and shouldn't be
>>>> brushed off as "paranoia" -- this really has to do with the
>>>> core values of the language.
>>> It just seems that the concerns don't directly follow from
>>> the principles. Something else has to be added to make that
>>> conclusion. It may not be paranoia (i.e. excessive anxiety),
>>> but there surely is some fear, no?
>>>
>> That argument is not really honest :-) Every risk can be estimated opimistically
>> or pessimistically. In both cases, there is some part of irrationallity.
> 
> Still, what is the risk being estimated? Is it that somebody
> maliciously tries to provide patches that use look-alike
> characters? I honestly don't know what risks you see.
> 
Well, I have not followed acurately the discussion about security risks.
However, I see a much simpler risk: the risk that I come across with code that
is technically open source, but that I can't even debug in case of need because
I cannot make sense of it. This would reduce the usefulness of such code, and
cause fragmentation for the community.

Cheers,
BC


From gproux+py3000 at gmail.com  Tue Jun 12 02:34:26 2007
From: gproux+py3000 at gmail.com (Guillaume Proux)
Date: Tue, 12 Jun 2007 09:34:26 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kno7$rr5$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
	<FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
	<f4kno7$rr5$1@sea.gmane.org>
Message-ID: <19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com>

Hello,

On 6/12/07, Baptiste Carvello <baptiste13 at altern.org> wrote:
> context. By contrast, with chineses identifiers, I will not recognise them from
> one another. So I won't be able to make any sense from the code without going
> through the complex task of translating everything.

You would be surprised how well you can do if you would actually try
to recognize a set of Chinese characters, especially if you would use
some tool to put a meaning on them. Well, I never formally learned any
Chinese (nor any Japanese actually) , but I can now effortlessly parse
both languages now.

But really, if you ever find any code with Chinese written all over it
that you would believe might be very useful to you, you would have one
of the following choice:
(a) use a tokenizer and use some tool to do a hanzi -> ascii automatic
transliteration/translation
(b)  try to wrap the Chinese things with an ASCII veil (which would
make you work on your Chinese a bit)  or you could ask your Chinese
girlfriend to help you (WHAT you don't have a Chinese girlfriend yet?
:))
(c) actually contact the person who submitted the code to let him know
you are very much interested in the code....

In most cases, this would give you the possibility to reach out to
different communities and to work together with people with whom you
might never have talked to. From what we can see on English-only
mailing lists, this is the kind of python users we don't normally have
access to currently because they simply are secluded in their own
little universe, in the comfortable realm of their own linguistic
barriers.

Of course, sometimes they step out  and offer a plea for help on
English ML in broken English...
PEP3131 is unlikely to change this. However it can see it might have
two ethnically interesting consequences:
1) Python usage in community where ascii has little place should find
more uses because people will become enpowered with Python and able to
express themselves like never before: my bet is that for example, the
Japanese python commmunity will become stronger and welcome new people
younger and older, and that do not know much English.
2) If ever a program written with non-ASCII character find some good
usage in ascii-only communities, then the usual plea for help will be
reversed. People will seek out e.g. Japanese programmers and request
help, maybe in broken Japanese. From this point on, all programming
communities will be on an equal footing and able to talk together on
the same standpoint. I guess you know "Libert? Egalit? Fraternit?".
Maybe this should be the PEP subtitle.

> what happens to the keyword "if" (just try it:-). You would have to translate
> the identifiers one by one, which is not practical.

would be possible with the tokenizer actually :)

Droit comme un if !

A bient?t,

Guillaume

From baptiste13 at altern.org  Tue Jun 12 02:34:54 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 02:34:54 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>	<466C5730.3060003@v.loewis.de>	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
	<dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>
Message-ID: <f4kpoa$pv$1@sea.gmane.org>

Michael Urman a ?crit :
> On 6/11/07, Ka-Ping Yee <python at zesty.ca> wrote:
>> Because the existence of these library modules does not make it
>> impossible to reliably read source code.  We're talking about
>> changing the definition of the language here, which is deeper
>> than adding or removing things in the library.
> 
> This has already been demonstrated to be false - you already cannot
> visually inspect a printed python program and know what it will do.
> There is the risk of visually aliased identifiers, but how is that
> qualitatively worse than the truly conflicting identifiers you can
> import with a *, or have inserted by modules mucking with
> __builtins__?
> 
Oh come on! imports usually are located at the top of the file, so they won't
clobber other names. And mucking with __builtins__ is rare and frowned upon. On
the contrary, non-ASCII identifiers will be encouraged, anywhere in the code.
The  amount of information you get from today's python code is most of the time
sufficient for debugging, or for using it as an inspiration. With non-ASCII
identifiers, these features will be lost to all users who cannot read the needed
characters. Denying the problem is not a good way to answer other people's concerns.

Cheers,
BC


From baptiste13 at altern.org  Tue Jun 12 02:38:06 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 02:38:06 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466C5BB7.8050909@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>	<4656920F.9040001@v.loewis.de>	<20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
Message-ID: <f4kpu7$pv$2@sea.gmane.org>

Martin v. L?wis a ?crit :
> I cannot imagine this scenario as realistic. It is certain realistic
> that you want to keep your own code base ASCII-only - what I don't
> understand why such a policy would extend to libraries that you use.
> If the interfaces of the library are non-ASCII, you will automatically
> notice; if it only has some non-ASCII identifiers inside, why would
> you bother?
> 
well, for the same reason I prefer to use open source software: because I can
debug it in case of need, and because I can use it as an inspiration if I need
to write a similar program.

Baptiste


From showell30 at yahoo.com  Tue Jun 12 02:45:19 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Mon, 11 Jun 2007 17:45:19 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kn1r$q2i$1@sea.gmane.org>
Message-ID: <511504.28584.qm@web33502.mail.mud.yahoo.com>


--- Baptiste Carvello <baptiste13 at altern.org> wrote:

> Leonardo Santagada a ?crit :
> > I don't. It is a bad idea to distribute non-ASCII
> code for libraries  
> > that are supposed to be used by the world as a
> whole. But  
> > distributing a chinese code for doing something
> like taxes using  
> > chinese rules is ok and should be encouraged (now,
> I don't know they  
> > have taxes in china, but that is not the point). 
> > 
> I wouldn't be so sure. In open source, you never
> know in advance to whom your
> code can be useful. Maybe some part of you chinese
> tax software can be
> refactored into a more generic library. If you write
> the software with non-ASCII
> identifiers, this refactored library won't be usable
> for non-chinese speakers. A
> good opportunity will be missed, but *you won't even
> know*.
> 

A couple people have made the point that it's easier
for a non-Chinese-speaking person to translate from
Unicode Chinese to their target language than from
ASCII pseudo-Chinese, due to the current state of the
art of translation engines like Babelfish, Google,
etc.

A more likely translation scenarios is that somebody
semi-literate in a language attempts the translation. 
For example, I'm not fluent in French, but I could
translate a small useful French module to English
without too much effort, assuming that the underlying
algorithms were within my capability and I had
babelfish to overcome my rusty high school French. 
Here Unicode would probably help me, unless my browser
were just completely lame and the accents somehow
encumbered by ability to copy and paste.  My French
spelling when it comes to accents is bad, but they
don't affect me when it comes to reading.

The most likely scenario of translation is that
somebody truly bilingual does the translation.  I'm
sure there are something like 50,000 people in the
U.S. alone who are Chinese/English bilingual, and then
once you get something translated to English, that
opens up even more doors.

Having said all that, I agree with the underlying
premise that the availability of Unicode will provide
some mild disincentive for the original authors to
publish their work in English, to the extent that the
author doesn't predict (or stand to benefit from?) the
utility of his module outside the Chinese-reading
community.  But you do have to weigh that against the
disincentive to write the module in the first place,
if ascii is the only option.




      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

From brett at python.org  Tue Jun 12 03:53:43 2007
From: brett at python.org (Brett Cannon)
Date: Mon, 11 Jun 2007 18:53:43 -0700
Subject: [Python-3000] Pre-PEP on fast imports
In-Reply-To: <20070611231640.8A55E3A407F@sparrow.telecommunity.com>
References: <466DD0D8.7040407@develer.com>
	<20070611231640.8A55E3A407F@sparrow.telecommunity.com>
Message-ID: <bbaeab100706111853j1a2b2e3by63e4d125996d54fe@mail.gmail.com>

On 6/11/07, Phillip J. Eby <pje at telecommunity.com> wrote:
>
> At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote:
> >Hi Philip,
> >
> >I'm going to submit a PEP for Python 3000 (and possibly backported
> >as an option off by default in Python 2). It's related to imports
> >and how to make them faster. Given your expertise on the subject,
> >I'd appreciate if you could review my ideas. I briefly spoken of it
> >with Alex Martelli a few days ago at PyCon Italia and he was not
> >negative about it.
> >
> >Problems:
> >
> >- A single import causes many syscalls (.pyo, .pyc, .py, in both
> >directory and .zip file).
> >- Situation is getting worse and worse with the advent of
> >easy_install which produces many .pth files (longer sys.path).
> >- Python startup time is slow, and a noticable fraction of it is
> >dominated by site.py-related stuff (a simple hello world runs takes
> >0.012s if run without -S, and 0.008s if run with -S).
> >- Many people might not be interested in this, but others are really
> >concerned. Eg: again at PyCon italia, I spoke with one of the
> >leading Sugar programmers (OLPC) who told me that one of the biggest
> >blocker right now is the python startup time (applications on latest
> >OLPC prototype take 3-4 seconds to startup). He suggested that this
> >was related to the large number of syscalls made for imports.
> >
> >
> >Proposed solution:
> >
> >- A site cache is introduced. It's a dictionary mapping module names
> >to absolute file paths.
> >- When an import occurs, for each directory/zipfile we walk in
> >sys.path, we read all directory entries, and update the site cache
> >with all the Python modules found in it (all the Python modules
> >found in the directory/zipfile).
> >- If the filepath for a certain module is found in the site cache,
> >the module is directly accessed. Otherwise, sys.path is walked.
> >- The site cache can be cleared with sys.clear_site_cache(). This
> >must be used after manual editing of sys.path (or could be done
> >automatically by making sys.path a list subclass which notices each
> >modification).
> >- The site cache must be manually cleared if a Python file is added
> >to a directory in sys.path after the application has started. This
> >is a rare-enough scenario to require an additional explicit call.
> >- If for whatever reason a filepath found in the site cache cannot
> >be accessed (unmounted device, whatever) ImportError is raised.
> >Again, this is something which is very rare and does not require
> >much attention.
>
> Here's a simpler solution, one that's easily testable using existing
> Python versions.  Create a subclass of pkgutil.ImpImporter
> (Python >=2.5) that caches a listdir of its contents, and uses it to
> immediately reject any find_module() requests for which matching data
> is not in its cached listdir.  Add this class to sys.path_hooks, and
> see if it speeds things up.



I thought about this use case when writing importlib for lowering the
penalty of importing over NFS and this is exactly how I would do it as well
(except I would use the code from importlib instead of pkgutil  =).

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070611/4191e4fe/attachment.html 

From stephen at xemacs.org  Tue Jun 12 05:02:24 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 12 Jun 2007 12:02:24 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
Message-ID: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > Of course, I wouldn't type them if I knew they were wrong.  With an
 > ASCII-only install, I would get that error-check because the
 > (remaining original uses) were in Cyrillic.  With an "any unicode
 > character" install, ... well, I might figure out my problem the next
 > morning.

But this is something that only a small subset of developers-of-Python
seem to be concerned about.  If that's generally the case for all
developers-in-Python, shouldn't the burden be placed on those who do
care?

It seems to me that rather than *impose* restrictions on third
parties, the sensible thing to do is to provide those restrictions to
those who want them.  But as Guido points out, that's outside of the
scope of this PEP because it can easily be done by external tools.

You object that running an auditor program would "cramp your style".
I don't mean that in a pejorative way; like Josiah's desire to
continue using certain tools, a developer's style is a BCP for him and
should *not* be gratuitously undermined.

But I see no reason why that auditor program can't be run as a PEP 263
codec.  AFAICS, the following objections could be raised, and answered:

1.  PEP 263 codecs delegate the decision to the code's author; an
    auditor shouldn't do that.

    You personally could modify your Python installation to replace
    all the codecs with a wrapper codec that processes the input by
    calling the "real" codec, then audits the resulting stream as it
    passes it back to the compiler.  But it can be done with a vanilla
    Python executable today.

    This is *proof of concept*; possibly there should be a UI to
    install such a codec via command line flag or environment
    variable, although there may be other creative ways to install it
    without altering the current interface to PEP 263 codecs.  I'm not
    yet familiar with the implementation to guess.

2.  The auditor would have to duplicate the work of the parser, and
    might get it wrong.

    AIUI, the parser is available as a module.  Use it.

3.  Parsing is expensive in time and other resources.

    No, it's not.  It's the other stuff that the compiler does that is
    expensive.<wink>  This is going to be O(size of source) like any
    codec with a somewhat higher constant than typical codecs.  More
    important, AIUI PEP 263 codecs don't get run on compiled code, so
    in a production environment it isn't an issue.

That doesn't mollify those who think I should not be allowed to use
non-ASCII identifiers at all.  But I think that should work for you
(modulo the UI for invoking the auditor).

Does it?

From murman at gmail.com  Tue Jun 12 04:56:10 2007
From: murman at gmail.com (Michael Urman)
Date: Mon, 11 Jun 2007 21:56:10 -0500
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kpoa$pv$1@sea.gmane.org>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
	<dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>
	<f4kpoa$pv$1@sea.gmane.org>
Message-ID: <dcbbbb410706111956s2f4ab9d3v2efaca6f59f31089@mail.gmail.com>

On 6/11/07, Baptiste Carvello <baptiste13 at altern.org> wrote:
> Michael Urman a ?crit :
> > There is the risk of visually aliased identifiers, but how is that
> > qualitatively worse than the truly conflicting identifiers you can
> > import with a *, or have inserted by modules mucking with
> > __builtins__?
> >
> Oh come on! imports usually are located at the top of the file, so they won't
> clobber other names. And mucking with __builtins__ is rare and frowned upon. On
> the contrary, non-ASCII identifiers will be encouraged, anywhere in the code.
> The  amount of information you get from today's python code is most of the time
> sufficient for debugging, or for using it as an inspiration. With non-ASCII
> identifiers, these features will be lost to all users who cannot read the needed
> characters. Denying the problem is not a good way to answer other people's concerns.

I think you overestimate my understanding of "the problem". To me
there is no problem (equal parts blindness and YAGNI; neither feel
like denial). As I am not going to be interested in trying to
understand code written in Chinese, Russian, etc., I'm not bothered by
the idea that someone might write code I will have a strong
disincentive to read. Am I underrating this concern because it doesn't
bother me? I don't see transliterated-into-ASCII as any better for
comprehension.

So to me that leaves the various potential aliasing problems that have
been described, and those honestly feel to me on par with import * and
builtins hackery. Yes these are discouraged, and aren't cause for
major concern. Similarly code intentionally designed to confuse would
be discouraged. I understand that Ka-Ping and several others do see
visual aliasing as a problem, so that is why I asked how it's
qualitatively worse.

I'm hoping that seeing answers from that angle (how is the potential
for aliasing worse than the potential for overriding int or str or
__import__ in some module you import) will help me understand why what
seems to me like a non-issue can be so important to others whose
opinions I respect. Is your concern that a flood of library code you
cannot read will be the only code written for things you want to do?
Or something else entirely?

-- 
Michael Urman

From stephen at xemacs.org  Tue Jun 12 05:35:44 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 12 Jun 2007 12:35:44 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kn1r$q2i$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>
	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>
	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>
	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>
	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
	<1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com>
	<f4kn1r$q2i$1@sea.gmane.org>
Message-ID: <87ejkhg7zj.fsf@uwakimon.sk.tsukuba.ac.jp>

Baptiste Carvello writes:

 > I wouldn't be so sure. In open source, you never know in advance to
 > whom your code can be useful. Maybe some part of you chinese tax
 > software can be refactored into a more generic library. If you
 > write the software with non-ASCII identifiers, this refactored
 > library won't be usable for non-chinese speakers. A good
 > opportunity will be missed, but *you won't even know*.

You won't know anyway, because you can't read the project's home
page.  You won't be able to refactor, because you can't read the
comments.

Only if the developer already has the ASCII/English discipline will it
be practical for a third party to do the refactoring.  Otherwise, it
will be easier and more reliable to just write from scratch.  Such
developers, who want a global audience, will pretty quickly learn that
external APIs need to be ASCII.

 > good point, but I'm not sure it is so much more difficult to
 > identify identifiers, given that you already need to know ASCII
 > characters in order to identify the keywords. Sure, you won't
 > understand what the identifiers mean, but you'll probably be able
 > to tell them from one another.

It is much harder to do so if your everyday language does not use a
Latin alphabet.  At least at the student level, here in Japan Japanese
(and to some extent, Chinese and Korean exchange students) make many
more spelling typos and punctuation errors in their programs than do
exchange students from the West.  One sees a lot of right/write (or,
for Japanese, right/light) errors, as well as transpositions and
one-letter mis-entry that most native speakers automatically catch
even in a cursory review.

I don't know that it would be better if they could use Japanese
identifiers, but I'd sure like to try.

 > > We are all consenting adults and we know that we should code in
 > > english if we want our code to be used and to be a first class
 > > citizen of the open source world.  What do you have to support
 > > your prediction?

 > I have experience in another community, namely the community of
 > physicists.  Here, most people don't know in advance how you're
 > supposed to write open source code. They learn in the doing. And if
 > someone starts coding with non-ASCII identifiers, he won't have
 > time to recode his program later. So he will simply not publih
 > it. Lost opportunity again.

Why won't he publish it?  The only reason I can see is that somebody
has indoctrinated him with all the FUD about how non-ASCII identifiers
make a program useless.  If you tell him the truth, "it will be more
useful to the world if you make those identifiers ASCII", won't the
great majority of physicists just say "it does what people like me
need, generalization is somebody else's job" and publish as-is?


From martin at v.loewis.de  Tue Jun 12 06:59:26 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 12 Jun 2007 06:59:26 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kpu7$pv$2@sea.gmane.org>
References: <20070524234516.8654.JCARLSON@uci.edu>	<4656920F.9040001@v.loewis.de>	<20070525091105.8663.JCARLSON@uci.edu>	<466C5BB7.8050909@v.loewis.de>
	<f4kpu7$pv$2@sea.gmane.org>
Message-ID: <466E282E.5070603@v.loewis.de>

Baptiste Carvello schrieb:
> Martin v. L?wis a ?crit :
>> I cannot imagine this scenario as realistic. It is certain
>> realistic that you want to keep your own code base ASCII-only -
>> what I don't understand why such a policy would extend to libraries
>> that you use. If the interfaces of the library are non-ASCII, you
>> will automatically notice; if it only has some non-ASCII
>> identifiers inside, why would you bother?
>> 
> well, for the same reason I prefer to use open source software:
> because I can debug it in case of need, and because I can use it as
> an inspiration if I need to write a similar program.

Ok, but why need you then *Python* to tell you that the file has
non-ASCII identifiers? Just look inside the file, and see whether
you like its source code. It's not that non-ASCII identifiers
*necessarily* make the file unmaintainable for your, they just
do so when you cannot easily recognize or type the characters
being used. Also, that all identifiers are ASCII is not sufficient
for you to be able to debug the program in case of need: it also
needs to be commented well, and the comments also should be in
a language you understand. Furthermore, it has been demonstrated
that ASCII-only identifiers are no guarantee that you can actually
understand the code, if they happen to be non-English in a convoluted
way, e.g. through transliteration.

So I don't see that an automatic detection of non-ASCII identifiers
actually helps much in determining whether you can use the source
code as inspiration. But even if you wanted to enforce a strict
"ASCII-only" policy, I don't see why you need Python to *reject*
identifiers outside ASCII - a warning would be surely enough to
indicate to you that your policy was violated.

Regards,
Martin



From martin at v.loewis.de  Tue Jun 12 06:59:31 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 12 Jun 2007 06:59:31 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kp08$uno$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705230926v4aa719a4x15c4a7047f48388d@mail.gmail.com>	<87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp>	<fb6fbf560705241114i55ae23afg5b8822abe0f99560@mail.gmail.com>	<Pine.LNX.4.58.0705241552450.8399@server1.LFW.org>	<781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net>	<87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>	<f4hoku$i57$1@sea.gmane.org>	<466CBC5A.3050907@v.loewis.de>
	<f4kp08$uno$1@sea.gmane.org>
Message-ID: <466E2833.3090202@v.loewis.de>

>> or if I upgrade to the next Python version, assuming the default is
>> to restrict identifiers.
>>
> That would only happen if the default changes to a more strict rule. If we start
> with ASCII only, this is unlikely to ever happen!

It will likely happen. In 3.0, I change the installation default to
allow for the characters I want. Then I install 3.1, and my code stops
working. I have to remember how to change the installation default
again, and locate the place in the 3.1 installation where I need to
change the same setting (assuming it is a per-installation setting).

In any case, global (application-wide) flags for restricting identifiers
already have been ruled out as solutions to whatever the problem is they
try to solve.

> my problem is then: what happens if the reader does not speak the same language
> as the author of the code? Right now, if I come across python code written in a
> language I don't speak, I can still try to make sense of it. Sure, I may have to
> do without the comments, sure, I may not understand what the identifier names
> mean. But I can still follow the instructions flow and try to figure out what
> happens. With non-ASCII identifiers, I cannot do that because I cannot recognise
> the identifiers from one another.

I think it was Ping who demonstrated that with ASCII-only identifiers,
you may not be able to reasonably analyze the code, either, so
restricting to ASCII is no *guarantee* that you can maintain the
code.

> Well, I have not followed acurately the discussion about security risks.
> However, I see a much simpler risk: the risk that I come across with code that
> is technically open source, but that I can't even debug in case of need because
> I cannot make sense of it. This would reduce the usefulness of such code, and
> cause fragmentation for the community.

How do you know this hasn't happened already? I'm *fairly* certain that
the community is *already* fragmented, and that there are open source
developers in other parts of the world writing Python programs that will
just never show up in your part of the world, because of language
and culture barriers.

In any case, the PEP advises that international projects should
constrain themselves to ASCII and English; beyond that advise, I
think there should be freedom of choice. It is not the interpreter's
job to be language police.

Regards,
Martin


From martin at v.loewis.de  Tue Jun 12 07:01:38 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 12 Jun 2007 07:01:38 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>	
	<466C5730.3060003@v.loewis.de>	
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>	
	<466DBD54.4030006@v.loewis.de>
	<740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com>
Message-ID: <466E28B2.8090905@v.loewis.de>

>> > Python currently provides to everyone the restriction of
>> > identifiers to a character set that everyone knows and trusts.
>> > Many of us want Python to continue to provide such restriction
>> > for those who want identifiers to be in a character set they
>> > know and trust.  This is not incompatible with your desire to
>> > permit alternative character sets, as long as Python offers an
>> > option to make that choice.  We can continue to discuss the
>> > details of how that choice is expressed, but this general idea
>> > is a solution that would give us both what we want.
>> >
>> > Can we agree on that?
>>
>> So far, all proposals I have seen *are* incompatible, or had
>> some other flaws, so I'm not certain that this general idea
>> provides a non-empty solution set.
> 
> python -ascii-only

That doesn't implement the requirement "restriction for those
who want identifiers to be in a character set they know and
trust", if that character set is not ASCII.

It also fails Guido's requirement "no global options", which
is "some other flaw".

Regards,
Martin

From rauli.ruohonen at gmail.com  Tue Jun 12 08:33:35 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 12 Jun 2007 09:33:35 +0300
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kno7$rr5$1@sea.gmane.org>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
	<FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
	<f4kno7$rr5$1@sea.gmane.org>
Message-ID: <f52584c00706112333l7c756303wfeec3224e45796ec@mail.gmail.com>

On 6/12/07, Baptiste Carvello <baptiste13 at altern.org> wrote:
> This is where we strongly disagree. If an identifier is written in
> transliterated chinese, I cannot understand what it means, but I can
> recognise?it when it is used in the code. I will then find out the
> meaning from the context. By contrast, with chineses identifiers, I
> will not recognise them from one another. So I won't be able to make
> any sense from the code without going through the complex task of
> translating everything.

I don't know any Chinese, but real Chinese is much more legible to me
than transliterated one. Transliterations are complete gibberish to me,
but because I know Japanese and it uses many of the same characters with
the same meaning, real Chinese makes at least *some* sense and if I
need to learn a few variable names in it then it's easier to do so
with the
proper characters. It's also much easier to look up what they mean, as
others have already mentioned. The same should be true for anyone who
knows Japanese, and there's a whole nation full of those.

From rrr at ronadam.com  Tue Jun 12 09:28:34 2007
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 12 Jun 2007 02:28:34 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4665EE44.2010306@ronadam.com>	
	<ee2a432c0706062018n6a6a5362yb75b380cf36cede2@mail.gmail.com>	
	<4667CCB2.6040405@ronadam.com>	
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>	
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
Message-ID: <466E4B22.6020408@ronadam.com>

Guido van Rossum wrote:
> On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>> >> The os.environ.get() method probably should return a unicode 
>> string. (?)
>> >
>> > Indeed -- care to contribute a patch?
>>
>> Ideally, such a patch would make use of the Win32 Unicode API for
>> environment variables on Windows. People had already been complaining
>> that they can't have "funny characters" in the value of an environment
>> variable, even though the UI allows them to set the variable just fine.
> 
> Yeah, but the Windows build of py3k is currently badly broken (e.g.
> the _fileio.c extension probably doesn't work at all) -- and I don't
> have access to a Windows box to work on it. I'm afraid 3.0a1 will be
> released without Windows support. Of course I'm counting on others to
> fix that before 3.0 final is released.
> 
> I don't mind for now that the posix.environ variable contains 8-bit
> strings -- people shouldn't be importing that anyway.


Here's a diff of the patch.  It looks like this may be backported to 2.6 
since it isn't Unicode specific but casts to the current str type.



Cast environ keys and values to current python str type in os.py
Added test for environ string types to test_os.py
Fixed test_update2, bug 1110478 test, that was being skipped.

Test test_tmpfile in test_os.py fails.  Haven't looked into it yet.


Index: Lib/os.py
===================================================================
--- Lib/os.py   (revision 55924)
+++ Lib/os.py   (working copy)
@@ -505,7 +505,8 @@
              def copy(self):
                  return dict(self)

-
+    # Make sure all environment keys and values are correct str type.
+    environ = dict([(str(k), str(v)) for k, v in environ.items()])
      environ = _Environ(environ)

  def getenv(key, default=None):
Index: Lib/test/test_os.py
===================================================================
--- Lib/test/test_os.py (revision 55924)
+++ Lib/test/test_os.py (working copy)
@@ -266,12 +266,25 @@
          os.environ.clear()
          os.environ.update(self.__save)

+class EnvironTests2(unittest.TestCase):
+    """Test os.environ for specific problems."""
+    def setUp(self):
+        self.__save = dict(os.environ)
+    def tearDown(self):
+        os.environ.clear()
+        os.environ.update(self.__save)
      # Bug 1110478
      def test_update2(self):
          if os.path.exists("/bin/sh"):
              os.environ.update(HELLO="World")
              value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip()
              self.assertEquals(value, "World")
+    # Verify environ keys and values from the OS are of the
+    # correct str type.
+    def test_keyvalue_types(self):
+        for key, val in os.environ.items():
+            self.assertEquals(type(key), str)
+            self.assertEquals(type(val), str)

  class WalkTests(unittest.TestCase):
      """Tests for os.walk()."""
@@ -466,6 +479,7 @@
          TemporaryFileTests,
          StatAttributeTests,
          EnvironTests,
+        EnvironTests2,
          WalkTests,
          MakedirTests,
          DevNullTests,



From python at zesty.ca  Tue Jun 12 09:30:33 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Tue, 12 Jun 2007 02:30:33 -0500 (CDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466E282E.5070603@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de>
	<f4kpu7$pv$2@sea.gmane.org> <466E282E.5070603@v.loewis.de>
Message-ID: <Pine.LNX.4.58.0706120226230.7196@server1.LFW.org>

On Tue, 12 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote:
> Also, that all identifiers are ASCII is not sufficient
> for you to be able to debug the program in case of need: it also
> needs to be commented well, and the comments also should be in
> a language you understand. Furthermore, it has been demonstrated
> that ASCII-only identifiers are no guarantee that you can actually
> understand the code, if they happen to be non-English in a convoluted
> way, e.g. through transliteration.

You keep making arguments of this type: that lacking a 100% guarantee
of a desirable property is reason to abandon consideration of the
property altogether.  I reject such arguments, as they should be
rejected by any sound application of logic.  They don't belong in this
debate; please stop making them.


-- ?!ng

From python at zesty.ca  Tue Jun 12 09:40:29 2007
From: python at zesty.ca (Ka-Ping Yee)
Date: Tue, 12 Jun 2007 02:40:29 -0500 (CDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
	<87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <Pine.LNX.4.58.0706120231490.7196@server1.LFW.org>

On Tue, 12 Jun 2007, Stephen J. Turnbull wrote:
> It seems to me that rather than *impose* restrictions on third
> parties, the sensible thing to do is to provide those restrictions to
> those who want them.

Hang on a second.  No one is *imposing* new restrictions.  Python
uses ASCII-only identifiers today and has always been that way.
The proposed change is to *expand* the identifier character set,
and some of us want to have control over this expansion.

> But I see no reason why that auditor program can't be run as a PEP 263
> codec.  AFAICS, the following objections could be raised, and answered:

The big missing concern from your list is that the vast majority
won't *know* that the character set is changing on them, so they
won't know that they need to do any of these things.

> 1.  PEP 263 codecs delegate the decision to the code's author; an
>     auditor shouldn't do that.

I'd be okay with this if the rules were adjusted so the codec
declaration was guaranteed to occur within a small bounded region
at the beginning of the file (e.g. the first two lines or first
80 characters, whichever is less) and that region was required to
be in ASCII.  Then you can easily know reliably, and at a glance,
what character set you are dealing with.

> 2.  The auditor would have to duplicate the work of the parser, and
>     might get it wrong.
> 3.  Parsing is expensive in time and other resources.

Both of these come down to the wastefulness of redoing something
that the Python interpreter itself already knows how to do very
well, and is, in some sense by definition, the authority on how
to do it correctly.


-- ?!ng

From rauli.ruohonen at gmail.com  Tue Jun 12 12:27:45 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 12 Jun 2007 13:27:45 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>

On 6/10/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I think you misunderstand.  Anything in Unicode that is normative is
> about interchange.  Strings are also a means of interchange---between
> modules (separate Unicode processes) in a program (single OS process).

Like Martin said, "what is a process?" :-) If you have a module that uses
noncharacters to mean something and it documents that, then that may well
be useful to its users. In my mind everything in a Python program is
within a single Unicode process, unless you have a very high level
component which specifies otherwise in its API documentation.

> Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2"
> is precisely a statement that various modules in Python do not specify
> what encoding forms they purport to accept or emit.

Actually, I said that there's no way to always do the right thing as long
as they are mixed, but that was a too theoretical argument. Practically
speaking, there's little need to interpret surrogate pairs as two
code points instead of as one non-BMP code point. The best use case I
could come up with was reading in an ill-formed UTF-8 file to see what
makes it ill-formed, but that's best done using bytes anyway.

E.g. '\xed\xa0\x80\xed\xb0\x80\xf0\x90\x80\x80' decodes to
u'\ud800\udc00\U00010000' on both builds, but as on a UCS-2 build
u'\U00010000' == u'\ud800\udc00', the distinction is lost there.
Effectively the codec has decoded the first two code points to UCS-2
and the the last code point to UTF-16, forming a string which mixes
the two interpretations instead of using one of them consistently, and
because of that you can no longer recover the original code point stream.
But what the decoder should really do is raise an exception anyway, as
the input is ill-formed.

Java and C# (and thus Jython and IronPython too) also sometimes use
UCS-2, sometimes UTF-16. As long as it works as you expect, there isn't a
problem, really.

On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the
extension that surrogates work as in UTF-16), but you get the extra
complication that some equal strings don't compare equal, e.g.
u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in
practice, because you shouldn't have strings like u'\ud800\udc00' in the
first place.

From stephen at xemacs.org  Tue Jun 12 13:35:22 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 12 Jun 2007 20:35:22 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f52584c00706112333l7c756303wfeec3224e45796ec@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>
	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>
	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>
	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>
	<466BBDA1.7070808@v.loewis.de> <f4hoku$i57$1@sea.gmane.org>
	<FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>
	<f4kno7$rr5$1@sea.gmane.org>
	<f52584c00706112333l7c756303wfeec3224e45796ec@mail.gmail.com>
Message-ID: <87vedte77p.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > I don't know any Chinese, but real Chinese is much more legible to me
 > than transliterated one. Transliterations are complete gibberish to me,

And will be to most Chinese, too, unless Mandarin is used, since
pronunciation varies infinitely from dialect to dialect, although the
characters and grammar are mostly the same.  You'd have to ask a
non-Beijing Chinese, but I suspect some of them feel about Mandarin as
a standard about the way Japanese feel about English.<wink>


From turnbull at sk.tsukuba.ac.jp  Tue Jun 12 14:04:35 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Tue, 12 Jun 2007 21:04:35 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706120231490.7196@server1.LFW.org>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
	<87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706120231490.7196@server1.LFW.org>
Message-ID: <87tztde5v0.fsf@uwakimon.sk.tsukuba.ac.jp>

Ka-Ping Yee writes:

 > On Tue, 12 Jun 2007, Stephen J. Turnbull wrote:
 > > It seems to me that rather than *impose* restrictions on third
 > > parties, the sensible thing to do is to provide those restrictions to
 > > those who want them.
 > 
 > Hang on a second.  No one is *imposing* new restrictions.  Python
 > uses ASCII-only identifiers today and has always been that way.

Who said "new"?  PEP 3131 is approved, so "reimpose", if you like.
But I don't want it, and definitely consider it an imposition, and
have done so since the discussion of PEP 263.

 > The big missing concern from your list is that the vast majority
 > won't *know* that the character set is changing on them, so they
 > won't know that they need to do any of these things.

Deliberate omission.

Such restrictions seem unacceptable to both Guido and Martin; the
*only* point of this proposal is to see if there's a way we can
achieve Jim's goal of no change to his best current practice without a
global setting unacceptable to Guido.

If you want to use this technology to change the default, fine, but
it's not part of my proposal.

 > > 1.  PEP 263 codecs delegate the decision to the code's author;
 > 
 > I'd be okay with this if [...]

[I'm not sure what you mean; I've deliberately edited to show the
meaning I took.]

I'm not OK with it.  Auditing by definition is under control of the
user, not the source code.  I don't see a real point in doing this if
the user or site can't enforce auditing, since they *can* do so by
explicitly running an external utility.

 > > 2.  The auditor would have to duplicate the work of the parser, and
 > >     might get it wrong.
 > > 3.  Parsing is expensive in time and other resources.
 > 
 > Both of these come down to the wastefulness of redoing something
 > that the Python interpreter itself already knows how to do very
 > well, and is, in some sense by definition, the authority on how
 > to do it correctly.

True.  However, Guido has already indicated that he favors some
approach like this, as an external lint utility.  My question is how
to minimize impact on users who desire flexible automatic auditing.


From jimjjewett at gmail.com  Tue Jun 12 16:03:53 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 10:03:53 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
Message-ID: <fb6fbf560706120703r1d2ed9cq4b980ab5c92ff892@mail.gmail.com>

On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> Practically
> speaking, there's little need to interpret surrogate pairs as two
> code points instead of as one non-BMP code point.

Depends on your definition of "practically".

Python does interpret them that way to maintain O(1) positional access
within strings encoded with 16 bits/char.

-jJ

From rauli.ruohonen at gmail.com  Tue Jun 12 16:39:48 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 12 Jun 2007 17:39:48 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706120703r1d2ed9cq4b980ab5c92ff892@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<fb6fbf560706120703r1d2ed9cq4b980ab5c92ff892@mail.gmail.com>
Message-ID: <f52584c00706120739p7798c18aqbf3c456011594bac@mail.gmail.com>

On 6/12/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > Practically speaking, there's little need to interpret surrogate pairs
> > as two code points instead of as one non-BMP code point.
>
> Depends on your definition of "practically".
>
> Python does interpret them that way to maintain O(1) positional access
> within strings encoded with 16 bits/char.

Indexing does not try to interpret the string as code points at all, it
works on code units. The difference is easier to see if you imagine Python
using utf-8 for strings. Indexing would still work on (8-bit) code units
instead of code points. It is higher level operations such as
unicodedata.normalize() that need to interpret strings as code points.
For 16-bit code units there are two interpretations, depending on whether
you think that surrogate pairs mean one (UTF-16) or two (UCS-2) code points.

Incidentally, unicodedata.normalize() is an example that currently does
interpret its input as UCS-2 instead of UTF-16. If you pass it a surrogate
pair it thinks of them as two code points, and won't do any normalization
for anything outside BMP on a UCS-2 build. Another example would be
unichr(), which gives you TypeError if you pass it a surrogate pair (oddly
enough, as strings of different length are of the same type).

From rauli.ruohonen at gmail.com  Tue Jun 12 16:45:20 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Tue, 12 Jun 2007 17:45:20 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706120739p7798c18aqbf3c456011594bac@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<fb6fbf560706120703r1d2ed9cq4b980ab5c92ff892@mail.gmail.com>
	<f52584c00706120739p7798c18aqbf3c456011594bac@mail.gmail.com>
Message-ID: <f52584c00706120745y24435962p944543b49039bc16@mail.gmail.com>

On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> Another example would be unichr(), which gives you TypeError if you
> pass it a surrogate pair (oddly enough, as strings of different length
> are of the same type).

Sorry, I meant ord(), not unichr. Anyway, ord(unichr(i)) == i doesn't
work for all code points on a UCS-2 build.

From bborcic at gmail.com  Tue Jun 12 18:27:11 2007
From: bborcic at gmail.com (Boris Borcic)
Date: Tue, 12 Jun 2007 18:27:11 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <Pine.LNX.4.58.0706120231490.7196@server1.LFW.org>
References: <20070524234516.8654.JCARLSON@uci.edu>	<4656920F.9040001@v.loewis.de>	<20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>	<466DBE10.1070804@v.loewis.de>	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>	<466DC90E.1070009@v.loewis.de>	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>	<87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
	<Pine.LNX.4.58.0706120231490.7196@server1.LFW.org>
Message-ID: <f4mhi9$dh2$1@sea.gmane.org>

Ka-Ping Yee wrote:
> 
> Hang on a second.  No one is *imposing* new restrictions.  Python
> uses ASCII-only identifiers today and has always been that way.

That restriction clearly wasn't imposed on the standard www.python.org windows 
distributions of Python - for quite a few versions already. See below.

Cheers, Boris Borcic


Python 2.4.2 (#67, Jan 17 2006, 15:36:03) [MSC v.1310 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

     ****************************************************************
     Personal firewall software may warn about the connection IDLE
     makes to its subprocess using this computer's internal loopback
     interface.  This connection is not visible on any external
     interface and no data is sent to or received from the Internet.
     ****************************************************************

IDLE 1.1.2
 >>> ?a_marchait = 1
 >>> print ?a_marchait
1

==========================================================================


Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

     ****************************************************************
     Personal firewall software may warn about the connection IDLE
     makes to its subprocess using this computer's internal loopback
     interface.  This connection is not visible on any external
     interface and no data is sent to or received from the Internet.
     ****************************************************************

IDLE 1.2
 >>> ?a_marche = 2
 >>> ?_l_?vidence = 3
 >>> ?a_marche + ?_l_?vidence
5


From pje at telecommunity.com  Tue Jun 12 18:30:46 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Tue, 12 Jun 2007 12:30:46 -0400
Subject: [Python-3000] Pre-PEP on fast imports
In-Reply-To: <20070611231640.8A55E3A407F@sparrow.telecommunity.com>
References: <466DD0D8.7040407@develer.com>
	<20070611231640.8A55E3A407F@sparrow.telecommunity.com>
Message-ID: <20070612162845.5F00A3A407F@sparrow.telecommunity.com>

At 07:18 PM 6/11/2007 -0400, Phillip J. Eby wrote:
>The subclass might look something like this:
>
>      import imp, os, sys
>      from pkgutil import ImpImporter
>
>      suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
>
>      class CachedImporter(ImpImporter):
>          def __init__(self, path):
>              if not os.path.isdir(path):
>                  raise ImportError("Not an existing directory")
>              super(CachedImporter, self).__init__(path)
>              self.refresh()
>
>          def refresh(self):
>              self.cache = set()
>              for fname in os.listdir(path):
>                  base, ext = os.path.splitext(fname)
>                  if ext in suffixes and '.' not in base:
>                      self.cache.add(base)
>
>          def find_module(self, fullname, path=None):
>              if fullname.split(".")[-1] not in self.cache:
>                  return None  # no need to check further
>              return super(CachedImporter, self).find_module(fullname, path)
>
>      sys.path_hooks.append(CachedImporter)

After a bit of reflection, it seems the refresh() method needs to be 
a bit different:

           def refresh(self):
               cache = set()
               for fname in os.listdir(self.path):
                   base, ext = os.path.splitext(fname)
                   if not ext or (ext in suffixes and '.' not in base):
                       cache.add(base)
               self.cache = cache

This version fixes two problems: first, a race condition could occur 
if you called refresh() while an import was taking place in another 
thread.  This version fixes that by only updating self.cache after 
the new cache is completely built.

Second, the old version didn't handle packages at all.  This version 
handles them by treating extension-less filenames as possible package 
directories.  I originally thought this should check for a 
subdirectory and __init__, but this could get very expensive if a 
sys.path directory has a lot of subdirectories (whether or not 
they're packages).  Having false positives in the cache (i.e. names 
that can't actually be imported) could slow things down a bit, but 
*only* if those names match something you're trying to import.  Thus, 
it seems like a reasonable trade-off versus needing to scan every 
subdirectory at startup or even to check whether all those names 
*are* subdirectories.


From jimjjewett at gmail.com  Tue Jun 12 18:48:46 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 12:48:46 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4kpoa$pv$1@sea.gmane.org>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>
	<466C5730.3060003@v.loewis.de>
	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>
	<dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>
	<f4kpoa$pv$1@sea.gmane.org>
Message-ID: <fb6fbf560706120948p647cc58du69b2c263ab3d69c8@mail.gmail.com>

On 6/11/07, Baptiste Carvello <baptiste13 at altern.org> wrote:
> Michael Urman a ?crit :

> > ... you already cannot visually inspect ...
> > There is the risk of visually aliased identifiers, but how is that
> > qualitatively worse than the truly conflicting identifiers you can
> > import with a *, or have inserted by modules mucking with
> > __builtins__?

> Oh come on! imports usually are located at the top of the file, so they won't
> clobber other names. And mucking with __builtins__ is rare and frowned upon.

Also, both are at least obvious.  Because of the (unexpected) visually
similar possibilities, a closer analogy would be a module that did

    def fn1():
        global mydata
        mydata = sys.modules['__builtin__']

and later changes mydata.  This is certainly possible, but it isn't
common, or accepted as good style.

> On the contrary, non-ASCII identifiers will be encouraged,
> anywhere in the code.

And that's OK with me -- but I want a warning when they are used, at
least as conspicuous as

    import *
    __builtins__

I have no objection to letting people turn that warning off locally
(preferably per-charset, rather than as a single switch), but I want
that decision to be local and explicit.

-jJ

From jimjjewett at gmail.com  Tue Jun 12 19:08:30 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 13:08:30 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706120739p7798c18aqbf3c456011594bac@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<fb6fbf560706120703r1d2ed9cq4b980ab5c92ff892@mail.gmail.com>
	<f52584c00706120739p7798c18aqbf3c456011594bac@mail.gmail.com>
Message-ID: <fb6fbf560706121008s7b23c4dav9c8a2957d5f861a7@mail.gmail.com>

On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> On 6/12/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > > Practically speaking, there's little need to interpret
> > > surrogate pairs as two code points instead of as one
> > > non-BMP code point.

> > Depends on your definition of "practically".

> > Python does interpret them that way to maintain O(1) positional
> > access within strings encoded with 16 bits/char.

> Indexing does not try to interpret the string as code points at all, it
> works on code units.

Even assuming that (when most people will assume "letters", and could
maybe understand that accent marks sometimes count), it still doesn't
quite work.

Slicing (or iterating over) a string claims to return strings of the same type.

>>> for x in u"abc": print type(x)

<type 'unicode'>
<type 'unicode'>
<type 'unicode'>

Strictly speaking, the surrogate pairs should be returned together,
rather that as separate code units.  It probably won't be fixed, since
those who care most are probably using 4-byte unicode characters.

-jJ

From g.brandl at gmx.net  Tue Jun 12 22:28:21 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Tue, 12 Jun 2007 22:28:21 +0200
Subject: [Python-3000] Unicode identifiers
In-Reply-To: <466EF680.10200@v.loewis.de>
References: <466BCB1D.2050101@v.loewis.de>	
	<ca471dc20706101557k6ef16b67t3e08e167e995ceed@mail.gmail.com>	
	<466CCF0D.5000304@v.loewis.de>	
	<ca471dc20706110751p214613a0l62fe672c890f91f@mail.gmail.com>	
	<466DBBA8.5020405@v.loewis.de>	
	<ca471dc20706111425i1549ec18qea50d5acbc09fc86@mail.gmail.com>	
	<466E3509.6020703@v.loewis.de> <466E9C08.5000704@gmx.net>
	<ca471dc20706120844q4d61b806tdb7e0ca9e8b2e29e@mail.gmail.com>
	<466EC238.1060502@gmx.net> <466EF680.10200@v.loewis.de>
Message-ID: <466F01E5.3030204@gmx.net>

[crossposting to python-3000]

Martin v. L?wis schrieb:

[removing string->string codecs]

>>> You're not losing functionality -- these conversions will remain
>>> available by importing the appropriate module. You're losing a very
>>> minor amount of convenience.
>> 
>> Of the mentioned encodings base64, uu, zlib, rot_13, hex, and quopri (bz2
>> should be in there as well) these all could work in the unicode<->bytes
>> way, except for rot13.
> 
> What does it mean to apply base64 to a string that contains characters
> with ordinals > 256? Likewise for uu, zlib, hex, and quopri.
> 
> They really encode bytes, not characters.

Perhaps I may then suggest a new bytes API, transform().

b"abc".transform("base64") would then do the same as today's "abc".encode("base64").

A unified bytestring transforming API could then make many of the
functions scattered across many modules obsolete.

Whether the opposite way would be via a different transformer name
or a untransform() method remains to be debated.

Georg


From baptiste13 at altern.org  Tue Jun 12 22:59:35 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 22:59:35 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <dcbbbb410706111956s2f4ab9d3v2efaca6f59f31089@mail.gmail.com>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>	<466C5730.3060003@v.loewis.de>	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>	<dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>	<f4kpoa$pv$1@sea.gmane.org>
	<dcbbbb410706111956s2f4ab9d3v2efaca6f59f31089@mail.gmail.com>
Message-ID: <f4n1gn$9o6$1@sea.gmane.org>

Michael Urman a ?crit :
> As I am not going to be interested in trying to
> understand code written in Chinese, Russian, etc., I'm not bothered by
> the idea that someone might write code I will have a strong
> disincentive to read. 
> 
The question is: is it worth it. Will the new feature allow more useful code to
be written, or will it cause unnecessary duplication of effort. Probably both,
but I cannot tell in which proportions, an neither can you, I guess.

I think helps a lot in this regard if developpers make a conscious choice as to
whether they use non-ASCII identifiers or not in a given project (as opposed to
just using the feature because it is there). Thus they will only use them when
they really feel the need. Having the feature disabled by default is a way to
make sure people take some time to think about it. However, maybe it s not
absolutely necessary and a prominent explanation in the various documentations
is sufficient? I'm not 100% sure one way or the other.

Cheers,
Baptiste


From baptiste13 at altern.org  Tue Jun 12 23:29:56 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 23:29:56 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466E282E.5070603@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>	<4656920F.9040001@v.loewis.de>	<20070525091105.8663.JCARLSON@uci.edu>	<466C5BB7.8050909@v.loewis.de>	<f4kpu7$pv$2@sea.gmane.org>
	<466E282E.5070603@v.loewis.de>
Message-ID: <f4n39n$gbv$1@sea.gmane.org>

Martin v. L?wis a ?crit :
> Baptiste Carvello schrieb:
>> Martin v. L?wis a ?crit :
>>> I cannot imagine this scenario as realistic. It is certain
>>> realistic that you want to keep your own code base ASCII-only -
>>> what I don't understand why such a policy would extend to libraries
>>> that you use. If the interfaces of the library are non-ASCII, you
>>> will automatically notice; if it only has some non-ASCII
>>> identifiers inside, why would you bother?
>>>
>> well, for the same reason I prefer to use open source software:
>> because I can debug it in case of need, and because I can use it as
>> an inspiration if I need to write a similar program.
> 
> Ok, but why need you then *Python* to tell you that the file has
> non-ASCII identifiers? Just look inside the file, and see whether
> you like its source code. 
>
well, doing that for all code before using it is not practical. And finding out
you can't read the code at the precise time when you have a bug you need to
solve is a really bad surprise.

> It's not that non-ASCII identifiers
> *necessarily* make the file unmaintainable for your, they just
> do so when you cannot easily recognize or type the characters
> being used. 
>
true, but better safe than sorry :-)

> Also, that all identifiers are ASCII is not sufficient
> for you to be able to debug the program in case of need: it also
> needs to be commented well, and the comments also should be in
> a language you understand. 
>
comments are nice to have, but you can often figure out what the code does
without them. It's not like all code is heavily commented...

> Furthermore, it has been demonstrated
> that ASCII-only identifiers are no guarantee that you can actually
> understand the code, if they happen to be non-English in a convoluted
> way, e.g. through transliteration.
> 
This is the same as comments: if the identifier name is gibberish, you can still
figure out what it stands for from the context (OK, at that point, it starts
getting very unconfortable :-).

> So I don't see that an automatic detection of non-ASCII identifiers
> actually helps much in determining whether you can use the source
> code as inspiration. But even if you wanted to enforce a strict
> "ASCII-only" policy, I don't see why you need Python to *reject*
> identifiers outside ASCII - a warning would be surely enough to
> indicate to you that your policy was violated.
> 
indeed, but I doubt a warning would be acceptable as a default policy, given how
annoying they are. So there would still need to be a configuration option to
disable (resp. enable) the warning. Also note that a warning would not solve the
security problems others have discussed, because it would only be shown after
the code has been executed.

> Regards,
> Martin
> 

Cheers,
Baptiste

PS: I think I'm going to reduce my participation in this thread, as I don't have
many new thoughts to add. I'm not convinced that non-ACSII identifiers allowed
by default is the way to go, but I'm not 100% sure otherwise, so count me as a
-0 on that.


From baptiste13 at altern.org  Tue Jun 12 23:34:25 2007
From: baptiste13 at altern.org (Baptiste Carvello)
Date: Tue, 12 Jun 2007 23:34:25 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com>
References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com>	<9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net>	<87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706012114y10a2af0es68c2bfadde6193cc@mail.gmail.com>	<87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp>	<Pine.LNX.4.58.0706040529350.7196@server1.LFW.org>	<466BBDA1.7070808@v.loewis.de>
	<f4hoku$i57$1@sea.gmane.org>	<FE92696F-3499-4128-BB3E-9C98D2F8838A@fuhm.net>	<f4kno7$rr5$1@sea.gmane.org>
	<19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com>
Message-ID: <f4n3i3$h9v$1@sea.gmane.org>

Guillaume Proux a ?crit :
> Hello,
> 
> On 6/12/07, Baptiste Carvello <baptiste13 at altern.org> wrote:
>> context. By contrast, with chineses identifiers, I will not recognise them from
>> one another. So I won't be able to make any sense from the code without going
>> through the complex task of translating everything.
> 
> You would be surprised how well you can do if you would actually try
> to recognize a set of Chinese characters, especially if you would use
> some tool to put a meaning on them. Well, I never formally learned any
> Chinese (nor any Japanese actually) , but I can now effortlessly parse
> both languages now.
> 
> But really, if you ever find any code with Chinese written all over it
> that you would believe might be very useful to you, you would have one
> of the following choice:
> (a) use a tokenizer and use some tool to do a hanzi -> ascii automatic
> transliteration/translation
> (b)  try to wrap the Chinese things with an ASCII veil (which would
> make you work on your Chinese a bit)  or you could ask your Chinese
> girlfriend to help you (WHAT you don't have a Chinese girlfriend yet?
> :))
> (c) actually contact the person who submitted the code to let him know
> you are very much interested in the code....
> 
> In most cases, this would give you the possibility to reach out to
> different communities and to work together with people with whom you
> might never have talked to. From what we can see on English-only
> mailing lists, this is the kind of python users we don't normally have
> access to currently because they simply are secluded in their own
> little universe, in the comfortable realm of their own linguistic
> barriers.
> 
> Of course, sometimes they step out  and offer a plea for help on
> English ML in broken English...
> PEP3131 is unlikely to change this. However it can see it might have
> two ethnically interesting consequences:
> 1) Python usage in community where ascii has little place should find
> more uses because people will become enpowered with Python and able to
> express themselves like never before: my bet is that for example, the
> Japanese python commmunity will become stronger and welcome new people
> younger and older, and that do not know much English.
> 2) If ever a program written with non-ASCII character find some good
> usage in ascii-only communities, then the usual plea for help will be
> reversed. People will seek out e.g. Japanese programmers and request
> help, maybe in broken Japanese. From this point on, all programming
> communities will be on an equal footing and able to talk together on
> the same standpoint. I guess you know "Libert? Egalit? Fraternit?".
> Maybe this should be the PEP subtitle.
> 
>> what happens to the keyword "if" (just try it:-). You would have to translate
>> the identifiers one by one, which is not practical.
> 
> would be possible with the tokenizer actually :)
> 
> Droit comme un if !
> 
> A bient?t,
> 
> Guillaume

si tu me prends par les sentiments :-) Really, you make it sound so nice I would
 almost change my mind. Still wondering how much of an effort it will be, though.

Ciao,
Baptiste


From jimjjewett at gmail.com  Tue Jun 12 23:46:32 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 17:46:32 -0400
Subject: [Python-3000] external dependencies (PEP 3131)
Message-ID: <fb6fbf560706121446t2ad33e64j52cdc7e2b2382191@mail.gmail.com>

On 6/11/07, Michael Urman <murman at gmail.com> wrote:

> I can't agree with this. The predictability of needing only to
> duplicate dependencies (version of python, modules) to ensure a
> program that ran over there will run over here (and vice versa) is too
> important to me.

This provides almost an exact analogy for locally allowed additional
scripts (scripts=writing systems, not .py files).

Your cherished assumption may already be false, depending on what is
in sitecustomize.  sitecustomize is imported by site.py; it is not
overwritten by a new install; it can do arbitrary things.

In theory, you need to add sitecustomize to your list of dependencies.
 In practice, it almost never exists, let alone does anything.  But
you could use it if you wanted to...

By exact analogy:

In theory, you need to pre-authorize the characters used in your
identifiers, just as you have to set up your sitecustomize.

In practice, you generally won't need to change anything because ASCII
is enough.  (Or if not, there will be a standard distribution for your
language community that already adds what you actually use.)

But if your local community does start using Tengwar, you are free to
add that too.

This script-permission file can last across installations, just like
sitecustomize does.  And to be honest, from my perspective, a fine
spelling would be to just add it right to sitecustomize, perhaps as:

    __allow_id_chars__("Kanji")
    __allow_id_chars__("0x1043..0x1059")

__allow_id_chars__ should be restricted in Brett's security-conscious
build, but I think it is OK to expose it to normal python.  If a
strange file does

    __allow_id_chars__("0x1043")

up near the import block, that provides about the same level of
warning as use of "__builtin__", or "import *".

(That is, less warning than I would ideally prefer, but probably
enough to prevent *accidental* charset confusion.)

-jJ

From jimjjewett at gmail.com  Wed Jun 13 00:09:17 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 18:09:17 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
	<87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706121509x5dcc9f07rf6c0f0e4d672be05@mail.gmail.com>

On 6/11/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Jim Jewett writes:

>  > Of course, I wouldn't type them if I knew they were wrong.  With an
>  > ASCII-only install, I would get that error-check because the
>  > (remaining original uses) were in Cyrillic.  With an "any unicode
>  > character" install, ... well, I might figure out my problem the next
>  > morning.

> But this is something that only a small subset of developers-of-Python
> seem to be concerned about.

(Almost) no one ever cares about typos (or fire escapes, for that
matter) in advance; if it non-ASCII characters were common enough (in
your local environment) that people expected and recognized them, then
they wouldn't be a problem.

That is why I have no objection to using Japanese on systems configured for it.

That is also why I want even systems configured for Japanese to be
able to still get warnings about Latin-1 (beyond ASCII).

I figure if the difference between ? and i may be as subtle to them as
the difference between (two of their letters that happen to be similar
to me), and they might appreciate the heads-up to look carefully.

> But I see no reason why that auditor program can't be run as a PEP 263
> codec.  AFAICS, the following objections could be raised, and answered:

This can of course be turned around.

The "codec does a bit more than you expect" option has been available
since 2.3 for people who want an expanded ID charset.  (Just
transliterate the extra characters into the moral equivalent of an
escape.)  It doesn't seem to have been used.

I'll freely agree that it hasn't been used in part because the
expanded charset is aimed largely at people not ready to use the
"write or at least install a codec that cheats" level of magic.  It is
also partly because the use of non-ASCII IDS is expected to stay small
in widely distributed code.

But the same facts argue against silently allowing unrecognized
characters; the use will be rare enough that people won't be expecting
it, and the level of magic required to write (or even know to install)
such a codec ... tends to come after someone has already found a
workaround for "strange characters".

> That doesn't mollify those who think I should not be allowed to use
> non-ASCII identifiers at all.

There is a subtle distinction there.  I am among those who think you
should not use non-ASCII identifiers *without an explicit
declaration.*

Putting that declaration at the top of the file itself would be fine.
(modulo possible security issues, such as the "coding" with a cyrillic
c.)

-jJ

From showell30 at yahoo.com  Wed Jun 13 00:15:49 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 12 Jun 2007 15:15:49 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <87tztde5v0.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <636530.83129.qm@web33503.mail.mud.yahoo.com>


--- "Stephen J. Turnbull" <turnbull at sk.tsukuba.ac.jp>
wrote:
> Ka-Ping Yee writes:
>  > Both of these come down to the wastefulness of
> redoing something
>  > that the Python interpreter itself already knows
> how to do very
>  > well, and is, in some sense by definition, the
> authority on how
>  > to do it correctly.
> 
> True.  However, Guido has already indicated that he
> favors some
> approach like this, as an external lint utility.  My
> question is how
> to minimize impact on users who desire flexible
> automatic auditing.
> 

I would like to comment on both points.

I am somebody who would use such an external lint
utility, even it was just out of idle curiosity about
the code I was importing (in other words, no fear
involved).

It seems like such a utility would need to be able to
do the following.

   1) The utility would need to tokenize my code.  It
seems like this could be done by the tokenizer module
pretty easily, even under PEP 3131.  Tokenizer.py does
not tap into Python internals right now AFAIK, and I
don't think it would need to under Py3K.

   2) The utility should triage my identifiers
according to their alphabet content.  In an ideal
world, since I'm not a Unicode expert, I would like it
somewhat simplifed for me -- e.g. the utility would
classify identifiers as ascii-only, totally mixed,
definitely Cyrillic, definitely French, German, mixed
Latin variant, Japanese, etc. To the extent that
Python knows how to classify strings on those general
levels, I would hope that those functions would be
exposed at the Python level.  

   But to the extent that CPython really shouldn't
care, I don't see a big problem with some third party
library implementing some kind of routine that can
deduce languages from Unicode spellings.  It's
basically a big dictionary, and maybe a small tree
structure, and something like a forty-line algorithm
(walk through the letters, look up their most specific
language, then with all the letters, walk up the tree
until you found the most specific
species/phylum/kingdom etc. of languages that
encompasses all letters). 

   3) The utility should be able to efficiently figure
out which files I want to inspect, by statically
walking the import structure.  To Ping's point, I
think this is one area where you lose something by
having to do this outside of the interpreter, but it
doesn't seem to be a terribly difficult problem to
solve.  (To the extent that Python can dynamically
import stuff at run-time, I'm willing to accept that
limitation in an external lint utility, since even if
CPython were doing the auditing for me, I'd still only
find out at runtime.)






       
____________________________________________________________________________________
Choose the right car based on your needs.  Check out Yahoo! Autos new Car Finder tool.
http://autos.yahoo.com/carfinder/

From showell30 at yahoo.com  Wed Jun 13 00:36:11 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 12 Jun 2007 15:36:11 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4n3i3$h9v$1@sea.gmane.org>
Message-ID: <402199.77500.qm@web33508.mail.mud.yahoo.com>


--- Baptiste Carvello <baptiste13 at altern.org> wrote:
> 
> si tu me prends par les sentiments :-) Really, you
> make it sound so nice I would
>  almost change my mind. Still wondering how much of
> an effort it will be, though.
> 

I would again make a call out for actual examples of
what Python code would look like under PEP 3131.  Then
people would not need to speculate on the effort
involved; they could try it out. 

In my best franglais: je pense que les avocats de PEP
3131 pourrait surmonter la doute, l'incertitude, le
crainte, etc., de PEP 3131 en montrant les exemples.



       
____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433

From jimjjewett at gmail.com  Wed Jun 13 03:31:10 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 21:31:10 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <466E282E.5070603@v.loewis.de>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de> <f4kpu7$pv$2@sea.gmane.org>
	<466E282E.5070603@v.loewis.de>
Message-ID: <fb6fbf560706121831u52bf55f8i3ced895b2f2cfd@mail.gmail.com>

On 6/12/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:

> Ok, but why need you then *Python* to tell you that the file has
> non-ASCII identifiers? Just look inside the file, and see whether
> you like its source code.

That is just what many users (including, in some environments, me)
cannot do *because* of the extended charset.

I can't see whether the I have an ASCII o or a Cyrillic o, because
they look the same, even though they aren't.  If the whole think is in
Cyrillic, I may notice; if only a few identifiers are, I probably
won't notice at least until I've already saved it (and possibly broken
it, depending on how unicode-unaware my editor is).

> I don't see why you need Python to *reject*
> identifiers outside ASCII - a warning would be surely enough to
> indicate to you that your policy was violated.

A warning would indeed be sufficient.

-jJ

From jimjjewett at gmail.com  Wed Jun 13 03:44:12 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 12 Jun 2007 21:44:12 -0400
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <402199.77500.qm@web33508.mail.mud.yahoo.com>
References: <f4n3i3$h9v$1@sea.gmane.org>
	<402199.77500.qm@web33508.mail.mud.yahoo.com>
Message-ID: <fb6fbf560706121844g783622a4ma0e56310bd9be52a@mail.gmail.com>

On 6/12/07, Steve Howell <showell30 at yahoo.com> wrote:

> In my best franglais: je pense que les avocats de PEP
> 3131 pourrait surmonter la doute, l'incertitude, le
> crainte, etc., de PEP 3131 en montrant les exemples.

Not really;  I think everyone agrees that you *can* produce
well-written code with non-ASCII identifiers.

The concerns are with not-so-well written code, or code that is mostly
ASCII with a few non-ASCII characters (or ids) thrown in around line
343.

-jJ

From showell30 at yahoo.com  Wed Jun 13 04:01:43 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 12 Jun 2007 19:01:43 -0700 (PDT)
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706121844g783622a4ma0e56310bd9be52a@mail.gmail.com>
Message-ID: <776344.63529.qm@web33512.mail.mud.yahoo.com>

--- Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/12/07, Steve Howell <showell30 at yahoo.com>
> wrote:
> 
> > In my best franglais: je pense que les avocats de
> PEP
> > 3131 pourrait surmonter la doute, l'incertitude,
> le
> > crainte, etc., de PEP 3131 en montrant les
> exemples.
> 
> Not really;  I think everyone agrees that you *can*
> produce
> well-written code with non-ASCII identifiers.
> 
> The concerns are with not-so-well written code, or
> code that is mostly
> ASCII with a few non-ASCII characters (or ids)
> thrown in around line
> 343.
> 

But then I would extend the same challenge to you.

Post a piece of code that follows the pattern that you
predict, and then see if the actual example of
not-so-well-written, non-ASCII-pure code can resonate
in the minds of folks who aren't able to imagine the
validity of your points without an example in front of
them.

I know that Ping has produced one example of deceptive
non-ASCII-pure code, but it didn't really sway me,
even though I've been basically sympatethic to his
overall conservatism about either keeping ASCII purity
or introducing Unicode only with some proper
safeguards.




 
____________________________________________________________________________________
It's here! Your new message!  
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

From martin at v.loewis.de  Wed Jun 13 05:57:36 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Wed, 13 Jun 2007 05:57:36 +0200
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <f4n1gn$9o6$1@sea.gmane.org>
References: <19685.97380.qm@web33511.mail.mud.yahoo.com>	<466C5730.3060003@v.loewis.de>	<Pine.LNX.4.58.0706110157180.7196@server1.LFW.org>	<dcbbbb410706110629x2f5f6a08j518ecd37518ad810@mail.gmail.com>	<f4kpoa$pv$1@sea.gmane.org>	<dcbbbb410706111956s2f4ab9d3v2efaca6f59f31089@mail.gmail.com>
	<f4n1gn$9o6$1@sea.gmane.org>
Message-ID: <466F6B30.3070308@v.loewis.de>

>> As I am not going to be interested in trying to
>> understand code written in Chinese, Russian, etc., I'm not bothered by
>> the idea that someone might write code I will have a strong
>> disincentive to read. 
>>
> The question is: is it worth it. Will the new feature allow more useful code to
> be written, or will it cause unnecessary duplication of effort. Probably both,
> but I cannot tell in which proportions, an neither can you, I guess.

This question has already been decided; PEP 3131 is accepted. So please
stop questioning it fundamentally.

Regards,
Martin

From turnbull at sk.tsukuba.ac.jp  Wed Jun 13 06:28:23 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Wed, 13 Jun 2007 13:28:23 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<fb6fbf560706061738s29e92ca2vfdbe867c8d013160@mail.gmail.com>
	<ca471dc20706061747r594daa58w2df9551ae726fb08@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
Message-ID: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > In my mind everything in a Python program is within a single
 > Unicode process,

Which is a *serious* mistake.  It is *precisely* the mistake that
leads to mixing UTF-16 and UCS-2 interpretations in the standard
library.  What you are saying is that if you write a 10-line script
that claims Unicode conformance, you are responsible for the Unicode-
correctness of all modules you call implicitly as well as that of the
Python interpreter.

This is what I mean by "Unicode conformance is not a goal of the
language."

Now, it's really not so bad.  If you look at what MAL and MvL are
doing (inter alia, it's their work I'm most familiar with), what you
will see is that they are gradually implementing conformant modules
here and there.  Eg, I am sure it is not MvL's laziness or inability
to come up with a reasonable spec himself that causes PEP 3131 to be a
profile of UAX #31.

 > Actually, I said that there's no way to always do the right thing as long
 > as they are mixed, but that was a too theoretical argument. Practically
 > speaking, there's little need to interpret surrogate pairs as two
 > code points instead of as one non-BMP code point.

Again, a mistake.  In the standard library, the question is not "do I
need this?", but "what happens if somebody else does it?"  They may
receive the same answer, but then again they may not.

For example, suppose you have a supplier-consumer pair sharing a
fixed-length buffer of 2-octet code units.  If it should happen that
the supplier uses the UCS-2 interpretation, then a surrogate pair may
get split when the buffer is full.  Will a UTF-16 consumer be prepared
for this?  Almost surely some will not, because that would imply
maintaining an internal buffer, which is stupidly inefficient if you
have an external buffer protocol.

Note that an UTF-16 supplier feeding a UCS-2 consumer will have no
problems (unless the UCS-2 consumer can't handle "short reads", but
that's unlikely), and if you have a chain starting with a UTF-16
source, then none of the downstream UTF-16 processes have a problem.
The problem is, suppose somehow you get a UCS-2 source?  Whose
responsibility is it to detect that?

 > Java and C# (and thus Jython and IronPython too) also sometimes use
 > UCS-2, sometimes UTF-16. As long as it works as you expect, there
 > isn't a problem, really.

That depends on how big a penalty you face if you break a promise of
conformance to your client.  Death, taxes, and Murphy's Law are
inescapable.

 > On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the
 > extension that surrogates work as in UTF-16), but you get the extra
 > complication that some equal strings don't compare equal, e.g.
 > u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in
 > practice, because you shouldn't have strings like u'\ud800\udc00' in the
 > first place.

But the Unicode standard itself gives (the equivalent of) u'\ud800' +
u'\udc00' as an example of the kind of thing you *should be able to
do*.  Because, you know, clients of the standard library *will* be
doing half-witted[1] things like that.


Footnotes: 
[1]  What I wanted to say was ????????? <wink>


From stephen at xemacs.org  Wed Jun 13 06:43:33 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 13 Jun 2007 13:43:33 +0900
Subject: [Python-3000] Support for PEP 3131
In-Reply-To: <fb6fbf560706121509x5dcc9f07rf6c0f0e4d672be05@mail.gmail.com>
References: <20070524234516.8654.JCARLSON@uci.edu>
	<4656920F.9040001@v.loewis.de>
	<20070525091105.8663.JCARLSON@uci.edu>
	<466C5BB7.8050909@v.loewis.de>
	<fb6fbf560706110637w35ae1991sa16d777b7c6c49c3@mail.gmail.com>
	<466DBE10.1070804@v.loewis.de>
	<fb6fbf560706111455w64f3d33ame24bea3b318e2f8@mail.gmail.com>
	<466DC90E.1070009@v.loewis.de>
	<fb6fbf560706111613o1f82354bx69ab9b1d9acb0169@mail.gmail.com>
	<87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp>
	<fb6fbf560706121509x5dcc9f07rf6c0f0e4d672be05@mail.gmail.com>
Message-ID: <87ps40ea6i.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > On 6/11/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

 > > But this is something that only a small subset of developers-of-Python
 > > seem to be concerned about.

This is a statement about the politics of changing an accepted PEP.
Without massive outcry, ain' agonna happ'm, Cap'n.

Remember, I've been in your camp w.r.t. "Python should provide
auditing" throughout.  If we can't get it in the language, I'm looking
for an *existing mechanism*.

 > The "codec does a bit more than you expect" option has been available
 > since 2.3 for people who want an expanded ID charset.  (Just
 > transliterate the extra characters into the moral equivalent of an
 > escape.)  It doesn't seem to have been used.

I would have done it immediately after PEP 263 if I had known how to
implement codecs.  It doesn't surprise me nobody else has done it.
This concept is probably sufficiently unobvious to meet the USPTO
criteria for patentability.<wink>

 > > That doesn't mollify those who think I should not be allowed to use
 > > non-ASCII identifiers at all.
 > 
 > There is a subtle distinction there.  I am among those who think you
 > should not use non-ASCII identifiers *without an explicit
 > declaration.*

So you're saying that you want to impose this check on all
Python *users* (not the developers; you can refuse the developers'
code yourself).  Fine, if you can get that past Guido and Martin.

All I'm trying to do here is find a way that *you* and *Josiah* can
get what you want in *your* installations, with existing mechanisms.
If you want to make it a default, discuss that with Guido and Martin;
that requires modifying the PEP.


From rauli.ruohonen at gmail.com  Wed Jun 13 12:24:33 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Wed, 13 Jun 2007 13:24:33 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>

On 6/13/07, Stephen J. Turnbull <turnbull at sk.tsukuba.ac.jp> wrote:
> What you are saying is that if you write a 10-line script that claims
> Unicode conformance, you are responsible for the Unicode-correctness of
> all modules you call implicitly as well as that of the Python interpreter.

If text files are by default read in normalized and noncharacters stripped,
where will you get problems in practice? A higher-level string type may be
useful, but there's no single obvious design.

>  > Practically speaking, there's little need to interpret surrogate pairs
>  > as two code points instead of as one non-BMP code point.
>
> Again, a mistake.  In the standard library, the question is not "do I
> need this?", but "what happens if somebody else does it?"  They may
> receive the same answer, but then again they may not.

What I meant is that the stdlib should only have string operations that
effectively work on (1) sequences of code units or (2) sequences of code
points, and that the choice between these two should be made reasonably.

One way to check whether a choice is reasonable is to consider what it
would mean for UTF-8, as there the difference between code units (0...ff)
and code points (0...10ffff) is the easiest to see. E.g. normalization
doesn't make any sense on code units, but slicing does.

Once you have determined that the reasonable choice is code points for some
operation in general, then you shouldn't use the UCS-2 interpretation for
16-bit strings in particular, because it muddies the underlying rule,
and Unicode is clear as mud without extra muddying already :-)

> For example, suppose you have a supplier-consumer pair sharing a
> fixed-length buffer of 2-octet code units.  If it should happen that
> the supplier uses the UCS-2 interpretation, then a surrogate pair may
> get split when the buffer is full.

I.e. you have a supplier that works on code units. If you document this,
then there's no problem, especially if that's what the user expects.

> Will a UTF-16 consumer be prepared for this?

This also needs to be documented, especially if it isn't. The consumer is
more useful if it is prepared for it. I've been excavating some Cambrian
period discussions on the topic recently, and this brings one post to mind:
http://mail.python.org/pipermail/i18n-sig/2001-June/001010.html

> Almost surely some will not, because that would imply maintaining an
> internal buffer, which is stupidly inefficient if you ave an external
> buffer protocol.

You only need to buffer one code unit at most, it's not inefficient.

> The problem is, suppose somehow you get a UCS-2 source?  Whose
> responsibility is it to detect that?

The user should check the API documentation. If the documentation is
missing, then you have to test or UTSL it (testing is good to do anyway).
If the documentation is wrong, then it's a bug.

> But the Unicode standard itself gives (the equivalent of) u'\ud800' +
> u'\udc00' as an example of the kind of thing you *should be able to
> do*.  Because, you know, clients of the standard library *will* be
> doing half-witted[1] things like that.

For UTF-16, yes, but for UTF-32, no. Any surrogate code units make
UTF-32 ill-formed, so there's no need to use them to make UTF-32 strings.
In UTF-16 surrogate pairs are allowed, and allowing isolated surrogates
makes some operations simpler. Kind of like negative integers make
calculations simpler, even if the end result is always non-negative.

Python itself has both UTF-16 and UTF-32 behavior on UCS-4 builds, but
that's an original invention probably intended to make code written for
UTF-16 work unchanged on UCS-4 builds, following the rule "be lenient
in what you accept and strict in what you emit".

> Footnotes:
> [1]  What I wanted to say was ????????? <wink>

?????????

From stephen at xemacs.org  Wed Jun 13 21:05:35 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 14 Jun 2007 04:05:35 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
Message-ID: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>

Rauli Ruohonen writes:

 > What I meant is that the stdlib should only have string operations
 > that effectively work on (1) sequences of code units or (2)
 > sequences of code points, and that the choice between these two
 > should be made reasonably.

I think we've reached a dead end.  AIUI, that's a matter for a PEP,
and the window for Python 3 is closed.  I'm pretty sure that Python 3
is going to have sequences of code units only (I know, Guido said
"code points", but I doubt he's read TR#17), except that people will
sneak in some UTF-16 behavior where it seems useful.

Until one or more of the senior developers says otherwise, I'm going
to assume that.

From guido at python.org  Wed Jun 13 22:03:39 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Jun 2007 13:03:39 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>

On 6/13/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Rauli Ruohonen writes:
>
>  > What I meant is that the stdlib should only have string operations
>  > that effectively work on (1) sequences of code units or (2)
>  > sequences of code points, and that the choice between these two
>  > should be made reasonably.
>
> I think we've reached a dead end.  AIUI, that's a matter for a PEP,
> and the window for Python 3 is closed.  I'm pretty sure that Python 3
> is going to have sequences of code units only (I know, Guido said
> "code points", but I doubt he's read TR#17), except that people will
> sneak in some UTF-16 behavior where it seems useful.
>
> Until one or more of the senior developers says otherwise, I'm going
> to assume that.

Yeah, what's the difference between code units and points?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Wed Jun 13 22:30:09 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Jun 2007 22:30:09 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	<87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp>	<ca471dc20706071010xee448f2m66963141747faa4@mail.gmail.com>	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <467053D1.7050107@v.loewis.de>

> I think we've reached a dead end.  AIUI, that's a matter for a PEP,
> and the window for Python 3 is closed.  I'm pretty sure that Python 3
> is going to have sequences of code units only (I know, Guido said
> "code points", but I doubt he's read TR#17), except that people will
> sneak in some UTF-16 behavior where it seems useful.
> 
> Until one or more of the senior developers says otherwise, I'm going
> to assume that.

I think it is *very* likely that Python 3 will work that way. There
isn't anything remotely that might look like an implementation of
an alternative.

Regards,
Martin

From martin at v.loewis.de  Wed Jun 13 22:37:45 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Jun 2007 22:37:45 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
Message-ID: <46705599.8090301@v.loewis.de>

>> Until one or more of the senior developers says otherwise, I'm going
>> to assume that.
> 
> Yeah, what's the difference between code units and points?

A code unit is the atomic base in some encoding. It is a single byte
in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
quantity in UTF-32).

A code point is something that has a 1:1 relationship with a logical
character (in particular, a Unicode character).

In UCS-2, a code point can be represented in 16 bits, and you can
represent all BMP characters. The low and high surrogates don't
encode characters and are reserved.

In UCS-4, you need more than 16 bits to represent a code point.
For example, you might use UTF-16, where you can use a single
code unit for all BMP characters, and two of them for code points
above U+FFFF.

Ever since PEP 261, Python admits that the elements of a Unicode
string are code units, and that you might need more than one of
them (specifically, for non-BMP characters in a narrow build)
to represent a code point.

Regards,
Martin

From guido at python.org  Wed Jun 13 23:05:21 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Jun 2007 14:05:21 -0700
Subject: [Python-3000] String comparison
In-Reply-To: <46705599.8090301@v.loewis.de>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
Message-ID: <ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>

On 6/13/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> Until one or more of the senior developers says otherwise, I'm going
> >> to assume that.
> >
> > Yeah, what's the difference between code units and points?
>
> A code unit is the atomic base in some encoding. It is a single byte
> in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
> quantity in UTF-32).
>
> A code point is something that has a 1:1 relationship with a logical
> character (in particular, a Unicode character).
>
> In UCS-2, a code point can be represented in 16 bits, and you can
> represent all BMP characters. The low and high surrogates don't
> encode characters and are reserved.
>
> In UCS-4, you need more than 16 bits to represent a code point.
> For example, you might use UTF-16, where you can use a single
> code unit for all BMP characters, and two of them for code points
> above U+FFFF.
>
> Ever since PEP 261, Python admits that the elements of a Unicode
> string are code units, and that you might need more than one of
> them (specifically, for non-BMP characters in a narrow build)
> to represent a code point.

Thanks for clearing that up. It sounds like we really use code units,
not code points (except when building with the 4-byte Unicode option,
when they are equivalent). Is there anywhere were we use code points,
apart from the UTF-8 codecs, which encode properly matched surrogate
pairs as a single code point?

Is it correct to say that a surrogate in UCS-16 is two code units
representing a single code point?

Apart from the surrogates, are there code points that aren't
characters? Are there characters that don't have a representation as a
single code point? (I know some characters have multiple
representations, some of which use multiple code points.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Wed Jun 13 23:53:50 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Jun 2007 14:53:50 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <466E4B22.6020408@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<4667CCB2.6040405@ronadam.com>
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
	<4668D535.7020103@v.loewis.de>
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
	<466E4B22.6020408@ronadam.com>
Message-ID: <ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>

I couldn't get this exact patch to apply, but I implemented something
equivalent in the py3kstruni branch. See revisions 55964 and 55965.
Thanks for the suggestion!

--Guido

On 6/12/07, Ron Adam <rrr at ronadam.com> wrote:
> Guido van Rossum wrote:
> > On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> >> The os.environ.get() method probably should return a unicode
> >> string. (?)
> >> >
> >> > Indeed -- care to contribute a patch?
> >>
> >> Ideally, such a patch would make use of the Win32 Unicode API for
> >> environment variables on Windows. People had already been complaining
> >> that they can't have "funny characters" in the value of an environment
> >> variable, even though the UI allows them to set the variable just fine.
> >
> > Yeah, but the Windows build of py3k is currently badly broken (e.g.
> > the _fileio.c extension probably doesn't work at all) -- and I don't
> > have access to a Windows box to work on it. I'm afraid 3.0a1 will be
> > released without Windows support. Of course I'm counting on others to
> > fix that before 3.0 final is released.
> >
> > I don't mind for now that the posix.environ variable contains 8-bit
> > strings -- people shouldn't be importing that anyway.
>
>
> Here's a diff of the patch.  It looks like this may be backported to 2.6
> since it isn't Unicode specific but casts to the current str type.
>
>
>
> Cast environ keys and values to current python str type in os.py
> Added test for environ string types to test_os.py
> Fixed test_update2, bug 1110478 test, that was being skipped.
>
> Test test_tmpfile in test_os.py fails.  Haven't looked into it yet.
>
>
> Index: Lib/os.py
> ===================================================================
> --- Lib/os.py   (revision 55924)
> +++ Lib/os.py   (working copy)
> @@ -505,7 +505,8 @@
>               def copy(self):
>                   return dict(self)
>
> -
> +    # Make sure all environment keys and values are correct str type.
> +    environ = dict([(str(k), str(v)) for k, v in environ.items()])
>       environ = _Environ(environ)
>
>   def getenv(key, default=None):
> Index: Lib/test/test_os.py
> ===================================================================
> --- Lib/test/test_os.py (revision 55924)
> +++ Lib/test/test_os.py (working copy)
> @@ -266,12 +266,25 @@
>           os.environ.clear()
>           os.environ.update(self.__save)
>
> +class EnvironTests2(unittest.TestCase):
> +    """Test os.environ for specific problems."""
> +    def setUp(self):
> +        self.__save = dict(os.environ)
> +    def tearDown(self):
> +        os.environ.clear()
> +        os.environ.update(self.__save)
>       # Bug 1110478
>       def test_update2(self):
>           if os.path.exists("/bin/sh"):
>               os.environ.update(HELLO="World")
>               value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip()
>               self.assertEquals(value, "World")
> +    # Verify environ keys and values from the OS are of the
> +    # correct str type.
> +    def test_keyvalue_types(self):
> +        for key, val in os.environ.items():
> +            self.assertEquals(type(key), str)
> +            self.assertEquals(type(val), str)
>
>   class WalkTests(unittest.TestCase):
>       """Tests for os.walk()."""
> @@ -466,6 +479,7 @@
>           TemporaryFileTests,
>           StatAttributeTests,
>           EnvironTests,
> +        EnvironTests2,
>           WalkTests,
>           MakedirTests,
>           DevNullTests,
>
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Thu Jun 14 00:18:25 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Thu, 14 Jun 2007 00:18:25 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>	
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>	
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>	
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>	
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
Message-ID: <46706D31.6030001@v.loewis.de>

> Thanks for clearing that up. It sounds like we really use code units,
> not code points (except when building with the 4-byte Unicode option,
> when they are equivalent). Is there anywhere were we use code points,
> apart from the UTF-8 codecs, which encode properly matched surrogate
> pairs as a single code point?

The literal syntax also supports it: \U00010000 is supported even
in a narrow build, and gets transparently encoded to the corresponding
two code units; likewise for repr(). There is an SF patch to make
unicodedata.lookup suport them also.

> Is it correct to say that a surrogate in UCS-16 is two code units
> representing a single code point?

That's my understanding, yes.

> Apart from the surrogates, are there code points that aren't
> characters? Are there characters that don't have a representation as a
> single code point? (I know some characters have multiple
> representations, some of which use multiple code points.)

[assuming you mean "code unit" again]
Not in the Unicode type, no. In the byte string type, this happens
all the time with multi-byte encodings.

[assuming you really mean "code point" in the first question]
There are numerous unassigned code points in Unicode, i.e. they
don't represent a character *yet*. There are also several code
points that are "noncharacters", in particular U+FFFE and
U+FFFF. These are permanently reserved and should never be
interpreted as abstract characters (rule C5). FFFE is reserved
because it is the byte-toggled BOM; I believe FFFF is reserved
so that APIs can use -1 as an error value. (FWIW, U+FFFD *is*
assigned and means "REPLACEMENT CHARACTER", ?).

As for "combining characters": I think the Unicode terminology
really is that they are separate characters. They get combined
into a single grapheme, and different character sequences might
be considered as equivalent under canonical forms - but the
decomposed ? (o + combining diaeresis) actually is understood
as a two-character (i.e. two-codepoint) sequence.

Whether that matches the intuitive definition of "character",
I don't know - and I'm sure somebody will correct me if I
presented it incorrectly.

Regards,
Martin

From jimjjewett at gmail.com  Thu Jun 14 00:23:24 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 13 Jun 2007 18:23:24 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
Message-ID: <fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>

On 6/13/07, Guido van Rossum <guido at python.org> wrote:
> On 6/13/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:

> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).

and

> > A code unit is the atomic base in some encoding. It is a single byte
> > in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
> > quantity in UTF-32).
...

> Is it correct to say that a surrogate in UCS-16 is two code units
> representing a single code point?

Basically, assuming you meant both halves of the surrogate pair put
together.  "A" surrogate often refers to only one of them.

> Apart from the surrogates, are there code points that aren't
> characters?

Yes.  The BOM mark, for one.  Plenty of other code points are reserved
for private use, or not yet assigned, or never will be assigned.
There are also some that are explicitly not characters.
(U+FD00..U+FDEF), and some that might be debatable (unprinted control
characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER)

> Are there characters that don't have a representation as a
> single code point? (I know some characters have multiple
> representations, some of which use multiple code points.)

There are plenty of (mostly archaic?) characters which don't (yet?)
have an assigned unicode code point.

There are also plenty of things that a native speaker may view as a
single character, but which unicode treats as (at most) a Named
Sequence.

-jJ

From rrr at ronadam.com  Thu Jun 14 01:49:26 2007
From: rrr at ronadam.com (Ron Adam)
Date: Wed, 13 Jun 2007 18:49:26 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4667CCB2.6040405@ronadam.com>	
	<acd65fa20706070850v32de4a75q9c160261e2df2692@mail.gmail.com>	
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>	
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
Message-ID: <46708286.6090201@ronadam.com>



Guido van Rossum wrote:
> I couldn't get this exact patch to apply, but I implemented something
> equivalent in the py3kstruni branch. See revisions 55964 and 55965.
> Thanks for the suggestion!

This is actually closer to how I started to do it, but I wasn't sure if it 
would catch everything.  Looking at it again, it looks good with the 
exception of riscos.  (The added test should catch that if it's a problem 
so it can be fixed later.)

The reason I made a new test case for added tests is that the existing test 
case based on mapping_tests.BasicTestMappingProtocol doesn't run the added 
test methods.  So I put those under a new test case based on unittest.TestCase.

I can't re-verify this currently because the latest merge broke something 
in my build process.  I'm getting a "lost stderr" message.  I've seen it 
before so it's probably something on my end.  I think the last time this 
happened to me I was able to clear it up by deleting the branch and 
re-updating it.



Another suggestion is to make a change in stringobject.c to represent 8 
bits strings as "str8('somes_tring')"  or just s"some_string" so they can 
more easily be found from unicode strings.  Particularly in the tests. 
This will force a few more tests to fail, but they are things that need to 
be fixed.  Only about 3 or 4 additional modules fail when I tried it.

I was getting failed expect/got test cases that looked exactly the same. 
But after changing the str8 representation those became obvious st8 vs 
unicode comparisons.

Using the shorter 's"string"' form will cause places, where eval or exec 
are using str8, to cause syntax errors.  Which may also be helpful.


BTW,  I will make a new remove_raw_escapes patch so it applies cleanly. 
I'm trying to track down why my patched version of test_tokenize.py passes 
sometimes but not at others.  (I think it's either a tempfile or string io 
issue, or both.)  This was what initiated the above suggestion.

;-)


Cheers,
    Ron


> --Guido
> 
> On 6/12/07, Ron Adam <rrr at ronadam.com> wrote:
>> Guido van Rossum wrote:
>> > On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>> >> >> The os.environ.get() method probably should return a unicode
>> >> string. (?)
>> >> >
>> >> > Indeed -- care to contribute a patch?
>> >>
>> >> Ideally, such a patch would make use of the Win32 Unicode API for
>> >> environment variables on Windows. People had already been complaining
>> >> that they can't have "funny characters" in the value of an environment
>> >> variable, even though the UI allows them to set the variable just 
>> fine.
>> >
>> > Yeah, but the Windows build of py3k is currently badly broken (e.g.
>> > the _fileio.c extension probably doesn't work at all) -- and I don't
>> > have access to a Windows box to work on it. I'm afraid 3.0a1 will be
>> > released without Windows support. Of course I'm counting on others to
>> > fix that before 3.0 final is released.
>> >
>> > I don't mind for now that the posix.environ variable contains 8-bit
>> > strings -- people shouldn't be importing that anyway.
>>
>>
>> Here's a diff of the patch.  It looks like this may be backported to 2.6
>> since it isn't Unicode specific but casts to the current str type.
>>
>>
>>
>> Cast environ keys and values to current python str type in os.py
>> Added test for environ string types to test_os.py
>> Fixed test_update2, bug 1110478 test, that was being skipped.
>>
>> Test test_tmpfile in test_os.py fails.  Haven't looked into it yet.
>>
>>
>> Index: Lib/os.py
>> ===================================================================
>> --- Lib/os.py   (revision 55924)
>> +++ Lib/os.py   (working copy)
>> @@ -505,7 +505,8 @@
>>               def copy(self):
>>                   return dict(self)
>>
>> -
>> +    # Make sure all environment keys and values are correct str type.
>> +    environ = dict([(str(k), str(v)) for k, v in environ.items()])
>>       environ = _Environ(environ)
>>
>>   def getenv(key, default=None):
>> Index: Lib/test/test_os.py
>> ===================================================================
>> --- Lib/test/test_os.py (revision 55924)
>> +++ Lib/test/test_os.py (working copy)
>> @@ -266,12 +266,25 @@
>>           os.environ.clear()
>>           os.environ.update(self.__save)
>>
>> +class EnvironTests2(unittest.TestCase):
>> +    """Test os.environ for specific problems."""
>> +    def setUp(self):
>> +        self.__save = dict(os.environ)
>> +    def tearDown(self):
>> +        os.environ.clear()
>> +        os.environ.update(self.__save)
>>       # Bug 1110478
>>       def test_update2(self):
>>           if os.path.exists("/bin/sh"):
>>               os.environ.update(HELLO="World")
>>               value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip()
>>               self.assertEquals(value, "World")
>> +    # Verify environ keys and values from the OS are of the
>> +    # correct str type.
>> +    def test_keyvalue_types(self):
>> +        for key, val in os.environ.items():
>> +            self.assertEquals(type(key), str)
>> +            self.assertEquals(type(val), str)
>>
>>   class WalkTests(unittest.TestCase):
>>       """Tests for os.walk()."""
>> @@ -466,6 +479,7 @@
>>           TemporaryFileTests,
>>           StatAttributeTests,
>>           EnvironTests,
>> +        EnvironTests2,
>>           WalkTests,
>>           MakedirTests,
>>           DevNullTests,
>>
>>
>>
> 
> 

From guido at python.org  Thu Jun 14 01:56:39 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Jun 2007 16:56:39 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46708286.6090201@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
	<4668D535.7020103@v.loewis.de>
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
	<466E4B22.6020408@ronadam.com>
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
	<46708286.6090201@ronadam.com>
Message-ID: <ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>

On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>
>
> Guido van Rossum wrote:
> > I couldn't get this exact patch to apply, but I implemented something
> > equivalent in the py3kstruni branch. See revisions 55964 and 55965.
> > Thanks for the suggestion!
>
> This is actually closer to how I started to do it, but I wasn't sure if it
> would catch everything.  Looking at it again, it looks good with the
> exception of riscos.  (The added test should catch that if it's a problem
> so it can be fixed later.)

If riscos is even still supported. ;-(

> The reason I made a new test case for added tests is that the existing test
> case based on mapping_tests.BasicTestMappingProtocol doesn't run the added
> test methods.  So I put those under a new test case based on unittest.TestCase.

I don't understand this. The test_keyvalue_types() test *does* run,
regardless if whether I use regrtest.py test_os or test_os.py.

> I can't re-verify this currently because the latest merge broke something
> in my build process.  I'm getting a "lost stderr" message.  I've seen it
> before so it's probably something on my end.  I think the last time this
> happened to me I was able to clear it up by deleting the branch and
> re-updating it.

Your best bet is to remove all .pyc files under Lib: rm `find Lib -name \*.pyc`
(make clean also works)

> Another suggestion is to make a change in stringobject.c to represent 8
> bits strings as "str8('somes_tring')"  or just s"some_string" so they can
> more easily be found from unicode strings.  Particularly in the tests.
> This will force a few more tests to fail, but they are things that need to
> be fixed.  Only about 3 or 4 additional modules fail when I tried it.

I've considered this, but then we should also support that notation on
input. I've also thought of using different string quote conventions,
e.g. "..." to mean Unicode and '...' to mean 8-bit.

> I was getting failed expect/got test cases that looked exactly the same.
> But after changing the str8 representation those became obvious st8 vs
> unicode comparisons.

Right.

> Using the shorter 's"string"' form will cause places, where eval or exec
> are using str8, to cause syntax errors.  Which may also be helpful.

Why would this help?

> BTW,  I will make a new remove_raw_escapes patch so it applies cleanly.
> I'm trying to track down why my patched version of test_tokenize.py passes
> sometimes but not at others.  (I think it's either a tempfile or string io
> issue, or both.)  This was what initiated the above suggestion.

Please send it as a proper attachment; somehow gmail doesn't make it
easy to extract patches pasted directly into the text (nor "inline"
attachments).

> ;-)
>
>
> Cheers,
>     Ron
>
>
> > --Guido
> >
> > On 6/12/07, Ron Adam <rrr at ronadam.com> wrote:
> >> Guido van Rossum wrote:
> >> > On 6/7/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> >> >> >> The os.environ.get() method probably should return a unicode
> >> >> string. (?)
> >> >> >
> >> >> > Indeed -- care to contribute a patch?
> >> >>
> >> >> Ideally, such a patch would make use of the Win32 Unicode API for
> >> >> environment variables on Windows. People had already been complaining
> >> >> that they can't have "funny characters" in the value of an environment
> >> >> variable, even though the UI allows them to set the variable just
> >> fine.
> >> >
> >> > Yeah, but the Windows build of py3k is currently badly broken (e.g.
> >> > the _fileio.c extension probably doesn't work at all) -- and I don't
> >> > have access to a Windows box to work on it. I'm afraid 3.0a1 will be
> >> > released without Windows support. Of course I'm counting on others to
> >> > fix that before 3.0 final is released.
> >> >
> >> > I don't mind for now that the posix.environ variable contains 8-bit
> >> > strings -- people shouldn't be importing that anyway.
> >>
> >>
> >> Here's a diff of the patch.  It looks like this may be backported to 2.6
> >> since it isn't Unicode specific but casts to the current str type.
> >>
> >>
> >>
> >> Cast environ keys and values to current python str type in os.py
> >> Added test for environ string types to test_os.py
> >> Fixed test_update2, bug 1110478 test, that was being skipped.
> >>
> >> Test test_tmpfile in test_os.py fails.  Haven't looked into it yet.
> >>
> >>
> >> Index: Lib/os.py
> >> ===================================================================
> >> --- Lib/os.py   (revision 55924)
> >> +++ Lib/os.py   (working copy)
> >> @@ -505,7 +505,8 @@
> >>               def copy(self):
> >>                   return dict(self)
> >>
> >> -
> >> +    # Make sure all environment keys and values are correct str type.
> >> +    environ = dict([(str(k), str(v)) for k, v in environ.items()])
> >>       environ = _Environ(environ)
> >>
> >>   def getenv(key, default=None):
> >> Index: Lib/test/test_os.py
> >> ===================================================================
> >> --- Lib/test/test_os.py (revision 55924)
> >> +++ Lib/test/test_os.py (working copy)
> >> @@ -266,12 +266,25 @@
> >>           os.environ.clear()
> >>           os.environ.update(self.__save)
> >>
> >> +class EnvironTests2(unittest.TestCase):
> >> +    """Test os.environ for specific problems."""
> >> +    def setUp(self):
> >> +        self.__save = dict(os.environ)
> >> +    def tearDown(self):
> >> +        os.environ.clear()
> >> +        os.environ.update(self.__save)
> >>       # Bug 1110478
> >>       def test_update2(self):
> >>           if os.path.exists("/bin/sh"):
> >>               os.environ.update(HELLO="World")
> >>               value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip()
> >>               self.assertEquals(value, "World")
> >> +    # Verify environ keys and values from the OS are of the
> >> +    # correct str type.
> >> +    def test_keyvalue_types(self):
> >> +        for key, val in os.environ.items():
> >> +            self.assertEquals(type(key), str)
> >> +            self.assertEquals(type(val), str)
> >>
> >>   class WalkTests(unittest.TestCase):
> >>       """Tests for os.walk()."""
> >> @@ -466,6 +479,7 @@
> >>           TemporaryFileTests,
> >>           StatAttributeTests,
> >>           EnvironTests,
> >> +        EnvironTests2,
> >>           WalkTests,
> >>           MakedirTests,
> >>           DevNullTests,
> >>
> >>
> >>
> >
> >
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rasky at develer.com  Tue Jun 12 18:40:01 2007
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 12 Jun 2007 18:40:01 +0200
Subject: [Python-3000] Pre-PEP on fast imports
In-Reply-To: <20070612162845.5F00A3A407F@sparrow.telecommunity.com>
References: <466DD0D8.7040407@develer.com>
	<20070611231640.8A55E3A407F@sparrow.telecommunity.com>
	<20070612162845.5F00A3A407F@sparrow.telecommunity.com>
Message-ID: <466ECC61.7070505@develer.com>

On 6/12/2007 6:30 PM, Phillip J. Eby wrote:

>>      import imp, os, sys
>>      from pkgutil import ImpImporter
>>
>>      suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
>>
>>      class CachedImporter(ImpImporter):
>>          def __init__(self, path):
>>              if not os.path.isdir(path):
>>                  raise ImportError("Not an existing directory")
>>              super(CachedImporter, self).__init__(path)
>>              self.refresh()
>>
>>          def refresh(self):
>>              self.cache = set()
>>              for fname in os.listdir(path):
>>                  base, ext = os.path.splitext(fname)
>>                  if ext in suffixes and '.' not in base:
>>                      self.cache.add(base)
>>
>>          def find_module(self, fullname, path=None):
>>              if fullname.split(".")[-1] not in self.cache:
>>                  return None  # no need to check further
>>              return super(CachedImporter, self).find_module(fullname, 
>> path)
>>
>>      sys.path_hooks.append(CachedImporter)
> 
> After a bit of reflection, it seems the refresh() method needs to be a 
> bit different:
> 
>           def refresh(self):
>               cache = set()
>               for fname in os.listdir(self.path):
>                   base, ext = os.path.splitext(fname)
>                   if not ext or (ext in suffixes and '.' not in base):
>                       cache.add(base)
>               self.cache = cache
> 
> This version fixes two problems: first, a race condition could occur if 
> you called refresh() while an import was taking place in another 
> thread.  This version fixes that by only updating self.cache after the 
> new cache is completely built.
> 
> Second, the old version didn't handle packages at all.  This version 
> handles them by treating extension-less filenames as possible package 
> directories.  I originally thought this should check for a subdirectory 
> and __init__, but this could get very expensive if a sys.path directory 
> has a lot of subdirectories (whether or not they're packages).  Having 
> false positives in the cache (i.e. names that can't actually be 
> imported) could slow things down a bit, but *only* if those names match 
> something you're trying to import.  Thus, it seems like a reasonable 
> trade-off versus needing to scan every subdirectory at startup or even 
> to check whether all those names *are* subdirectories.

There is another couple of things I'll fix as soon as I try it. First is 
that I'd call refresh() lazily on the first find_module because I don't 
want to listdir() directories on sys.path that will never be accessed.

The idea of using sys.path_hooks is very clever (I hadn't thought of 
it... because I didn't know of path_hooks in the first place! It appears 
to be undocumented and sparsely indexed by google as well), and it will 
probably help me a lot in my task of fixing this problem in the 2.x serie.
-- 
Giovanni Bajo


From rrr at ronadam.com  Thu Jun 14 04:13:44 2007
From: rrr at ronadam.com (Ron Adam)
Date: Wed, 13 Jun 2007 21:13:44 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<ca471dc20706071034i47e1ee4dg77946f2d22cf043@mail.gmail.com>	
	<46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com>	
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>
Message-ID: <4670A458.7050206@ronadam.com>



Guido van Rossum wrote:
> On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>>
>>
>> Guido van Rossum wrote:
>> > I couldn't get this exact patch to apply, but I implemented something
>> > equivalent in the py3kstruni branch. See revisions 55964 and 55965.
>> > Thanks for the suggestion!
>>
>> This is actually closer to how I started to do it, but I wasn't sure 
>> if it
>> would catch everything.  Looking at it again, it looks good with the
>> exception of riscos.  (The added test should catch that if it's a problem
>> so it can be fixed later.)
> 
> If riscos is even still supported. ;-(

I have no idea.

Looking at the overall structure of os.py makes me think the platform 
specific code could be abstracted out a bit further.  Possibly have one 
public "platform" module (or package) that is an alias or built from 
private _platform package files.

So instead of having "import mac" or "from mac import ..." in if-else 
structures, just do "from platform import ...".  That moves all the 
platform testing to either the build process or as part of site.py so it 
can set 'platform' to the correct platform module or package.  After that 
everything else is platform independent (or mostly).


>> The reason I made a new test case for added tests is that the existing 
>> test
>> case based on mapping_tests.BasicTestMappingProtocol doesn't run the 
>> added
>> test methods.  So I put those under a new test case based on 
>> unittest.TestCase.
> 
> I don't understand this. The test_keyvalue_types() test *does* run,
> regardless if whether I use regrtest.py test_os or test_os.py.

Just tested it again and you are right.  I did test it earlier and it did 
not run those tests when I wrote the test exactly as you did. (So if it was 
broke, it got fixed someplace else.)


>> I can't re-verify this currently because the latest merge broke something
>> in my build process.  I'm getting a "lost stderr" message.  I've seen it
>> before so it's probably something on my end.  I think the last time this
>> happened to me I was able to clear it up by deleting the branch and
>> re-updating it.
> 
> Your best bet is to remove all .pyc files under Lib: rm `find Lib -name 
> \*.pyc`
> (make clean also works)

You fixed this when you added the missing abc.py file.


>> Another suggestion is to make a change in stringobject.c to represent 8
>> bits strings as "str8('somes_tring')"  or just s"some_string" so they can
>> more easily be found from unicode strings.  Particularly in the tests.
>> This will force a few more tests to fail, but they are things that 
>> need to
>> be fixed.  Only about 3 or 4 additional modules fail when I tried it.
> 
> I've considered this, but then we should also support that notation on
> input. I've also thought of using different string quote conventions,
> e.g. "..." to mean Unicode and '...' to mean 8-bit.

Are str8 types going to be part of the final distribution?  I thought the 
goal was to eventually remove all of those where ever possible.

I think "" vs '' is too subtle.


>> I was getting failed expect/got test cases that looked exactly the same.
>> But after changing the str8 representation those became obvious st8 vs
>> unicode comparisons.
> 
> Right.
> 
>> Using the shorter 's"string"' form will cause places, where eval or exec
>> are using str8, to cause syntax errors.  Which may also be helpful.
> 
> Why would this help?

This would be only a temporary debugging aid to be removed later.  Often 
eval and exec get their inputs from temporary files or other file like 
sources. So this moves the point of failure a bit closer to the problem in 
these cases.  I don't think there should be any places where a str8 string 
is created by a python program will be used this way,  those will be 
unicode strings.

Think of it as just another test, but it's more general in scope than a 
highly specific unit test with usually very controlled inputs.  And it's 
purpose is to help expose some harder to find problems, not the easy to fix 
ones.


>> BTW,  I will make a new remove_raw_escapes patch so it applies cleanly.
>> I'm trying to track down why my patched version of test_tokenize.py 
>> passes
>> sometimes but not at others.  (I think it's either a tempfile or 
>> string io
>> issue, or both.)  This was what initiated the above suggestion.
> 
> Please send it as a proper attachment; somehow gmail doesn't make it
> easy to extract patches pasted directly into the text (nor "inline"
> attachments).

Ok, will do.  I'll update the patch on the patch tracker since it's already 
started as well.

Cheers,
    Ron








From guido at python.org  Thu Jun 14 04:25:15 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Jun 2007 19:25:15 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4670A458.7050206@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<466892B7.4050108@ronadam.com>
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>
	<4668D535.7020103@v.loewis.de>
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
	<466E4B22.6020408@ronadam.com>
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
	<46708286.6090201@ronadam.com>
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>
	<4670A458.7050206@ronadam.com>
Message-ID: <ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>

On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
> Looking at the overall structure of os.py makes me think the platform
> specific code could be abstracted out a bit further.  Possibly have one
> public "platform" module (or package) that is an alias or built from
> private _platform package files.
>
> So instead of having "import mac" or "from mac import ..." in if-else
> structures, just do "from platform import ...".  That moves all the
> platform testing to either the build process or as part of site.py so it
> can set 'platform' to the correct platform module or package.  After that
> everything else is platform independent (or mostly).

Yeah, but I'm not going to rewrite the standard library -- I'm only
going to keep the current architecture working. Others will have to
help with improving the architecture. You have the right idea -- can
you make it work as a patch?

> You fixed this when you added the missing abc.py file.

Sorry about that. I think it was a svnmerge glitch; I didn't notice it
until long after the merge.

> Are str8 types going to be part of the final distribution?  I thought the
> goal was to eventually remove all of those where ever possible.

I don't know yet. There's been a cry for an "immutable bytes" type --
it could be str8 (perhaps renamed). Also, much C code doesn't deal
with Unicode strings yet and expects char* strings whose lifetime is
the same as the Unicode string. Having a str8 permanently attached to
the Unicode string is a convenient solution -- especially since it's
already implemented. :-)

> I think "" vs '' is too subtle.

Fair enough.

> >> I was getting failed expect/got test cases that looked exactly the same.
> >> But after changing the str8 representation those became obvious st8 vs
> >> unicode comparisons.
> >
> > Right.
> >
> >> Using the shorter 's"string"' form will cause places, where eval or exec
> >> are using str8, to cause syntax errors.  Which may also be helpful.
> >
> > Why would this help?
>
> This would be only a temporary debugging aid to be removed later.  Often
> eval and exec get their inputs from temporary files or other file like
> sources. So this moves the point of failure a bit closer to the problem in
> these cases.  I don't think there should be any places where a str8 string
> is created by a python program will be used this way,  those will be
> unicode strings.
>
> Think of it as just another test, but it's more general in scope than a
> highly specific unit test with usually very controlled inputs.  And it's
> purpose is to help expose some harder to find problems, not the easy to fix
> ones.

Makes some sense. Could you come up with a patch?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From brett at python.org  Thu Jun 14 06:01:43 2007
From: brett at python.org (Brett Cannon)
Date: Wed, 13 Jun 2007 21:01:43 -0700
Subject: [Python-3000] Pre-PEP on fast imports
In-Reply-To: <466ECC61.7070505@develer.com>
References: <466DD0D8.7040407@develer.com>
	<20070611231640.8A55E3A407F@sparrow.telecommunity.com>
	<20070612162845.5F00A3A407F@sparrow.telecommunity.com>
	<466ECC61.7070505@develer.com>
Message-ID: <bbaeab100706132101q25c7c449nc1d0b497c54f892e@mail.gmail.com>

On 6/12/07, Giovanni Bajo <rasky at develer.com> wrote:
>
> On 6/12/2007 6:30 PM, Phillip J. Eby wrote:
>
> >>      import imp, os, sys
> >>      from pkgutil import ImpImporter
> >>
> >>      suffixes = set(ext for ext,mode,typ in imp.get_suffixes())
> >>
> >>      class CachedImporter(ImpImporter):
> >>          def __init__(self, path):
> >>              if not os.path.isdir(path):
> >>                  raise ImportError("Not an existing directory")
> >>              super(CachedImporter, self).__init__(path)
> >>              self.refresh()
> >>
> >>          def refresh(self):
> >>              self.cache = set()
> >>              for fname in os.listdir(path):
> >>                  base, ext = os.path.splitext(fname)
> >>                  if ext in suffixes and '.' not in base:
> >>                      self.cache.add(base)
> >>
> >>          def find_module(self, fullname, path=None):
> >>              if fullname.split(".")[-1] not in self.cache:
> >>                  return None  # no need to check further
> >>              return super(CachedImporter, self).find_module(fullname,
> >> path)
> >>
> >>      sys.path_hooks.append(CachedImporter)
> >
> > After a bit of reflection, it seems the refresh() method needs to be a
> > bit different:
> >
> >           def refresh(self):
> >               cache = set()
> >               for fname in os.listdir(self.path):
> >                   base, ext = os.path.splitext(fname)
> >                   if not ext or (ext in suffixes and '.' not in base):
> >                       cache.add(base)
> >               self.cache = cache
> >
> > This version fixes two problems: first, a race condition could occur if
> > you called refresh() while an import was taking place in another
> > thread.  This version fixes that by only updating self.cache after the
> > new cache is completely built.
> >
> > Second, the old version didn't handle packages at all.  This version
> > handles them by treating extension-less filenames as possible package
> > directories.  I originally thought this should check for a subdirectory
> > and __init__, but this could get very expensive if a sys.path directory
> > has a lot of subdirectories (whether or not they're packages).  Having
> > false positives in the cache (i.e. names that can't actually be
> > imported) could slow things down a bit, but *only* if those names match
> > something you're trying to import.  Thus, it seems like a reasonable
> > trade-off versus needing to scan every subdirectory at startup or even
> > to check whether all those names *are* subdirectories.
>
> There is another couple of things I'll fix as soon as I try it. First is
> that I'd call refresh() lazily on the first find_module because I don't
> want to listdir() directories on sys.path that will never be accessed.
>
> The idea of using sys.path_hooks is very clever (I hadn't thought of
> it... because I didn't know of path_hooks in the first place! It appears
> to be undocumented and sparsely indexed by google as well), and it will
> probably help me a lot in my task of fixing this problem in the 2.x serie.



PEP 302 documents all of this, but unfortunately was never documented in the
official docs.

I also have some pseudocode of how import (roughly) works at
sandbox/trunk/import_in_py/pseudocode.py .

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070613/8998fc87/attachment.htm 

From rrr at ronadam.com  Thu Jun 14 06:27:49 2007
From: rrr at ronadam.com (Ron Adam)
Date: Wed, 13 Jun 2007 23:27:49 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<466892B7.4050108@ronadam.com>	
	<ca471dc20706071654n2fd72568v8ec030ecfae8ae9@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>
Message-ID: <4670C3C5.4070907@ronadam.com>



Guido van Rossum wrote:
> On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>> Looking at the overall structure of os.py makes me think the platform
>> specific code could be abstracted out a bit further.  Possibly have one
>> public "platform" module (or package) that is an alias or built from
>> private _platform package files.
>>
>> So instead of having "import mac" or "from mac import ..." in if-else
>> structures, just do "from platform import ...".  That moves all the
>> platform testing to either the build process or as part of site.py so it
>> can set 'platform' to the correct platform module or package.  After that
>> everything else is platform independent (or mostly).
> 
> Yeah, but I'm not going to rewrite the standard library -- I'm only
> going to keep the current architecture working. Others will have to
> help with improving the architecture. You have the right idea -- can
> you make it work as a patch?

Yes, I expect it would be part of the library reorganization which is still 
down the road a bit.  I'll try look into a bit more sometime between now 
and then.  Maybe I can get enough of it started and get others motivated to 
contribute to it.  <shrug>


>> You fixed this when you added the missing abc.py file.
> 
> Sorry about that. I think it was a svnmerge glitch; I didn't notice it
> until long after the merge.
> 
>> Are str8 types going to be part of the final distribution?  I thought the
>> goal was to eventually remove all of those where ever possible.
> 
> I don't know yet. There's been a cry for an "immutable bytes" type --
> it could be str8 (perhaps renamed). Also, much C code doesn't deal
> with Unicode strings yet and expects char* strings whose lifetime is
> the same as the Unicode string. Having a str8 permanently attached to
> the Unicode string is a convenient solution -- especially since it's
> already implemented. :-)

Well I can see where a str8() type with an __incoded_with__ attribute could 
be useful.  It would use a bit more memory, but it won't be the 
default/primary string type anymore so maybe it's ok.

Then bytes can be bytes, and unicode can be unicode, and str8 can be 
encoded strings for interfacing with the outside non-unicode world.  Or 
something like that. <shrug>


>> I think "" vs '' is too subtle.
> 
> Fair enough.
> 
>> >> I was getting failed expect/got test cases that looked exactly the 
>> same.
>> >> But after changing the str8 representation those became obvious st8 vs
>> >> unicode comparisons.
>> >
>> > Right.
>> >
>> >> Using the shorter 's"string"' form will cause places, where eval or 
>> exec
>> >> are using str8, to cause syntax errors.  Which may also be helpful.
>> >
>> > Why would this help?
>>
>> This would be only a temporary debugging aid to be removed later.  Often
>> eval and exec get their inputs from temporary files or other file like
>> sources. So this moves the point of failure a bit closer to the 
>> problem in
>> these cases.  I don't think there should be any places where a str8 
>> string
>> is created by a python program will be used this way,  those will be
>> unicode strings.
>>
>> Think of it as just another test, but it's more general in scope than a
>> highly specific unit test with usually very controlled inputs.  And it's
>> purpose is to help expose some harder to find problems, not the easy 
>> to fix
>> ones.
> 
> Makes some sense. Could you come up with a patch?

Done  :-)

Attached both the str8 repr as s"..." and s'...', and the latest 
no_raw_escape patch which I think is complete now and should apply with no 
problems.

I tracked the random fails I am having in test_tokenize.py down to it doing 
a round trip on random test_*.py files.  If one of those files has a 
problem it causes test_tokanize.py to fail also.  So I added a line to the 
test to output the file name it does the round trip on so those can be 
fixed as they are found.

Let me know it needs to be adjusted or something doesn't look right.

Cheers,
    Ron


-------------- next part --------------
A non-text attachment was scrubbed...
Name: norawescape3.diff
Type: text/x-patch
Size: 18923 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070613/d4f1846d/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stingobject_str8repr.diff
Type: text/x-patch
Size: 758 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070613/d4f1846d/attachment-0003.bin 

From martin at v.loewis.de  Thu Jun 14 08:28:34 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 14 Jun 2007 08:28:34 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>	
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>	
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>	
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>	
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>	
	<46705599.8090301@v.loewis.de>	
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
	<fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
Message-ID: <4670E012.2090803@v.loewis.de>

> Yes.  The BOM mark, for one.

Actually, the BOM *is* a character: ZERO WIDTH NO-BREAK SPACE,
character class Cf. This function of the code point (as a character)
is deprecated, though.

> There are also some that are explicitly not characters.
> (U+FD00..U+FDEF)

??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM,
U+FDEF is unassigned.

Regards,
Martin


From stephen at xemacs.org  Thu Jun 14 09:43:55 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 14 Jun 2007 16:43:55 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
	<fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
Message-ID: <877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > > Apart from the surrogates, are there code points that aren't
 > > characters?

 > Yes.  The BOM mark, for one.

Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK
SPACE).  Its byte-swapped counterpart FFFE is guaranteed *not* to be a
character.  (Martin wrote that correctly.)  FFFF is guaranteed *not*
to be a character; in fact all code points U that are equal to FFFE or
FFFF modulo 0x10000 are guaranteed not to be characters (ie, the last
two characters in each plane).

 > Plenty of other code points are reserved
 > for private use, or not yet assigned,

Or reserved for use as surrogates, and therefore should never appear
in UTF-8 or UTF-32 streams -- but if they do, AIUI they must be passed
on uninterpreted unless the API explicitly says what it does with them.

 > or never will be assigned.  There are also some that are explicitly
 > not characters.  (U+FD00..U+FDEF),

Guaranteed not to be assigned == not a character.  The special range
of non-characters is quite a bit smaller, FDD0..FDEF.

 > and some that might be debatable (unprinted control
 > characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER)

Not a good idea to classify this way.  Those *are* characters, and a
process may interpret them or not.  Python (the language and the
stdlib, except where it explicitly says otherwise) definitely should
*not* worry about these things.  They're characters, that's the most
Python needs to know.

 > > Are there characters that don't have a representation as a
 > > single code point? (I know some characters have multiple
 > > representations, some of which use multiple code points.)

Not a question that can be answered without reference to a specific
application.  An application may treat each code point as a character,
or it may choose to compose code points (eg, into private space).

The most Python might want to do is deal with canonical equivalence,
but even then there are issues, such as the ? in the English word
co?rdinate.  I would consider the diaeresis as a separate diacritic
(meaning "don't pronounce as 'oo', pronounce as 'oh-oh'), not a
component of a single character.  There may be even clearer examples.

 > There are also plenty of things that a native speaker may view as a
 > single character, but which unicode treats as (at most) a Named
 > Sequence.

Eg, the New Line Function (Unicode's name for "universal newline"),
which can be any of the usual suspects (CR, LF, CRLF) depending on
context.


From rauli.ruohonen at gmail.com  Thu Jun 14 14:34:06 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Thu, 14 Jun 2007 15:34:06 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
Message-ID: <f52584c00706140534w26d30d54i9da108568993b97e@mail.gmail.com>

On 6/14/07, Guido van Rossum <guido at python.org> wrote:
> On 6/13/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).

As the word "character" is ambiguous, I'd put it this way:

- code point: the smallest unit Unicode deals with that's independent of
  encoding. Takes values in range(0, 0x110000)
- grapheme (or "grapheme cluster"): what users think of as characters. May
  consist of multiple code points, e.g. "?" can be represented with one
  or two code points. Depends on the language the user speaks

> It sounds like we really use code units, not code points (except when
> building with the 4-byte Unicode option, when they are equivalent).

Not quite equivalent in current Python. From some past discussions I thought
this was by design, but now having seen this odd behavior, maybe it isn't:

>>> sys.maxunicode
1114111
>>> x = u'\ud840\udc21'
>>> marshal.loads(marshal.dumps(x)) == x
False
>>> pickle.loads(pickle.dumps(x, 2)) == x
False
>>> pickle.loads(pickle.dumps(x, 1)) == x
False
>>> pickle.loads(pickle.dumps(x)) == x
True
>>>

Pickling should work the same way regardless of protocol, right? And
probably should not modify the objects it pickles if it can help it.
The reason the above happens is that binary pickles use UTF-8 to encode
unicode, and this is what happens with codecs:

>>> u'\ud840\udc21' == u'\U00020021'
False
>>> u'\ud840\udc21'.encode('utf-8').decode('utf-8')
u'\U00020021'
>>> u'\ud840\udc21'.encode('punycode').decode('punycode')
u'\ud840\udc21'
>>> u'\ud840\udc21'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\U00020021'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\ud840\udc21'.encode('big5hkscs').decode('big5hkscs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5hkscs' codec can't encode character u'\ud840'
in position 0: illegal multibyte sequence
>>> u'\U00020021'.encode('big5hkscs').decode('big5hkscs')
u'\U00020021'
>>>

Should codecs treat u'\ud840\udc21' and u'\U00020021' the same even on
UCS-4 builds (like current UTF-8 and UTF-16 codecs do) or not (like current
punycode and big5hkscs codecs do)?

From rauli.ruohonen at gmail.com  Thu Jun 14 15:51:09 2007
From: rauli.ruohonen at gmail.com (Rauli Ruohonen)
Date: Thu, 14 Jun 2007 16:51:09 +0300
Subject: [Python-3000] String comparison
In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>
	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>
	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <f52584c00706140651p5d5df21fw2293625649ec8b2a@mail.gmail.com>

On 6/13/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> except that people will sneak in some UTF-16 behavior where it seems useful.

How about sneaking these in py3k-struni:

- chr(i) returns a len-1 or len-2 string for all i in range(0, 0x110000) and
  ord(chr(i)) == i for all i in range(0, 0x110000)

- unicodedata.name(chr(i)) returns the same result for all i on both UCS-2
  and UCS-4 builds (and same for bidirectional(), category(), combining(),
  decimal(), decomposition(), digit(), east_asian_width(), mirrored() and
  numeric() in unicodedata)

- return len-1 or len-2 strings on unicodedata.lookup(), instead of always
  len-1 strings (e.g. unicodedata.lookup('AEGEAN WORD SEPARATOR LINE')
  returns '\u0100' on UCS-2 builds, but '\U00010100' on UCS-4 builds)

- unicodedata.normalize(s) interprets its input as UTF-16 on UCS-2 builds

- use ValueError instead of TypeError in the above when passed an
  inappropriate string, e.g. ord('aa')

Any chances?

From jimjjewett at gmail.com  Thu Jun 14 18:54:15 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 14 Jun 2007 12:54:15 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <4670E012.2090803@v.loewis.de>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
	<fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
	<4670E012.2090803@v.loewis.de>
Message-ID: <fb6fbf560706140954x15984882v2f8c41857afca91d@mail.gmail.com>

On 6/14/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > There are also some that are explicitly not characters.
> > (U+FD00..U+FDEF)

> ??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM,
> U+FDEF is unassigned.

Sorry; typo on my part.  The start of the range is u+fdD0, not 00.

I suspect there may be others that are guaranteed never to get an
assignment, because of their location.  (Example:  The character would
have to have certain properties or be part of a specific script, but
adding more such characters would violate some other stability rule.)

-jJ

From jimjjewett at gmail.com  Thu Jun 14 19:56:20 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 14 Jun 2007 13:56:20 -0400
Subject: [Python-3000] String comparison
In-Reply-To: <877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
	<fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
	<877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <fb6fbf560706141056m6c199cfbv24ebd1d82b04115d@mail.gmail.com>

On 6/14/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>  > There are also plenty of things that a native speaker may view as a
>  > single character, but which unicode treats as (at most) a Named
>  > Sequence.

> Eg, the New Line Function (Unicode's name for "universal newline"),
> which can be any of the usual suspects (CR, LF, CRLF) depending on
> context.

I hadn't even thought of such abstract chracters; I was thinking of
(Normative Appendix) UAX 34: Unicode Named Character Sequences at
http://unicode.org/reports/tr34/

These are more like ?, or the NJ digraph, except that a
single-character equivalent has not been coded (and probably never
will be coded -- see
http://www.unicode.org/faq/ligature_digraph.html#3).

The current list of named sequences is available at
http://www.unicode.org/Public/UNIDATA/NamedSequences.txt

-jJ

From stephen at xemacs.org  Thu Jun 14 20:53:52 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 15 Jun 2007 03:53:52 +0900
Subject: [Python-3000] String comparison
In-Reply-To: <fb6fbf560706140954x15984882v2f8c41857afca91d@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>
	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>
	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>
	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<ca471dc20706131303s67a93afcmf4a0a8ef4c819d16@mail.gmail.com>
	<46705599.8090301@v.loewis.de>
	<ca471dc20706131405p2f9c2abeyfed426761ab4fb45@mail.gmail.com>
	<fb6fbf560706131523t5c32239ei75cf638ff10321aa@mail.gmail.com>
	<4670E012.2090803@v.loewis.de>
	<fb6fbf560706140954x15984882v2f8c41857afca91d@mail.gmail.com>
Message-ID: <87tztacqpr.fsf@uwakimon.sk.tsukuba.ac.jp>

Jim Jewett writes:

 > I suspect there may be others that are guaranteed never to get an
 > assignment, because of their location.  (Example:  The character would
 > have to have certain properties or be part of a specific script, but
 > adding more such characters would violate some other stability rule.)

In Unicode 4.1, there are precisely 66.  34 in the highest 2 positions
in each page (nnFFFE and nnFFFF), and the 32 point gap from FDD0 to
FDEF.  The text doesn't explain that latter gap, but does say it's a
historical anomoly.  I doubt there will ever be any more.

From guido at python.org  Fri Jun 15 01:57:28 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 14 Jun 2007 16:57:28 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4670C3C5.4070907@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<4668D535.7020103@v.loewis.de>
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>
	<466E4B22.6020408@ronadam.com>
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
	<46708286.6090201@ronadam.com>
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>
	<4670A458.7050206@ronadam.com>
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>
	<4670C3C5.4070907@ronadam.com>
Message-ID: <ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>

On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
> Well I can see where a str8() type with an __incoded_with__ attribute could
> be useful.  It would use a bit more memory, but it won't be the
> default/primary string type anymore so maybe it's ok.
>
> Then bytes can be bytes, and unicode can be unicode, and str8 can be
> encoded strings for interfacing with the outside non-unicode world.  Or
> something like that. <shrug>

Hm... Requiring each str8 instance to have an encoding might be a
problem -- it means you can't just create one from a bytes object.
What would be the use of this information? What would happen on
concatenation? On slicing? (Slicing can break the encoding!)

> Attached both the str8 repr as s"..." and s'...', and the latest
> no_raw_escape patch which I think is complete now and should apply with no
> problems.

I like the str8 repr patch enough to check it in.

> I tracked the random fails I am having in test_tokenize.py down to it doing
> a round trip on random test_*.py files.  If one of those files has a
> problem it causes test_tokanize.py to fail also.  So I added a line to the
> test to output the file name it does the round trip on so those can be
> fixed as they are found.
>
> Let me know it needs to be adjusted or something doesn't look right.

Well, I'm still philosophically uneasy with r'\' being a valid string
literal, for various reasons (one being that writing a string parser
becomes harder and harder). I definitely want r'\u1234' to be a
6-character string, however. Do you have a patch that does just that?
(We can argue over the rest later in a larger forum.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rrr at ronadam.com  Fri Jun 15 06:51:10 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 14 Jun 2007 23:51:10 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
Message-ID: <46721ABE.9090203@ronadam.com>



Guido van Rossum wrote:
> On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>> Well I can see where a str8() type with an __incoded_with__ attribute 
>> could
>> be useful.  It would use a bit more memory, but it won't be the
>> default/primary string type anymore so maybe it's ok.
>>
>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>> encoded strings for interfacing with the outside non-unicode world.  Or
>> something like that. <shrug>
> 
> Hm... Requiring each str8 instance to have an encoding might be a
> problem -- it means you can't just create one from a bytes object.
> What would be the use of this information? What would happen on
> concatenation? On slicing? (Slicing can break the encoding!)

Round trips to and from bytes should work just fine.  Why would that be a 
problem?

There really is no safety in concatenation and slicing of encoded 8bit 
strings now.  If by accident two strings of different encodings are 
combined, then all bets are off.  And since there is no way to ask a string 
what it's current encoding is, it becomes an easy to make and hard to find 
silent error.  So we have to be very careful not to mix encoded strings 
with different encodings.

It's not too different from trying to find the current unicode and str8 
issues in the py3k-struni branch.

Concatenating str8 and str types is a bit safer, as long as the str8 is in 
in "the" default encoding, but it may still be an unintended implicit 
conversion.  And if it's not in the default encoding, then all bets are off 
again.

The use would be in ensuring the integrity of encoded strings. 
Concatenating strings with different encodings could then produce errors. 
Explicit casting could automatically decode and encode as needed.  Which 
would eliminate a lot of encode/decode confusion.

This morning I was thinking all of this could be done as a module that 
possibly uses metaclass's or mixins to create encoded string types.  Then 
it wouldn't need an attribute on the instances.  Possibly someone has 
already did something along that lines?

But Back to the issues at hand...

>> Attached both the str8 repr as s"..." and s'...', and the latest
>> no_raw_escape patch which I think is complete now and should apply 
>> with no
>> problems.
> 
> I like the str8 repr patch enough to check it in.
> 
>> I tracked the random fails I am having in test_tokenize.py down to it 
>> doing
>> a round trip on random test_*.py files.  If one of those files has a
>> problem it causes test_tokanize.py to fail also.  So I added a line to 
>> the
>> test to output the file name it does the round trip on so those can be
>> fixed as they are found.
>>
>> Let me know it needs to be adjusted or something doesn't look right.
> 
> Well, I'm still philosophically uneasy with r'\' being a valid string
> literal, for various reasons (one being that writing a string parser
> becomes harder and harder).

Hmmm..  It looks to me the thing that makes it somewhat hard is in 
determining weather or not its a single-quote, empty-single-quote, or 
triple-quote string.  I made some improvements to that in tokenize.c 
although it may not be clear from just looking at the unified diff.

After that, it was just a matter of checking a !is_raw_str flag before 
always blindly accepting the following character.

Before that it was a matter of doing that, and checking the quote type 
status, as well which wasn't intuitive since the string parsing loop was 
entered before the beginning quote type was confirmed.

I can remove the raw string flag and flag-check and leave the other changes 
in or revert the whole file back. Any preference?  The later makes it an 
easy approximate three line change to add r'\' support back in.

I'll have to look at tokanize.py again to see what needs to be done there. 
It uses regular expressions to parse the file.

I definitely want r'\u1234' to be a
> 6-character string, however. Do you have a patch that does just that?
> (We can argue over the rest later in a larger forum.)

I can split the patch into two patches. And the second allow escape at end 
of strings patch can be reviewed later.

What about br'\'?  Should that be excluded also?

Ron










From martin at v.loewis.de  Fri Jun 15 08:03:41 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 15 Jun 2007 08:03:41 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46721ABE.9090203@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
	<46721ABE.9090203@ronadam.com>
Message-ID: <46722BBD.5060100@v.loewis.de>

>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>>> encoded strings for interfacing with the outside non-unicode world.  Or
>>> something like that. <shrug>
>>
>> Hm... Requiring each str8 instance to have an encoding might be a
>> problem -- it means you can't just create one from a bytes object.
>> What would be the use of this information? What would happen on
>> concatenation? On slicing? (Slicing can break the encoding!)
> 
> Round trips to and from bytes should work just fine.  Why would that be
> a problem?

I'm strongly opposed to adding encoding information to str8 objects.
I think they will eventually go away, anyway; adding that kind of
overhead now is both a waste of developer's time and of memory
resources; plus it has all the semantic issues that Guido points
out.

As for creating str8 objects from bytes objects: If you want
the str8 object to carry an encoding, you would have to *specify*
the encoding when creating the str8 object, since the bytes object
does not have that information. This is *very* hard, as you may
not know what the encoding is when you need to create the str8
object.

> There really is no safety in concatenation and slicing of encoded 8bit
> strings now.  If by accident two strings of different encodings are
> combined, then all bets are off.  And since there is no way to ask a
> string what it's current encoding is, it becomes an easy to make and
> hard to find silent error.  So we have to be very careful not to mix
> encoded strings with different encodings.

Please answer the question: what would happen on concatenation? In
particular, what is the value of the encoding for the result
of the concatenated string if one input is "latin-1", and the
other one is "utf-8"?

It's easy to tell what happens now: the bytes of those input
strings are just appended; the result string does not follow
a consistent character encoding anymore. This answer does
not apply to your proposed modification, as it does not answer
what the value of the .encoding attribute of the str8 would be
after concatenation (likewise for slicing).

> It's not too different from trying to find the current unicode and str8
> issues in the py3k-struni branch.

This sentence I do not understand. What is not too different from
trying to find issues?

> Concatenating str8 and str types is a bit safer, as long as the str8 is
> in in "the" default encoding, but it may still be an unintended implicit
> conversion.  And if it's not in the default encoding, then all bets are
> off again.

Sure. However, the str8 type will go away, and along with it all these
issues.

> The use would be in ensuring the integrity of encoded strings.
> Concatenating strings with different encodings could then produce
> errors.

Ok. What about slicing?

Regards,
Martin


From martin at v.loewis.de  Fri Jun 15 08:13:29 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 15 Jun 2007 08:13:29 +0200
Subject: [Python-3000] String comparison
In-Reply-To: <f52584c00706140651p5d5df21fw2293625649ec8b2a@mail.gmail.com>
References: <f52584c00706060150i598efe64w95e2c24ff19cfb1c@mail.gmail.com>	<87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706080638m668adf8eh3c4795a996120424@mail.gmail.com>	<877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706091401v4ae432cfxa0a4b808df919615@mail.gmail.com>	<87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706120327q21c1e9c6vff9b1e98ef94aa92@mail.gmail.com>	<87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp>	<f52584c00706130324o1f7a7313m24b508671d4e4e12@mail.gmail.com>	<87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<f52584c00706140651p5d5df21fw2293625649ec8b2a@mail.gmail.com>
Message-ID: <46722E09.3020502@v.loewis.de>

> - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x110000) and
>   ord(chr(i)) == i for all i in range(0, 0x110000)

This would contradict an explicit decision in PEP 261. I'm don't quite
remember the rationale for that, however, the PEP mentions that ord()
should be symmetric with chr().

Whether it would be acceptable to allow selected length-two strings
in ord, I don't know.

> - unicodedata.name(chr(i)) returns the same result for all i on both UCS-2
>   and UCS-4 builds (and same for bidirectional(), category(), combining(),
>   decimal(), decomposition(), digit(), east_asian_width(), mirrored() and
>   numeric() in unicodedata)

There is a patch on SF requesting such a change for .lookup. I think
this should be done in 2.6, not 3.0. It doesn't have the ord/unichr
issue, so I think the same concerns don't apply.

> - return len-1 or len-2 strings on unicodedata.lookup(), instead of always
>   len-1 strings (e.g. unicodedata.lookup('AEGEAN WORD SEPARATOR LINE')
>   returns '\u0100' on UCS-2 builds, but '\U00010100' on UCS-4 builds)

See the patch on SF.

> - unicodedata.normalize(s) interprets its input as UTF-16 on UCS-2 builds

Definitely; somebody would have to write the code.

> - use ValueError instead of TypeError in the above when passed an
>   inappropriate string, e.g. ord('aa')

I'm not sure about this one. The TypeError is deliberate currently.

Regards,
Martin

From rrr at ronadam.com  Sat Jun 16 00:38:58 2007
From: rrr at ronadam.com (Ron Adam)
Date: Fri, 15 Jun 2007 17:38:58 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46722BBD.5060100@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
	<46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de>
Message-ID: <46731502.9050400@ronadam.com>



Martin v. L?wis wrote:
>>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>>>> encoded strings for interfacing with the outside non-unicode world.  Or
>>>> something like that. <shrug>
>>> Hm... Requiring each str8 instance to have an encoding might be a
>>> problem -- it means you can't just create one from a bytes object.
>>> What would be the use of this information? What would happen on
>>> concatenation? On slicing? (Slicing can break the encoding!)
>> Round trips to and from bytes should work just fine.  Why would that be
>> a problem?
> 
> I'm strongly opposed to adding encoding information to str8 objects.
> I think they will eventually go away, anyway; adding that kind of
> overhead now is both a waste of developer's time and of memory
> resources; plus it has all the semantic issues that Guido points
> out.

This was in the context that it is decided by the community that a st8 type 
is needed and it does not go away.

The alternative is for str8 to be replaced by byte objects which I believe 
was, and still is, the plan if possible.

The same semantic issues will also be present in bytes objects in one form 
or another when handling data acquired from sources that use encoded 
strings.  They don't go away even if str8 does go away.

It sort of depends on how someone wants to handle situations where encoded 
strings are encountered.  Do they decode them and convert everything to 
unicode and then convert back as needed for any output.  Or can they keep 
the data in the encoded form for the duration?  I expect different people 
will feel differently on this.


> As for creating str8 objects from bytes objects: If you want
> the str8 object to carry an encoding, you would have to *specify*
> the encoding when creating the str8 object, since the bytes object
> does not have that information. This is *very* hard, as you may
> not know what the encoding is when you need to create the str8
> object.

True, and this also applies if you want to convert an already encoded bytes 
object to unicode as well.

>> There really is no safety in concatenation and slicing of encoded 8bit
>> strings now.  If by accident two strings of different encodings are
>> combined, then all bets are off.  And since there is no way to ask a
>> string what it's current encoding is, it becomes an easy to make and
>> hard to find silent error.  So we have to be very careful not to mix
>> encoded strings with different encodings.
> 
> Please answer the question: what would happen on concatenation? In
> particular, what is the value of the encoding for the result
> of the concatenated string if one input is "latin-1", and the
> other one is "utf-8"?

I was trying to avoid this becoming a long thread. If these Ideas seem 
worth discussing, maybe we can move the reply to python ideas and we can 
work out the details there.


But to not avoid your questions...

Concatenation of unlike encoded objects should cause an error message of 
course. It's just not possible to do presently.

I agree that putting an attribute on a str8 object instance is only a 
partial solution and does waste some space.  (I changed my mind on this 
yesterday morning after thinking about it some more.)

So I offered an alternative suggestion that it may be possibly to use 
dynamically created encoded str types, which avoids putting an attribute on 
every instance, and can handle the problems of slicing, concatenation, and 
conversion.  I didn't go into the details because it was, and is, only a 
general suggestion or thought.

One approach is to possibly use a factory function that uses metaclass's or 
mixins to create these based either on a str base type or a bytes object.

      Latin1 = get_encoded_str_type('latin-1')

      s1 = Latin1('Hello ')

      Utf8 = get_encoded_str_type('utf-8')

      s2 = Utf8('World')

      s = s1 + s2                 -> Exception Raised

      s = s1 + type(s1)(s2)       -> latin-1 string

      s = type(s2)(s1) + s2       -> utf-8 string

      lines = [s1, s2, ..., sn]
      s = Utf8.join([Utf8(s) for s in lines])

In this last case the strings in s1 can even be of arbitrary encoding types 
and they would still all get re-encoded to utf-8 correctly.  Chances are 
you would never have a list of strings with many different encodings, but 
you may have a list of strings with types unknown to a local function.

There can probably be various ways of creating these types that do not 
require them to be built in.  The advantage is they can be smarter about 
concatenation, slicing, and transforming to bytes and unicode and back. 
It's really just a higher level API.

Weather it's a waste of time and effort, <shrug>, I suppose that depends on 
who is doing it and weather or not they think so.  It could also be a third 
party module as well.  Then if it becomes popular it can be included in 
python some time in a future version.


> It's easy to tell what happens now: the bytes of those input
> strings are just appended; the result string does not follow
> a consistent character encoding anymore. This answer does
> not apply to your proposed modification, as it does not answer
> what the value of the .encoding attribute of the str8 would be
> after concatenation (likewise for slicing).

And what is the use of appending unlike encoded str8 types?  Most anything 
I can think of are hacks.

I disagree about it being easy to tell what happens.  Thats only true on a 
micro level.  On a macro level, it may work out ok, or it may cause an 
error to be raised at some point, or it may be completely silent and the 
data you send out is corrupted.  In which case, something even worse may 
happen when the data is used.  Like missing mars orbiters or crashed 
landers.  That does not sound like it is "easy to tell what happens" to me.

I think what Guido is thinking is we may need keep str8 around (for a 
while) as a 'C' compatible string type for purposes of interfacing to 'C' code.

What I was thinking about was to simplify encoding and decoding and 
avoiding issues that are caused by miss matched strings of *any* type.  A 
different problem set, that may need a different solution.


>> It's not too different from trying to find the current unicode and str8
>> issues in the py3k-struni branch.
> 
> This sentence I do not understand. What is not too different from
> trying to find issues?

It was a general statement reflecting on the process of converting the 
py3k-struni branch to unicode.

As I said above:

 >> ... it becomes an easy to make and
 >> hard to find silent errors ...

In this case the errors are expected, but finding them is still difficult. 
  It's not quite the same thing, but I did say "not too different", meaning 
there are some differences.


>> Concatenating str8 and str types is a bit safer, as long as the str8 is
>> in in "the" default encoding, but it may still be an unintended implicit
>> conversion.  And if it's not in the default encoding, then all bets are
>> off again.
> 
> Sure. However, the str8 type will go away, and along with it all these
> issues.

Yes, hopefully it will, eventually, along with encoded strings in the wild, 
as well. But probably not immediately.

>> The use would be in ensuring the integrity of encoded strings.
>> Concatenating strings with different encodings could then produce
>> errors.
> 
> Ok. What about slicing?

Details... for which all of these can be solved.  Encoded string types as I 
described above can also know how to slice themselves correctly.

Cheers and Regards,
    Ron

From martin at v.loewis.de  Sat Jun 16 01:00:10 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Jun 2007 01:00:10 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46731502.9050400@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
	<46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de>
	<46731502.9050400@ronadam.com>
Message-ID: <467319FA.8000000@v.loewis.de>

> This was in the context that it is decided by the community that a st8
> type is needed and it does not go away.

I think *that* context has not occurred. People wanted a read-only
bytes type, not a byte-oriented character string type.

> The alternative is for str8 to be replaced by byte objects which I
> believe was, and still is, the plan if possible.

That type is already implemented.

> The same semantic issues will also be present in bytes objects in one
> form or another when handling data acquired from sources that use
> encoded strings.  They don't go away even if str8 does go away.

No they don't. The bytes type doesn't have an encoding associated
with it, and it shouldn't. Values may not even represent text,
but, say, image data.

> It sort of depends on how someone wants to handle situations where
> encoded strings are encountered.  Do they decode them and convert
> everything to unicode and then convert back as needed for any output. 
> Or can they keep the data in the encoded form for the duration?  I
> expect different people will feel differently on this.

In Py3k, they will use the string type, because anything else will
just be too tedious.

>> As for creating str8 objects from bytes objects: If you want
>> the str8 object to carry an encoding, you would have to *specify*
>> the encoding when creating the str8 object, since the bytes object
>> does not have that information. This is *very* hard, as you may
>> not know what the encoding is when you need to create the str8
>> object.
> 
> True, and this also applies if you want to convert an already encoded
> bytes object to unicode as well.

Right, and therefore it can never be automatic - whereas the conversion
between a bytes object and a str8 object *could* be automatic otherwise
(assuming the str8 type survives at all).

> One approach is to possibly use a factory function that uses metaclass's
> or mixins to create these based either on a str base type or a bytes
> object.
> 
>      Latin1 = get_encoded_str_type('latin-1')
> 
>      s1 = Latin1('Hello ')
[snip]

I think I lost track now what problem precisely you are trying to solve.

>> It's easy to tell what happens now: the bytes of those input
>> strings are just appended; the result string does not follow
>> a consistent character encoding anymore. This answer does
>> not apply to your proposed modification, as it does not answer
>> what the value of the .encoding attribute of the str8 would be
>> after concatenation (likewise for slicing).
> 
> And what is the use of appending unlike encoded str8 types?

You may need to put encoded text into binary data, e.g. putting
a file name into a zip file. Some of the bytes will be utf-8
encoded, others will be cp437 encode, others will be data structures
of the zip file, and the rest will be compressed bytes.

Likewise for constructing MIME messages: different pieces will
use different encodings.

> I think what Guido is thinking is we may need keep str8 around (for a
> while) as a 'C' compatible string type for purposes of interfacing to
> 'C' code.

That might be. I hope not, and I have plans to eliminate the need for
many such places (providing Unicode-oriented APIs in some cases,
and using the bytes type in other cases).

In cases where we still have char*, I think the API should specify that
this must be ASCII most of them time, with UTF-8 in selected other
cases; arbitrary binary data only when interfacing to the bytes
type.

Regards,
Martin

From rrr at ronadam.com  Sat Jun 16 02:31:02 2007
From: rrr at ronadam.com (Ron Adam)
Date: Fri, 15 Jun 2007 19:31:02 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <467319FA.8000000@v.loewis.de>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
	<46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de>
	<46731502.9050400@ronadam.com> <467319FA.8000000@v.loewis.de>
Message-ID: <46732F46.8020701@ronadam.com>



Martin v. L?wis wrote:
>> This was in the context that it is decided by the community that a st8
>> type is needed and it does not go away.
> 
> I think *that* context has not occurred. People wanted a read-only
> bytes type, not a byte-oriented character string type.
> 
>> The alternative is for str8 to be replaced by byte objects which I
>> believe was, and still is, the plan if possible.
> 
> That type is already implemented.

But the actual replacing of str8 by bytes is still a work in progress.


>> The same semantic issues will also be present in bytes objects in one
>> form or another when handling data acquired from sources that use
>> encoded strings.  They don't go away even if str8 does go away.
> 
> No they don't. The bytes type doesn't have an encoding associated
> with it, and it shouldn't. Values may not even represent text,
> but, say, image data.

Right, and in the cases where the bytes are an encoded form of string data, 
you will need to be very careful about how it is sliced.  But this isn't 
any different for any other byte type data.  It's a low level interface 
meant to do low level things.  Which is good, we need that.


>> It sort of depends on how someone wants to handle situations where
>> encoded strings are encountered.  Do they decode them and convert
>> everything to unicode and then convert back as needed for any output. 
>> Or can they keep the data in the encoded form for the duration?  I
>> expect different people will feel differently on this.
> 
> In Py3k, they will use the string type, because anything else will
> just be too tedious.

I agree, this will be the preferred way, and should be.

>>> As for creating str8 objects from bytes objects: If you want
>>> the str8 object to carry an encoding, you would have to *specify*
>>> the encoding when creating the str8 object, since the bytes object
>>> does not have that information. This is *very* hard, as you may
>>> not know what the encoding is when you need to create the str8
>>> object.
>> True, and this also applies if you want to convert an already encoded
>> bytes object to unicode as well.
> 
> Right, and therefore it can never be automatic - whereas the conversion
> between a bytes object and a str8 object *could* be automatic otherwise
> (assuming the str8 type survives at all).

But conversion between different encodings won't be automatic.  It will 
still be as tedious and confusing as it always has been.  The improvement 
that python3000 makes here is that maybe it won't be needed as often with 
unicode strings being the default.


>> One approach is to possibly use a factory function that uses metaclass's
>> or mixins to create these based either on a str base type or a bytes
>> object.
>>
>>      Latin1 = get_encoded_str_type('latin-1')
>>
>>      s1 = Latin1('Hello ')
> [snip]
> 
> I think I lost track now what problem precisely you are trying to solve.

A case of abstract motivation, prompting a very general idea, which 
illicits subjective responses, which prompts even more concrete examples, 
etc...

The original motivation wasn't explicitly stated at the beginning and got 
lost.  ;-)

My primary reason for the suggestion is that maybe it can increase string 
data integrity and make finding errors easier.

This was just a thought in that direction.

A more specific example or issue that is much more relevant at this time 
might be, should the conversion to bytes be automatic when combining str8 
and bytes?  (str and bytes in python 2.6+)

The first answer might be yes since it's a one to one conversion.  But it's 
implicit.

 >>> str8('hello ') + b'world'
b'hello world'

 >>> b'hello ' + str8('world')
b'hello world'

That's clear enough, but what about...

 >>> ''.join(slist)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string or Unicode, bytes found

And so starts yet another tedious session of tracing variables back to find 
where the bytes type actually occurred.  Which may not be obvious since it 
could have been an unintentional and implicit conversion.


>>> It's easy to tell what happens now: the bytes of those input
>>> strings are just appended; the result string does not follow
>>> a consistent character encoding anymore. This answer does
>>> not apply to your proposed modification, as it does not answer
>>> what the value of the .encoding attribute of the str8 would be
>>> after concatenation (likewise for slicing).
>> And what is the use of appending unlike encoded str8 types?
> 
> You may need to put encoded text into binary data, e.g. putting
> a file name into a zip file. Some of the bytes will be utf-8
> encoded, others will be cp437 encode, others will be data structures
> of the zip file, and the rest will be compressed bytes.
> 
> Likewise for constructing MIME messages: different pieces will
> use different encodings.

Wouldn't you need some sort of wrapper in these cases to indicate what the 
encoding is and where it starts and stops?

So even in binary data, extracting it to bytes and then decoding each 
section to it's particular encoded type should not be a problem.  Same goes 
for the other way around.

For text encoded data within other text encoded data, its a nested encoding 
that needs to be unencoded in the correct sequence.  Not a sequential 
encoding that is done and appended together as is.  Is that correct?  And 
it still needs headers to indicate it's encoding, start, and length.  Or 
something equivalent.  What am I missing?


Cheers,
   Ron





>> I think what Guido is thinking is we may need keep str8 around (for a
>> while) as a 'C' compatible string type for purposes of interfacing to
>> 'C' code.
> 
> That might be. I hope not, and I have plans to eliminate the need for
> many such places (providing Unicode-oriented APIs in some cases,
> and using the bytes type in other cases).
> 
> In cases where we still have char*, I think the API should specify that
> this must be ASCII most of them time, with UTF-8 in selected other
> cases; arbitrary binary data only when interfacing to the bytes
> type.
> 
> Regards,
> Martin
> 
> 

From rrr at ronadam.com  Sun Jun 17 04:38:12 2007
From: rrr at ronadam.com (Ron Adam)
Date: Sat, 16 Jun 2007 21:38:12 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<4668D535.7020103@v.loewis.de>	
	<ca471dc20706072106yd4c9045w750f8be8ee64e011@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
Message-ID: <46749E94.5010301@ronadam.com>



Guido van Rossum wrote:
> On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>> Attached both the str8 repr as s"..." and s'...', and the latest
>> no_raw_escape patch which I think is complete now and should apply 
>> with no
>> problems.
> 
> I like the str8 repr patch enough to check it in.
> 
>> I tracked the random fails I am having in test_tokenize.py down to it 
>> doing
>> a round trip on random test_*.py files.  If one of those files has a
>> problem it causes test_tokanize.py to fail also.  So I added a line to 
>> the
>> test to output the file name it does the round trip on so those can be
>> fixed as they are found.
>>
>> Let me know it needs to be adjusted or something doesn't look right.
> 
> Well, I'm still philosophically uneasy with r'\' being a valid string
> literal, for various reasons (one being that writing a string parser
> becomes harder and harder). I definitely want r'\u1234' to be a
> 6-character string, however. Do you have a patch that does just that?
> (We can argue over the rest later in a larger forum.)

The str8 patch caused tokanize.py to fail again also.  ;-)  Those s'' == '' 
asserts of course.

I tracked it down to cStringIO.c only returning str8 types.  Fixing that 
*may* fix a number of other modules as well, but I'm not sure how, so I put 
a str() around the returned value in tokanize.py with a note for now.

The attached path has various minor fix's.  (But not the no raw escape 
stuff yet.)

Cheers,
    Ron



tokenize.py - Get rid of s'' errors by making returned value from 
cStringIO.c to unicode with str().

test_tokanize.py - Added printing of roundtrip file names to test.  This is 
needed because files are a random sample and if they have errors, it causes 
this module to fail.  Without this there is no way to tell what is going on.

_fileio.c - Return unicode strings instead of str8 strings.  (check this one.)

smtplib - Fixed strip() without args on bytes.

test_fileinput.py - Replaced bad writelines() call with a for loop with a 
write() call.  (buffer object doesn't have a writelines method, and 
try-finally hid the error.)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: variousfixes.diff
Type: text/x-patch
Size: 3447 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070616/acfc33d9/attachment.bin 

From guido at python.org  Mon Jun 18 20:37:44 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 18 Jun 2007 11:37:44 -0700
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <46749E94.5010301@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>
	<466E4B22.6020408@ronadam.com>
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>
	<46708286.6090201@ronadam.com>
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>
	<4670A458.7050206@ronadam.com>
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>
	<4670C3C5.4070907@ronadam.com>
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>
	<46749E94.5010301@ronadam.com>
Message-ID: <ca471dc20706181137h3b46a17eu3e49112303b22e1d@mail.gmail.com>

Thanks for the patches!  Applied, except for the change to
tokenize.py; instead, I changed test_tokenize.py to use io.StringIO.

--Guido

On 6/16/07, Ron Adam <rrr at ronadam.com> wrote:
>
>
> Guido van Rossum wrote:
> > On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
> >> Attached both the str8 repr as s"..." and s'...', and the latest
> >> no_raw_escape patch which I think is complete now and should apply
> >> with no
> >> problems.
> >
> > I like the str8 repr patch enough to check it in.
> >
> >> I tracked the random fails I am having in test_tokenize.py down to it
> >> doing
> >> a round trip on random test_*.py files.  If one of those files has a
> >> problem it causes test_tokanize.py to fail also.  So I added a line to
> >> the
> >> test to output the file name it does the round trip on so those can be
> >> fixed as they are found.
> >>
> >> Let me know it needs to be adjusted or something doesn't look right.
> >
> > Well, I'm still philosophically uneasy with r'\' being a valid string
> > literal, for various reasons (one being that writing a string parser
> > becomes harder and harder). I definitely want r'\u1234' to be a
> > 6-character string, however. Do you have a patch that does just that?
> > (We can argue over the rest later in a larger forum.)
>
> The str8 patch caused tokanize.py to fail again also.  ;-)  Those s'' == ''
> asserts of course.
>
> I tracked it down to cStringIO.c only returning str8 types.  Fixing that
> *may* fix a number of other modules as well, but I'm not sure how, so I put
> a str() around the returned value in tokanize.py with a note for now.
>
> The attached path has various minor fix's.  (But not the no raw escape
> stuff yet.)
>
> Cheers,
>     Ron
>
>
>
> tokenize.py - Get rid of s'' errors by making returned value from
> cStringIO.c to unicode with str().
>
> test_tokanize.py - Added printing of roundtrip file names to test.  This is
> needed because files are a random sample and if they have errors, it causes
> this module to fail.  Without this there is no way to tell what is going on.
>
> _fileio.c - Return unicode strings instead of str8 strings.  (check this one.)
>
> smtplib - Fixed strip() without args on bytes.
>
> test_fileinput.py - Replaced bad writelines() call with a for loop with a
> write() call.  (buffer object doesn't have a writelines method, and
> try-finally hid the error.)
>
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Tue Jun 19 08:32:59 2007
From: guido at python.org (Guido van Rossum)
Date: Mon, 18 Jun 2007 23:32:59 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
Message-ID: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>

I've written up a comprehensive status report on Python 3000. Please read:

http://www.artima.com/weblogs/viewpost.jsp?thread=208549

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rrr at ronadam.com  Tue Jun 19 12:04:27 2007
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 19 Jun 2007 05:04:27 -0500
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <ca471dc20706181137h3b46a17eu3e49112303b22e1d@mail.gmail.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>	
	<466E4B22.6020408@ronadam.com>	
	<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>	
	<46708286.6090201@ronadam.com>	
	<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>	
	<4670A458.7050206@ronadam.com>	
	<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>	
	<4670C3C5.4070907@ronadam.com>	
	<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>	
	<46749E94.5010301@ronadam.com>
	<ca471dc20706181137h3b46a17eu3e49112303b22e1d@mail.gmail.com>
Message-ID: <4677AA2B.8000704@ronadam.com>



Guido van Rossum wrote:
> Thanks for the patches!  Applied, except for the change to
> tokenize.py; instead, I changed test_tokenize.py to use io.StringIO.
> 
> --Guido

Glad to have the opportunity to help make the future happen. ;-)


This next one converts unicode literals in tokenize.py and it's tests to 
byte literals.  I've also fixed some more unicode literals in a few other 
places I found.

By doing this first it will make the no raw escape patches not include any 
thing else.

Cheers,
    Ron


M      Lib/tokenize.py
M      Lib/test/tokenize_tests.txt
M      Lib/test/output/test_tokenize
- Removed unicode literals from test results and tokenize.py.  And make it 
pass again.


M      Lib/test/output/test_pep277
- Removed unicode literals from test results.  This is a windows only test, 
so I can't test it.

M      Lib/test/test_codeccallbacks.py
M      Objects/exceptions.c
- Remove unicode literals from test_codeccallbacks.py and removed unicode 
litteral quoting from exceptions.c to make it pass again.

M      Lib/test/test_codecs.py
M      Lib/test/test_doctest.py
M      Lib/test/re_tests.py
- Removed some literals from comments.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: variousfixes2.diff
Type: text/x-patch
Size: 14354 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070619/749db319/attachment.bin 

From ncoghlan at gmail.com  Tue Jun 19 13:57:44 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 19 Jun 2007 21:57:44 +1000
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <f5856m$h2q$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<f5856m$h2q$1@sea.gmane.org>
Message-ID: <4677C4B8.8010508@gmail.com>

Georg Brandl wrote:
> Guido van Rossum schrieb:
>> I've written up a comprehensive status report on Python 3000. Please read:
>>
>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
> 
> Thank you! Now I have something to show to interested people except "read
> the PEPs".
> 
> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
> supported anymore :)

Given that there are valid use cases for bytes-to-bytes translations, 
and a common API for them would be nice, does it make sense to have an 
additional category of codec that is invoked via specific recoding 
methods on bytes objects? For example:

   encoded = data.encode_bytes('bz2')
   decoded = encoded.decode_bytes('bz2')
   assert data == decoded

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From gabor at nekomancer.net  Tue Jun 19 13:54:32 2007
From: gabor at nekomancer.net (=?ISO-8859-1?Q?G=E1bor_Farkas?=)
Date: Tue, 19 Jun 2007 13:54:32 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
Message-ID: <4677C3F8.3050305@nekomancer.net>

Guido van Rossum wrote:
> I've written up a comprehensive status report on Python 3000. Please read:
> 
> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
> 

why does map and filter stay, but reduce leaves?

i understand that some people think that an explicit for-loop is more 
understandable, but also many people claim that list-comprehensions are 
more understandable than map/filter.., and map/filter can be trivially 
written using list-comprehensions.. so why  _reduce_?

gabor

From g.brandl at gmx.net  Tue Jun 19 14:20:06 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Tue, 19 Jun 2007 14:20:06 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4677C4B8.8010508@gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>
	<4677C4B8.8010508@gmail.com>
Message-ID: <f58hlj$sri$1@sea.gmane.org>

Nick Coghlan schrieb:
> Georg Brandl wrote:
>> Guido van Rossum schrieb:
>>> I've written up a comprehensive status report on Python 3000. Please read:
>>>
>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>> 
>> Thank you! Now I have something to show to interested people except "read
>> the PEPs".
>> 
>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
>> supported anymore :)
> 
> Given that there are valid use cases for bytes-to-bytes translations, 
> and a common API for them would be nice, does it make sense to have an 
> additional category of codec that is invoked via specific recoding 
> methods on bytes objects? For example:
> 
>    encoded = data.encode_bytes('bz2')
>    decoded = encoded.decode_bytes('bz2')
>    assert data == decoded

This is exactly what I proposed a while before under the name
bytes.transform().

IMO it would make a common use pattern much more convenient and
should be given thought.

If a PEP is called for, I'd be happy to at least co-author it.

Georg



From walter at livinglogic.de  Tue Jun 19 14:40:57 2007
From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=)
Date: Tue, 19 Jun 2007 14:40:57 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <f58hlj$sri$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>	<4677C4B8.8010508@gmail.com>
	<f58hlj$sri$1@sea.gmane.org>
Message-ID: <4677CED9.1060800@livinglogic.de>

Georg Brandl wrote:
> Nick Coghlan schrieb:
>> Georg Brandl wrote:
>>> Guido van Rossum schrieb:
>>>> I've written up a comprehensive status report on Python 3000. Please read:
>>>>
>>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>>> Thank you! Now I have something to show to interested people except "read
>>> the PEPs".
>>>
>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
>>> supported anymore :)
>> Given that there are valid use cases for bytes-to-bytes translations, 
>> and a common API for them would be nice, does it make sense to have an 
>> additional category of codec that is invoked via specific recoding 
>> methods on bytes objects? For example:
>>
>>    encoded = data.encode_bytes('bz2')
>>    decoded = encoded.decode_bytes('bz2')
>>    assert data == decoded
> 
> This is exactly what I proposed a while before under the name
> bytes.transform().
> 
> IMO it would make a common use pattern much more convenient and
> should be given thought.
> 
> If a PEP is called for, I'd be happy to at least co-author it.

Codecs are a major exception to Guido's law: Never have a parameter
whose value switches between completely unrelated algorithms.

Why don't we put all string transformation functions into a common
module (the string module might be a good place):

>>> import string
>>> string.rot13('abc')

Servus,
   Walter

From mal at egenix.com  Tue Jun 19 15:19:50 2007
From: mal at egenix.com (M.-A. Lemburg)
Date: Tue, 19 Jun 2007 15:19:50 +0200
Subject: [Python-3000] [Python-Dev]   Python 3000 Status Update (Long!)
In-Reply-To: <4677CED9.1060800@livinglogic.de>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>	<4677C4B8.8010508@gmail.com>	<f58hlj$sri$1@sea.gmane.org>
	<4677CED9.1060800@livinglogic.de>
Message-ID: <4677D7F6.3040304@egenix.com>

On 2007-06-19 14:40, Walter D?rwald wrote:
> Georg Brandl wrote:
>>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
>>>> supported anymore :)
>>> Given that there are valid use cases for bytes-to-bytes translations, 
>>> and a common API for them would be nice, does it make sense to have an 
>>> additional category of codec that is invoked via specific recoding 
>>> methods on bytes objects? For example:
>>>
>>>    encoded = data.encode_bytes('bz2')
>>>    decoded = encoded.decode_bytes('bz2')
>>>    assert data == decoded
>> This is exactly what I proposed a while before under the name
>> bytes.transform().
>>
>> IMO it would make a common use pattern much more convenient and
>> should be given thought.
>>
>> If a PEP is called for, I'd be happy to at least co-author it.
> 
> Codecs are a major exception to Guido's law: Never have a parameter
> whose value switches between completely unrelated algorithms.

I don't see much of a problem with that. Parameters are
per-se intended to change the behavior of a function or
method.

Note that you are referring to the .encode() and .decode()
methods - these are just easy to use interfaces to the codecs
registered in the system.

The codec design allows for different input and output
types as it doesn't impose restrictions on these. Codecs
are more general in that respect: they don't just deal
with Unicode encodings, it's a more general approach
that also works with other kinds of data types.

The access methods, OTOH, can impose restrictions and probably
should to restrict the return types to a predicable set.

> Why don't we put all string transformation functions into a common
> module (the string module might be a good place):
> 
>>>> import string
>>>> string.rot13('abc')

I think the string module will have to go away. It doesn't
really separate between text and bytes data.

Adding more confusion will not really help with making
this distinction clear, either, I'm afraid.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 19 2007)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2007-07-09: EuroPython 2007, Vilnius, Lithuania            19 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From him at online.de  Tue Jun 19 15:56:39 2007
From: him at online.de (=?ISO-8859-1?Q?Joachim_K=F6nig?=)
Date: Tue, 19 Jun 2007 15:56:39 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
Message-ID: <4677E097.5060205@online.de>

Guido van Rossum schrieb:
> I've written up a comprehensive status report on Python 3000. Please read:
>
> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>
>   
Nice summary, thanks.

I'm sure it has been proposed before (and I've googled for it but did 
not find it),
but could someone enlighten me why

{,}

can't be used for the empty set, analogous to the empty tuple (,)?

No, I do not want to start a new discussion about it.

Joachim

From g.brandl at gmx.net  Tue Jun 19 15:03:26 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Tue, 19 Jun 2007 15:03:26 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4677CED9.1060800@livinglogic.de>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>	<4677C4B8.8010508@gmail.com>	<f58hlj$sri$1@sea.gmane.org>
	<4677CED9.1060800@livinglogic.de>
Message-ID: <f58k6r$6fv$1@sea.gmane.org>

Walter D?rwald schrieb:
> Georg Brandl wrote:
>> Nick Coghlan schrieb:
>>> Georg Brandl wrote:
>>>> Guido van Rossum schrieb:
>>>>> I've written up a comprehensive status report on Python 3000. Please read:
>>>>>
>>>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>>>> Thank you! Now I have something to show to interested people except "read
>>>> the PEPs".
>>>>
>>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
>>>> supported anymore :)
>>> Given that there are valid use cases for bytes-to-bytes translations, 
>>> and a common API for them would be nice, does it make sense to have an 
>>> additional category of codec that is invoked via specific recoding 
>>> methods on bytes objects? For example:
>>>
>>>    encoded = data.encode_bytes('bz2')
>>>    decoded = encoded.decode_bytes('bz2')
>>>    assert data == decoded
>> 
>> This is exactly what I proposed a while before under the name
>> bytes.transform().
>> 
>> IMO it would make a common use pattern much more convenient and
>> should be given thought.
>> 
>> If a PEP is called for, I'd be happy to at least co-author it.
> 
> Codecs are a major exception to Guido's law: Never have a parameter
> whose value switches between completely unrelated algorithms.

I don't think that applies here. This is more like __import__():
depending on the first parameter, completely different things can happen.
Yes, the same import algorithm is used, but in the case of
bytes.encode_bytes, the same algorithm is used to find and execute the
codec.

Georg


From lists at cheimes.de  Tue Jun 19 15:05:42 2007
From: lists at cheimes.de (Christian Heimes)
Date: Tue, 19 Jun 2007 15:05:42 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4677C3F8.3050305@nekomancer.net>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net>
Message-ID: <f58kb7$6vp$1@sea.gmane.org>

G?bor Farkas wrote:
> why does map and filter stay, but reduce leaves?
> 
> i understand that some people think that an explicit for-loop is more 
> understandable, but also many people claim that list-comprehensions are 
> more understandable than map/filter.., and map/filter can be trivially 
> written using list-comprehensions.. so why  _reduce_?

Don't worry, it wasn't complete removed. Reduce was moved to functools

$ ./python
Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> map
<built-in function map>
>>> filter
<built-in function filter>
>>> from functools import reduce
>>> reduce
<built-in function reduce>


From benji at benjiyork.com  Tue Jun 19 16:37:00 2007
From: benji at benjiyork.com (Benji York)
Date: Tue, 19 Jun 2007 10:37:00 -0400
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4677E097.5060205@online.de>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677E097.5060205@online.de>
Message-ID: <4677EA0C.3020107@benjiyork.com>

Joachim K?nig wrote:
> could someone enlighten me why
> 
> {,}
> 
> can't be used for the empty set, analogous to the empty tuple (,)?

Partially because (,) is not the empty tuple, () is.
-- 
Benji York
http://benjiyork.com

From walter at livinglogic.de  Tue Jun 19 16:45:46 2007
From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=)
Date: Tue, 19 Jun 2007 16:45:46 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <f58k6r$6fv$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>	<4677C4B8.8010508@gmail.com>	<f58hlj$sri$1@sea.gmane.org>	<4677CED9.1060800@livinglogic.de>
	<f58k6r$6fv$1@sea.gmane.org>
Message-ID: <4677EC1A.10306@livinglogic.de>

Georg Brandl wrote:
> Walter D?rwald schrieb:
>> Georg Brandl wrote:
>>> Nick Coghlan schrieb:
>>>> Georg Brandl wrote:
>>>>> Guido van Rossum schrieb:
>>>>>> I've written up a comprehensive status report on Python 3000. Please read:
>>>>>>
>>>>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>>>>> Thank you! Now I have something to show to interested people except "read
>>>>> the PEPs".
>>>>>
>>>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be
>>>>> supported anymore :)
>>>> Given that there are valid use cases for bytes-to-bytes translations, 
>>>> and a common API for them would be nice, does it make sense to have an 
>>>> additional category of codec that is invoked via specific recoding 
>>>> methods on bytes objects? For example:
>>>>
>>>>    encoded = data.encode_bytes('bz2')
>>>>    decoded = encoded.decode_bytes('bz2')
>>>>    assert data == decoded
>>> This is exactly what I proposed a while before under the name
>>> bytes.transform().
>>>
>>> IMO it would make a common use pattern much more convenient and
>>> should be given thought.
>>>
>>> If a PEP is called for, I'd be happy to at least co-author it.
>> Codecs are a major exception to Guido's law: Never have a parameter
>> whose value switches between completely unrelated algorithms.
> 
> I don't think that applies here. This is more like __import__():
> depending on the first parameter, completely different things can happen.
> Yes, the same import algorithm is used, but in the case of
> bytes.encode_bytes, the same algorithm is used to find and execute the
> codec.

What would a registry of tranformation algorithms buy us compared to a
module with transformation functions?

The function version is shorter:

   transform.rot13('foo')

compared to:

   'foo'.transform('rot13')

If each transformation has its own function, these functions can have
their own arguments, e.g.
   transform.bz2encode(data: bytes, level: int=6) -> bytes

Of course str.transform() could pass along all arguments to the
registered function, but that's worse from a documentation viewpoint,
because the real signature is hidden deep in the registry.

Servus,
   Walter

From brandon at rhodesmill.org  Tue Jun 19 16:43:40 2007
From: brandon at rhodesmill.org (Brandon Craig Rhodes)
Date: Tue, 19 Jun 2007 10:43:40 -0400
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4677E097.5060205@online.de> (Joachim =?utf-8?Q?K=C3=B6nig's?=
	message of "Tue, 19 Jun 2007 15:56:39 +0200")
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677E097.5060205@online.de>
Message-ID: <87bqfcj97n.fsf@ten22.rhodesmill.org>

Joachim K?nig <him at online.de> writes:

> ... could someone enlighten me why
>
> {,}
>
> can't be used for the empty set, analogous to the empty tuple (,)?

And now that someone else has broken the ice regarding questions that
have probably been exhausted already, I want to comment that Python 3k
seems to perpetuate a vast asymmetry.  Observe:

(a) Syntactic constructors

 [ 1,2,3 ]   works
 { 1,2,3 }   works
 { 1:1, 2:4, 3:9 }   works

(b) Generators + constructor functions

 list(i for i in (1,2,3))   works
 set(i for i in (1,2,3))   works
 dict((i,i*i) for i in (1,2,3))   works

(c) Comprehensions

 [ i for i in (1,2,3) ]   works
 { i for i in (1,2,3) ]   works
 { i:i*i for i in (1,2,3) ]   returns a SyntaxError!

This seems offensive.  It spells trouble for new students, who will
have to simply memorize which of the three syntactically-supported
containers support comprehensions and which do not.  It spells trouble
when trying to explain Python to seasoned programmers, who will think
that they detect trouble in a language that breaks obvious symmetries
over something so basic as instantiating container types.

The PEP for dictionary comprehensions, when I last reviewed it, argued
that dict comprehensions are unnecessary, because we have generators
now.  It seems to me that either:

 1) The grounds for rejecting dict comprehensions are valid, and
    therefore should be extended so that everything in (c) above goes
    away.  That is, if generators + built-in constructor functions are
    such a great solution for creating dicts, then list comprehensions
    and set comprehensions should both go away as well in favor of
    generators.  The language would become simpler, the parser would
    become simpler, and Python would be easier to learn.

 2) The grounds for rejecting dict comprehensions are invalid, so they
    should be introduced in Python 3k so that everything in (c) works.

Given that Python 3k is making such strides in other areas where cruft
and asymmetry needed to be removed, it would seem a shame to leave the
container types in such disarray.

-- 
Brandon Craig Rhodes   brandon at rhodesmill.org   http://rhodesmill.org/brandon

From guido at python.org  Tue Jun 19 17:20:25 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 19 Jun 2007 08:20:25 -0700
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
Message-ID: <ca471dc20706190820n7715fc30jeafcffd14c6b5623@mail.gmail.com>

Those are valid concerns. I'm cross-posting this to the python-3000
list in the hope that the PEP's author and defendents can respond. I'm
sure we can work something out.

Please keep further discussion on the python-3000 at python.org list.

--Guido

On 6/19/07, Chris McDonough <chrism at plope.com> wrote:
> Wrt http://www.python.org/dev/peps/pep-3101/
>
> PEP 3101 says Py3K should allow item and attribute access syntax
> within string templating expressions but "to limit potential security
> issues", access to underscore prefixed names within attribute/item
> access expressions will be disallowed.
>
> I am a person who has lived with the aftermath of a framework
> designed to prevent data access by restricting access to underscore-
> prefixed names (Zope 2, ahem), and I've found it's very hard to
> explain and justify.  As a result, I feel that this is a poor default
> policy choice for a framework.
>
> In some cases, underscore names must become part of an object's
> external interface.  Consider a URL with one or more underscore-
> prefixed path segment elements (because prefixing a filename with an
> underscore is a perfectly reasonable thing to do on a filesystem, and
> path elements are often named after file names) fed to a traversal
> algorithm that attempts to resolve each path element into an object
> by calling __getitem__ against the parent found by the last path
> element's traversal result.  Perhaps this is poor design and
> __getitem__ should not be consulted here, but I doubt that highly
> because there's nothing particularly special about calling a method
> named __getitem__ as opposed to some method named "traverse".
>
> The only precedent within Python 2 for this sort of behavior is
> limiting access to variables that begin with __ and which do not end
> with __ to the scope defined by a class and its instances.  I
> personally don't believe this is a very useful feature, but it's
> still only an advisory policy and you can worm around it with enough
> gyrations.
>
> Given that security is a concern at all, the only truly reasonable
> way to "limit security issues" is to disallow item and attribute
> access completely within the string templating expression syntax.  It
> seems gratuituous to me to encourage string templating expressions
> with item/attribute access, given that you could do it within the
> format arguments just as easily in the 99% case, and we've (well...
> I've) happily been living with that restriction for years now.
>
> But if this syntax is preserved, there really should be no *default*
> restrictions on the traversable names within an expression because
> this will almost certainly become a hard-to-explain, hard-to-justify
> bug magnet as it has become in Zope.
>
> - C
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From him at online.de  Tue Jun 19 16:49:10 2007
From: him at online.de (=?ISO-8859-1?Q?Joachim_K=F6nig?=)
Date: Tue, 19 Jun 2007 16:49:10 +0200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4677EA0C.3020107@benjiyork.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677E097.5060205@online.de> <4677EA0C.3020107@benjiyork.com>
Message-ID: <4677ECE6.2040402@online.de>

Benji York schrieb:
> Joachim K?nig wrote:
>> could someone enlighten me why
>>
>> {,}
>>
>> can't be used for the empty set, analogous to the empty tuple (,)?
>
> Partially because (,) is not the empty tuple, () is.
Oh, yes, of course. I was thinking of (x) vs. (x,), and that the comma
after the last element is optional if len() > 1, but required when len() 
== 1
and forgot that is is forbidden when len() == 0.

Sorry about that.

Joachim

From jimjjewett at gmail.com  Tue Jun 19 17:29:30 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 19 Jun 2007 11:29:30 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4677C3F8.3050305@nekomancer.net>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net>
Message-ID: <fb6fbf560706190829s68cb9c95v8b478e21f21ecd0a@mail.gmail.com>

On 6/19/07, G?bor Farkas <gabor at nekomancer.net> wrote:
> Guido van Rossum wrote:
> > I've written up a comprehensive status report on Python 3000. Please read:

> > http://www.artima.com/weblogs/viewpost.jsp?thread=208549

> why does map and filter stay, but reduce leaves?

> i understand that some people think that an explicit for-loop is more
> understandable, but also many people claim that list-comprehensions are
> more understandable than map/filter.., and map/filter can be trivially
> written using list-comprehensions.. so why  _reduce_?

Note:  these are my opinions, which may be unrelated to Guido's reasoning.

In practice, reduce is almost always difficult to read and understand.
 There are counterexamples, but they tend to already be written
without reduce.  (They may use sum instead of a for loop, but they
don't use reduce, unless they are intended as an example of reduce
usage.)

filter is at least well-named; no one has any doubts over what it is doing.

map over a simple function is better written as a list comprehension,
but if the function is complicated, has side effects, sends output to
multiple places ... then map is probably a less-bad choice.

-jJ

From janssen at parc.com  Tue Jun 19 18:34:32 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 09:34:32 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f58kb7$6vp$1@sea.gmane.org> 
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
Message-ID: <07Jun19.093441pdt."57996"@synergy1.parc.xerox.com>

> > written using list-comprehensions.. so why  _reduce_?
> 
> Don't worry, it wasn't complete removed. Reduce was moved to functools

Though, really, same question!  There are functional equivalents (list
comprehensions) for "map" and "filter", but not for "reduce".
Shouldn't "reduce" stay in the 'built-in' space, while the other two
move to "functools"?  Or move them all to "functools"?  Bizarre
recombination, IMO.

Bill

From collinw at gmail.com  Tue Jun 19 18:46:08 2007
From: collinw at gmail.com (Collin Winter)
Date: Tue, 19 Jun 2007 09:46:08 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <7301715244131583311@unknownmsgid>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
Message-ID: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>

On 6/19/07, Bill Janssen <janssen at parc.com> wrote:
> > > written using list-comprehensions.. so why  _reduce_?
> >
> > Don't worry, it wasn't complete removed. Reduce was moved to functools
>
> Though, really, same question!  There are functional equivalents (list
> comprehensions) for "map" and "filter", but not for "reduce".

There is a range of list comprehensions that are more
readably/concisely expressed as calls to map or filter:

[f(x) for x in y] -> map(f, y)
[x for x in y if f(x)] -> filter(f, y)

Turning a for loop into the equivalent reduce() may be more concise,
but as Guido has remarked before, someone new to your code generally
has to break out pen and paper to figure out what's going on.

> Shouldn't "reduce" stay in the 'built-in' space, while the other two
> move to "functools"?  Or move them all to "functools"?  Bizarre
> recombination, IMO.

Arguing from the standpoint of purity, that, "these functions are
builtins, why not this other one" isn't going to get you very far.
Another data point to consider is that map and filter are used far,
far more often than reduce (100000 and 62000 usages to 10000, says
Google Code Search), so there's more resistance to moving them.

Collin Winter

From janssen at parc.com  Tue Jun 19 19:51:15 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 10:51:15 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> 
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
Message-ID: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>

> > Shouldn't "reduce" stay in the 'built-in' space, while the other two
> > move to "functools"?  Or move them all to "functools"?  Bizarre
> > recombination, IMO.
> 
> Arguing from the standpoint of purity, that, "these functions are
> builtins, why not this other one" isn't going to get you very far.

If you think that's what I was arguing, you'd better re-read that
message.

Though, from the standpoint of pragmatism, removing "reduce" from the
built-in space will break code (*my* code, among others), and leaving
it in will not affect "purity", as both "map" and "reduce" are being
left in.  So leaving it alone seems the more Pythonic response to me.

Guido's argument
(http://www.artima.com/weblogs/viewpost.jsp?thread=98196) is that
"any" and "all" (and "filter", of course) are better ways to do the
same thing.  I'm not sure, but it's an interesting hypothesis.  But
while we run the experiment, why not leave "reduce" where it is?

Bill

From eric+python-dev at trueblade.com  Tue Jun 19 19:55:07 2007
From: eric+python-dev at trueblade.com (Eric V. Smith)
Date: Tue, 19 Jun 2007 13:55:07 -0400
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
Message-ID: <4678187B.2060402@trueblade.com>

Guido van Rossum wrote:
> I've written up a comprehensive status report on Python 3000. Please read:
> 
> http://www.artima.com/weblogs/viewpost.jsp?thread=208549

I think this sentence:

"Python 2.6 will contain backported versions of many Py3k features, 
either enabled through __future__ statements or simply by allowing old 
and new syntax to be used side-by-side (if the new syntax would be a 
syntax error in 2.x)."

Should end with "syntax error in 2.5", not "syntax error in 2.x".  Or, 
state that x <= 5, in this sentence only.  But I think we really mean 
exactly 2.5.

Eric.

From janssen at parc.com  Tue Jun 19 20:02:32 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 11:02:32 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> 
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
Message-ID: <07Jun19.110237pdt."57996"@synergy1.parc.xerox.com>

And, while I'm at it, why isn't there a built-in function called
"output()", which matches "input()", that is, it's equivalent to

     import sys
     sys.stdout.write(MESSAGE)

It could be easily implemented in terms of the built-in function
called "print".  The fact that it's not there is going to confuse
the heck out of the same audience "input" was designed for.

I realize that there are good individual reasons for each of these
point decisions; my fear is that by making them individually, we make
the task of keeping Python in one's head unacceptably complex.

Bill

From mike.klaas at gmail.com  Tue Jun 19 20:19:09 2007
From: mike.klaas at gmail.com (Mike Klaas)
Date: Tue, 19 Jun 2007 11:19:09 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>

On 19-Jun-07, at 10:51 AM, Bill Janssen wrote:
>

> Though, from the standpoint of pragmatism, removing "reduce" from the
> built-in space will break code (*my* code, among others), and leaving
> it in will not affect "purity", as both "map" and "reduce" are being
> left in.  So leaving it alone seems the more Pythonic response to me.

map (especially the new iterized version) is a frequently-used  
builtin, while reduce is a rarely-used builtin that requires some  
head-wrapping.  It makes sense to me to move it out of builtins.

To pick a codebase at random (mine):

$ find -name \*.py | xargs wc -l
  137952 total

$ pygrep map\( | wc -l
220

$ pygrep imap\( | wc -l
13

$ pygrep reduce\( | wc -l
2

Of the two uses of reduce(), one is in a unittest that should be  
using any():

         self.assertTrue(not reduce((lambda b1, b2: b1 or b2), ...

and the other is a tricky combination of callable "iterator filters"  
that looks something like this:

         df = lambda itr: reduce(lambda x, f: f(x), filter_list, itr)

this isn't the clearest piece of code, even with more explanation.   
It would require a multi-line inner-function generator to replace  
it.  I'm have no qualms importing reduce for such uses.

In contrast, partial(), which should have less use as most of the  
codebase was written pre-2.5, and requires an import, is used four  
times:

$ pygrep partial\( | wc -l
4

-Mike


From janssen at parc.com  Tue Jun 19 20:47:46 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 11:47:46 PDT
Subject: [Python-3000] On PEP 3116:  new I/O base classes
Message-ID: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>

A few comments here:

I'd get rid of "readinto", and just make the buffer an optional
argument to "read".  If you keep "readinto", why not rename "write" to
"writefrom"?

The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with
the "whence" argument.  It could be made considerably more Pythonic by
splitting it into two methods:

  .seek(POS: int)

where positive values for POS are from the beginning of the file, and
negative values of POS are from the end of the file, and

  .nudge(POS: int)

where the value of POS, positive or negative, is from the current
location.  Or call the two methods "seek_absolute" and "seek_relative".
Of course, you don't really need "nudge" if you have "tell".

Might even rename "seek" to "position".  And I'd consider putting these
two methods in a separate mixin; lots of file-like things can't seek.

===============================================

``If and only if a RawIOBase implementation operates on an underlying
file descriptor, it must additionally provide a .fileno() member
function. This could be defined specifically by the implementation, or
a mix-in class could be used (need to decide about this).''

I'd suggest a mixin.

===============================================

TextIOBase: this seems an odd mix of high-level and low-level.  I'd
remove "seek", "tell", "read", and "write".  Remember that in Python,
mixins actually work, so that you can provide a file object that
combines several different I/O classes.  And The Java-ish notion in
TextIOBase.read(), that you can specify a count for the number of
characters (or is that the number of UTF-8 bytes, etc... -- rich
source of subtle bugs), just doesn't work in practice.  And the
"codecs" module already provides a way of doing this, for those who
feel the need.  Stick to just "readline" and "writeline" for text I/O.

Bill


From guido at python.org  Tue Jun 19 20:49:10 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 19 Jun 2007 11:49:10 -0700
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4678187B.2060402@trueblade.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4678187B.2060402@trueblade.com>
Message-ID: <ca471dc20706191149i2d6208f0v737bfd9296a04f41@mail.gmail.com>

Thanks, you're right, I've fixed it.

On 6/19/07, Eric V. Smith <eric+python-dev at trueblade.com> wrote:
> Guido van Rossum wrote:
> > I've written up a comprehensive status report on Python 3000. Please read:
> >
> > http://www.artima.com/weblogs/viewpost.jsp?thread=208549
>
> I think this sentence:
>
> "Python 2.6 will contain backported versions of many Py3k features,
> either enabled through __future__ statements or simply by allowing old
> and new syntax to be used side-by-side (if the new syntax would be a
> syntax error in 2.x)."
>
> Should end with "syntax error in 2.5", not "syntax error in 2.x".  Or,
> state that x <= 5, in this sentence only.  But I think we really mean
> exactly 2.5.
>
> Eric.
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From janssen at parc.com  Tue Jun 19 21:13:24 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 12:13:24 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> 
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
Message-ID: <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>

> map (especially the new iterized version) is a frequently-used  
> builtin, while reduce is a rarely-used builtin that requires some  
> head-wrapping.  It makes sense to me to move it out of builtins.

I've never understood this kind of argument.  Because most people
don't program in Python, we should abandon the project as a whole?
For those who have "wrapped their head" around functional programming,
"reduce" is a very clear and easy-to-understand primitive.

But posting results gleaned from grepping over some random codebase
written by someone who may or may not have done that head-wrapping at
various points in time where some feature X may more may not have been
available, seems even less of an argument.  As I said, Guido's
argument that "filter" (in the guise of [x for x in y if f(x)]),
"any", and "all" are sufficient for almost every case seems like an
interesting one to me, and he may well be right, but while we find
out...

Bill

From lists at cheimes.de  Tue Jun 19 21:17:06 2007
From: lists at cheimes.de (Christian Heimes)
Date: Tue, 19 Jun 2007 21:17:06 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <f59a3o$rnn$1@sea.gmane.org>

Bill Janssen wrote:
> Though, from the standpoint of pragmatism, removing "reduce" from the
> built-in space will break code (*my* code, among others), and leaving
> it in will not affect "purity", as both "map" and "reduce" are being
> left in.  So leaving it alone seems the more Pythonic response to me.

Python 3000 tries to reduce (hehe) the amount of builtins so reduce was
removed since it is rarely used. I don't understand why map and filter
wasn't moved to functools, too.

You made one good point. At the moment you can't write code that
utilizes reduce and works under 2.6 and 3.0. from functools import
reduce fails in 2.6. The 2to3 suite has no fixer for reduce. My patch
removes the flaw: http://www.python.org/sf/1739906

Christian


From martin at v.loewis.de  Tue Jun 19 22:53:09 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 19 Jun 2007 22:53:09 +0200
Subject: [Python-3000] [Python-Dev]   Python 3000 Status Update (Long!)
In-Reply-To: <f58sls$983$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<f5856m$h2q$1@sea.gmane.org>	<4677C4B8.8010508@gmail.com>	<f58hlj$sri$1@sea.gmane.org>	<4677CED9.1060800@livinglogic.de>	<f58k6r$6fv$1@sea.gmane.org>	<4677EC1A.10306@livinglogic.de>
	<f58sls$983$1@sea.gmane.org>
Message-ID: <46784235.5050102@v.loewis.de>

>> What would a registry of tranformation algorithms buy us compared to a
>> module with transformation functions?
> 
> Easier registering of custom transformations. Without a registry, you'd have
> to monkey-patch a module.

Or users would have to invoke the module directly.

I think a convention would be enough:

rot13.encode(foo)
rot13.decode(bar)

Then, "registration" would require to put the module on sys.path,
which it would for any other kind of registry as well.

My main objection to using an encoding is that for these,
the algorithm name will *always* be a string literal,
completely unlike "real" codecs, where the encoding name
often comes from the environment (either from the process
environment, or from some kind of input).

Regards,
Martin


From mike.klaas at gmail.com  Tue Jun 19 22:53:35 2007
From: mike.klaas at gmail.com (Mike Klaas)
Date: Tue, 19 Jun 2007 13:53:35 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
	<07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>


On 19-Jun-07, at 12:13 PM, Bill Janssen wrote:

>> map (especially the new iterized version) is a frequently-used
>> builtin, while reduce is a rarely-used builtin that requires some
>> head-wrapping.  It makes sense to me to move it out of builtins.
>
> I've never understood this kind of argument.  Because most people
> don't program in Python, we should abandon the project as a whole?

No, but it certainly is an argument for not pre-installing a python  
dev environment on all windows boxes, for instance.

Surely frequency of use should be at least _one_ of the criteria  
involved in making decisions about including functionality as a builtin?

> For those who have "wrapped their head" around functional programming,
> "reduce" is a very clear and easy-to-understand primitive.

Granted.  However, I claim that most python users would find a reduce 
()-based construct less natural than an alternative, even if they  
understand what it does.  The suggestion is simply to move the  
function to the "FP toolbox".

> But posting results gleaned from grepping over some random codebase
> written by someone who may or may not have done that head-wrapping at
> various points in time where some feature X may more may not have been
> available, seems even less of an argument.

reduce() was always available.  But that isn't the point: I'm not  
presenting the statistics as evidence of the entire python world, but  
I think they still indicate _something_, if not only what are the  
usage patterns of some type of python programmer (namely, one who is  
familiar with FP and uses many of its concepts in their python  
programming, though is by no means a disciple).

Stats from _any_ large python project is better than anecdotes.   
Perhaps it would be better to turn to the stdlib (367289 lines)?

Python2.5/Lib $ pygrep -E '\breduce\(' | wc -l
31

15 of those are tests for reduce()/iterators
7 are in pickle.py (nomenclature clash)

Which leaves a few uses over binary operators:

./test/test_random.py:            return reduce(int.__mul__, xrange 
(1, n), 1)
./idlelib/MultiCall.py:_state_names = [reduce(lambda x, y: x + y,
./idlelib/MultiCall.py:_state_codes = [reduce(lambda x, y: x | y,
./idlelib/AutoCompleteWindow.py:        elif reduce(lambda x, y: x or y,
./difflib.py:        matches = reduce(lambda sum, triple: sum + triple 
[-1],
                          self.get_matching_blocks(), 0)


Some trickiness in csv.py:

         quotechar = reduce(lambda a, b, quotes = quotes:
                            (quotes[a] > quotes[b]) and a or b,  
quotes.keys())
             delim = reduce(lambda a, b, delims = delims:
                            (delims[a] > delims[b]) and a or b,  
delims.keys())
             modes[char] = reduce(lambda a, b: a[1] > b[1] and a or  
b, items)

(which can be replaced with max(..., key=...)

       reduce(lambda a, b: (0, a[1] + b[1]), items)[1]

(which could be written sum(x[1] for x in items)

> As I said, Guido's
> argument that "filter" (in the guise of [x for x in y if f(x)]),
> "any", and "all" are sufficient for almost every case seems like an
> interesting one to me, and he may well be right, but while we find
> out...

How will we find out, if reduce() continues to be availabe? <wink>

Regardless, that's my 2c.  I don't think I have anythign further to  
add to this (settled) matter.

-Mike

From brett at python.org  Tue Jun 19 23:13:56 2007
From: brett at python.org (Brett Cannon)
Date: Tue, 19 Jun 2007 14:13:56 -0700
Subject: [Python-3000] How best to handle failing tests in struni?
Message-ID: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>

After reading Guido's blog post and noticing his comment about lack of
delegation, I decided to delegate to myself a look at struni and what
tests were failing (which turned out to be a lot).

I just started at the beginning and so that meant looking at
test_anydbm.  That's failing because _bsddb.c requires PyInt_Check or
PyString_Check to pass for keys.  That doesn't work in a world where
string constants are all Unicode.  =)

So, my question is how best to handle this test (and thus other tests
like it).  Should it just continue to fail until someone fixes
_bsddb.c to accept Unicode keys (and thus start up a FAILING file
listing the various tests that are failing and doc which ones are
expected to fail until something specific changes)?  Or do we silence
the failure by making the constants pass through str8?  Or should str8
not even be used at all since (I assume) it won't survive the merge
back into p3yk?

-Brett

From guido at python.org  Wed Jun 20 01:22:06 2007
From: guido at python.org (Guido van Rossum)
Date: Tue, 19 Jun 2007 16:22:06 -0700
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
Message-ID: <ca471dc20706191622k1ca313bdtb53d33475eff8632@mail.gmail.com>

Check out what the dbm-based modules do. I believe they use strings
for keys and bytes for values, and if the keys are unicode, it
converts them to UTF-8.

On 6/19/07, Brett Cannon <brett at python.org> wrote:
> After reading Guido's blog post and noticing his comment about lack of
> delegation, I decided to delegate to myself a look at struni and what
> tests were failing (which turned out to be a lot).
>
> I just started at the beginning and so that meant looking at
> test_anydbm.  That's failing because _bsddb.c requires PyInt_Check or
> PyString_Check to pass for keys.  That doesn't work in a world where
> string constants are all Unicode.  =)
>
> So, my question is how best to handle this test (and thus other tests
> like it).  Should it just continue to fail until someone fixes
> _bsddb.c to accept Unicode keys (and thus start up a FAILING file
> listing the various tests that are failing and doc which ones are
> expected to fail until something specific changes)?  Or do we silence
> the failure by making the constants pass through str8?  Or should str8
> not even be used at all since (I assume) it won't survive the merge
> back into p3yk?
>
> -Brett
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From gareth.mccaughan at pobox.com  Wed Jun 20 01:26:16 2007
From: gareth.mccaughan at pobox.com (Gareth McCaughan)
Date: Wed, 20 Jun 2007 00:26:16 +0100
Subject: [Python-3000] On PEP 3116:  new I/O base classes
In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <200706200026.16568.gareth.mccaughan@pobox.com>

On Tuesday 19 June 2007 19:47, Bill Janssen wrote:

> The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with
> the "whence" argument.  It could be made considerably more Pythonic by
> splitting it into two methods:
>
>   .seek(POS: int)
>
> where positive values for POS are from the beginning of the file, and
> negative values of POS are from the end of the file, and
>
>   .nudge(POS: int)
>
> where the value of POS, positive or negative, is from the current
> location.

Presumably this would go along with introducing a new "wink" method.
I wonder what it would do. (Close the file briefly?)

-- 
Gareth McCaughan

From showell30 at yahoo.com  Wed Jun 20 02:19:40 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Tue, 19 Jun 2007 17:19:40 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
Message-ID: <931752.18195.qm@web33504.mail.mud.yahoo.com>

+1 on deciding on keeping builtins built in based on
populuarity within actual source code.

Stats will never be perfect, and nobody can
practically sample all Python code ever written, but
anybody who measures a large codebase to argue for
keeping a builtin built in gets a +1 from me.

Regarding map and filter, I never use them myself, but
I also never collide with the keywords, even though a
lot of my code really comes downs to mapping and filtering.


      ___________________________________________________________________________________
You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html

From lists at cheimes.de  Wed Jun 20 02:29:01 2007
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 20 Jun 2007 02:29:01 +0200
Subject: [Python-3000] On PEP 3116:  new I/O base classes
In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <f59scm$m59$1@sea.gmane.org>

Bill Janssen wrote:
> The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with
> the "whence" argument.  It could be made considerably more Pythonic by
> splitting it into two methods:
> 
>   .seek(POS: int)
> 
> where positive values for POS are from the beginning of the file, and
> negative values of POS are from the end of the file, and

How would I seek to EOF with your proposal? seek(-0)?

Christian


From benji at benjiyork.com  Wed Jun 20 03:11:03 2007
From: benji at benjiyork.com (Benji York)
Date: Tue, 19 Jun 2007 21:11:03 -0400
Subject: [Python-3000] On PEP 3116:  new I/O base classes
In-Reply-To: <200706200026.16568.gareth.mccaughan@pobox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
	<200706200026.16568.gareth.mccaughan@pobox.com>
Message-ID: <46787EA7.6050903@benjiyork.com>

Gareth McCaughan wrote:
> On Tuesday 19 June 2007 19:47, Bill Janssen wrote:
> 
>> The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with
>> the "whence" argument.  It could be made considerably more Pythonic by
>> splitting it into two methods:
>>
>>   .seek(POS: int)
>>
>> where positive values for POS are from the beginning of the file, and
>> negative values of POS are from the end of the file, and
>>
>>   .nudge(POS: int)
>>
>> where the value of POS, positive or negative, is from the current
>> location.
> 
> Presumably this would go along with introducing a new "wink" method.
> I wonder what it would do. (Close the file briefly?)

That's a great idea!  It can be called in response to a HUP to rotate 
log files.

me.wink()-ly y'rs
-- 
Benji York
http://benjiyork.com

From janssen at parc.com  Wed Jun 20 03:46:50 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 19 Jun 2007 18:46:50 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <f59scm$m59$1@sea.gmane.org> 
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
	<f59scm$m59$1@sea.gmane.org>
Message-ID: <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>

> How would I seek to EOF with your proposal? seek(-0)?

Good point.  Though I just grepped all my Python sources, and I never
do that, so presumably the obvious workaround of

   seek_eof = lambda fp: fp.seek(-1), fp.nudge(+1)

would be OK for that case.

Bill

From foom at fuhm.net  Wed Jun 20 06:19:26 2007
From: foom at fuhm.net (James Y Knight)
Date: Wed, 20 Jun 2007 00:19:26 -0400
Subject: [Python-3000] On PEP 3116:  new I/O base classes
In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>

On Jun 19, 2007, at 2:47 PM, Bill Janssen wrote:

> TextIOBase: this seems an odd mix of high-level and low-level.  I'd
> remove "seek", "tell", "read", and "write".  Remember that in Python,
> mixins actually work, so that you can provide a file object that
> combines several different I/O classes.

Huh? All those operations you want to remove are entirely necessary  
for a number of applications. I'm not sure what you meant about mixins?

> And The Java-ish notion in
> TextIOBase.read(), that you can specify a count for the number of
> characters (or is that the number of UTF-8 bytes, etc... -- rich
> source of subtle bugs), just doesn't work in practice.

It doesn't work? Why not? Of course read() should take the number of  
characters as a parameter, not number of bytes.

> And the
> "codecs" module already provides a way of doing this, for those who
> feel the need.  Stick to just "readline" and "writeline" for text I/O.

Ah, not everyone dealing with text is dealing with line-delimited  
text, you know...

James

From aurelien.campeas at logilab.fr  Wed Jun 20 10:57:01 2007
From: aurelien.campeas at logilab.fr (=?iso-8859-1?Q?Aur=E9lien_Camp=E9as?=)
Date: Wed, 20 Jun 2007 10:57:01 +0200
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <ca471dc20706190820n7715fc30jeafcffd14c6b5623@mail.gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<ca471dc20706190820n7715fc30jeafcffd14c6b5623@mail.gmail.com>
Message-ID: <20070620085701.GA31968@crater.logilab.fr>

On Tue, Jun 19, 2007 at 08:20:25AM -0700, Guido van Rossum wrote:
> Those are valid concerns. I'm cross-posting this to the python-3000
> list in the hope that the PEP's author and defendents can respond. I'm
> sure we can work something out.

Thanks to raise this. It is horrible enough that I feel obliged to
de-lurk.

-10 on this part of PEP3101.


> 
> Please keep further discussion on the python-3000 at python.org list.
> 
> --Guido
> 
> On 6/19/07, Chris McDonough <chrism at plope.com> wrote:
> > Wrt http://www.python.org/dev/peps/pep-3101/
> >
> > PEP 3101 says Py3K should allow item and attribute access syntax
> > within string templating expressions but "to limit potential security
> > issues", access to underscore prefixed names within attribute/item
> > access expressions will be disallowed.

People talking about potential security issues should have an
obligation to show how their proposals *really* improve security (in
general); this is of course, a hard thing to do; mere hand-waving is
not sufficient.

> > I am a person who has lived with the aftermath of a framework
> > designed to prevent data access by restricting access to underscore-
> > prefixed names (Zope 2, ahem), and I've found it's very hard to
> > explain and justify.  As a result, I feel that this is a poor default
> > policy choice for a framework.

And it's even poorer in the context of a language (for it's probably
harder to escape language-level restrictions than framework
obscurities ...).

> > In some cases, underscore names must become part of an object's
> > external interface.  Consider a URL with one or more underscore-
> > prefixed path segment elements (because prefixing a filename with an
> > underscore is a perfectly reasonable thing to do on a filesystem, and
> > path elements are often named after file names) fed to a traversal
> > algorithm that attempts to resolve each path element into an object
> > by calling __getitem__ against the parent found by the last path
> > element's traversal result.  Perhaps this is poor design and
> > __getitem__ should not be consulted here, but I doubt that highly
> > because there's nothing particularly special about calling a method
> > named __getitem__ as opposed to some method named "traverse".

This is trying to make a technical argument, but the 'consenting
adults' policy might be enough. In my experience, zope forbiding
access to _ prefixed attributes just led to work around the
limitation, thus adding more useless indirection to an already crufty
code base. The result is more obfuscation and probably even less
security (as in auditability of the code).

> >
> > The only precedent within Python 2 for this sort of behavior is
> > limiting access to variables that begin with __ and which do not end
> > with __ to the scope defined by a class and its instances.  I
> > personally don't believe this is a very useful feature, but it's
> > still only an advisory policy and you can worm around it with enough
> > gyrations.

FWIW I've come to never use __attrs. The obfuscation feature seems to
bring nothing but pain (the few times I've fell into that trap as a
beginner python programmer).

> >
> > Given that security is a concern at all, the only truly reasonable
> > way to "limit security issues" is to disallow item and attribute
> > access completely within the string templating expression syntax.  It
> > seems gratuituous to me to encourage string templating expressions
> > with item/attribute access, given that you could do it within the
> > format arguments just as easily in the 99% case, and we've (well...
> > I've) happily been living with that restriction for years now.
> >
> > But if this syntax is preserved, there really should be no *default*
> > restrictions on the traversable names within an expression because
> > this will almost certainly become a hard-to-explain, hard-to-justify
> > bug magnet as it has become in Zope.

I'd add that Zope in general looks to me like a giant collection of
python anti-patterns and as such can be used as a clue source about
what not to do, especially what not to include in Py3k.

I don't want to offense people, well no more than necessary (imho zope
*is* an offense to common sense in many ways), but that's the opinion
from someone who earns its living mostly from zope/plone products
dev. and maintenance (these days, anyway).

Regards,
Aur?lien.

From walter at livinglogic.de  Wed Jun 20 11:26:59 2007
From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=)
Date: Wed, 20 Jun 2007 11:26:59 +0200
Subject: [Python-3000] setup.py fails in the py3k-struni branch
In-Reply-To: <4677AA2B.8000704@ronadam.com>
References: <acd65fa20706051433l5413ca33hd0a1793cedd4e675@mail.gmail.com>		<466E4B22.6020408@ronadam.com>		<ca471dc20706131453w3e073ab7qba5c8bdbf0e0812c@mail.gmail.com>		<46708286.6090201@ronadam.com>		<ca471dc20706131656q213c61c2g37da25e3a6a0e856@mail.gmail.com>		<4670A458.7050206@ronadam.com>		<ca471dc20706131925t5cf61e61j6e701a5c79e29b64@mail.gmail.com>		<4670C3C5.4070907@ronadam.com>		<ca471dc20706141657x14ad1ba7m911ac17bb1d7939@mail.gmail.com>		<46749E94.5010301@ronadam.com>	<ca471dc20706181137h3b46a17eu3e49112303b22e1d@mail.gmail.com>
	<4677AA2B.8000704@ronadam.com>
Message-ID: <4678F2E3.7080900@livinglogic.de>

Ron Adam wrote:

> [...]
> M      Lib/tokenize.py
> M      Lib/test/tokenize_tests.txt
> M      Lib/test/output/test_tokenize
> - Removed unicode literals from test results and tokenize.py.  And make
> it pass again.
> 
> 
> M      Lib/test/output/test_pep277
> - Removed unicode literals from test results.  This is a windows only
> test, so I can't test it.
> 
> M      Lib/test/test_codeccallbacks.py
> M      Objects/exceptions.c
> - Remove unicode literals from test_codeccallbacks.py and removed
> unicode litteral quoting from exceptions.c to make it pass again.
> 
> M      Lib/test/test_codecs.py
> M      Lib/test/test_doctest.py
> M      Lib/test/re_tests.py
> - Removed some literals from comments.

The following changes looked good to me:

M      Lib/test/test_codeccallbacks.py
M      Objects/exceptions.c
M      Lib/test/test_codecs.py

so I checked them in.

No opinion about the rest.

Servus,
   Walter

From ncoghlan at gmail.com  Wed Jun 20 12:31:38 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 20 Jun 2007 20:31:38 +1000
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f59a3o$rnn$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
	<f59a3o$rnn$1@sea.gmane.org>
Message-ID: <4679020A.8020609@gmail.com>

Christian Heimes wrote:
> Bill Janssen wrote:
>> Though, from the standpoint of pragmatism, removing "reduce" from the
>> built-in space will break code (*my* code, among others), and leaving
>> it in will not affect "purity", as both "map" and "reduce" are being
>> left in.  So leaving it alone seems the more Pythonic response to me.
> 
> Python 3000 tries to reduce (hehe) the amount of builtins so reduce was
> removed since it is rarely used. I don't understand why map and filter
> wasn't moved to functools, too.

Because (str(x) for x in seq) is not an improvement over map(str, x) - 
applying a single existing function to a sequence is a very common 
operation.

map() accepts any function (given an appropriate number of sequences), 
and thus has wide applicability.

filter() accepts any single argument predicate function (using bool() by 
default), and thus also has wide applicability.

reduce(), on the other hand, works only with functions that are 
specially designed to be fed to it - you are unlikely to have an 
appropriate function just lying around. Given the likely need to write a 
special function to perform the desired reduction importing the reduce 
function itself isn't going to be much additional overhead.

 From the point of view of readability, it is probably going to be 
better to hide the fact that reduce is being used at all behind a named 
reduction function (or, where possible, just use one of the builtin 
sequence reduction functions like any(), all(), sum(), min(), max()).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From nicko at nicko.org  Wed Jun 20 12:22:11 2007
From: nicko at nicko.org (Nicko van Someren)
Date: Wed, 20 Jun 2007 11:22:11 +0100
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
	<07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>
	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
Message-ID: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>

On 19 Jun 2007, at 21:53, Mike Klaas wrote:
> ...
> Stats from _any_ large python project is better than anecdotes.
> Perhaps it would be better to turn to the stdlib (367289 lines)?
...
>        reduce(lambda a, b: (0, a[1] + b[1]), items)[1]
>
> (which could be written sum(x[1] for x in items)

Only if the items at index 1 happen to be numbers.  That's another  
bugbear of mine.  The sum(l) built-in is NOT equivalent to reduce 
(operator.add, l) in Python 2.x:
	>>> reduce(operator.add, [1,2,3])
	6
	>>> reduce(operator.add, ['a','b','c'])
	'abc'
	>>> reduce(operator.add, [["a"],[u'b'],[3]])
	['a', u'b', 3]
	>>> sum([1,2,3])
	6
	>>> sum(['a','b','c'])
	Traceback (most recent call last):
	  File "<stdin>", line 1, in <module>
	TypeError: unsupported operand type(s) for +: 'int' and 'str'
	>>> sum([["a"],[u'b'],[3]])
	Traceback (most recent call last):
	  File "<stdin>", line 1, in <module>
	TypeError: unsupported operand type(s) for +: 'int' and 'list'

Given that reduce is moving one step further away in Python 3, and  
given that it's use seems to be somewhat discouraged these days  
anyway, perhaps the sum() function could be made properly polymorphic  
so as to remove one more class of use cases for reduce().

	Nicko


From ncoghlan at gmail.com  Wed Jun 20 16:44:10 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 21 Jun 2007 00:44:10 +1000
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>	<07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
Message-ID: <46793D3A.8020303@gmail.com>

Nicko van Someren wrote:
> 	>>> sum(['a','b','c'])
> 	Traceback (most recent call last):
> 	  File "<stdin>", line 1, in <module>
> 	TypeError: unsupported operand type(s) for +: 'int' and 'str'
> 	>>> sum([["a"],[u'b'],[3]])
> 	Traceback (most recent call last):
> 	  File "<stdin>", line 1, in <module>
> 	TypeError: unsupported operand type(s) for +: 'int' and 'list'

You can already make the second example work properly by supplying an 
appropriate starting value:

 >>> sum([["a"],[u'b'],[3]], [])
['a', u'b', 3]

(and a similar call will also work for the new bytes type, as well as 
other sequences)

Strings are explicitly disallowed (because Guido doesn't want a second 
way to spell ''.join(seq), as far as I know):

 >>> sum(['a','b','c'], '')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: sum() can't sum strings [use ''.join(seq) instead]

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Wed Jun 20 16:49:41 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 21 Jun 2007 00:49:41 +1000
Subject: [Python-3000] Issues with PEP 3101 (string formatting)
In-Reply-To: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
Message-ID: <46793E85.4000402@gmail.com>

Chris McDonough wrote:
> Wrt http://www.python.org/dev/peps/pep-3101/
> 
> PEP 3101 says Py3K should allow item and attribute access syntax  
> within string templating expressions but "to limit potential security  
> issues", access to underscore prefixed names within attribute/item  
> access expressions will be disallowed.

Personally, I'd be fine with leaving at least the embedded attribute 
access out of the initial implementation of the PEP. I'd even be OK with 
leaving out the embedded item access, but if we leave it in "vars(obj)" 
and the embedded item access would still provide a shorthand notation 
for access to instance variable attributes in a format string.

So +1 for leaving out embedded attribute access from the initial 
implementation of PEP 3101, and -0 for leaving out the embedded item access.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From thomas at python.org  Wed Jun 20 17:13:00 2007
From: thomas at python.org (Thomas Wouters)
Date: Wed, 20 Jun 2007 08:13:00 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <46793D3A.8020303@gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
	<46793D3A.8020303@gmail.com>
Message-ID: <9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com>

On 6/20/07, Nick Coghlan <ncoghlan at gmail.com> wrote:

> Strings are explicitly disallowed (because Guido doesn't want a second
> way to spell ''.join(seq), as far as I know):


More importantly, because it has positively abysmal performance, just like
the reduce() solution (and, in fact, many reduce solutions to problems
better solved otherwise :-) Like the old input(), backticks and allowing the
mixing of tabs and spaces, while it has uses, the ease and frequency with
which it is misused outweigh the utility enough that it should not be in
such a prominent place.

-- 
Thomas Wouters <thomas at python.org>

Hi! I'm a .signature virus! copy me into your .signature file to help me
spread!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20070620/9ca59b39/attachment.htm 

From lists at cheimes.de  Wed Jun 20 17:32:27 2007
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 20 Jun 2007 17:32:27 +0200
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>	<f59scm$m59$1@sea.gmane.org>
	<07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <f5bhac$qga$1@sea.gmane.org>

Bill Janssen wrote:
> Good point.  Though I just grepped all my Python sources, and I never
> do that, so presumably the obvious workaround of

I'm using seek(0, 2) + tell() sometimes when I need to know the file
size and don't want to worry about buffers.

pos = fd.tell()
size = None
try:
    fd.seek(0, 2)
    size = fd.tell()
finally:
    fd.seek(pos)

IMO you made a good point. The seek() arguments are really too UNIX
centric and hard to understand for newbies. The os module contains three
aliases for seek (SEEK_CUR, SEEK_END, SEEK_SET) (why is it called SET
and not START?) but they are rarely used.

What do you think about adding two additional functions which act as
alias from whence = 1 and whence = 2?

    def seek(self, pos: int, whence: int = 0) -> int:
        """Change stream position.

        Seek to byte offset pos relative to position indicated by whence:
             0  Start of stream (the default).  pos should be >= 0;
             1  Current position - whence may be negative;
             2  End of stream - whence usually negative.
        Returns the new absolute position.
        """

    def seekcur(self, pos: int) -> int:
        """seek relative to current position

        alternative names: seekrel, seek_relative
        """
        return self.seek(pos, 1)

    def seekend(self, pos: int) -> int:
        """seek from end of stream

        alternative names: seekeof
        """
        return self.seek(pos, 2)



From lists at cheimes.de  Wed Jun 20 17:43:51 2007
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 20 Jun 2007 17:43:51 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4679020A.8020609@gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>	<f59a3o$rnn$1@sea.gmane.org>
	<4679020A.8020609@gmail.com>
Message-ID: <f5bi01$skk$1@sea.gmane.org>

Nick Coghlan wrote:
> Because (str(x) for x in seq) is not an improvement over map(str, x) - 
> applying a single existing function to a sequence is a very common 
> operation.
> 
> map() accepts any function (given an appropriate number of sequences), 
> and thus has wide applicability.
> 
> filter() accepts any single argument predicate function (using bool() by 
> default), and thus also has wide applicability.

But map() and filter() can be easily replaced with a generator or list
comprehensive expression. [str(x) for x in seq] and [x for x in seq if
func(x)] are consider easier to read these days.

IIRC map and filter are a bit slower than list comprehensive and may use
much more memory than a generator expression since map and filter are
returning a list of results.

Personally I don't see the reason why map and filter are still builtins
when they can be replaced with easier to read, faster and less memory
consuming code. OK, I have to type some more characters but that's not
an issue for me.

Christian


From chrism at plope.com  Wed Jun 20 17:52:47 2007
From: chrism at plope.com (Chris McDonough)
Date: Wed, 20 Jun 2007 11:52:47 -0400
Subject: [Python-3000] Issues with PEP 3101 (string formatting)
In-Reply-To: <46793E85.4000402@gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<46793E85.4000402@gmail.com>
Message-ID: <AEAEDAF5-D8D2-4FD7-884C-5C4BD2337C80@plope.com>

Allowing attribute and/or item access within templating expressions  
has historically been the domain of full-on templating languages  
(which invariably also have a way to do repeats, conditionals,  
arbitrary method calls, etc).

I think it should probably stay that way because to me, at least,  
there's not much more compelling about being able to do item/ 
attribute access within a template expression than there is to be  
able to do replacements using results from arbitrary method calls.   
It's fairly arbitrary to allow calls to __getitem__ and __getattr__  
and but prevent, say, calls to "traverse", at least if the format  
arguments are not restricted to plain lists/tuples/dicts.

That's not to say that maybe an extended templating thingy shouldn't  
ship within the stdlib though, maybe even one that extends the  
default interpolation syntax in these sorts of ways.

- C

On Jun 20, 2007, at 10:49 AM, Nick Coghlan wrote:

> Chris McDonough wrote:
>> Wrt http://www.python.org/dev/peps/pep-3101/
>> PEP 3101 says Py3K should allow item and attribute access syntax   
>> within string templating expressions but "to limit potential  
>> security  issues", access to underscore prefixed names within  
>> attribute/item  access expressions will be disallowed.
>
> Personally, I'd be fine with leaving at least the embedded  
> attribute access out of the initial implementation of the PEP. I'd  
> even be OK with leaving out the embedded item access, but if we  
> leave it in "vars(obj)" and the embedded item access would still  
> provide a shorthand notation for access to instance variable  
> attributes in a format string.
>
> So +1 for leaving out embedded attribute access from the initial  
> implementation of PEP 3101, and -0 for leaving out the embedded  
> item access.
>
> Cheers,
> Nick.
>
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> ---------------------------------------------------------------
>             http://www.boredomandlaziness.org
>


From veloso at verylowsodium.com  Wed Jun 20 19:00:59 2007
From: veloso at verylowsodium.com (Greg Falcon)
Date: Wed, 20 Jun 2007 13:00:59 -0400
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
Message-ID: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>

On 6/19/07, Chris McDonough <chrism at plope.com> wrote:
> Given that security is a concern at all, the only truly reasonable
> way to "limit security issues" is to disallow item and attribute
> access completely within the string templating expression syntax.  It
> seems gratuituous to me to encourage string templating expressions
> with item/attribute access, given that you could do it within the
> format arguments just as easily in the 99% case, and we've (well...
> I've) happily been living with that restriction for years now.
>
> But if this syntax is preserved, there really should be no *default*
> restrictions on the traversable names within an expression because
> this will almost certainly become a hard-to-explain, hard-to-justify
> bug magnet as it has become in Zope.

This sounds exactly right to me.  I don't have strong feelings either
way about attribute lookups in formatting strings, or the security
problems they raise.  But while it seems a reasonable stance that
user-injected getattr()s may pose a security problem, what seems
indefensible is the stance that user-injected getattr()s are okay
precisely when the attribute being looked up doesn't start with an
underscore.

A single underscore prefix is a hint to human readers, not to the
language itself, and things should stay that way.

Greg F

From janssen at parc.com  Wed Jun 20 19:03:49 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 10:03:49 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> 
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
	<21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
Message-ID: <07Jun20.100358pdt."57996"@synergy1.parc.xerox.com>

> > TextIOBase: this seems an odd mix of high-level and low-level.  I'd
> > remove "seek", "tell", "read", and "write".  Remember that in Python,
> > mixins actually work, so that you can provide a file object that
> > combines several different I/O classes.
> 
> Huh? All those operations you want to remove are entirely necessary  
> for a number of applications. I'm not sure what you meant about mixins?

I meant that TextIOBase should just provide the operations for text.
The other operations would be supported, when appropriate, by mixing
in an appropriate class that provides them.  Remember that this is
a PEP about base classes.

> It doesn't work? Why not? Of course read() should take the number of  
> characters as a parameter, not number of bytes.

Unfortunately, files contain encodings of characters, and those
encodings may at times be mapped to multiple equivalent strings, at
least with respect to Unicode, the target for Python-3000.  The
standard Unicode support for Python-3000 seems to be settling on
having code-point representations of those strings exposed to the
application, which means that any specific automatic normalization is
precluded.  So any particular "readchars(1)" operation may validly
return different strings even if operating on the same underlying
file, and may require a different number of read operations to read
the same underlying bytes.  That is, I believe that the string and/or
file operations are not well-specified enough to guarantee that this
won't happen.  This is the same situation we have today, which means
that the only real way to read Unicode strings from a file will be the
same as today, that is, read raw bytes from a file, decode them and
normalize them in some specific way, and then see what string you wind
up with.  You could probably fix this in the PEP by specifying a
specific Unicode normalization to use when returning strings.

> > feel the need.  Stick to just "readline" and "writeline" for text I/O.
> 
> Ah, not everyone dealing with text is dealing with line-delimited  
> text, you know...

It's really the only difference between text and non-text.

Bill



From janssen at parc.com  Wed Jun 20 19:09:04 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 10:09:04 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <46793D3A.8020303@gmail.com> 
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
	<07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>
	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
	<46793D3A.8020303@gmail.com>
Message-ID: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com>

> Strings are explicitly disallowed (because Guido doesn't want a second 
> way to spell ''.join(seq), as far as I know):

Isn't "map(str, x)" just a second way to write "[str(x) for x in y]"?

This "second way" argument is another often-heard bogon.  There are
lots of second-way and third-way techniques in Python, and properly
so.  It's more important to make things work consistently than to only
have "one way".  "sum" should concatenate strings.

Bill

From janssen at parc.com  Wed Jun 20 19:11:08 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 10:11:08 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <f5bhac$qga$1@sea.gmane.org> 
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
	<f59scm$m59$1@sea.gmane.org>
	<07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>
	<f5bhac$qga$1@sea.gmane.org>
Message-ID: <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com>

Not bad, but if you're going that route, I think I'd get rid of the
optional arguments, and just say

    seek_from_beginning(INCR: int)

    seek_from_current(INCR: int)

    seek_from_end(DECR: int)

Bill

From nicko at nicko.org  Wed Jun 20 19:12:20 2007
From: nicko at nicko.org (Nicko van Someren)
Date: Wed, 20 Jun 2007 18:12:20 +0100
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <46793D3A.8020303@gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<07Jun19.105121pdt."57996"@synergy1.parc.xerox.com>	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>	<07Jun19.121327pdt."57996"@synergy1.parc.xerox.com>	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
	<46793D3A.8020303@gmail.com>
Message-ID: <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org>

On 20 Jun 2007, at 15:44, Nick Coghlan wrote:

> Nicko van Someren wrote:
>> 	>>> sum(['a','b','c'])
>> 	Traceback (most recent call last):
>> 	  File "<stdin>", line 1, in <module>
>> 	TypeError: unsupported operand type(s) for +: 'int' and 'str'
>> 	>>> sum([["a"],[u'b'],[3]])
>> 	Traceback (most recent call last):
>> 	  File "<stdin>", line 1, in <module>
>> 	TypeError: unsupported operand type(s) for +: 'int' and 'list'
>
> You can already make the second example work properly by supplying  
> an appropriate starting value:
>
> >>> sum([["a"],[u'b'],[3]], [])
> ['a', u'b', 3]
>
> (and a similar call will also work for the new bytes type, as well  
> as other sequences)

The need to have an explicit 'start' value just seems wrong.  It's  
horribly inconsistent.  Things that can be added to integers work  
without initialisers but things that can be added to each other (for  
instance numbers in number fields or vectors in vector spaces) can  
not.  I think in most people's minds the 'sum' operation is like an  
evaluation of "+".join(...), you are sticking an addition operation  
between the elements of the list.  The need to have an explicit  
initial value means that sum() is not the sum function for anyone who  
does math in any sort of non-standard number space.

> Strings are explicitly disallowed (because Guido doesn't want a  
> second way to spell ''.join(seq), as far as I know):
>
> >>> sum(['a','b','c'], '')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: sum() can't sum strings [use ''.join(seq) instead]

I can appreciate the value of TOOWTDI, and I appreciate that (in the  
absence of string concatenation by reference) the performance of  
string sum() would suck, but I still think that wilfully making  
things inconsistent in order to enforce TOOWTDI is going too far.

	Nicko



From martin at v.loewis.de  Wed Jun 20 19:20:42 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 20 Jun 2007 19:20:42 +0200
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
Message-ID: <467961EA.4060007@v.loewis.de>

> So, my question is how best to handle this test (and thus other tests
> like it).  Should it just continue to fail until someone fixes
> _bsddb.c to accept Unicode keys (and thus start up a FAILING file
> listing the various tests that are failing and doc which ones are
> expected to fail until something specific changes)?  Or do we silence
> the failure by making the constants pass through str8?  Or should str8
> not even be used at all since (I assume) it won't survive the merge
> back into p3yk?

This goes back to the text-vs-binary debate. I _think_ bsddb inherently
operates on binary data, i.e. neither keys nor values need to be text
in some sense.

So the most natural way would be to make it accept binary data only on
input, and always produce binary data on output. Any *usage* that
expect to be able to pass in strings is broken.

Regards,
Martin

From guido at python.org  Wed Jun 20 19:24:55 2007
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Jun 2007 10:24:55 -0700
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <467961EA.4060007@v.loewis.de>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
	<467961EA.4060007@v.loewis.de>
Message-ID: <ca471dc20706201024m22080y168b120474205123@mail.gmail.com>

On 6/20/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > So, my question is how best to handle this test (and thus other tests
> > like it).  Should it just continue to fail until someone fixes
> > _bsddb.c to accept Unicode keys (and thus start up a FAILING file
> > listing the various tests that are failing and doc which ones are
> > expected to fail until something specific changes)?  Or do we silence
> > the failure by making the constants pass through str8?  Or should str8
> > not even be used at all since (I assume) it won't survive the merge
> > back into p3yk?
>
> This goes back to the text-vs-binary debate. I _think_ bsddb inherently
> operates on binary data, i.e. neither keys nor values need to be text
> in some sense.
>
> So the most natural way would be to make it accept binary data only on
> input, and always produce binary data on output. Any *usage* that
> expect to be able to pass in strings is broken.

OTOH, pragmatically, people will generally use text strings for db keys.

I'm not sure how to decide this; perhaps we need to take it public.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From aleaxit at gmail.com  Wed Jun 20 19:50:48 2007
From: aleaxit at gmail.com (Alex Martelli)
Date: Wed, 20 Jun 2007 10:50:48 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f5bi01$skk$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<f59a3o$rnn$1@sea.gmane.org> <4679020A.8020609@gmail.com>
	<f5bi01$skk$1@sea.gmane.org>
Message-ID: <e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>

On 6/20/07, Christian Heimes <lists at cheimes.de> wrote:
> Nick Coghlan wrote:
> > Because (str(x) for x in seq) is not an improvement over map(str, x) -
> > applying a single existing function to a sequence is a very common
> > operation.
> >
> > map() accepts any function (given an appropriate number of sequences),
> > and thus has wide applicability.
> >
> > filter() accepts any single argument predicate function (using bool() by
> > default), and thus also has wide applicability.
>
> But map() and filter() can be easily replaced with a generator or list
> comprehensive expression. [str(x) for x in seq] and [x for x in seq if
> func(x)] are consider easier to read these days.
>
> IIRC map and filter are a bit slower than list comprehensive and may use
> much more memory than a generator expression since map and filter are
> returning a list of results.

No, in 3.0 they'll return iterables -- you really SHOULD read Guido's
blog entry referred to at the top of this thread,
<http://www.artima.com/weblogs/viewpost.jsp?thread=208549>, before
discussing Python 3.0 issues.

So, there's no reason their performance should suffer, either -- using
today's itertools.imap as a stand-in for 3.0's map, for example:

$ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x
in it.imap(abs,L): pass'
100000 loops, best of 3: 3 usec per loop
$ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x
in (abs(y) for y in L): pass'
100000 loops, best of 3: 4.47 usec per loop

(imap is faster in this case because the built-in name 'abs' is looked
up only once -- in the genexp, it's looked up each time, sigh --
possibly the biggest "we should REALLY tweak the language to let this
be optimized sensibly" gotcha in Python, IMHO).


Alex

From exarkun at divmod.com  Wed Jun 20 19:51:25 2007
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Wed, 20 Jun 2007 13:51:25 -0400
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <ca471dc20706201024m22080y168b120474205123@mail.gmail.com>
Message-ID: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm>

On Wed, 20 Jun 2007 10:24:55 -0700, Guido van Rossum <guido at python.org> wrote:
>On 6/20/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>> > So, my question is how best to handle this test (and thus other tests
>> > like it).  Should it just continue to fail until someone fixes
>> > _bsddb.c to accept Unicode keys (and thus start up a FAILING file
>> > listing the various tests that are failing and doc which ones are
>> > expected to fail until something specific changes)?  Or do we silence
>> > the failure by making the constants pass through str8?  Or should str8
>> > not even be used at all since (I assume) it won't survive the merge
>> > back into p3yk?
>>
>> This goes back to the text-vs-binary debate. I _think_ bsddb inherently
>> operates on binary data, i.e. neither keys nor values need to be text
>> in some sense.
>>
>> So the most natural way would be to make it accept binary data only on
>> input, and always produce binary data on output. Any *usage* that
>> expect to be able to pass in strings is broken.
>
>OTOH, pragmatically, people will generally use text strings for db keys.
>
>I'm not sure how to decide this; perhaps we need to take it public.

If it helps, after having used bsddb for a couple years and developed a
non-trivial library on top of it, what Martin said seems most sensible to
me.

Jean-Paul

From alexandre at peadrop.com  Wed Jun 20 19:58:09 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Wed, 20 Jun 2007 13:58:09 -0400
Subject: [Python-3000] Summary of the differences between StringIO and
	cStringIO for PEP-3108
Message-ID: <acd65fa20706201058o4388f8c5q5a3167a7619ebc85@mail.gmail.com>

Hi,

I written a short summary of the differences between the StringIO and
cStringIO modules.  I attached it as a patch for PEP-3108.

-- Alexandre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: semantic_diff_stringio.patch
Type: text/x-patch
Size: 1655 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070620/0d65cca8/attachment.bin 

From daniel at stutzbachenterprises.com  Wed Jun 20 20:17:14 2007
From: daniel at stutzbachenterprises.com (Daniel Stutzbach)
Date: Wed, 20 Jun 2007 13:17:14 -0500
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <6002181751375776921@unknownmsgid>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
Message-ID: <eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>

On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> > Ah, not everyone dealing with text is dealing with line-delimited
> > text, you know...
>
> It's really the only difference between text and non-text.

Text is a sequence of characters.  Non-text is a sequence of bytes.
Characters may be multi-byte.  It is no longer an ASCII world.

-- 
Daniel Stutzbach, Ph.D.             President, Stutzbach Enterprises LLC

From jimjjewett at gmail.com  Wed Jun 20 20:33:10 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 20 Jun 2007 14:33:10 -0400
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <6382747756546996159@unknownmsgid>
References: <f59scm$m59$1@sea.gmane.org> <f5bhac$qga$1@sea.gmane.org>
	<6382747756546996159@unknownmsgid>
Message-ID: <fb6fbf560706201133x335a6cfld37527a8d8616699@mail.gmail.com>

On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> Not bad, but if you're going that route, I think I'd get rid of the
> optional arguments, and just say
>
>     seek_from_beginning(INCR: int)
>
>     seek_from_current(INCR: int)
>
>     seek_from_end(DECR: int)


    goto(pos)  # absolute

    move(incr:int)  # relative to current position

negative numbers can be interpreted naturally; for move they go
backwards, and for goto they count from the end.

This would require either a length, or a special value (None?) for at
least one of Start and End, because 0 == -0.

Note that this makes sense for bytes; I'm not sure exactly how unicode
characters even should be counted, without a normalization promise.

-jJ

From lists at cheimes.de  Wed Jun 20 19:45:40 2007
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 20 Jun 2007 19:45:40 +0200
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com>
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>	<f59scm$m59$1@sea.gmane.org>	<07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>	<f5bhac$qga$1@sea.gmane.org>
	<07Jun20.101108pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <f5bp46$h6v$1@sea.gmane.org>

Bill Janssen wrote:
> Not bad, but if you're going that route, I think I'd get rid of the
> optional arguments, and just say
> 
>     seek_from_beginning(INCR: int)
> 
>     seek_from_current(INCR: int)
> 
>     seek_from_end(DECR: int)

I don't like it. It's too noisy and too much to type. My mini proposal
has the benefit that it is backward compatible.

Besides your arguments names aren't correct. seek_from_current can be
negative to seek backward and seek_from_end can be positive to enlarge a
file. On some OS a seek after EOF creates a sparse file.


From brett at python.org  Wed Jun 20 20:50:50 2007
From: brett at python.org (Brett Cannon)
Date: Wed, 20 Jun 2007 11:50:50 -0700
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <ca471dc20706201024m22080y168b120474205123@mail.gmail.com>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>
	<467961EA.4060007@v.loewis.de>
	<ca471dc20706201024m22080y168b120474205123@mail.gmail.com>
Message-ID: <bbaeab100706201150m27e94b5by61134459a5015298@mail.gmail.com>

On 6/20/07, Guido van Rossum <guido at python.org> wrote:
> On 6/20/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > > So, my question is how best to handle this test (and thus other tests
> > > like it).  Should it just continue to fail until someone fixes
> > > _bsddb.c to accept Unicode keys (and thus start up a FAILING file
> > > listing the various tests that are failing and doc which ones are
> > > expected to fail until something specific changes)?  Or do we silence
> > > the failure by making the constants pass through str8?  Or should str8
> > > not even be used at all since (I assume) it won't survive the merge
> > > back into p3yk?
> >
> > This goes back to the text-vs-binary debate. I _think_ bsddb inherently
> > operates on binary data, i.e. neither keys nor values need to be text
> > in some sense.
> >
> > So the most natural way would be to make it accept binary data only on
> > input, and always produce binary data on output. Any *usage* that
> > expect to be able to pass in strings is broken.
>
> OTOH, pragmatically, people will generally use text strings for db keys.
>
> I'm not sure how to decide this; perhaps we need to take it public.

That's fine since I don't want to fix it.  =)  So kick this out to
python-dev then?

And speaking of struni, when I realized that fixing _bsddb.c was not
going to be simple, I moved on to the next test (test_asynchat) and
came across a string with an 's' prefix.  Just to make sure I got
everything straight, str8 produces a classic str instance (pure ASCII)
and a string with an 's' prefix is a str8 string.  Other there any
other differences to be aware of when working on the branch?

And I assume the PyString API is going away, so when working on a
module one should just tear out use of the API and convert it over to
PyUnicode, correct?  And do the same for "s" format characters in
Py_BuildValue and PyArg_ParseTuple?

I just want to get an idea of the basic process going on to do the
conversion so that I don't have to figure out the hard way.

-Brett

From brett at python.org  Wed Jun 20 20:58:44 2007
From: brett at python.org (Brett Cannon)
Date: Wed, 20 Jun 2007 11:58:44 -0700
Subject: [Python-3000] Summary of the differences between StringIO and
	cStringIO for PEP-3108
In-Reply-To: <acd65fa20706201058o4388f8c5q5a3167a7619ebc85@mail.gmail.com>
References: <acd65fa20706201058o4388f8c5q5a3167a7619ebc85@mail.gmail.com>
Message-ID: <bbaeab100706201158k3745aa66g8e3aa794c9570ce2@mail.gmail.com>

On 6/20/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hi,
>
> I written a short summary of the differences between the StringIO and
> cStringIO modules.  I attached it as a patch for PEP-3108.

Thanks for the summary, Alexandre.  Luckily your new version for the
io library does away with all of those issues for Py3K.

-Brett

From alexandre at peadrop.com  Wed Jun 20 21:05:17 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Wed, 20 Jun 2007 15:05:17 -0400
Subject: [Python-3000] Summary of the differences between StringIO and
	cStringIO for PEP-3108
In-Reply-To: <bbaeab100706201158k3745aa66g8e3aa794c9570ce2@mail.gmail.com>
References: <acd65fa20706201058o4388f8c5q5a3167a7619ebc85@mail.gmail.com>
	<bbaeab100706201158k3745aa66g8e3aa794c9570ce2@mail.gmail.com>
Message-ID: <acd65fa20706201205k1ac66678sd57454bb738bb6bc@mail.gmail.com>

On 6/20/07, Brett Cannon <brett at python.org> wrote:
> Thanks for the summary, Alexandre.  Luckily your new version for the
> io library does away with all of those issues for Py3K.

Yes, all these issues are fixed (except the pickle thing), in my new version.

-- Alexandre

From lists at cheimes.de  Wed Jun 20 20:40:19 2007
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 20 Jun 2007 20:40:19 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<f59a3o$rnn$1@sea.gmane.org>
	<4679020A.8020609@gmail.com>	<f5bi01$skk$1@sea.gmane.org>
	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
Message-ID: <f5bsak$4ub$1@sea.gmane.org>

Alex Martelli wrote:
> No, in 3.0 they'll return iterables -- you really SHOULD read Guido's
> blog entry referred to at the top of this thread,
> <http://www.artima.com/weblogs/viewpost.jsp?thread=208549>, before
> discussing Python 3.0 issues.

I read it. I also wasn't sure if map returns a special iterable like
dict.keys() or a list so I tried it before I wrote my posting:

Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> map(str, range(5))
['0', '1', '2', '3', '4']
>>> type(map(str, range(5)))
<type 'list'>

It looks like an ordinary list to me.

Christian


From eopadoan at altavix.com  Wed Jun 20 21:18:41 2007
From: eopadoan at altavix.com (Eduardo "EdCrypt" O. Padoan)
Date: Wed, 20 Jun 2007 16:18:41 -0300
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f5bsak$4ub$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<f59a3o$rnn$1@sea.gmane.org> <4679020A.8020609@gmail.com>
	<f5bi01$skk$1@sea.gmane.org>
	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
	<f5bsak$4ub$1@sea.gmane.org>
Message-ID: <dea92f560706201218r580e6134sec4a73a83f96056f@mail.gmail.com>

> It looks like an ordinary list to me.

There are many things to implement yet.

From martin at v.loewis.de  Wed Jun 20 21:25:32 2007
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Wed, 20 Jun 2007 21:25:32 +0200
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <bbaeab100706201150m27e94b5by61134459a5015298@mail.gmail.com>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>	
	<467961EA.4060007@v.loewis.de>	
	<ca471dc20706201024m22080y168b120474205123@mail.gmail.com>
	<bbaeab100706201150m27e94b5by61134459a5015298@mail.gmail.com>
Message-ID: <46797F2C.8090505@v.loewis.de>

> And speaking of struni, when I realized that fixing _bsddb.c was not
> going to be simple, I moved on to the next test (test_asynchat) and
> came across a string with an 's' prefix.  Just to make sure I got
> everything straight, str8 produces a classic str instance (pure ASCII)
> and a string with an 's' prefix is a str8 string.  Other there any
> other differences to be aware of when working on the branch?

There appears to be some disagreement on what the objective for that
branch is. I would personally like to see str8 disappear, at least
from Python API. IOW, if a str8 shows up somewhere, check whether it
might be easy to replace it with a Unicode string.

> And I assume the PyString API is going away, so when working on a
> module one should just tear out use of the API and convert it over to
> PyUnicode, correct?  

The API will stay. However, it should get used less-and-less. Whether
converting it in Unicode depends on the use case. It might be that
converting to binary is the right answer.

> And do the same for "s" format characters in
> Py_BuildValue and PyArg_ParseTuple?

Again, depends. On ParseTuple, the default encoding is applied, which
is supposed to always work, and always provides you with a char*.
(it currently produces a str8 internally, but eventually should
create a bytes object instead).

For BuildValue, I would recommend that the s format code produces
a Unicode object. That might be ambiguous, as some people might
want to create bytes instead, but I recommend to designate a
different code for creating bytes in BuildValue.

> I just want to get an idea of the basic process going on to do the
> conversion so that I don't have to figure out the hard way.

I think many questions are still open, and should be discussed
(or Guido would have to publish a policy in case he made up
his mind already).

Regards,
Martin

From g.brandl at gmx.net  Wed Jun 20 21:14:17 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Wed, 20 Jun 2007 21:14:17 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f5bsak$4ub$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<f59a3o$rnn$1@sea.gmane.org>	<4679020A.8020609@gmail.com>	<f5bi01$skk$1@sea.gmane.org>	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
	<f5bsak$4ub$1@sea.gmane.org>
Message-ID: <f5bua6$d9a$1@sea.gmane.org>

Christian Heimes schrieb:
> Alex Martelli wrote:
>> No, in 3.0 they'll return iterables -- you really SHOULD read Guido's
>> blog entry referred to at the top of this thread,
>> <http://www.artima.com/weblogs/viewpost.jsp?thread=208549>, before
>> discussing Python 3.0 issues.
> 
> I read it. I also wasn't sure if map returns a special iterable like
> dict.keys() or a list so I tried it before I wrote my posting:
> 
> Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> map(str, range(5))
> ['0', '1', '2', '3', '4']
>>>> type(map(str, range(5)))
> <type 'list'>
> 
> It looks like an ordinary list to me.

Well, not everything that's planned is implemented yet in Py3k.
So, you should really believe the *plans* rather than the *branch*.

Georg


From rrr at ronadam.com  Wed Jun 20 21:29:10 2007
From: rrr at ronadam.com (Ron Adam)
Date: Wed, 20 Jun 2007 14:29:10 -0500
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <bbaeab100706201150m27e94b5by61134459a5015298@mail.gmail.com>
References: <bbaeab100706191413h640b6fcdg4c7ad8ce7b842ebc@mail.gmail.com>	<467961EA.4060007@v.loewis.de>	<ca471dc20706201024m22080y168b120474205123@mail.gmail.com>
	<bbaeab100706201150m27e94b5by61134459a5015298@mail.gmail.com>
Message-ID: <46798006.70407@ronadam.com>



Brett Cannon wrote:
> On 6/20/07, Guido van Rossum <guido at python.org> wrote:
>> On 6/20/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>>>> So, my question is how best to handle this test (and thus other tests
>>>> like it).  Should it just continue to fail until someone fixes
>>>> _bsddb.c to accept Unicode keys (and thus start up a FAILING file
>>>> listing the various tests that are failing and doc which ones are
>>>> expected to fail until something specific changes)?  Or do we silence
>>>> the failure by making the constants pass through str8?  Or should str8
>>>> not even be used at all since (I assume) it won't survive the merge
>>>> back into p3yk?
>>> This goes back to the text-vs-binary debate. I _think_ bsddb inherently
>>> operates on binary data, i.e. neither keys nor values need to be text
>>> in some sense.
>>>
>>> So the most natural way would be to make it accept binary data only on
>>> input, and always produce binary data on output. Any *usage* that
>>> expect to be able to pass in strings is broken.
>> OTOH, pragmatically, people will generally use text strings for db keys.
>>
>> I'm not sure how to decide this; perhaps we need to take it public.
> 
> That's fine since I don't want to fix it.  =)  So kick this out to
> python-dev then?
> 
> And speaking of struni, when I realized that fixing _bsddb.c was not
> going to be simple, I moved on to the next test (test_asynchat) and
> came across a string with an 's' prefix.  Just to make sure I got
> everything straight, str8 produces a classic str instance (pure ASCII)
> and a string with an 's' prefix is a str8 string.  Other there any
> other differences to be aware of when working on the branch?

There's no 'u' prefix on unicode strings obviously.  ;-)

The 's' prefix was my idea as a temporary way to differentiate unicode and 
str8 while the conversion is taking place.  It will most likely be removed 
after all or most all the str8 values are replaced by unicode or bytes.

Ron


> And I assume the PyString API is going away, so when working on a
> module one should just tear out use of the API and convert it over to
> PyUnicode, correct?  And do the same for "s" format characters in
> Py_BuildValue and PyArg_ParseTuple?
 >
> I just want to get an idea of the basic process going on to do the
> conversion so that I don't have to figure out the hard way.



From mike.klaas at gmail.com  Wed Jun 20 22:34:15 2007
From: mike.klaas at gmail.com (Mike Klaas)
Date: Wed, 20 Jun 2007 13:34:15 -0700
Subject: [Python-3000] How best to handle failing tests in struni?
In-Reply-To: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm>
References: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm>
Message-ID: <30C6361F-4BDC-4638-9AF0-2BB1790BF4BD@gmail.com>

On 20-Jun-07, at 10:51 AM, Jean-Paul Calderone wrote:

> On Wed, 20 Jun 2007 10:24:55 -0700, Guido van Rossum  
> <guido at python.org> wrote:
>
> > OTOH, pragmatically, people will generally use text strings for  
> db keys.
>>
>> I'm not sure how to decide this; perhaps we need to take it public.
>
> If it helps, after having used bsddb for a couple years and  
> developed a
> non-trivial library on top of it, what Martin said seems most  
> sensible to
> me.

As an extremely heavy user of bsddb, +1.

Berkeley db is rather sensitive on how things are serialized (for  
instance, big-endian is much better for ints, performance-wise), so  
it is necessary to let the developer control this on a bytestring  
level.  It is easy to write a wrapper on top of this to do the  
serialization automatically.

-Mike




From gareth.mccaughan at pobox.com  Thu Jun 21 00:58:15 2007
From: gareth.mccaughan at pobox.com (Gareth McCaughan)
Date: Wed, 20 Jun 2007 23:58:15 +0100
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<46793D3A.8020303@gmail.com>
	<5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org>
Message-ID: <200706202358.16087.gareth.mccaughan@pobox.com>

On Wednesday 20 June 2007 18:12, Nicko van Someren wrote
(about summing strings, etc.)::

> The need to have an explicit 'start' value just seems wrong.  It's
> horribly inconsistent.  Things that can be added to integers work
> without initialisers but things that can be added to each other (for
> instance numbers in number fields or vectors in vector spaces) can
> not.

I think there's another reason for not allowing sum on arbitrary
objects, which I find more convincing (though still maybe not
convincing enough):

    sum([1,1,1]) = 3
    sum([1,1]) = 2
    sum([1]) = 1
    sum([]) = 0

    sum(["a","a","a"]) = "aaa"
    sum(["a","a"]) = "aa"
    sum(["a"]) = "a"
    sum([]) = ????

That is: if you're writing code that expects sum() to do something
sensible with lists of strings, you'll usually need it to do something
sensible with *empty* lists of strings -- but that isn't possible,
because there's only one empty list and it has to serve as the empty
list of integers too.

> I think in most people's minds the 'sum' operation is like an 
> evaluation of "+".join(...), you are sticking an addition operation
> between the elements of the list.  The need to have an explicit
> initial value means that sum() is not the sum function for anyone who
> does math in any sort of non-standard number space.

If you allow the elements of your number field to be added to
plain ol' 0, then sum() will work fine for them, no? (But that
isn't such a plausible prospect for vectors.)

-- 
Gareth McCaughan

From janssen at parc.com  Thu Jun 21 02:33:45 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 17:33:45 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com> 
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
Message-ID: <07Jun20.173354pdt."57996"@synergy1.parc.xerox.com>

Daniel Stutzbach wrote:
> On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> > > Ah, not everyone dealing with text is dealing with line-delimited
> > > text, you know...
> >
> > It's really the only difference between text and non-text.
> 
> Text is a sequence of characters.  Non-text is a sequence of bytes.
> Characters may be multi-byte.  It is no longer an ASCII world.

Yes, of course, Daniel, but I was speaking of the contents of files,
and files are inherently sequences of bytes.  If we are talking about
some layer which interprets the contents of a file, just saying "give
me N characters" isn't enough.  We need to say, "N characters assuming
a text encoding of M, with a normalization policy of Q, and a newline
policy of R".  If we don't, we can't just "read" N characters safely.
So I think it's broken to put this in the TextIOBase class; instead,
there should be some wrapper class that does buffering and can be
configured as to (M, Q, R).

Bill

From janssen at parc.com  Thu Jun 21 02:46:42 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 17:46:42 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <fb6fbf560706201133x335a6cfld37527a8d8616699@mail.gmail.com> 
References: <f59scm$m59$1@sea.gmane.org> <f5bhac$qga$1@sea.gmane.org>
	<6382747756546996159@unknownmsgid>
	<fb6fbf560706201133x335a6cfld37527a8d8616699@mail.gmail.com>
Message-ID: <07Jun20.174649pdt."57996"@synergy1.parc.xerox.com>

>     goto(pos)  # absolute
> 
>     move(incr:int)  # relative to current position
> 
> negative numbers can be interpreted naturally; for move they go
> backwards, and for goto they count from the end.
> 
> This would require either a length, or a special value (None?) for at
> least one of Start and End, because 0 == -0.

I like this idea.  Define START and END as values in the "file" class.

> I'm not sure exactly how unicode
> characters even should be counted, without a normalization promise.

No one's sure.  That's why "read(N: int) => str" doesn't make sense.

Bill

From janssen at parc.com  Thu Jun 21 02:49:47 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 17:49:47 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <f5bp46$h6v$1@sea.gmane.org> 
References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com>
	<f59scm$m59$1@sea.gmane.org>
	<07Jun19.184653pdt."57996"@synergy1.parc.xerox.com>
	<f5bhac$qga$1@sea.gmane.org>
	<07Jun20.101108pdt."57996"@synergy1.parc.xerox.com>
	<f5bp46$h6v$1@sea.gmane.org>
Message-ID: <07Jun20.174955pdt."57996"@synergy1.parc.xerox.com>

Christian Heimes wrote:
> Bill Janssen wrote:
> > Not bad, but if you're going that route, I think I'd get rid of the
> > optional arguments, and just say
> > 
> >     seek_from_beginning(INCR: int)
> > 
> >     seek_from_current(INCR: int)
> > 
> >     seek_from_end(DECR: int)
>
> I don't like it. It's too noisy and too much to type.

Well, it would be noisy, and the complaint about length would apply,
if these were widely used many times in one piece of code, but they
aren't.  So that doesn't matter, and in its favor, it's clear,
consistent, and easy to remember.

Bill




From daniel at stutzbachenterprises.com  Thu Jun 21 02:54:17 2007
From: daniel at stutzbachenterprises.com (Daniel Stutzbach)
Date: Wed, 20 Jun 2007 19:54:17 -0500
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <-665892861201335771@unknownmsgid>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
	<-665892861201335771@unknownmsgid>
Message-ID: <eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>

On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> Yes, of course, Daniel, but I was speaking of the contents of files,
> and files are inherently sequences of bytes.  If we are talking about
> some layer which interprets the contents of a file, just saying "give
> me N characters" isn't enough.  We need to say, "N characters assuming
> a text encoding of M, with a normalization policy of Q, and a newline
> policy of R".  If we don't, we can't just "read" N characters safely.
> So I think it's broken to put this in the TextIOBase class; instead,
> there should be some wrapper class that does buffering and can be
> configured as to (M, Q, R).

The PEP specifies that TextIOWrapper objects (the primary
implementation of the TextIOBase interface) are created via the
following signature:

    .__init__(self, buffer, encoding=None, newline=None)

In other words, TextIOBase *is* the wrapper type that does the
buffering and allows the user to configure M and R.

Are you suggesting that TextIOBase should be split into two classes,
one of which provides the (M, R) functionality and one of which does
not?  If so, how would the later be different from the RawIOBase and
BufferedIOBase classes, already described in the PEP?

I'm not sure I 100% understand what you mean by "normalization policy"
(Q).  Could you give an example?

-- 
Daniel Stutzbach, Ph.D.             President, Stutzbach Enterprises LLC

From janssen at parc.com  Thu Jun 21 04:00:57 2007
From: janssen at parc.com (Bill Janssen)
Date: Wed, 20 Jun 2007 19:00:57 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com> 
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
	<-665892861201335771@unknownmsgid>
	<eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>
Message-ID: <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com>

> I'm not sure I 100% understand what you mean by "normalization policy"
> (Q).  Could you give an example?

I was speaking of the 4 different normalization forms for Unicode,
which can produce different code-point sequences.  Since "strings" in
Python-3000 aren't really strings, but instead are immutable
code-point sequences, this means that any byte-to-string
transformation which doesn't specify this can produce different
strings from the same bytes without violating its constraints.

Bill


From greg.ewing at canterbury.ac.nz  Thu Jun 21 04:43:00 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 21 Jun 2007 14:43:00 +1200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt.57996@synergy1.parc.xerox.com>
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>
	<07Jun19.121327pdt.57996@synergy1.parc.xerox.com>
	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>
Message-ID: <4679E5B4.4040507@canterbury.ac.nz>

Nicko van Someren wrote:
> perhaps the sum() function could be made properly polymorphic  
> so as to remove one more class of use cases for reduce().

That's unlikely to happen. As I remember things,
sum() was deliberately restricted to numbers so
as not to present an attractive nuisance as an
inefficient way to concatenate a list of strings.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Thu Jun 21 05:31:18 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 21 Jun 2007 15:31:18 +1200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <f5bi01$skk$1@sea.gmane.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<07Jun19.105121pdt.57996@synergy1.parc.xerox.com>
	<f59a3o$rnn$1@sea.gmane.org>
	<4679020A.8020609@gmail.com> <f5bi01$skk$1@sea.gmane.org>
Message-ID: <4679F106.2010307@canterbury.ac.nz>

Christian Heimes wrote:
> IIRC map and filter are a bit slower than list comprehensive

But isn't that true only when the function passed
is a Python function? Or are LCs faster now even
for C functions?

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From showell30 at yahoo.com  Thu Jun 21 11:49:46 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 02:49:46 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <77903.26658.qm@web33507.mail.mud.yahoo.com>


--- Bill Janssen <janssen at parc.com> wrote:

> [...]  It's more important to make things work
> consistently than to only
> have "one way".  "sum" should concatenate strings.
> 

"Sum" should sum stuff.  You can't sum strings.  It
makes no sense in English.

You can concatenate strings, or you can join them
using a connecting string.  Since concatenating is
just a degenerate case of joining, it's hard to
justify a concat() builtin when you already have
''.join(), but I'd rather have a concat() builtin than
an insensible interpretation of sum().  

Multiple additions (with "+") mean "sum" in
arithmetic, but you can't generalize that to strings
and text processing.  The "+" operator for any two
strings is not about adding--it's about
joining/concatenating.  So multiple applications of
"+" on strings aren't a sum.  They're just a longer
join/concatenation. 

Remember also that you can't have "+" operate on a
string/integer pair.  It's just practicality that
Python uses the same punctuation for addition and
concatenation.  In English it's sensible to have
punctuation for addition, so it has "+," but it needs
no punctuation for joining/concatenation, so Python
had to pick the closest match.




       
____________________________________________________________________________________
Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/

From ncoghlan at gmail.com  Thu Jun 21 15:40:29 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 21 Jun 2007 23:40:29 +1000
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>	
	<7301715244131583311@unknownmsgid>	
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	
	<435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com>	
	<4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com>	
	<4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org>	
	<46793D3A.8020303@gmail.com>
	<9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com>
Message-ID: <467A7FCD.6010506@gmail.com>

Thomas Wouters wrote:
> 
> 
> On 6/20/07, *Nick Coghlan* <ncoghlan at gmail.com 
> <mailto:ncoghlan at gmail.com>> wrote:
> 
>     Strings are explicitly disallowed (because Guido doesn't want a second
>     way to spell ''.join(seq), as far as I know):
> 
> 
> More importantly, because it has positively abysmal performance, just 
> like the reduce() solution (and, in fact, many reduce solutions to 
> problems better solved otherwise :-) Like the old input(), backticks and 
> allowing the mixing of tabs and spaces, while it has uses, the ease and 
> frequency with which it is misused outweigh the utility enough that it 
> should not be in such a prominent place.

The rejected suggestion which lead to the current error message was for 
sum(seq, '') to call ''.join(seq) behind the scenes to actually do the 
string concatenation - the performance would have been identical to 
calling ''.join(seq) directly. You certainly wouldn't want to use the 
same summing algorithm as is used for mutable sequences - as you say, 
the performance would be terrible.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Thu Jun 21 15:45:20 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 21 Jun 2007 23:45:20 +1000
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <200706202358.16087.gareth.mccaughan@pobox.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<46793D3A.8020303@gmail.com>	<5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org>
	<200706202358.16087.gareth.mccaughan@pobox.com>
Message-ID: <467A80F0.7060904@gmail.com>

Gareth McCaughan wrote:
> That is: if you're writing code that expects sum() to do something
> sensible with lists of strings, you'll usually need it to do something
> sensible with *empty* lists of strings -- but that isn't possible,
> because there's only one empty list and it has to serve as the empty
> list of integers too.

That is indeed the reason for the explicit start value - sum() needs to 
know what to return when the supplied iterable is empty.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jimjjewett at gmail.com  Thu Jun 21 18:12:22 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 21 Jun 2007 12:12:22 -0400
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base
	classes]
Message-ID: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>

Should canonicalization should be an extra feature of the Text IO, on
par with character encoding?

On 6/20/07, Daniel Stutzbach <daniel at stutzbachenterprises.com> wrote:
> On 6/20/07, Bill Janssen <janssen at parc.com> wrote:

[For the TextIO, as opposed to the raw IO, Bill originally proposed
dropping read(n), because character count is not well-defined.  Dan
objected that not all text has useful line breaks.]

> > ... just saying "give me N characters" isn't enough.
> > We need to say, "N characters assuming a text
> > encoding of M, with a normalization policy of Q,
> > and a newline policy of R".

[ Daniel points out that TextIO already handles M and R ]

> I'm not sure I 100% understand what you mean by
> "normalization policy" (Q).  Could you give an example?

How many characters are there in ??

If I ask for just one character, do I get only the o, without the
diaeresis, or do I get both (since they are linguistically one
letter), or does it depend on how some editor happened to store it?

Distinguishing strings based on an accident of storage would violate
unicode standards.  (More precisely, it would be a violation of
standards to assume that they are distinguished.)

To the extent that you are treating the data as text rather than
binary, NFC or NFD normalization should always be appropriate.

In practice, binary concerns do intrude even for text data; you may
well want to save it back out in the original encoding, without any
spurious changes.

Proposal:

    open would default to NFC.

    import would open source code with NFKC.

    An explict None canonicalization would allow round-trips without
spurious binary-level changes.

-jJ

From amcnabb at mcnabbs.org  Thu Jun 21 17:45:01 2007
From: amcnabb at mcnabbs.org (Andrew McNabb)
Date: Thu, 21 Jun 2007 09:45:01 -0600
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <77903.26658.qm@web33507.mail.mud.yahoo.com>
References: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com>
	<77903.26658.qm@web33507.mail.mud.yahoo.com>
Message-ID: <20070621154501.GA1607@mcnabbs.org>

On Thu, Jun 21, 2007 at 02:49:46AM -0700, Steve Howell wrote:
> 
> "Sum" should sum stuff.  You can't sum strings.  It makes no sense in
> English.

I think you're technically right, but I frequently find myself using the
phrase "add together a list of strings" when it would be more accurate
to say "concatenate a list of strings."  I can't say I feel bad when I
use this terminology.

> Multiple additions (with "+") mean "sum" in arithmetic, but you can't
> generalize that to strings and text processing.  The "+" operator for
> any two strings is not about adding--it's about joining/concatenating.
> So multiple applications of "+" on strings aren't a sum.  They're just
> a longer join/concatenation. 

I guess I don't find the distinction between adding and concatenating as
strong as you do.

When we write 'a' + 'b', I don't see any problem with saying that we're
adding 'a' and 'b', and I don't think there's anything unclear about
sum(['a', 'b', 'c']).


-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mail.python.org/pipermail/python-3000/attachments/20070621/c0366fe4/attachment.pgp 

From janssen at parc.com  Thu Jun 21 18:19:04 2007
From: janssen at parc.com (Bill Janssen)
Date: Thu, 21 Jun 2007 09:19:04 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <77903.26658.qm@web33507.mail.mud.yahoo.com> 
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
Message-ID: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>

> Multiple additions (with "+") mean "sum" in
> arithmetic, but you can't generalize that to strings
> and text processing.  The "+" operator for any two
> strings is not about adding--it's about
> joining/concatenating.  So multiple applications of
> "+" on strings aren't a sum.  They're just a longer
> join/concatenation. 

Hmmm.  Your argument would be more pursuasive if you couldn't do this
in Python:

>>> a = "abc" + "def" + "ghi" + "jkl"
>>> a
'abcdefghijkl'
>>> 

The real problem with "sum", I think, is that the parameter list is
ill-conceived (perhaps because it was added before variable length
parameter lists were?).  It should be

  sum(*operands)

not

  sum(operands, initialvalue=?)

It should amount to "map(+, operands)".

Bill

From showell30 at yahoo.com  Thu Jun 21 19:12:30 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 10:12:30 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <380622.43139.qm@web33515.mail.mud.yahoo.com>


--- Bill Janssen <janssen at parc.com> wrote:

> > Multiple additions (with "+") mean "sum" in
> > arithmetic, but you can't generalize that to
> strings
> > and text processing.  The "+" operator for any two
> > strings is not about adding--it's about
> > joining/concatenating.  So multiple applications
> of
> > "+" on strings aren't a sum.  They're just a
> longer
> > join/concatenation. 
> 
> Hmmm.  Your argument would be more pursuasive if you
> couldn't do this
> in Python:
> 
> >>> a = "abc" + "def" + "ghi" + "jkl"
> >>> a
> 'abcdefghijkl'
> >>> 
> 
> The real problem with "sum", I think, is that the
> parameter list is
> ill-conceived (perhaps because it was added before
> variable length
> parameter lists were?).  It should be
> 
>   sum(*operands)
> 
> not
> 
>   sum(operands, initialvalue=?)
> 
> It should amount to "map(+, operands)".
> 

I think you were missing my point, which is that sum
doesn't and shouldn't necessarily have the same
semantics as map(+).

"Sum," in both Python and common English usage, is a
generalization of arithmetic addition, but it's not a
generalization of applying operators that happen to be
spelled "+."

There's no natural English punctuation for
concatenation, and Python's choice of "+" could be
called mostly arbitrary (although it's consistent with
a few other programming languages.)  The following
operators can mean concatenation in various
programming languages:

   +
   .
   &
   ||

Oddly, in English a common way to concatenate words is
with the "-" character.  It means a hyphen in English,
and it's use to create multi-word-thingies, but the
operator itself is also a subtraction operator.  So
you could speciously argue that when you concatenate
words in English, you're doing a difference, but under
your proposed Python, you'd be doing a sum.



   


       
____________________________________________________________________________________Ready for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/

From jimjjewett at gmail.com  Thu Jun 21 19:18:21 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 21 Jun 2007 13:18:21 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <-405694068883174656@unknownmsgid>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<-405694068883174656@unknownmsgid>
Message-ID: <fb6fbf560706211018t56115de7j6115b53a905d3a54@mail.gmail.com>

On 6/21/07, Bill Janssen <janssen at parc.com> wrote:
> The real problem with "sum", I think, is that the parameter list is
> ill-conceived (perhaps because it was added before variable length
> parameter lists were?).  It should be

>   sum(*operands)

> not

>   sum(operands, initialvalue=?)

Is this worth fixing in Python 3, where keyword-only parameters become
an option?

    sum(*operands, start=0)

-jJ

From showell30 at yahoo.com  Thu Jun 21 19:33:41 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 10:33:41 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070621154501.GA1607@mcnabbs.org>
Message-ID: <141120.84388.qm@web33510.mail.mud.yahoo.com>


--- Andrew McNabb <amcnabb at mcnabbs.org> wrote:

> 
> I think you're technically right, but I frequently
> find myself using the
> phrase "add together a list of strings" when it
> would be more accurate
> to say "concatenate a list of strings."  I can't say
> I feel bad when I
> use this terminology.
> 

Nope, and I wouldn't throw the grammar book at you if
you did.  But if you said a compound word is a "sum"
of smaller words, I might look at you a little funny.
:)

> 
> I guess I don't find the distinction between adding
> and concatenating as
> strong as you do.
> 

Fair enough.

> When we write 'a' + 'b', I don't see any problem
> with saying that we're
> adding 'a' and 'b', and I don't think there's
> anything unclear about
> sum(['a', 'b', 'c']).

I think you're right that most people would guess that
the above returns 'abc', so we're not in major
disagreement.

But I'm approaching usability from another direction,
I guess.  If I wanted to join a series of strings
together, sum() wouldn't be the most naturally
occuring method to me.  To me it would be concat() or
join().  I think ''.join(...) in Python is a tiny
wart, since it's pretty unlikely for a newcomer to
guess the syntax, but I'm not sure they'd guess sum()
either.  The one advantage of ''.join() is that you
can at least deduce it, via introspection, by doing
dir('foo').

My other concern with sum() is just the common pitfall
that you do sum(line_of_numbers.split(',')) and get
'35' when you intended to write code to get 8.  I'd
rather have that fail obviously than subtlely.




      ____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 


From janssen at parc.com  Thu Jun 21 19:45:22 2007
From: janssen at parc.com (Bill Janssen)
Date: Thu, 21 Jun 2007 10:45:22 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <141120.84388.qm@web33510.mail.mud.yahoo.com> 
References: <141120.84388.qm@web33510.mail.mud.yahoo.com>
Message-ID: <07Jun21.104531pdt."57996"@synergy1.parc.xerox.com>

> My other concern with sum() is just the common pitfall
> that you do sum(line_of_numbers.split(',')) and get
> '35' when you intended to write code to get 8.  I'd
> rather have that fail obviously than subtlely.

Common pitfall?  I  doubt it.  Possible pitfall?  Sure.

Bill

From jjb5 at cornell.edu  Thu Jun 21 19:32:42 2007
From: jjb5 at cornell.edu (Joel Bender)
Date: Thu, 21 Jun 2007 13:32:42 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <467AB63A.7050505@cornell.edu>

> It should be
> 
>   sum(*operands)
> 
> not
> 
>   sum(operands, initialvalue=?)
> 
> It should amount to "map(+, operands)".

Or, to be pedantic, this:

     reduce(lambda x, y: x.__add__(y), operands)


Joel


From janssen at parc.com  Thu Jun 21 19:51:32 2007
From: janssen at parc.com (Bill Janssen)
Date: Thu, 21 Jun 2007 10:51:32 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <380622.43139.qm@web33515.mail.mud.yahoo.com> 
References: <380622.43139.qm@web33515.mail.mud.yahoo.com>
Message-ID: <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com>

> I think you were missing my point, which is that sum
> doesn't and shouldn't necessarily have the same
> semantics as map(+).

It's not that I don't understand your argument, Steve.

I just don't find it effective.  If we are going to distinguish
between "arithmetic addition" and "concatenation", we should find
another operator.

As long as we *don't* do that, my personal preference would be to
either remove "sum" completely, or have it work in a regular fashion,
depending on which data type is passed to it, either as arithmetic
addition or as sequence concatenation.

Bill

From showell30 at yahoo.com  Thu Jun 21 20:04:37 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 11:04:37 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.104531pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <794697.77794.qm@web33514.mail.mud.yahoo.com>


--- Bill Janssen <janssen at parc.com> wrote:

> > My other concern with sum() is just the common
> pitfall
> > that you do sum(line_of_numbers.split(',')) and
> get
> > '35' when you intended to write code to get 8. 
> I'd
> > rather have that fail obviously than subtlely.
> 
> Common pitfall?  I  doubt it.  Possible pitfall? 
> Sure.
> 

It's a common mistake, for me anyway, to forgot to
cast something that I just read from a file into an
integer before performing arithmetic on it.  But it's
usually not a pitfall now.  It's just a quick
exception that I can quickly diagnose and fix.

Try this code under Python 2:

   name, amount, tip = 'Bill,20,1.5'.split(',')
   print name + ' payed ' + sum(amount,tip)

It will throw an obvious exception.

Obviously, this is a pitfall even under current
Python:

   name, amount, tip = 'Bill,20,1.5'.split(',')
   print name + ' payed ' + amount + tip

So then you have three choices on how to improve
Python, one of which you sort of alluded to in your
other reply:

   1) Eliminate the current pitfall by introducing
another operator for concatenation.

   2) Keep sum() as it is, but make the error message
more clear when somebody uses it on strings.  Example:

Sum() cannot be used to join strings.  Perhaps you
meant to use ''.join().

   3) Make sum() have a consistent pitfall with the
"+" operator, even though English/Python has a lot
more latitude with words than punctuation when it
comes to disambiguating concepts.

IMHO #1 is too extreme, #2 is the best option, and #3
doesn't really solve any practical problems.  The
arguments for #3 seems to come from consistency/purity
vantages, which are fine, but not as important as
usability.

I concede this entire argument is based on the perhaps
shaky premise that *most* people never forget to turn
strings into integers, but I fully admit my
fallibility in this regard.





       
____________________________________________________________________________________
Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php

From amcnabb at mcnabbs.org  Thu Jun 21 20:46:18 2007
From: amcnabb at mcnabbs.org (Andrew McNabb)
Date: Thu, 21 Jun 2007 12:46:18 -0600
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <141120.84388.qm@web33510.mail.mud.yahoo.com>
References: <20070621154501.GA1607@mcnabbs.org>
	<141120.84388.qm@web33510.mail.mud.yahoo.com>
Message-ID: <20070621184618.GB1607@mcnabbs.org>

On Thu, Jun 21, 2007 at 10:33:41AM -0700, Steve Howell wrote:
> 
> Nope, and I wouldn't throw the grammar book at you if you did.  But if
> you said a compound word is a "sum" of smaller words, I might look at
> you a little funny.  :)

It wouldn't be the first time someone looked at me a little funny. :)

> But I'm approaching usability from another direction, I guess.  If I
> wanted to join a series of strings together, sum() wouldn't be the
> most naturally occuring method to me.

I agree that on its own, it's not the most natural method.  However,
once you've already used the + operator to join two strings, you are
much more likely to consider sum() for concatenating a list of strings.
I remember being confused the first time I tried it and found that it
didn't work.

In the end, though, it's really not that big a deal.


-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mail.python.org/pipermail/python-3000/attachments/20070621/0ba915dd/attachment.pgp 

From showell30 at yahoo.com  Thu Jun 21 20:55:46 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 11:55:46 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070621184618.GB1607@mcnabbs.org>
Message-ID: <582524.76259.qm@web33515.mail.mud.yahoo.com>


--- Andrew McNabb <amcnabb at mcnabbs.org> wrote:
> 
> I agree that on its own, it's not the most natural
> method.  However,
> once you've already used the + operator to join two
> strings, you are
> much more likely to consider sum() for concatenating
> a list of strings.
> I remember being confused the first time I tried it
> and found that it
> didn't work.
>  
> In the end, though, it's really not that big a deal.
> 

Sure, I agree with that, although I think we are right
to quibble about this, because both problems, summing
numbers and summing/joining strings, are pretty darn
common, so any brainstorm to make those more natural
under the language are worthy of consideration.  I've
had ''.join() burned into my brain pretty well by now,
so I think I'm unfairly biased, and this is really a
case where newbies have more perspectives than most
people on this list.  But I also hate to have a
language be *too* driven by newbie concerns, because I
think Python also needs to be appreciated over the
long haul.







       
____________________________________________________________________________________
Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online.
http://smallbusiness.yahoo.com/webhosting 

From rrr at ronadam.com  Thu Jun 21 21:09:07 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 21 Jun 2007 14:09:07 -0500
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com>
References: <380622.43139.qm@web33515.mail.mud.yahoo.com>
	<07Jun21.105136pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <467ACCD3.6070608@ronadam.com>



Bill Janssen wrote:
>> I think you were missing my point, which is that sum
>> doesn't and shouldn't necessarily have the same
>> semantics as map(+).
> 
> It's not that I don't understand your argument, Steve.
> 
> I just don't find it effective.  If we are going to distinguish
> between "arithmetic addition" and "concatenation", we should find
> another operator.
> 
> As long as we *don't* do that, my personal preference would be to
> either remove "sum" completely, or have it work in a regular fashion,
> depending on which data type is passed to it, either as arithmetic
> addition or as sequence concatenation.

 From the standpoint of readability and being able to know what a 
particular section of code does I believe it is better to have limits that 
make sense in cases where the behavior of a function may change based on 
what the data is.

My preference would be to limit sum() to value addition only, and never do 
concatenation.  For bytes types, it could be the summing of bytes.  This 
could be useful for image data.  For all non numeric types it would 
generate an exception.

And if a general function that joins and/or extends is desired, a separate 
function possibly called merge() might be better.  Then sum() would always 
do numerical addition and merge() would always do concatenation of objects. 
That makes the code much easier to read 6 months from now with a lower 
chance of having subtle bugs.

The main thing for me is how quickly I can look at a block of code and 
determine what it does with a minimum of back tracking and data tracing.

Cheers,
    Ron






From showell30 at yahoo.com  Thu Jun 21 21:15:43 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Thu, 21 Jun 2007 12:15:43 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467ACCD3.6070608@ronadam.com>
Message-ID: <121087.44934.qm@web33505.mail.mud.yahoo.com>


--- Ron Adam <rrr at ronadam.com> wrote:

> 
> 
> Bill Janssen wrote:
> >> I think you were missing my point, which is that
> sum
> >> doesn't and shouldn't necessarily have the same
> >> semantics as map(+).
> > 
> > It's not that I don't understand your argument,
> Steve.
> > 
> > I just don't find it effective.  If we are going
> to distinguish
> > between "arithmetic addition" and "concatenation",
> we should find
> > another operator.
> > 
> > As long as we *don't* do that, my personal
> preference would be to
> > either remove "sum" completely, or have it work in
> a regular fashion,
> > depending on which data type is passed to it,
> either as arithmetic
> > addition or as sequence concatenation.
> 
>  From the standpoint of readability and being able
> to know what a 
> particular section of code does I believe it is
> better to have limits that 
> make sense in cases where the behavior of a function
> may change based on 
> what the data is.
> 
> My preference would be to limit sum() to value
> addition only, and never do 
> concatenation.  For bytes types, it could be the
> summing of bytes.  This 
> could be useful for image data.  For all non numeric
> types it would 
> generate an exception.
> 
> And if a general function that joins and/or extends
> is desired, a separate 
> function possibly called merge() might be better. 
> Then sum() would always 
> do numerical addition and merge() would always do
> concatenation of objects. 
> That makes the code much easier to read 6 months
> from now with a lower 
> chance of having subtle bugs.
> 
> The main thing for me is how quickly I can look at a
> block of code and 
> determine what it does with a minimum of back
> tracking and data tracing.
> 

Ron, I obviously agree with your overriding points
100%, and thank you for expressing them better than I
did, but I would object to the name merge().  "Merge"
to me has the semantics of blending strings, not
joining them.  English already has two perfectly well
understood words for this concept:

   "abc", "def" -> "abcdef"

The two English words are "join" and "concatenate." 
Python wisely chose the shorter word, although I can
see arguments for the longer word, as "join" probably
is a tiny bit more semantically overloaded than
"concatenate."

The best slang word for the above is "mush."



 
____________________________________________________________________________________
Now that's room service!  Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.
http://farechase.yahoo.com/promo-generic-14795097

From jjb5 at cornell.edu  Thu Jun 21 21:59:03 2007
From: jjb5 at cornell.edu (Joel Bender)
Date: Thu, 21 Jun 2007 15:59:03 -0400
Subject: [Python-3000] join vs. add [was: Python 3000 Status Update
	(Long!)]
In-Reply-To: <467ACCD3.6070608@ronadam.com>
References: <380622.43139.qm@web33515.mail.mud.yahoo.com>	<07Jun21.105136pdt."57996"@synergy1.parc.xerox.com>
	<467ACCD3.6070608@ronadam.com>
Message-ID: <467AD887.2000601@cornell.edu>

> My preference would be to limit sum() to value addition only, and never do 
> concatenation.

I would be happy with that, provided there was join function and operator:

     >>> join = lambda x: reduce(lambda y, z: y.__join__(z), x)

I think this is clearer than sum():

     >>> join(['a', 'b', 'c'])
     'abc'

It wouldn't interfere with ''.join(), and ''.__add__() could be 
redirected to ''.__join__().

> For all non numeric types it would generate an exception.

How about generating an exception where __add__ isn't defined, so it 
would work on MyFunkyVector type.  I could join my vectors together as 
well, since in MyNonEucledeanSpace, it doesn't mean the same thing as "add".


Joel


From fdrake at acm.org  Thu Jun 21 22:14:29 2007
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 21 Jun 2007 16:14:29 -0400
Subject: [Python-3000] join vs. add [was: Python 3000 Status Update
	(Long!)]
In-Reply-To: <467AD887.2000601@cornell.edu>
References: <380622.43139.qm@web33515.mail.mud.yahoo.com>
	<467ACCD3.6070608@ronadam.com> <467AD887.2000601@cornell.edu>
Message-ID: <200706211614.29277.fdrake@acm.org>

On Thursday 21 June 2007, Joel Bender wrote:
 > I think this is clearer than sum():
 >      >>> join(['a', 'b', 'c'])
 >      'abc'
 >
 > It wouldn't interfere with ''.join(), and ''.__add__() could be
 > redirected to ''.__join__().

And then int.__join__ could be defined in confusing ways, too:

  >>> join([4, 2])
  42

There's something appealing about that specific example.  ;-)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>

From janssen at parc.com  Thu Jun 21 22:21:10 2007
From: janssen at parc.com (Bill Janssen)
Date: Thu, 21 Jun 2007 13:21:10 PDT
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467AB63A.7050505@cornell.edu> 
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
	<467AB63A.7050505@cornell.edu>
Message-ID: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>

> > It should amount to "map(+, operands)".
> 
> Or, to be pedantic, this:
> 
>      reduce(lambda x, y: x.__add__(y), operands)

Don't you mean:

   reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])

Bill

From rrr at ronadam.com  Fri Jun 22 01:18:04 2007
From: rrr at ronadam.com (Ron Adam)
Date: Thu, 21 Jun 2007 18:18:04 -0500
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <121087.44934.qm@web33505.mail.mud.yahoo.com>
References: <121087.44934.qm@web33505.mail.mud.yahoo.com>
Message-ID: <467B072C.7000906@ronadam.com>



Steve Howell wrote:
> --- Ron Adam <rrr at ronadam.com> wrote:
> 
>>
>> Bill Janssen wrote:
>>>> I think you were missing my point, which is that
>> sum
>>>> doesn't and shouldn't necessarily have the same
>>>> semantics as map(+).
>>> It's not that I don't understand your argument,
>> Steve.
>>> I just don't find it effective.  If we are going
>> to distinguish
>>> between "arithmetic addition" and "concatenation",
>> we should find
>>> another operator.
>>>
>>> As long as we *don't* do that, my personal
>> preference would be to
>>> either remove "sum" completely, or have it work in
>> a regular fashion,
>>> depending on which data type is passed to it,
>> either as arithmetic
>>> addition or as sequence concatenation.
>>  From the standpoint of readability and being able
>> to know what a 
>> particular section of code does I believe it is
>> better to have limits that 
>> make sense in cases where the behavior of a function
>> may change based on 
>> what the data is.
>>
>> My preference would be to limit sum() to value
>> addition only, and never do 
>> concatenation.  For bytes types, it could be the
>> summing of bytes.  This 
>> could be useful for image data.  For all non numeric
>> types it would 
>> generate an exception.
>>
>> And if a general function that joins and/or extends
>> is desired, a separate 
>> function possibly called merge() might be better. 
>> Then sum() would always 
>> do numerical addition and merge() would always do
>> concatenation of objects. 
>> That makes the code much easier to read 6 months
>> from now with a lower 
>> chance of having subtle bugs.
>>
>> The main thing for me is how quickly I can look at a
>> block of code and 
>> determine what it does with a minimum of back
>> tracking and data tracing.
>>
> 
> Ron, I obviously agree with your overriding points
> 100%, and thank you for expressing them better than I
> did, but I would object to the name merge().  "Merge"
> to me has the semantics of blending strings, not
> joining them.  English already has two perfectly well
> understood words for this concept:
> 
>    "abc", "def" -> "abcdef"
> 
> The two English words are "join" and "concatenate." 
> Python wisely chose the shorter word, although I can
> see arguments for the longer word, as "join" probably
> is a tiny bit more semantically overloaded than
> "concatenate."
> 
> The best slang word for the above is "mush."

Yes, join is the better choice for strings only, but I think the discussion 
was also concerned with joining sequences of other types as well. 
list.join() or tuple.join() doesn't work.


For joining sequences we have...

     str.join()
     list.extend()   # but returns None

Sets have an have a .union() method, but it only works on two sets at a time.

There is no equivalent of .extend() for tuples.  There is an .__add__() method.

A join() function or generator might be able to unify these operations and 
remove the need for sum() to do this.  It can't be a method as it would 
break the general rule that an object should not mutate itself and also 
return it self.


As for the numeric cases...

There are no int or float methods to get a single value from a sequence of 
values.  So we need to use a function or some other way of doing it.  So 
sum is needed for this.  (well, it's nice to have.)

Currently our choices are:

     sum(seq)

     reduce(lambda x, y: x+y, seq)  # not limited to addition

Summing items across sequences of same length might be a sumitems() 
function.  But then we are getting into numeric territory.

The more complex uses of these are probably just as well done with a for 
loop.

Cheers,
    Ron


















From guido at python.org  Fri Jun 22 02:39:34 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 21 Jun 2007 17:39:34 -0700
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <2614969285506109322@unknownmsgid>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<2614969285506109322@unknownmsgid>
Message-ID: <ca471dc20706211739r5e1a747l5c30ff60ffa0ac09@mail.gmail.com>

On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> > > TextIOBase: this seems an odd mix of high-level and low-level.  I'd
> > > remove "seek", "tell", "read", and "write".  Remember that in Python,
> > > mixins actually work, so that you can provide a file object that
> > > combines several different I/O classes.
> >
> > Huh? All those operations you want to remove are entirely necessary
> > for a number of applications. I'm not sure what you meant about mixins?
>
> I meant that TextIOBase should just provide the operations for text.
> The other operations would be supported, when appropriate, by mixing
> in an appropriate class that provides them.  Remember that this is
> a PEP about base classes.

Um, it's not meant to be just about base classes -- it's also meant to
be about the actual implementations -- both abstract and concrete
classes will be importable from the same module, 'io'. Have you
checked out io.py in the p3yk branch?

> > It doesn't work? Why not? Of course read() should take the number of
> > characters as a parameter, not number of bytes.
>
> Unfortunately, files contain encodings of characters, and those
> encodings may at times be mapped to multiple equivalent strings, at
> least with respect to Unicode, the target for Python-3000.  The
> standard Unicode support for Python-3000 seems to be settling on
> having code-point representations of those strings exposed to the
> application, which means that any specific automatic normalization is
> precluded.  So any particular "readchars(1)" operation may validly
> return different strings even if operating on the same underlying
> file, and may require a different number of read operations to read
> the same underlying bytes.  That is, I believe that the string and/or
> file operations are not well-specified enough to guarantee that this
> won't happen.  This is the same situation we have today, which means
> that the only real way to read Unicode strings from a file will be the
> same as today, that is, read raw bytes from a file, decode them and
> normalize them in some specific way, and then see what string you wind
> up with.  You could probably fix this in the PEP by specifying a
> specific Unicode normalization to use when returning strings.

I don't understand exactly what you're saying, but here's the semantic
model from which I've been operating.

A file contains a sequence of bytes. If you read it all in one fell
swoop, and then decoded it to Unicode (using a specific encoding),
you'd get a specific text string. This is a sequence of code units.
(Whether they are valid code points or characters I don't think we can
guarantee -- I use the GIGO principle.)

*Conceptually*, read(n) simply returns the next n code units;
readline() is equivalent to read(n) for some n, whose value is
determined by looking ahead until the first \n is found.

Universal newlines collapse \r\n into \n and turn lone \r into \n (or
whatever algorithm is deemed right, I'm not sure the latter is still
needed) *before* we reach the sequence of code points that read() and
readline() see.

Files are all about making this conceptual model efficient even if the
file doesn't fit in memory. We have incremental codecs which make this
possible. (We always assume the file doesn't change while we're
reading it; if it does, certain bets are off.)

In my mind, seek() and tell() should work like getpos() and setpos()
in modern C stdio -- tell() returns a "cookie" whose only use is that
you can later pass it to seek() and it will reset the position in the
sequence of code units to where it was when tell() was called. For
many encodings, in practice, seek() and tell() can just use byte
positions since the boundaries between code points always fall on byte
boundaries (but not the other way around). For other encodings, the
implementation currently in io.py encodes the incremental codec state
in the (very) high bits of the cookie (this is convenient since we
have arbitrary precision integers).

Relative seeks (except for a few end cases) are not supported for text files.

> > > feel the need.  Stick to just "readline" and "writeline" for text I/O.
> >
> > Ah, not everyone dealing with text is dealing with line-delimited
> > text, you know...
>
> It's really the only difference between text and non-text.

Again, I don't quite follow this.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From greg.ewing at canterbury.ac.nz  Fri Jun 22 03:27:34 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 22 Jun 2007 13:27:34 +1200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070621154501.GA1607@mcnabbs.org>
References: <07Jun20.100913pdt.57996@synergy1.parc.xerox.com>
	<77903.26658.qm@web33507.mail.mud.yahoo.com>
	<20070621154501.GA1607@mcnabbs.org>
Message-ID: <467B2586.4050901@canterbury.ac.nz>

Andrew McNabb wrote:
> I think you're technically right, but I frequently find myself using the
> phrase "add together a list of strings" when it would be more accurate
> to say "concatenate a list of strings."

The word "add" has a wider connotation in English than
"sum". Consider the following two sentences:

    I put a sandwich and an apple in my lunchbox,
    then I added a banana.

    I put the sum of a sandwich, an apple and a
    banana in my lunchbox.

--
Greg

From greg.ewing at canterbury.ac.nz  Fri Jun 22 03:34:01 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 22 Jun 2007 13:34:01 +1200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.091906pdt.57996@synergy1.parc.xerox.com>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<07Jun21.091906pdt.57996@synergy1.parc.xerox.com>
Message-ID: <467B2709.5050808@canterbury.ac.nz>

Bill Janssen wrote:
> It should be
> 
>   sum(*operands)

That would incur copying of the sequence. It would be
justifiable only if the vast majority of use cases
involved passing the operands as separate arguments,
which I don't think is true.

--
Greg

From daniel at stutzbachenterprises.com  Fri Jun 22 05:40:38 2007
From: daniel at stutzbachenterprises.com (Daniel Stutzbach)
Date: Thu, 21 Jun 2007 22:40:38 -0500
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <ca471dc20706211739r5e1a747l5c30ff60ffa0ac09@mail.gmail.com>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<2614969285506109322@unknownmsgid>
	<ca471dc20706211739r5e1a747l5c30ff60ffa0ac09@mail.gmail.com>
Message-ID: <eae285400706212040m220728a8o957ed880f3bc8adb@mail.gmail.com>

On 6/21/07, Guido van Rossum <guido at python.org> wrote:
> In my mind, seek() and tell() should work like getpos() and setpos()
> in modern C stdio -- tell() returns a "cookie" whose only use is that
> you can later pass it to seek() and it will reset the position in the
> sequence of code units to where it was when tell() was called. For
> many encodings, in practice, seek() and tell() can just use byte
> positions since the boundaries between code points always fall on byte
> boundaries (but not the other way around). For other encodings, the
> implementation currently in io.py encodes the incremental codec state
> in the (very) high bits of the cookie (this is convenient since we
> have arbitrary precision integers).

If the cookie is meant to be opaque to the caller, is there a reason
that the cookie must be an integer?

Specifying the return type as opaque might also reduce the temptation
to do perform arithmetic on them, which will work for some codecs
(ASCII), but break later in odd ways for others.

-- 
Daniel Stutzbach, Ph.D.             President, Stutzbach Enterprises LLC

From guido at python.org  Fri Jun 22 07:43:37 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 21 Jun 2007 22:43:37 -0700
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <eae285400706212040m220728a8o957ed880f3bc8adb@mail.gmail.com>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<2614969285506109322@unknownmsgid>
	<ca471dc20706211739r5e1a747l5c30ff60ffa0ac09@mail.gmail.com>
	<eae285400706212040m220728a8o957ed880f3bc8adb@mail.gmail.com>
Message-ID: <ca471dc20706212243tb113088p13783eb6053d12b5@mail.gmail.com>

On 6/21/07, Daniel Stutzbach <daniel at stutzbachenterprises.com> wrote:
> On 6/21/07, Guido van Rossum <guido at python.org> wrote:
> > In my mind, seek() and tell() should work like getpos() and setpos()
> > in modern C stdio -- tell() returns a "cookie" whose only use is that
> > you can later pass it to seek() and it will reset the position in the
> > sequence of code units to where it was when tell() was called. For
> > many encodings, in practice, seek() and tell() can just use byte
> > positions since the boundaries between code points always fall on byte
> > boundaries (but not the other way around). For other encodings, the
> > implementation currently in io.py encodes the incremental codec state
> > in the (very) high bits of the cookie (this is convenient since we
> > have arbitrary precision integers).
>
> If the cookie is meant to be opaque to the caller, is there a reason
> that the cookie must be an integer?

Yes, so the API for seek() and tell() can be the same for binary and
text files. It also makes it easier to persist cookies.

> Specifying the return type as opaque might also reduce the temptation
> to do perform arithmetic on them, which will work for some codecs
> (ASCII), but break later in odd ways for others.

I actually like the "open kimono" approach where users can work around
the system if they really need to.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Fri Jun 22 07:54:23 2007
From: guido at python.org (Guido van Rossum)
Date: Thu, 21 Jun 2007 22:54:23 -0700
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base
	classes]
In-Reply-To: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
References: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
Message-ID: <ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>

On 6/21/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> Should canonicalization should be an extra feature of the Text IO, on
> par with character encoding?
>
> On 6/20/07, Daniel Stutzbach <daniel at stutzbachenterprises.com> wrote:
> > On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
>
> [For the TextIO, as opposed to the raw IO, Bill originally proposed
> dropping read(n), because character count is not well-defined.  Dan
> objected that not all text has useful line breaks.]
>
> > > ... just saying "give me N characters" isn't enough.
> > > We need to say, "N characters assuming a text
> > > encoding of M, with a normalization policy of Q,
> > > and a newline policy of R".
>
> [ Daniel points out that TextIO already handles M and R ]
>
> > I'm not sure I 100% understand what you mean by
> > "normalization policy" (Q).  Could you give an example?
>
> How many characters are there in ??
>
> If I ask for just one character, do I get only the o, without the
> diaeresis, or do I get both (since they are linguistically one
> letter), or does it depend on how some editor happened to store it?

It should get you the next code unit as it comes out of the
incremental codec. (Did you see my semantic model I described in a
different thread?)

> Distinguishing strings based on an accident of storage would violate
> unicode standards.  (More precisely, it would be a violation of
> standards to assume that they are distinguished.)

I don't give a damn about this requirement of the Unicode standard. At
least, I don't think Python should enforce it at the level of the str
data type, and that includes str objects returned by the I/O library.

> To the extent that you are treating the data as text rather than
> binary, NFC or NFD normalization should always be appropriate.
>
> In practice, binary concerns do intrude even for text data; you may
> well want to save it back out in the original encoding, without any
> spurious changes.
>
> Proposal:
>
>     open would default to NFC.
>
>     import would open source code with NFKC.
>
>     An explict None canonicalization would allow round-trips without
> spurious binary-level changes.

Counter-proposal: normalization is provided as library functionality.
Applications are responsible for normalization data when they need it
to be normalized and they can't be sure that it isn't already
normalized. The source parser used by import and a few other places is
an "application" in this sense and can certainly apply whatever
normalization is required. Have we agreed on the level of
normalization for source code yet? I'm pretty sure we have agreed on
*when* it happens, i.e. (logically) before the lexer starts scanning
the source code.

I would not be against an additional optional layer in the I/O stack
that applies normalization. We could even have an optional parameter
to open() to push this onto the stack. But I don't think it should be
the default.

What is the status of normalization in Java? Does Java source code get
normalized before it is parsed? What if \u.... is used? Do the Java
I/O library classes normalize text?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Fri Jun 22 08:45:27 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Fri, 22 Jun 2007 08:45:27 +0200
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base
 classes]
In-Reply-To: <ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
References: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
	<ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
Message-ID: <467B7007.1040503@v.loewis.de>

> Counter-proposal: normalization is provided as library functionality.
> Applications are responsible for normalization data when they need it
> to be normalized and they can't be sure that it isn't already
> normalized. The source parser used by import and a few other places is
> an "application" in this sense and can certainly apply whatever
> normalization is required. Have we agreed on the level of
> normalization for source code yet? I'm pretty sure we have agreed on
> *when* it happens, i.e. (logically) before the lexer starts scanning
> the source code.

That isn't actually my view: I would apply normalization *only* to
identifiers, i.e. leave string literals unmodified. If people would
rather see normalization applied to the entire input, that would be
an option, of course (although perhaps more expensive to implement,
as you need to perform it on all source, even if that source turns
out to be ASCII only).

> What is the status of normalization in Java? Does Java source code get
> normalized before it is parsed? 

The JLS is silent on that issue, so I think the answer is "no".
A quick test (see attached file) shows that it doesn't: i.e.
it reports an error "cannot find symbol" even though the symbol
would be defined under NFC (or NFD).

> What if \u.... is used? 

It just gets inserted as-is.

> Do the Java I/O library classes normalize text?

The java.io.InputStreamReader doesn't, see attached code.
It appears that Java JRE doesn't support normalization at all
until Java 6, where you can use java.text.Normalizer. Before,
this class was in sun.text.Normalizer, and (apparently)
only used for URI (normalizing to NFC), collation (performing
NFD on request), and regular expressions (likewise).

Apparently, Sun doesn't consider Unicode normalization
as an issue.

Regards,
Martin


-------------- next part --------------
A non-text attachment was scrubbed...
Name: foo.java
Type: text/x-java
Size: 53 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment.java 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r.java
Type: text/x-java
Size: 479 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment-0001.java 

From ncoghlan at gmail.com  Fri Jun 22 11:11:14 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 22 Jun 2007 19:11:14 +1000
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>	<07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>	<467AB63A.7050505@cornell.edu>
	<07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <467B9232.2090002@gmail.com>

Bill Janssen wrote:
>>> It should amount to "map(+, operands)".
>> Or, to be pedantic, this:
>>
>>      reduce(lambda x, y: x.__add__(y), operands)
> 
> Don't you mean:
> 
>    reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])

This is a nice illustration of a fairly significant issue with the 
usability of reduce: two attempts to rewrite sum() using reduce(), and 
both of them are buggy. Neither of the solutions above can correctly 
handle an empty sequence:

.>>> operands = []
.>>> reduce(lambda x, y: x.__add__(y), operands).
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: reduce() of empty sequence with no initial value
.>>> reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
IndexError: list index out of range

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From aurelien.campeas at logilab.fr  Fri Jun 22 11:46:20 2007
From: aurelien.campeas at logilab.fr (=?iso-8859-1?Q?Aur=E9lien_Camp=E9as?=)
Date: Fri, 22 Jun 2007 11:46:20 +0200
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467B9232.2090002@gmail.com>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
	<467AB63A.7050505@cornell.edu>
	<07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>
	<467B9232.2090002@gmail.com>
Message-ID: <20070622094620.GA25641@crater.logilab.fr>

On Fri, Jun 22, 2007 at 07:11:14PM +1000, Nick Coghlan wrote:
> Bill Janssen wrote:
> >>> It should amount to "map(+, operands)".
> >> Or, to be pedantic, this:
> >>
> >>      reduce(lambda x, y: x.__add__(y), operands)
> > 
> > Don't you mean:
> > 
> >    reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])
> 
> This is a nice illustration of a fairly significant issue with the 
> usability of reduce: two attempts to rewrite sum() using reduce(), and 
> both of them are buggy. Neither of the solutions above can correctly 

Maybe the specification/documentation is missing some phrasing like that : 

"The function must also be able to accept no arguments." (taken from
another language spec.) ?

Better fix the documentation than blame reduce. Of course, reduce was
taken from Lisp, where lambda is not castrated and thus allows one to
write the no-argument case with more ease. Castrated lambdas limit the
usefulness of reduce *in python*, not in general.

Regards,
Au?lien.

> handle an empty sequence:
> 
> .>>> operands = []
> .>>> reduce(lambda x, y: x.__add__(y), operands).
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> TypeError: reduce() of empty sequence with no initial value
> .>>> reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> IndexError: list index out of range
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> ---------------------------------------------------------------
>              http://www.boredomandlaziness.org
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/aurelien.campeas%40logilab.fr

From showell30 at yahoo.com  Fri Jun 22 12:24:08 2007
From: showell30 at yahoo.com (Steve Howell)
Date: Fri, 22 Jun 2007 03:24:08 -0700 (PDT)
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467B2586.4050901@canterbury.ac.nz>
Message-ID: <525768.82779.qm@web33502.mail.mud.yahoo.com>


--- Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> The word "add" has a wider connotation in English
> than
> "sum". [...]

Just to elaborate on the point...

And, likewise, symbolic operators have a wider
connotation in programming languages than do keywords.
 Keywords can, and should, be more specifically
spelled for a task than punctuation characters.


 



 
____________________________________________________________________________________
8:00? 8:25? 8:40? Find a flick in no time 
with the Yahoo! Search movie showtime shortcut.
http://tools.search.yahoo.com/shortcuts/#news

From ncoghlan at gmail.com  Fri Jun 22 14:12:08 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 22 Jun 2007 22:12:08 +1000
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>	<6002181751375776921@unknownmsgid>	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>	<-665892861201335771@unknownmsgid>	<eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>
	<07Jun20.190105pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <467BBC98.1070201@gmail.com>

Bill Janssen wrote:
>> I'm not sure I 100% understand what you mean by "normalization policy"
>> (Q).  Could you give an example?
> 
> I was speaking of the 4 different normalization forms for Unicode,
> which can produce different code-point sequences.  Since "strings" in
> Python-3000 aren't really strings, but instead are immutable
> code-point sequences, this means that any byte-to-string
> transformation which doesn't specify this can produce different
> strings from the same bytes without violating its constraints.

A given codec won't randomly decide to change its normalisation policy, 
though - so when you pick the codec, you're picking the normalisation as 
well.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Fri Jun 22 14:18:10 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 22 Jun 2007 22:18:10 +1000
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <eae285400706212040m220728a8o957ed880f3bc8adb@mail.gmail.com>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>	<2614969285506109322@unknownmsgid>	<ca471dc20706211739r5e1a747l5c30ff60ffa0ac09@mail.gmail.com>
	<eae285400706212040m220728a8o957ed880f3bc8adb@mail.gmail.com>
Message-ID: <467BBE02.9030103@gmail.com>

Daniel Stutzbach wrote:

> If the cookie is meant to be opaque to the caller, is there a reason 
> that the cookie must be an integer?
> 
> Specifying the return type as opaque might also reduce the temptation
>  to do perform arithmetic on them, which will work for some codecs 
> (ASCII), but break later in odd ways for others.

seek() & tell() are already documented as using opaque cookies for text
files (quote is from the documentation of file.seek()):

	If the file is opened in text mode (without 'b'), only offsets
	returned by tell() are legal. Use of other offsets causes
	undefined behavior.

(Seeking to an arbitrary byte index on a file with DOS line endings may 
put you in the middle of a \r\n sequence, which may cause weirdness)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From stephen at xemacs.org  Fri Jun 22 16:15:14 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 22 Jun 2007 23:15:14 +0900
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O
	base	classes]
In-Reply-To: <ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
References: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
	<ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
Message-ID: <87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp>

Guido van Rossum writes:

 > > If I ask for just one character, do I get only the o, without the
 > > diaeresis, or do I get both (since they are linguistically one
 > > letter), or does it depend on how some editor happened to store it?
 > 
 > It should get you the next code unit as it comes out of the
 > incremental codec. (Did you see my semantic model I described in a
 > different thread?)

I don't like this<wink>, but since that's the way it's gonna be ...

 > > Distinguishing strings based on an accident of storage would violate
 > > unicode standards.  (More precisely, it would be a violation of
 > > standards to assume that they are distinguished.)
 > 
 > I don't give a damn about this requirement of the Unicode standard.

... this requirement does not apply to the Python str type as you have
described it.

I think at this stage we're asking for trouble to have any
normalization by default, even in the TextIO module.  str is not text,
it's an array of code units.  str is going to be used to implement
codecs, I/O buffers, all kinds of things that don't necessarily have
Unicode text semantics.  Unless the Python language itself defines the
semantics of the array of code units, EIBTI.  This accords with
Martin's statement about identifiers being the only thing he proposed
normalizing.

Even if we know a user wants text, I don't see any state of the art
that allows us to guess which normalization will be most useful to
him.  I think for identifiers, NFKC is almost a no-brainer.  But for
strings it is not at all obvious.  NFC violates such useful string
invariants such as len(a) + len(b) == len(a+b).  AFAICS, NKD does
not.  OTOH, if you don't need strings to obey array invariants, NFC is
much more friendly to "dumb" UIs that just display the characters as
they get them, without trying to find an equivalent that is in the
font for missing charactes.

And it seems plausible that some applications will mix normalizations
inside of the Python instance.  The app must handle this; Python
can't.  Even if you carry normalization information around with your
str object, what normalization is Python supposed to apply to nfd_str
+ nfc_str?  But surely that operation is permissible!

 > > In practice, binary concerns do intrude even for text data; you may
 > > well want to save it back out in the original encoding, without any
 > > spurious changes.

Then for the purposes of this discussion, it's not text, it's binary.
In many cases it will need to be read as bytes and stored that way
until written back out.

Ie, many legacy encodings do not support roundtrips, such as those
that use ISO 2022 extension techniques: there's no rule against having
a mode-changing sequence and its inverse in succession, and it's
occasionally seen in the wild.  Even UTF-8 has unnormalized
representations for many characters, and it was only recently that
Unicode came to require that they be treated as errors, and not
interpreted (producing them has always been forbidden).

From guido at python.org  Fri Jun 22 17:58:04 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 22 Jun 2007 08:58:04 -0700
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base
	classes]
In-Reply-To: <467B7007.1040503@v.loewis.de>
References: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
	<ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
	<467B7007.1040503@v.loewis.de>
Message-ID: <ca471dc20706220858l5bc3597l6c85e2d0d306803d@mail.gmail.com>

On 6/21/07, "Martin v. L?wis" <martin at v.loewis.de> wrote:
[Guido]
> > Have we agreed on the level of
> > normalization for source code yet? I'm pretty sure we have agreed on
> > *when* it happens, i.e. (logically) before the lexer starts scanning
> > the source code.
>
> That isn't actually my view: I would apply normalization *only* to
> identifiers, i.e. leave string literals unmodified. If people would
> rather see normalization applied to the entire input, that would be
> an option, of course (although perhaps more expensive to implement,
> as you need to perform it on all source, even if that source turns
> out to be ASCII only).

OK, sorry, I must've stopped reading that thread at the wrong moment.
No need to change it on my behalf.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From janssen at parc.com  Fri Jun 22 18:37:43 2007
From: janssen at parc.com (Bill Janssen)
Date: Fri, 22 Jun 2007 09:37:43 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <467BBC98.1070201@gmail.com> 
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
	<-665892861201335771@unknownmsgid>
	<eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>
	<07Jun20.190105pdt."57996"@synergy1.parc.xerox.com>
	<467BBC98.1070201@gmail.com>
Message-ID: <07Jun22.093746pdt."57996"@synergy1.parc.xerox.com>

> A given codec won't randomly decide to change its normalisation policy, 
> though - so when you pick the codec, you're picking the normalisation as 
> well.

You're sure?  Between CPython and Jython and IronPython and
JavascriptPython and ...?  Might as well specify it up front.

Bill

From janssen at parc.com  Fri Jun 22 18:41:19 2007
From: janssen at parc.com (Bill Janssen)
Date: Fri, 22 Jun 2007 09:41:19 PDT
Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base
	classes]
In-Reply-To: <87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp> 
References: <fb6fbf560706210912t5e2eea2apf1b8d30559d082e7@mail.gmail.com>
	<ca471dc20706212254j188472dbge6ef7061f0af0dac@mail.gmail.com>
	<87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <07Jun22.094122pdt."57996"@synergy1.parc.xerox.com>

>  > > In practice, binary concerns do intrude even for text data; you may
>  > > well want to save it back out in the original encoding, without any
>  > > spurious changes.
> 
> Then for the purposes of this discussion, it's not text, it's binary.
> In many cases it will need to be read as bytes and stored that way
> until written back out.

That was more or less my original point; the string situation has
gotten complicated enough that I believe any careful coder will do any
transformations in application code, rather than relying on (and
trying to understand) the particular machinations of some text wrapper
in the I/O library.

Bill

From guido at python.org  Fri Jun 22 19:21:18 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 22 Jun 2007 10:21:18 -0700
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <-3030247401668859168@unknownmsgid>
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
	<-665892861201335771@unknownmsgid>
	<eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>
	<467BBC98.1070201@gmail.com> <-3030247401668859168@unknownmsgid>
Message-ID: <ca471dc20706221021l69ddc647o216f27a9f46f72c8@mail.gmail.com>

On 6/22/07, Bill Janssen <janssen at parc.com> wrote:
> > A given codec won't randomly decide to change its normalisation policy,
> > though - so when you pick the codec, you're picking the normalisation as
> > well.
>
> You're sure?  Between CPython and Jython and IronPython and
> JavascriptPython and ...?  Might as well specify it up front.

I'm not sure I see the use case.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From janssen at parc.com  Fri Jun 22 20:40:55 2007
From: janssen at parc.com (Bill Janssen)
Date: Fri, 22 Jun 2007 11:40:55 PDT
Subject: [Python-3000] On PEP 3116: new I/O base classes
In-Reply-To: <ca471dc20706221021l69ddc647o216f27a9f46f72c8@mail.gmail.com> 
References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net>
	<6002181751375776921@unknownmsgid>
	<eae285400706201117t36873c7egfb8ba2c356511883@mail.gmail.com>
	<-665892861201335771@unknownmsgid>
	<eae285400706201754s7a7d7cbfidb9feddd8ad530f9@mail.gmail.com>
	<467BBC98.1070201@gmail.com> <-3030247401668859168@unknownmsgid>
	<ca471dc20706221021l69ddc647o216f27a9f46f72c8@mail.gmail.com>
Message-ID: <07Jun22.114102pdt."57996"@synergy1.parc.xerox.com>

Guido writes:
> On 6/22/07, Bill Janssen <janssen at parc.com> wrote:
> > > A given codec won't randomly decide to change its normalisation policy,
> > > though - so when you pick the codec, you're picking the normalisation as
> > > well.
> >
> > You're sure?  Between CPython and Jython and IronPython and
> > JavascriptPython and ...?  Might as well specify it up front.
> 
> I'm not sure I see the use case.

Portable Python code that reads and writes "text" files the same way
in any implementation of Python.

Bill

From ntoronto at cs.byu.edu  Fri Jun 22 21:32:42 2007
From: ntoronto at cs.byu.edu (Neil Toronto)
Date: Fri, 22 Jun 2007 13:32:42 -0600
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<f59a3o$rnn$1@sea.gmane.org>
	<4679020A.8020609@gmail.com>	<f5bi01$skk$1@sea.gmane.org>
	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
Message-ID: <467C23DA.2040507@cs.byu.edu>

Alex Martelli wrote:
> $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x
> in it.imap(abs,L): pass'
> 100000 loops, best of 3: 3 usec per loop
> $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x
> in (abs(y) for y in L): pass'
> 100000 loops, best of 3: 4.47 usec per loop
>
> (imap is faster in this case because the built-in name 'abs' is looked
> up only once -- in the genexp, it's looked up each time, sigh --
> possibly the biggest "we should REALLY tweak the language to let this
> be optimized sensibly" gotcha in Python, IMHO).
>   

What is it about the language as it stands that requires abs() to be 
looked up each iteration?

Neil


From amcnabb at mcnabbs.org  Fri Jun 22 21:42:19 2007
From: amcnabb at mcnabbs.org (Andrew McNabb)
Date: Fri, 22 Jun 2007 13:42:19 -0600
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467C23DA.2040507@cs.byu.edu>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>
	<4677C3F8.3050305@nekomancer.net> <f58kb7$6vp$1@sea.gmane.org>
	<7301715244131583311@unknownmsgid>
	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>
	<f59a3o$rnn$1@sea.gmane.org> <4679020A.8020609@gmail.com>
	<f5bi01$skk$1@sea.gmane.org>
	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>
	<467C23DA.2040507@cs.byu.edu>
Message-ID: <20070622194219.GB26333@mcnabbs.org>

On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote:
> > (imap is faster in this case because the built-in name 'abs' is looked
> > up only once -- in the genexp, it's looked up each time, sigh --
> > possibly the biggest "we should REALLY tweak the language to let this
> > be optimized sensibly" gotcha in Python, IMHO).
> 
> What is it about the language as it stands that requires abs() to be 
> looked up each iteration?

Calling abs() could change locals()['abs'], in which case a different
function would be called the next time through.  You lookup 'abs' each
time just in case it's changed.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/4bc19635/attachment.pgp 

From ntoronto at cs.byu.edu  Fri Jun 22 22:13:39 2007
From: ntoronto at cs.byu.edu (Neil Toronto)
Date: Fri, 22 Jun 2007 14:13:39 -0600
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070622194219.GB26333@mcnabbs.org>
References: <ca471dc20706182332q18df52eaw77c3e544a65aa196@mail.gmail.com>	<4677C3F8.3050305@nekomancer.net>
	<f58kb7$6vp$1@sea.gmane.org>	<7301715244131583311@unknownmsgid>	<43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com>	<f59a3o$rnn$1@sea.gmane.org>
	<4679020A.8020609@gmail.com>	<f5bi01$skk$1@sea.gmane.org>	<e8a0972d0706201050w7174e730g33bc5acf54a0afcb@mail.gmail.com>	<467C23DA.2040507@cs.byu.edu>
	<20070622194219.GB26333@mcnabbs.org>
Message-ID: <467C2D73.3020600@cs.byu.edu>

Andrew McNabb wrote:
> On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote:
>   
>>> (imap is faster in this case because the built-in name 'abs' is looked
>>> up only once -- in the genexp, it's looked up each time, sigh --
>>> possibly the biggest "we should REALLY tweak the language to let this
>>> be optimized sensibly" gotcha in Python, IMHO).
>>>       
>> What is it about the language as it stands that requires abs() to be 
>> looked up each iteration?
>>     
>
> Calling abs() could change locals()['abs'], in which case a different
> function would be called the next time through.  You lookup 'abs' each
> time just in case it's changed.
>   

I can't think of a reason to allow that outside of something like an 
obfuscated Python code contest. I'm sure there exists someone who thinks 
differently...

Neil


From exarkun at divmod.com  Fri Jun 22 22:50:01 2007
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Fri, 22 Jun 2007 16:50:01 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <467C2D73.3020600@cs.byu.edu>
Message-ID: <20070622205001.4947.1910896405.divmod.quotient.3570@ohm>

On Fri, 22 Jun 2007 14:13:39 -0600, Neil Toronto <ntoronto at cs.byu.edu> wrote:
>Andrew McNabb wrote:
>> On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote:
>>
>>>> (imap is faster in this case because the built-in name 'abs' is looked
>>>> up only once -- in the genexp, it's looked up each time, sigh --
>>>> possibly the biggest "we should REALLY tweak the language to let this
>>>> be optimized sensibly" gotcha in Python, IMHO).
>>>>
>>> What is it about the language as it stands that requires abs() to be
>>> looked up each iteration?
>>>
>>
>> Calling abs() could change locals()['abs'], in which case a different
>> function would be called the next time through.  You lookup 'abs' each
>> time just in case it's changed.
>>
>
>I can't think of a reason to allow that outside of something like an
>obfuscated Python code contest. I'm sure there exists someone who thinks
>differently...

The perfectly good reason to allow it is that it is a completely
predictable, unsurprising consequence of how the Python language
is defined.

Making a special case for the way names are looked up in a genexp
means making it harder to learn Python and to understand programs
written in Python.

Keeping this simple isn't about letting people obfuscate code,
it's about making it _easy_ for people to understand Python
programs.

If the goal is to make it easier to write obscure code, _that_
would be a valid motivation for changing the lookup rules here.
Preventing people from writing obfuscated programs is _not_.

Jean-Paul

From aleaxit at gmail.com  Fri Jun 22 23:06:07 2007
From: aleaxit at gmail.com (Alex Martelli)
Date: Fri, 22 Jun 2007 14:06:07 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070622205001.4947.1910896405.divmod.quotient.3570@ohm>
References: <467C2D73.3020600@cs.byu.edu>
	<20070622205001.4947.1910896405.divmod.quotient.3570@ohm>
Message-ID: <e8a0972d0706221406q624c3b2je119cca0e155f273@mail.gmail.com>

On 6/22/07, Jean-Paul Calderone <exarkun at divmod.com> wrote:
   ...
> >> Calling abs() could change locals()['abs'], in which case a different
> >> function would be called the next time through.  You lookup 'abs' each
> >> time just in case it's changed.
> >>
> >
> >I can't think of a reason to allow that outside of something like an
> >obfuscated Python code contest. I'm sure there exists someone who thinks
> >differently...
>
> The perfectly good reason to allow it is that it is a completely
> predictable, unsurprising consequence of how the Python language
> is defined.
>
> Making a special case for the way names are looked up in a genexp
> means making it harder to learn Python and to understand programs
> written in Python.

Absolutely: it should NOT be about specialcasing genexp.  Rather, it
would be some new rule such as:
"""
If a built-in name that is used within the body of a function F is
rebound or unbound (in the builtins' module or in F's own module),
after 'def F' executes and builds a function object F', and before any
call to F' has finished executing, the resulting effect is undefined.
"""
This gives a future Python compiler a fighting chance to optimize
builtins' access and use -- quite independently from specialcases such
as genexps.  (Limiting the optimization to functions is, I believe,
quite fine, because similar limitations apply to optimization of
local-variable access; IOW, people who care about the speed of some
piece of code had better make that code part of some function body,
already:-).


Alex

From exarkun at divmod.com  Fri Jun 22 23:13:26 2007
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Fri, 22 Jun 2007 17:13:26 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <e8a0972d0706221406q624c3b2je119cca0e155f273@mail.gmail.com>
Message-ID: <20070622211326.4947.551402273.divmod.quotient.3575@ohm>

On Fri, 22 Jun 2007 14:06:07 -0700, Alex Martelli <aleaxit at gmail.com> wrote:
>On 6/22/07, Jean-Paul Calderone <exarkun at divmod.com> wrote:
>   ...
>> >> Calling abs() could change locals()['abs'], in which case a different
>> >> function would be called the next time through.  You lookup 'abs' each
>> >> time just in case it's changed.
>> >>
>> >
>> >I can't think of a reason to allow that outside of something like an
>> >obfuscated Python code contest. I'm sure there exists someone who thinks
>> >differently...
>>
>>The perfectly good reason to allow it is that it is a completely
>>predictable, unsurprising consequence of how the Python language
>>is defined.
>>
>>Making a special case for the way names are looked up in a genexp
>>means making it harder to learn Python and to understand programs
>>written in Python.
>
>Absolutely: it should NOT be about specialcasing genexp.  Rather, it
>would be some new rule such as:
>"""
>If a built-in name that is used within the body of a function F is
>rebound or unbound (in the builtins' module or in F's own module),
>after 'def F' executes and builds a function object F', and before any
>call to F' has finished executing, the resulting effect is undefined.
>"""
>This gives a future Python compiler a fighting chance to optimize
>builtins' access and use -- quite independently from specialcases such
>as genexps.  (Limiting the optimization to functions is, I believe,
>quite fine, because similar limitations apply to optimization of
>local-variable access; IOW, people who care about the speed of some
>piece of code had better make that code part of some function body,
>already:-).
>

This is more reasonable, but it's still a new rule (and I personally
find rules which include undefined behavior to be distasteful -- but
your suggestion could be modified so that the name change is never
respected to achieve roughly the same consequence).  And it's not even
a rule imposed for a good reason (good reasons are reasons of semantic
simplicity, consistency, etc), it's just imposed to make it easier to
optimize the runtime.  If the common case is to read a name repeatedly
and not care about writes to the name, then leave the language alone and
just optimize reading of names.  For example, have the runtime set up
observers for the names used in a function and require any write to a
name to notify those observers.  Now lookups are fast, the semantics
are unchanged, and there are no new rules.

No, I'm not volunteering to implement this, but if someone else is
interested in spending time speeding up CPython, then this is worth
trying first (and it is worth trying to think of other ideas that
don't complicate the language).

Jean-Paul

From aleaxit at gmail.com  Fri Jun 22 23:28:12 2007
From: aleaxit at gmail.com (Alex Martelli)
Date: Fri, 22 Jun 2007 14:28:12 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070622211326.4947.551402273.divmod.quotient.3575@ohm>
References: <e8a0972d0706221406q624c3b2je119cca0e155f273@mail.gmail.com>
	<20070622211326.4947.551402273.divmod.quotient.3575@ohm>
Message-ID: <e8a0972d0706221428j51c58370y5b8d5a5e6039eec9@mail.gmail.com>

On 6/22/07, Jean-Paul Calderone <exarkun at divmod.com> wrote:
   ...
> This is more reasonable, but it's still a new rule (and I personally
> find rules which include undefined behavior to be distasteful -- but
> your suggestion could be modified so that the name change is never
> respected to achieve roughly the same consequence).  And it's not even

That would put a potentially heavy burden on Python compilers that may
not be interested in producing speedy code but in compiling faster.
Specifically asserting that some _weird_ behavior is undefined in
order to allow the writing of compilers without excessive burden is
quite sensible to me, in general.  For example, what happens to
'import foo' statements if some foo.py appears, disappears, and/or
changes somewhere on sys.path during the program run IS "de facto"
undefined (for filesystems with sufficiently flaky behavior, such as
remote ones:-) -- I'd like that to be stated outright in the docs, to
allow a sensible and compliant import system to perform some caching
(e.g. ensuring os.listdir is called no more than once per directory in
sys.path) without lingering feelings of guilt or trickiness.

> a rule imposed for a good reason (good reasons are reasons of semantic
> simplicity, consistency, etc), it's just imposed to make it easier to
> optimize the runtime.  If the common case is to read a name repeatedly
> and not care about writes to the name, then leave the language alone and
> just optimize reading of names.  For example, have the runtime set up
> observers for the names used in a function and require any write to a
> name to notify those observers.  Now lookups are fast, the semantics
> are unchanged, and there are no new rules.

However, this would not afford the same level of optimization (e.g.
special opcodes for very lightweight builtins such as len), and if it
involved making all dicts richer to support 'observers on key rebinds'
might possibly slow dicts by enough to more than counteract the
benefits (of course it might be possible to get away with replacing
builtin and modules' dicts with instances of an "observabledict"
subclass -- possibly worthwhile, but a HUGE workload to undertake in
order to let some weirdo reassign 'len' in builtins at random times).
Practicality beats purity.


Alex

From exarkun at divmod.com  Fri Jun 22 23:44:11 2007
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Fri, 22 Jun 2007 17:44:11 -0400
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <e8a0972d0706221428j51c58370y5b8d5a5e6039eec9@mail.gmail.com>
Message-ID: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm>

On Fri, 22 Jun 2007 14:28:12 -0700, Alex Martelli <aleaxit at gmail.com> wrote:
>On 6/22/07, Jean-Paul Calderone <exarkun at divmod.com> wrote:
>   ...
>>This is more reasonable, but it's still a new rule (and I personally
>>find rules which include undefined behavior to be distasteful -- but
>>your suggestion could be modified so that the name change is never
>>respected to achieve roughly the same consequence).  And it's not even
>
>That would put a potentially heavy burden on Python compilers that may
>not be interested in producing speedy code but in compiling faster.
>Specifically asserting that some _weird_ behavior is undefined in
>order to allow the writing of compilers without excessive burden is
>quite sensible to me, in general.  For example, what happens to
>'import foo' statements if some foo.py appears, disappears, and/or
>changes somewhere on sys.path during the program run IS "de facto"
>undefined (for filesystems with sufficiently flaky behavior, such as
>remote ones:-) -- I'd like that to be stated outright in the docs, to
>allow a sensible and compliant import system to perform some caching
>(e.g. ensuring os.listdir is called no more than once per directory in
>sys.path) without lingering feelings of guilt or trickiness.

Could be.  I don't find many of my programs to be bottlenecked on
compilation time or import time, so these optimizations look like
pure lose to me.

>>a rule imposed for a good reason (good reasons are reasons of semantic
>>simplicity, consistency, etc), it's just imposed to make it easier to
>>optimize the runtime.  If the common case is to read a name repeatedly
>>and not care about writes to the name, then leave the language alone and
>>just optimize reading of names.  For example, have the runtime set up
>>observers for the names used in a function and require any write to a
>>name to notify those observers.  Now lookups are fast, the semantics
>>are unchanged, and there are no new rules.
>
>However, this would not afford the same level of optimization (e.g.
>special opcodes for very lightweight builtins such as len), and if it
>involved making all dicts richer to support 'observers on key rebinds'
>might possibly slow dicts by enough to more than counteract the
>benefits (of course it might be possible to get away with replacing
>builtin and modules' dicts with instances of an "observabledict"
>subclass -- possibly worthwhile, but a HUGE workload to undertake in
>order to let some weirdo reassign 'len' in builtins at random times).
>Practicality beats purity.
>

I also don't find much of my code bottlenecked on local name lookup.
Function call and attribute lookup overhead is a much bigger killer,
but I can still write apps where the Python VM isn't the bottleneck
without really trying, and when something is slow, giving attention
to the miniscule fraction of my overall codebase which causes the
problem is itself not much of a problem.

Is it neat when CPython gets faster overall?  Sure.  Is it worth
complications to the language for what is ultimately a tiny
speedup?  Not on my balance sheet.

Jean-Paul

From mike.klaas at gmail.com  Sat Jun 23 00:26:37 2007
From: mike.klaas at gmail.com (Mike Klaas)
Date: Fri, 22 Jun 2007 15:26:37 -0700
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm>
References: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm>
Message-ID: <D78EDBC8-05A7-4359-971C-8CD5375F4F60@gmail.com>

On 22-Jun-07, at 2:44 PM, Jean-Paul Calderone wrote:

> On Fri, 22 Jun 2007 14:28:12 -0700, Alex Martelli  
> <aleaxit at gmail.com> wrote:
>
> Could be.  I don't find many of my programs to be bottlenecked on
> compilation time or import time, so these optimizations look like
> pure lose to me.

Nor do mine, though this complaint is common (python startup time in  
general, esp for short scripts).

>> However, this would not afford the same level of optimization (e.g.
>> special opcodes for very lightweight builtins such as len), and if it
>> involved making all dicts richer to support 'observers on key  
>> rebinds'
>> might possibly slow dicts by enough to more than counteract the
>> benefits (of course it might be possible to get away with replacing
>> builtin and modules' dicts with instances of an "observabledict"
>> subclass -- possibly worthwhile, but a HUGE workload to undertake in
>> order to let some weirdo reassign 'len' in builtins at random times).
>> Practicality beats purity.
>>
>
> I also don't find much of my code bottlenecked on local name lookup.
> Function call and attribute lookup overhead is a much bigger killer,
> but I can still write apps where the Python VM isn't the bottleneck
> without really trying, and when something is slow, giving attention
> to the miniscule fraction of my overall codebase which causes the
> problem is itself not much of a problem.
>
> Is it neat when CPython gets faster overall?  Sure.  Is it worth
> complications to the language for what is ultimately a tiny
> speedup?  Not on my balance sheet.

I agree that making CPython .5% faster is not compelling, but there  
is value in knowing that certain patterns of code are optimized in  
certain ways, so that less mental effort and tweaking is necessary in  
those bottleneck functions.  Further, it allows the code to remain  
clearer and truer to the original intent (rebinding globals to locals  
is _ugly_).

It is like constant folding: I don't expect that it produces much by  
way of general CPython speedup, but it allows clearer code to be  
written without micro-worries about micro-optimization.

s = 0
for x in xrange(10):
       s += 10*1024*1024 # add ten MB

I _like_ being able to write that, knowing that my preferred way of  
writing the code is not costing me anything.  It would, of course, be  
even better if the whole loop disappeared :)

-Mike

From nicko at nicko.org  Sat Jun 23 10:12:30 2007
From: nicko at nicko.org (Nicko van Someren)
Date: Sat, 23 Jun 2007 09:12:30 +0100
Subject: [Python-3000] Python 3000 Status Update (Long!)
In-Reply-To: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>
References: <77903.26658.qm@web33507.mail.mud.yahoo.com>
	<07Jun21.091906pdt."57996"@synergy1.parc.xerox.com>
	<467AB63A.7050505@cornell.edu>
	<07Jun21.132117pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <189666FC-E9D9-4A84-9934-DDA0B51BF958@nicko.org>

On 21 Jun 2007, at 21:21, Bill Janssen wrote:

>>> It should amount to "map(+, operands)".
>>
>> Or, to be pedantic, this:
>>
>>      reduce(lambda x, y: x.__add__(y), operands)
>
> Don't you mean:
>
>    reduce(lambda x, y:  x.__add__(y), operands[1:], operands[0])

In the absence of a "start" value reduce "does the right thing", so  
you don't need to do that.  My original post was asking for sum to  
behave as Joel wrote.  At the moment sum is more like:
	def sum(operands, start=0):
		return reduce(lambda x,y: x+y, operands, start)
Since the start value defaults to 0, if you don't specify a start  
value and your items can't be added to zero you run into a problem.   
I was proposing something that behaved more like:
	def sum(operands, start=None):
		if start is None:
			operands , start = operands[1:], operands[0]
		return reduce(lambda x,y: x+y, operands, start)

The best argument against this so far however is the one from Gareth  
about what type is returned if no start value is given and the list  
is also empty.  Unless one is happy with the idea that sum([]) ==  
None then I concede that the current behaviour is probably the best  
compromise.  That said, I still think that the special case rejection  
of strings is ugly!

	Cheers,
		Nicko



From alexandre at peadrop.com  Sat Jun 23 17:53:35 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Sat, 23 Jun 2007 11:53:35 -0400
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly
Message-ID: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>

Hello,

I think found a bug in the implementation of StringIO/BytesIO in the
new io module.  I would like to fix it, but I am not sure what should
be the correct behavior. Any hint on this?

And one more thing, the close method on StringIO/BytesIO objects
doesn't work.  I will try to fix that too.

Thanks,
-- Alexandre

Python 3.0x (py3k-struni:56080M, Jun 22 2007, 17:18:04)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> s1 = io.StringIO()
>>> s1.seek(10)
10
>>> s1.write('hello')
5
>>> s1.getvalue()
'hello'
>>> s1.seek(0)
0
>>> s1.write('abc')
3
>>> s1.getvalue()
'abclo'
>>> import StringIO
>>> s2 = StringIO.StringIO()
>>> s2.seek(10)
>>> s2.write('hello')
>>> s2.getvalue()
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00hello'
>>> s2.seek(0)
>>> s2.write('abc')
>>> s2.getvalue()
'abc\x00\x00\x00\x00\x00\x00\x00hello'

From guido at python.org  Sat Jun 23 19:52:19 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 23 Jun 2007 10:52:19 -0700
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek
	properly
In-Reply-To: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
References: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
Message-ID: <ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>

On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hello,
>
> I think found a bug in the implementation of StringIO/BytesIO in the
> new io module.  I would like to fix it, but I am not sure what should
> be the correct behavior. Any hint on this?

BytesIO should behave the way Unix files work: just seeking only sets
the read/write position, but writing inserts null bytes between the
existing end of the file and the new write position. (Writing zero
bytes doesn't count; I've just experimentally verified this.)

I think however that for StringIO this should not be allowed -- seek()
on StringIO is only allowed to accept cookies returned by tell() on
the same file object.

> And one more thing, the close method on StringIO/BytesIO objects
> doesn't work.  I will try to fix that too.

What do you want it to do? I'm thinking perhaps it doesn't need to do anything.

--Guido

> Thanks,
> -- Alexandre
>
> Python 3.0x (py3k-struni:56080M, Jun 22 2007, 17:18:04)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import io
> >>> s1 = io.StringIO()
> >>> s1.seek(10)
> 10
> >>> s1.write('hello')
> 5
> >>> s1.getvalue()
> 'hello'
> >>> s1.seek(0)
> 0
> >>> s1.write('abc')
> 3
> >>> s1.getvalue()
> 'abclo'
> >>> import StringIO
> >>> s2 = StringIO.StringIO()
> >>> s2.seek(10)
> >>> s2.write('hello')
> >>> s2.getvalue()
> '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00hello'
> >>> s2.seek(0)
> >>> s2.write('abc')
> >>> s2.getvalue()
> 'abc\x00\x00\x00\x00\x00\x00\x00hello'
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From alexandre at peadrop.com  Sat Jun 23 20:24:14 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Sat, 23 Jun 2007 14:24:14 -0400
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek
	properly
In-Reply-To: <ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>
References: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
	<ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>
Message-ID: <acd65fa20706231124q4e5d5192kdc5694d52175e660@mail.gmail.com>

On 6/23/07, Guido van Rossum <guido at python.org> wrote:
> On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > I think found a bug in the implementation of StringIO/BytesIO in the
> > new io module.  I would like to fix it, but I am not sure what should
> > be the correct behavior. Any hint on this?
>
> BytesIO should behave the way Unix files work: just seeking only sets
> the read/write position, but writing inserts null bytes between the
> existing end of the file and the new write position. (Writing zero
> bytes doesn't count; I've just experimentally verified this.)

I agree with this. I will try to write a patch to fix io.BytesIO.

> I think however that for StringIO this should not be allowed -- seek()
> on StringIO is only allowed to accept cookies returned by tell() on
> the same file object.

I am not sure what you mean, by "cookies", here. So, do you mean
StringIO would not be allowed to seek beyond the end-of-file?

> > And one more thing, the close method on StringIO/BytesIO objects
> > doesn't work.  I will try to fix that too.
>
> What do you want it to do? I'm thinking perhaps it doesn't need to do anything.

Free the resources held by the object, and make all methods of the
object raise a ValueError if they are used.

-- Alexandre

From guido at python.org  Sat Jun 23 20:48:11 2007
From: guido at python.org (Guido van Rossum)
Date: Sat, 23 Jun 2007 11:48:11 -0700
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek
	properly
In-Reply-To: <acd65fa20706231124q4e5d5192kdc5694d52175e660@mail.gmail.com>
References: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
	<ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>
	<acd65fa20706231124q4e5d5192kdc5694d52175e660@mail.gmail.com>
Message-ID: <ca471dc20706231148p7cbb9953tb31099dfe68c9a32@mail.gmail.com>

On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/23/07, Guido van Rossum <guido at python.org> wrote:
> > On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > > I think found a bug in the implementation of StringIO/BytesIO in the
> > > new io module.  I would like to fix it, but I am not sure what should
> > > be the correct behavior. Any hint on this?
> >
> > BytesIO should behave the way Unix files work: just seeking only sets
> > the read/write position, but writing inserts null bytes between the
> > existing end of the file and the new write position. (Writing zero
> > bytes doesn't count; I've just experimentally verified this.)
>
> I agree with this. I will try to write a patch to fix io.BytesIO.

Great!

> > I think however that for StringIO this should not be allowed -- seek()
> > on StringIO is only allowed to accept cookies returned by tell() on
> > the same file object.
>
> I am not sure what you mean, by "cookies", here. So, do you mean
> StringIO would not be allowed to seek beyond the end-of-file?

tell() returns a number that doesn't necessary a byte offset. It's an
abstract value that only seek() knows what to do with.

TextIOBase in general doesn't support arbitrary seeks at all.

I just realize that a different implementation of StringIO could use
"code unit" offsets and then it could be allowed to seek beyond EOF.
But IMO it's not required to do that (and the current implementation
doesn't work that way -- it's a TextIOWrapper on top of a BytesIO).

> > > And one more thing, the close method on StringIO/BytesIO objects
> > > doesn't work.  I will try to fix that too.
> >
> > What do you want it to do? I'm thinking perhaps it doesn't need to do anything.
>
> Free the resources held by the object, and make all methods of the
> object raise a ValueError if they are used.

I'm not sure what the use case for that is (even though the 2.x
StringIO does this).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From talin at acm.org  Sun Jun 24 04:28:58 2007
From: talin at acm.org (Talin)
Date: Sat, 23 Jun 2007 19:28:58 -0700
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101
	(string	formatting)
In-Reply-To: <20070620085701.GA31968@crater.logilab.fr>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>	<ca471dc20706190820n7715fc30jeafcffd14c6b5623@mail.gmail.com>
	<20070620085701.GA31968@crater.logilab.fr>
Message-ID: <467DD6EA.6010303@acm.org>

I haven't responded to this thread because I was hoping some of the 
original proponents of the feature would come out to defend it. 
(Remember, 3101 is a synthesis of a lot of people's ideas gleaned from 
many forum postings - In some cases I am willing to defend particular 
aspects of the PEP, and in others I just write down what I think the 
general consensus is.)

That being said - from what I've read so far, the evidence on both sides 
of the argument seems anecdotal to me. I'd rather wait and see what more 
people have to say on the topic.

-- Talin

Aur?lien Camp?as wrote:
> On Tue, Jun 19, 2007 at 08:20:25AM -0700, Guido van Rossum wrote:
>> Those are valid concerns. I'm cross-posting this to the python-3000
>> list in the hope that the PEP's author and defendents can respond. I'm
>> sure we can work something out.
> 
> Thanks to raise this. It is horrible enough that I feel obliged to
> de-lurk.
> 
> -10 on this part of PEP3101.
> 
> 
>> Please keep further discussion on the python-3000 at python.org list.
>>
>> --Guido
>>
>> On 6/19/07, Chris McDonough <chrism at plope.com> wrote:
>>> Wrt http://www.python.org/dev/peps/pep-3101/
>>>
>>> PEP 3101 says Py3K should allow item and attribute access syntax
>>> within string templating expressions but "to limit potential security
>>> issues", access to underscore prefixed names within attribute/item
>>> access expressions will be disallowed.
> 
> People talking about potential security issues should have an
> obligation to show how their proposals *really* improve security (in
> general); this is of course, a hard thing to do; mere hand-waving is
> not sufficient.
> 
>>> I am a person who has lived with the aftermath of a framework
>>> designed to prevent data access by restricting access to underscore-
>>> prefixed names (Zope 2, ahem), and I've found it's very hard to
>>> explain and justify.  As a result, I feel that this is a poor default
>>> policy choice for a framework.
> 
> And it's even poorer in the context of a language (for it's probably
> harder to escape language-level restrictions than framework
> obscurities ...).
> 
>>> In some cases, underscore names must become part of an object's
>>> external interface.  Consider a URL with one or more underscore-
>>> prefixed path segment elements (because prefixing a filename with an
>>> underscore is a perfectly reasonable thing to do on a filesystem, and
>>> path elements are often named after file names) fed to a traversal
>>> algorithm that attempts to resolve each path element into an object
>>> by calling __getitem__ against the parent found by the last path
>>> element's traversal result.  Perhaps this is poor design and
>>> __getitem__ should not be consulted here, but I doubt that highly
>>> because there's nothing particularly special about calling a method
>>> named __getitem__ as opposed to some method named "traverse".
> 
> This is trying to make a technical argument, but the 'consenting
> adults' policy might be enough. In my experience, zope forbiding
> access to _ prefixed attributes just led to work around the
> limitation, thus adding more useless indirection to an already crufty
> code base. The result is more obfuscation and probably even less
> security (as in auditability of the code).
> 
>>> The only precedent within Python 2 for this sort of behavior is
>>> limiting access to variables that begin with __ and which do not end
>>> with __ to the scope defined by a class and its instances.  I
>>> personally don't believe this is a very useful feature, but it's
>>> still only an advisory policy and you can worm around it with enough
>>> gyrations.
> 
> FWIW I've come to never use __attrs. The obfuscation feature seems to
> bring nothing but pain (the few times I've fell into that trap as a
> beginner python programmer).
> 
>>> Given that security is a concern at all, the only truly reasonable
>>> way to "limit security issues" is to disallow item and attribute
>>> access completely within the string templating expression syntax.  It
>>> seems gratuituous to me to encourage string templating expressions
>>> with item/attribute access, given that you could do it within the
>>> format arguments just as easily in the 99% case, and we've (well...
>>> I've) happily been living with that restriction for years now.
>>>
>>> But if this syntax is preserved, there really should be no *default*
>>> restrictions on the traversable names within an expression because
>>> this will almost certainly become a hard-to-explain, hard-to-justify
>>> bug magnet as it has become in Zope.
> 
> I'd add that Zope in general looks to me like a giant collection of
> python anti-patterns and as such can be used as a clue source about
> what not to do, especially what not to include in Py3k.
> 
> I don't want to offense people, well no more than necessary (imho zope
> *is* an offense to common sense in many ways), but that's the opinion
> from someone who earns its living mostly from zope/plone products
> dev. and maintenance (these days, anyway).
> 
> Regards,
> Aur?lien.
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org
> 

From brett at python.org  Sun Jun 24 05:30:40 2007
From: brett at python.org (Brett Cannon)
Date: Sat, 23 Jun 2007 20:30:40 -0700
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>
Message-ID: <bbaeab100706232030t6921fabmb020c9aa7972da89@mail.gmail.com>

On 6/20/07, Greg Falcon <veloso at verylowsodium.com> wrote:
> On 6/19/07, Chris McDonough <chrism at plope.com> wrote:
> > Given that security is a concern at all, the only truly reasonable
> > way to "limit security issues" is to disallow item and attribute
> > access completely within the string templating expression syntax.  It
> > seems gratuituous to me to encourage string templating expressions
> > with item/attribute access, given that you could do it within the
> > format arguments just as easily in the 99% case, and we've (well...
> > I've) happily been living with that restriction for years now.
> >
> > But if this syntax is preserved, there really should be no *default*
> > restrictions on the traversable names within an expression because
> > this will almost certainly become a hard-to-explain, hard-to-justify
> > bug magnet as it has become in Zope.
>
> This sounds exactly right to me.  I don't have strong feelings either
> way about attribute lookups in formatting strings, or the security
> problems they raise.  But while it seems a reasonable stance that
> user-injected getattr()s may pose a security problem, what seems
> indefensible is the stance that user-injected getattr()s are okay
> precisely when the attribute being looked up doesn't start with an
> underscore.
>
> A single underscore prefix is a hint to human readers, not to the
> language itself, and things should stay that way.

Since Talin said he wanted to see what others had to say, I am going
to say I agree with this sentiment.  I want string formatting to be
dead-simple.  That means either leaving out overly fancy formatting
abilities and keeping it simple, or make it very intuitive with as few
special cases as possible.

-Brett

From talin at acm.org  Sun Jun 24 08:01:17 2007
From: talin at acm.org (Talin)
Date: Sat, 23 Jun 2007 23:01:17 -0700
Subject: [Python-3000] Issues with PEP 3101 (string formatting)
In-Reply-To: <AEAEDAF5-D8D2-4FD7-884C-5C4BD2337C80@plope.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>	<46793E85.4000402@gmail.com>
	<AEAEDAF5-D8D2-4FD7-884C-5C4BD2337C80@plope.com>
Message-ID: <467E08AD.8020703@acm.org>

Chris McDonough wrote:
> Allowing attribute and/or item access within templating expressions  
> has historically been the domain of full-on templating languages  
> (which invariably also have a way to do repeats, conditionals,  
> arbitrary method calls, etc).
> 
> I think it should probably stay that way because to me, at least,  
> there's not much more compelling about being able to do item/ 
> attribute access within a template expression than there is to be  
> able to do replacements using results from arbitrary method calls.   
> It's fairly arbitrary to allow calls to __getitem__ and __getattr__  
> and but prevent, say, calls to "traverse", at least if the format  
> arguments are not restricted to plain lists/tuples/dicts.

I don't buy this argument - in that I don't think its arbitrary. You are 
correct that 3101 is not intended to be a full-on templating language, 
but that doesn't mean that we can't extend it beyond what, say, printf 
can do.

The current design is a mid-point between Perl's interpolated strings 
(which can contain arbitrary expressions), and C-style printf. The 
guiding rule is to allow expressions which increase convenience and 
expressiveness, and which are likely to be useful, while disallowing 
most of the types of expressions which would be likely to have side 
effects. Since this is Python, we can't guarantee that there's no side 
effects, but we can make a pretty good guess based on the assumption 
that most Python programmers are reasonable and sane.

 From an implementation standpoint, this is not where the complexity 
lies. (The most complex part of the code is the part dealing with 
details of conversion specifiers and formatting of numbers.)

> That's not to say that maybe an extended templating thingy shouldn't  
> ship within the stdlib though, maybe even one that extends the  
> default interpolation syntax in these sorts of ways.
> 
> - C
> 
> On Jun 20, 2007, at 10:49 AM, Nick Coghlan wrote:
> 
>> Chris McDonough wrote:
>>> Wrt http://www.python.org/dev/peps/pep-3101/
>>> PEP 3101 says Py3K should allow item and attribute access syntax   
>>> within string templating expressions but "to limit potential  
>>> security  issues", access to underscore prefixed names within  
>>> attribute/item  access expressions will be disallowed.
>> Personally, I'd be fine with leaving at least the embedded  
>> attribute access out of the initial implementation of the PEP. I'd  
>> even be OK with leaving out the embedded item access, but if we  
>> leave it in "vars(obj)" and the embedded item access would still  
>> provide a shorthand notation for access to instance variable  
>> attributes in a format string.
>>
>> So +1 for leaving out embedded attribute access from the initial  
>> implementation of PEP 3101, and -0 for leaving out the embedded  
>> item access.
>>
>> Cheers,
>> Nick.
>>
>> -- 
>> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>> ---------------------------------------------------------------
>>             http://www.boredomandlaziness.org
>>
> 
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org
> 

From chrism at plope.com  Sun Jun 24 08:32:13 2007
From: chrism at plope.com (Chris McDonough)
Date: Sun, 24 Jun 2007 02:32:13 -0400
Subject: [Python-3000] Issues with PEP 3101 (string formatting)
In-Reply-To: <467E08AD.8020703@acm.org>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>	<46793E85.4000402@gmail.com>
	<AEAEDAF5-D8D2-4FD7-884C-5C4BD2337C80@plope.com>
	<467E08AD.8020703@acm.org>
Message-ID: <CCD64BEA-D6C0-473F-99B8-C75AF3C4830B@plope.com>


On Jun 24, 2007, at 2:01 AM, Talin wrote:
> The current design is a mid-point between Perl's interpolated  
> strings (which can contain arbitrary expressions), and C-style  
> printf. The guiding rule is to allow expressions which increase  
> convenience and expressiveness, and which are likely to be useful,  
> while disallowing most of the types of expressions which would be  
> likely to have side effects. Since this is Python, we can't  
> guarantee that there's no side effects, but we can make a pretty  
> good guess based on the assumption that most Python programmers are  
> reasonable and sane.

Of course it's a judgment call whether the benefit of being able to  
do attribute/item lookup within formatting expressions is "worth  
it".  At very least it means I'll need to be more careful when  
supplying formatting arguments in order to prevent inappropriate data  
exposure.  And I won't be able to allow untrusted users to compose  
plain strings with formatting expressions in them, at least without  
imposing some restricted execution model within the objects fed to  
the formatter.  Zope currently does this inasmuch as it allows people  
to compose dnyamic TALES expressions, which is "safe" right now, but  
will become unsafe.  Frankly I'd rather just not think about it,  
because leaving this feature out is way easier than dealing with  
restricted execution or coming up with a mini templating language to  
replace the current string formatting stuff, which works fine.

But, that aside, at very least, we shouldn't restrict the names  
available to be looked up by default to those not starting with an  
underscore (for the reasons I mentioned in the original post in this  
thread).

>
> From an implementation standpoint, this is not where the complexity  
> lies. (The most complex part of the code is the part dealing with  
> details of conversion specifiers and formatting of numbers.)

I know it's not very complex, I just don't believe it's terribly  
beneficial to have in the base string formatting implementation, and  
it's potentially harmful.  Particularly to web programmers, at least  
to dumb ones like me.

- C


From p.f.moore at gmail.com  Sun Jun 24 21:10:43 2007
From: p.f.moore at gmail.com (Paul Moore)
Date: Sun, 24 Jun 2007 20:10:43 +0100
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <bbaeab100706232030t6921fabmb020c9aa7972da89@mail.gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>
	<bbaeab100706232030t6921fabmb020c9aa7972da89@mail.gmail.com>
Message-ID: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com>

On 24/06/07, Brett Cannon <brett at python.org> wrote:
> On 6/20/07, Greg Falcon <veloso at verylowsodium.com> wrote:
> > This sounds exactly right to me.  I don't have strong feelings either
> > way about attribute lookups in formatting strings, or the security
> > problems they raise.  But while it seems a reasonable stance that
> > user-injected getattr()s may pose a security problem, what seems
> > indefensible is the stance that user-injected getattr()s are okay
> > precisely when the attribute being looked up doesn't start with an
> > underscore.
> >
> > A single underscore prefix is a hint to human readers, not to the
> > language itself, and things should stay that way.
>
> Since Talin said he wanted to see what others had to say, I am going
> to say I agree with this sentiment.  I want string formatting to be
> dead-simple.  That means either leaving out overly fancy formatting
> abilities and keeping it simple, or make it very intuitive with as few
> special cases as possible.

Again, I agree. I'd prefer to see attribute access stay, but I'm not
too bothered, I'm very strongly against any restrictions based on the
form of name.

Count me as +0 on allowing a.b, and -1 on allowing a.b unless b
contains leading underscores.

Paul.

From p.f.moore at gmail.com  Sun Jun 24 21:13:50 2007
From: p.f.moore at gmail.com (Paul Moore)
Date: Sun, 24 Jun 2007 20:13:50 +0100
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>
	<bbaeab100706232030t6921fabmb020c9aa7972da89@mail.gmail.com>
	<79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com>
Message-ID: <79990c6b0706241213o41e395den6687e7fa9af3c189@mail.gmail.com>

On 24/06/07, Paul Moore <p.f.moore at gmail.com> wrote:
> Count me as +0 on allowing a.b, and -1 on allowing a.b unless b
> contains leading underscores.

Rereading that, the second part didn't make sense. Assuming a.b is
allowed, I'm -1 on putting restrictions on b, specifically on not
allowing it to start with an underscore.

Heck, the fact that I find it so hard to describe argues that it's a
misguided restriction (ignoring the possibility that I simply can't
express myself in my native language :-))

Paul.

From talin at acm.org  Sun Jun 24 21:51:51 2007
From: talin at acm.org (Talin)
Date: Sun, 24 Jun 2007 12:51:51 -0700
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <f5lcc5$up4$1@sea.gmane.org>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>	<ca471dc20706190820n7715fc30jeafcffd14c6b5623@mail.gmail.com>
	<f5lcc5$up4$1@sea.gmane.org>
Message-ID: <467ECB57.8080209@acm.org>

Georg Brandl wrote:
> Another question w.r.t. new string formatting:
> 
> Assuming the %-operator for strings goes away as you said in the recent blog
> post, how are we going to convert string formatting (which I daresay is a very
> common operation in Python modules) in the 2to3 tool?
> 
> Of course, "abc" % anything can be converted easily.
> 
> name % tuple_or_dict can only be converted to name.format(tuple_or_dict),
> without correcting the format string.
> 
> name % name can not be converted at all without type inference.
> 
> Though probably the first type of application is the most frequent one,
> pre-building (or just loading from elsewhere) of format strings is not so
> uncommon when it comes to localization, where the format string likely
> has a _() wrapped around it.
> 
> Of course, converting format strings manually is a PITA, mainly because it's
> so common.
> 
> Georg

Actually, I was presuming that '%' would stick around for the time 
being, although it might be officially deprecated.

Given that writing a 2to3 converter for format strings would be a 
project in itself, I think it's probably best to remain backwards 
compatible for now.

-- Talin

From jcarlson at uci.edu  Sun Jun 24 23:05:30 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 24 Jun 2007 14:05:30 -0700
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <87bqfcj97n.fsf@ten22.rhodesmill.org>
References: <4677E097.5060205@online.de> <87bqfcj97n.fsf@ten22.rhodesmill.org>
Message-ID: <20070624132756.7998.JCARLSON@uci.edu>


Brandon Craig Rhodes <brandon at rhodesmill.org> wrote:
> Joachim K??nig <him at online.de> writes:
> 
> > ... could someone enlighten me why
> >
> > {,}
> >
> > can't be used for the empty set, analogous to the empty tuple (,)?
> 
> And now that someone else has broken the ice regarding questions that
> have probably been exhausted already, I want to comment that Python 3k
> seems to perpetuate a vast asymmetry.  Observe:

Since no one seems to have responded to this, I will go ahead and do so
(I just got back from vacation).


> (a) Syntactic constructors
> 
>  [ 1,2,3 ]   works
>  { 1,2,3 }   works
>  { 1:1, 2:4, 3:9 }   works
> 
> (b) Generators + constructor functions
> 
>  list(i for i in (1,2,3))   works
>  set(i for i in (1,2,3))   works
>  dict((i,i*i) for i in (1,2,3))   works
> 
> (c) Comprehensions
> 
>  [ i for i in (1,2,3) ]   works
>  { i for i in (1,2,3) }   works
>  { i:i*i for i in (1,2,3) ]   returns a SyntaxError!

But you forgot tuples!

    ( 1,2,3 )
    tuple(i for i in (1,2,3))
    (i for i in (1,2,3))

Oops, that last one isn't a tuple, it is a generator expression wrapped
up in parenthesis.  Really though, there are two exceptions to the rule.
Honestly, if you are that concerned about teaching students the language
(to the point that they have difficulty figuring out the *two*
exceptions to the rule), teach them the single form that always works;
generators + constructors.  They may see the three different
comprehensions/expressions (list, set, generator), but it should be
fairly easy to explain that they are equivalent to the generator +
constructor version.

> Given that Python 3k is making such strides in other areas where cruft
> and asymmetry needed to be removed, it would seem a shame to leave the
> container types in such disarray.

And one could make the argument that TOOTDI says that literals and
generators + constructors are the only reasonable options.
Comprehensions save perhaps 5 characters over the constructor method,
and may be a bit faster, but result in teh asymmetry above.  But I will
admit that comprehension syntax is not likely to be going anywhere, and
dictionary comprehensions are not likely to be added (and neither are
tuple comprehensions).

 - Josiah


From jimjjewett at gmail.com  Mon Jun 25 16:07:26 2007
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 25 Jun 2007 10:07:26 -0400
Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string
	formatting)
In-Reply-To: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com>
References: <A51EAB52-FA02-47DE-8A82-DF706F4ECD67@plope.com>
	<3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com>
	<bbaeab100706232030t6921fabmb020c9aa7972da89@mail.gmail.com>
	<79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com>
Message-ID: <fb6fbf560706250707i17b74f0bi18ee2b7db28499f2@mail.gmail.com>

On 6/24/07, Paul Moore <p.f.moore at gmail.com> wrote:
> Count me as +0 on allowing a.b, and -1 on allowing a.b
> unless b contains leading underscores.

FWIW, I do want to allow a.b, because it means I can more easily pass
locals(), instead of creating a one-use near-boilerplate dictionary,
such as

{"a"=a, "b"=b, "name"=c.name}

I do like the "no attribues with leading underscores" restriction as
the default; these shouldn't be part of the public API.  If they are
needed, there should be an alias, and if there isn't an alias, then
... make it easy to override the policy.

If the restriction were actually "no magic attributes", so that
_myfile was fine, but __file__ wasn't, that would work even better --
except that it would encourage people to use __attributes__ when they
shouldn't, just to get the protection.

-jJ

From python3now at gmail.com  Mon Jun 25 18:36:52 2007
From: python3now at gmail.com (James Thiele)
Date: Mon, 25 Jun 2007 09:36:52 -0700
Subject: [Python-3000] Bug(s) in 2to3
Message-ID: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com>

After checking out subversion repository of 2to3 yesterday I found two
cases where refactor.py failed.
It didn't like this line:
example.py:322:    print h.iterkeys().next()

throwing:
AttributeError: 'DelayedStrNode' object has no attribute 'set_prefix'

The attached file "dict_ex.py" is a short example which also gets this error.

refactor.py also didn't like:
lineno, line = lineno+1, f.next()

also throwing:
AttributeError: 'DelayedStrNode' object has no attribute 'get_prefix'

The attached file "tup.py" is a short example which also gets this
error. The attached file
"no_tup.py" comments out the offending line and doesn't throw the exception.

The attached file "transcript" contains a shell session with full
tracebacks. The line numbers in the tracebacks may vary slightly from
the repository versions due to debug code used to isolate the problem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcript
Type: application/octet-stream
Size: 2315 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dict_ex.py
Type: application/octet-stream
Size: 33 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0001.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tup.py
Type: application/octet-stream
Size: 226 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: no_tup.py
Type: application/octet-stream
Size: 228 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0003.obj 

From collinw at gmail.com  Mon Jun 25 18:42:50 2007
From: collinw at gmail.com (Collin Winter)
Date: Mon, 25 Jun 2007 09:42:50 -0700
Subject: [Python-3000] Bug(s) in 2to3
In-Reply-To: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com>
References: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com>
Message-ID: <43aa6ff70706250942s23c0bd5fve015164de07bbbff@mail.gmail.com>

On 6/25/07, James Thiele <python3now at gmail.com> wrote:
> After checking out subversion repository of 2to3 yesterday I found two
> cases where refactor.py failed.

I'll fix this. Thanks for the bug report.

Collin Winter

From alexandre at peadrop.com  Mon Jun 25 19:18:25 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Mon, 25 Jun 2007 13:18:25 -0400
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
Message-ID: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>

Hi,

I found two small bugs in pydoc.py. The patch is rather simple, so I doubt
I have to explain it. Note, I removed the -*- coding: -*- tag, since
the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's
what Emacs told me).

-- Alexandre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pydoc-fix.patch
Type: text/x-patch
Size: 825 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/344fee3d/attachment.bin 

From g.brandl at gmx.net  Mon Jun 25 19:47:54 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 25 Jun 2007 19:47:54 +0200
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
In-Reply-To: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
References: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
Message-ID: <f5ov45$f8h$1@sea.gmane.org>

Alexandre Vassalotti schrieb:
> Hi,
> 
> I found two small bugs in pydoc.py. The patch is rather simple, so I doubt
> I have to explain it.

You've submitted this before; I've already committed it to SVN.

> Note, I removed the -*- coding: -*- tag, since
> the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's
> what Emacs told me).

AFAICS, the file doesn't have any non-ascii characters in it, so actually
it's both latin1 and utf8 :)

Georg


From alexandre at peadrop.com  Mon Jun 25 20:14:16 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Mon, 25 Jun 2007 14:14:16 -0400
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek
	properly
In-Reply-To: <ca471dc20706231148p7cbb9953tb31099dfe68c9a32@mail.gmail.com>
References: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
	<ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>
	<acd65fa20706231124q4e5d5192kdc5694d52175e660@mail.gmail.com>
	<ca471dc20706231148p7cbb9953tb31099dfe68c9a32@mail.gmail.com>
Message-ID: <acd65fa20706251114u60bae701ve95a84ffee27e0b2@mail.gmail.com>

On 6/23/07, Guido van Rossum <guido at python.org> wrote:
> On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > I agree with this. I will try to write a patch to fix io.BytesIO.
>
> Great!

I got the patch (it's attached to this email). The fix was simpler
than I thought.

I would like to write a unittest for it, but I am not sure where it
should go in test_io.py. From what I see, MemorySeekTestMixin is for
testing read/seek operation common to BytesIO and StringIO, so I can't
put it there. And I don't really like the idea of adding another test
in IOTest.test_raw_bytes_io.

By the way, I am having the same problem for the tests of _string_io
and _bytes_io -- i.e., I don't know exactly how to organize them with
the rest of the tests in test_io.py.

> > Free the resources held by the object, and make all methods of the
> > object raise a ValueError if they are used.
>
> I'm not sure what the use case for that is (even though the 2.x
> StringIO does this).
>

It seem the close method on TextIOWrapper objects is broken too (or at
least, bizarre):

    >>> f = open('test', 'w')
    >>> f.write('hello')
    5
    >>> f.close()
    >>> f.write('hello')
    5
    >>> ^D
    $ hd test
    00000000  68 65 6c 6c 6f                                    |hello|
    00000005


-- Alexandre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: overseek-bytesio.patch
Type: text/x-patch
Size: 608 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/25c904d4/attachment.bin 

From alexandre at peadrop.com  Mon Jun 25 20:22:15 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Mon, 25 Jun 2007 14:22:15 -0400
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
In-Reply-To: <f5ov45$f8h$1@sea.gmane.org>
References: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
	<f5ov45$f8h$1@sea.gmane.org>
Message-ID: <acd65fa20706251122o699feac4v9e6d51f0a10952f2@mail.gmail.com>

On 6/25/07, Georg Brandl <g.brandl at gmx.net> wrote:
> Alexandre Vassalotti schrieb:
> > I found two small bugs in pydoc.py. The patch is rather simple, so I doubt
> > I have to explain it.
>
> You've submitted this before; I've already committed it to SVN.
>

Really??? I don't remember this ... My last patch was against pdb.py,
not pydoc.py

> > Note, I removed the -*- coding: -*- tag, since
> > the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's
> > what Emacs told me).
>
> AFAICS, the file doesn't have any non-ascii characters in it, so actually
> it's both latin1 and utf8 :)

Ah!

-- Alexandre

From alexandre at peadrop.com  Mon Jun 25 20:27:55 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Mon, 25 Jun 2007 14:27:55 -0400
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
In-Reply-To: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
References: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
Message-ID: <acd65fa20706251127u621b3ca1y1a30f7ebf965356f@mail.gmail.com>

Meanwhile, I found another division/range combination that could be
problematic. I attached an updated patch.

-- Alexandre

On 6/25/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> Hi,
>
> I found two small bugs in pydoc.py. The patch is rather simple, so I doubt
> I have to explain it. Note, I removed the -*- coding: -*- tag, since
> the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's
> what Emacs told me).
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pydoc-fix-2.patch
Type: text/x-patch
Size: 1225 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/ccddc33d/attachment-0001.bin 

From alexandre at peadrop.com  Mon Jun 25 20:44:42 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Mon, 25 Jun 2007 14:44:42 -0400
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
In-Reply-To: <acd65fa20706251122o699feac4v9e6d51f0a10952f2@mail.gmail.com>
References: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
	<f5ov45$f8h$1@sea.gmane.org>
	<acd65fa20706251122o699feac4v9e6d51f0a10952f2@mail.gmail.com>
Message-ID: <acd65fa20706251144l4da195fwefaa3337edb1197d@mail.gmail.com>

On 6/25/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/25/07, Georg Brandl <g.brandl at gmx.net> wrote:
> > You've submitted this before; I've already committed it to SVN.
>
> Really??? I don't remember this ... My last patch was against pdb.py,
> not pydoc.py

Nevermind, I just found out someone else already sent a patch (Patch #1739659).

Sorry for the noise,
-- Alexandre

From rowen at cesmail.net  Mon Jun 25 21:55:53 2007
From: rowen at cesmail.net (Russell E. Owen)
Date: Mon, 25 Jun 2007 12:55:53 -0700
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
References: <4677E097.5060205@online.de> <87bqfcj97n.fsf@ten22.rhodesmill.org>
	<20070624132756.7998.JCARLSON@uci.edu>
Message-ID: <rowen-C34545.12555325062007@sea.gmane.org>

In article <20070624132756.7998.JCARLSON at uci.edu>,
 Josiah Carlson <jcarlson at uci.edu> wrote:

> ...one could make the argument that TOOTDI says that literals and
> generators + constructors are the only reasonable options.
> Comprehensions save perhaps 5 characters over the constructor method,
> and may be a bit faster, but result in teh asymmetry above.  But I will
> admit that comprehension syntax is not likely to be going anywhere, and
> dictionary comprehensions are not likely to be added (and neither are
> tuple comprehensions).

OK, I'll bite. Does Python really need both list comprehensions and 
generator expressions? Perhaps list comprehensions should go away in 
Python 3000? I'm sure it's been discussed (I'm late to this list) and a 
google search showed a few blog entries but nothing more.

-- Russell


From jcarlson at uci.edu  Tue Jun 26 05:05:41 2007
From: jcarlson at uci.edu (Josiah Carlson)
Date: Mon, 25 Jun 2007 20:05:41 -0700
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <rowen-C34545.12555325062007@sea.gmane.org>
References: <20070624132756.7998.JCARLSON@uci.edu>
	<rowen-C34545.12555325062007@sea.gmane.org>
Message-ID: <20070625193345.79AA.JCARLSON@uci.edu>


"Russell E. Owen" <rowen at cesmail.net> wrote:
> In article <20070624132756.7998.JCARLSON at uci.edu>,
>  Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > ...one could make the argument that TOOTDI says that literals and
> > generators + constructors are the only reasonable options.
> > Comprehensions save perhaps 5 characters over the constructor method,
> > and may be a bit faster, but result in teh asymmetry above.  But I will
> > admit that comprehension syntax is not likely to be going anywhere, and
> > dictionary comprehensions are not likely to be added (and neither are
> > tuple comprehensions).
> 
> OK, I'll bite. Does Python really need both list comprehensions and 
> generator expressions? Perhaps list comprehensions should go away in 
> Python 3000? I'm sure it's been discussed (I'm late to this list) and a 
> google search showed a few blog entries but nothing more.

If list comprehensions went away, then it would make sense for set
comprehensions to go away too (being that list comprehensions arguably
have far more example uses in real code, and perhaps more use-cases).

 - Josiah


From cspencer at cinci.rr.com  Tue Jun 26 20:47:14 2007
From: cspencer at cinci.rr.com (Chris Spencer)
Date: Tue, 26 Jun 2007 14:47:14 -0400
Subject: [Python-3000] An impassioned plea for true multithreading
Message-ID: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>

	I know this is probably futile, but I'm going to ask anyway.
Since I have not the time (or ability) to code this, I am not even
submitting a PEP.  I'm throwing this out there on the wind.
	Since we're doing a lot of work that breaks backwards
compatibility, one more piece of breakage needs to happen.  We need to
have true multithreading.

Reasons:
1.  Most people who bought a computer in the past year bought a
dual-core processor with it.  Quad-cores are going to take over the
market in 2008.  To not be able to take advantage of these extra cores
is an inherent language disadvantage.  Yes, you can run more than one
process and do some sort of IPC, but it requires a lot more work for
the coder and a lot more complexity in the code (ie more bugs).

2.  It makes writing servers so much more easy on Windows systems (you
know, the OS without an effective "fork" mechanism).  To simply stick
fingers in your ears and yelling "LA LA LA" in the hopes Windows will
go away is not effective language design.

3.  C# and Java have true multithreading.  Ruby doesn't.  Let's get it
before Ruby does.

4.  It will actually speed up the Python interpreter.  Not at first,
but I'm certain there's a level of parallelism in the Python bytecode
that can be exploited by threaded branch prediction and concurrent
processing.  For example, a generator that figures out its next value
BEFORE being called, so it's a simple return of a value when the
iteration is called.  I speculate that with true multitasking, an
optimized python interpreter will appear within a year to take
advantage of these possibilities.

	I hope the thoughts behind this email aren't outweighed by the
fact that it didn't go through the proper channels.  Thank you for
your time.

Christoper L. Spencer
CTO ROXX, LLC
4515 Leslie Ave.
Cincinnati, OH
45242
TEL: 513-545-7057
EMAIL: cspencer at cinci.rr.com

From ronaldoussoren at mac.com  Tue Jun 26 21:23:00 2007
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Tue, 26 Jun 2007 12:23:00 -0700
Subject: [Python-3000] An impassioned plea for true multithreading
In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
Message-ID: <B73FC157-0113-1000-BB49-29AF471CD20B-Webmail-10015@mac.com>

 
On Tuesday, June 26, 2007, at 08:49PM, "Chris Spencer" <cspencer at cinci.rr.com> wrote:
>	I know this is probably futile, but I'm going to ask anyway.
>Since I have not the time (or ability) to code this, I am not even
>submitting a PEP.  I'm throwing this out there on the wind.
>	Since we're doing a lot of work that breaks backwards
>compatibility, one more piece of breakage needs to happen.  We need to
>have true multithreading.

This request comes up from time to time and the standard OSS mantra applies here: show us the code.   None of the core developers is interested enough to work on this and it is far from sure that removing the GIL can be done without massive restructuring of the core interpreter or loss of performance (and possibly both). 

Someone has tried to lose the GIL several years ago (Google should be able to tell you about this) and ended up with a working but significantly slower interpreter.

>
>Reasons:
[snip the same old reasons]

Ronald

From rowen at cesmail.net  Wed Jun 27 02:23:13 2007
From: rowen at cesmail.net (Russell E. Owen)
Date: Tue, 26 Jun 2007 17:23:13 -0700
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
References: <20070624132756.7998.JCARLSON@uci.edu>
	<rowen-C34545.12555325062007@sea.gmane.org>
	<20070625193345.79AA.JCARLSON@uci.edu>
Message-ID: <rowen-8EA8EC.17231326062007@sea.gmane.org>

In article <20070625193345.79AA.JCARLSON at uci.edu>,
 Josiah Carlson <jcarlson at uci.edu> wrote:

> "Russell E. Owen" <rowen at cesmail.net> wrote:
> > In article <20070624132756.7998.JCARLSON at uci.edu>,
> >  Josiah Carlson <jcarlson at uci.edu> wrote:
> > 
> > > ...one could make the argument that TOOTDI says that literals and
> > > generators + constructors are the only reasonable options.
> > > Comprehensions save perhaps 5 characters over the constructor method,
> > > and may be a bit faster, but result in teh asymmetry above.  But I will
> > > admit that comprehension syntax is not likely to be going anywhere, and
> > > dictionary comprehensions are not likely to be added (and neither are
> > > tuple comprehensions).
> > 
> > OK, I'll bite. Does Python really need both list comprehensions and 
> > generator expressions? Perhaps list comprehensions should go away in 
> > Python 3000? I'm sure it's been discussed (I'm late to this list) and a 
> > google search showed a few blog entries but nothing more.
> 
> If list comprehensions went away, then it would make sense for set
> comprehensions to go away too (being that list comprehensions arguably
> have far more example uses in real code, and perhaps more use-cases).

I would personally be happy lose set comprehensions and just use 
generator expressions for all comprehension-like tasks.

-- Russell


From martin at v.loewis.de  Wed Jun 27 05:43:54 2007
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 27 Jun 2007 05:43:54 +0200
Subject: [Python-3000] An impassioned plea for true multithreading
In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
Message-ID: <4681DCFA.4050000@v.loewis.de>

Chris Spencer schrieb:
> 	I know this is probably futile, but I'm going to ask anyway.
> Since I have not the time (or ability) to code this, I am not even
> submitting a PEP.  I'm throwing this out there on the wind.

Just to second Ronald's sentiment: it won't happen unless somebody
does it, and it is highly unlikely that somebody will.

> 	Since we're doing a lot of work that breaks backwards
> compatibility, one more piece of breakage needs to happen.  We need to
> have true multithreading.

Be careful when using the pronoun "we"; in the first sentence, it seems
to not include yourself, and in the second sentence, it does not include
myself.

Regards,
Martin

From rasky at develer.com  Wed Jun 27 11:26:59 2007
From: rasky at develer.com (Giovanni Bajo)
Date: Wed, 27 Jun 2007 11:26:59 +0200
Subject: [Python-3000] An impassioned plea for true multithreading
In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
Message-ID: <f5tah3$ve9$1@sea.gmane.org>

On 26/06/2007 20.47, Chris Spencer wrote:

> 1.  Most people who bought a computer in the past year bought a
> dual-core processor with it.  Quad-cores are going to take over the
> market in 2008.  To not be able to take advantage of these extra cores
> is an inherent language disadvantage.  Yes, you can run more than one
> process and do some sort of IPC, but it requires a lot more work for
> the coder and a lot more complexity in the code (ie more bugs).

In my experience, it's multi-threading that gives you endless bugs without any 
hope of getting debugged and fixed. Multi-processing (carefully coupled with 
event-based programming) instead gives you a solid program
with small parts which can be run and tested invididually.

In fact, I am *happy* that Python does not have true multithreading: this 
forces people to design programs the right way from the beginning (unless you 
want the typical quick, non-performance-sensitive, fast-hack thread, and in 
that case Python's multithreading with GIL is more than enough).

So please don't say that Python isn't able to exploit quad-cores: it's a false 
statement. On the contrary: it lets you use them CORRECTLY, without shared 
memory issues.

Have a look at the package called "processing" in PyPI. You might find it 
interesting.
-- 
Giovanni Bajo


From gproux+py3000 at gmail.com  Wed Jun 27 12:44:51 2007
From: gproux+py3000 at gmail.com (Guillaume Proux)
Date: Wed, 27 Jun 2007 19:44:51 +0900
Subject: [Python-3000] An impassioned plea for true multithreading
In-Reply-To: <f5tah3$ve9$1@sea.gmane.org>
References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com>
	<f5tah3$ve9$1@sea.gmane.org>
Message-ID: <19dd68ba0706270344q27fe5e7fg3bc15f70336db23d@mail.gmail.com>

My 2 cents...

I have really felt the need for real multithreading when I have tried
programming multimedia with python (pygame).
Doing scene management at the same time than other processes that
require quasi realtime (video decode) is just basically impossible
(without telling you about the garbage collector kicking in when the
bad guy is about to shoot you!)

Of course, one solution is to make a  multithreaded scene-graph engine
in C++ and control that engine from Python but then it just proves the
point that not everything can be scaled up through increasing the
number of processes. Some things just cannot be scaled up when it is
required to have simultaneous access to the same dataset.

Regards,

Guillaume

From greg.ewing at canterbury.ac.nz  Thu Jun 28 02:37:20 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 28 Jun 2007 12:37:20 +1200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <rowen-8EA8EC.17231326062007@sea.gmane.org>
References: <20070624132756.7998.JCARLSON@uci.edu>
	<rowen-C34545.12555325062007@sea.gmane.org>
	<20070625193345.79AA.JCARLSON@uci.edu>
	<rowen-8EA8EC.17231326062007@sea.gmane.org>
Message-ID: <468302C0.3050808@canterbury.ac.nz>

Russell E. Owen wrote:
> I would personally be happy lose set comprehensions and just use 
> generator expressions for all comprehension-like tasks.

One advantage of the comprehension syntaxes is that the
body can be inlined instead of relegated to a lambda,
saving the overhead of a Python function call per
loop.

It would be difficult to do that optimisation with
a generator unless things like list(generator) were
recognised and special-cased somehow.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From ncoghlan at gmail.com  Thu Jun 28 15:56:25 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 28 Jun 2007 23:56:25 +1000
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <468302C0.3050808@canterbury.ac.nz>
References: <20070624132756.7998.JCARLSON@uci.edu>	<rowen-C34545.12555325062007@sea.gmane.org>	<20070625193345.79AA.JCARLSON@uci.edu>	<rowen-8EA8EC.17231326062007@sea.gmane.org>
	<468302C0.3050808@canterbury.ac.nz>
Message-ID: <4683BE09.4010702@gmail.com>

Greg Ewing wrote:
> Russell E. Owen wrote:
>> I would personally be happy lose set comprehensions and just use 
>> generator expressions for all comprehension-like tasks.
> 
> One advantage of the comprehension syntaxes is that the
> body can be inlined instead of relegated to a lambda,
> saving the overhead of a Python function call per
> loop.

I'm not sure what you mean by "function call per loop" in this 
paragraph. There is no function call per loop even when using a 
generator expression - a generator function is implicit defined, and 
then called once to instantiate the generator. Iterating over this 
suspends and resumes the generating to retrieve each item, rather than 
making a Python function call as such - is that behaviour what you were 
referring to?

Regardless, what the list and set comprehension syntax saves you is that 
instead of having to suspend/resume a generator multiple times while 
iterating over it to fill the container, the implicitly defined function 
instead creates and populates the desired container type directly. These 
operations are also compiled to use special opcodes, so they should be 
significantly faster than the corresponding pure Python code would be.

(I'd provide some timing figures, but my Py3k checkout is somewhat 
stale, so the timeit module isn't working for me at the moment)

To get back to the original question, I believe the point of adding set 
literal and comprehension syntax is to make it possible to easily speed 
up membership tests for items in known groups - the existing list 
literals are fast to create, but slow to search. Using a set literal 
instead of a list literal is also a good way to make it explicit that 
the order in which the items are added to the container is arbitrary and 
coincidental, rather than having any significant meaning.

Cheers,
Nick.





-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From alexandre at peadrop.com  Thu Jun 28 16:37:01 2007
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Thu, 28 Jun 2007 10:37:01 -0400
Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek
	properly
In-Reply-To: <acd65fa20706251114u60bae701ve95a84ffee27e0b2@mail.gmail.com>
References: <acd65fa20706230853w32f8895g91b7715c456900b7@mail.gmail.com>
	<ca471dc20706231052x561e7acfpf84373ea670c2974@mail.gmail.com>
	<acd65fa20706231124q4e5d5192kdc5694d52175e660@mail.gmail.com>
	<ca471dc20706231148p7cbb9953tb31099dfe68c9a32@mail.gmail.com>
	<acd65fa20706251114u60bae701ve95a84ffee27e0b2@mail.gmail.com>
Message-ID: <acd65fa20706280737n54b8dea8l5362b8545c990236@mail.gmail.com>

Can someone, other than Guido, review my patch? He is in vacation
right now, so he probably won't have the time to review and submit it
until August.

Thanks,
-- Alexandre

On 6/25/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> On 6/23/07, Guido van Rossum <guido at python.org> wrote:
> > On 6/23/07, Alexandre Vassalotti <alexandre at peadrop.com> wrote:
> > > I agree with this. I will try to write a patch to fix io.BytesIO.
> >
> > Great!
>
> I got the patch (it's attached to this email). The fix was simpler
> than I thought.
>
> I would like to write a unittest for it, but I am not sure where it
> should go in test_io.py. From what I see, MemorySeekTestMixin is for
> testing read/seek operation common to BytesIO and StringIO, so I can't
> put it there. And I don't really like the idea of adding another test
> in IOTest.test_raw_bytes_io.
>
> By the way, I am having the same problem for the tests of _string_io
> and _bytes_io -- i.e., I don't know exactly how to organize them with
> the rest of the tests in test_io.py.
>
> > > Free the resources held by the object, and make all methods of the
> > > object raise a ValueError if they are used.
> >
> > I'm not sure what the use case for that is (even though the 2.x
> > StringIO does this).
> >
>
> It seem the close method on TextIOWrapper objects is broken too (or at
> least, bizarre):
>
>     >>> f = open('test', 'w')
>     >>> f.write('hello')
>     5
>     >>> f.close()
>     >>> f.write('hello')
>     5
>     >>> ^D
>     $ hd test
>     00000000  68 65 6c 6c 6f                                    |hello|
>     00000005
>
>
> -- Alexandre
>
>


-- 
Alexandre Vassalotti

From tav at espians.com  Thu Jun 28 17:59:31 2007
From: tav at espians.com (tav)
Date: Thu, 28 Jun 2007 16:59:31 +0100
Subject: [Python-3000] pimp; restructuring the standard library
Message-ID: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>

rehi all,

First of all, I'd like to say "fucking great work!". Whilst initially
skeptical about Python 3.0, I really love all the decisions that have
been made so far. Python 3.0 is looking like it's going to be a great
language! Thank you ever so much to all those who've put in their time
and effort.

Now, one of the killer features of Python has always been it's
batteries included standard library. However, until recently, this has
been slightly abandoned. Wth 3.0, we have a chance to rectify this and
bring it up-to-date with the modern era.

I don't think what PEP 3001 currently suggests goes far enough in this
regard. It seems to be treating the change as a usual python 2.x ->
2.x+1 change.

I'd like to suggest a complete overhaul of the standard library, and
along with it, perhaps some changes to the import mechanism.

* Structured hierarchy (this seems to be something that already has support).

* Abandoning of unit tests and replacing with full doctest coverage in
the style perfected by Jim Fulton and PJE. Integration with py.test.

* Ability to import from remote networked sources, e.g. import
http://foo.com/blah/

* Authentication of sources with configurable crypto.

* Full integration with setuptools + eggs.

* Pluggable integration support for version control systems like svn/bzr.

* Builtin versioning support for modules.

* Live-update of modules/code support (in the vein of Erlang).

* Rewrite of standard library to be more adaptable, concurrent, and
pertaining to object capability. This way, we can have a secure,
composable and parallelisable standard library!

* Integration of "best-of" libraries out there. (Obviously subjective...)

* Grouped imports/exports, e.g. from module import :api, instead of
the current all or nothing, from module import *

Now, this might seem a bit much but if done well, I think it can
provide Python a huge leap over other languages...

I have already worked on this for my own projects by implementing an
import replacement called ``pimp`` in python 2.x. See:

  https://svn.espnow.net/24weeks/trunk/source/python/importer/pimp/pimp.py

And, have been working on structuring code for my own uses under:

  https://svn.espnow.net/24weeks/trunk/source/python/

Hope this all makes some kind of sense... your thoughts will be much
appreciated. Thanks!

-- 
love, tav
founder and ceo, esp metanational llp

plex:espians/tav | tav at espians.com | +44 (0) 7809 569 369

From pje at telecommunity.com  Thu Jun 28 18:41:30 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Thu, 28 Jun 2007 12:41:30 -0400
Subject: [Python-3000] pimp; restructuring the standard library
In-Reply-To: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.co
 m>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
Message-ID: <20070628163922.A40063A40AF@sparrow.telecommunity.com>

At 04:59 PM 6/28/2007 +0100, tav wrote:
>* Abandoning of unit tests and replacing with full doctest coverage in
>the style perfected by Jim Fulton and PJE. Integration with py.test.

I believe that the origination credit for this rightly falls to Tim 
Peters.  (I just copied Jim, myself.)  Meanwhile, there are quite a 
few stdlib doctests now, and unittests still more than have their place.

Indeed, I'm also wary of breaking backward compatibility of unittest 
or doctest in Python 3.0, because that will make it even harder to 
port code over.  How will 2.x users run their existing test suites to 
verify their code has been ported correctly, if they can't keep using 
unittest?  As it is, they'll have to run them through 2to3, which 
AFAIK doesn't do doctests currently.


>* Ability to import from remote networked sources, e.g. import
>http://foo.com/blah/

A strong -1 on any import system that breaks down the current strict 
separation between module names and module *locations*.  Too many 
people confuse these concepts already, and we already have a nicely 
field-tested mechanism for specifying locations and turning them into 
importer objects.


>* Authentication of sources with configurable crypto.
>
>* Full integration with setuptools + eggs.
>
>* Pluggable integration support for version control systems like svn/bzr.
>
>* Builtin versioning support for modules.
>
>* Live-update of modules/code support (in the vein of Erlang).
>
>* Rewrite of standard library to be more adaptable, concurrent, and
>pertaining to object capability. This way, we can have a secure,
>composable and parallelisable standard library!

Um, and who are you volunteering to do all this work?  i.e., "you and 
what army?"  :)


>Hope this all makes some kind of sense... your thoughts will be much
>appreciated. Thanks!

My thought is that you've just proposed several major PEPs that are 
already too late for Python 3.0 and would probably have been rejected 
or deferred anyway.

I also think it's more likely that your ideas would find more 
interest/support with the PyPy project than with mainline Python, as 
some of them at least vaguely resemble some things they are working 
on, and/or would be easier to implement using PyPy object spaces.


From tav at espians.com  Thu Jun 28 19:03:31 2007
From: tav at espians.com (tav)
Date: Thu, 28 Jun 2007 18:03:31 +0100
Subject: [Python-3000] pimp; restructuring the standard library
In-Reply-To: <20070628163922.A40063A40AF@sparrow.telecommunity.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
Message-ID: <95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com>

> Indeed, I'm also wary of breaking backward compatibility of unittest
> or doctest in Python 3.0, because that will make it even harder to
> port code over.  How will 2.x users run their existing test suites to
> verify their code has been ported correctly, if they can't keep using
> unittest?  As it is, they'll have to run them through 2to3, which
> AFAIK doesn't do doctests currently.

Ah, wasn't suggesting dumping the unittest module. Just that tests in
the "standard library" should be doctest-based as these are much nicer
and more useful!

> A strong -1 on any import system that breaks down the current strict
> separation between module names and module *locations*.  Too many
> people confuse these concepts already, and we already have a nicely
> field-tested mechanism for specifying locations and turning them into
> importer objects.

I agree with your -1.

Let me rephrase that as being able to use any character in a str for
import as opposed to the current limited set in 2.x. I understand that
python literals are much broader in 3.0, how does that impact import?

> >* Authentication of sources with configurable crypto.
> >
> >* Full integration with setuptools + eggs.
> >
> >* Pluggable integration support for version control systems like svn/bzr.
> >
> >* Builtin versioning support for modules.
> >
> >* Live-update of modules/code support (in the vein of Erlang).
> >
> >* Rewrite of standard library to be more adaptable, concurrent, and
> >pertaining to object capability. This way, we can have a secure,
> >composable and parallelisable standard library!
>
> Um, and who are you volunteering to do all this work?  i.e., "you and
> what army?"  :)

Well, all that code being added to PyPi and the ASPN Python cookbook
ain't being done by imagination alone... ;p

Seriously, with:

a). clear 'lead by example' initial set of how modules should work
(with the above mentioned feature sets)

b). a provisional incentive model (say, via a gift economy model for
all those who have contributed to the standard library)

c). a simple hook in the importer which keeps track of modules/code is
used, which is used to remunerate the army (if anyone ever contributes
financially to it) ;p

> My thought is that you've just proposed several major PEPs that are
> already too late for Python 3.0 and would probably have been rejected
> or deferred anyway.

I understood that issues relating to the standard library were still
not fixed-in-stone for python 3.0?

> I also think it's more likely that your ideas would find more
> interest/support with the PyPy project than with mainline Python, as
> some of them at least vaguely resemble some things they are working
> on, and/or would be easier to implement using PyPy object spaces.

This is definitely true. But I want to see all this in Python 3.0...

And, I don't see any of the changes proposed requiring any changes to
the language.. besides the __subclasses__ and func_closure thing we
discussed on python-dev.

-- 
love, tav
founder and ceo, esp metanational llp

plex:espians/tav | tav at espians.com | +44 (0) 7809 569 369

From brett at python.org  Thu Jun 28 19:29:03 2007
From: brett at python.org (Brett Cannon)
Date: Thu, 28 Jun 2007 10:29:03 -0700
Subject: [Python-3000] pimp; restructuring the standard library
In-Reply-To: <95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com>
Message-ID: <bbaeab100706281029g201113a1oce97c3eca1df4d2b@mail.gmail.com>

On 6/28/07, tav <tav at espians.com> wrote:
> > Indeed, I'm also wary of breaking backward compatibility of unittest
> > or doctest in Python 3.0, because that will make it even harder to
> > port code over.  How will 2.x users run their existing test suites to
> > verify their code has been ported correctly, if they can't keep using
> > unittest?  As it is, they'll have to run them through 2to3, which
> > AFAIK doesn't do doctests currently.
>
> Ah, wasn't suggesting dumping the unittest module. Just that tests in
> the "standard library" should be doctest-based as these are much nicer
> and more useful!

But that is your opinion.  I personally prefer unittest and find them
just as useful.

If you want to get more doctest usage then you can convert some tests
over from the old stdout-comparison style to doctest.

>
> > A strong -1 on any import system that breaks down the current strict
> > separation between module names and module *locations*.  Too many
> > people confuse these concepts already, and we already have a nicely
> > field-tested mechanism for specifying locations and turning them into
> > importer objects.
>
> I agree with your -1.
>
> Let me rephrase that as being able to use any character in a str for
> import as opposed to the current limited set in 2.x. I understand that
> python literals are much broader in 3.0, how does that impact import?
>

You would need to change the grammar to get this to work.  And at that
point I would say you are better off developing a function to handle
the import than tweaking the grammar.

> > >* Authentication of sources with configurable crypto.
> > >
> > >* Full integration with setuptools + eggs.
> > >
> > >* Pluggable integration support for version control systems like svn/bzr.
> > >
> > >* Builtin versioning support for modules.
> > >
> > >* Live-update of modules/code support (in the vein of Erlang).
> > >
> > >* Rewrite of standard library to be more adaptable, concurrent, and
> > >pertaining to object capability. This way, we can have a secure,
> > >composable and parallelisable standard library!
> >
> > Um, and who are you volunteering to do all this work?  i.e., "you and
> > what army?"  :)
>
> Well, all that code being added to PyPi and the ASPN Python cookbook
> ain't being done by imagination alone... ;p
>
> Seriously, with:
>
> a). clear 'lead by example' initial set of how modules should work
> (with the above mentioned feature sets)

What is that supposed to mean?  Modules work how they work.  If you
are after specific style guidelines in terms of structure you are not
going to get one since each module has their own needs.  And taking
volunteer code already is hard enough; forcing a specific structure
just makes getting help that much harder.

>
> b). a provisional incentive model (say, via a gift economy model for
> all those who have contributed to the standard library)
>

Are we going to give gifts to everyone who has already contributed,
with interest?  And where is this money coming from?

I just see people tossing stuff at us just to get the money, and I
don't want that.

> c). a simple hook in the importer which keeps track of modules/code is
> used, which is used to remunerate the army (if anyone ever contributes
> financially to it) ;p
>

So you want every execution of a Python program to know who
contributed code to its execution?  It's called Misc/ACKS.

> > My thought is that you've just proposed several major PEPs that are
> > already too late for Python 3.0 and would probably have been rejected
> > or deferred anyway.
>
> I understood that issues relating to the standard library were still
> not fixed-in-stone for python 3.0?
>

Right, but that is mostly the reorganization, renaming, etc.  It does
not include changes to how import works, etc. that would require a
PEP.

> > I also think it's more likely that your ideas would find more
> > interest/support with the PyPy project than with mainline Python, as
> > some of them at least vaguely resemble some things they are working
> > on, and/or would be easier to implement using PyPy object spaces.
>
> This is definitely true. But I want to see all this in Python 3.0...

=)  Well, everyone wants to see everything they want in the next
version of Python.  Even core developers don't always get what they
want in a release.

-Brett

From theller at ctypes.org  Thu Jun 28 19:11:00 2007
From: theller at ctypes.org (Thomas Heller)
Date: Thu, 28 Jun 2007 19:11:00 +0200
Subject: [Python-3000] Py3k doesn't understand octal literals (on Windows)
Message-ID: <f60q34$ttf$1@sea.gmane.org>

In a break from real work, I wanted to play a little with 3.0.  Did svn update, and built on Windows.
Unfortunately, the resulting Python does not understand the new octal literals like 0o777, so
importing 'os' fails:

'import site' failed; use -v for traceback
Python 3.0x (p3yk:55071M, May  2 2007, 13:50:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\svn\p3yk\lib\os.py", line 150
    def makedirs(name, mode=0o777):
                                ^
SyntaxError: invalid syntax
>>>

Any hints on what I have to do to make this work?

Thanks,
Thomas


From theller at ctypes.org  Thu Jun 28 20:33:57 2007
From: theller at ctypes.org (Thomas Heller)
Date: Thu, 28 Jun 2007 20:33:57 +0200
Subject: [Python-3000] Py3k doesn't understand octal literals (on
	Windows)
In-Reply-To: <f60q34$ttf$1@sea.gmane.org>
References: <f60q34$ttf$1@sea.gmane.org>
Message-ID: <f60uum$opn$1@sea.gmane.org>

Thomas Heller schrieb:
> In a break from real work, I wanted to play a little with 3.0.  Did svn update, and built on Windows.
> Unfortunately, the resulting Python does not understand the new octal literals like 0o777, so
> importing 'os' fails:
> 
> 'import site' failed; use -v for traceback
> Python 3.0x (p3yk:55071M, May  2 2007, 13:50:45) [MSC v.1310 32 bit (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import os
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "c:\svn\p3yk\lib\os.py", line 150
>     def makedirs(name, mode=0o777):
>                                 ^
> SyntaxError: invalid syntax
>>>>
> 
> Any hints on what I have to do to make this work?
> 
> Thanks,
> Thomas
> 

Sorry for the false alarm, it was all my fault.

Thomas


From chrism at plope.com  Thu Jun 28 22:04:20 2007
From: chrism at plope.com (Chris McDonough)
Date: Thu, 28 Jun 2007 16:04:20 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <20070628163922.A40063A40AF@sparrow.telecommunity.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
Message-ID: <A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>

On Jun 28, 2007, at 12:41 PM, Phillip J. Eby wrote:

> At 04:59 PM 6/28/2007 +0100, tav wrote:
>> * Abandoning of unit tests and replacing with full doctest  
>> coverage in
>> the style perfected by Jim Fulton and PJE. Integration with py.test.
>
> I believe that the origination credit for this rightly falls to Tim
> Peters.  (I just copied Jim, myself.)  Meanwhile, there are quite a
> few stdlib doctests now, and unittests still more than have their  
> place.
>
> Indeed, I'm also wary of breaking backward compatibility of unittest
> or doctest in Python 3.0, because that will make it even harder to
> port code over.  How will 2.x users run their existing test suites to
> verify their code has been ported correctly, if they can't keep using
> unittest?  As it is, they'll have to run them through 2to3, which
> AFAIK doesn't do doctests currently.

I've historically not been a huge fan of doctests because (these  
things may have changed since last I used doctest in anger):

a) If one of your fixture calls or an assertion fails for some  
reason, the rest of the test
    trips over itself trying to complete, usually without success  
because an invariant
    hasn't been met, and you need to scroll through a bunch of decoy  
output to
    see where the actual problem began.

b) I often use test bodies as convenient points to put a  
pdb.set_trace call if I want to
    debug something.  This wasn't very well supported when I was  
trying to use doctest.

As a result, I still use unittest pretty much exlusively to write  
tests.  I'd be sad if it went away.

- C


From fdrake at acm.org  Thu Jun 28 22:20:31 2007
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 28 Jun 2007 16:20:31 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
Message-ID: <200706281620.32051.fdrake@acm.org>

On Thursday 28 June 2007, Chris McDonough wrote:
 > a) If one of your fixture calls or an assertion fails for some
 > reason, the rest of the test
 >     trips over itself trying to complete, usually without success
 > because an invariant
 >     hasn't been met, and you need to scroll through a bunch of decoy
 > output to
 >     see where the actual problem began.

The testrunner in zope.testing handles this by providing an option to hide the 
secondary failures, so only one traceback shows up per document.

 > b) I often use test bodies as convenient points to put a
 > pdb.set_trace call if I want to
 >     debug something.  This wasn't very well supported when I was
 > trying to use doctest.

The doctest in zope.testing supports this; hopefully someone sufficiently 
in-the-know can unfork that version.

 > As a result, I still use unittest pretty much exlusively to write
 > tests.  I'd be sad if it went away.

Yes; there's definately a place for unittest, or something very like it.


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>

From pje at telecommunity.com  Thu Jun 28 22:57:04 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Thu, 28 Jun 2007 16:57:04 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
 restructuring the standard library)
In-Reply-To: <A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
Message-ID: <20070628205456.A94EA3A40A8@sparrow.telecommunity.com>

At 04:04 PM 6/28/2007 -0400, Chris McDonough wrote:
>a) If one of your fixture calls or an assertion fails for some
>reason, the rest of the test
>     trips over itself trying to complete, usually without success
>because an invariant
>     hasn't been met, and you need to scroll through a bunch of decoy
>output to
>     see where the actual problem began.

Use the REPORT_ONLY_FIRST_FAILURE option:

http://python.org/doc/2.4.1/lib/doctest-options.html


>b) I often use test bodies as convenient points to put a
>pdb.set_trace call if I want to
>     debug something.  This wasn't very well supported when I was
>trying to use doctest.

I believe this was fixed in 2.4.  And I *know* it's fixed in 2.5.  :)


From greg.ewing at canterbury.ac.nz  Fri Jun 29 01:26:13 2007
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 29 Jun 2007 11:26:13 +1200
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <4683BE09.4010702@gmail.com>
References: <20070624132756.7998.JCARLSON@uci.edu>
	<rowen-C34545.12555325062007@sea.gmane.org>
	<20070625193345.79AA.JCARLSON@uci.edu>
	<rowen-8EA8EC.17231326062007@sea.gmane.org>
	<468302C0.3050808@canterbury.ac.nz> <4683BE09.4010702@gmail.com>
Message-ID: <46844395.4000802@canterbury.ac.nz>

Nick Coghlan wrote:
> There is no function call per loop even when using a 
> generator expression - a generator function is implicit defined, and 
> then called once to instantiate the generator.

You're right -- I must have been half-thinking of
map() at the time. Resuming the generator ought to
be faster than a function call. But still a bit
slower than in-line code, perhaps.

> I believe the point of adding set 
> literal and comprehension syntax is to make it possible to easily speed 
> up membership tests for items in known groups

Yes, but set(generator) would do that just as well
as {generator} if it weren't any slower.

So the reasons for keeping the comprehension notations
are (a) slightly more convenient syntax and (b) maybe
a bit faster.

--
Greg

From barry at python.org  Fri Jun 29 05:46:05 2007
From: barry at python.org (Barry Warsaw)
Date: Thu, 28 Jun 2007 23:46:05 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
Message-ID: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote:

> I've historically not been a huge fan of doctests because (these
> things may have changed since last I used doctest in anger):

I used to think the same thing, but I've gotten the doctest  
religion.  I'm using them almost exclusively in the new Mailman code,  
and we use them at work (though both still have traditional Python  
unit tests).

The thing that convinced me was the realization (assisted by my  
colleagues) that doctests are first and foremost documentation.  They  
are testable documentation sure, but the unit tests are secondary.   
There's no question that for things like system documentation, the  
narrative that weaves the testable bits together in a well written  
doctest are much more valuable than the tests.  Most unittest based  
tests have little or no comments, and nothing approaching the  
narrative in a good doctest, so it's clear that unittests are tests  
first and probably not documentation at all.

I've even experimented with writing a PEP for my enum package (as yet  
unsubmitted) that is nothing more than a doctest.  It seemed almost  
completely natural.

A good test suite can benefit from both doctests and unittests and I  
don't think unittest will ever go away (nor should it), but in my  
latest work I'm opting more and more for doctests.  That Tim Peters  
is a smart guy after all I guess. :)

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRoSAfnEjvBPtnXfVAQLSrQP/criiWjS2RdChwq5CVw1BbYbS5LP8WI7b
4SY6BRLFFWH218IrihVa8kZh8cvrTb1PHxVqiuEQIj3qcHo3SuMO6A1MKYZJhuCN
vOINQkseaP1jGn5/b85/Q3OSUGbVfdWS+E7Yri5qCva/GaTNwCNNHNTT9+K7LBqE
7AA937O2oa8=
=97lr
-----END PGP SIGNATURE-----

From chrism at plope.com  Fri Jun 29 07:40:39 2007
From: chrism at plope.com (Chris McDonough)
Date: Fri, 29 Jun 2007 01:40:39 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
	<C79C940D-D087-48AB-8212-2F0A67230819@python.org>
Message-ID: <A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>

On Jun 28, 2007, at 11:46 PM, Barry Warsaw wrote:

> The thing that convinced me was the realization (assisted by my  
> colleagues) that doctests are first and foremost documentation.   
> They are testable documentation sure, but the unit tests are  
> secondary.  There's no question that for things like system  
> documentation, the narrative that weaves the testable bits together  
> in a well written doctest are much more valuable than the tests.

I suspect it would be even more valuable as documentation if it  
didn't give good coverage.


> Most unittest based tests have little or no comments, and nothing  
> approaching the narrative in a good doctest, so it's clear that  
> unittests are tests first and probably not documentation at all.

This probably isn't the place for this discussion but I'll give an  
explanation about why I think that's actually a good thing.

I find that I only get good test coverage when I have more test code  
for a component than the implementation of the component I'm trying  
to test.   At least that's been my experience.  I haven't been able  
to make the tests more succinct while still testing things adequately.

When coverage gets good, "documentation-ness" of tests suffers.  You  
can get good test coverage with any sort of tests.  But once you get  
good test coverage, whatever framework you've chose to write them in,  
the tests are no longer very good as narrative documentation because  
they're littered with bits of fixture code, edge case assertions, etc.

I don't mind doctest at all really (I just use unittest out of  
inertia and personal preference, I'd probably just as happy with nose  
or whatever).  I just don't like when folks advertise the same  
doctest as both a comprehensive set of tests and a component's only  
source of documentation, because I don't think it's possible for it  
to be both at the same time with any sort of quality in both  
directions simultaneously.

That said, having testable documentation is a good thing!  I'd just  
prefer that that documentation did not include lots of fixture noise.

> A good test suite can benefit from both doctests and unittests and  
> I don't think unittest will ever go away (nor should it), but in my  
> latest work I'm opting more and more for doctests.  That Tim Peters  
> is a smart guy after all I guess. :)

I miss "uncle Timmy". :-(

- C


From mhammond at skippinet.com.au  Fri Jun 29 07:27:53 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 29 Jun 2007 15:27:53 +1000
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>
Message-ID: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>

Barry writes:

> On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote:
>
> > I've historically not been a huge fan of doctests because (these
> > things may have changed since last I used doctest in anger):
>
> I used to think the same thing, but I've gotten the doctest
> religion.  I'm using them almost exclusively in the new
> Mailman code,
> and we use them at work (though both still have traditional Python
> unit tests).
>
> The thing that convinced me was the realization (assisted by my
> colleagues) that doctests are first and foremost
> documentation.  They
> are testable documentation sure, but the unit tests are secondary.

I admit I'm still yet to get the doctest religion to that degree - but for
exactly the same reasons :)

My problem is that too quickly, doctests go way beyond documentation - they
turn into a full-blown test framework, and this tends to work against the
clarity of the resulting documentation.

I like doctests that give you a general introduction to what is being
tested.  They can operate as a kind of 'tutorial', allowing someone with no
experience in the code to quickly see the basics of how it is used - that is
very useful indeed.

But IMO, these too quickly morph into the territory of unittests - they
start testing all corner cases.  The simple tutorial quality gets lost as
the doctests start including reams of test data and testing against
invariants that are important to the developer of the library, but mere
noise to a casual user of it.

Another key feature of unitests are their utility in helping you *find* bugs
in the first place.  When a bug is identified "in the field", unit tests
make it easy to find a "smallest possible" reproduction of a bug in order to
identify the root cause - which is then checked in when the bug is fixed.
If only doctests are available, then either that obscure bug is also added
to the doctests (making even more noise), or a test case is extracted to a
temporary program and discarded once the bug is fixed.

> A good test suite can benefit from both doctests and unittests and I
> don't think unittest will ever go away (nor should it), but in my
> latest work I'm opting more and more for doctests.

I find myself opting for doctests when working with "new" code, but quickly
leaving the doctests in their pristine state and moving to unittests once
the bugs get a bit curlier, or coverage.py directs me to write tests I'd
never dreamt of, etc...

> That Tim Peters is a smart guy after all I guess. :)

Indeed he is - which is exactly why use them as I described - that is my
interpretation of what he intended <wink>

Mark


From ncoghlan at gmail.com  Fri Jun 29 09:41:57 2007
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 29 Jun 2007 17:41:57 +1000
Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!)
In-Reply-To: <46844395.4000802@canterbury.ac.nz>
References: <20070624132756.7998.JCARLSON@uci.edu>	<rowen-C34545.12555325062007@sea.gmane.org>	<20070625193345.79AA.JCARLSON@uci.edu>	<rowen-8EA8EC.17231326062007@sea.gmane.org>	<468302C0.3050808@canterbury.ac.nz>
	<4683BE09.4010702@gmail.com> <46844395.4000802@canterbury.ac.nz>
Message-ID: <4684B7C5.8020307@gmail.com>

Greg Ewing wrote:
> So the reasons for keeping the comprehension notations
> are (a) slightly more convenient syntax and (b) maybe
> a bit faster.

Yes, I was actually agreeing with you on that point (I just got 
sidetracked on a couple of technical quibbles, so my agreement may not 
have been clear...)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From guido at python.org  Fri Jun 29 16:49:05 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 29 Jun 2007 07:49:05 -0700
Subject: [Python-3000] doctests vs. unittests (was Re: pimp;
	restructuring the standard library)
In-Reply-To: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
References: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
Message-ID: <ca471dc20706290749o407a9d39td72c6822d3383885@mail.gmail.com>

If I have any say in it, unittest isn't going away (unless replaced by
something very similar, and doctest ain't it). Religion is all fine
and well, as long as there's room for other views. I personally find
using unit tests a lot easier than using doctest, for many of the
things I tend to do (and most of my co-workers at Google see it that
way, too).

That said, I hope that the doctest community will contribute a  better
way for the 2to3 tool to find and fix doctests; the -d option is too
cumbersome to use.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From pje at telecommunity.com  Fri Jun 29 17:37:51 2007
From: pje at telecommunity.com (Phillip J. Eby)
Date: Fri, 29 Jun 2007 11:37:51 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
 restructuring the standard library)
In-Reply-To: <A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
	<C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>
Message-ID: <20070629153542.467A03A40BF@sparrow.telecommunity.com>

At 01:40 AM 6/29/2007 -0400, Chris McDonough wrote:
>When coverage gets good, "documentation-ness" of tests suffers.

The question is more one of, "documentation for whom?".  You can 
write separate documents for library users than for library 
extenders/developers.  I don't put doctests in docstrings, but if I 
did, I'd probably only put user doctests there.  As it is, I normally 
split my doctests into multiple files for different audiences, or 
under different headings in one large file.

For example, if you look at the BytecodeAssembler documentation:

    http://peak.telecommunity.com/DevCenter/BytecodeAssembler

You'll see that the assertion and invariant testing is mostly 
relegated to a separate section.

Another library I'm working on has two doctest files for users (a 
quick intro and a developer guide/reference) and a separate file that 
tests all the innards.  So, there are a lot of ways to use doctests 
effectively, at least if you're doing them in text files, rather than 
in your docstrings.  I've actually never put a doctest in a 
docstring; it always seems like overkill to me.  (Especially since 
reST doctests can be made into nice HTML pages like the above!)


From barry at python.org  Fri Jun 29 18:12:28 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 29 Jun 2007 12:12:28 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <20070629153542.467A03A40BF@sparrow.telecommunity.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<20070628163922.A40063A40AF@sparrow.telecommunity.com>
	<A7C378C1-77AA-4374-9416-F07C5AA5FBCE@plope.com>
	<C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>
	<20070629153542.467A03A40BF@sparrow.telecommunity.com>
Message-ID: <41E11C29-D7BF-45C5-8E0F-FEE3FB1CE150@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Since this has stopped being on-topic for this mailing list, so just  
one last follow up from me.

On Jun 29, 2007, at 11:37 AM, Phillip J. Eby wrote:

> The question is more one of, "documentation for whom?".  You can  
> write separate documents for library users than for library  
> extenders/developers.  I don't put doctests in docstrings, but if I  
> did, I'd probably only put user doctests there.  As it is, I  
> normally split my doctests into multiple files for different  
> audiences, or under different headings in one large file.
>
> Another library I'm working on has two doctest files for users (a  
> quick intro and a developer guide/reference) and a separate file  
> that tests all the innards.  So, there are a lot of ways to use  
> doctests effectively, at least if you're doing them in text files,  
> rather than in your docstrings.  I've actually never put a doctest  
> in a docstring; it always seems like overkill to me.  (Especially  
> since reST doctests can be made into nice HTML pages like the above!)

I concur with Phillip about two important points.  First, I also  
never put doctests in docstrings.  I find them unreadable, difficult  
to edit, and not conducive to a cohesive narrative.  I always put my  
doctests in a separate file, usually under a 'docs' directory  
actually.  Maybe this will make a difference for people considering  
doctests as a complement to traditional unittests.

Second, I agree that you can achieve a high degree of coverage with  
doctests if you stop to answer Phillip's question: "documentation for  
whom?"  I typically put documentation for users in a separate file  
than documentation for extenders/developers, but that's just personal  
taste.  The important insight is that explaining how to use a library  
often covers most if not all the corner cases of its use.

Lastly, an observation: I've found that using doctests has had the  
surprisingly consequence of making test-driven development  
enjoyable.  I've found myself starting a new task by writing the  
documentation first, which is a great way to design how you want the  
code to work.  Because the documentation is testable, you're left  
with a simple matter of coding <wink> until the doctest passes.

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRoUvbXEjvBPtnXfVAQIodgP9H7+7LFCY+TJtCMbeZjl9Q1JHXBrZyIBj
q0aQFR82qvo/I/OUoAwLAqOQbAxyXXBMiEVuAKBQA8ETCOhmOcF/apTmizu0SWZR
b4a6XbvPiIH9DMeOWyFOEEqchfMYFyqWkp+J5fc1mmyF9wLGeQxdmUffzSy17a66
7b/GQBwajlI=
=Aamm
-----END PGP SIGNATURE-----

From srichter at cosmos.phy.tufts.edu  Fri Jun 29 18:39:34 2007
From: srichter at cosmos.phy.tufts.edu (Stephan Richter)
Date: Fri, 29 Jun 2007 12:39:34 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>
References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com>
	<C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<A5434288-0696-43BB-8B9D-30A0B3B13550@plope.com>
Message-ID: <200706291239.34485.srichter@cosmos.phy.tufts.edu>

On Friday 29 June 2007 01:40, Chris McDonough wrote:
> I don't mind doctest at all really (I just use unittest out of ?
> inertia and personal preference, I'd probably just as happy with nose ?
> or whatever). ?I just don't like when folks advertise the same ?
> doctest as both a comprehensive set of tests and a component's only ?
> source of documentation, because I don't think it's possible for it ?
> to be both at the same time with any sort of quality in both ?
> directions simultaneously.

I could not disagree more. My personal rule is that any released code should 
be 100% coverage tested. And I never write regular unittests anymore, unless 
for some super-specific cases. Alos, people compliment me about good 
documentation all the time. Have a look at 
http://svn.zope.org/z3c.form/trunk/src/z3c/form/. the documentation is 
example driven, yet still covers all of the API.

Having said that, writing comprehensive doctests that do not read like a CS 
thesis is very hard. It took me the last 5 years developing Zope 3 to learn 
how to do that right. 

BTW, I do agree with what Phillip and Barry wrote. I always consider it a 
challenge to see how many lines of testable documentation I can write before 
writing one line of code -- I max out at about 2k right now.

Regards,
Stephan
-- 
Stephan Richter
CBU Physics & Chemistry (B.S.) / Tufts Physics (Ph.D. student)
Web2k - Web Software Design, Development and Training

From santagada at gmail.com  Fri Jun 29 19:03:19 2007
From: santagada at gmail.com (Leonardo Santagada)
Date: Fri, 29 Jun 2007 14:03:19 -0300
Subject: [Python-3000] doctests vs. unittests (was Re: pimp;
	restructuring the standard library)
In-Reply-To: <ca471dc20706290749o407a9d39td72c6822d3383885@mail.gmail.com>
References: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
	<ca471dc20706290749o407a9d39td72c6822d3383885@mail.gmail.com>
Message-ID: <3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com>


Em 29/06/2007, ?s 11:49, Guido van Rossum escreveu:

> If I have any say in it, unittest isn't going away (unless replaced by
> something very similar, and doctest ain't it). Religion is all fine
> and well, as long as there's room for other views. I personally find
> using unit tests a lot easier than using doctest, for many of the
> things I tend to do (and most of my co-workers at Google see it that
> way, too).

py.test is similar enough to replace unittest?

--
Leonardo Santagada



From guido at python.org  Fri Jun 29 19:09:19 2007
From: guido at python.org (Guido van Rossum)
Date: Fri, 29 Jun 2007 10:09:19 -0700
Subject: [Python-3000] doctests vs. unittests (was Re: pimp;
	restructuring the standard library)
In-Reply-To: <3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com>
References: <C79C940D-D087-48AB-8212-2F0A67230819@python.org>
	<003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
	<ca471dc20706290749o407a9d39td72c6822d3383885@mail.gmail.com>
	<3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com>
Message-ID: <ca471dc20706291009o53e4a74cjf07ab049ba64f01@mail.gmail.com>

On 6/29/07, Leonardo Santagada <santagada at gmail.com> wrote:
>
> Em 29/06/2007, ?s 11:49, Guido van Rossum escreveu:
>
> > If I have any say in it, unittest isn't going away (unless replaced by
> > something very similar, and doctest ain't it). Religion is all fine
> > and well, as long as there's room for other views. I personally find
> > using unit tests a lot easier than using doctest, for many of the
> > things I tend to do (and most of my co-workers at Google see it that
> > way, too).
>
> py.test is similar enough to replace unittest?

I've never looked at py.test, so I can't tell. There needs to be a
100% backwards-compatible API so existing unittests don't need to be
changed (as they are the cornerstone of any transition to Python
3000).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rrr at ronadam.com  Sat Jun 30 01:19:54 2007
From: rrr at ronadam.com (Ron Adam)
Date: Fri, 29 Jun 2007 18:19:54 -0500
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
 restructuring the standard library)
In-Reply-To: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
Message-ID: <4685939A.9090803@ronadam.com>


Mark Hammond wrote:
> Barry writes:
> 
>> On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote:
>> A good test suite can benefit from both doctests and unittests and I
>> don't think unittest will ever go away (nor should it), but in my
>> latest work I'm opting more and more for doctests.
> 
> I find myself opting for doctests when working with "new" code, but quickly
> leaving the doctests in their pristine state and moving to unittests once
> the bugs get a bit curlier, or coverage.py directs me to write tests I'd
> never dreamt of, etc...

I agree with this completely.  Doctests are very useful for getting the 
basics down and working both while the code is being written.

After that, unittests are much better for testing edge cases and making 
sure everything works including the kitchen sink, the pipes to the sink, 
the quality of water, etc...  ;-)

IF there is a problem, I don't think it is in the exact execution of 
doctests or unittests, but in the organization of them relative to the 
modules and how they are run.


Currently the unittest test suite runs tests that are in a know place with 
known name.  There can be modules in a distribution that are completely 
untested and you would not know unless you manually checked for this.

I'd like to see this turned around a bit so that the test suite runner 
first scans the modules and then looks for tests for each of them.  And if 
no test for a particular module is found, give some sort of warning.

Possibly a module could have a __tests__ 'list' attribute with locations of 
tests?  So an automatic test runner might start by first importing a module 
and then running the test modules listed in __tests__.  And yes even the 
tests can have tests. ;-)  A "__tests__ = None" could explicitly turn that 
off, where a "__tests__ = []" would indicate a module that does not yet 
have tests but needs them.

This could also reduce the boiler plate needed to run unittests as a side 
bonus.


There's been a few times where I started writing doctests for a module with 
less than 100 lines of code and by the time I was done with the doc tests, 
it became a 500 line or more module.  The actual code then starts to get 
lost in the file.

It would be cool if the documents files could also contain the doc tests 
instead of them being in the source code.  I'm sure the could be done now, 
but there isn't a standard way to do this.  Currently I create a seperate 
test module which unclutters the program modules, but then it isn't clear 
these are meant to be documentation first.

Cheers,
    Ron














From percivall at gmail.com  Sat Jun 30 01:39:43 2007
From: percivall at gmail.com (Simon Percivall)
Date: Sat, 30 Jun 2007 01:39:43 +0200
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
	restructuring the standard library)
In-Reply-To: <4685939A.9090803@ronadam.com>
References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
	<4685939A.9090803@ronadam.com>
Message-ID: <E705D1E7-2218-4E5B-AA19-3B6EE419E84F@gmail.com>

On 30 jun 2007, at 01.19, Ron Adam wrote:
> It would be cool if the documents files could also contain the doc  
> tests
> instead of them being in the source code.  I'm sure the could be  
> done now,
> but there isn't a standard way to do this.  Currently I create a  
> seperate
> test module which unclutters the program modules, but then it isn't  
> clear
> these are meant to be documentation first.

Well, there is doctest.testfile, which should do that. It's been
in doctest since 2.4.

//Simon

From benji at benjiyork.com  Sat Jun 30 04:20:01 2007
From: benji at benjiyork.com (Benji York)
Date: Fri, 29 Jun 2007 22:20:01 -0400
Subject: [Python-3000] doctests vs. unittests (was Re:  pimp;
 restructuring the standard library)
In-Reply-To: <4685939A.9090803@ronadam.com>
References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local>
	<4685939A.9090803@ronadam.com>
Message-ID: <4685BDD1.6010906@benjiyork.com>

Being off topic, I'm just going to do a drive by and urge people that 
are interested in following up to visit the TIP (testing in Python) list 
at http://lists.idyll.org/listinfo/testing-in-python.

Ron Adam wrote:
> I agree with this completely.  Doctests are very useful for getting the 
> basics down and working both while the code is being written.
> 
> After that, unittests are much better for testing edge cases and making 
> sure everything works including the kitchen sink, the pipes to the sink, 
> the quality of water, etc...  ;-)

In the code bases I'm involved in right now, we use doctests almost 
exclusively, including for the "kitchen sink" tests.  We find the slight 
tenancy toward more and better prose in doctests is especially nice when 
trying to discern what exactly some obscure test code is actually trying 
to verify (particularly important when the test fails).

> Currently the unittest test suite runs tests that are in a know place with 
> known name.  There can be modules in a distribution that are completely 
> untested and you would not know unless you manually checked for this.

Most test runners have coverage reporting options for both unit tests 
and doctests.

> There's been a few times where I started writing doctests for a module with 
> less than 100 lines of code and by the time I was done with the doc tests, 
> it became a 500 line or more module.  The actual code then starts to get 
> lost in the file.
> 
> It would be cool if the documents files could also contain the doc tests 
> instead of them being in the source code.

As mentioned later in this thread this is already possible.  Having a 
separate files for them (one of which is usually named README.txt) is 
quite a bit nicer.

If you write your "whole file" doctests in ReST, you can also render 
them to HTML as is done for the packages we put in pypi (here's a short 
example: http://cheeseshop.python.org/pypi/zc.queue/1.1, ReST source at 
http://svn.zope.org/*checkout*/zc.queue/trunk/src/zc/queue/queue.txt).
-- 
Benji York
http://benjiyork.com

From g.brandl at gmx.net  Sat Jun 30 09:33:22 2007
From: g.brandl at gmx.net (Georg Brandl)
Date: Sat, 30 Jun 2007 09:33:22 +0200
Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk
In-Reply-To: <acd65fa20706251127u621b3ca1y1a30f7ebf965356f@mail.gmail.com>
References: <acd65fa20706251018u3887c239hc1a22391f85cddd@mail.gmail.com>
	<acd65fa20706251127u621b3ca1y1a30f7ebf965356f@mail.gmail.com>
Message-ID: <f650vd$dp6$1@sea.gmane.org>

Alexandre Vassalotti schrieb:
> Meanwhile, I found another division/range combination that could be
> problematic. I attached an updated patch.

Thanks, committed.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From matt-python at theory.org  Sat Jun 30 22:54:44 2007
From: matt-python at theory.org (Matt Chisholm)
Date: Sat, 30 Jun 2007 13:54:44 -0700
Subject: [Python-3000] Announcing PEP 3136
Message-ID: <20070630205444.GD22221@theory.org>

Hi all. 

I've created and submitted a new PEP proposing support for labels in
Python's break and continue statements.  Georg Brandl has graciously
added it to the PEP list as PEP 3136:

http://www.python.org/dev/peps/pep-3136/

I understand that the deadline for submitting features for Python 3.0
has passed, so this PEP targets Python 3.1.  I also expect that people
might not want to take time off from the Python 3.0 effort to discuss
features that are even further off in the future.

Thanks for your time, and thanks for letting me contribute an idea to
Python.

-matt