From tim_one@email.msn.com  Mon May  1 07:31:05 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Mon, 1 May 2000 02:31:05 -0400
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com>
Message-ID: <000001bfb336$d4f512a0$0f2d153f@tim>

[Guido]
> The email below is a serious bug report.  A quick analysis
> shows that UserString.count() calls the count() method on a string
> object, which calls PyArg_ParseTuple() with the format string "O|ii".
> The 'i' format code truncates integers.

For people unfamiliar w/ the details, let's be explicit:  the "i" code
implicitly converts a Python int (== a C long) to a C int (which has no
visible Python counterpart).  Overflow is not detected, so this is broken on
the face of it.

> It probably should raise an overflow exception instead.

Definitely.

> But that would still cause the test to fail -- just in a different
> way (more explicit).  Then the string methods should be fixed to
> use long ints instead -- and then something else would probably break...

Yup.  Seems inevitable.

[MAL]
> Since strings and Unicode objects use integers to describe the
> length of the object (as well as most if not all other
> builtin sequence types), the correct default value should
> thus be something like sys.maxlen which then gets set to
> INT_MAX.
>
> I'd suggest adding sys.maxlen and the modifying UserString.py,
> re.py and sre_parse.py accordingly.

I understand this, but hate it.  I'd rather get rid of the user-visible
distinction between the two int types already there, not add yet a third
artificial int limit.

[Guido]
> Hm, I'm not so sure.  It would be much better if passing sys.maxint
> would just WORK...  Since that's what people have been doing so far.

[Trent Mick]
> Possible solutions (I give 4 of them):
>
> 1. The 'i' format code could raise an overflow exception and the
> PyArg_ParseTuple() call in string_count() could catch it and truncate to
> INT_MAX (reasoning that any overflow of the end position of a
> string can be bound to INT_MAX because that is the limit for any string
> in Python).

There's stronger reason than that:  string_count's "start" and "end"
arguments are documented as "interpreted as in slice notation", and slice
notation with out-of-range indices is well defined in all cases:

    The semantics for a simple slicing are as follows. The primary
    must evaluate to a sequence object. The lower and upper bound
    expressions, if present, must evaluate to plain integers; defaults
    are zero and the sequence's length, respectively. If either bound
    is negative, the sequence's length is added to it. The slicing now
    selects all items with index k such that i <= k < j where i and j
    are the specified lower and upper bounds. This may be an empty
    sequence. It is not an error if i or j lie outside the range of
    valid indexes (such items don't exist so they aren't selected).

(From the Ref Man's section "Slicings")  That is, what string_count should
do is perfectly clear already (or will be, when you read that two more times
<wink>).  Note that you need to distinguish between positive and negative
overflow, though!

> Pros:
> - This "would just WORK" for usage of sys.maxint.
>
> Cons:
> -  This overflow exception catching should then reasonably be
> propagated to other similar functions (like string.endswith(), etc).

Absolutely, but they *all* follow from what "sequence slicing* is *always*
supposed to do in case of out-of-bounds indices.

> - We have to assume that the exception raised in the
> PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for
> the second integer (i.e. 'last'). This is subtle and ugly.

Luckily <wink>, it's not that simple:  exactly the same treatment needs to
be given to both the optional "start" and "end" arguments, and in every
function that accepts optional slice indices.  So you write one utility
function to deal with all that, called in case PyArg_ParseTuple raises an
overflow error.

> Pro or Con:
> - Do we want to start raising overflow exceptions for other conversion
> formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on
> Win64 where sizeof(long) < size(void*))? I think this is a good idea
> in principle but may break code (even if it *does* identify bugs in that
> code).

The code this would break is already broken <0.1 wink>.

> 2. Just change the definitions of the UserString methods to pass
> a variable length argument list instead of default value parameters.
> For example change UserString.count() from:
>
>     def count(self, sub, start=0, end=sys.maxint):
>         return self.data.count(sub, start, end)
>
> to:
>
>     def count(self, *args)):
>         return self.data.count(*args)
>
> The result is that the default value for 'end' is now set by
> string_count() rather than by the UserString implementation:
> ...

This doesn't solve anything -- users can (& do) pass sys.maxint explicitly.
That's part of what Guido means by "since that's what people have been doing
so far".

> ...
> Cons:
> - Does not fix the general problem of the (common?) usage of sys.maxint to
> mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit
> Unices).

Anyone using sys.maxint to mean INT_MAX is fatally confused; passing
sys.maxint as a slice index is not an instance of that confusion, it's just
relying on the documented behavior of out-of-bounds slice indices.

> 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with
> breaks the logical difference with sys.maxint (set to LONG_MAX):
> ...

I hate this (see above).

> ...
> 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f.
> ANSI size_t type). It is probably not a biggie, but Python currently
> makes the assumption that string never exceed INT_MAX in length.

It's not an assumption, it's an explicit part of the design:
PyObject_VAR_HEAD declares ob_size to be an int.  This leads to strain for
sure, partly because the *natural* limit on sizes is derived from malloc
(which indeed takes neither int nor long, but size_t), and partly because
Python exposes no corresponding integer type.  I vaguely recall that this
was deliberate, with the *intent* being to save space in object headers on
the upcoming 128-bit KSR machines <wink>.

> While this assumption is not likely to be proven false it technically
> could be on 64-bit systems.

Well, Guido once said he would take away Python's recursion overflow checks
just as soon as machines came with infinite memory <wink> -- 2Gb is a
reasonable limit for string length, and especially if it's a tradeoff
against increasing the header size for all string objects (it's almost
certainly more important to cater to oodles of small strings on smaller
machines than to one or two gigantic strings on huge machines).

> As well, when you start compiling on Win64 (where sizeof(int) ==
> sizeof(long) < sizeof(size_t)) then you are going to be annoyed
> by hundreds of warnings about implicit casts from size_t (64-bits) to
> int (32-bits) for every strlen, str*, fwrite, and sizeof call that
> you make.

Every place the code implicitly downcasts from size_t to int is plainly
broken today, so we *should* get warnings.  Python has been sloppy about
this!  In large part it's because Python was written before ANSI C, and
size_t simply wasn't supported at the time.  But as with all software,
people rarely go back to clean up; it's overdue (just be thankful you're not
working on the Perl source <0.9 wink>).

> Pros:
> - IMHO logically more correct.
> - Might clean up some subtle bugs.
> - Cleans up annoying and disconcerting warnings.
> - Will probably mean less pain down the road as 64-bit systems
> (esp. Win64) become more prevalent.
>
> Cons:
> - Lot of coding changes.
> - As Guido said: "and then something else would probably break".
> (Though, on currently 32-bits system, there should be no effective
> change).  Only 64-bit systems should be affected and, I would hope,
> the effect would be a clean up.

I support this as a long-term solution, perhaps for P3K.  Note that
ob_refcnt should also be declared size_t (no overflow-checking is done on
refcounts today; the theory is that a refcount can't possibly get bigger
than the total # of pointers in the system, and so if you declare ob_refcnt
to be large enough to hold that, refcount overflow is impossible; but, in
theory, this argument has been broken on every machine where sizeof(int) <
sizeof(void*)).

> I apologize for not being succinct.

Humbug -- it was a wonderfully concise posting, Trent!  The issues are
simply messy.

> Note that I am volunteering here.  Opinions and guidance please.

Alas, the first four letters in "guidance" spell out four-fifths of the only
one able to give you that.

opinions-are-fun-but-don't-count<wink>-ly y'rs  - tim




From mal@lemburg.com  Mon May  1 11:55:52 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 12:55:52 +0200
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg
 stringobject (PR#306)
References: <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <390D62B8.15331407@lemburg.com>

I've just posted a simple patch to the patches list which
implements the idea I posted earlier:

Silent truncation still takes place, but in a somwhat more
natural way ;-) ...

                       /* Silently truncate to INT_MAX/INT_MIN to
                          make passing sys.maxint to 'i' parser
                          markers work on 64-bit platforms work just
                          like on 32-bit platforms. Overflow errors
                          are not raised. */
                       else if (ival > INT_MAX)
                               ival = INT_MAX;
                       else if (ival < INT_MIN)
                               ival = INT_MIN;
                       *p = ival;

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Mon May  1 15:04:08 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 10:04:08 -0400 (EDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
Message-ID: <14605.36568.455646.598506@seahag.cnri.reston.va.us>

Moshe Zadka writes:
 > 1. I'm not sure what to call this function. Currently, I call it
 > __print_expr__, but I'm not sure it's a good name

  It's not.  ;)  How about printresult?
  Another thing to think about is interface; formatting a result and
"printing" it may be different, and you may want to overload them
separately in an environment like IDLE.  Some people may want to just
say:

	import sys
	sys.formatresult = str

  I'm inclined to think that level of control may be better left to
the application; if one hook is provided as you've described, the
application can build different layers as appropriate.

 > 2. I haven't yet supplied a default in __builtin__, so the user *must*
 > override this. This is unacceptable, of course.

  You're right!  But a default is easy enough to add.  I'd put it in
sys instead of __builtin__ though.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From Moshe Zadka <moshez@math.huji.ac.il>  Mon May  1 15:19:46 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Mon, 1 May 2000 17:19:46 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <14605.36568.455646.598506@seahag.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005011712410.25942-100000@sundial>

On Mon, 1 May 2000, Fred L. Drake, Jr. wrote:

>   It's not.  ;)  How about printresult?

Hmmmm...better then mine at least.

> 	import sys
> 	sys.formatresult = str

And where does the "don't print if it's None" enter? I doubt if there is a
really good way to divide functionality. OF course, specific IDEs may
provide their own hooks.

>   You're right!  But a default is easy enough to add.

I agree. It was more to spur discussion -- with the advantage that there
is already a way to include Python sessions.

> I'd put it in
> sys instead of __builtin__ though.

Hmmm.. that's a Guido Issue(TM). Guido?
--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From fdrake@acm.org  Mon May  1 16:19:10 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:19:10 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
Message-ID: <14605.41070.290137.787832@seahag.cnri.reston.va.us>

  The "winreg" module needs some documentation; is anyone here up to
the task?  I don't think I know enough about the registry to write
something reasonable.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From fdrake@acm.org  Mon May  1 16:23:06 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:23:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
Message-ID: <14605.41306.146320.597637@seahag.cnri.reston.va.us>

I wrote:
 >   The "winreg" module needs some documentation; is anyone here up to
 > the task?  I don't think I know enough about the registry to write
 > something reasonable.

  Of course, as soon as I sent this message I remembered that there's
also the linuxaudiodev module; that needs documentation as well!  (I
guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
wants to document audiodev, perhaps I could avoid the Linux chapter
(with one module) by adding documentation for the portable interface.
  There's also the pyexpat module; Andrew/Paul, did one of you want to
contribute something for that?


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From guido@python.org  Mon May  1 16:26:44 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 11:26:44 -0400
Subject: [Python-Dev] documentation for new modules
In-Reply-To: Your message of "Mon, 01 May 2000 11:23:06 EDT."
 <14605.41306.146320.597637@seahag.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
 <14605.41306.146320.597637@seahag.cnri.reston.va.us>
Message-ID: <200005011526.LAA20332@eric.cnri.reston.va.us>

>  >   The "winreg" module needs some documentation; is anyone here up to
>  > the task?  I don't think I know enough about the registry to write
>  > something reasonable.

Maybe you could adapt the documentation for the registry functions in
Mark Hammond's win32all?  Not all the APIs are the same but the should
mostly do the same thing...

>   Of course, as soon as I sent this message I remembered that there's
> also the linuxaudiodev module; that needs documentation as well!  (I
> guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
> wants to document audiodev, perhaps I could avoid the Linux chapter
> (with one module) by adding documentation for the portable interface.

There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
mostly the same.

>   There's also the pyexpat module; Andrew/Paul, did one of you want to
> contribute something for that?

I would hope so!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fdrake@acm.org  Mon May  1 17:17:06 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 12:17:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <200005011526.LAA20332@eric.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
 <14605.41306.146320.597637@seahag.cnri.reston.va.us>
 <200005011526.LAA20332@eric.cnri.reston.va.us>
Message-ID: <14605.44546.568978.296426@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Maybe you could adapt the documentation for the registry functions in
 > Mark Hammond's win32all?  Not all the APIs are the same but the should
 > mostly do the same thing...

  I'll take a look at it when I have time, unless anyone beats me to
it.

 > There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
 > mostly the same.

  It's been documented for a long time.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From guido@python.org  Mon May  1 19:02:32 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:02:32 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 09:18:05 CDT."
 <390AEF1D.253B93EF@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
Message-ID: <200005011802.OAA21612@eric.cnri.reston.va.us>

[Guido]
> > And this is exactly why encodings will remain important: entities
> > encoded in ISO-2022-JP have no compelling reason to be recoded
> > permanently into ISO10646, and there are lots of forces that make it
> > convenient to keep it encoded in ISO-2022-JP (like existing tools).

[Paul]
> You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
> a character *set* and not an encoding. ISO-2022-JP says how you should
> represent characters in terms of bits and bytes. ISO10646 defines a
> mapping from integers to characters.

OK.  I really meant recoding in UTF-8 -- I maintain that there are
lots of forces that prevent recoding most ISO-2022-JP documents in
UTF-8.

> They are both important, but separate. I think that this automagical
> re-encoding conflates them.

Who is proposing any automagical re-encoding?

Are you sure you understand what we are arguing about?

*I* am not even sure what we are arguing about.

I am simply saying that 8-bit strings (literals or otherwise) in
Python have always been able to contain encoded strings.

Earlier, you quoted some reference documentation that defines 8-bit
strings as containing characters.  That's taken out of context -- this
was written in a time when there was (for most people anyway) no
difference between characters and bytes, and I really meant bytes.
There's plenty of use of 8-bit Python strings for non-character uses
so your "proof" that 8-bit strings should contain "characters"
according to your definition is invalid.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon May  1 19:05:33 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:05:33 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011802.OAA21612@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <14605.51053.369016.283239@cymru.basistech.com>

Guido van Rossum writes:
 > OK.  I really meant recoding in UTF-8 -- I maintain that there are
 > lots of forces that prevent recoding most ISO-2022-JP documents in
 > UTF-8.

Such as?

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From Fredrik Lundh" <effbot@telia.com  Mon May  1 19:39:52 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 20:39:52 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <009f01bfb39c$a603cc00$34aab5d4@hagrid>

Tom Emerson wrote:
> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.
>=20
> Such as?

ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
you just recode the character codes, you'll lose important information.

</F>



From tree@basistech.com  Mon May  1 19:42:40 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:42:40 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <009f01bfb39c$a603cc00$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
 <009f01bfb39c$a603cc00$34aab5d4@hagrid>
Message-ID: <14605.53280.55595.335112@cymru.basistech.com>

Fredrik Lundh writes:
 > ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
 > you just recode the character codes, you'll lose important information.

So encode them using the Plane 14 language tags.

I won't start with whether language/locale should be encoded in a
character encoding... 8-)

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@python.org  Mon May  1 19:52:04 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:52:04 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 14:05:33 EDT."
 <14605.51053.369016.283239@cymru.basistech.com>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <200005011852.OAA21973@eric.cnri.reston.va.us>

> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.

[Tom Emerson]
> Such as?

The standard forces that work against all change -- existing tools,
user habits, compatibility, etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon May  1 19:46:04 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:46:04 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011852.OAA21973@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
 <200005011852.OAA21973@eric.cnri.reston.va.us>
Message-ID: <14605.53484.225980.235301@cymru.basistech.com>

Guido van Rossum writes:
 > The standard forces that work against all change -- existing tools,
 > user habits, compatibility, etc.

Ah... I misread your original statement, which I took to be a
technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of
course one cannot expect everyone to switch en masse to a new
encoding, pulling their existing documents with them. I'm in full
agreement there.

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paul@prescod.net  Mon May  1 21:38:29 2000
From: paul@prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 15:38:29 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <390DEB45.D8D12337@prescod.net>

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:
> 
> ...
>
> OK.  I really meant recoding in UTF-8 -- I maintain that there are
> lots of forces that prevent recoding most ISO-2022-JP documents in
> UTF-8.

Absolutely agree.
 
> Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal,
and then compare that string literal against a Unicode object, should
those funny characters be treated as logical units of text (characters)
or as bytes? And if bytes, should some transformation be automatically
performed to have those bytes be reinterpreted as characters according
to some particular encoding scheme (probably UTF-8).

I claim that we should *as far as possible* treat strings as character
lists and not add any new functionality that depends on them being byte
list. Ideally, we could add a byte array type and start deprecating the
use of strings in that manner. Yes, it will take a long time to fix this
bug but that's what happens when good software lives a long time and the
world changes around it.

> Earlier, you quoted some reference documentation that defines 8-bit
> strings as containing characters.  That's taken out of context -- this
> was written in a time when there was (for most people anyway) no
> difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik. 

Anyhow, you wrote the documentation that way because it was the most
intuitive way of thinking about strings. It remains the most intuitive
way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving
people towards the "character-list" model. In concrete terms I would
suggest that old fashioned lists be automatically coerced to Unicode by
interpreting each byte as a Unicode character. Trying to go the other
way could cause the moral equivalent of an OverflowError but that's not
a problem. 

>>> a=1000000000000000000000000000000000000L
>>> int(a)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify
strings and unicode strings (but not byte arrays).

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Mon May  1 22:32:38 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 17:32:38 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT."
 <390DEB45.D8D12337@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us>

> > Are you sure you understand what we are arguing about?
> 
> Here's what I thought we were arguing about:
> 
> If you put a bunch of "funny characters" into a Python string literal,
> and then compare that string literal against a Unicode object, should
> those funny characters be treated as logical units of text (characters)
> or as bytes? And if bytes, should some transformation be automatically
> performed to have those bytes be reinterpreted as characters according
> to some particular encoding scheme (probably UTF-8).
> 
> I claim that we should *as far as possible* treat strings as character
> lists and not add any new functionality that depends on them being byte
> list. Ideally, we could add a byte array type and start deprecating the
> use of strings in that manner. Yes, it will take a long time to fix this
> bug but that's what happens when good software lives a long time and the
> world changes around it.
> 
> > Earlier, you quoted some reference documentation that defines 8-bit
> > strings as containing characters.  That's taken out of context -- this
> > was written in a time when there was (for most people anyway) no
> > difference between characters and bytes, and I really meant bytes.
> 
> Actually, I think that that was Fredrik. 

Yes, I came across the post again later.  Sorry.

> Anyhow, you wrote the documentation that way because it was the most
> intuitive way of thinking about strings. It remains the most intuitive
> way. I think that that was the point Fredrik was trying to make.

I just wish he made the point more eloquently.  The eff-bot seems to
be in a crunchy mood lately...

> We can't make "byte-list" strings go away soon but we can start moving
> people towards the "character-list" model. In concrete terms I would
> suggest that old fashioned lists be automatically coerced to Unicode by
> interpreting each byte as a Unicode character. Trying to go the other
> way could cause the moral equivalent of an OverflowError but that's not
> a problem. 
> 
> >>> a=1000000000000000000000000000000000000L
> >>> int(a)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too long to convert
> 
> And just as with ints and longs, we would expect to eventually unify
> strings and unicode strings (but not byte arrays).

OK, you've made your claim -- like Fredrik, you want to interpret
8-bit strings as Latin-1 when converting (not just comparing!) them to
Unicode.

I don't think I've heard a good *argument* for this rule though.  "A
character is a character is a character" sounds like an axiom to me --
something you can't prove or disprove rationally.

I have a bunch of good reasons (I think) for liking UTF-8: it allows
you to convert between Unicode and 8-bit strings without losses, Tcl
uses it (so displaying Unicode in Tkinter *just* *works*...), it is
not Western-language-centric.

Another reason: while you may claim that your (and /F's, and Just's)
preferred solution doesn't enter into the encodings issue, I claim it
does: Latin-1 is just as much an encoding as any other one.

I claim that as long as we're using an encoding we might as well use
the most accepted 8-bit encoding of Unicode as the default encoding.

I also think that the issue is blown out of proportions: this ONLY
happens when you use Unicode objects, and it ONLY matters when some
other part of the program uses 8-bit string objects containing
non-ASCII characters.  Given the long tradition of using different
encodings in 8-bit strings, at that point it is anybody's guess what
encoding is used, and UTF-8 is a better guess than Latin-1.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May  1 23:17:17 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 18:17:17 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Sat, 29 Apr 2000 21:09:40 +0300."
 <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
Message-ID: <200005012217.SAA23503@eric.cnri.reston.va.us>

> Continuing the recent debate about what is appropriate to the interactive
> prompt printing, and the wide agreement that whatever we decide, users
> might think otherwise, I've written up a patch to have the user control 
> via a function in __builtin__ the way things are printed at the prompt.
> This is not patches@python level stuff for two reasons:
> 
> 1. I'm not sure what to call this function. Currently, I call it
> __print_expr__, but I'm not sure it's a good name
> 
> 2. I haven't yet supplied a default in __builtin__, so the user *must*
> override this. This is unacceptable, of course.
> 
> I'd just like people to tell me if they think this is worth while, and if
> there is anything I missed.

Thanks for bringing this up again.  I think it should be called
sys.displayhook.  The default could be something like

import __builtin__
def displayhook(obj):
    if obj is None:
        return
    __builtin__._ = obj
    sys.stdout.write("%s\n" % repr(obj))

to be nearly 100% compatible with current practice; or use str(obj) to
do what most people would probably prefer.

(Note that you couldn't do "%s\n" % obj because obj might be a tuple.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Mon May  1 23:29:41 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 00:29:41 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>             <390DEB45.D8D12337@prescod.net>  <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> I just wish he made the point more eloquently.  The eff-bot seems to
> be in a crunchy mood lately...

I've posted a few thousand messages on this topic, most of which
seem to have been ignored.  if you'd read all my messages, and seen
all the replies, you'd be cranky too...

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

maybe, but it's a darn good axiom, and it's used by everyone else.
Perl uses it, Tcl uses it, XML uses it, etc.  see:

http://www.python.org/pipermail/python-dev/2000-April/005218.html

> I have a bunch of good reasons (I think) for liking UTF-8: it allows
> you to convert between Unicode and 8-bit strings without losses, Tcl
> uses it (so displaying Unicode in Tkinter *just* *works*...), it is
> not Western-language-centric.

the "Tcl uses it" is a red herring -- their internal implementation
uses 16-bit integers, and the external interface works very hard
to keep the "strings are character sequences" illusion.

in other words, the length of a string is *always* the number of
characters, the character at index i is *always* the i'th character
in the string, etc.

that's not true in Python 1.6a2.

(as for Tkinter, you only have to add 2-3 lines of code to make it
use 16-bit strings instead...)

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

this is another red herring: my argument is that 8-bit strings should
contain unicode characters, using unicode character codes.  there
should be only one character repertoire, and that repertoire is uni-
code.  for a definition of these terms, see:

http://www.python.org/pipermail/python-dev/2000-April/005225.html

obviously, you can only store 256 different values in a single 8-bit
character (just like you can only store 4294967296 different values
in a single 32-bit int).

to store larger values, use unicode strings (or long integers).

conversion from a small type to a large type always work, conversion
from a large type to a small one may result in an OverflowError.

it has nothing to do with encodings.

> I claim that as long as we're using an encoding we might as well use
> the most accepted 8-bit encoding of Unicode as the default encoding.

yeah, and I claim that it won't fly, as long as it breaks the "strings
are character sequences" rule used by all other contemporary (and
competing) systems.

(if you like, I can post more "fun with unicode" messages ;-)

and as I've mentioned before, there are (at least) two ways to solve
this:

1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and
   Perl).  make sure len(s) returns the number of characters in the
   string, make sure s[i] returns the i'th character (not necessarily
   starting at the i'th byte, and not necessarily one byte), etc.  to
   make this run reasonable fast, use as many implementation tricks
   as you can come up with (I've described three ways to implement
   this in an earlier post).

2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i])
   is a unicode character code, whether s is an 8-bit string or a =
unicode
   string.

for alternative 1 to work, you need to add some way to explicitly work
with binary strings (like it's done in Perl and Tcl).

alternative 2 doesn't need that; 8-bit strings can still be used to hold
any kind of binary data, as in 1.5.2.  just keep in mind you cannot use
use all methods on such an object...

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

I still think it's very unfortunate that you think that unicode strings
are a special kind of strings.  Perl and Tcl don't, so why should we?

</F>



From gward@mems-exchange.org  Mon May  1 23:40:18 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Mon, 1 May 2000 18:40:18 -0400
Subject: [Python-Dev] Comparison inconsistency with ExtensionClass
Message-ID: <20000501184017.A1171@mems-exchange.org>

Hi all --

I seem to have discovered an inconsistency in the semantics of object
comparison between plain old Python instances and ExtensionClass
instances.  (I've cc'd python-dev because it looks as though one *could*
blame Python for the inconsistency, but I don't really understand the
guts of either Python or ExtensionClass enough to know.)

Here's a simple script that shows the difference:

    class Simple:
        def __init__ (self, data):
            self.data = data

        def __repr__ (self):
            return "<%s at %x: %s>" % (self.__class__.__name__,
                                       id(self),
                                       `self.data`)

        def __cmp__ (self, other):
            print "Simple.__cmp__: self=%s, other=%s" % (`self`, `other`)
            return cmp (self.data, other)


    if __name__ == "__main__":
        v1 = 36
        v2 = Simple (36)
        print "v1 == v2?", (v1 == v2 and "yes" or "no")
        print "v2 == v1?", (v2 == v1 and "yes" or "no")
        print "v1 == v2.data?", (v1 == v2.data and "yes" or "no")
        print "v2.data == v1?", (v2.data == v1 and "yes" or "no")

If I run this under Python 1.5.2, then all the comparisons come out true
and my '__cmp__()' method is called twice:

    v1 == v2? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v2 == v1? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes


The first one and the last two are obvious, but the second one only
works thanks to a trick in PyObject_Compare():

    if (PyInstance_Check(v) || PyInstance_Check(w)) {
        ...
        if (!PyInstance_Check(v))
	    return -PyObject_Compare(w, v);
        ...
    }

However, if I make Simple an ExtensionClass:

    from ExtensionClass import Base

    class Simple (Base):

Then the "swap v and w and use w's comparison method" no longer works.
Here's the output of the script with Simple as an ExtensionClass:

    v1 == v2? no
    v2 == v1? Simple.__cmp__: self=<Simple at 1b51c0: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes

It looks as though ExtensionClass would have to duplicate the trick in
PyObject_Compare() that I quoted, since Python has no idea that
ExtensionClass instances really should act like instances.  This smells
to me like a bug in ExtensionClass.  Comments?

BTW, I'm using the ExtensionClass provided with Zope 2.1.4.  Mostly
tested with Python 1.5.2, but also under the latest CVS Python and we
observed the same behaviour.

        Greg


From mhammond@skippinet.com.au  Tue May  2 00:45:02 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 09:45:02 +1000
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>

> Guido van Rossum writes:
>  > Maybe you could adapt the documentation for the
> registry functions in
>  > Mark Hammond's win32all?  Not all the APIs are the
> same but the should
>  > mostly do the same thing...
>
>   I'll take a look at it when I have time, unless anyone
> beats me to
> it.

I wonder if that anyone could be me? :-)

Note that all the win32api docs for the registry all made it into
docstrings - so winreg has OK documentation as it is...

But I will try and put something together.  It will need to be plain
text or HTML, but I assume that is better than nothing!

Give me a few days...

Mark.



From paul@prescod.net  Tue May  2 01:19:20 2000
From: paul@prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 19:19:20 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <390E1F08.EA91599E@prescod.net>

Sorry for the long message. Of course you need only respond to that
which is interesting to you. I don't think that most of it is redundant.

Guido van Rossum wrote:
> 
> ...
> 
> OK, you've made your claim -- like Fredrik, you want to interpret
> 8-bit strings as Latin-1 when converting (not just comparing!) them to
> Unicode.

If the user provides an explicit conversion function (e.g. UTF-8-decode)
then of course we should use that function. Under my character is a
character is a character model, this "conversion" is morally equivalent
to ROT-13, strupr or some other text->text translation. So you could
apply UTF-8-decode even to a Unicode string as long as each character in
the string has ord()<256 (so that it could be interpreted as a character
representation for a byte).

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

I don't see it as an axiom, but rather as a design decision you make to
keep your language simple. Along the lines of "all values are objects"
and (now) all integer values are representable with a single type. Are
you happy with this?

a="\244"
b=u"\244"
assert len(a)==len(b)
assert ord(a[0])==ord(b[0])

# same thing, right?
print b==a
# Traceback (most recent call last):
#  File "<stdin>", line 1, in ?
# UnicodeError: UTF-8 decoding error: unexpected code byte

If I type "\244" it means I want character 244, not the first half of a
UTF-8 escape sequence. "\244" is a string with one character. It has no
encoding. It is not latin-1. It is not UTF-8. It is a string with one
character and should compare as equal with another string with the same
character.

I would laugh my ass off if I was using Perl and it did something weird
like this to me (as long as it didn't take a month to track down the
bug!). Now it isn't so funny.

> I have a bunch of good reasons (I think) for liking UTF-8: 

I'm not against UTF-8. It could be an internal representation for some
Unicode objects.

> it allows
> you to convert between Unicode and 8-bit strings without losses, 

Here's the heart of our disagreement:

******
I don't want, in Py3K, to think about "converting between Unicode and
8-bit strings." I want strings and I want byte-arrays and I want to
worry about converting between *them*. There should be only one string
type, its characters should all live in the Unicode character repertoire
and the character numbers should all come from Unicode. "Special"
characters can be assigned to the Unicode Private User Area. Byte arrays
would be entirely seperate and would be converted to Unicode strings
with explicit conversion functions.
*****

In the meantime I'm just trying to get other people thinking in this
mode so that the transition is easier. If I see people embedding UTF-8
escape sequences in literal strings today, I'm going to hit them.

I recognize that we can't design the universe right now but we could
agree on this direction and use it to guide our decision-making.

By the way, if we DID think of 8-bit strings as essentially "byte
arrays" then let's use that terminology and imagine some future
documentation:

"Python's string type is equivalent to a list of bytes. For clarity, we
will call this type a byte list from now on. In contexts where a Unicode
character-string is desired, Python automatically converts byte lists to
charcter strings by doing a UTF-8 decode on them." 

What would you think if Java had a default (I say "magical") conversion
from byte arrays to character strings.

The only reason we are discussing this is because Python strings have a
dual personality which was useful in the past but will (IMHO, of course)
become increasingly confusing in the future. We want the best of both
worlds without confusing anybody and I don't think that we can have it.

If you want 8-bit strings to be really byte arrays in perpetuity then
let's be consistent in that view. We can compare them to Unicode as we
would two completely separate types. "U" comes after "S" so unicode
strings always compare greater than 8-bit strings. The use of the word
"string" for both objects can be considered just a historical accident.

> Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), 

Don't follow this entirely. Shouldn't the next version of TKinter accept
and return Unicode strings? It would be rather ugly for two
Unicode-aware systems (Python and TK) to talk to each other in 8-bit
strings. I mean I don't care what you do at the C level but at the
Python level arguments should be "just strings."

Consider that len() on the TKinter side would return a different value
than on the Python side. 

What about integral indexes into buffers? I'm totally ignorant about
TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is
between the 5th and 6th character when in an 8-bit string the equivalent
index might be the 11th or 12th byte?

> it is not Western-language-centric.

If you look at encoding efficiency it is.

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

The fact that my proposal has the same effect as making Latin-1 the
"default encoding" is a near-term side effect of the definition of
Unicode. My long term proposal is to do away with the concept of 8-bit
strings (and thus, conversions from 8-bit to Unicode) altogether. One
string to rule them all!

Is Unicode going to be the canonical Py3K character set or will we have
different objects for different character sets/encodings with different
default (I say "magical") conversions between them. Such a design would
not be entirely insane though it would be a PITA to implement and
maintain. If we aren't ready to establish Unicode as the one true
character set then we should probably make no special concessions for
Unicode at all. Let a thousand string objects bloom!

Even if we agreed to allow many string objects, byte==character should
not be the default string object. Unicode should be the default.

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  

Won't this be totally common? Most people are going to use 8-bit
literals in their program text but work with Unicode data from XML
parsers, COM, WebDAV, Tkinter, etc?

> Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

If we are guessing then we are doing something wrong. My answer to the
question of "default encoding" falls out naturally from a certain way of
looking at text, popularized in various other languages and increasingly
"the norm" on the Web. If you accept the model (a character is a
character is a character), the right behavior is obvious. 

"\244"==u"\244"

Nobody is ever going to have trouble understanding how this works.
Choose simplicity!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From mhammond@skippinet.com.au  Tue May  2 01:34:16 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 10:34:16 +1000
Subject: [Python-Dev] Neil Hodgson on python-dev?
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au>

I'd like to propose that we invite Neil Hodgson to join the
python-dev family.

Neil is the author of the Scintilla editor control, now used by
wxPython and Pythonwin...  Smart guy, and very experienced with
Python (scintilla was originally written because he had trouble
converting Pythonwin to be a color'd editor :-)

But most relevant at the moment is his Unicode experience.  He
worked for along time with Fujitsu, working with Japanese and all
the encoding issues there.  I have heard him echo the exact
sentiments of Andy.  He is also in the process of polishing the
recent Unicode support in Scintilla.

As this Unicode debate seems to be going nowhere fast, and appears
to simply need more people with _experience_, I think he would be
valuable.  Further, he is a pretty quiet guy - you wont find him
offering his opinion on every post that moves through here :-)

Thoughts?

Mark.



From guido@python.org  Tue May  2 01:41:43 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:41:43 -0400
Subject: [Python-Dev] Neil Hodgson on python-dev?
In-Reply-To: Your message of "Tue, 02 May 2000 10:34:16 +1000."
 <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au>
Message-ID: <200005020041.UAA23648@eric.cnri.reston.va.us>

> I'd like to propose that we invite Neil Hodgson to join the
> python-dev family.

Excellent!

> As this Unicode debate seems to be going nowhere fast, and appears
> to simply need more people with _experience_, I think he would be
> valuable.  Further, he is a pretty quiet guy - you wont find him
> offering his opinion on every post that moves through here :-)

As long as he isn't too quiet on the Unicode thing ;-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 01:53:26 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:53:26 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT."
 <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us>

Paul, we're both just saying the same thing over and over without
convincing each other.  I'll wait till someone who wasn't in this
debate before chimes in.

Have you tried using this?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Tue May  2 02:26:06 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 03:26:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net>
Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid>

Paul Prescod <paul@prescod.net> wrote:
> I would laugh my ass off if I was using Perl and it did something =
weird
> like this to me.

you don't have to -- in Perl 5.6, a character is a character...

does anyone on this list follow the perl-porters list?  was this as
controversial over in Perl land as it appears to be over here?

</F>



From tpassin@home.com  Tue May  2 02:55:25 2000
From: tpassin@home.com (tpassin@home.com)
Date: Mon, 1 May 2000 21:55:25 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>

Guido van  Rossum wrote, about how to represent strings:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

I'm with Paul and Federick on this one - at least about characters being the
atoms of a string.  We **have** to be able to refer to **characters** in a
string, and without guessing.  Otherwise, how could you ever construct a
test, like theString[3]==[a particular japanese ideograph]?  If we do it by
having a "string" datatype, which is really a byte list, and a
"unicodeString" datatype which is a list of abstract characters, I'd say
everyone could get used to working with them.  We'd have to supply
conversion functions, of course.

This route might be the easiest to understand for users.  We'd have to be
very clear about what file.read() would return, for example, and all those
similar read and write functions.  And we'd have to work out how real 8-bit
calls (like writing to a socket?) would play with the new types.

For extra clarity, we could leave string the way it is, introduce stringU
(unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
be the best equivalent to the current string).  Then we would deprecate
string in favor of string8.  Then if tcl and perl go to unicode strings we
pass them a stringU, and if they go some other way, we pass them something
else.  COme to think of it, we need some some data type that will continue
to work with c and c++.  Would that be string8 or would we keep string for
that purpose?

Clarity and ease of use for the user should be primary, fast implementations
next.  If we didn't care about ease of use and clarity, we could all use
Scheme or c, don't use sight of it.

I'd suggest we could create some use cases or scenarios for this area -
needs input from those who know encodings and low level Python stuff better
than I.  Then we could examine more systematically how well various
approaches would work out.

Regards,
Tom Passin




From mhammond@skippinet.com.au  Tue May  2 03:17:09 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 12:17:09 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPJCJAA.mhammond@skippinet.com.au>

> Guido van  Rossum wrote, about how to represent strings:
>
> > Paul, we're both just saying the same thing over and
> over without
> > convincing each other.  I'll wait till someone who
> wasn't in this
> > debate before chimes in.

Ive chimed in a little, but Ill chime in again :-)

> I'm with Paul and Federick on this one - at least about
> characters being the
> atoms of a string.  We **have** to be able to refer to
> **characters** in a
> string, and without guessing.  Otherwise, how could you

I see the point, and agree 100% with the intent.  However, reality
does bite.

As far as I can see, the following are immuatable:
* There will be 2 types - a string type and a Unicode type.
* History dicates that the string type may hold binary data.

Thus, it is clear that Python simply can not treat characters as the
smallest atoms of strings.  If I understand things correctly, this
is key to Guido's point, and a bit of a communication block.

The issue, to my mind, is how we handle these facts to produce "the
principal of least surprise".  We simply need to accept that Python
1.x will never be able to treat string objects as sequences of
"characters" - only bytes.

However, with my limited understanding of the full issues, it does
appear that the proposal championed by Fredrik, Just and Paul is the
best solution - not because it magically causes Python to treat
strings as characters in all cases, but because it offers the
prinipcal of least surprise.

As I said, I dont really have a deep enough understanding of the
issues, so this is probably (hopefully!?) my last word on the
matter - but that doesnt mean I dont share the concerns raised
here...

Mark.



From guido@python.org  Tue May  2 04:31:54 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 23:31:54 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <200005020331.XAA23818@eric.cnri.reston.va.us>

Tom Passin:
> I'm with Paul and Federick on this one - at least about characters being the
> atoms of a string.  We **have** to be able to refer to **characters** in a
> string, and without guessing.  Otherwise, how could you ever construct a
> test, like theString[3]==[a particular japanese ideograph]?  If we do it by
> having a "string" datatype, which is really a byte list, and a
> "unicodeString" datatype which is a list of abstract characters, I'd say
> everyone could get used to working with them.  We'd have to supply
> conversion functions, of course.

You seem unfamiliar with the details of the implementation we're
proposing?  We already have two datatypes, 8-bit string (call it byte
array) and Unicode string.  There are conversions between them:
explicit conversions such as u.encode("utf-8") or unicode(s,
"latin-1") and implicit conversions used in situations like u+s or
u==s.  The whole discussion is *only* about what the default
conversion in the latter cases should be -- the rest of the
implementation is rock solid and works well.

Users can accomplish what you are proposing by simply ensuring that
theString is a Unicode string.

> This route might be the easiest to understand for users.  We'd have to be
> very clear about what file.read() would return, for example, and all those
> similar read and write functions.  And we'd have to work out how real 8-bit
> calls (like writing to a socket?) would play with the new types.

These are all well defined -- they all deal in 8-bit strings
internally, and all use the default conversions when given Unicode
strings.  Programs that only deal in 8-bit strings don't need to
change.  Programs that want to deal with Unicode and sockets, for
example, must know what encoding to use on the socket, and if it's not
the default encoding, must use explicit conversions.

> For extra clarity, we could leave string the way it is, introduce stringU
> (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
> be the best equivalent to the current string).  Then we would deprecate
> string in favor of string8.  Then if tcl and perl go to unicode strings we
> pass them a stringU, and if they go some other way, we pass them something
> else.  COme to think of it, we need some some data type that will continue
> to work with c and c++.  Would that be string8 or would we keep string for
> that purpose?

What would be the difference between string and string8?

> Clarity and ease of use for the user should be primary, fast implementations
> next.  If we didn't care about ease of use and clarity, we could all use
> Scheme or c, don't use sight of it.
> 
> I'd suggest we could create some use cases or scenarios for this area -
> needs input from those who know encodings and low level Python stuff better
> than I.  Then we could examine more systematically how well various
> approaches would work out.

Very good.

Here's one usage scenario.

A Japanese user is reading lines from a file encoded in ISO-2022-JP.
The readline() method returns 8-bit strings in that encoding (the file
object doesn't do any decoding).  She realizes that she wants to do
some character-level processing on the file so she decides to convert
the strings to Unicode.

I believe that whether the default encoding is UTF-8 or Latin-1
doesn't matter for here -- both are wrong, she needs to write explicit
unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
"better", because interpreting ISO-2022-JP data as UTF-8 will most
likely give an exception (when a \300 range byte isn't followed by a
\200 range byte) -- while interpreting it as Latin-1 will silently do
the wrong thing.  (An explicit error is always better than silent
failure.)

I'd love to discuss other scenarios.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Moshe Zadka <moshez@math.huji.ac.il>  Tue May  2 05:39:12 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Tue, 2 May 2000 07:39:12 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <200005012217.SAA23503@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>

> Thanks for bringing this up again.  I think it should be called
> sys.displayhook.

That should be the easy part -- I'll do it as soon as I'm home.

> The default could be something like
> 
> import __builtin__
import sys # Sorry, I couldn't resist
> def displayhook(obj):
>     if obj is None:
>         return
>     __builtin__._ = obj
>     sys.stdout.write("%s\n" % repr(obj))

This brings up a painful point -- the reason I haven't wrote the default
is because it was way much easier to write it in Python. Of course, I
shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
pains me Python cannot be written with more Python and less C.

A while ago we started talking about the mini-interpreter idea, which
would then freeze Python code into itself, and then it sort of died out.
What have become of it?

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From just@letterror.com  Tue May  2 06:47:35 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 06:47:35 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020331.XAA23818@eric.cnri.reston.va.us>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <l03102802b534149a9639@[193.78.237.164]>

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
>Here's one usage scenario.
>
>A Japanese user is reading lines from a file encoded in ISO-2022-JP.
>The readline() method returns 8-bit strings in that encoding (the file
>object doesn't do any decoding).  She realizes that she wants to do
>some character-level processing on the file so she decides to convert
>the strings to Unicode.
>
>I believe that whether the default encoding is UTF-8 or Latin-1
>doesn't matter for here -- both are wrong, she needs to write explicit
>unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
>"better", because interpreting ISO-2022-JP data as UTF-8 will most
>likely give an exception (when a \300 range byte isn't followed by a
>\200 range byte) -- while interpreting it as Latin-1 will silently do
>the wrong thing.  (An explicit error is always better than silent
>failure.)

But then it's even better to *always* raise an exception, since it's
entirely possible a string contains valid utf-8 while not *being* utf-8. I
really think the exception argument is moot, since there can *always* be
situations that will pass silently. Encoding issues are silent by nature --
eg. there's no way any system can tell that interpreting MacRoman data as
Latin-1 is wrong, maybe even fatal -- the user will just have to deal with
it. You can argue what you want, but *any* multi-byte encoding stored in an
8-bit string is a buffer, not a string, for all the reasons Fredrik and
Paul have thrown at you, and right they are. Choosing such an encoding as a
default conversion to Unicode makes no sense at all. Recap of the main
arguments:

pro UTF-8:
always reversible when going from Unicode to 8-bit

con UTF-8:
not a string: confusing semantics

pro Latin-1:
simpler semantics

con Latin-1:
non-reversible, western-centric

Given the fact that very often *both* will be wrong, I'd go for the simpler
semantics.

Just




From guido@python.org  Tue May  2 05:51:45 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 00:51:45 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 07:39:12 +0300."
 <Pine.GSO.4.10.10005020732200.8759-100000@sundial>
References: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>
Message-ID: <200005020451.AAA23940@eric.cnri.reston.va.us>

> > import __builtin__
> import sys # Sorry, I couldn't resist
> > def displayhook(obj):
> >     if obj is None:
> >         return
> >     __builtin__._ = obj
> >     sys.stdout.write("%s\n" % repr(obj))
> 
> This brings up a painful point -- the reason I haven't wrote the default
> is because it was way much easier to write it in Python. Of course, I
> shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
> pains me Python cannot be written with more Python and less C.
> 

But the C code on how to do it was present in the code you deleted
from ceval.c!

> A while ago we started talking about the mini-interpreter idea,
> which would then freeze Python code into itself, and then it sort of
> died out.  What have become of it?

Nobody sent me a patch :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)


From nhodgson@bigpond.net.au  Tue May  2 06:04:12 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 15:04:12 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <035501bfb3f3$db87fb10$e3cb8490@neil>

   I'm dropping in a bit late in this thread but can the current problem be
summarised in an example as "how is 'literal' interpreted here"?

s = aUnicodeStringFromSomewhere
DoSomething(s + "<literal>")

   The two options being that literal is either assumed to be encoded in
Latin-1 or UTF-8. I can see some arguments for both sides.

Latin-1: more current code was written in a European locale with an implicit
assumption that all string handling was Latin-1. Current editors are more
likely to be displaying literal as it is meant to be interpreted.

UTF-8: all languages can be written in UTF-8 and more recent editors can
display this correctly. Thus people using non-Roman alphabets can write code
which is interpreted as is seen with no need to remember to call conversion
functions.

   Neil



From tpassin@home.com  Tue May  2 06:07:07 2000
From: tpassin@home.com (tpassin@home.com)
Date: Tue, 2 May 2000 01:07:07 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>  <200005020331.XAA23818@eric.cnri.reston.va.us>
Message-ID: <006101bfb3f4$454f99e0$7cac1218@reston1.va.home.com>

Guido van Rossum said
<snip/>
> What would be the difference between string and string8?

Probably none, except to alert people that string8 might have different
behavior than the present-day string, perhaps when interacting with
unicode - probably its behavior would be specified more tightly (i.e., is it
strictly a list of bytes or does it have some assumption about encoding?) or
changed in some way from what we have now.  Or if it turned out that a lot
of programmers in other languages (perl, tcl, perhaps?) expected "string" to
behave in particular ways, the use of a term like "string8" might reduce
confusion.   Possibly none of these apply - no need for "string8" then.

>
> > Clarity and ease of use for the user should be primary, fast
implementations
> > next.  If we didn't care about ease of use and clarity, we could all use
> > Scheme or c, don't use sight of it.
> >
> > I'd suggest we could create some use cases or scenarios for this area -
> > needs input from those who know encodings and low level Python stuff
better
> > than I.  Then we could examine more systematically how well various
> > approaches would work out.
>
> Very good.
>
<snip/>

Tom Passin



From Fredrik Lundh" <effbot@telia.com  Tue May  2 07:59:03 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 08:59:03 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <003b01bfb404$03cd0560$34aab5d4@hagrid>

Neil Hodgson <nhodgson@bigpond.net.au> wrote:
>    I'm dropping in a bit late in this thread but can the current =
problem be
> summarised in an example as "how is 'literal' interpreted here"?
>=20
> s =3D aUnicodeStringFromSomewhere
> DoSomething(s + "<literal>")

nope.  the whole discussion centers around what happens
if you type:

    # example 1

    u =3D aUnicodeStringFromSomewhere
    s =3D an8bitStringFromSomewhere

    DoSomething(s + u)

and

    # example 2

    u =3D aUnicodeStringFromSomewhere
    s =3D an8bitStringFromSomewhere

    if len(u) + len(s) =3D=3D len(u + s):
        print "true"
    else:
        print "not true"

in Guido's design, the first example may or may not result in
an "UTF-8 decoding error: UTF-8 decoding error: unexpected
code byte" exception.  the second example may result in a
similar error, print "true", or print "not true", depending on the
contents of the 8-bit string.

(under the counter proposal, the first example will never
raise an exception, and the second will always print "true")

...

the string literal issue is a slightly different problem.

> The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. I can see some arguments for both sides.

better make that "two options", not "the two options" ;-)

a more flexible scheme would be to borrow the design from XML
(see http://www.w3.org/TR/1998/REC-xml-19980210). for those
who haven't looked closer at XML, it basically treats the source
file as an encoded unicode character stream, and does all pro-
cessing on the decoded side.

replace "entity" with "script file" in the following excerpts, and you
get close:

section 2.2:

    A parsed entity contains text, a sequence of characters,
    which may represent markup or character data.

    A character is an atomic unit of text as specified by
    ISO/IEC 10646.

section 4.3.3:

    Each external parsed entity in an XML document may
    use a different encoding for its characters. All XML
    processors must be able to read entities in either
    UTF-8 or UTF-16.=20

    Entities encoded in UTF-16 must begin with the Byte
    Order Mark /.../ XML processors must be able to use
    this character to differentiate between UTF-8 and
    UTF-16 encoded documents.

    Parsed entities which are stored in an encoding other
    than UTF-8 or UTF-16 must begin with a text declaration
    containing an encoding declaration.

(also see appendix F: Autodetection of Character Encodings)

I propose that we adopt a similar scheme for Python -- but not
in 1.6.  the current "dunno, so we just copy the characters" is
good enough for now...

</F>



From tim_one@email.msn.com  Tue May  2 08:20:52 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 03:20:52 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: <200004271523.LAA13614@eric.cnri.reston.va.us>
Message-ID: <000201bfb406$f2f35520$df2d153f@tim>

[Guido asks good questions about how Windows deals w/ Unicode filenames,
 last Thursday, but gets no answers]

> ...
> I'd like to solve this problem, but I have some questions: what *IS*
> the encoding used for filenames on Windows?  This may differ per
> Windows version; perhaps it can differ drive letter?  Or per
> application or per thread?  On Windows NT, filenames are supposed to
> be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> with a given Unicode string for its name, in a C program?  I suppose
> there's a Win32 API call for that which has a Unicode variant.
>
> On Windows 95/98, the Unicode variants of the Win32 API calls don't
> exist.  So what is the poor Python runtime to do there?
>
> Can Japanese people use Japanese characters in filenames on Windows
> 95/98?  Let's assume they can.  Since the filesystem isn't Unicode
> aware, the filenames must be encoded.  Which encoding is used?  Let's
> assume they use Microsoft's multibyte encoding.  If they put such a
> file on a floppy and ship it to Linköping, what will Fredrik see as
> the filename?  (I.e., is the encoding fixed by the disk volume, or by
> the operating system?)
>
> Once we have a few answers here, we can solve the problem.  Note that
> sometimes we'll have to refuse a Unicode filename because there's no
> mapping for some of the characters it contains in the filename
> encoding used.

I just thought I'd repeat the questions <wink>.  However, I don't think
you'll really want the answers -- Windows is a legacy-encrusted mess, and
there are always many ways to get a thing done in the end.  For example ...

> Question: how does Fredrik create a file with a Euro
> character (u'\u20ac') in its name?

This particular one is shallower than you were hoping:  in many of the
TrueType fonts (e.g., Courier New but not Courier), Windows extended its
Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80.
So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8.
This is true even on US Win98 (which has no visible Unicode support) -- but
was not supported in US Win95.

i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop-
    at-work-so-can-verify-ms-sure-got-japanese-characters-into-the-
    filenames-somehow-but-doubt-it's-via-unicode-ly y'rs  - tim




From Fredrik Lundh" <effbot@telia.com  Tue May  2 08:55:49 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 09:55:49 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <007d01bfb40b$d7693720$34aab5d4@hagrid>

Tim Peters wrote:
> [Guido asks good questions about how Windows deals w/ Unicode =
filenames,
>  last Thursday, but gets no answers]

you missed Finn Bock's post on how Java does it.

here's another data point:

Tcl uses a system encoding to convert from unicode to a suitable
system API encoding, and uses the following approach to figure out
what that one is:

    windows NT/2000:
        unicode (use wide api)

    windows 95/98:
        "cp%d" % GetACP()
        (note that this is "cp1252" in us and western europe,
        not "iso-8859-1")
 =20
    macintosh:
        determine encoding for fontId 0 based on (script,
        smScriptLanguage) tuple. if that fails, assume
        "macroman"

    unix:
        figure out the locale from LC_ALL, LC_CTYPE, or LANG.
        use heuristics to map from the locale to an encoding
        (see unix/tclUnixInit). if that fails, assume "iso-8859-1"

I propose adding a similar mechanism to Python, along these lines:

    sys.getdefaultencoding() returns the right thing for windows
    and macintosh, "iso-8859-1" for other platforms.

    sys.setencoding(codec) changes the system encoding.  it's
    used from site.py to set things up properly on unix and other
    non-unicode platforms.

</F>



From nhodgson@bigpond.net.au  Tue May  2 09:22:36 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 18:22:36 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <004501bfb40f$92ff0980$e3cb8490@neil>

> > I'd like to solve this problem, but I have some questions: what *IS*
> > the encoding used for filenames on Windows?  This may differ per
> > Windows version; perhaps it can differ drive letter?  Or per
> > application or per thread?  On Windows NT, filenames are supposed to
> > be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> > with a given Unicode string for its name, in a C program?  I suppose
> > there's a Win32 API call for that which has a Unicode variant.

   Its decided by each file system.

   For FAT file systems, the OEM code page is used. The OEM code page
generally used in the United States is code page 437 which is different from
the code page windows uses for display. I had to deal with this in a system
where people used fractions (1/4, 1/2 and 3/4) as part of names which had to
be converted into valid file names. For example 1/4 is 0xBC for display but
0xAC when used in a file name.

   In Japan, I think different manufacturers used different encodings with
NEC trying to maintain market control with their own encoding.

   VFAT stores both Unicode long file names and shortened aliases. However
the Unicode variant is hard to get to from Windows 95/98.

   NTFS stores Unicode.

> > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > exist.  So what is the poor Python runtime to do there?

   Fail the call. All existing files can be opened because they have short
non-Unicode aliases. If a file with a Unicode name can not be created
because the OS doesn't support it then you should give up. Just as you
should give up if you try to save a file with a name that includes a
character not allowed by the file system.

> > Can Japanese people use Japanese characters in filenames on Windows
> > 95/98?

   Yes.

> > Let's assume they can.  Since the filesystem isn't Unicode
> > aware, the filenames must be encoded.  Which encoding is used?  Let's
> > assume they use Microsoft's multibyte encoding.  If they put such a
> > file on a floppy and ship it to Linköping, what will Fredrik see as
> > the filename?  (I.e., is the encoding fixed by the disk volume, or by
> > the operating system?)

   If Fredrik is running a non-Japanese version of Windows 9x, he will see
some 'random' western characters replacing the Japanese.

   Neil



From Fredrik Lundh" <effbot@telia.com  Tue May  2 09:36:40 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 10:36:40 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil>
Message-ID: <008501bfb411$8e0502c0$34aab5d4@hagrid>

Neil Hodgson wrote:
>    Its decided by each file system.

...but the system API translates from the active code page to the
encoding used by the file system, right?

on my w95 box, GetACP() returns 1252, and GetOEMCP() returns
850. =20

if I create a file with a name containing latin-1 characters, on a
FAT drive, it shows up correctly in the file browser (cp1252), and
also shows up correctly in the MS-DOS window (under cp850).

if I print the same filename to stdout in the same DOS window, I
get gibberish.

> > > On Windows 95/98, the Unicode variants of the Win32 API calls =
don't
> > > exist.  So what is the poor Python runtime to do there?
>=20
>    Fail the call.

...if you fail to convert from unicode to the local code page.

</F>



From mal@lemburg.com  Tue May  2 09:36:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 10:36:43 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <390E939B.11B99B71@lemburg.com>

Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its
work.

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From ping@lfw.org  Tue May  2 09:42:51 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 01:42:51 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <Pine.LNX.4.10.10005020114250.522-100000@localhost>

I'll warn you that i'm not much experienced or well-informed, but
i suppose i might as well toss in my naive opinion.

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
> 
> I believe that whether the default encoding is UTF-8 or Latin-1
> doesn't matter for here -- both are wrong, she needs to write explicit
> unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
> "better", because [this] will most likely give an exception...

On Tue, 2 May 2000, Just van Rossum wrote:
> But then it's even better to *always* raise an exception, since it's
> entirely possible a string contains valid utf-8 while not *being* utf-8.

I believe it is time for me to make a truly radical proposal:

    No automatic conversions between 8-bit "strings" and Unicode strings.

If you want to turn UTF-8 into a Unicode string, say so.
If you want to turn Latin-1 into a Unicode string, say so.
If you want to turn ISO-2022-JP into a Unicode string, say so.
Adding a Unicode string and an 8-bit "string" gives an exception.

I know this sounds tedious, but at least it stands the least possible
chance of confusing anyone -- and given all i've seen here and in
other i18n and l10n discussions, there's plenty enough confusion to
go around already.


If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.


Okay.  Feel free to yell at me now.


-- ?!ng

P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
sense of them as byte-buffers -- since that *is* all you get when you
read in some bytes from a file.  If you manipulate an 8-bit "string"
as a character string, you are implicitly making the assumption that
the byte values correspond to the character encoding of the character
repertoire you want to work with, and that's your responsibility.

P. P. S.  If always having to specify encodings is really too much,
i'd probably be willing to consider a default-encoding state on the
Unicode class, but it would have to be a stack of values, not a
single value.



From Fredrik Lundh" <effbot@telia.com  Tue May  2 10:00:07 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 11:00:07 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <009701bfb414$d35d0ea0$34aab5d4@hagrid>

M.-A. Lemburg <mal@lemburg.com> wrote:
> Just a small note on the subject of a character being atomic
> which seems to have been forgotten by the discussing parties:
>=20
> Unicode itself can be understood as multi-word character
> encoding, just like UTF-8. The reason is that Unicode entities
> can be combined to produce single display characters (e.g.
> u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).
> Slicing such a combined Unicode string will have the same
> effect as slicing UTF-8 data.

really?  does it result in a decoder error?  or does it just result
in a rendering error, just as if you slice off any trailing character
without looking...

> It seems that most Latin-1 proponents seem to have single
> display characters in mind. While the same is true for
> many Unicode entities, there are quite a few cases of
> combining characters in Unicode 3.0 and the Unicode
> nomarization algorithm uses these as basis for its
> work.

do we supported automatic normalization in 1.6?

</F>



From ping@lfw.org  Tue May  2 10:46:40 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:46:40 -0700 (PDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>
Message-ID: <Pine.LNX.4.10.10005020242270.522-100000@localhost>

On Tue, 2 May 2000, Moshe Zadka wrote:
> 
> > Thanks for bringing this up again.  I think it should be called
> > sys.displayhook.

I apologize profusely for dropping the ball on this.  I
was going to do it; i have been having a tough time lately
figuring out a Big Life Decision.  (Hate those BLDs.)

I was partway through hacking the patch and didn't get back
to it, but i wanted to at least air the plan i had in mind.
I hope you'll allow me this indulgence.

I was planning to submit a patch that adds the built-in routines

    sys.display
    sys.displaytb

    sys.__display__
    sys.__displaytb__

sys.display(obj) would be implemented as 'print repr(obj)'
and sys.displaytb(tb, exc) would call the same built-in
traceback printer we all know and love.

I assumed that sys.__stdin__ was added to make it easier to
restore sys.stdin to its original value.  In the same vein,
sys.__display__ and sys.__displaytb__ would be saved references
to the original sys.display and sys.displaytb.

I hate to contradict Guido, but i'll gently suggest why i
like "display" better than "displayhook": "display" is a verb,
and i prefer function names to be verbs rather than nouns
describing what the functions are (e.g. "read" rather than
"reader", etc.)


-- ?!ng



From ping@lfw.org  Tue May  2 10:47:34 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:47:34 -0700 (PDT)
Subject: [Python-Dev] Traceback style
Message-ID: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>

This was also going to go out after i posted the
display/displaytb patch.  But anyway, let's see what
you all think.

I propose the following stylistic changes to traceback
printing:

    1.  If there is no function name for a given level
        in the traceback, just omit the ", in ?" at the
        end of the line.

    2.  If a given level of the traceback is in a method,
        instead of just printing the method name, print
        the class and the method name.

    3.  Instead of beginning each line with:
        
            File "foo.py", line 5

        print the line first and drop the quotes:

            Line 5 of foo.py

        In the common interactive case that the file
        is a typed-in string, the current printout is
        
            File "<stdin>", line 1
        
        and the following is easier to read in my opinion:

            Line 1 of <stdin>

Here is an example:

    >>> class Spam:
    ...     def eggs(self):
    ...         return self.ham
    ... 
    >>> s = Spam()
    >>> s.eggs()
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
      File "<stdin>", line 3, in eggs
    AttributeError: ham

With the suggested changes, this would print as

    Traceback (innermost last):
      Line 1 of <stdin>
      Line 3 of <stdin>, in Spam.eggs
    AttributeError: ham



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton



From ping@lfw.org  Tue May  2 10:53:01 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:53:01 -0700 (PDT)
Subject: [Python-Dev] Traceback behaviour in exceptional cases
Message-ID: <Pine.LNX.4.10.10004170045510.1157-100000@localhost>

Here is how i was planning to take care of exceptions in
sys.displaytb...


    1.  When the 'sys' module does not contain a 'stderr'
        attribute, Python currently prints 'lost sys.stderr'
        to the original stderr instead of printing the traceback.
        I propose that it proceed to try to print the traceback
        to the real stderr in this case.

    2.  If 'sys.stderr' is buffered, the traceback does not
        appear in the file.  I propose that Python flush
        'sys.stderr' immediately after printing a traceback.

    3.  Tracebacks get printed to whatever object happens to
        be in 'sys.stderr'.  If the object is not a file (or
        other problems occur during printing), nothing gets
        printed anywhere.  I propose that Python warn about
        this on stderr, then try to print the traceback to
        the real stderr as above.

    4.  Similarly, 'sys.displaytb' may cause an exception.
        I propose that when this happens, Python invoke its
        default traceback printer to print the exception from
        'sys.displaytb' as well as the original exception.

#4 may seem a little convoluted, so here is the exact logic
i suggest (described here in Python but to be implemented in C),
where 'handle_exception()' is the routine the interpreter uses
to handle an exception, 'print_exception' is the built-in
exception printer currently implemented in PyErr_PrintEx and
PyTraceBack_Print, and 'err' is the actual, original stderr.

    def print_double_exception(tb, exc, disptb, dispexc, file):
        file.write("Exception occured during traceback display:\n")
        print_exception(disptb, dispexc, file)
        file.write("\n")
        file.write("Original exception passed to display routine:\n")
        print_exception(tb, exc, file)

    def handle_double_exception(tb, exc, disptb, dispexc):
        if hasattr(sys, 'stderr'):
            err.write("Missing sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)
            return
        try:
            print_double_exception(tb, exc, disptb, dispexc, sys.stderr)
        except:
            err.write("Error on sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)

    def handle_exception():
        tb, exc = sys.exc_traceback, sys.exc_value
        try:
            sys.displaytb(tb, exc)
        except:
            disptb, dispexc = sys.exc_traceback, sys.exc_value
            try:
                handle_double_exception(tb, exc, disptb, dispexc)
            except: pass

    def default_displaytb(tb, exc):
        if hasattr(sys, 'stderr'):
            print_exception(tb, exc, sys.stderr)
        else:
            print "Missing sys.stderr; printing exception to stderr."
            print_exception(tb, exc, err)

    sys.displaytb = sys.__displaytb__ = default_displaytb



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton



From mal@lemburg.com  Tue May  2 10:56:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 11:56:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
Message-ID: <390EA645.89E3B22A@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > Just a small note on the subject of a character being atomic
> > which seems to have been forgotten by the discussing parties:
> >
> > Unicode itself can be understood as multi-word character
> > encoding, just like UTF-8. The reason is that Unicode entities
> > can be combined to produce single display characters (e.g.
> > u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
> > Slicing such a combined Unicode string will have the same
> > effect as slicing UTF-8 data.
> 
> really?  does it result in a decoder error?  or does it just result
> in a rendering error, just as if you slice off any trailing character
> without looking...

In the example, if you cut off the u"\u0301", the "e" would
appear without the acute accent, cutting off the u"e" would
probably result in a rendering error or worse put the accent
over the next character to the left.

UTF-8 is better in this respect: it warns you about
the error by raising an exception when being converted to
Unicode.
 
> > It seems that most Latin-1 proponents seem to have single
> > display characters in mind. While the same is true for
> > many Unicode entities, there are quite a few cases of
> > combining characters in Unicode 3.0 and the Unicode
> > normalization algorithm uses these as basis for its
> > work.
> 
> do we supported automatic normalization in 1.6?

No, but it is likely to appear in 1.7... not sure about
the "automatic" though.

FYI: Normalization is needed to make comparing Unicode
strings robust, e.g. u"é" should compare equal to u"e\u0301".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From esr@thyrsus.com  Tue May  2 11:16:55 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 06:16:55 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502061655.A16999@thyrsus.com>

Ka-Ping Yee <ping@lfw.org>:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.
> 
>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.
> 
>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py
> 
>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>
> 
> Here is an example:
> 
>     >>> class Spam:
>     ...     def eggs(self):
>     ...         return self.ham
>     ... 
>     >>> s = Spam()
>     >>> s.eggs()
>     Traceback (innermost last):
>       File "<stdin>", line 1, in ?
>       File "<stdin>", line 3, in eggs
>     AttributeError: ham
> 
> With the suggested changes, this would print as
> 
>     Traceback (innermost last):
>       Line 1 of <stdin>
>       Line 3 of <stdin>, in Spam.eggs
>     AttributeError: ham

IMHO, this is not a good idea.  Emacs users like me want traceback
labels to be *more* like C compiler error messages, not less.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The United States is in no way founded upon the Christian religion
	-- George Washington & John Adams, in a diplomatic message to Malta.


From Moshe Zadka <moshez@math.huji.ac.il>  Tue May  2 11:12:14 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Tue, 2 May 2000 13:12:14 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>

On Mon, 1 May 2000, Guido van Rossum wrote:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

Well, I'm guessing you had someone specific in mind (Neil?), but I want to
say someothing too, as the only one here (I think) using ISO-8859-8
natively. I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8
is even more horribly western-centered (or I should say USA centered) --
ASCII documents are the same. I'd much prefer Python to reflect a
fundamental truth about Unicode, which at least makes sure binary-goop can
pass through Unicode and remain unharmed, then to reflect a nasty problem
with UTF-8 (not everything is legal). 

If I'm using Hebrew characters in my source (which I won't for a long
while), I'll use them in  Unicode strings only, and make sure I use
Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a
conversion to Unicode on the fly anyway, since most bidi libraries work on
Unicode. So having UTF-8 conversions magically happen won't help me at
all, and will only cause problem when I use "sort-for-uniqueness" on a
list with mixed binary-goop and Unicode strings. In short, this sounds
like a recipe for disaster.

internationally y'rs, Z.

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From pf@artcom-gmbh.de  Tue May  2 11:12:26 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:12:26 +0200 (MEST)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> from "Barry A. Warsaw" at "May 1, 2000 12:18:25 pm"
Message-ID: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>

Barry A. Warsaw:
> Update of /projects/cvsroot/python/dist/src/Doc/lib
[...]
> 	libos.tex 
[...]
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}
> --- 703,712 ----
>   \end{funcdesc}
>   
> ! \begin{funcdesc}{utime}{path, times}
> ! Set the access and modified times of the file specified by \var{path}.
> ! If \var{times} is \code{None}, then the file's access and modified
> ! times are set to the current time.  Otherwise, \var{times} must be a
> ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> ! set the access and modified times, respectively.
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}

I may have missed something, but I haven't seen a patch to the WinXX
and MacOS implementation of the 'utime' function.  So either the
documentation should explicitly point out, that the new additional
signature is only available on Unices or even better it should be
implemented on all platforms so that programmers intending to write
portable Python have not to worry about this.

I suggest an additional note saying that this signature has been
added in Python 1.6.  There used to be several such notes all over
the documentation saying for example: "New in version 1.5.2." which
I found very useful in the past!

Regards, Peter


From nhodgson@bigpond.net.au  Tue May  2 11:22:00 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 20:22:00 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>
Message-ID: <00d101bfb420$4197e510$e3cb8490@neil>

> ...but the system API translates from the active code page to the
> encoding used by the file system, right?

   Yes, although I think that wasn't the case with Win16 and there are still
some situations in which you have to deal with the differences. Copying a
file from the console on Windows 95 to a FAT volume appears to allow use of
the OEM character set with no conversion.

> if I create a file with a name containing latin-1 characters, on a
> FAT drive, it shows up correctly in the file browser (cp1252), and
> also shows up correctly in the MS-DOS window (under cp850).

   Do you have a FAT drive or a VFAT drive? If you format as FAT on 9x or NT
you will get a VFAT volume.

   Neil



From ping@lfw.org  Tue May  2 11:23:26 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 03:23:26 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502061655.A16999@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005020317030.522-100000@localhost>

On Tue, 2 May 2000, Eric S. Raymond wrote:
>
> Ka-Ping Yee <ping@lfw.org>:
> > 
> > With the suggested changes, this would print as
> > 
> >     Traceback (innermost last):
> >       Line 1 of <stdin>
> >       Line 3 of <stdin>, in Spam.eggs
> >     AttributeError: ham
> 
> IMHO, this is not a good idea.  Emacs users like me want traceback
> labels to be *more* like C compiler error messages, not less.

I suppose Python could go all the way and say things like

    Traceback (innermost last):
      <stdin>:3
      foo.py:25: in Spam.eggs
    AttributeError: ham

but that might be more intimidating for a beginner.

Besides, you Emacs guys have plenty of programmability anyway :)
You would have to do a little parsing to get the file name and
line number from the current format; it's no more work to get
it from the suggested format.

(What i would really like, by the way, is to see the values of
the function arguments on the stack -- but that's a lot of work
to do in C, so implementing this with the help of repr.repr
will probably be the first thing i do with sys.displaytb.)


-- ?!ng



From mal@lemburg.com  Tue May  2 11:46:06 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 12:46:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <390EB1EE.EA557CA9@lemburg.com>

Moshe Zadka wrote:
> 
> I'd much prefer Python to reflect a
> fundamental truth about Unicode, which at least makes sure binary-goop can
> pass through Unicode and remain unharmed, then to reflect a nasty problem
> with UTF-8 (not everything is legal).

Let's not do the same mistake again: Unicode objects should *not*
be used to hold binary data. Please use buffers instead.

BTW, I think that this behaviour should be changed:

>>> buffer('binary') + 'data'
'binarydata'

while:

>>> 'data' + buffer('binary')         
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

IMHO, buffer objects should never coerce to strings, but instead
return a buffer object holding the combined contents. The
same applies to slicing buffer objects:

>>> buffer('binary')[2:5]
'nar'

should prefereably be buffer('nar').

--

Hmm, perhaps we need something like a data string object
to get this 100% right ?!

>>> d = data("...data...")
or
>>> d = d"...data..."
>>> print type(d)
<type 'data'>

>>> 'string' + d
d"string...data..."
>>> u'string' + d
d"s\000t\000r\000i\000n\000g\000...data..."

>>> d[:5]
d"...da"

etc.

Ideally, string and Unicode objects would then be subclasses
of this type in Py3K.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From pf@artcom-gmbh.de  Tue May  2 11:59:55 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:59:55 +0200 (MEST)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10005020317030.522-100000@localhost> from Ka-Ping Yee at "May 2, 2000  3:23:26 am"
Message-ID: <m12maPD-000CnCC@artcom0.artcom-gmbh.de>

> > Ka-Ping Yee <ping@lfw.org>:
> > > 
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

> On Tue, 2 May 2000, Eric S. Raymond wrote:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.
> 
Ka-Ping Yee :
[...]
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

I like pings proposed traceback output.  

But beside existing Elisp code there might be other software relying
on a particular format.  As a long time vim user I have absolutely
no idea about other IDEs.  So before changing the default format this
should be carefully checked.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

I'm eagerly waiting to see this. ;-)

Regards, Peter


From just@letterror.com  Tue May  2 13:34:57 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 13:34:57 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E939B.11B99B71@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102804b534772fc25b@[193.78.237.142]>

At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).

Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.

>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's
non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
argument can easily be turned around, backfiring at utf-8:

    Defaulting to utf-8 when going from Unicode to 8-bit and
    back only gives the *illusion* things "just work", since it
    will *silently* "work", even if utf-8 is *not* the desired
    8-bit encoding -- as shown by Fredrik's excellent "fun with
    Unicode, part 1" example. Defaulting to Latin-1 will
    warn the user *much* earlier, since it'll barf when
    converting a Unicode string that contains any character
    code > 255. So there.
]

>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
>work.

Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.

However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with
surrogates?

Just




From nhodgson@bigpond.net.au  Tue May  2 12:40:44 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 21:40:44 +1000
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil> <003b01bfb404$03cd0560$34aab5d4@hagrid>
Message-ID: <013e01bfb42b$41a3f200$e3cb8490@neil>

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    DoSomething(s + u)

> in Guido's design, the first example may or may not result in
> an "UTF-8 decoding error: UTF-8 decoding error: unexpected
> code byte" exception.

   I would say it is less surprising for most people for this to follow the
silent-widening of each byte - the Fredrik-Paul position. With the current
scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to
UTF-16 conversion. While complete prohibition of automatic conversion has
some appeal, it will just be more noise to many.

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    if len(u) + len(s) == len(u + s):
>        print "true"
>    else:
>        print "not true"

> the second example may result in a
> similar error, print "true", or print "not true", depending on the
> contents of the 8-bit string.

   I don't see this as important as its trying to take the Unicode strings
are equivalent to 8 bit strings too far. How much further before you have to
break? I always thought of len measuring the number of bytes rather than
characters when applied to strings. The same as strlen in C when you have a
DBCS string.

   I should correct some of the stuff Mark wrote about me. At Fujitsu we did
a lot more DBCS work than Unicode because that's what Japanese code uses.
Even with Java most storage is still DBCS. I was more involved with Unicode
architecture at Reuters 6 or so years ago.

   Neil



From guido@python.org  Tue May  2 12:53:10 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 07:53:10 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 02:46:40 PDT."
 <Pine.LNX.4.10.10005020242270.522-100000@localhost>
References: <Pine.LNX.4.10.10005020242270.522-100000@localhost>
Message-ID: <200005021153.HAA24134@eric.cnri.reston.va.us>

> I was planning to submit a patch that adds the built-in routines
> 
>     sys.display
>     sys.displaytb
> 
>     sys.__display__
>     sys.__displaytb__
> 
> sys.display(obj) would be implemented as 'print repr(obj)'
> and sys.displaytb(tb, exc) would call the same built-in
> traceback printer we all know and love.

Sure.  Though I would recommend to separate the patch in two parts,
because their implementation is totally unrelated.

> I assumed that sys.__stdin__ was added to make it easier to
> restore sys.stdin to its original value.  In the same vein,
> sys.__display__ and sys.__displaytb__ would be saved references
> to the original sys.display and sys.displaytb.

Good idea.

> I hate to contradict Guido, but i'll gently suggest why i
> like "display" better than "displayhook": "display" is a verb,
> and i prefer function names to be verbs rather than nouns
> describing what the functions are (e.g. "read" rather than
> "reader", etc.)

Good idea.  But I hate the "displaytb" name (when I read your message
I had no idea what the "tb" stood for until you explained it).

Hm, perhaps we could do showvalue and showtraceback?
("displaytraceback" is a bit long.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:15:28 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:15:28 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: Your message of "Tue, 02 May 2000 12:12:26 +0200."
 <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
References: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <200005021215.IAA24169@eric.cnri.reston.va.us>

> > ! \begin{funcdesc}{utime}{path, times}
> > ! Set the access and modified times of the file specified by \var{path}.
> > ! If \var{times} is \code{None}, then the file's access and modified
> > ! times are set to the current time.  Otherwise, \var{times} must be a
> > ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> > ! set the access and modified times, respectively.
> >   Availability: Macintosh, \UNIX{}, Windows.
> >   \end{funcdesc}
> 
> I may have missed something, but I haven't seen a patch to the WinXX
> and MacOS implementation of the 'utime' function.  So either the
> documentation should explicitly point out, that the new additional
> signature is only available on Unices or even better it should be
> implemented on all platforms so that programmers intending to write
> portable Python have not to worry about this.

Actually, it works on WinXX (tested on 98).  The utime()
implementation there is the same file as on Unix, so the patch fixed
both platforms.  The MS C library only seems to set the mtime, but
that's okay.

On Mac, I hope that the utime() function in GUSI 2 does this, in which
case Jack Jansen needs to copy Barry's patch.

> I suggest an additional note saying that this signature has been
> added in Python 1.6.  There used to be several such notes all over
> the documentation saying for example: "New in version 1.5.2." which
> I found very useful in the past!

Thanks, you're right!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:19:38 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:19:38 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: Your message of "Tue, 02 May 2000 20:22:00 +1000."
 <00d101bfb420$4197e510$e3cb8490@neil>
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>
 <00d101bfb420$4197e510$e3cb8490@neil>
Message-ID: <200005021219.IAA24181@eric.cnri.reston.va.us>

>    Yes, although I think that wasn't the case with Win16 and there are still
> some situations in which you have to deal with the differences. Copying a
> file from the console on Windows 95 to a FAT volume appears to allow use of
> the OEM character set with no conversion.

BTW, MS's use of code pages is full of shit.  Yesterday I was
spell-checking a document that had the name Andre in it (the accent
was missing).  The popup menu suggested Andr* where the * was an upper
case slashed O.  I first thought this was because the menu character
set might be using a different code page, but no -- it must have been
bad in the database, because selecting that entry from the menu
actually inserted the slashed O character.  So they must have been
maintaining their database with a different code page.

Just to indicate that when we sort out the rest of the Unicode debate
(which I'm sure we will :-) there will still be surprises on
Windows...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:22:24 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:22:24 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 03:23:26 PDT."
 <Pine.LNX.4.10.10005020317030.522-100000@localhost>
References: <Pine.LNX.4.10.10005020317030.522-100000@localhost>
Message-ID: <200005021222.IAA24192@eric.cnri.reston.va.us>

> > Ka-Ping Yee <ping@lfw.org>:
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

ESR:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.

Ping:
> I suppose Python could go all the way and say things like
> 
>     Traceback (innermost last):
>       <stdin>:3
>       foo.py:25: in Spam.eggs
>     AttributeError: ham
> 
> but that might be more intimidating for a beginner.
> 
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

Not sure -- I think I carefully designed the old format to be one of
the formats that Emacs parses *by default*: File "...", line ...  Your
change breaks this.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

Yes, this is much easier in Python.  Watch out for values that are
uncomfortably big or recursive or that cause additional exceptions on
displaying.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:26:50 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:26:50 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 12:46:06 +0200."
 <390EB1EE.EA557CA9@lemburg.com>
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
 <390EB1EE.EA557CA9@lemburg.com>
Message-ID: <200005021226.IAA24203@eric.cnri.reston.va.us>

[MAL]
> Let's not do the same mistake again: Unicode objects should *not*
> be used to hold binary data. Please use buffers instead.

Easier said than done -- Python doesn't really have a buffer data
type.  Or do you mean the array module?  It's not trivial to read a
file into an array (although it's possible, there are even two ways).
Fact is, most of Python's standard library and built-in objects use
(8-bit) strings as buffers.

I agree there's no reason to extend this to Unicode strings.

> BTW, I think that this behaviour should be changed:
> 
> >>> buffer('binary') + 'data'
> 'binarydata'
> 
> while:
> 
> >>> 'data' + buffer('binary')         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: illegal argument type for built-in operation
> 
> IMHO, buffer objects should never coerce to strings, but instead
> return a buffer object holding the combined contents. The
> same applies to slicing buffer objects:
> 
> >>> buffer('binary')[2:5]
> 'nar'
> 
> should prefereably be buffer('nar').

Note that a buffer object doesn't hold data!  It's only a pointer to
data.  I can't off-hand explain the asymmetry though.

> --
> 
> Hmm, perhaps we need something like a data string object
> to get this 100% right ?!
> 
> >>> d = data("...data...")
> or
> >>> d = d"...data..."
> >>> print type(d)
> <type 'data'>
> 
> >>> 'string' + d
> d"string...data..."
> >>> u'string' + d
> d"s\000t\000r\000i\000n\000g\000...data..."
> 
> >>> d[:5]
> d"...da"
> 
> etc.
> 
> Ideally, string and Unicode objects would then be subclasses
> of this type in Py3K.

Not clear.  I'd rather do the equivalent of byte arrays in Java, for
which no "string literal" notations exist.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gward@mems-exchange.org  Tue May  2 13:27:51 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Tue, 2 May 2000 08:27:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502082751.A1504@mems-exchange.org>

On 02 May 2000, Ka-Ping Yee said:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.

+0 on this: it doesn't really add anything, but it does neaten things
up.

>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.

+1 here too: this definitely adds utility.

>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py

-0: adds nothing, cleans nothing up, and just generally breaks things
for no good reason.

>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>

OK, that's a good reason.  Maybe you could special-case the "<stdin>"
case?  How about

   <stdin>, line 1

?

        Greg


From guido@python.org  Tue May  2 13:30:02 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:30:02 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <200005021230.IAA24232@eric.cnri.reston.va.us>

[MAL]
> > > Unicode itself can be understood as multi-word character
> > > encoding, just like UTF-8. The reason is that Unicode entities
> > > can be combined to produce single display characters (e.g.
> > > u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
> > > Slicing such a combined Unicode string will have the same
> > > effect as slicing UTF-8 data.
[/F]
> > really?  does it result in a decoder error?  or does it just result
> > in a rendering error, just as if you slice off any trailing character
> > without looking...
[MAL]
> In the example, if you cut off the u"\u0301", the "e" would
> appear without the acute accent, cutting off the u"e" would
> probably result in a rendering error or worse put the accent
> over the next character to the left.
> 
> UTF-8 is better in this respect: it warns you about
> the error by raising an exception when being converted to
> Unicode.

I think /F's point was that the Unicode standard prescribes different
behavior here: for UTF-8, a missing or lone continuation byte is an
error; for Unicode, accents are separate characters that may be
inserted and deleted in a string but whose display is undefined under
certain conditions.

(I just noticed that this doesn't work in Tkinter but it does work in
wish.  Strange.)

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"é" should compare equal to u"e\u0301".

Aha, then we'll see u == v even though type(u) is type(v) and len(u)
!= len(v).  /F's world will collapse. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:31:55 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:31:55 -0400
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 01:42:51 PDT."
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <200005021231.IAA24249@eric.cnri.reston.va.us>

>     No automatic conversions between 8-bit "strings" and Unicode strings.
> 
> If you want to turn UTF-8 into a Unicode string, say so.
> If you want to turn Latin-1 into a Unicode string, say so.
> If you want to turn ISO-2022-JP into a Unicode string, say so.
> Adding a Unicode string and an 8-bit "string" gives an exception.

I'd accept this, with one change: mixing Unicode and 8-bit strings is
okay when the 8-bit strings contain only ASCII (byte values 0 through
127).  That does the right thing when the program is combining
ASCII data (e.g. literals or data files) with Unicode and warns you
when you are using characters for which the encoding matters.  I
believe that this is important because much existing code dealing with
strings can in fact deal with Unicode just fine under these
assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
it deal with Unicode strings -- those changes were all getattr() and
setattr() calls.)

When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
bytes in either should make the comparison fail; when ordering is
important, we can make an arbitrary choice e.g. "\377" < u"\200".

Why not Latin-1?  Because it gives us Western-alphabet users a false
sense that our code works, where in fact it is broken as soon as you
change the encoding.

> P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
> sense of them as byte-buffers -- since that *is* all you get when you
> read in some bytes from a file.  If you manipulate an 8-bit "string"
> as a character string, you are implicitly making the assumption that
> the byte values correspond to the character encoding of the character
> repertoire you want to work with, and that's your responsibility.

This is how I think of them too.

> P. P. S.  If always having to specify encodings is really too much,
> i'd probably be willing to consider a default-encoding state on the
> Unicode class, but it would have to be a stack of values, not a
> single value.

Please elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 14:44:30 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:44:30 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021230.IAA24232@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <l03102807b5348b0e6e0b@[193.78.237.142]>

At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
>I think /F's point was that the Unicode standard prescribes different
>behavior here: for UTF-8, a missing or lone continuation byte is an
>error; for Unicode, accents are separate characters that may be
>inserted and deleted in a string but whose display is undefined under
>certain conditions.
>
>(I just noticed that this doesn't work in Tkinter but it does work in
>wish.  Strange.)
>
>> FYI: Normalization is needed to make comparing Unicode
>> strings robust, e.g. u"=C8" should compare equal to u"e\u0301".
>
>Aha, then we'll see u =3D=3D v even though type(u) is type(v) and len(u)
>!=3D len(v).  /F's world will collapse. :-)

Does the Unicode spec *really* specifies u should compare equal to v? This
behavior would be the responsibility of a layout engine, a role which is
way beyond the scope of Unicode support in Python, as it is language- and
script-dependent.

Just




From just@letterror.com  Tue May  2 14:39:24 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:39:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode debate
In-Reply-To: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102806b534883ec4cf@[193.78.237.142]>

At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote:
>If it turns out automatic conversions *are* absolutely necessary,
>then i vote in favour of the simple, direct method promoted by Paul
>and Fredrik: just copy the numerical values of the bytes.  The fact
>that this happens to correspond to Latin-1 is not really the point;
>the main reason is that it satisfies the Principle of Least Surprise.

Exactly.

I'm not sure if automatic conversions are absolutely necessary, but seeing
8-bit strings as Latin-1 encoded Unicode strings seems most natural to me.
Heck, even 8-bit strings should have an s.encode() method, that would
behave *just* like u.encode(), and unicode(blah) could even *return* an
8-bit string if it turns out the string has no character codes > 255!

Conceptually, this gets *very* close to the ideal of "there is only one
string type", and at the same times leaves room for 8-bit strings doubling
as byte arrays for backward compatibility reasons.

(Unicode strings and 8-bit strings could even be the same type, which only
uses wide chars when neccesary!)

Just




From just@letterror.com  Tue May  2 14:55:31 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:55:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 01:42:51 PDT."
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <l03102808b5348d1eea20@[193.78.237.142]>

At 8:31 AM -0400 02-05-2000, Guido van Rossum wrote:
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

Blech. Just document 8-bit strings *are* Latin-1 unless converted
explicitly, and you're done. It's really much simpler this way. For you as
well as the users.

>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

Yeah, and? It least it'll *show* it's broken instead of *silently* doing
the wrong thing with utf-8.

It's like using Python ints all over the place, and suddenly a user of the
application enters data that causes an integer overflow. Boom. Program
needs to be fixed. What's the big deal?

Just




From Fredrik Lundh" <effbot@telia.com  Tue May  2 14:05:42 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 15:05:42 +0200
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>             <390EA645.89E3B22A@lemburg.com>  <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <00f301bfb437$227bc180$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".
>=20
> Aha, then we'll see u =3D=3D v even though type(u) is type(v) and =
len(u)
> !=3D len(v).  /F's world will collapse. :-)

you're gonna do automatic normalization?  that's interesting.
will this make Python the first language to defines strings as
a "sequence of graphemes"?

or was this just the cheap shot it appeared to be?

</F>



From skip@mojam.com (Skip Montanaro)  Tue May  2 14:10:22 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 2 May 2000 08:10:22 -0500 (CDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <14606.54206.559407.213584@beluga.mojam.com>

[... completely eliding Ping's note and stealing his subject ...]

On a not-quite unrelated tack, I wonder if traceback printing can be
enhanced in the case where Python code calls a function or method written in
C (possibly calling multiple C functions), which in turn calls a Python
function that raises an exception.  Currently, the Python functions on
either side of the C functions are printed, but no hint of the C function's
existence is displayed.  Any way to get some indication there's another
function in the middle?

Thanks,

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From tdickenson@geminidataloggers.com  Tue May  2 14:46:44 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 02 May 2000 14:46:44 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>

On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum
<guido@python.org> wrote:

>>     No automatic conversions between 8-bit "strings" and Unicode =
strings.
>>=20
>> If you want to turn UTF-8 into a Unicode string, say so.
>> If you want to turn Latin-1 into a Unicode string, say so.
>> If you want to turn ISO-2022-JP into a Unicode string, say so.
>> Adding a Unicode string and an 8-bit "string" gives an exception.
>
>I'd accept this, with one change: mixing Unicode and 8-bit strings is
>okay when the 8-bit strings contain only ASCII (byte values 0 through
>127).  That does the right thing when the program is combining
>ASCII data (e.g. literals or data files) with Unicode and warns you
>when you are using characters for which the encoding matters.  I
>believe that this is important because much existing code dealing with
>strings can in fact deal with Unicode just fine under these
>assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
>it deal with Unicode strings -- those changes were all getattr() and
>setattr() calls.)
>
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

I assume 'fail' means 'non-equal', rather than 'raises an exception'?


Toby Dickenson
tdickenson@geminidataloggers.com


From guido@python.org  Tue May  2 14:58:51 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 09:58:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:10:22 CDT."
 <14606.54206.559407.213584@beluga.mojam.com>
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
 <14606.54206.559407.213584@beluga.mojam.com>
Message-ID: <200005021358.JAA24443@eric.cnri.reston.va.us>

[Skip]
> On a not-quite unrelated tack, I wonder if traceback printing can be
> enhanced in the case where Python code calls a function or method written in
> C (possibly calling multiple C functions), which in turn calls a Python
> function that raises an exception.  Currently, the Python functions on
> either side of the C functions are printed, but no hint of the C function's
> existence is displayed.  Any way to get some indication there's another
> function in the middle?

In some cases, that's a good thing -- in others, it's not.  There
should probably be an API that a C function can call to add an entry
onto the stack.

It's not going to be a trivial fix though -- you'd have to manufacture
a frame object.

I can see two options: you can do this "on the way out" when you catch
an exception, or you can do this "on the way in" when you are called.
The latter would require you to explicitly get rid of the frame too --
probably both on normal returns and on exception returns.  That seems
hairier than only having to make a call on exception returns; but it
means that the C function is invisible to the Python debugger unless
it fails.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 15:00:14 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:00:14 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:46:44 BST."
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <200005021400.KAA24464@eric.cnri.reston.va.us>

[me]
> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> >bytes in either should make the comparison fail; when ordering is
> >important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby] 
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

Yes, sorry for the ambiguity.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fdrake@acm.org  Tue May  2 15:04:17 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 10:04:17 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
References: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
 <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
Message-ID: <14606.57441.97184.499435@seahag.cnri.reston.va.us>

Mark Hammond writes:
 > I wonder if that anyone could be me? :-)

  I certainly wouldn't object!  ;)

 > But I will try and put something together.  It will need to be plain
 > text or HTML, but I assume that is better than nothing!

  Plain text would be better than HTML.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From just@letterror.com  Tue May  2 16:11:39 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:11:39 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 14:46:44 BST."
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <l0310280fb5349fd24fc5@[193.78.237.142]>

At 10:00 AM -0400 02-05-2000, Guido van Rossum wrote:
>[me]
>> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>> >bytes in either should make the comparison fail; when ordering is
>> >important, we can make an arbitrary choice e.g. "\377" < u"\200".
>
>[Toby]
>> I assume 'fail' means 'non-equal', rather than 'raises an exception'?
>
>Yes, sorry for the ambiguity.

You're going to have a hard time explaining that "\377" != u"\377".

Again, if you define that "all strings are unicode" and that 8-bit strings
contain Unicode characters up to 255, you're all set. Clear semantics, few
surprises, simple implementation, etc. etc.

Just




From guido@python.org  Tue May  2 15:21:28 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:21:28 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]>
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <200005021421.KAA24526@eric.cnri.reston.va.us>

[Just]
> You're going to have a hard time explaining that "\377" != u"\377".

I agree.  You are an example of how hard it is to explain: you still
don't understand that for a person using CJK encodings this is in fact
the truth.

> Again, if you define that "all strings are unicode" and that 8-bit strings
> contain Unicode characters up to 255, you're all set. Clear semantics, few
> surprises, simple implementation, etc. etc.

But not all 8-bit strings occurring in programs are Unicode.  Ask
Moshe.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 16:42:24 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:42:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021421.KAA24526@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102812b534a7430fb6@[193.78.237.142]>

>[Just]
>> You're going to have a hard time explaining that "\377" != u"\377".
>
[GvR]
>I agree.  You are an example of how hard it is to explain: you still
>don't understand that for a person using CJK encodings this is in fact
>the truth.

That depends on the definition of truth: it you document that 8-bit strings
are Latin-1, the above is the truth. Conceptually classify all other 8-bit
encodings as binary goop makes the semantics chrystal clear.

>> Again, if you define that "all strings are unicode" and that 8-bit strings
>> contain Unicode characters up to 255, you're all set. Clear semantics, few
>> surprises, simple implementation, etc. etc.
>
>But not all 8-bit strings occurring in programs are Unicode.  Ask
>Moshe.

I know. They can be anything, even binary goop. But that's *only* an
artifact of the fact that 8-bit strings need to double as buffer objects.

Just




From just@letterror.com  Tue May  2 16:45:01 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:45:01 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <l03102812b534a7430fb6@[193.78.237.142]>
References: <200005021421.KAA24526@eric.cnri.reston.va.us> Your message of
 "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102813b534a8484cf9@[193.78.237.142]>

I wrote:
>That depends on the definition of truth: it you document that 8-bit strings
>are Latin-1, the above is the truth.

Oops, I meant of course that "\377" == u"\377" is then the truth...

Sorry,

Just




From mal@lemburg.com  Tue May  2 16:18:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:18:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com> <l03102807b5348b0e6e0b@[193.78.237.142]>
Message-ID: <390EF1BD.E6C7AF74@lemburg.com>

Just van Rossum wrote:
> 
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish.  Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"È" should compare equal to u"e\u0301".

                            ^
                            |

Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"é"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.

> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v).  /F's world will collapse. :-)
> 
> Does the Unicode spec *really* specifies u should compare equal to v?

The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.

Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May  2 16:55:40 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:55:40 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390EFA7B.F6B622F0@lemburg.com>

[Guido going ASCII]

Do you mean going ASCII all the way (using it for all
aspects where Unicode gets converted to a string and cases
where strings get converted to Unicode), or just 
for some aspect of conversion, e.g. just for the silent
conversions from strings to Unicode ?

[BTW, I'm pretty sure that the Latin-1 folks won't like
ASCII for the same reason they don't like UTF-8: it's
simply an inconvenient way to write strings in their favorite
encoding directly in Python source code. My feeling in this
whole discussion is that it's more about convenience than
anything else. Still, it's very amusing ;-) ]

FYI, here's the conversion table of (potentially) all
conversions done by the implementation:

Python:
-------
string + unicode:       unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode:          print unicode.encode('utf-8'); with stdout
                        redirection this can be changed to any
                        other encoding
str(unicode):           unicode.encode('utf-8')
repr(unicode):          repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:          same as "s" + unicode.encode('utf-8')
"s#" + unicode:         same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:          same as "t" + unicode.encode('utf-8')
"t#" + unicode:         same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string            same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
                        Latin-1 chars as single-char input; using
                        escape sequences any Unicode char can be
                        entered (*)
codecs.open(filename,mode,encname)
                        opens an encoded file for
                        reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
                        returns UTF-8 strings based on the input
                        encoding

IO:
---
open(file,'w').write(unicode)
        same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
        same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
        same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
        same as unicode(open(file,'rb').read(),encname)
stdin + stdout
        can be redirected using StreamRecoders to handle any
        of the supported encodings

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May  2 16:27:39 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:27:39 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
 <390EB1EE.EA557CA9@lemburg.com> <200005021226.IAA24203@eric.cnri.reston.va.us>
Message-ID: <390EF3EB.5BCE9EC3@lemburg.com>

Guido van Rossum wrote:
> 
> [MAL]
> > Let's not do the same mistake again: Unicode objects should *not*
> > be used to hold binary data. Please use buffers instead.
> 
> Easier said than done -- Python doesn't really have a buffer data
> type.  Or do you mean the array module?  It's not trivial to read a
> file into an array (although it's possible, there are even two ways).
> Fact is, most of Python's standard library and built-in objects use
> (8-bit) strings as buffers.
> 
> I agree there's no reason to extend this to Unicode strings.
> 
> > BTW, I think that this behaviour should be changed:
> >
> > >>> buffer('binary') + 'data'
> > 'binarydata'
> >
> > while:
> >
> > >>> 'data' + buffer('binary')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > TypeError: illegal argument type for built-in operation
> >
> > IMHO, buffer objects should never coerce to strings, but instead
> > return a buffer object holding the combined contents. The
> > same applies to slicing buffer objects:
> >
> > >>> buffer('binary')[2:5]
> > 'nar'
> >
> > should prefereably be buffer('nar').
> 
> Note that a buffer object doesn't hold data!  It's only a pointer to
> data.  I can't off-hand explain the asymmetry though.

Dang, you're right...
 
> > --
> >
> > Hmm, perhaps we need something like a data string object
> > to get this 100% right ?!
> >
> > >>> d = data("...data...")
> > or
> > >>> d = d"...data..."
> > >>> print type(d)
> > <type 'data'>
> >
> > >>> 'string' + d
> > d"string...data..."
> > >>> u'string' + d
> > d"s\000t\000r\000i\000n\000g\000...data..."
> >
> > >>> d[:5]
> > d"...da"
> >
> > etc.
> >
> > Ideally, string and Unicode objects would then be subclasses
> > of this type in Py3K.
> 
> Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> which no "string literal" notations exist.

Anyway, one way or another I think we should make it clear
to users that they should start using some other type for
storing binary data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May  2 16:24:24 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:24:24 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <390EF327.86D8C3D8@lemburg.com>

Just van Rossum wrote:
> 
> At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
> >Just a small note on the subject of a character being atomic
> >which seems to have been forgotten by the discussing parties:
> >
> >Unicode itself can be understood as multi-word character
> >encoding, just like UTF-8. The reason is that Unicode entities
> >can be combined to produce single display characters (e.g.
> >u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
> 
> Erm, are you sure Unicode prescribes this behavior, for this
> example? I know similar behaviors are specified for certain
> languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried
in some of the tech reports on normalization and
collation.
 
> >Slicing such a combined Unicode string will have the same
> >effect as slicing UTF-8 data.
> 
> Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert
a broken UTF-8 sequence to Unicode. This is per design
of UTF-8 itself which uses the top bit to identify
multi-byte character encodings.

Or can you give an example (perhaps you've found a bug 
that needs fixing) ?

> [ Speaking of exceptions,
> 
> after I sent off my previous post I realized Guido's
> non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
> argument can easily be turned around, backfiring at utf-8:
> 
>     Defaulting to utf-8 when going from Unicode to 8-bit and
>     back only gives the *illusion* things "just work", since it
>     will *silently* "work", even if utf-8 is *not* the desired
>     8-bit encoding -- as shown by Fredrik's excellent "fun with
>     Unicode, part 1" example. Defaulting to Latin-1 will
>     warn the user *much* earlier, since it'll barf when
>     converting a Unicode string that contains any character
>     code > 255. So there.
> ]
> 
> >It seems that most Latin-1 proponents seem to have single
> >display characters in mind. While the same is true for
> >many Unicode entities, there are quite a few cases of
> >combining characters in Unicode 3.0 and the Unicode
> >nomarization algorithm uses these as basis for its
> >work.
> 
> Still, two combining characters are still two input characters for
> the renderer! They may result in one *glyph*, but trust me,
> that's an entirly different can of worms.

No. Please see my other post on the subject...
 
> However, if you'd be talking about Unicode surrogates,
> you'd definitely have a point. How do Java/Perl/Tcl deal with
> surrogates?

Good question... anybody know the answers ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From paul@prescod.net  Tue May  2 17:05:20 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:05:20 -0500
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <390EFCC0.240BC56B@prescod.net>

Neil, I sincerely appreciate your informed input. I want to emphasize
one ideological difference though. :)

Neil Hodgson wrote:
> 
> ...
>
>    The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. 

I reject that characterization.

I claim that both strings contain Unicode characters but one can contain
Unicode charactes with higher digits. UTF-8 versus latin-1 does not
enter into it. Python strings should not be documented in terms of
encodings any more than Python ints are documented in terms of their
two's complement representation. Then we could describe the default
conversion from integers to floats in terms of their bit-representation.
Ugh!

I accept that the effect is similar to calling Latin-1 the "default"
that's a side effect of the simple logical model that we are proposing.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From just@letterror.com  Tue May  2 18:33:56 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:33:56 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFA7B.F6B622F0@lemburg.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <l03102815b534c1763aa8@[193.78.237.142]>

At 5:55 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>[BTW, I'm pretty sure that the Latin-1 folks won't like
>ASCII for the same reason they don't like UTF-8: it's
>simply an inconvenient way to write strings in their favorite
>encoding directly in Python source code. My feeling in this
>whole discussion is that it's more about convenience than
>anything else. Still, it's very amusing ;-) ]

For the record, I don't want Latin-1 because it's my favorite encoding. It
isn't. Guido's right: I can't even *use* it derictly on my platform. I want
it *only* because it's the most logical 8-bit subset of Unicode -- as we
have stated over and opver and over and over again. What's so hard to
understand about this?

Just




From paul@prescod.net  Tue May  2 17:11:13 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:11:13 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <390EFE21.DAD7749B@prescod.net>

Combining characters are a whole 'nother level of complexity. Charater
sets are hard. I don't accept that the argument that "Unicode itself has
complexities so that gives us license to introduce even more
complexities at the character representation level."

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"é" should compare equal to u"e\u0301".

That's a whole 'nother debate at a whole 'nother level of abstraction. I
think we need to get the bytes/characters level right and then we can
worry about display-equivalent characters (or leave that to the Python
programmer to figure out...).
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From paul@prescod.net  Tue May  2 17:13:00 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:13:00 -0500
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us>
Message-ID: <390EFE8C.4C10473C@prescod.net>

Guido van Rossum wrote:
> 
> ...
>
> But not all 8-bit strings occurring in programs are Unicode.  Ask
> Moshe.

Where are we going? What's our long-range vision?

Three years from now where will we be? 

1. How will we handle characters? 
2. How will we handle bytes?
3. What will unadorned literal strings "do"?
4. Will literal strings be the same type as byte arrays?

I don't see how we can make decisions today without a vision for the
future. I think that this is the central point in our disagreement. Some
of us are aiming for as much compatibility with where we think we should
be going and others are aiming for as much compatibility as possible
with where we came from.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From just@letterror.com  Tue May  2 18:37:09 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:37:09 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <l03102816b534c2476bce@[193.78.237.142]>

At 5:24 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>> Still, two combining characters are still two input characters for
>> the renderer! They may result in one *glyph*, but trust me,
>> that's an entirly different can of worms.
>
>No. Please see my other post on the subject...

It would help if you'd post some actual doco.

Just




From paul@prescod.net  Tue May  2 17:25:33 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:25:33 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <390F017C.91C7A8A0@prescod.net>

Guido van Rossum wrote:
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

There are many levels of equality that are interesting. I don't think we
would move to grapheme equivalence until "the rest of the world" (XML,
Java, W3C, SQL) did. 

If we were going to move to grapheme equivalence (some day), the right
way would be to normalize characters in the construction of the Unicode
string. This is known as "Early normalization":

http://www.w3.org/TR/charmod/#NormalizationApplication

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From ping@lfw.org  Tue May  2 17:43:25 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 09:43:25 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502082751.A1504@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005020939050.522-100000@localhost>

On Tue, 2 May 2000, Greg Ward wrote:
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?

...and "<string>", and "<console>", and perhaps others... ?

    File "<string>", line 3

just looks downright clumsy the first time you see it.
(Well, it still looks kinda clumsy to me or i wouldn't be
proposing the change.)

Can someone verify the already-parseable-by-Emacs claim, and
describe how you get Emacs to do something useful with bits
of traceback?  (Alas, i'm not an Emacs user, so understanding
just how the current format is useful would help.)


-- ?!ng



From bwarsaw@python.org  Tue May  2 18:13:03 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Tue, 2 May 2000 13:13:03 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
 <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <14607.3231.115841.262068@anthem.cnri.reston.va.us>

>>>>> "PF" == Peter Funk <pf@artcom-gmbh.de> writes:

    PF> I suggest an additional note saying that this signature has
    PF> been added in Python 1.6.  There used to be several such notes
    PF> all over the documentation saying for example: "New in version
    PF> 1.5.2." which I found very useful in the past!

Good point.  Fred, what is the Right Way to do this?

-Barry


From bwarsaw@python.org  Tue May  2 18:16:22 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Tue, 2 May 2000 13:16:22 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
 <20000502082751.A1504@mems-exchange.org>
Message-ID: <14607.3430.941026.496225@anthem.cnri.reston.va.us>

I concur with Greg's scores.


From guido@python.org  Tue May  2 18:22:02 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 13:22:02 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:27:51 EDT."
 <20000502082751.A1504@mems-exchange.org>
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
 <20000502082751.A1504@mems-exchange.org>
Message-ID: <200005021722.NAA25854@eric.cnri.reston.va.us>

> On 02 May 2000, Ka-Ping Yee said:
> > I propose the following stylistic changes to traceback
> > printing:
> > 
> >     1.  If there is no function name for a given level
> >         in the traceback, just omit the ", in ?" at the
> >         end of the line.

Greg Ward expresses my sentiments:

> +0 on this: it doesn't really add anything, but it does neaten things
> up.
> 
> >     2.  If a given level of the traceback is in a method,
> >         instead of just printing the method name, print
> >         the class and the method name.
> 
> +1 here too: this definitely adds utility.
> 
> >     3.  Instead of beginning each line with:
> >         
> >             File "foo.py", line 5
> > 
> >         print the line first and drop the quotes:
> > 
> >             Line 5 of foo.py
> 
> -0: adds nothing, cleans nothing up, and just generally breaks things
> for no good reason.
> 
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?  How about
> 
>    <stdin>, line 1
> 
> ?

I'd special-case any filename that starts with < and ends with > --
those are all made-up names like <string> or <stdin>.  You can display
them however you like, perhaps

  In "<string>", line 3

For regular files I'd leave the formatting alone -- there are tools
out there that parse these.  (E.g. Emacs' Python mode jumps to the
line with the error if you run a file and it begets an exception.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Tue May  2 18:14:24 2000
From: tree@basistech.com (Tom Emerson)
Date: Tue, 2 May 2000 13:14:24 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
 <390EF327.86D8C3D8@lemburg.com>
Message-ID: <14607.3312.660077.42872@cymru.basistech.com>

M.-A. Lemburg writes:
 > The details are on the www.unicode.org web-site burried
 > in some of the tech reports on normalization and
 > collation.

This is described in the Unicode standard itself, and in UTR #15 and
UTR #10. Normalization is an issue with wider imlications than just
handling glyph variants: indeed, it's irrelevant.

The question is this: should

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

compare equal to

U+0055 LATIN CAPITAL LETTER U
U+0308 COMBINING DIAERESIS

or not? It depends on the application. Certainly in a database system
I would want these to compare equal.

Perhaps normalization form needs to be an option of the string comparator?

        -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From bwarsaw@python.org  Tue May  2 18:51:17 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Tue, 2 May 2000 13:51:17 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
 <20000502082751.A1504@mems-exchange.org>
 <200005021722.NAA25854@eric.cnri.reston.va.us>
Message-ID: <14607.5525.160379.760452@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido@python.org> writes:

    GvR> For regular files I'd leave the formatting alone -- there are
    GvR> tools out there that parse these.  (E.g. Emacs' Python mode
    GvR> jumps to the line with the error if you run a file and it
    GvR> begets an exception.)

py-traceback-line-re is what matches those lines.  It's current
definition is

(defconst py-traceback-line-re
  "[ \t]+File \"\\([^\"]+\\)\", line \\([0-9]+\\)"
  "Regular expression that describes tracebacks.")

There are probably also gud.el (and maybe compile.el) regexps that
need to be changed too.  I'd rather see something that outputs the
same regardless of whether it's a real file, or something "fake".
Something like

Line 1 of <stdin>
Line 12 of foo.py

should be fine.  I'm not crazy about something like

File "foo.py", line 12
In <stdin>, line 1

-Barry


From fdrake@acm.org  Tue May  2 18:59:43 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 13:59:43 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <14607.3231.115841.262068@anthem.cnri.reston.va.us>
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
 <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
 <14607.3231.115841.262068@anthem.cnri.reston.va.us>
Message-ID: <14607.6031.770981.424012@seahag.cnri.reston.va.us>

bwarsaw@python.org writes:
 > Good point.  Fred, what is the Right Way to do this?

  Pester me night and day until it gets done (email only!).
  Unless of course you've already seen the check-in messages.  ;)


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From bwarsaw@python.org  Tue May  2 19:05:00 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Tue, 2 May 2000 14:05:00 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
 <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
 <14607.3231.115841.262068@anthem.cnri.reston.va.us>
 <14607.6031.770981.424012@seahag.cnri.reston.va.us>
Message-ID: <14607.6348.453682.219847@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr <fdrake@acm.org> writes:

    Fred>   Pester me night and day until it gets done (email only!).

Okay, I'll cancel the daily delivery of angry rabid velco monkeys.

    Fred> Unless of course you've already seen the check-in messages.
    Fred> ;)

Saw 'em.  Thanks.
-Barry


Return-Path: <sc-publicity-return-3-python-dev=python.org@software-carpentry.com>
Delivered-To: python-dev@python.org
Received: from merlin.codesourcery.com (merlin.codesourcery.com [206.168.99.1])
	by dinsdale.python.org (Postfix) with SMTP id 81F951CD8B
	for <python-dev@python.org>; Tue,  2 May 2000 14:45:04 -0400 (EDT)
Received: (qmail 9404 invoked by uid 513); 2 May 2000 18:53:01 -0000
Mailing-List: contact sc-publicity-help@software-carpentry.com; run by ezmlm
Precedence: bulk
X-No-Archive: yes
Delivered-To: mailing list sc-publicity@software-carpentry.com
Delivered-To: moderator for sc-publicity@software-carpentry.com
Received: (qmail 5829 invoked from network); 2 May 2000 18:12:54 -0000
Date: Tue, 2 May 2000 14:04:56 -0400 (EDT)
From: <gvwilson@nevex.com>
To: sc-discuss@software-carpentry.com,
	sc-announce@software-carpentry.com,
	sc-publicity@software-carpentry.com
Message-ID: <Pine.LNX.4.10.10005021403560.30804-100000@akbar.nevex.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [Python-Dev] Software Carpentry Design Competition Finalists
Sender: python-dev-admin@python.org
Errors-To: python-dev-admin@python.org
X-BeenThere: python-dev@python.org
X-Mailman-Version: 2.0beta3
List-Id: Python core developers <python-dev.python.org>

		Software Carpentry Design Competition

			 First-Round Results

		  http://www.software-carpentry.com

			     May 2, 2000

The Software Carpentry Project is pleased to announce the selection of
finalists in its first Open Source Design Competition.  There were
many strong entries, and we would like to thank everyone who took the
time to participate.

We would also like to invite everyone who has been involved to contact
the teams listed below, and see if there is any way to collaborate in
the second round.  Many of you had excellent ideas that deserve to be
in the final tools, and the more involved you are in discussions over
the next two months, the easier it will be for you to take part in the
ensuing implementation effort.

The 12 entries that are going forward in the "Configuration", "Build",
and "Track" categories are listed below (in alphabetical order).  The
four prize-winning entries in the "Test" category are also listed, but
as is explained there, we are putting this section of the competition
on hold for a couple of months while we try to refine the requirements.
You can inspect these entries on-line at:

         http://www.software-carpentry.com/first-round.html

And so, without further ado...


== Configuration

The final four entries in the "Configuration" category are:

* BuildConf     Vassilis Virvilis

* ConfBase      Stefan Knappmann

* SapCat        Lindsay Todd

* Tan           David Ascher


== Build

The finalists in the "Build" category are:

* Black         David Ascher and Trent Mick

* PyMake        Rich Miller

* ScCons        Steven Knight

* Tromey        Tom Tromey

Honorable mentions in this category go to:

* Forge         Bill Bitner, Justin Patterson, and Gilbert Ramirez

* Quilt         David Lamb


== Track

The four entries to go forward in the "Track" category are:

* Egad          John Martin

* K2            David Belfer-Shevett

* Roundup       Ka-Ping Yee

* Tracker       Ken Manheimer

There is also an honorable mention for:

* TotalTrack    Alex Samuel, Mark Mitchell


== Test

This category was the most difficult one for the judges. First-round
prizes are being awarded to

* AppTest         Linda Timberlake

* TestTalk        Chang Liu

* Thomas          Patrick Campbell-Preston

* TotalQuality    Alex Samuel, Mark Mitchell

However, the judges did not feel that any of these tools would have an
impact on Open Source software development in general, or scientific
and numerical programming in particular.  This is due in large part to
the vagueness of the posted requirements, for which the project
coordinator (Greg Wilson) accepts full responsibility.

We will therefore not be going forward with this category at the
present time.  Instead, the judges and others will develop narrower,
more specific requirements, guidelines, and expectations.  The
category will be re-opened in July 2000.


== Contact

The aim of the Software Carpentry project is to create a new generation of
easy-to-use software engineering tools, and to document both those tools
and the working practices they are meant to support.  The Advanced
Computing Laboratory at Los Alamos National Laboratory is providing
$860,000 of funding for Software Carpentry, which is being administered by
Code Sourcery, LLC.  For more information, contact the project
coordinator, Dr. Gregory V. Wilson, at 'gvwilson@software-carpentry.com',
or on +1 (416) 504 2325 ext. 229.


== Footnote: Entries from CodeSourcery, LLC

Two entries (TotalTrack and TotalQuality) were received from employees
of CodeSourcery, LLC, the company which is hosting the Software
Carpentry web site.  We discussed this matter with Dr. Rod Oldehoeft,
Deputy Directory of the Advanced Computing Laboratory at Los Alamos
National Laboratory.  His response was:

    John Reynders [Director of the ACL] and I have discussed this
    matter.  We agree that since the judges who make decisions
    are not affiliated with Code Sourcery, there is no conflict of
    interest. Code Sourcery gains no advantage by hosting the
    Software Carpentry web pages.  Please continue evaluating all
    the entries on their merits, and choose the best for further
    eligibility.

Note that the project coordinator, Greg Wilson, is neither employed by
CodeSourcery, nor a judge in the competition.



From paul@prescod.net  Tue May  2 19:23:24 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 13:23:24 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <390F1D1C.6EAF7EAD@prescod.net>

Guido van Rossum wrote:
> 
> ....
> 
> Have you tried using this?

Yes. I haven't had large problems with it.

As long as you know what is going on, it doesn't usually hurt anything
because you can just explicitly set up the decoding you want. It's like
the int division problem. You get bitten a few times and then get
careful.

It's the naive user who will be surprised by these random UTF-8 decoding
errors. 

That's why this is NOT a convenience issue (are you listening MAL???).
It's a short and long term simplicity issue. There are lots of languages
where it is de rigeur to discover and work around inconvenient and
confusing default behaviors. I just don't think that we should be ADDING
such behaviors.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Tue May  2 19:56:34 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 14:56:34 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT."
 <390F1D1C.6EAF7EAD@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <390F1D1C.6EAF7EAD@prescod.net>
Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us>

> It's the naive user who will be surprised by these random UTF-8 decoding
> errors. 
> 
> That's why this is NOT a convenience issue (are you listening MAL???).
> It's a short and long term simplicity issue. There are lots of languages
> where it is de rigeur to discover and work around inconvenient and
> confusing default behaviors. I just don't think that we should be ADDING
> such behaviors.

So what do you think of my new proposal of using ASCII as the default
"encoding"?  It takes care of "a character is a character" but also
(almost) guarantees an error message when mixing encoded 8-bit strings
with Unicode strings without specifying an explicit conversion --
*any* 8-bit byte with the top bit set is rejected by the default
conversion to Unicode.

I think this is less confusing than Latin-1: when an unsuspecting user
is reading encoded text from a file into 8-bit strings and attempts to
use it in a Unicode context, an error is raised instead of producing
garbage Unicode characters.

It encourages the use of Unicode strings for everything beyond ASCII
-- there's no way around ASCII since that's the source encoding etc.,
but Latin-1 is an inconvenient default in most parts of the world.
ASCII is accepted everywhere as the base character set (e.g. for
email and for text-based protocols like FTP and HTTP), just like
English is the one natural language that we can all sue to communicate
(to some extent).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From dieter@handshake.de  Tue May  2 19:44:41 2000
From: dieter@handshake.de (Dieter Maurer)
Date: Tue,  2 May 2000 20:44:41 +0200 (CEST)
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <390E1F08.EA91599E@prescod.net>
Message-ID: <14607.7798.510723.419556@lindm.dm>

Paul Prescod writes:
 > The fact that my proposal has the same effect as making Latin-1 the
 > "default encoding" is a near-term side effect of the definition of
 > Unicode. My long term proposal is to do away with the concept of 8-bit
 > strings (and thus, conversions from 8-bit to Unicode) altogether. One
 > string to rule them all!
Why must this be a long term proposal?

I would find it quite attractive, when
 * the old string type became an imutable list of bytes
 * automatic conversion between byte lists and unicode strings 
   were performed via user customizable conversion functions
   (a la __import__).

Dieter


From paul@prescod.net  Tue May  2 20:01:32 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:01:32 -0500
Subject: [Python-Dev] Unicode compromise?
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390F260C.2314F97E@prescod.net>

Guido van Rossum wrote:
> 
> >     No automatic conversions between 8-bit "strings" and Unicode strings.
> >
> > If you want to turn UTF-8 into a Unicode string, say so.
> > If you want to turn Latin-1 into a Unicode string, say so.
> > If you want to turn ISO-2022-JP into a Unicode string, say so.
> > Adding a Unicode string and an 8-bit "string" gives an exception.
> 
> I'd accept this, with one change: mixing Unicode and 8-bit strings is
> okay when the 8-bit strings contain only ASCII (byte values 0 through
> 127).  

I could live with this compromise as long as we document that a future
version may use the "character is a character" model. I just don't want
people to start depending on a catchable exception being thrown because
that would stop us from ever unifying unmarked literal strings and
Unicode strings.

--

Are there any steps we could take to make a future divorce of strings
and byte arrays easier? What if we added a 

binary_read()

function that returns some form of byte array. The byte array type could
be just like today's string type except that its type object would be
distinct, it wouldn't have as many string-ish methods and it wouldn't
have any auto-conversion to Unicode at all.

People could start to transition code that reads non-ASCII data to the
new function. We could put big warning labels on read() to state that it
might not always be able to read data that is not in some small set of
recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binary_open(). Or perhaps both.

I do not suggest just using the text/binary flag on the existing open
function because we cannot immediately change its behavior without
breaking code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From jkraai@murlmail.com  Tue May  2 20:46:49 2000
From: jkraai@murlmail.com (jkraai@murlmail.com)
Date: Tue, 2 May 2000 14:46:49 -0500
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <200005021946.OAA03609@www.polytopic.com>

The ever quotable Guido:
> English is the one natural language that we can all sue to communicate



------------------------------------------------------------------
You've received MurlMail! -- FREE, web-based email, accessible
anywhere, anytime from any browser-enabled device. Sign up now at
http://murl.com

Murl.com - At Your Service


From paul@prescod.net  Tue May  2 20:23:27 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:23:27 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
Message-ID: <390F2B2F.2953C72D@prescod.net>

Guido van Rossum wrote:
> 
> ...
> 
> So what do you think of my new proposal of using ASCII as the default
> "encoding"?  

I can live with it. I am mildly uncomfortable with the idea that I could
write a whole bunch of software that works great until some European
inserts one of their name characters. Nevertheless, being hard-assed is
better than being permissive because we can loosen up later.

What do we do about str( my_unicode_string )? Perhaps escape the Unicode
characters with backslashed numbers?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Tue May  2 20:58:20 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 15:58:20 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT."
 <390F2B2F.2953C72D@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
 <390F2B2F.2953C72D@prescod.net>
Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us>

[me]
> > So what do you think of my new proposal of using ASCII as the default
> > "encoding"?  

[Paul]
> I can live with it. I am mildly uncomfortable with the idea that I could
> write a whole bunch of software that works great until some European
> inserts one of their name characters.

Better than that when some Japanese insert *their* name characters and
it produces gibberish instead.

> Nevertheless, being hard-assed is
> better than being permissive because we can loosen up later.

Exactly -- just as nobody should *count* on 10**10 raising
OverflowError, nobody (except maybe parts of the standard library :-)
should *count* on unicode("\347") raising ValueError.  I think that's
fine.

> What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> characters with backslashed numbers?

Hm, good question.  Tcl displays unknown characters as \x or \u
escapes.  I think this may make more sense than raising an error.

But there must be a way to turn on Unicode-awareness on e.g. stdout
and then printing a Unicode object should not use str() (as it
currently does).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Tue May  2 21:47:17 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 2 May 2000 13:47:17 -0700
Subject: [Python-Dev] Cannot declare the largest integer literal.
Message-ID: <20000502134717.A16825@activestate.com>

>>> i = -2147483648
OverflowError: integer literal too large
>>> i = -2147483648L
>>> int(i)   # it *is* a valid integer literal
-2147483648


As far as I traced back:

Python/compile.c::com_atom() calls
Python/compile.c::parsenumber(s = "2147483648") calls
Python/mystrtoul.c::PyOS_strtol() which

returns the ERANGE errno because it is given 2147483648 (which *is* out of
range) rather than -2147483648.


My question: Why is the minus sign not considered part of the "atom", i.e.
the integer literal? Should it be? PyOS_strtol() can properly parse this
integer literal if it is given the whole number with the minus sign.
Otherwise the special case largest negative number will always erroneously be
considered out of range.

I don't know how the tokenizer works in Python. Was there a design decision
to separate the integer literal and the leading sign? And was the effect on
functions like PyOS_strtol() down the pipe missed?


Trent

--
Trent Mick
trentm@activestate.com








From guido@python.org  Tue May  2 21:47:30 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 16:47:30 -0400
Subject: [Python-Dev] Unicode compromise?
In-Reply-To: Your message of "Tue, 02 May 2000 14:01:32 CDT."
 <390F260C.2314F97E@prescod.net>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
 <390F260C.2314F97E@prescod.net>
Message-ID: <200005022047.QAA26828@eric.cnri.reston.va.us>

> I could live with this compromise as long as we document that a future
> version may use the "character is a character" model. I just don't want
> people to start depending on a catchable exception being thrown because
> that would stop us from ever unifying unmarked literal strings and
> Unicode strings.

Agreed (as I've said before).

> --
> 
> Are there any steps we could take to make a future divorce of strings
> and byte arrays easier? What if we added a 
> 
> binary_read()
> 
> function that returns some form of byte array. The byte array type could
> be just like today's string type except that its type object would be
> distinct, it wouldn't have as many string-ish methods and it wouldn't
> have any auto-conversion to Unicode at all.

You can do this now with the array module, although clumsily:

  >>> import array
  >>> f = open("/core", "rb")
  >>> a = array.array('B', [0]) * 1000
  >>> f.readinto(a)
  1000
  >>>

Or if you wanted to read raw Unicode (UTF-16):

  >>> a = array.array('H', [0]) * 1000
  >>> f.readinto(a)
  2000
  >>> u = unicode(a, "utf-16")
  >>> 

There are some performance issues, e.g. you have to initialize the
buffer somehow and that seems a bit wasteful.

> People could start to transition code that reads non-ASCII data to the
> new function. We could put big warning labels on read() to state that it
> might not always be able to read data that is not in some small set of
> recognized encodings (probably UTF-8 and UTF-16).
> 
> Or perhaps binary_open(). Or perhaps both.
> 
> I do not suggest just using the text/binary flag on the existing open
> function because we cannot immediately change its behavior without
> breaking code.

A new method makes most sense -- there are definitely situations where
you want to read in text mode for a while and then switch to binary
mode (e.g. HTTP).

I'd like to put this off until after Python 1.6 -- but it deserves
attention.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Wed May  3 00:03:22 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 2 May 2000 16:03:22 -0700
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
Message-ID: <20000502160322.A19101@activestate.com>

I apologize if I am hitting covered ground. What about a module (called
limits or something like that) that would expose some appropriate #define's
in limits.h and float.h.

For example:

limits.FLT_EPSILON could expose the C DBL_EPSILON
limits.FLT_MAX could expose the C DBL_MAX
limits.INT_MAX could expose the C LONG_MAX (although that particulay name
would cause confusion with the actual C INT_MAX)


- Does this kind of thing already exist somewhere? Maybe in NumPy.

- If we ever (perhaps in Py3K) turn the basic types into classes then these
  could turn into constant attributes of those classes, i.e.:
  f = 3.14159
  f.EPSILON = <as set by C's DBL_EPSILON>

- I thought of these values being useful when I thought of comparing two
  floats for equality. Doing a straight comparison of floats is
  dangerous/wrong but is it not okay to consider two floats reasonably equal
  iff:
  	-EPSILON < float2 - float1 < EPSILON
  Or maybe that should be two or three EPSILONs. It has been a while since
  I've done any numerical analysis stuff.

  I suppose the answer to my question is: "It depends on the situation."
  Could this algorithm for float comparison be a better default than the
  status quo? I know that Mark H. and others have suggested that Python
  should maybe not provide a float comparsion operator at all to beginners.



Trent

--
Trent Mick
trentm@activestate.com



From mal@lemburg.com  Wed May  3 00:11:37 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:11:37 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
 <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <390F60A9.A3AA53A9@lemburg.com>

Guido van Rossum wrote:
> 
> > > So what do you think of my new proposal of using ASCII as the default
> > > "encoding"?

How about using unicode-escape or raw-unicode-escape as
default encoding ? (They would have to be adapted to disallow
Latin-1 char input, though.)

The advantage would be that they are compatible with ASCII
while still providing loss-less conversion and since they
use escape characters, you can even read them using an
ASCII based editor.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond@skippinet.com.au  Wed May  3 00:12:18 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 3 May 2000 09:12:18 +1000
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEBGCKAA.mhammond@skippinet.com.au>

> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

I struck this years ago!  At the time, the answer was "yes, its an
implementation flaw thats not worth fixing".

Interestingly, it _does_ work as a hex literal:

>>> 0x80000000
-2147483648
>>> -2147483648
Traceback (OverflowError: integer literal too large
>>>

Mark.



From mal@lemburg.com  Wed May  3 00:05:28 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:05:28 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net>
Message-ID: <390F5F38.DD76CAF4@lemburg.com>

Paul Prescod wrote:
> 
> Combining characters are a whole 'nother level of complexity. Charater
> sets are hard. I don't accept that the argument that "Unicode itself has
> complexities so that gives us license to introduce even more
> complexities at the character representation level."
> 
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"é" should compare equal to u"e\u0301".
> 
> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> think we need to get the bytes/characters level right and then we can
> worry about display-equivalent characters (or leave that to the Python
> programmer to figure out...).

I just wanted to point out that the argument "slicing doesn't
work with UTF-8" is moot.

I do see a point against UTF-8 auto-conversion given the example
that Guido mailed me:

"""
s = 'ab\341\210\264def'        # == str(u"ab\u1234def")
s.find(u"def")

This prints 3 -- the wrong result since "def" is found at s[5:8], not
at s[3:6].
"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one@email.msn.com  Wed May  3 03:20:20 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 22:20:20 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <000001bfb4a6$21da7900$922d153f@tim>

[Trent Mick]
> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

Python's grammar is such that negative integer literals don't exist; what
you actually have there is the unary minus operator applied to positive
integer literals; indeed,

>>> def f():
	return -42

>>> import dis
>>> dis.dis(f)
          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 (42)
          9 UNARY_NEGATIVE
         10 RETURN_VALUE
         11 LOAD_CONST               0 (None)
         14 RETURN_VALUE
>>>

Note that, at runtime, the example loads +42, then negates it:  this wart
has deep roots!

> ...
> And was the effect on functions like PyOS_strtol() down the pipe
> missed?

More that it was considered an inconsequential endcase.  It's sure not worth
changing the grammar for <wink>.  I'd rather see Python erase the visible
distinction between ints and longs.




From guido@python.org  Wed May  3 03:31:21 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 22:31:21 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200."
 <390F60A9.A3AA53A9@lemburg.com>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
 <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us>

> Guido van Rossum wrote:
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?

[MAL]
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

No, the backslash should mean itself when encoding from ASCII to
Unicode.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From esr@thyrsus.com  Wed May  3 04:22:20 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 23:22:20 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFE8C.4C10473C@prescod.net>; from paul@prescod.net on Tue, May 02, 2000 at 11:13:00AM -0500
References: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us> <390EFE8C.4C10473C@prescod.net>
Message-ID: <20000502232220.B18638@thyrsus.com>

Paul Prescod <paul@prescod.net>:
> Where are we going? What's our long-range vision?
> 
> Three years from now where will we be? 
> 
> 1. How will we handle characters? 
> 2. How will we handle bytes?
> 3. What will unadorned literal strings "do"?
> 4. Will literal strings be the same type as byte arrays?
> 
> I don't see how we can make decisions today without a vision for the
> future. I think that this is the central point in our disagreement. Some
> of us are aiming for as much compatibility with where we think we should
> be going and others are aiming for as much compatibility as possible
> with where we came from.

And *that* is the most insightful statement I have seen in this entire 
foofaraw (which I have carefully been staying right the hell out of). 

Everybody meditate on the above, please.  Then declare your objectives *at
this level* so our Fearless Leader can make an informed decision *at this
level*.  Only then will it make sense to argue encoding theology...
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Extremism in the defense of liberty is no vice; moderation in the
pursuit of justice is no virtue."
	-- Barry Goldwater (actually written by Karl Hess)


From tim_one@email.msn.com  Wed May  3 06:05:59 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:05:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
Message-ID: <000301bfb4bd$463ec280$622d153f@tim>

[Guido]
> When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> bytes in either should make the comparison fail; when ordering is
> important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby]
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

[Guido]
> Yes, sorry for the ambiguity.

Huh!  You sure about that?  If we're setting up a case where meaningful
comparison is impossible, isn't an exception more appropriate?  The current

>>> 83479278 < "42"
1
>>>

probably traps more people than it helps.




From tim_one@email.msn.com  Wed May  3 06:19:28 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:19:28 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>
Message-ID: <000401bfb4bf$27ec1600$622d153f@tim>

[Fredrik Lundh]
> ...
> (if you like, I can post more "fun with unicode" messages ;-)

By all means!  Exposing a gotcha to ridicule does more good than a dozen
abstract arguments.  But next time stoop to explaining what it is that's
surprising <wink>.




From just@letterror.com  Wed May  3 07:47:07 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 07:47:07 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390F5F38.DD76CAF4@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net>
Message-ID: <l03102800b53572ee87ad@[193.78.237.142]>

[MAL vs. PP]
>> > FYI: Normalization is needed to make comparing Unicode
>> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".
>>
>> That's a whole 'nother debate at a whole 'nother level of abstraction. I
>> think we need to get the bytes/characters level right and then we can
>> worry about display-equivalent characters (or leave that to the Python
>> programmer to figure out...).
>
>I just wanted to point out that the argument "slicing doesn't
>work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue
(which is indeed not relevant to the current discussion, but it's
fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "=E8" is u"\350", "=F6" is
u"\366")

John Jenkins replied:
"""
Well, I'm not sure you want to hear the answer -- but it really depends on
what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated
the same as "=E8". This is a *semantic* equivalence -- that is, they *mean*
the same thing -- and doesn't depend on the display engine to be true.
Unicode also provides a default collation algorithm
(http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string
comparison and collation are complicated, language-specific problems
requiring a lot of work and interaction with the user to do right.

>From the perspective of a programming language, it would best be served IMH=
O
by implementing the contents of TR10 for string comparison and collation.
That would make "e`" and "=E8" come out as equivalent.
"""


Dave Opstad replied:
"""
Unicode talks about "canonical decomposition" in order to make it easier
to answer questions like yours. Specifically, in the Unicode 3.0
standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their
full canonical decompositions are identical. For example, the sequences
<o, combining-diaeresis> and <=F6> are canonical equivalents. Canonical
equivalence is a Unicode propert. It should not be confused with
language-specific collation or matching, which may add additional
equivalencies."

So they still have language-specific differences, even if Unicode sees
them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify
things further.
"""


It's very deep stuff, which seems more appropriate for an extension than
for builtin comparisons to me.

Just




From tim_one@email.msn.com  Wed May  3 06:47:37 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:47:37 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <000501bfb4c3$16743480$622d153f@tim>

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

Then you don't want Unicode at all, Moshe.  All the official encoding
schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
Unicode not yet having assigned a character to this position, it's that the
standard explicitly makes this sequence illegal and guarantees it will
always be illegal!  the other place this comes up is with surrogates, where
what's legal depends on both parts of a character pair; and, again, the
illegalities here are guaranteed illegal for all time).  UCS-4 is the
closest thing to binary-transparent Unicode encodings get, but even there
the length of a thing is contrained to be a multiple of 4 bytes.  Unicode
and binary goop will never coexist peacefully.




From ping@lfw.org  Wed May  3 06:56:12 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 22:56:12 -0700 (PDT)
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <Pine.LNX.4.10.10005022249330.522-100000@localhost>

On Wed, 3 May 2000, Tim Peters wrote:
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.
> 
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> 
> probably traps more people than it helps.

Yeah, when i said

    No automatic conversions between Unicode strings and 8-bit "strings".

i was about to say

    Raise an exception on any operation attempting to combine or
    compare Unicode strings and 8-bit "strings".

...and then i thought, oh crap, but everything in Python is supposed
to be comparable.

What happens when you have some lists with arbitrary objects in them
and you want to sort them for printing, or to canonicalize them so
you can compare?  It might be too troublesome for list.sort() to
throw an exception because e.g. strings and ints were incomparable,
or 8-bit "strings" and Unicode strings were incomparable...

So -- what's the philosophy, Guido?  Are we committed to "everything
is comparable" (well, "all built-in types are comparable") or not?


-- ?!ng



From tim_one@email.msn.com  Wed May  3 07:40:54 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 02:40:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <000701bfb4ca$87b765c0$622d153f@tim>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

He succeeded for me.  Blind slicing doesn't always "work right" no matter
what encoding you use, because "work right" depends on semantics beyond the
level of encoding.  UTF-8 is no worse than anything else in this respect.




From just@letterror.com  Wed May  3 08:50:11 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 08:50:11 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000701bfb4ca$87b765c0$622d153f@tim>
References: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102804b5358971d413@[193.78.237.152]>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

[Tim]
>He succeeded for me.  Blind slicing doesn't always "work right" no matter
>what encoding you use, because "work right" depends on semantics beyond the
>level of encoding.  UTF-8 is no worse than anything else in this respect.

But the discussion *was* at the level of encoding! Still it is worse, since
an arbitrary utf-8 slice may result in two illegal strings -- slicing "e`"
results in two perfectly legal strings, at the encoding level. Had he used
surrogates as an example, he would've been right... (But even that is an
encoding issue.)

Just




From tim_one@email.msn.com  Wed May  3 08:11:12 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 03:11:12 -0400
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
In-Reply-To: <20000502160322.A19101@activestate.com>
Message-ID: <000801bfb4ce$c361ea60$622d153f@tim>

[Trent Mick]
> I apologize if I am hitting covered ground. What about a module (called
> limits or something like that) that would expose some appropriate
> #define's
> in limits.h and float.h.

I personally have little use for these.

> For example:
>
> limits.FLT_EPSILON could expose the C DBL_EPSILON
> limits.FLT_MAX could expose the C DBL_MAX

Hmm -- all evidence suggests that your "O" and "A" keys work fine, so where
did the absurdly abbreviated FLT come from <wink>?

> limits.INT_MAX could expose the C LONG_MAX (although that particulay name
> would cause confusion with the actual C INT_MAX)

That one is available as sys.maxint.

> - Does this kind of thing already exist somewhere? Maybe in NumPy.

Dunno.  I compute the floating-point limits when needed with Python code,
and observing what the hardware actually does is a heck of a lot more
trustworthy than platform C header files (and especially when
cross-compiling).

> - If we ever (perhaps in Py3K) turn the basic types into classes
> then these could turn into constant attributes of those classes, i.e.:
>   f = 3.14159
>   f.EPSILON = <as set by C's DBL_EPSILON>

That sounds better.

> - I thought of these values being useful when I thought of comparing
>   two floats for equality. Doing a straight comparison of floats is
>   dangerous/wrong

This is a myth whose only claim to veracity is the frequency and intensity
with which it's mechanically repeated <0.6 wink>.  It's no more dangerous
than adding two floats:  you're potentially screwed if you don't know what
you're doing in either case, but you're in no trouble at all if you do.

> but is it not okay to consider two floats reasonably equal iff:
>   	-EPSILON < float2 - float1 < EPSILON

Knuth (Vol 2) gives a reasonable defn of approximate float equality.  Yours
is measuring absolute error, which is almost never reasonable; relative
error is the measure of interest, but then 0.0 is an especially irksome
comparand.

> ...
>   I suppose the answer to my question is: "It depends on the situation."

Yes.

>   Could this algorithm for float comparison be a better default than the
>   status quo?

No.

> I know that Mark H. and others have suggested that Python should maybe
> not provide a float comparsion operator at all to beginners.

There's a good case to be made for not exposing *anything* about fp to
beginners, but comparisons aren't especially surprising.  This usually gets
suggested when a newbie is surprised that e.g. 1./49*49 != 1.  Telling them
they *are* equal is simply a lie, and they'll pay for that false comfort
twice over a little bit later down the fp road.  For example, int(1./49*49)
is 0 on IEEE-754 platforms, which is awfully surprising for an expression
that "equals" 1(!).  The next suggestion is then to fudge int() too, and so
on and so on.  It's like the arcade Whack-A-Mole game:  each mole you knock
into its hole pops up two more where you weren't looking.  Before you know
it, not even a bona fide expert can guess what code will actually do
anymore.

the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim




From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:34:51 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:34:51 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005022249330.522-100000@localhost>
Message-ID: <00b201bfb4d3$07a95420$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> So -- what's the philosophy, Guido?  Are we committed to "everything
> is comparable" (well, "all built-in types are comparable") or not?

in 1.6a2, obviously not:

>>> aUnicodeString < an8bitString
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

in 1.6a3, maybe.

</F>



From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:48:56 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:48:56 +0200
Subject: [Python-Dev] Unicode debate
References: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <00ce01bfb4d4$0a7d1820$34aab5d4@hagrid>

Tim Peters <tim_one@email.msn.com> wrote:
> [Moshe Zadka]
> > ...
> > I'd much prefer Python to reflect a fundamental truth about Unicode,
> > which at least makes sure binary-goop can pass through Unicode and
> > remain unharmed, then to reflect a nasty problem with UTF-8 (not
> > everything is legal).
>=20
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences (for example, =
0xffff
> is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
> Unicode not yet having assigned a character to this position, it's =
that the
> standard explicitly makes this sequence illegal and guarantees it will
> always be illegal!

in context, I think what Moshe meant was that with a straight
character code mapping, any 8-bit string can always be mapped
to a unicode string and back again.

given a byte array "b":

    u =3D unicode(b, "default")
    assert map(ord, u) =3D=3D map(ord, s)

again, this is no different from casting an integer to a long integer
and back again.  (imaging having to do that on the bits and bytes
level!).

and again, the internal unicode encoding used by the unicode string
type itself, or when serializing that string type, has nothing to do
with that.

</F>



From jack@oratrix.nl  Wed May  3 08:58:31 2000
From: jack@oratrix.nl (Jack Jansen)
Date: Wed, 03 May 2000 09:58:31 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python
 bltinmodule.c,2.154,2.155
In-Reply-To: Message by bwarsaw@cnri.reston.va.us (Barry A. Warsaw) ,
 Tue, 2 May 2000 15:24:09 -0400 (EDT) , <20000502192409.8C44E6636B@anthem.cnri.reston.va.us>
Message-ID: <20000503075832.18574370CF2@snelboot.oratrix.nl>

> _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> ahead and initialize the class-based standard exceptions.  If this
> fails, we throw a Py_FatalError.

Isn't a Py_FatalError overkill? Or will not having the class-based standard 
exceptions lead to so much havoc later on that it is better than limping on?
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From just@letterror.com  Wed May  3 10:03:16 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 10:03:16 +0100
Subject: [Python-Dev] Unicode comparisons & normalization
Message-ID: <l03102806b535964edb26@[193.78.237.152]>

After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:

- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)

Just




From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:56:25 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:56:25 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> > What do we do about str( my_unicode_string )? Perhaps escape the =
Unicode
> > characters with backslashed numbers?
>=20
> Hm, good question.  Tcl displays unknown characters as \x or \u
> escapes.  I think this may make more sense than raising an error.

but that's on the display side of things, right?  similar to
repr, in other words.

> But there must be a way to turn on Unicode-awareness on e.g. stdout
> and then printing a Unicode object should not use str() (as it
> currently does).

to throw some extra gasoline on this, how about allowing
str() to return unicode strings?

(extra questions: how about renaming "unicode" to "string",
and getting rid of "unichr"?)

count to ten before replying, please.

</F>



From ping@lfw.org  Wed May  3 09:30:02 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:30:02 -0700 (PDT)
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <Pine.LNX.4.10.10005030116460.522-100000@localhost>

On Wed, 3 May 2000, Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:
> 
> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

I just looked through this document.  Indeed, there's a lot
of work to be done if we want to compare strings this way.

I thought the most striking feature was that this comparison
method does *not* satisfy the common assumption

    a > b  implies  a + c > b + d        (+ is concatenation)

-- in fact, it is specifically designed to allow for cases
where differences in the *later* part of a string can have
greater influence than differences in an earlier part of a
string.  It *does* still guarantee that

    a + b > a

and of course we can still rely on the most basic rules such as

    a > b  and  b > c  implies  a > c

There are sufficiently many significant transformations
described in the UTR 10 document that i'm pretty sure it
is possible for two things to collate equally but not be
equivalent.  (Even after Unicode normalization, there is
still the possibility of rearrangement in step 1.2.)

This would be another motivation for Python to carefully
separate the three types of equality:

    is         identity-equal
    ==         value-equal
    <=>        magnitude-equal

We currently don't distinguish between the last two;
the operator "<=>" is my proposal for how to spell
"magnitude-equal", and in terms of outward behaviour
you can consider (a <=> b) to be (a <= b and a >= b).
I suspect we will find ourselves needing it if we do
rich comparisons anyway.

(I don't know of any other useful kinds of equality,
but if you've run into this before, do pipe up...)


-- ?!ng



From mal@lemburg.com  Wed May  3 09:15:29 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:15:29 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <390FE021.6F15C1C8@lemburg.com>

Just van Rossum wrote:
> 
> [MAL vs. PP]
> >> > FYI: Normalization is needed to make comparing Unicode
> >> > strings robust, e.g. u"é" should compare equal to u"e\u0301".
> >>
> >> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> >> think we need to get the bytes/characters level right and then we can
> >> worry about display-equivalent characters (or leave that to the Python
> >> programmer to figure out...).
> >
> >I just wanted to point out that the argument "slicing doesn't
> >work with UTF-8" is moot.
> 
> And failed...

Huh ? The pure fact that you can have two (or more)
Unicode characters to represent a single character makes
Unicode itself have the same problems as e.g. UTF-8.

> [Refs about collation and decomposition]
>
> It's very deep stuff, which seems more appropriate for an extension than
> for builtin comparisons to me.

That's what I think too; I never argued for making this
builtin and automatic (don't know where people got this idea
from).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From Fredrik Lundh" <effbot@telia.com  Wed May  3 10:02:09 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:02:09 +0200
Subject: [Python-Dev] Unicode comparisons & normalization
References: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <018a01bfb4de$7744cc00$34aab5d4@hagrid>

Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, =
I
> reach the following (possibly wrong) conclusions:

here's another good paper that covers this, the universe, and =
everything:

    Character Model for the World Wide Web=20
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to =
tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify =
four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../=20

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../
=20
    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting.=20

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:

> - there is a script and language independent canonical form (but =
automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly =
realistic
> for 1.6, if at all...)

note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).

> - this would indeed mean that it's possible for u =3D=3D v even though =
type(u)
> is type(v) and len(u) !=3D len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).

yes, but on layer 3 instead of layer 2.

> - there may be additional customized language-specific sorting rules. =
I
> currently don't see how to implement that without some global =
variable.

layer 4.

> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take =
up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...

layer 4.

> - u.find() may need an alternative API, which returns a (begin, end) =
tuple,
> since the match may not have the same length as the search string... =
(This
> is tricky, since you need the begin and end indices in the =
non-canonical
> form...)

layer 3.



From Fredrik Lundh" <effbot@telia.com  Wed May  3 10:11:26 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:11:26 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> >=20
> > > > So what do you think of my new proposal of using ASCII as the =
default
> > > > "encoding"?
>=20
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
>=20
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

umm.  if you disallow latin-1 characters, how can you call this
one loss-less?

looks like political correctness taken to an entirely new level...

</F>



From ping@lfw.org  Wed May  3 09:50:30 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:50:30 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005030141580.522-100000@localhost>

On Wed, 3 May 2000, Fredrik Lundh wrote:
> Guido van Rossum <guido@python.org> wrote:
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?

You still need to *print* them somehow.  One way or another,
stdout is still just a stream with bytes on it, unless we
augment file objects to understand encodings.

stdout sends bytes to something -- and that something will
interpret the stream of bytes in some encoding (could be
Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:

    1.  You explicitly downconvert to bytes, and specify
        the encoding each time you do.  Then write the
        bytes to stdout (or your file object).

    2.  The file object is smart and can be told what
        encoding to use, and Unicode strings written to
        the file are automatically converted to bytes.

Another thread mentioned having separate read/write and
binary_read/binary_write methods on files.  I suggest
doing it the other way, actually: since read/write operate
on byte streams now, *they* are the binary operations;
the new methods should be the ones that do the extra
encoding/decoding work, and could be called uniread/uniwrite,
uread/uwrite, textread/textwrite, etc.

> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)

Would you expect chr(x) to return an 8-bit string when x < 128,
and a Unicode string when x >= 128?


-- ?!ng



From ping@lfw.org  Wed May  3 10:32:31 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:32:31 -0700 (PDT)
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030151150.522-100000@localhost>

On Tue, 2 May 2000, Guido van Rossum wrote:
> > P. P. S.  If always having to specify encodings is really too much,
> > i'd probably be willing to consider a default-encoding state on the
> > Unicode class, but it would have to be a stack of values, not a
> > single value.
> 
> Please elaborate?

On general principle, it seems bad to just have a "set" method
that encourages people to set static state in a way that
irretrievably loses the current state.  For something like this,
you want a "push" method and a "pop" method with which to bracket
a series of operations, so that you can easily write code which
politely leaves other code unaffected.

For example:

    >>> x = unicode("d\351but")        # assume Guido-ASCII wins
    UnicodeError: ASCII encoding error: value out of range
    >>> x = unicode("d\351but", "latin-1")
    >>> x
    u'd\351but'
    >>> print x.encode("latin-1")      # on my xterm with Latin-1 fonts
    début
    >>> x.encode("utf-8")
    'd\303\251but'

Now:

    >>> u"".pushenc("latin-1")         # need a better interface to this?
    >>> x = unicode("d\351but")        # okay now
    >>> x
    u'd\351but'
    >>> u"".pushenc("utf-8")
    >>> x = unicode("d\351but")
    UnicodeError: UTF-8 decoding error: invalid data
    >>> x = unicode("d\303\251but")
    >>> print x.encode("latin-1")
    début
    >>> str(x)
    'd\303\251\but'
    >>> u"".popenc()                   # back to the Latin-1 encoding
    >>> str(x)
    'd\351but'
        .
        .
        .
    >>> u"".popenc()                   # back to the ASCII encoding

Similarly, imagine:

    >>> x = u"<Japanese text...>"

    >>> file = open("foo.jis", "w")
    >>> file.pushenc("iso-2022-jp")
    >>> file.uniwrite(x)
        .
        .
        .
    >>> file.popenc()

    >>> import sys
    >>> sys.stdout.write(x)            # bad! x contains chars > 127
    UnicodeError: ASCII decoding error: value out of range

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> sys.stdout.write(x)            # on a kterm with kanji fonts
    <Japanese text...>
        .
        .
        .
    >>> sys.stdout.popenc()

The above examples incorporate the Guido-ASCII proposal, which
makes a fair amount of sense to me now.  How do they look to y'all?



This illustrates the remaining wart:

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> print x                        # still bad! str is still doing ASCII
    UnicodeError: ASCII decoding error: value out of range

    >>> u"".pushenc("iso-2022-jp")
    >>> print x                        # on a kterm with kanji fonts
    <Japanese text...>

Writing to files asks the file object to convert from Unicode to
bytes, then write the bytes.

Printing converts the Unicode to bytes first with str(), then
hands the bytes to the file object to write.

This wart is really a larger printing issue.  If we want to
solve it, files have to know what to do with objects, i.e.

    print x

doesn't mean

    sys.stdout.write(str(x) + "\n")

instead it means

    sys.stdout.printout(x)

Hmm.  I think this might deserve a separate subject line.


-- ?!ng



From ping@lfw.org  Wed May  3 10:41:20 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:41:20 -0700 (PDT)
Subject: [Python-Dev] Printing objects on files
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030232360.522-100000@localhost>

The following is all stolen from E: see http://www.erights.org/.

As i mentioned in the previous message, there are reasons that
we might want to enable files to know what it means to print
things on them.

    print x

would mean

    sys.stdout.printout(x)

where sys.stdout is defined something like

    def __init__(self):
        self.encs = ["ASCII"]

    def pushenc(self, enc):
        self.encs.append(enc)
    
    def popenc(self):
        self.encs.pop()
        if not self.encs: self.encs = ["ASCII"]

    def printout(self, x):
        if type(x) is type(u""):
            self.write(x.encode(self.encs[-1]))
        else:   
            x.__print__(self)
        self.write("\n")

and each object would have a __print__ method; for lists, e.g.:

    def __print__(self, file):
        file.write("[")
        if len(self):
            file.printout(self[0])
        for item in self[1:]:
            file.write(", ")
            file.printout(item)
        file.write("]")

for floats, e.g.:

    def __print__(self, file):
        if hasattr(file, "floatprec"):
            prec = file.floatprec
        else:
            prec = 17
        file.write("%%.%df" % prec % self)

The passing of control between the file and the objects to
be printed enables us to make Tim happy:

    >>> l = [1/2, 1/3, 1/4]            # I can dream, can't i?

    >>> print l
    [0.3, 0.33333333333333331, 0.25]

    >>> sys.stdout.floatprec = 6
    >>> print l
    [0.5, 0.333333, 0.25]

Fantasizing about other useful kinds of state beyond "encs"
and "floatprec" ("listmax"? "ratprec"?) and managing this
namespace is left as an exercise to the reader.


-- ?!ng



From ht@cogsci.ed.ac.uk  Wed May  3 10:59:28 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 03 May 2000 10:59:28 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <f5bog6o54zj.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage).  I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.

I notice one apparent mis-communication between the various
contributors:

Treating narrow-strings as consisting of UNICODE code points <= 255 is 
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.

I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings.  Encodings are only relevant to
conversion between byte-arrays and strings.  Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might.  The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the 
code points between 128 and 255.

This is effectively the way our C-based XML toolset (which we embed in 
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.

I'd like to keep byte-arrays at the margins in Python as well, for all 
the reasons advanced by Paul and Fredrik.

I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From ping@lfw.org  Wed May  3 10:51:30 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:51:30 -0700 (PDT)
Subject: [Python-Dev] Re: Printing objects on files
In-Reply-To: <Pine.LNX.4.10.10005030232360.522-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005030242030.522-100000@localhost>

On Wed, 3 May 2000, Ka-Ping Yee wrote:
> 
> Fantasizing about other useful kinds of state beyond "encs"
> and "floatprec" ("listmax"? "ratprec"?) and managing this
> namespace is left as an exercise to the reader.

Okay, i lied.  Shortly after writing this i realized that it
is probably advisable for all such bits of state to be stored
in stacks, so an interface such as this might do:

    def push(self, key, value):
        if not self.state.has_key(key):
            self.state[key] = []
        self.state[key].append(value)

    def pop(self, key):
        if self.state.has_key(key):
            if len(self.state[key]):
                self.state[key].pop()

    def get(self, key):
        if not self.state.has_key(key):
            stack = self.state[key][-1]
        if stack:
            return stack[-1]
        return None

Thus:

    >>> print 1/3
    0.33333333333333331

    >>> sys.stdout.push("float.prec", 6)
    >>> print 1/3
    0.333333

    >>> sys.stdout.pop("float.prec")
    >>> print 1/3
    0.33333333333333331

And once we allow arbitrary strings as keys to the bits
of state, the period is a natural separator we can use
for managing the namespace.

Take the special case for Unicode out of the file object:
    
    def printout(self, x):
        x.__print__(self)
        self.write("\n")

and have the Unicode string do the work:

    def __printon__(self, file):
        file.write(self.encode(file.get("unicode.enc")))

This behaves just right if an encoding of None means ASCII.

If mucking with encodings is sufficiently common, you could
imagine conveniences on file objects such as

    def __init__(self, filename, mode, encoding=None):
        ...
        if encoding:
            self.push("unicode.enc", encoding)

    def pushenc(self, encoding):
        self.push("unicode.enc", encoding)

    def popenc(self, encoding):
        self.pop("unicode.enc")


-- ?!ng



From Fredrik Lundh" <effbot@telia.com  Wed May  3 11:31:34 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 12:31:34 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> > to throw some extra gasoline on this, how about allowing
> > str() to return unicode strings?
>=20
> You still need to *print* them somehow.  One way or another,
> stdout is still just a stream with bytes on it, unless we
> augment file objects to understand encodings.
>=20
> stdout sends bytes to something -- and that something will
> interpret the stream of bytes in some encoding (could be
> Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
>=20
>     1.  You explicitly downconvert to bytes, and specify
>         the encoding each time you do.  Then write the
>         bytes to stdout (or your file object).
>=20
>     2.  The file object is smart and can be told what
>         encoding to use, and Unicode strings written to
>         the file are automatically converted to bytes.

which one's more convenient?

(no, I won't tell you what I prefer. guido doesn't want
more arguments from the old "characters are characters"
proponents, so I gotta trick someone else to spell them
out ;-)

> > (extra questions: how about renaming "unicode" to "string",
> > and getting rid of "unichr"?)
>=20
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >=3D 128?

that will break too much existing code, I think.  but what
about replacing 128 with 256?

</F>



From just@letterror.com  Wed May  3 12:41:27 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 12:41:27 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390FE021.6F15C1C8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102800b535bef21708@[193.78.237.152]>

At 10:15 AM +0200 03-05-2000, M.-A. Lemburg wrote:
>Huh ? The pure fact that you can have two (or more)
>Unicode characters to represent a single character makes
>Unicode itself have the same problems as e.g. UTF-8.

It's the different level of abstraction that makes it different.

Even if "e`" is _equivalent_ to the combined character, that doesn't mean
that it _is_ the combined character, on the level of abstraction we are
talking about: it's still 2 characters, and those can be sliced apart
without a problem. Slicing utf-8 doesn't work because it yields invalid
strings, slicing "e`" does work since both halves are valid strings. The
fact that "e`" is semantically equivalent to the combined character doesn't
change that.

Just




From guido@python.org  Wed May  3 12:12:44 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:12:44 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 01:30:02 PDT."
 <Pine.LNX.4.10.10005030116460.522-100000@localhost>
References: <Pine.LNX.4.10.10005030116460.522-100000@localhost>
Message-ID: <200005031112.HAA03138@eric.cnri.reston.va.us>

[Ping]
> This would be another motivation for Python to carefully
> separate the three types of equality:
> 
>     is         identity-equal
>     ==         value-equal
>     <=>        magnitude-equal
> 
> We currently don't distinguish between the last two;
> the operator "<=>" is my proposal for how to spell
> "magnitude-equal", and in terms of outward behaviour
> you can consider (a <=> b) to be (a <= b and a >= b).
> I suspect we will find ourselves needing it if we do
> rich comparisons anyway.

I don't think that this form of equality deserves its own operator.
The Unicode comparison rules are sufficiently hairy that it seems
better to implement them separately, either in a separate module or at
least as a Unicode-object-specific method, and let the == operator do
what it does best: compare the representations.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido@python.org  Wed May  3 12:14:54 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:14:54 -0400
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 11:02:09 +0200."
 <018a01bfb4de$7744cc00$34aab5d4@hagrid>
References: <l03102806b535964edb26@[193.78.237.152]>
 <018a01bfb4de$7744cc00$34aab5d4@hagrid>
Message-ID: <200005031114.HAA03152@eric.cnri.reston.va.us>

> here's another good paper that covers this, the universe, and everything:

Theer's a lot of useful pointers being flung around.  Could someone
with more spare cycles than I currently have perhaps collect these and
produce a little write up "further reading on Unicode comparison and
normalization" (or perhaps a more comprehensive title if warranted) to
be added to the i18n-sig's home page?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just@letterror.com  Wed May  3 13:26:50 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 13:26:50 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <l03102804b535cb14f243@[193.78.237.149]>

[Ka-Ping Yee]
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

[Fredrik Lundh]
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-)

Just




From guido@python.org  Wed May  3 13:04:29 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:04:29 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 12:31:34 +0200."
 <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
 <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
Message-ID: <200005031204.IAA03252@eric.cnri.reston.va.us>

[Ping]
> > stdout sends bytes to something -- and that something will
> > interpret the stream of bytes in some encoding (could be
> > Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> > 
> >     1.  You explicitly downconvert to bytes, and specify
> >         the encoding each time you do.  Then write the
> >         bytes to stdout (or your file object).
> > 
> >     2.  The file object is smart and can be told what
> >         encoding to use, and Unicode strings written to
> >         the file are automatically converted to bytes.

[Fredrik]
> which one's more convenient?

Marc-Andre's codec module contains file-like objects that support this
(or could easily be made to).

However the problem is that print *always* first converts the object
using str(), and str() enforces that the result is an 8-bit string.
I'm afraid that loosening this will break too much code.  (This all
really happens at the C level.)

I'm also afraid that this means that str(unicode) may have to be
defined to yield UTF-8.  My argument goes as follows:

1. We want to be able to set things up so that print u"..." does the
   right thing.  (What "the right thing" is, is not defined here,
   as long as the user sees the glyphs implied by u"...".)

2. print u is equivalent to sys.stdout.write(str(u)).

3. str() must always returns an 8-bit string.

4. So the solution must involve assigning an object to sys.stdout that
   does the right thing given an 8-bit encoding of u.

5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.

6. UTF-8 is the only sensible candidate.

Note that (apart from print) str() is never implicitly invoked -- all
implicit conversions when Unicode and 8-bit strings are combined
go from 8-bit to Unicode.

(There might be an alternative, but it would depend on having yet
another hook (similar to Ping's sys.display) that gets invoked when
printing an object (as opposed to displaying it at the interactive
prompt).  I'm not too keen on this because it would break code that
temporarily sets sys.stdout to a file of its own choosing and then
invokes print -- a common idiom to capture printed output in a string,
for example, which could be embedded deep inside a module.  If the
main program were to install a naive print hook that always sent
output to a designated place, this strategy might fail.)

> > > (extra questions: how about renaming "unicode" to "string",
> > > and getting rid of "unichr"?)
> > 
> > Would you expect chr(x) to return an 8-bit string when x < 128,
> > and a Unicode string when x >= 128?
> 
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

If the 8-bit Unicode proposal were accepted, this would make sense.
In my "only ASCII is implicitly convertible" proposal, this would be a
mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).

I agree with everyone that things would be much simpler if we had
separate data types for byte arrays and 8-bit character strings.  But
we don't have this distinction yet, and I don't see a quick way to add
it in 1.6 without major upsetting the release schedule.

So all of my proposals are to be considered hacks to maintain as much
b/w compatibility as possible while still supporting some form of
Unicode.  The fact that half the time 8-bit strings are really being
used as byte arrays, while Python can't tell the difference, means (to
me) that the default encoding is an important thing to argue about.

I don't know if I want to push it out all the way to Py3k, but I just
don't see a way to implement "a character is a character" in 1.6 given
all the current constraints.  (BTW I promise that 1.7 will be speedy
once 1.6 is out of the door -- there's a lot else that was put off to
1.7.)

Fredrik, I believe I haven't seen your response to my ASCII proposal.
Is it just as bad as UTF-8 to you, or could you live with it?  On a
scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?

Where's my sre snapshot?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 13:16:56 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:16:56 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "03 May 2000 10:59:28 BST."
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us>

[Henry S. Thompson]
> OK, I've never contributed to this discussion, but I have a long
> history of shipping widely used Python/Tkinter/XML tools (see my
> homepage).  I care _very_ much that heretofore I have been unable to
> support full XML because of the lack of Unicode support in Python.
> I've already started playing with 1.6a2 for this reason.

Thanks for chiming in!

> I notice one apparent mis-communication between the various
> contributors:
> 
> Treating narrow-strings as consisting of UNICODE code points <= 255 is 
> not necessarily the same thing as making Latin-1 the default encoding.
> I don't think on Paul and Fredrik's account encoding are relevant to
> narrow-strings at all.

I agree that's what they are trying to tell me.

> I'd rather go right away to the coherent position of byte-arrays,
> narrow-strings and wide-strings.  Encodings are only relevant to
> conversion between byte-arrays and strings.  Decoding a byte-array
> with a UTF-8 encoding into a narrow string might cause
> overflow/truncation, just as decoding a byte-array with a UTF-8
> encoding into a wide-string might.  The fact that decoding a
> byte-array with a Latin-1 encoding into a narrow-string is a memcopy
> is just a side-effect of the courtesy of the UNICODE designers wrt the 
> code points between 128 and 255.
> 
> This is effectively the way our C-based XML toolset (which we embed in 
> Python) works today -- we build an 8-bit version which uses char*
> strings, and a 16-bit version which uses unsigned short* strings, and
> convert from/to byte-streams in any supported encoding at the margins.
> 
> I'd like to keep byte-arrays at the margins in Python as well, for all 
> the reasons advanced by Paul and Fredrik.
> 
> I think treating existing strings as a sort of pun between
> narrow-strings and byte-arrays is a recipe for ongoing confusion.

Very good analysis.

Unfortunately this is where we're stuck, until we have a chance to
redesign this kind of thing from scratch.  Python 1.5.2 programs use
strings for byte arrays probably as much as they use them for
character strings.  This is because way back in 1990 I when I was
designing Python, I wanted to have smallest set of basic types, but I
also wanted to be able to manipulate byte arrays somewhat.  Influenced
by K&R C, I chose to make strings and string I/O 8-bit clean so that
you could read a binary "string" from a file, manipulate it, and write
it back to a file, regardless of whether it was character or binary
data.

This model has never been challenged until now.  I agree that the Java
model (byte arrays and strings) or perhaps your proposed model (byte
arrays, narrow and wide strings) looks better.  But, although Python
has had rudimentary support for byte arrays for a while (the array
module, introduced in 1993), the majority of Python code manipulating
binary data still uses string objects.

My ASCII proposal is a compromise that tries to be fair to both uses
for strings.  Introducing byte arrays as a more fundamental type has
been on the wish list for a long time -- I see no way to introduce
this into Python 1.6 without totally botching the release schedule
(June 1st is very close already!).  I'd like to be able to move on,
there are other important things still to be added to 1.6 (Vladimir's
malloc patches, Neil's GC, Fredrik's completed sre...).

For 1.7 (which should happen later this year) I promise I'll reopen
the discussion on byte arrays.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 13:18:39 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:18:39 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python bltinmodule.c,2.154,2.155
In-Reply-To: Your message of "Wed, 03 May 2000 09:58:31 +0200."
 <20000503075832.18574370CF2@snelboot.oratrix.nl>
References: <20000503075832.18574370CF2@snelboot.oratrix.nl>
Message-ID: <200005031218.IAA03288@eric.cnri.reston.va.us>

> > _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> > ahead and initialize the class-based standard exceptions.  If this
> > fails, we throw a Py_FatalError.
> 
> Isn't a Py_FatalError overkill? Or will not having the class-based standard 
> exceptions lead to so much havoc later on that it is better than limping on?

There will be *no* exception objects -- they will all be NULL
pointers.  It's not clear that you will be able to limp very far, and
it's better to have a clear diagnostic at the source of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 13:22:57 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:22:57 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:05:59 EDT."
 <000301bfb4bd$463ec280$622d153f@tim>
References: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <200005031222.IAA03300@eric.cnri.reston.va.us>

> [Guido]
> > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> > bytes in either should make the comparison fail; when ordering is
> > important, we can make an arbitrary choice e.g. "\377" < u"\200".
> 
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.

[Tim]
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> >>>
> 
> probably traps more people than it helps.

Agreed, but that's the rule we all currently live by, and changing it
is something for Python 3000.

I'm not real strong on this though -- I was willing to live with
exceptions from the UTF-8-to-Unicode conversion.  If we all agree that
it's better for u"\377" == "\377" to raise an precedent-setting
exception than to return false, that's fine with me too.  I do want
u"a" == "a" to be true though (and I believe we all already agree on
that one).

Note that it's not the first precedent -- you can already define
classes whose instances can raise exceptions during comparisons.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Wed May  3 09:56:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:56:08 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us> <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <390FE9A7.DE5545DA@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum <guido@python.org> wrote:
> > > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > > characters with backslashed numbers?
> >
> > Hm, good question.  Tcl displays unknown characters as \x or \u
> > escapes.  I think this may make more sense than raising an error.
> 
> but that's on the display side of things, right?  similar to
> repr, in other words.
> 
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?
> 
> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)
> 
> count to ten before replying, please.

1 2 3 4 5 6 7 8 9 10 ... ok ;-)

Guido's problem with printing Unicode can easily be solved
using the standard codecs.StreamRecoder class as I've done
in the example I posted some days ago.

Basically, what the stdout wrapper would do is take strings
as input, converting them to Unicode and then writing
them encoded to the original stdout. For Unicode objects
the conversion can be skipped and the encoded output written
directly to stdout.

This can be done for any encoding supported by Python; e.g.
you could do the indirection in site.py and then have
Unicode printed as Latin-1 or UTF-8 or one of the many
code pages supported through the mapping codec.

About having str() return Unicode objects: I see str()
as constructor for string objects and under that assumption
str() will always have to return string objects.
unicode() does the same for Unicode objects, so renaming
it to something else doesn't really help all that much.

BTW, __str__() has to return strings too. Perhaps we
need __unicode__() and a corresponding slot function too ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed May  3 14:06:27 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 15:06:27 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid>
Message-ID: <39102453.6923B10@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > > > So what do you think of my new proposal of using ASCII as the default
> > > > > "encoding"?
> >
> > How about using unicode-escape or raw-unicode-escape as
> > default encoding ? (They would have to be adapted to disallow
> > Latin-1 char input, though.)
> >
> > The advantage would be that they are compatible with ASCII
> > while still providing loss-less conversion and since they
> > use escape characters, you can even read them using an
> > ASCII based editor.
> 
> umm.  if you disallow latin-1 characters, how can you call this
> one loss-less?

[Guido didn't like this one, so its probably moot investing
 any more time on this...]

I meant that the unicode-escape codec should only take ASCII
characters as input and disallow non-escaped Latin-1 characters.

Anyway, I'm out of this discussion... 

I'll wait a week or so until things have been sorted out.

Have fun,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From ping@lfw.org  Wed May  3 14:09:59 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 06:09:59 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031204.IAA03252@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030556250.522-100000@localhost>

On Wed, 3 May 2000, Guido van Rossum wrote:
> (There might be an alternative, but it would depend on having yet
> another hook (similar to Ping's sys.display) that gets invoked when
> printing an object (as opposed to displaying it at the interactive
> prompt).  I'm not too keen on this because it would break code that
> temporarily sets sys.stdout to a file of its own choosing and then
> invokes print -- a common idiom to capture printed output in a string,
> for example, which could be embedded deep inside a module.  If the
> main program were to install a naive print hook that always sent
> output to a designated place, this strategy might fail.)

I know this is not a small change, but i'm pretty convinced the
right answer here is that the print hook should call a *method*
on sys.stdout, whatever sys.stdout happens to be.  The details
are described in the other long message i wrote ("Printing objects
on files").

Here is an addendum that might actually make that proposal
feasible enough (compatibility-wise) to fly in the short term:

    print x

does, conceptually:

    try:
        sys.stdout.printout(x)
    except AttributeError:
        sys.stdout.write(str(x))
        sys.stdout.write("\n")

The rest can then be added, and the change in 'print x' will
work nicely for any file objects, but will not break on file-like
substitutes that don't define a 'printout' method.

Any reactions to the other benefit of this proposal -- namely,
the ability to control the printing parameters of object
components as they're being traversed for printing?  That was
actually the original motivation for doing the file.printout
thing: it gives you some of the effect of "passing down str-ness"
that we were discussing so heatedly a little while ago.

The other thing that just might justify this much of a change
is that, as you reasoned clearly in your other message, without
adequate resolution to the printing problem we may have painted
ourselves into a corner with regard to str(u"") conversion, and
i don't like the look of that corner much.  *Even* if we were to
get people to agree that it's okay for str(u"") to produce UTF-8,
it still seems pretty hackish to me that we're forced to choose
this encoding as a way of working around that fact that we can't
simply give the file the thing we want to print.


-- ?!ng



From Moshe Zadka <moshez@math.huji.ac.il>  Wed May  3 14:55:37 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Wed, 3 May 2000 16:55:37 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <Pine.GSO.4.10.10005031649040.4859-100000@sundial>

On Wed, 3 May 2000, Tim Peters wrote:

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

[Tim Peters]
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences

Of course I don't, and of course you're right. But what I do want is for
my binary goop to pass unharmed through the evil Unicode forest. Which is
why I don't want it to interpret my goop as a sequence of bytes it tries
to decode, but I want the numeric values of my bytes to pass through to
Unicode uharmed -- that means Latin-1 because of the second design
decision of the horribly western-specific unicdoe - the first 256
characters are the same as Latin-1. If it were up to me, I'd use Latin-3,
but it wasn't, so it's not.

> (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE)

Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE
-- isn't it.

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From akuchlin@mems-exchange.org  Wed May  3 15:12:06 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:12:06 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <14608.13238.339572.202494@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>been on the wish list for a long time -- I see no way to introduce
>this into Python 1.6 without totally botching the release schedule
>(June 1st is very close already!).  I'd like to be able to move on,

My suggested criterion is that 1.6 not screw things up in a way that
we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
a corner that 

(And can we choose a mailing list for discussing this and stick to it?
 This is being cross-posted to three lists: python-dev, i18-sig, and
 xml-sig!  i18-sig only, maybe?  Or string-sig?)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Chess! I'm tormented by thoughts of strip chess. Pure mind just isn't enough,
Mallah. I long for a body.
  -- The Brain, in DOOM PATROL #34



From akuchlin@mems-exchange.org  Wed May  3 15:15:18 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:15:18 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <14608.13238.339572.202494@amarok.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
 <14608.13238.339572.202494@amarok.cnri.reston.va.us>
Message-ID: <14608.13430.92985.717058@amarok.cnri.reston.va.us>

Andrew M. Kuchling writes:
>Guido van Rossum writes:
>My suggested criterion is that 1.6 not screw things up in a way that
>we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
>a corner that 

Doh!  To complete that paragraph: Magic conversions assuming UTF-8
does back us into a corner that is hard to get out of later.  Magic
conversions assuming Latin1 or ASCII are a bit better, but I'd lean
toward the draconian solution: we don't know what we're doing, so do
nothing and require the user to explicitly convert between Unicode and
8-bit strings in a user-selected encoding.

--amk


From guido@python.org  Wed May  3 16:48:32 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 11:48:32 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 10:15:18 EDT."
 <14608.13430.92985.717058@amarok.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us> <14608.13238.339572.202494@amarok.cnri.reston.va.us>
 <14608.13430.92985.717058@amarok.cnri.reston.va.us>
Message-ID: <200005031548.LAA03595@eric.cnri.reston.va.us>

> >Guido van Rossum writes:
> >My suggested criterion is that 1.6 not screw things up in a way that
> >we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
> >a corner that 

> Andrew M. Kuchling writes:
> Doh!  To complete that paragraph: Magic conversions assuming UTF-8
> does back us into a corner that is hard to get out of later.  Magic
> conversions assuming Latin1 or ASCII are a bit better, but I'd lean
> toward the draconian solution: we don't know what we're doing, so do
> nothing and require the user to explicitly convert between Unicode and
> 8-bit strings in a user-selected encoding.

GvR responds:
That's what Ping suggested.  My reason for proposing default
conversions from ASCII is that there is much code that deals with
character strings in a fairly abstract sense and that would work out
of the box (or after very small changes) with Unicode strings.  This
code often uses some string literals containing ASCII characters.  An
arbitrary example: code to reformat a text paragraph; another: an XML
parser.  These look for certain ASCII characters given as literals in
the code (" ", "<" and so on) but the algorithm is essentially
independent of what encoding is used for non-ASCII characters.  (I
realize that the text reformatting example doesn't work for all
Unicode characters because its assumption that all characters have
equal width is broken -- but at the very least it should work with
Latin-1 or Greek or Cyrillic stored in Unicode strings.)

It's the same as for ints: a function to calculate the GCD works with
ints as well as long ints without change, even though it references
the int constant 0.  In other words, we want string-processing code to
be just as polymorphic as int-processing code.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Wed May  3 20:55:24 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 20:55:24 +0100
Subject: [Python-Dev] Unicode strings: an alternative
Message-ID: <l03102800b5362642bae3@[193.78.237.149]>

Today I had a relatively simple idea that unites wide strings and narrow
strings in a way that is more backward comatible at the C level. It's quite
possible this has already been considered and rejected for reasons that are
not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the
existing string object like so:
- wide strings are stored as if they were narrow strings, simply using two
bytes for each Unicode character.
- there's a flag that specifies whether the string is narrow or wide.
- the ob_size field is the _physical_ length of the data; if the string is
wide, len(s) will return ob_size/2, all other string operations will have
to do similar things.
- there can possibly be an encoding attribute which may specify the used
encoding, if known.

Admittedly, this is tricky and involves quite a bit of effort to implement,
since all string methods need to have narrow/wide switch. To make it worse,
it hardly offers anything the current solution doesn't. However, it offers
one IMHO _big_ advantage: C code that just passes strings along does not
need to change: wide strings can be seen as narrow strings without any
loss. This allows for __str__() & str() and friends to work with unicode
strings without any change.

Any thoughts?

Just




From tree@basistech.com  Wed May  3 21:19:05 2000
From: tree@basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 16:19:05 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <14608.35257.729641.178724@cymru.basistech.com>

Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?


 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From skip@mojam.com (Skip Montanaro)  Wed May  3 21:51:49 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Wed, 3 May 2000 15:51:49 -0500 (CDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.35257.729641.178724@cymru.basistech.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
 <14608.35257.729641.178724@cymru.basistech.com>
Message-ID: <14608.37223.787291.236623@beluga.mojam.com>

    Tom> Is it possible to add a logical length field too? I presume it is
    Tom> too expensive to recalculate the logical (character) length of a
    Tom> string each time len(s) is called? Doing this is only slightly more
    Tom> time consuming than a normal strlen: ...

Note that currently the len() method doesn't call strlen() at all.  It just
returns the ob_size field.  Presumably, with Just's proposal len() would
simply return ob_size/width.  If you used a variable width encoding, Just's
plan wouldn't work.  (I don't know anything about string encodings - is
UTF-8 variable width?)



From guido@python.org  Wed May  3 22:22:59 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 17:22:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <200005032122.RAA05150@eric.cnri.reston.va.us>

> Today I had a relatively simple idea that unites wide strings and narrow
> strings in a way that is more backward comatible at the C level. It's quite
> possible this has already been considered and rejected for reasons that are
> not yet obvious to me, but I'll give it a shot anyway.
> 
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
> - there's a flag that specifies whether the string is narrow or wide.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
> 
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.

This seems to have some nice properties, but I think it would cause
problems for existing C code that tries to *interpret* the bytes of a
string: it could very well do the wrong thing for wide strings (since
old C code doesn't check for the "wide" flag).  I'm not sure how much
C code there is that merely passes strings along...  Most C code using
strings makes use of the strings (e.g. open() falls in this category
in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Wed May  3 23:05:39 2000
From: tree@basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 18:05:39 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.37223.787291.236623@beluga.mojam.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
 <14608.35257.729641.178724@cymru.basistech.com>
 <14608.37223.787291.236623@beluga.mojam.com>
Message-ID: <14608.41651.781464.747522@cymru.basistech.com>

Skip Montanaro writes:
 > Note that currently the len() method doesn't call strlen() at all.  It just
 > returns the ob_size field.  Presumably, with Just's proposal len() would
 > simply return ob_size/width.  If you used a variable width encoding, Just's
 > plan wouldn't work.  (I don't know anything about string encodings - is
 > UTF-8 variable width?)

Yes, technically from 1 - 6 bytes per character, though in practice
for Unicode it's 1 - 3.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@python.org  Thu May  4 01:52:39 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:52:39 -0400
Subject: [Python-Dev] weird bug in test_winreg
Message-ID: <200005040052.UAA07874@eric.cnri.reston.va.us>

I just noticed a weird traceback in test_winreg.  When I import
test.autotest on Windows, I get a "test failed" notice for
test_winreg.  When I run it by itself the test succeeds.  But when I
first import test.autotest and then import test.test_winreg (which
should rerun the latter, since test.regrtest unloads all test modules
after they have run), I get an AttributeError telling me that 'None'
object has no attribute 'get'.  This is in encodings.__init__.py in
the first call to _cache.get() in search_function.  Somehow this is
called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
inspection of the encodings module shows that _cache is {}, not None,
and the source shows no evidence of how this could have happened.

Any suggestions?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido@python.org  Thu May  4 01:57:50 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:57:50 -0400
Subject: [Python-Dev] weird bug in test_winreg
In-Reply-To: Your message of "Wed, 03 May 2000 20:52:39 EDT."
 <200005040052.UAA07874@eric.cnri.reston.va.us>
References: <200005040052.UAA07874@eric.cnri.reston.va.us>
Message-ID: <200005040057.UAA07966@eric.cnri.reston.va.us>

> I just noticed a weird traceback in test_winreg.  When I import
> test.autotest on Windows, I get a "test failed" notice for
> test_winreg.  When I run it by itself the test succeeds.  But when I
> first import test.autotest and then import test.test_winreg (which
> should rerun the latter, since test.regrtest unloads all test modules
> after they have run), I get an AttributeError telling me that 'None'
> object has no attribute 'get'.  This is in encodings.__init__.py in
> the first call to _cache.get() in search_function.  Somehow this is
> called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
> inspection of the encodings module shows that _cache is {}, not None,
> and the source shows no evidence of how this could have happened.

I may have sounded confused: the problem is not caused by the
reload().  The test fails the first time around when run by
test.autotest.  My suspicion is that another test somehow overwrites
encodings._cache?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mhammond@skippinet.com.au  Thu May  4 02:20:24 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 4 May 2000 11:20:24 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au>

Oops - I didnt notice the CC - a copy of what I sent to Guido:

-----Original Message-----
From: Mark Hammond [mailto:mhammond@skippinet.com.au]
Sent: Thursday, 4 May 2000 11:13 AM
To: Guido van Rossum
Subject: RE: weird bug in test_winreg


Hah - I was just thinking about this this myself.  If I wasnt waiting 24
hours, I would have beaten you to the test_fork1 patch :-)

However, there is something bad going on.  If you remove your test_fork1
patch, and run it from regrtest (_not_ stand alone) you will see the
children threads die with:

  File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
    alive[id] = os.getpid()
AttributeError: 'None' object has no attribute 'getpid'

Note the error - os is None!

[The reason is only happens as part of the test is because the children are
created before the main thread fails with the attribute error]

Similarly, I get spurious:

Traceback (most recent call last):
  File ".\test_thread.py", line 103, in task2
    mutex.release()
AttributeError: 'None' object has no attribute 'release'

(Only rarely, and never when run stand-alone - the test_fork1 exception
happens 100% of the time from the test suite)

And of course the test_winreg one.

test_winreg, I guessed, may be caused by the import lock (but its certainly
not obvious how or why!?).  However, that doesnt explain the others.

I also saw these _before_ I applied the threading patches (and after!)

So I think the problem may be a little deeper?

Mark.



From just@letterror.com  Thu May  4 08:42:00 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 08:42:00 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <200005032122.RAA05150@eric.cnri.reston.va.us>
References: Your message of "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102800b536d1d8c0bc@[193.78.237.161]>

(Thanks for all the comments. I'll condense my replies into one post.)

[JvR]
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.

[Tom Emerson wrote]
>I disagree with you here... store them as UTF-8.

Erm, utf-8 in a wide string? This makes no sense...

[Skip Montanaro]
>Presumably, with Just's proposal len() would
>simply return ob_size/width.

Right. And if you would allow values for width other than 1 and 2, it opens
the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and
"only" width==1 needs to be special-cased for speed.

>If you used a variable width encoding, Just's plan wouldn't work.

Correct, but nor does the current unicode object. Variable width encodings
are too messy to see as strings at all: they are only useful as byte arrays.

[GvR]
>This seems to have some nice properties, but I think it would cause
>problems for existing C code that tries to *interpret* the bytes of a
>string: it could very well do the wrong thing for wide strings (since
>old C code doesn't check for the "wide" flag).  I'm not sure how much
>C code there is that merely passes strings along...  Most C code using
>strings makes use of the strings (e.g. open() falls in this category
>in my eyes).

There are probably many cases that fall into this category. But then again,
these cases, especially those that potentially can deal with other
encodings than ascii, are not much helped by a default encoding, as /F
showed.

My idea arose after yesterday's discussions. Some quotes, plus comments:

[GvR]
>However the problem is that print *always* first converts the object
>using str(), and str() enforces that the result is an 8-bit string.
>I'm afraid that loosening this will break too much code.  (This all
>really happens at the C level.)

Guido goes on to explain that this means utf-8 is the only sensible default
in this case. Good reasoning, but I think it's backwards:
- str(unicodestring) should just return unicodestring
- it is important that stdout receives the original unicode object.

[MAL]
>BTW, __str__() has to return strings too. Perhaps we
>need __unicode__() and a corresponding slot function too ?!

This also seems backwards. If it's really too hard to change Python so that
__str__ can return unicode objects, my solution may help.

[Ka-Ping Yee]
>Here is an addendum that might actually make that proposal
>feasible enough (compatibility-wise) to fly in the short term:
>
>    print x
>
>does, conceptually:
>
>    try:
>        sys.stdout.printout(x)
>    except AttributeError:
>        sys.stdout.write(str(x))
>        sys.stdout.write("\n")

That stuff like this is even being *proposed* (not that it's not smart or
anything...) means there's a terrible bottleneck somewhere which needs
fixing. My proposal seems to do does that nicely.

Of course, there's no such thing as a free lunch, and I'm sure there are
other corners that'll need fixing, but it appears having to write

    if (!PyString_Check(doc) && !PyUnicode_Check(doc))
        ...

in all places that may accept unicode strings is no fun either.

Yes, some code will break if you throw a wide string at it, but I think
that code is easier repaired with my proposal than with the current
implementation.

It's a big advantage to have only one string type; it makes many problems
we've been discussing easier to talk about.

Just




From Fredrik Lundh" <effbot@telia.com  Thu May  4 08:46:05 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 09:46:05 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030556250.522-100000@localhost>
Message-ID: <002d01bfb59c$cf482280$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> I know this is not a small change, but i'm pretty convinced the
> right answer here is that the print hook should call a *method*
> on sys.stdout, whatever sys.stdout happens to be.  The details
> are described in the other long message i wrote ("Printing objects
> on files").
>=20
> Here is an addendum that might actually make that proposal
> feasible enough (compatibility-wise) to fly in the short term:
>=20
>     print x
>=20
> does, conceptually:
>=20
>     try:
>         sys.stdout.printout(x)
>     except AttributeError:
>         sys.stdout.write(str(x))
>         sys.stdout.write("\n")
>=20
> The rest can then be added, and the change in 'print x' will
> work nicely for any file objects, but will not break on file-like
> substitutes that don't define a 'printout' method.

another approach is (simplified):

    try:
        sys.stdout.write(x.encode(sys.stdout.encoding))
    except AttributeError:
        sys.stdout.write(str(x))

or, if str is changed to return any kind of string:

    x =3D str(x)
    try:
        x =3D x.encode(sys.stdout.encoding)
    except AttributeError:
        pass
    sys.stdout.write(x)

</F>



From ht@cogsci.ed.ac.uk  Thu May  4 09:51:39 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 04 May 2000 09:51:39 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

<snip/>

> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

I think I hear a moderate consensus developing that the 'ASCII
proposal' is a reasonable compromise given the time constraints.  But
let's not fail to come back to this ASAP -- it _really_ narcs me that
every time I load XML into my Python-based editor I'm going to convert
large amounts of wide-string data into UTF-8 just so Tk can convert it
back to wide-strings in order to display it!

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From just@letterror.com  Thu May  4 12:27:45 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 12:27:45 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b536d1d8c0bc@[193.78.237.161]>
References: <200005032122.RAA05150@eric.cnri.reston.va.us> Your message of
 "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102809b53709fef820@[193.78.237.126]>

I wrote:
>It's a big advantage to have only one string type; it makes many problems
>we've been discussing easier to talk about.

I think I should've been more explicit about what I meant here. I'll try to
phrase it as an addendum to my proposal -- which suddenly is no longer just
a narrow/wide string unification but narrow/wide/ultrawide, to really be
ready for the future...

As someone else suggested in the discussion, I think it's good if we
separate the encoding from the data type. Meaning that wide strings are no
longer tied to Unicode. This allows for double-byte encodings other than
UCS-2 as well as for safe passing-through of binary goop, but that's not
the main point. The main point is that this will make the behavior of
(wide) strings more understandable and consistent.

The extended string type is simply a sequence of code points, allowing for
0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for
ultra-wide strings. Upcasting is always safe, downcasting may raise
OverflowError. Depending on the used encoding, this comes as close as
possible to the sequence-of-characters model.

The default character set should of course be Unicode -- and it should be
obvious that this implies Latin-1 for narrow strings.

(Additionally: an encoding attribute suddenly makes a whole lot of sense
again.)

Ok, y'all can shoot me now ;-)

Just




From guido@python.org  Thu May  4 13:40:35 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 04 May 2000 08:40:35 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "04 May 2000 09:51:39 BST."
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us>

> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.  But
> let's not fail to come back to this ASAP -- it _really_ narcs me that
> every time I load XML into my Python-based editor I'm going to convert
> large amounts of wide-string data into UTF-8 just so Tk can convert it
> back to wide-strings in order to display it!

Thanks -- but that's really Tcl's fault, since the only way to get
character data *into* Tcl (or out of it) is through the UTF-8
encoding.

And is your XML really stored on disk in its 16-bit format?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Thu May  4 14:21:25 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 4 May 2000 15:21:25 +0200
Subject: [Python-Dev] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>  <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>

Guido van Rossum <guido@python.org> wrote:
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.

from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm

    Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)

    Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
    object or modify an existing object to hold a copy of the
    Unicode string given by unicode and numChars.

    (Tcl_UniChar* is currently the same thing as Py_UNICODE*)

</F>



From guido@python.org  Thu May  4 18:03:58 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 04 May 2000 13:03:58 -0400
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: Your message of "Thu, 04 May 2000 11:20:24 +1000."
 <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au>
Message-ID: <200005041703.NAA13471@eric.cnri.reston.va.us>

Mark Hammond:

> However, there is something bad going on.  If you remove your test_fork1
> patch, and run it from regrtest (_not_ stand alone) you will see the
> children threads die with:
> 
>   File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
>     alive[id] = os.getpid()
> AttributeError: 'None' object has no attribute 'getpid'
> 
> Note the error - os is None!
> 
> [The reason is only happens as part of the test is because the children are
> created before the main thread fails with the attribute error]

I don't get this one -- maybe my machine is too slow.  (130 MHz
Pentium.)

> Similarly, I get spurious:
> 
> Traceback (most recent call last):
>   File ".\test_thread.py", line 103, in task2
>     mutex.release()
> AttributeError: 'None' object has no attribute 'release'
> 
> (Only rarely, and never when run stand-alone - the test_fork1 exception
> happens 100% of the time from the test suite)
> 
> And of course the test_winreg one.
> 
> test_winreg, I guessed, may be caused by the import lock (but its certainly
> not obvious how or why!?).  However, that doesnt explain the others.
> 
> I also saw these _before_ I applied the threading patches (and after!)
> 
> So I think the problem may be a little deeper?

It's Vladimir's patch which, after each tests, unloads all modules
that were loaded by that test.  If I change this to only unload
modules whose name starts with "test.", the test_winreg problem goes
away, and I bet yours go away too.

The real reason must be deeper -- there's also the import lock and the
fact that if a submodule of package "test" tries to import "os", a
search for "test.os" is made and if it doesn't find it it sticks None
in sys.modules['test.os'].

but I don't have time to research this further.

I'm tempted to apply the following change to regrtest.py.  This should
still unload the test modules (so you can rerun an individual test)
but it doesn't touch other modules.  I'll wait 24 hours. :-)

*** regrtest.py	2000/04/21 21:35:06	1.15
--- regrtest.py	2000/05/04 16:56:26
***************
*** 121,127 ****
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules:
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:
--- 121,127 ----
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules and module.startswith("test."):
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gvwilson@nevex.com  Thu May  4 20:03:54 2000
From: gvwilson@nevex.com (gvwilson@nevex.com)
Date: Thu, 4 May 2000 15:03:54 -0400 (EDT)
Subject: [Python-Dev] Minimal (single-file) Python?
Message-ID: <Pine.LNX.4.10.10005041448010.22917-100000@akbar.nevex.com>

Hi.  Has anyone ever built, or thought about building, a single-file
Python, in which all the "basic" capabilities are included in a single
executable (where "basic" means "can do as much as the Bourne shell")?
Some of the entries in the Software Carpentry competition would like to be
able to bootstrap from as small a starting point as possible.

Thanks,
Greg

p.s. I don't think this is the same problem as moving built-in features of
Python into optionally-loaded libraries, as some of the things in the
'sys', 'string', and 'os' modules would have to move in the other
direction to ensure Bourne shell equivalence.




From just@letterror.com  Thu May  4 22:22:38 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 22:22:38 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
Message-ID: <l03102810b5378dda02f5@[193.78.237.126]>

(Boy, is it quiet here all of a sudden ;-)

Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.

These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
  happen purely on the basis of the character codes; no
  assumption at all should be made about the encoding, ie.
  what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
  there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
  or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
  narrow or wide

The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
  since it is not neccesary for the character to be a Unicode
  code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
  contains a character > 255 the string should automatically
  become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
  - it is possible to specify a source encoding. I'm not sure if
    this is best done through an extra argument for encode()
    or that it should be a new method, eg. transcode().
  - s.encode() or s.transcode() are allowed to output a wide
    string, as in aNarrowString.encode("UCS-2") and
    s.transcode("Mac-Roman", "UCS-2").

My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.

Just




From skip@mojam.com (Skip Montanaro)  Thu May  4 21:43:42 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Thu, 4 May 2000 15:43:42 -0500 (CDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <14609.57598.738381.250872@beluga.mojam.com>

    Just> Sorry for the duplication of stuff, but I'd like to reiterate my
    Just> points, to separate them from my implementation proposal, as
    Just> that's just what it is: an implementation detail.

    Just> These things are important to me:
    ...

For the encoding-challenged like me, does it make sense to explicitly state
that you can't mix character widths within a single string, or is that just
so obvious that I deserve a head slap just for mentioning it?

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From Fredrik Lundh" <effbot@telia.com  Thu May  4 22:02:35 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 23:02:35 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54zj.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid>

Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.

agreed.

(but even if we settle for "7-bit unicode" in 1.6, there are still a
few issues left to sort out before 1.6 final.  but it might be best
to get back to that after we've added SRE and GC to 1.6a3. we
might all need a short break...)

> But let's not fail to come back to this ASAP

first week in june, promise ;-)

</F>



From mhammond@skippinet.com.au  Fri May  5 00:55:15 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 5 May 2000 09:55:15 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: <200005041703.NAA13471@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEEBCKAA.mhammond@skippinet.com.au>

> It's Vladimir's patch which, after each tests, unloads all modules
> that were loaded by that test.  If I change this to only unload
> modules whose name starts with "test.", the test_winreg problem goes
> away, and I bet yours go away too.

They do indeed!

> The real reason must be deeper -- there's also the import lock and the
> fact that if a submodule of package "test" tries to import "os", a
> search for "test.os" is made and if it doesn't find it it sticks None
> in sys.modules['test.os'].
>
> but I don't have time to research this further.

I started to think about this.  The issue is simply that code which
blithely wipes sys.modules[] may cause unexpected results.  While the end
result is a bug, the symptoms are caused by extreme hackiness.

Seeing as my time is also limited, I say we forget it!

> I'm tempted to apply the following change to regrtest.py.  This should
> still unload the test modules (so you can rerun an individual test)
> but it doesn't touch other modules.  I'll wait 24 hours. :-)

The 24 hour time limit is only supposed to apply to _my_ patches - you can
check yours straight in (and if anyone asks, just tell them I said it was
OK) :-)

Mark.



From ht@cogsci.ed.ac.uk  Fri May  5 09:19:07 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:19:07 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
 <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <f5bya5pxvd0.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

> > I think I hear a moderate consensus developing that the 'ASCII
> > proposal' is a reasonable compromise given the time constraints.  But
> > let's not fail to come back to this ASAP -- it _really_ narcs me that
> > every time I load XML into my Python-based editor I'm going to convert
> > large amounts of wide-string data into UTF-8 just so Tk can convert it
> > back to wide-strings in order to display it!
> 
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.
> 
> And is your XML really stored on disk in its 16-bit format?

No, I have no idea what encoding it's in, my XML parser supports over
a dozen encodings, and quite sensibly always delivers the content, as
per the XML REC, as wide-strings.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From ht@cogsci.ed.ac.uk  Fri May  5 09:21:41 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:21:41 +0100
Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate
In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
 <200005041240.IAA08277@eric.cnri.reston.va.us>
 <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>
Message-ID: <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>

"Fredrik Lundh" <fredrik@pythonware.com> writes:

> Guido van Rossum <guido@python.org> wrote:
> > Thanks -- but that's really Tcl's fault, since the only way to get
> > character data *into* Tcl (or out of it) is through the UTF-8
> > encoding.
> 
> from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm
> 
>     Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)
> 
>     Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
>     object or modify an existing object to hold a copy of the
>     Unicode string given by unicode and numChars.
> 
>     (Tcl_UniChar* is currently the same thing as Py_UNICODE*)
> 

Any way this can be exploited in Tkinter?

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From just@letterror.com  Fri May  5 10:25:37 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 10:25:37 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <007701bfb60c$1543f060$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237
 .154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres
 cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p
 rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233
 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91
 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54z
 j.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <l03102802b5383fd7c128@[193.78.237.126]>

At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
>Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
>> I think I hear a moderate consensus developing that the 'ASCII
>> proposal' is a reasonable compromise given the time constraints.
>
>agreed.

This makes no sense: implementing the 7-bit proposal takes the more or less
the same time as implementing 8-bit downcasting. Or is it just the
bickering that's too time consuming? ;-)

I worry that if the current implementation goes into 1.6 more or less as it
is now there's no way we can ever go back (before P3K). Or will Unicode
support be marked "experimental" in 1.6? This is not so much about the
7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
and the u"" notation:

- unicode() only takes strings, so is effectively a method of the string type.
- if narrow and wide strings are meant to be as similar as possible,
chr(256) should just return a wide char
- similarly, why is the u"" notation at all needed?

The current design is more complex than needed, and still offers plenty of
surprises. Making it simpler (without integrating the two string types) is
not a huge effort. Seeing the wide string type as independent of Unicode
takes no physical effort at all, as it's just in our heads.

Fixing str() so it can return wide strings might be harder, and can wait
until later. Would be too bad, though.

Just




From ping@lfw.org  Fri May  5 10:21:20 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Fri, 5 May 2000 02:21:20 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <002d01bfb59c$cf482280$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>

On Thu, 4 May 2000, Fredrik Lundh wrote:
> 
> another approach is (simplified):
> 
>     try:
>         sys.stdout.write(x.encode(sys.stdout.encoding))
>     except AttributeError:
>         sys.stdout.write(str(x))

Indeed, that would work to solve just this specific Unicode
issue -- but there is a lot of flexibility and power to be
gained from the general solution of putting a method on the
stream object, as the example with the formatted list items
showed.  I think it is a good idea, for instance, to leave
decisions about how to print Unicode up to the Unicode object,
and not hardcode bits of it into print.

Guido, have you digested my earlier 'printout' suggestions?


-- ?!ng

"Old code doesn't die -- it just smells that way."
    -- Bill Frantz



From tdickenson@geminidataloggers.com  Fri May  5 10:07:46 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Fri, 05 May 2000 10:07:46 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>

On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just@letterror.com> wrote:

>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points,=
 to
>separate them from my implementation proposal, as that's just what it =
is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
>  happen purely on the basis of the character codes; no
>  assumption at all should be made about the encoding, ie.
>  what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
>  there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
>  or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
>  narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
>  since it is not neccesary for the character to be a Unicode
>  code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
>  contains a character > 255 the string should automatically
>  become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
>  - it is possible to specify a source encoding. I'm not sure if
>    this is best done through an extra argument for encode()
>    or that it should be a new method, eg. transcode().

>  - s.encode() or s.transcode() are allowed to output a wide
>    string, as in aNarrowString.encode("UCS-2") and
>    s.transcode("Mac-Roman", "UCS-2").

One other pleasant consequence:

- String comparisons work character-by character, even if the
  representation of those characters have different widths.

>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide =
some
>additional C compatibility (especially now that silent conversion to =
utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.


Toby Dickenson
tdickenson@geminidataloggers.com


From just@letterror.com  Fri May  5 12:40:49 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 12:40:49 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <l03102805b5385e3de8e8@[193.78.237.127]>

At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
>One other pleasant consequence:
>
>- String comparisons work character-by character, even if the
>  representation of those characters have different widths.

Exactly. By saying "(wide) strings are not tied to Unicode" the question
whether wide strings should or should not be sorted according to the
Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
too hard anyway"...

Just




From tree@basistech.com  Fri May  5 12:46:41 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 07:46:41 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <14610.46241.129977.642796@cymru.basistech.com>

Just van Rossum writes:
 > At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
 > >One other pleasant consequence:
 > >
 > >- String comparisons work character-by character, even if the
 > >  representation of those characters have different widths.
 > 
 > Exactly. By saying "(wide) strings are not tied to Unicode" the question
 > whether wide strings should or should not be sorted according to the
 > Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
 > too hard anyway"...

Wait a second.

There is nothing about Unicode that would prevent you from defining
string equality as byte-level equality.

This strikes me as the wrong way to deal with the complex collation
issues of Unicode.

It seems to me that by default wide-strings compare at the byte-level
(i.e., '=' is a byte level comparison). If you want a normalized
comparison, then you make an explicit function call for that.

This is no different from comparing strings in a case sensitive
vs. case insensitive manner.

       -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From just@letterror.com  Fri May  5 14:17:31 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 14:17:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14610.46241.129977.642796@cymru.basistech.com>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <l03102808b53877a3e392@[193.78.237.127]>

[Me]
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

[Tom Emerson]
>Wait a second.
>
>There is nothing about Unicode that would prevent you from defining
>string equality as byte-level equality.

Agreed.

>This strikes me as the wrong way to deal with the complex collation
>issues of Unicode.

All I was trying to say, was that by looking at it this way, it is even
more obvious that the builtin comparison should not deal with Unicode
sorting & collation issues. It seems you're saying the exact same thing:

>It seems to me that by default wide-strings compare at the byte-level
>(i.e., '=' is a byte level comparison). If you want a normalized
>comparison, then you make an explicit function call for that.

Exactly.

>This is no different from comparing strings in a case sensitive
>vs. case insensitive manner.

Good point. All this taken together still means to me that comparisons
between wide and narrow strings should take place at the character level,
which implies that coercion from narrow to wide is done at the character
level, without looking at the encoding. (Which in my book in turn still
implies that as long as we're talking about Unicode, narrow strings are
effectively Latin-1.)

Just




From tree@basistech.com  Fri May  5 13:34:35 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 08:34:35 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102808b53877a3e392@[193.78.237.127]>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <14610.49115.820599.172598@cymru.basistech.com>

Just van Rossum writes:
 > Good point. All this taken together still means to me that comparisons
 > between wide and narrow strings should take place at the character level,
 > which implies that coercion from narrow to wide is done at the character
 > level, without looking at the encoding. (Which in my book in turn still
 > implies that as long as we're talking about Unicode, narrow strings are
 > effectively Latin-1.)

Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
characters" are Unicode, but stored in UTF-8 encoding, then you loose.

Hmmmm... how often do you expect to compare narrow vs. wide strings,
using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
use the byte comparison? I may very well have two strings (one narrow,
one wide) that compare equal, even though they're not. Not exactly
what I would expect.

     -tree

[I'm flying from Seattle to Boston today, so eventually I will
 disappear for a while]

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From pf@artcom-gmbh.de  Fri May  5 14:13:05 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 5 May 2000 15:13:05 +0200 (MEST)
Subject: [Python-Dev] wide strings vs. Unicode point of view (was Re: [I18n-sig] Unicode st.... alternative)
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]> from Just van Rossum at "May 5, 2000 12:40:49 pm"
Message-ID: <m12nhuj-000CnCC@artcom0.artcom-gmbh.de>

Just van Rossum:
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

I personally like the idea speaking of "wide strings" containing wide
character codes instead of Unicode objects.

Unfortunately there are many methods which need to interpret the
content of strings according to some encoding knowledge: for example
'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know,
to which class certain characters belong.

This problem was already some kind of visible in 1.5.2, since these methods 
were available as library functions from the string module and they did
work with a global state maintained by the 'setlocale()' C-library function.
Quoting from the C library man pages:

"""    The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con­
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.
"""

I guess applying 'upper' to a chinese char will not make much sense.

Now these former string module functions were moved into the Python
object core.  So the current Python string and Unicode object API is
somewhat "western centric".  ;-) At least Marc's implementation in
'unicodectype.c' contains the hard coded assumption, that wide strings
contain really unicode characters.  
print u"äöü".upper().encode("latin1") 
shows "ÄÖÜ" independent from the locale setting.  This makes sense.
The output from  print u"äöü".upper().encode()  however looks ugly
here on my screen... UTF-8 ... blech:Ã ÃÃ

Regards and have a nice weekend, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From guido@python.org  Fri May  5 15:49:52 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:49:52 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Fri, 05 May 2000 02:21:20 PDT."
 <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>
References: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>
Message-ID: <200005051449.KAA14138@eric.cnri.reston.va.us>

> Guido, have you digested my earlier 'printout' suggestions?

Not quite, except to the point that they require more thought than to
rush them into 1.6.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 15:54:16 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:54:16 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Thu, 04 May 2000 22:22:38 BST."
 <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <200005051454.KAA14168@eric.cnri.reston.va.us>

> (Boy, is it quiet here all of a sudden ;-)

Maybe because (according to one report on NPR here) 80% of the world's
email systems are victimized by the ILOVEYOU virus?  You & I are not
affected because it's Windows specific (a visual basic script, I got a
copy mailed to me so I could have a good look :-).  Note that there
are already mutations, one of which pretends to be a joke.

> Sorry for the duplication of stuff, but I'd like to reiterate my points, to
> separate them from my implementation proposal, as that's just what it is:
> an implementation detail.
> 
> These things are important to me:
> - get rid of the Unicode-ness of wide strings, in order to
> - make narrow and wide strings as similar as possible
> - implicit conversion between narrow and wide strings should
>   happen purely on the basis of the character codes; no
>   assumption at all should be made about the encoding, ie.
>   what the character code _means_.
> - downcasting from wide to narrow may raise OverflowError if
>   there are characters in the wide string that are > 255
> - str(s) should always return s if s is a string, whether narrow
>   or wide
> - file objects need to be responsible for handling wide strings
> - the above two points should make it possible for
> - if no encoding is known, Unicode is the default, whether
>   narrow or wide
> 
> The above points seem to have the following consequences:
> - the 'u' in \uXXXX notation no longer makes much sense,
>   since it is not neccesary for the character to be a Unicode
>   code point: it's just a 2-byte int. \wXXXX might be an option.
> - the u"" notation is no longer neccesary: if a string literal
>   contains a character > 255 the string should automatically
>   become a wide string.
> - narrow strings should also have an encode() method.
> - the builtin unicode() function might be redundant if:
>   - it is possible to specify a source encoding. I'm not sure if
>     this is best done through an extra argument for encode()
>     or that it should be a new method, eg. transcode().
>   - s.encode() or s.transcode() are allowed to output a wide
>     string, as in aNarrowString.encode("UCS-2") and
>     s.transcode("Mac-Roman", "UCS-2").
> 
> My proposal to extend the "old" string type to be able to contain wide
> strings is of course largely unrelated to all this. Yet it may provide some
> additional C compatibility (especially now that silent conversion to utf-8
> is out) as well as a workaround for the
> str()-having-to-return-a-narrow-string bottleneck.

I'm not so sure that this is enough.  You seem to propose wide strings
as vehicles for 16-bit values (and maybe later 32-bit values) apart
from their encoding.  We already have a data type for that (the array
module).  The Unicode type does a lot more than storing 16-bit values:
it knows lots of encodings to and from Unicode, and it knows things
like which characters are upper or lower or title case and how to map
between them, which characters are word characters, and so on.  All
this is highly Unicode specific and is part of what people ask for
when then when they request Unicode support.  (Example: Unicode has
405 characters classified as numeric, according to the isnumeric()
method.)

And by the way, don't worry about the comparison.  I'm not changing
the default comparison (==, cmp()) for Unicode strings to be anything
than per 16-bit-quantity.  However a Unicode object might in addition
has a method to do normalization or whatever, as long as it's language
independent and strictly defined by the Unicode standard.
Language-specific operations belong in separate modules.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 16:07:48 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:07:48 -0400
Subject: [Python-Dev] Moving Unicode debate to i18n-sig@python.org
Message-ID: <200005051507.LAA14262@eric.cnri.reston.va.us>

I've moved all my responses to the Unicode debate to the i18n-sig
mailing list, where it belongs.  Please don't cross-post any more.

If you're interested in this issue but aren't subscribed to the
i18n-sig list, please subscribe at
http://www.python.org/mailman/listinfo/i18n-sig/.

To view the archives, go to http://www.python.org/pipermail/i18n-sig/.

See you there!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From jim@digicool.com  Fri May  5 18:09:34 2000
From: jim@digicool.com (Jim Fulton)
Date: Fri, 05 May 2000 13:09:34 -0400
Subject: [Python-Dev] Pickle diffs anyone?
Message-ID: <3913004E.6CC69857@digicool.com>

Someone recently made a cool proposal for utilizing
diffs to save space taken by old versions in
the Zope object database:

  http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

To make this work, we need a good way of diffing pickles.

I thought maybe someone here would have some good suggestions.
I do think that the topic is sort of interesting (for some
definition of "interesting" ;).

The page above is a Wiki page. (Wiki is awesome. If you haven't
seen it before, check out http://joyful.com/zwiki/ZWiki.)
If you are a member of zope.org, you can edit the page directly,
which would be fine with me. :)

Jim

--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From fdrake@acm.org  Fri May  5 18:14:16 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Fri, 5 May 2000 13:14:16 -0400 (EDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.360.166536.866583@seahag.cnri.reston.va.us>

Jim Fulton writes:
 > To make this work, we need a good way of diffing pickles.

Jim,
  If the basic requirement is for a binary diff facility, perhaps you
should look into XDelta; I think that's available as a C library as
well as a command line tool, so you should be able to hook it in
fairly easily.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From trentm@activestate.com  Fri May  5 18:25:48 2000
From: trentm@activestate.com (Trent Mick)
Date: Fri, 5 May 2000 10:25:48 -0700
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <000001bfb336$d4f512a0$0f2d153f@tim>
References: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com> <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <20000505102548.B25914@activestate.com>

I posted a couple of patches a couple of days ago to correct the string
methods implementing slice-like optional parameters (count, find, index,
rfind, rindex) to properly clamp slice index values to the proper range (any
PyInt or PyLong value is acceptible now). In fact the slice_index() function
that was being used in ceval.c was reused (renamed to _PyEval_SliceIndex).

As well, the other patch changes PyArg_ParseTuple's 'b', 'h', and 'i'
formatters to raise an OverflowError if they overflow.

Trent

p.s. I thought I would whine here for some more attention. Who needs that
Unicode stuff anyway. ;-)


From fw@deneb.cygnus.argh.org  Fri May  5 17:13:42 2000
From: fw@deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:13:42 +0200
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Just van Rossum's message of "Fri, 5 May 2000 14:17:31 +0100"
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <8766st5615.fsf@deneb.cygnus.argh.org>

Just van Rossum <just@letterror.com> writes:

> Good point. All this taken together still means to me that comparisons
> between wide and narrow strings should take place at the character level,
> which implies that coercion from narrow to wide is done at the character
> level, without looking at the encoding. (Which in my book in turn still
> implies that as long as we're talking about Unicode, narrow strings are
> effectively Latin-1.)

Sorry for jumping in, I've only recently discovered this list. :-/

At the moment, most of the computing world is not Latin-1 but
Windows-12??.  That's why I don't think this is a good idea at all.


From skip@mojam.com (Skip Montanaro)  Fri May  5 20:10:24 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 5 May 2000 14:10:24 -0500 (CDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.7328.869011.109768@beluga.mojam.com>

    Jim> Someone recently made a cool proposal for utilizing diffs to save
    Jim> space taken by old versions in the Zope object database:

    Jim>   http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

    Jim> To make this work, we need a good way of diffing pickles.

Fred already mentioned a candidate library to do diffs.  If that works, the
only other thing I think you'd need to do is guarantee that dicts are
pickled in a consistent fashion, probably by sorting the keys before
enumerating them.

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From trentm@activestate.com  Fri May  5 22:34:48 2000
From: trentm@activestate.com (Trent Mick)
Date: Fri, 5 May 2000 14:34:48 -0700
Subject: [Python-Dev] should a float overflow or just equal 'inf'
Message-ID: <20000505143448.A10731@activestate.com>

Hi all,

I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
formatter for PyArg_ParseTuple raise an Overflow exception if they overflow
(currently they just silently overflow). Presuming that this is considered a
good idea, should this be carried to floats.

Floats don't really overflow, they just equal 'inf'. Would it be more
desireable to raise an Overflow exception for this? I am inclined to think
that this would *not* be desireable based on the following quote:

"""
the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim
"""

In any case, the question stands. I don't really have an idea of the
potential pains that this could cause to (1) efficiecy, (2) external code
that expects to deal with 'inf's itself. The reason I ask is because I am
looking at related issues in the Python code these days.


Trent
--
Trent Mick
trentm@activestate.com



From tismer@tismer.com  Sat May  6 15:29:07 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sat, 06 May 2000 16:29:07 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000001bfb4a6$21da7900$922d153f@tim>
Message-ID: <39142C33.507025B5@tismer.com>


Tim Peters wrote:
> 
> [Trent Mick]
> > >>> i = -2147483648
> > OverflowError: integer literal too large
> > >>> i = -2147483648L
> > >>> int(i)   # it *is* a valid integer literal
> > -2147483648
> 
> Python's grammar is such that negative integer literals don't exist; what
> you actually have there is the unary minus operator applied to positive
> integer literals; indeed,

<disassembly snipped>

Well, knowing that there are more negatives than positives
and then coding it this way appears in fact as a design flaw to me.

A simple solution could be to do the opposite:
Always store a negative number and negate it
for positive numbers. A real negative number
would then end up with two UNARY_NEGATIVE
opcodes in sequence. If we had a simple postprocessor
to remove such sequences at the end, we're done.
As another step, it could also adjust all such consts
and remove those opcodes.

This could be a task for Skip's peephole optimizer.
Why did it never go into the core?

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From tim_one@email.msn.com  Sat May  6 20:13:46 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:13:46 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39142C33.507025B5@tismer.com>
Message-ID: <000301bfb78f$33e33d80$452d153f@tim>

[Tim]
> Python's grammar is such that negative integer literals don't
> exist; what you actually have there is the unary minus operator
> applied to positive integer literals; ...

[Christian Tismer]
> Well, knowing that there are more negatives than positives
> and then coding it this way appears in fact as a design flaw to me.

Don't know what you're saying here.  Python's grammar has nothing to do with
the relative number of positive vs negative entities; indeed, in a
2's-complement machine it's not even true that there are more negatives than
positives.  Python generates the unary minus for "negative literals"
because, again, negative literals *don't exist* in the grammar.

> A simple solution could be to do the opposite:
> Always store a negative number and negate it
> for positive numbers.  ...

So long as negative literals don't exist in the grammar, "-2147483648" makes
no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
problem" here worth fixing, although if there is <wink>, it will get fixed
by magic as soon as Python ints and longs are unified.




From tim_one@email.msn.com  Sat May  6 20:47:25 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:47:25 -0400
Subject: [Python-Dev] should a float overflow or just equal 'inf'
In-Reply-To: <20000505143448.A10731@activestate.com>
Message-ID: <000801bfb793$e70c9420$452d153f@tim>

[Trent Mick]
> I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
> formatter for PyArg_ParseTuple raise an Overflow exception if
> they overflow (currently they just silently overflow). Presuming that
> this is considered a good idea, should this be carried to floats.
>
> Floats don't really overflow, they just equal 'inf'. Would it be more
> desireable to raise an Overflow exception for this? I am inclined to think
> that this would *not* be desireable based on the following quote:
>
> """
> the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
>     that-can-be-done-ly y'rs  - tim
> """
>
> In any case, the question stands. I don't really have an idea of the
> potential pains that this could cause to (1) efficiecy, (2) external code
> that expects to deal with 'inf's itself. The reason I ask is because I am
> looking at related issues in the Python code these days.

Alas, this is the tip of a very large project:  while (I believe) *every*
platform Python runs on now is 754-conformant, Python itself has no idea
what it's doing wrt 754 semantics.  In part this is because ISO/ANSI C has
no idea what it's doing either.  C9X (the next C std) is supposed to supply
portable spellings of ways to get at 754 features, but before then there's
simply nothing portable that can be done.

Guido & I already agreed in principle that Python will eventually follow 754
rules, but with the overflow, divide-by-0, and invalid operation exceptions
*enabled* by default (and the underflow and inexact exceptions disabled by
default).  It does this by accident <0.9 wink> already for, e.g.,

>>> 1. / 0.
Traceback (innermost last):
  File "<pyshell#0>", line 1, in ?
    1. / 0.
ZeroDivisionError: float division
>>>

Under the 754 defaults, that should silently return a NaN instead.  But
neither Guido nor I think the latter is reasonable default behavior, and
having done so before in a previous life I can formally justify changing the
defaults a language exposes.

Anyway, once all that is done, float overflow *will* raise an exception (by
default; there will also be a way to turn that off), unlike what happens
today.

Before then, I guess continuing the current policy of benign neglect (i.e.,
let it overflow silently) is best for consistency.  Without access to all
the 754 features in C, it's not even easy to detect overflow now!  "if (x ==
x * 0.5) overflow();" isn't quite good enough, as it can trigger a spurious
underflow error -- there's really no reasonable way to spell this stuff in
portable C now!




From gstein@lyra.org  Sun May  7 11:25:29 2000
From: gstein@lyra.org (Greg Stein)
Date: Sun, 7 May 2000 03:25:29 -0700 (PDT)
Subject: [Python-Dev] buffer object (was: Unicode debate)
In-Reply-To: <390EF3EB.5BCE9EC3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>

[ damn, I wish people would pay more attention to changing the subject
  line to reflect the contents of the email ... I could not figure out if
  there were any further responses to this without opening most of those
  dang "Unicode debate" emails. sheesh... ]

On Tue, 2 May 2000, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> > 
> > [MAL]
> > > Let's not do the same mistake again: Unicode objects should *not*
> > > be used to hold binary data. Please use buffers instead.
> > 
> > Easier said than done -- Python doesn't really have a buffer data
> > type.

The buffer object. We *do* have the type.

> > Or do you mean the array module?  It's not trivial to read a
> > file into an array (although it's possible, there are even two ways).
> > Fact is, most of Python's standard library and built-in objects use
> > (8-bit) strings as buffers.

For historical reasons only. It would be very easy to change these to use
buffer objects, except for the simple fact that callers might expect a
*string* rather than something with string-like behavior.

>...
> > > BTW, I think that this behaviour should be changed:
> > >
> > > >>> buffer('binary') + 'data'
> > > 'binarydata'

In several places, bufferobject.c uses PyString_FromStringAndSize(). It
wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
copy the data in. A new API could also help out here:

  PyBuffer_CopyMemory(void *ptr, int size)


> > > while:
> > >
> > > >>> 'data' + buffer('binary')
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in ?
> > > TypeError: illegal argument type for built-in operation

The string object can't handle the buffer on the right side. Buffer
objects use the buffer interface, so they can deal with strings on the
right. Therefore: asymmetry :-(

> > > IMHO, buffer objects should never coerce to strings, but instead
> > > return a buffer object holding the combined contents. The
> > > same applies to slicing buffer objects:
> > >
> > > >>> buffer('binary')[2:5]
> > > 'nar'
> > >
> > > should prefereably be buffer('nar').

Sure. Wouldn't be a problem. The FromStringAndSize() thing.

> > Note that a buffer object doesn't hold data!  It's only a pointer to
> > data.  I can't off-hand explain the asymmetry though.
> 
> Dang, you're right...

Untrue. There is an API call which will construct a buffer object with its
own memory:

  PyObject * PyBuffer_New(int size)

The resulting buffer object will be read/write, and you can stuff values
into it using the slice notation.


> > > Hmm, perhaps we need something like a data string object
> > > to get this 100% right ?!

Nope. The buffer object is intended to be exactly this.

>...
> > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > which no "string literal" notations exist.
> 
> Anyway, one way or another I think we should make it clear
> to users that they should start using some other type for
> storing binary data.

Buffer objects. There are a couple changes to make this a bit easier for
people:

1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
   create a read/write buffer of a particular size. buffer() should create
   a zero-length read/write buffer.

2) if slice assignment is updated to allow changes to the length (for
   example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
   change. Specifically: when the buffer object owns the memory, it does
   this by appending the memory after the PyObject_HEAD and setting its
   internal pointer to it; when the dealloc() occurs, the target memory
   goes with the object. A flag would need to be added to tell the buffer
   object to do a second free() for the case where a realloc has returned
   a new pointer.
   [ I'm not sure that I would agree with this change, however; but it
     does make them a bit easier to work with; on the other hand, people
     have been working with immutable strings for a long time, so they're
     okay with concatenation, so I'm okay with saying length-altering
     operations must simply be done thru concatenation. ]


IMO, extensions should be using the buffer object for raw bytes. I know
that Mark has been updating some of the Win32 extensions to do this.
Python programs could use the objects if the buffer() builtin is tweaked
to allow a bit more flexibility in the arguments.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Sun May  7 12:09:45 2000
From: gstein@lyra.org (Greg Stein)
Date: Sun, 7 May 2000 04:09:45 -0700 (PDT)
Subject: [Python-Dev] introducing byte arrays in 1.6 (was: Unicode debate)
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005070406200.7610-100000@nebula.lyra.org>

On Wed, 3 May 2000, Guido van Rossum wrote:
>...
> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

See my other note. I think a simple change to the buffer() builtin would
allow read/write byte arrays to be simply constructed.

There are a couple API changes that could be made to bufferobject.[ch]
which could simplify some operations for C code and returning buffer
objects. But changes like that would be preconditioned on accepting the
change in return type from those extensions. For example, the doc may say
something returns a string; while buffer objects are similar to strings in
operation, they are not the *same*. IMO, Python 1.7 would be a good time
to alter return types to buffer objects as appropriate. (but I'm not
adverse to doing it today! (to get people used to the difference in
purposes))

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From bckfnn@worldonline.dk  Sun May  7 14:37:21 2000
From: bckfnn@worldonline.dk (Finn Bock)
Date: Sun, 07 May 2000 13:37:21 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39156208.13412015@smtp.worldonline.dk>

[Greg Stein]

>IMO, extensions should be using the buffer object for raw bytes. I know
>that Mark has been updating some of the Win32 extensions to do this.
>Python programs could use the objects if the buffer() builtin is tweaked
>to allow a bit more flexibility in the arguments.

Forgive me for rewinding this to the very beginning. But what is a
buffer object usefull for? I'm trying think about buffer object in terms
of jpython, so my primary interest is the user experience of buffer
objects.

Please correct my misunderstandings.

- There is not a buffer protocol exposed to python object (in the way
  the sequence protocol __getitem__ & friends are exposed).
- A buffer object typically gives access to the raw bytes which
  under lays the backing object. Regardless of the structure of the
  bytes.
- It is only intended for object which have a natural byte storage to
  implement the buffer interface.
- Of the builtin object only string, unicode and array supports the
  buffer interface.
- When slicing a buffer object, the result is always a string regardless
  of the buffer object base.


In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
said to have some natural byte storage. The jpython string type doesn't.
It would take some awful bit shifting to present a jpython string as an
array of bytes.

Would it make any sense to have a buffer object which only accept a byte
array as base? So that jpython would say:

>>> buffer("abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: buffer object expected


Would it make sense to tell python users that they cannot depend on the
portability of using strings (both 8bit and 16bit) as buffer object
base?


Because it is so difficult to look at java storage as a sequence of
bytes, I think I'm all for keeping the buffer() builtin and buffer
object as obscure and unknown as possible <wink>.

regards,
finn


From guido@python.org  Sun May  7 22:29:43 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 07 May 2000 17:29:43 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Sun, 07 May 2000 13:37:21 GMT."
 <39156208.13412015@smtp.worldonline.dk>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
 <39156208.13412015@smtp.worldonline.dk>
Message-ID: <200005072129.RAA15850@eric.cnri.reston.va.us>

[Finn Bock]

> Forgive me for rewinding this to the very beginning. But what is a
> buffer object usefull for? I'm trying think about buffer object in terms
> of jpython, so my primary interest is the user experience of buffer
> objects.
> 
> Please correct my misunderstandings.
> 
> - There is not a buffer protocol exposed to python object (in the way
>   the sequence protocol __getitem__ & friends are exposed).
> - A buffer object typically gives access to the raw bytes which
>   under lays the backing object. Regardless of the structure of the
>   bytes.
> - It is only intended for object which have a natural byte storage to
>   implement the buffer interface.

All true.

> - Of the builtin object only string, unicode and array supports the
>   buffer interface.

And the new mmap module.

> - When slicing a buffer object, the result is always a string regardless
>   of the buffer object base.
> 
> In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
> said to have some natural byte storage. The jpython string type doesn't.
> It would take some awful bit shifting to present a jpython string as an
> array of bytes.

I don't recall why JPython has jarray instead of array -- how do they
differ?  I think it's a shame that similar functionality is embodied
in different APIs.

> Would it make any sense to have a buffer object which only accept a byte
> array as base? So that jpython would say:
> 
> >>> buffer("abc")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: buffer object expected
> 
> 
> Would it make sense to tell python users that they cannot depend on the
> portability of using strings (both 8bit and 16bit) as buffer object
> base?

I think that the portability of many string properties is in danger
with the Unicode proposal.  Supporting this in the next version of
JPython will be a bit tricky.

> Because it is so difficult to look at java storage as a sequence of
> bytes, I think I'm all for keeping the buffer() builtin and buffer
> object as obscure and unknown as possible <wink>.

I basically agree, and in a private email to Greg Stein I've told him
this.  I think that the array module should be promoted to a built-in
function/type, and should be the recommended solution for data
storage.  The buffer API should remain a C-level API, and the buffer()
built-in should be labeled with "for experts only".

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Mon May  8 09:33:01 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 08 May 2000 10:33:01 +0200
Subject: [Python-Dev] buffer object (was: Unicode debate)
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39167BBD.88EB2C64@lemburg.com>

Greg Stein wrote:
> 
> [ damn, I wish people would pay more attention to changing the subject
>   line to reflect the contents of the email ... I could not figure out if
>   there were any further responses to this without opening most of those
>   dang "Unicode debate" emails. sheesh... ]
> 
> On Tue, 2 May 2000, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > [MAL]
> > > > Let's not do the same mistake again: Unicode objects should *not*
> > > > be used to hold binary data. Please use buffers instead.
> > >
> > > Easier said than done -- Python doesn't really have a buffer data
> > > type.
> 
> The buffer object. We *do* have the type.
> 
> > > Or do you mean the array module?  It's not trivial to read a
> > > file into an array (although it's possible, there are even two ways).
> > > Fact is, most of Python's standard library and built-in objects use
> > > (8-bit) strings as buffers.
> 
> For historical reasons only. It would be very easy to change these to use
> buffer objects, except for the simple fact that callers might expect a
> *string* rather than something with string-like behavior.

Would this be a too drastic change, then ? I think that we should
at least make use of buffers in the standard lib.

>
> >...
> > > > BTW, I think that this behaviour should be changed:
> > > >
> > > > >>> buffer('binary') + 'data'
> > > > 'binarydata'
> 
> In several places, bufferobject.c uses PyString_FromStringAndSize(). It
> wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
> copy the data in. A new API could also help out here:
> 
>   PyBuffer_CopyMemory(void *ptr, int size)
> 
> > > > while:
> > > >
> > > > >>> 'data' + buffer('binary')
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in ?
> > > > TypeError: illegal argument type for built-in operation
> 
> The string object can't handle the buffer on the right side. Buffer
> objects use the buffer interface, so they can deal with strings on the
> right. Therefore: asymmetry :-(
> 
> > > > IMHO, buffer objects should never coerce to strings, but instead
> > > > return a buffer object holding the combined contents. The
> > > > same applies to slicing buffer objects:
> > > >
> > > > >>> buffer('binary')[2:5]
> > > > 'nar'
> > > >
> > > > should prefereably be buffer('nar').
> 
> Sure. Wouldn't be a problem. The FromStringAndSize() thing.

Right.
 
Before digging deeper into this, I think we should here
Guido's opinion on this again: he said that he wanted to
use Java's binary arrays for binary data... perhaps we
need to tweak the array type and make it more directly
accessible (from C and Python) instead.

> > > Note that a buffer object doesn't hold data!  It's only a pointer to
> > > data.  I can't off-hand explain the asymmetry though.
> >
> > Dang, you're right...
> 
> Untrue. There is an API call which will construct a buffer object with its
> own memory:
> 
>   PyObject * PyBuffer_New(int size)
> 
> The resulting buffer object will be read/write, and you can stuff values
> into it using the slice notation.

Yes, but that API is not reachable from within Python,
AFAIK.
 
> > > > Hmm, perhaps we need something like a data string object
> > > > to get this 100% right ?!
> 
> Nope. The buffer object is intended to be exactly this.
> 
> >...
> > > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > > which no "string literal" notations exist.
> >
> > Anyway, one way or another I think we should make it clear
> > to users that they should start using some other type for
> > storing binary data.
> 
> Buffer objects. There are a couple changes to make this a bit easier for
> people:
> 
> 1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
>    create a read/write buffer of a particular size. buffer() should create
>    a zero-length read/write buffer.

This looks a lot like function overloading... I don't think we
should get into this: how about having the buffer() API take
keywords instead ?!

buffer(size=1024,mode='rw') - 1K of owned read write memory
buffer(obj) - read-only referenced memory from obj
buffer(obj,mode='rw') - read-write referenced memory in obj

etc.

Or we could allow passing None as object to obtain an owned
read-write memory block (much like passing NULL to the
C functions).

> 2) if slice assignment is updated to allow changes to the length (for
>    example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
>    change. Specifically: when the buffer object owns the memory, it does
>    this by appending the memory after the PyObject_HEAD and setting its
>    internal pointer to it; when the dealloc() occurs, the target memory
>    goes with the object. A flag would need to be added to tell the buffer
>    object to do a second free() for the case where a realloc has returned
>    a new pointer.
>    [ I'm not sure that I would agree with this change, however; but it
>      does make them a bit easier to work with; on the other hand, people
>      have been working with immutable strings for a long time, so they're
>      okay with concatenation, so I'm okay with saying length-altering
>      operations must simply be done thru concatenation. ]

I don't think I like this either: what happens when the buffer
doesn't own the memory ?
 
> IMO, extensions should be using the buffer object for raw bytes. I know
> that Mark has been updating some of the Win32 extensions to do this.
> Python programs could use the objects if the buffer() builtin is tweaked
> to allow a bit more flexibility in the arguments.

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From bckfnn@worldonline.dk  Mon May  8 20:44:27 2000
From: bckfnn@worldonline.dk (Finn Bock)
Date: Mon, 08 May 2000 19:44:27 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <200005072129.RAA15850@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>   <39156208.13412015@smtp.worldonline.dk>  <200005072129.RAA15850@eric.cnri.reston.va.us>
Message-ID: <3917074c.8837607@smtp.worldonline.dk>

[Guido]

>I don't recall why JPython has jarray instead of array -- how do they
>differ?  I think it's a shame that similar functionality is embodied
>in different APIs.

The jarray module is a paper thin factory for the PyArray type which is
primary (I believe) a wrapper around any existing java array instance.
It exists to make arrays returned from java code useful for jpython.
Since a PyArray must always wrap the original java array, it cannot
resize the array.

In contrast an array instance would own the memory and can resize it as
necessary.

Due to the different purposes I agree with Jim's decision of making the
two module incompatible. And they are truly incompatible. jarray.array
have reversed the (typecode, seq) arguments.

OTOH creating a mostly compatible array module for jpython should not be
too hard.
 
regards,
finn




From guido@python.org  Mon May  8 20:55:50 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 08 May 2000 15:55:50 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Mon, 08 May 2000 19:44:27 GMT."
 <3917074c.8837607@smtp.worldonline.dk>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org> <39156208.13412015@smtp.worldonline.dk> <200005072129.RAA15850@eric.cnri.reston.va.us>
 <3917074c.8837607@smtp.worldonline.dk>
Message-ID: <200005081955.PAA21928@eric.cnri.reston.va.us>

> >I don't recall why JPython has jarray instead of array -- how do they
> >differ?  I think it's a shame that similar functionality is embodied
> >in different APIs.
> 
> The jarray module is a paper thin factory for the PyArray type which is
> primary (I believe) a wrapper around any existing java array instance.
> It exists to make arrays returned from java code useful for jpython.
> Since a PyArray must always wrap the original java array, it cannot
> resize the array.

Understood.  This is a bit like the buffer API in CPython then (except
for Greg's vision where the buffer object manages storage as well :-).

> In contrast an array instance would own the memory and can resize it as
> necessary.

OK, this makes sense.

> Due to the different purposes I agree with Jim's decision of making the
> two module incompatible. And they are truly incompatible. jarray.array
> have reversed the (typecode, seq) arguments.

This I'm not so sure of.  Why be different just to be different?

> OTOH creating a mostly compatible array module for jpython should not be
> too hard.

OK, when we make array() a built-in, this should be done for Java too.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Mon May  8 21:29:21 2000
From: trentm@activestate.com (Trent Mick)
Date: Mon, 8 May 2000 13:29:21 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005081400.KAA19889@eric.cnri.reston.va.us>
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>
Message-ID: <20000508132921.A31981@activestate.com>

On Mon, May 08, 2000 at 10:00:30AM -0400, Guido van Rossum wrote:
> > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > Overflow exception if they overflow (previously they just silently
> > overflowed).
> 
> Trent,
> 
> There's one issue with this: I believe the 'b' format is mostly used
> with unsigned character arguments in practice.
>However on systems
> with default signed characters, CHAR_MAX is 127 and values 128-255 are
> rejected.  I'll change the overflow test to:
> 
> 	else if (ival > CHAR_MAX && ival >= 256) {
> 
> if that's okay with you.
> 
Okay, I guess. Two things:

1. In a way this defeats the main purpose of the checks. Now a silent overflow
could happen for a signed byte value over CHAR_MAX. The only way to
automatically do the bounds checking is if the exact type is known, i.e.
different formatters for signed and unsigned integral values. I don't know if
this is desired (is it?). The obvious choice of 'u' prefixes to specify
unsigned is obviously not an option.

Another option might be to document 'b' as for unsigned chars and 'h', 'i',
'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
limiting the user to unsigned or signed depending on the formatter. (Which
again, means that it would be nice to have different formatters for signed
and unsigned.) I think that the bounds checking is false security unless
these restrictions are made.


2. The above aside, I would be more inclined to change the line in question to:

   else if (ival > UCHAR_MAX) {

as this is more explicit about what is being done.

> Another issue however is that there are probably cases where an 'i'
> format is used (which can't overflow on 32-bit architectures) but
> where the int value is then copied into a short field without an
> additional check...  I'm not sure how to fix this except by a complete
> inspection of all code...  Not clear if it's worth it.

Yes, a complete code inspection seems to be the only way. That is some of
what I am doing. Again, I have two questions:

1. There are a fairly large number of downcasting cases in the Python code
(not necessarily tied to PyArg_ParseTuple results). I was wondering if you
think a generalized check on each such downcast would be advisable. This
would take the form of some macro that would do a bounds check before doing
the cast. For example (a common one is the cast of strlen's size_t return
value to int, because Python strings use int for their length, this is a
downcast on 64-bit systems):

  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, len);

would become
  
  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));

CAST_TO_INT would ensure that 'len'did not overflow and would raise an
exception otherwise.

Pros:

- should never have to worry about overflows again
- easy to find (given MSC warnings) and easy to code in (staightforward)

Cons:

- more code, more time to execute
- looks ugly
- have to check PyErr_Occurred every time a cast is done


I would like other people's opinion on this kind of change. There are three
possible answers:

  +1 this is a bad change idea because...<reason>
  -1 this is a good idea, go for it
  +0 (mostly likely) This is probably a good idea for some case where the
     overflow *could* happen, however the strlen example that you gave is
	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
	 is a good assumption/limitation.



2. Microsofts compiler gives good warnings for casts where information loss
is possible. However, I cannot find a way to get similar warnings from gcc.
Does anyone know if that is possible. I.e.

	int i = 123456;
	short s = i;  // should warn about possible loss of information

should give a compiler warning.


Thanks,
Trent

-- 
Trent Mick
trentm@activestate.com


From trentm@activestate.com  Mon May  8 22:26:51 2000
From: trentm@activestate.com (Trent Mick)
Date: Mon, 8 May 2000 14:26:51 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005081416.KAA20158@eric.cnri.reston.va.us>
References: <20000505135817.A9859@activestate.com> <200005081416.KAA20158@eric.cnri.reston.va.us>
Message-ID: <20000508142651.C8000@activestate.com>

On Mon, May 08, 2000 at 10:16:42AM -0400, Guido van Rossum wrote:
> > The patch to config.h looks big but it really is not. These are the effective
> > changes:
> > - MS_WINxx are keyed off _WINxx
> > - SIZEOF_VOID_P is set to 8 for Win64
> > - COMPILER string is changed appropriately for Win64
>
> One thing worries me: if COMPILER is changed, that changes
> sys.platform to "win64", right?  I'm sure that will break plenty of
> code which currently tests for sys.platform=="win32" but really wants
> to test for any form of Windows.  Maybe sys.platform should remain
> win32?
> 

No, but yes. :( Actually I forgot to mention that my config.h patch changes
the PLATFORM #define from win32 to win64. So yes, you are correct. And, yes
(Sigh) you are right that this will break tests for sys.platform == "win32".

So I guess the simplest thing to do is to leave it as win32 following the
same reasoning for defining MS_WIN32 on Win64:

>  The idea is that the common case is
>  that code specific to Win32 will also work on Win64 rather than being
>  specific to Win32 (i.e. there is more the same than different in WIn32 and
>  Win64).
 

What if someone needs to do something in Python code for either Win32 or
Win64 but not both? Or should this never be necessary (not likely). I would
like Mark H's opinion on this stuff.


Trent

-- 
Trent Mick
trentm@activestate.com


From tismer@tismer.com  Mon May  8 22:52:54 2000
From: tismer@tismer.com (Christian Tismer)
Date: Mon, 08 May 2000 23:52:54 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000301bfb78f$33e33d80$452d153f@tim>
Message-ID: <39173736.2A776348@tismer.com>


Tim Peters wrote:
> 
> [Tim]
> > Python's grammar is such that negative integer literals don't
> > exist; what you actually have there is the unary minus operator
> > applied to positive integer literals; ...
> 
> [Christian Tismer]
> > Well, knowing that there are more negatives than positives
> > and then coding it this way appears in fact as a design flaw to me.
> 
> Don't know what you're saying here. 

On a 2's-complement machine, there are 2**(n-1) negatives, zero, and
2**(n-1)-1 positives. The most negative number cannot be inverted.
Most machines today use the 2's complement.

> Python's grammar has nothing to do with
> the relative number of positive vs negative entities; indeed, in a
> 2's-complement machine it's not even true that there are more negatives than
> positives. 

If I read this 1's-complement machine then I believe it.
But we don't need to split hair on known stuff :-)

> Python generates the unary minus for "negative literals"
> because, again, negative literals *don't exist* in the grammar.

Yes. If I know the facts and don't build negative literals into
the grammar, then I call it an oversight. Not too bad but not nice.

> > A simple solution could be to do the opposite:
> > Always store a negative number and negate it
> > for positive numbers.  ...
> 
> So long as negative literals don't exist in the grammar, "-2147483648" makes
> no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> problem" here worth fixing, although if there is <wink>, it will get fixed
> by magic as soon as Python ints and longs are unified.

I'd change the grammar.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From gstein@lyra.org  Mon May  8 22:54:31 2000
From: gstein@lyra.org (Greg Stein)
Date: Mon, 8 May 2000 14:54:31 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173736.2A776348@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>

On Mon, 8 May 2000, Christian Tismer wrote:
>...
> > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > problem" here worth fixing, although if there is <wink>, it will get fixed
> > by magic as soon as Python ints and longs are unified.
> 
> I'd change the grammar.

That would be very difficult, with very little positive benefit. As Mark
said, use 0x80000000 if you want that number.

Consider that the grammar would probably want to deal with things like
  - 1234
or
  -0xA

Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
complicate the syntax for NUMBER.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From tismer@tismer.com  Mon May  8 23:09:43 2000
From: tismer@tismer.com (Christian Tismer)
Date: Tue, 09 May 2000 00:09:43 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>
Message-ID: <39173B27.4B3BEB40@tismer.com>


Greg Stein wrote:
> 
> On Mon, 8 May 2000, Christian Tismer wrote:
> >...
> > > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > > problem" here worth fixing, although if there is <wink>, it will get fixed
> > > by magic as soon as Python ints and longs are unified.
> >
> > I'd change the grammar.
> 
> That would be very difficult, with very little positive benefit. As Mark
> said, use 0x80000000 if you want that number.
> 
> Consider that the grammar would probably want to deal with things like
>   - 1234
> or
>   -0xA
> 
> Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
> complicate the syntax for NUMBER.

Right. That was the reason for my first, dumb, proposal:
Always interpret a number as negative and negate it once more.
That makes it positive. In a post process, remove double-negates.
This leaves negations always where they are allowed: On negatives.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From gstein@lyra.org  Mon May  8 23:11:00 2000
From: gstein@lyra.org (Greg Stein)
Date: Mon, 8 May 2000 15:11:00 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173B27.4B3BEB40@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081508490.18798-100000@nebula.lyra.org>

On Tue, 9 May 2000, Christian Tismer wrote:
>...
> Right. That was the reason for my first, dumb, proposal:
> Always interpret a number as negative and negate it once more.
> That makes it positive. In a post process, remove double-negates.
> This leaves negations always where they are allowed: On negatives.

IMO, that is a non-intuitive hack. It would increase the complexity of
Python's parsing internals. Again, with little measurable benefit.

I do not believe that I've run into a case of needing -2147483648 in the
source of one of my programs. If I had, then I'd simply switch to
0x80000000 and/or assign it to INT_MIN.

-1 on making Python more complex to support this single integer value.
   Users should be pointed to 0x80000000 to represent it. (a FAQ entry
   and/or comment in the language reference would be a Good Thing)


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From mhammond@skippinet.com.au  Mon May  8 23:15:17 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 08:15:17 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000508142651.C8000@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au>

[Trent]
> What if someone needs to do something in Python code for either Win32 or
> Win64 but not both? Or should this never be necessary (not
> likely). I would
> like Mark H's opinion on this stuff.

OK :-)

I have always thought that it _would_ move to "win64", and the official way
of checking for "Windows" will be sys.platform[:3]=="win".

In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
sys.platform[:3] in ["win", "mac"])

It will no doubt cause a bit of pain, but IMO it is cleaner...

Mark.



From guido@python.org  Tue May  9 03:14:07 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:14:07 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 08:15:17 +1000."
 <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au>
Message-ID: <200005090214.WAA22419@eric.cnri.reston.va.us>

> [Trent]
> > What if someone needs to do something in Python code for either Win32 or
> > Win64 but not both? Or should this never be necessary (not
> > likely). I would
> > like Mark H's opinion on this stuff.

[Mark]
> OK :-)
> 
> I have always thought that it _would_ move to "win64", and the official way
> of checking for "Windows" will be sys.platform[:3]=="win".
> 
> In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> sys.platform[:3] in ["win", "mac"])
> 
> It will no doubt cause a bit of pain, but IMO it is cleaner...

Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
symbol is defined even on Win64 systems -- to test for Win64, you must
test the _WIN64 symbol.  The two variants are more similar than they
are different.

While testing sys.platform isn't quite the same thing, I think that
the same reasoning goes: a win64 system is everything that a win32
system is, and then some.

So I'd vote for leaving sys.platform alone (i.e. "win32" in both
cases), and providing another way to test for win64-ness.

I wish we had had the foresight to set sys.platform to 'windows', but
since we hadn't, I think we'll have to live with the consequences.

The changes that Trent had to make in the standard library are only
the tip of the iceberg...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  9 03:24:50 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:24:50 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Mon, 08 May 2000 13:29:21 PDT."
 <20000508132921.A31981@activestate.com>
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>
 <20000508132921.A31981@activestate.com>
Message-ID: <200005090224.WAA22457@eric.cnri.reston.va.us>

[Trent]
> > > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > > Overflow exception if they overflow (previously they just silently
> > > overflowed).

[Guido]
> > There's one issue with this: I believe the 'b' format is mostly used
> > with unsigned character arguments in practice.
> > However on systems
> > with default signed characters, CHAR_MAX is 127 and values 128-255 are
> > rejected.  I'll change the overflow test to:
> > 
> > 	else if (ival > CHAR_MAX && ival >= 256) {
> > 
> > if that's okay with you.

[Trent]
> Okay, I guess. Two things:
> 
> 1. In a way this defeats the main purpose of the checks. Now a silent overflow
> could happen for a signed byte value over CHAR_MAX. The only way to
> automatically do the bounds checking is if the exact type is known, i.e.
> different formatters for signed and unsigned integral values. I don't know if
> this is desired (is it?). The obvious choice of 'u' prefixes to specify
> unsigned is obviously not an option.

The struct module uses upper case for unsigned.  I think this is
overkill here, and would add a lot of code (if applied systematically)
that would rarely be used.

> Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
> for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> limiting the user to unsigned or signed depending on the formatter. (Which
> again, means that it would be nice to have different formatters for signed
> and unsigned.) I think that the bounds checking is false security unless
> these restrictions are made.

I like this: 'b' is unsigned, the others are signed.

> 2. The above aside, I would be more inclined to change the line in question to:
> 
>    else if (ival > UCHAR_MAX) {
> 
> as this is more explicit about what is being done.

Agreed.

> > Another issue however is that there are probably cases where an 'i'
> > format is used (which can't overflow on 32-bit architectures) but
> > where the int value is then copied into a short field without an
> > additional check...  I'm not sure how to fix this except by a complete
> > inspection of all code...  Not clear if it's worth it.
> 
> Yes, a complete code inspection seems to be the only way. That is some of
> what I am doing. Again, I have two questions:
> 
> 1. There are a fairly large number of downcasting cases in the Python code
> (not necessarily tied to PyArg_ParseTuple results). I was wondering if you
> think a generalized check on each such downcast would be advisable. This
> would take the form of some macro that would do a bounds check before doing
> the cast. For example (a common one is the cast of strlen's size_t return
> value to int, because Python strings use int for their length, this is a
> downcast on 64-bit systems):
> 
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
> 
> would become
>   
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> 
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.
> 
> Pros:
> 
> - should never have to worry about overflows again
> - easy to find (given MSC warnings) and easy to code in (staightforward)
> 
> Cons:
> 
> - more code, more time to execute
> - looks ugly
> - have to check PyErr_Occurred every time a cast is done

How would the CAST_TO_INT macro signal an erro?  C doesn't have
exceptions.  If we have to add checks, I'd prefer to write

  size_t len = strlen(s);
  if (INT_OVERFLOW(len))
     return NULL; /* Or whatever is appropriate in this context */
  obj = PyString_FromStringAndSize(s, len);

> I would like other people's opinion on this kind of change. There are three
> possible answers:
> 
>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it
>   +0 (mostly likely) This is probably a good idea for some case where the
>      overflow *could* happen, however the strlen example that you gave is
> 	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
> 	 is a good assumption/limitation.

-0

> 2. Microsofts compiler gives good warnings for casts where information loss
> is possible. However, I cannot find a way to get similar warnings from gcc.
> Does anyone know if that is possible. I.e.
> 
> 	int i = 123456;
> 	short s = i;  // should warn about possible loss of information
> 
> should give a compiler warning.

Beats me :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mhammond@skippinet.com.au  Tue May  9 03:29:50 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:29:50 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>

> > It will no doubt cause a bit of pain, but IMO it is cleaner...
>
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.

Yes, but still, one day, (if MS have their way :-) win32 will be "legacy".

eg, imagine we were having the same debate about 5 years ago, but there was
a more established Windows 3.1 port available.

If we believed the hype, we probably _would_ have gone with "windows" for
both platforms, in the hope that they are more similar than different
(after all, that _was_ the story back then).

> The changes that Trent had to make in the standard library are only
> the tip of the iceberg...

Yes, but OTOH, the fact we explicitely use "win32" means people shouldnt
really expect code to work on Win64.  If nothing else, it will be a good
opportunity to examine the situation as each occurrence is found.  It will
be quite some time before many people play with the Win64 port seriously
(just like the first NT ports when I first came on the scene :-)

So, I remain a +0 on this - ie, I dont really care personally, but think
"win64" is the right thing.  In any case, Im happy to rely on Guido's time
machine...

Mark.



From mhammond@skippinet.com.au  Tue May  9 03:36:59 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:36:59 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au>

One more data point:

Windows CE uses "wince", and I certainly dont believe this should be
"win32" (although if you read the CE marketting stuff, they would have you
believe it is close enough that we should :-).

So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
:-)

Mark.



From guido@python.org  Tue May  9 04:16:34 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:16:34 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:29:50 +1000."
 <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>
Message-ID: <200005090316.XAA22614@eric.cnri.reston.va.us>

To help me understand the significance of win64 vs. win32, can you
list the major differences?  I thought that the main thing was that
pointers are 64 bits, and that otherwise the APIs are the same.  In
fact, I don't know if WIN64 refers to Windows running on 64-bit
machines (e.g. Alphas) only, or that it is possible to have win64 on a
32-bit machine (e.g. Pentium).

If it's mostly a matter of pointer size, this is almost completely
hidden at the Python level, and I don't think it's worth changing the
plaform name.  All of the changes that Trent found were really tests
for the presence of Windows APIs like the registry...

I could defend calling it Windows in comments but having sys.platform
be "win32".  Like uname on Solaris 2.7 returns SunOS 5.7 -- there's
too much old code that doesn't deserve to be broken.  (And it's not
like we have an excuse that it was always documented this way -- this
wasn't documented very clearly at all...)

It's-spelt-Raymond-Luxury-Yach-t-but-it's-pronounced-Throatwobbler-Mangrove,

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido@python.org  Tue May  9 04:19:19 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:19:19 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:36:59 +1000."
 <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au>
Message-ID: <200005090319.XAA22627@eric.cnri.reston.va.us>

> Windows CE uses "wince", and I certainly dont believe this should be
> "win32" (although if you read the CE marketting stuff, they would have you
> believe it is close enough that we should :-).
> 
> So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
> :-)

That's a feature :-).  Too many things we think we know are true on
Windows don't hold on Win/CE, so it's worth being more precise.

I don't believe this is the case for Win64, but I have to admit I
speak from a position of ignorance -- I am clueless as to what defines
Win64.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From nhodgson@bigpond.net.au  Tue May  9 04:35:16 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 9 May 2000 13:35:16 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>  <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <035e01bfb968$9ad8cca0$e3cb8490@neil>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that
> pointers are 64 bits, and that otherwise the APIs are the same.  In
> fact, I don't know if WIN64 refers to Windows running on 64-bit
> machines (e.g. Alphas) only, or that it is possible to have win64 on a
> 32-bit machine (e.g. Pentium).

   The 64 bit pointer change propagates to related types like size_t and
window procedure parameters. Running the 64 bit checker over Scintilla found
one real problem and a large number of strlen returning 64 bit size_ts where
only ints were expected.

   64 bit machines will continue to run Win32 code but it is unlikely that
32 bit machines will be taught to run Win64 code.

   Mixed operations, calling between 32 bit and 64 bit code and vice-versa
will be fun. Microsoft (unlike IBM with OS/2) never really did the right
thing for the 16->32 bit conversion. Is there any information yet on mixed
size applications?

   Neil




From mhammond@skippinet.com.au  Tue May  9 05:06:25 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 14:06:25 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that

I just saw Neils, and Trent may have other input.

However, the point I was making is that 5 years ago, MS were telling us
that the Win32 API was almost identical to the Win16 API, except for the
size of pointers, and dropping of the "memory model" abominations.

The Windows CE department is telling us that CE is, or will be, basically
the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
this should be hidden from the Python programmer.

Now all we need is "win64s" - it will respond to Neil's criticism that
mixed mode programs are a pain, and MS will tell us what "win64s" will
solve all our problems, and allow win32 to run 64 bit programs well into
the future.  Until everyone in the world realizes it sucks, and MS promptly
says it was only ever a hack in the first place, and everyone should be on
Win64 by now anyway :-)

Its-times-like-this-we-really-need-that-time-machine-ly,

Mark.



From tim_one@email.msn.com  Tue May  9 07:54:51 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 9 May 2000 02:54:51 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005090224.WAA22457@eric.cnri.reston.va.us>
Message-ID: <000101bfb983$7a34d3c0$592d153f@tim>

[Trent]
> 1. There are a fairly large number of downcasting cases in the
> Python code (not necessarily tied to PyArg_ParseTuple results). I
> was wondering if you think a generalized check on each such
> downcast would be advisable. This would take the form of some macro
> that would do a bounds check before doing the cast. For example (a
> common one is the cast of strlen's size_t return value to int,
> because Python strings use int for their length, this is a downcast
> on 64-bit systems):
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
>
> would become
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
>
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.

[Guido]
> How would the CAST_TO_INT macro signal an erro?  C doesn't have
> exceptions.  If we have to add checks, I'd prefer to write
>
>   size_t len = strlen(s);
>   if (INT_OVERFLOW(len))
>      return NULL; /* Or whatever is appropriate in this context */
>   obj = PyString_FromStringAndSize(s, len);

Of course we have to add checks -- strlen doesn't return an int!  It hasn't
since about a year after Python was first written (ANSI C changed the rules,
and Python is long overdue in catching up -- if you want people to stop
passing multiple args to append, set a good example in our use of C <0.5
wink>).

[Trent]
> I would like other people's opinion on this kind of change.
> There are three possible answers:

Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
a hurrah, -0 and +0 are obvious <ahem>.

>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it

That one, except spelled +1.

>   +0 (mostly likely) This is probably a good idea for some case
> where the overflow *could* happen, however the strlen example that
> you gave is *not* such a situation. As Tim Peters said: 2GB limit on
> string lengths is a good assumption/limitation.

No, it's a defensible limitation, but it's *never* a valid assumption.  The
check isn't needed anywhere we can prove a priori that it could never fail
(in which case we're not assuming anything), but it's always needed when we
can't so prove (in which case skipping the check would be a bad asssuption).
In the absence of any context, your strlen example above definitely needs
the check.

An alternative would be to promote the size member from int to size_t;
that's no actual change on the 32-bit machines Guido generally assumes
without realizing it, and removes an arbitrary (albeit defensible)
limitation on some 64-bit machines at the cost of (just possibly, due to
alignment vagaries) boosting var objects' header size on the latter.

correctness-doesn't-happen-by-accident-ly y'rs  - tim




From guido@python.org  Tue May  9 11:48:16 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 09 May 2000 06:48:16 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Tue, 09 May 2000 02:54:51 EDT."
 <000101bfb983$7a34d3c0$592d153f@tim>
References: <000101bfb983$7a34d3c0$592d153f@tim>
Message-ID: <200005091048.GAA22912@eric.cnri.reston.va.us>

> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.

Then the signatures of many, many functions would have to be changed
to take or return size_t, too -- almost anything in the Python/C API
that *conceptually* is a size_t is declared as int; the ob_size field
is only the tip of the iceberg.

We'd also have to change the size of Python ints (currently long) to
an integral type that can hold a size_t; on Windows (and I believe
*only* on Windows) this is a long long, or however they spell it
(except size_t is typically unsigned).

This all is a major reworking -- not good for 1.6, even though I agree
it needs to be done eventually.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  9 12:08:25 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 09 May 2000 07:08:25 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:06:25 +1000."
 <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
Message-ID: <200005091108.HAA22983@eric.cnri.reston.va.us>

> > To help me understand the significance of win64 vs. win32, can you
> > list the major differences?  I thought that the main thing was that
> 
> I just saw Neils, and Trent may have other input.
> 
> However, the point I was making is that 5 years ago, MS were telling us
> that the Win32 API was almost identical to the Win16 API, except for the
> size of pointers, and dropping of the "memory model" abominations.
> 
> The Windows CE department is telling us that CE is, or will be, basically
> the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
> this should be hidden from the Python programmer.
> 
> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

OK, I am beginning to get the picture.

The win16-win32-win64 distinction mostly affects the C API.  I agree
that the win16/win32 distinction was huge -- while they provided
backwards compatible APIs, most of these were quickly deprecated.  The
user experience was also completely different.  And huge amounts of
functionality were only available in the win32 version (e.g. the
registry), win32s notwithstanding.

I don't see the same difference for the win32/win64 API.  Yes, all the
APIs have changed -- but only in a way you would *expect* them to
change in a 64-bit world.  From the descriptions of differences, the
user experience and the sets of APIs available are basically the same,
but the APIs are tweaked to allow 64-bit values where this makes
sense.  This is a big deal for MS developers because of MS's
insistence on fixing the sizes of all datatypes -- POSIX developers
are used to typedefs that have platform-dependent widths, but MS in
its wisdom has decided that it should be okay to know that a long is
exactly 32 bits.

Again, the Windows/CE user experience is quite different, so I agree
on making the user-visible platform is different there.  But I still
don't see that the user experience for win64 will be any difference
than for win32.

Another view: win32 was my way of saying the union of Windows 95,
Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
platforms.  If Windows 2000 is sufficiently different to the user, it
deserves a different platform id (win2000?).

Is there a connection between Windows 2000 and _WIN64?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal@lemburg.com  Tue May  9 10:09:40 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 11:09:40 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <3917D5D3.A8CD1B3E@lemburg.com>

Guido van Rossum wrote:
> 
> > [Trent]
> > > What if someone needs to do something in Python code for either Win32 or
> > > Win64 but not both? Or should this never be necessary (not
> > > likely). I would
> > > like Mark H's opinion on this stuff.
> 
> [Mark]
> > OK :-)
> >
> > I have always thought that it _would_ move to "win64", and the official way
> > of checking for "Windows" will be sys.platform[:3]=="win".
> >
> > In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> > sys.platform[:3] in ["win", "mac"])
> >
> > It will no doubt cause a bit of pain, but IMO it is cleaner...
> 
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.
> 
> While testing sys.platform isn't quite the same thing, I think that
> the same reasoning goes: a win64 system is everything that a win32
> system is, and then some.
> 
> So I'd vote for leaving sys.platform alone (i.e. "win32" in both
> cases), and providing another way to test for win64-ness.

Just curious, what's the output of platform.py on Win64 ?
(You can download platform.py from my Python Pages.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From fdrake@acm.org  Tue May  9 19:53:37 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 14:53:37 -0400 (EDT)
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091108.HAA22983@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
 <200005091108.HAA22983@eric.cnri.reston.va.us>
Message-ID: <14616.24241.26240.247048@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Another view: win32 was my way of saying the union of Windows 95,
 > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
 > platforms.  If Windows 2000 is sufficiently different to the user, it
 > deserves a different platform id (win2000?).
 > 
 > Is there a connection between Windows 2000 and _WIN64?

  Since no one else has responded, here's some stuff from MS on the
topic of Win64:

http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

This document talks only of the Itanium (IA64) processor, and doesn't
mention the Alpha at all.  I know the NT shipping on Alpha machines is
Win32, though the actual application code can be 64-bit (think "32-bit
Solaris on an Ultra"); just the system APIs are 32 bits.
  The last link on the page links to some more detailed technical
information on moving application code to Win64.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From guido@python.org  Tue May  9 19:57:21 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 09 May 2000 14:57:21 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:53:37 EDT."
 <14616.24241.26240.247048@seahag.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us>
 <14616.24241.26240.247048@seahag.cnri.reston.va.us>
Message-ID: <200005091857.OAA24731@eric.cnri.reston.va.us>

>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

Thanks, this makes more sense.  I guess that Trent's interest in Win64
has to do with an early shipment of Itaniums that ActiveState might
have received. :-)

The document confirms my feeling that WIN64 vs WIN32, unlike WIN32 vs
WIN16, is mostly a compiler issue, and not a user experience or OS
functionality issue.  The table lists increased limits, not new
software subsystems.

So I still think that sys.platform should be 'win32', to avoid
breaking existing apps.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein@lyra.org  Tue May  9 19:56:34 2000
From: gstein@lyra.org (Greg Stein)
Date: Tue, 9 May 2000 11:56:34 -0700 (PDT)
Subject: [Python-Dev] win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred L. Drake, Jr. wrote:
> Guido van Rossum writes:
>  > Another view: win32 was my way of saying the union of Windows 95,
>  > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
>  > platforms.  If Windows 2000 is sufficiently different to the user, it
>  > deserves a different platform id (win2000?).
>  > 
>  > Is there a connection between Windows 2000 and _WIN64?
> 
>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp
> 
> This document talks only of the Itanium (IA64) processor, and doesn't
> mention the Alpha at all.  I know the NT shipping on Alpha machines is
> Win32, though the actual application code can be 64-bit (think "32-bit
> Solaris on an Ultra"); just the system APIs are 32 bits.

Windows is no longer made/sold for the Alpha processor. That was canned in
August of '99, I believe. Possibly August 98.

Basically, Windows is just the x86 family, and Win/CE for various embedded
processors.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From fdrake@acm.org  Tue May  9 20:06:49 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 15:06:49 -0400 (EDT)
Subject: [Python-Dev] Re: win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
References: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
 <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
Message-ID: <14616.25033.883165.800216@seahag.cnri.reston.va.us>

Greg Stein writes:
 > Windows is no longer made/sold for the Alpha processor. That was canned in
 > August of '99, I believe. Possibly August 98.

  <sigh/>


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives



From trentm@activestate.com  Tue May  9 20:49:57 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 9 May 2000 12:49:57 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091857.OAA24731@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us> <14616.24241.26240.247048@seahag.cnri.reston.va.us> <200005091857.OAA24731@eric.cnri.reston.va.us>
Message-ID: <20000509124957.A21838@activestate.com>

> Thanks, this makes more sense.  I guess that Trent's interest in Win64
> has to do with an early shipment of Itaniums that ActiveState might
> have received. :-)

Could be.... Or maybe we don't have any Itanium boxes. :)

Here is a good link on MSDN:

Getting Ready for 64-bit Windows
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_410z.htm

More specifically this (presuming it is being kept up to date) documents the
changes to the Win32 API for 64-bit Windows:
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_9xo3.htm
I am not a Windows programmer, but the changes are pretty minimal.

Summary:

Points for sys.platform == "win32" on Win64:
Pros:
- will not break existing sys.platform checks
- it would be nicer for casual Python programmer to have platform issues
  hidden, therefore one symbol for the common Windows OSes is more of the
  Pythonic ideal than "the first three characters of the platform string are
  'win'".
Cons:
- may need to add some other mechnism to differentiate Win32 and Win64 in
  Python code
- "win32" is a little misleading in that it refers to an API supported on
  Win32 and Win64 ("windows" would be more accurate, but too late for that)
  

Points for sys.platform == "win64" on Win64:
Pros:
- seems logically cleaner, given that the Win64 API may diverge from the
  Win32 API and there is no other current mechnism to differentiate Win32 and
  Win64 in Python code
Cons:
- may break existing sys.platform checks when run on Win64


Opinion:

I see the two choices ("win32" or "win64") as a trade off between:
- Use "win32" because a common user experience should translate to a common
  way to check for that environment, i.e. one value for sys.platform.
  Unfortunately we are stuck with "win32" instead of something like
  "windows".
- Use "win64" because it is not a big deal for the user to check for
  sys.platform[:3]=="win" and this way a mechanism exists to differentiate
  btwn Win32 and Win64 should it be necessary.

I am inclined to pick "win32" because:

1. While it may be confusing to the Python scriptor on Win64 that he has to
   check for win*32*, that is something that he will learn the first time. It
   is better than the alternative of the scriptor happily using "win64" and
   then that code not running on Win32 for no good reason. 
2. The main question is: is Win64 so much more like Win32 than different from
   it that the common-case general Python programmer should not ever have to
   make the differentiation in his Python code. Or, at least, enough so that
   such differentiation by the Python scriptor is rare enough that some other
   provided mechanism is sufficient (even preferable).
3. Guido has expressed that he favours this option. :) 

then change "win32" to "windows" in Py3K.



Trent

-- 
Trent Mick
trentm@activestate.com


From trentm@activestate.com  Tue May  9 21:05:53 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 9 May 2000 13:05:53 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <000101bfb983$7a34d3c0$592d153f@tim>
References: <200005090224.WAA22457@eric.cnri.reston.va.us> <000101bfb983$7a34d3c0$592d153f@tim>
Message-ID: <20000509130553.D21443@activestate.com>

[Trent]
> > Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> > 'l' as signed integral values and then set the bounds checks ([0,
> > UCHAR_MAX]
> > for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> > limiting the user to unsigned or signed depending on the formatter. (Which
> > again, means that it would be nice to have different formatters for signed
> > and unsigned.) I think that the bounds checking is false security unless
> > these restrictions are made.
[guido]
> 
> I like this: 'b' is unsigned, the others are signed.

Okay I will submit a patch for this them. 'b' formatter will limit values to
[0, UCHAR_MAX].

> [Trent]
> > 1. There are a fairly large number of downcasting cases in the
> > Python code (not necessarily tied to PyArg_ParseTuple results). I
> > was wondering if you think a generalized check on each such
> > downcast would be advisable. This would take the form of some macro
> > that would do a bounds check before doing the cast. For example (a
> > common one is the cast of strlen's size_t return value to int,
> > because Python strings use int for their length, this is a downcast
> > on 64-bit systems):
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, len);
> >
> > would become
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> >
> > CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> > exception otherwise.
> 
> [Guido]
> > How would the CAST_TO_INT macro signal an erro?  C doesn't have
> > exceptions.  If we have to add checks, I'd prefer to write
> >
> >   size_t len = strlen(s);
> >   if (INT_OVERFLOW(len))
> >      return NULL; /* Or whatever is appropriate in this context */
> >   obj = PyString_FromStringAndSize(s, len);
> 
[Tim]
> Of course we have to add checks -- strlen doesn't return an int!  It hasn't
> since about a year after Python was first written (ANSI C changed the rules,
> and Python is long overdue in catching up -- if you want people to stop
> passing multiple args to append, set a good example in our use of C <0.5
> wink>).
>
> The
> check isn't needed anywhere we can prove a priori that it could never fail
> (in which case we're not assuming anything), but it's always needed when we
> can't so prove (in which case skipping the check would be a bad
> asssuption).
> In the absence of any context, your strlen example above definitely needs
> the check.
>

Okay, I just wanted a go ahead that this kind of thing was desired. I will
try to find the points where these overflows *can* happen and then I'll add
checks in a manner closer to Guido syntax above.

> 
> [Trent]
> > I would like other people's opinion on this kind of change.
> > There are three possible answers:
> 
> Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
> a hurrah, -0 and +0 are obvious <ahem>.
> 
> >   +1 this is a bad change idea because...<reason>
> >   -1 this is a good idea, go for it
> 
Whoa, sorry Tim. I mixed up the +/- there. I did not intend to change the
voting system.

[Tim]
> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.
> 
I agree with Guido that this is too big an immediate change. I'll just try to
find and catch the possible overflows.


Thanks,
Trent

-- 
Trent Mick
trentm@activestate.com


From gstein@lyra.org  Tue May  9 21:14:19 2000
From: gstein@lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:14:19 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: <200005091953.PAA28201@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred Drake wrote:
> Update of /projects/cvsroot/python/dist/src/Objects
> In directory seahag.cnri.reston.va.us:/home/fdrake/projects/python/Objects
> 
> Modified Files:
> 	unicodeobject.c 
> Log Message:
> 
> M.-A. Lemburg <mal@lemburg.com>:
> Added support for user settable default encodings. The
> current implementation uses a per-process global which
> defines the value of the encoding parameter in case it
> is set to NULL (meaning: use the default encoding).

Umm... maybe I missed something, but I thought there was pretty broad
feelings *against* having a global like this. This kind of thing is just
nasty.

1) Python modules can't change it, nor can they rely on it being a
   particular value
2) a mutable, global variable is just plain wrong. The InterpreterState
   and ThreadState structures were created *specifically* to avoid adding
   crap variables like this.
3) allowing a default other than utf-8 is sure to cause gotchas and
   surprises. Some code is going to rightly assume that the default is
   just that, but be horribly broken when an application changes it.

Somebody please say this is hugely experimental. And then say why it isn't
just a private patch, rather than sitting in CVS.

:-(

-g

-- 
Greg Stein, http://www.lyra.org/



From guido@python.org  Tue May  9 21:24:05 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 09 May 2000 16:24:05 -0400
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: Your message of "Tue, 09 May 2000 13:14:19 PDT."
 <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org>
Message-ID: <200005092024.QAA25835@eric.cnri.reston.va.us>

> Umm... maybe I missed something, but I thought there was pretty broad
> feelings *against* having a global like this. This kind of thing is just
> nasty.
> 
> 1) Python modules can't change it, nor can they rely on it being a
>    particular value
> 2) a mutable, global variable is just plain wrong. The InterpreterState
>    and ThreadState structures were created *specifically* to avoid adding
>    crap variables like this.
> 3) allowing a default other than utf-8 is sure to cause gotchas and
>    surprises. Some code is going to rightly assume that the default is
>    just that, but be horribly broken when an application changes it.
> 
> Somebody please say this is hugely experimental. And then say why it isn't
> just a private patch, rather than sitting in CVS.

Watch your language.

Marc did this at my request.  It is my intention that the encoding be
hardcoded at compile time.  But while there's a discussion going about
what the hardcoded encoding should *be*, it would seem handy to have a
quick way to experiment.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gstein@lyra.org  Tue May  9 21:33:40 2000
From: gstein@lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:33:40 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ...
 unicodeobject.c)
In-Reply-To: <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091331300.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Guido van Rossum wrote:
>...
> Watch your language.

Yes, Dad :-) Sorry...

> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Okee dokee... That was one of my questions: is this experimental or not?

It is still a bit frightening, though, if it might get left in there, for
the reasons I listed (to name a few) ... :-(

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Tue May  9 22:35:16 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 23:35:16 +0200
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ...
 unicodeobject.c)
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <39188494.61424A7@lemburg.com>

Guido van Rossum wrote:
> 
> > Umm... maybe I missed something, but I thought there was pretty broad
> > feelings *against* having a global like this. This kind of thing is just
> > nasty.
> >
> > 1) Python modules can't change it, nor can they rely on it being a
> >    particular value
> > 2) a mutable, global variable is just plain wrong. The InterpreterState
> >    and ThreadState structures were created *specifically* to avoid adding
> >    crap variables like this.
> > 3) allowing a default other than utf-8 is sure to cause gotchas and
> >    surprises. Some code is going to rightly assume that the default is
> >    just that, but be horribly broken when an application changes it.

Hmm, the patch notice says it all I guess:

This patch fixes a few bugglets and adds an experimental
feature which allows setting the string encoding assumed
by the Unicode implementation at run-time.

The current implementation uses a process global for
the string encoding. This should subsequently be changed
to a thread state variable, so that the setting can
be done on a per thread basis.

Note that only the coercions from strings to Unicode
are affected by the encoding parameter. The "s" parser
marker still returns UTF-8. (str(unicode) also returns
the string encoding -- unlike what I wrote in the original
patch notice.)

The main intent of this patch is to provide a test
bed for the ongoing Unicode debate, e.g. to have the
implementation use 'latin-1' as default string encoding,
put

import sys
sys.set_string_encoding('latin-1')

in you site.py file.

> > Somebody please say this is hugely experimental. And then say why it isn't
> > just a private patch, rather than sitting in CVS.
> 
> Watch your language.
> 
> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Right and that's what the intent was behind adding a global
and some APIs to change it first... there are a few ways this
could one day get finalized:

1. hardcode the encoding (UTF-8 was previously hard-coded)
2. make the encoding a compile time option
3. make the encoding a per-process option
4. make the encoding a per-thread option
5. make the encoding a per-process setting which is deduced
   from env. vars such as LC_ALL, LC_CTYPE, LANG or system
   APIs which can be used to get at the currently
   active local encoding

Note that I have named the APIs sys.get/set_string_encoding()...
I've done that on purpose, because I have a feeling that
changing the conversion from Unicode to strings from UTF-8
to an encoding not capable of representing all Unicode
characters won't get us very far. Also, changing this is
rather tricky due to the way the buffer API works.

The other way around needs some experimenting though and this
is what the patch implements: it allows you to change the
string encoding assumption to test various
possibilities, e.g. ascii, latin-1, unicode-escape,
<your favourite local encoding> etc. without having to
recompile the interpreter every time.

Have fun with it :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond@skippinet.com.au  Tue May  9 23:58:19 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 10 May 2000 08:58:19 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000509124957.A21838@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEJFCKAA.mhammond@skippinet.com.au>

Geez - Fred is posting links to the MS site, and Im battling ipchains and
DHCP on my newly installed Debian box - what is this world coming to!?!?!

> I am inclined to pick "win32" because:

OK - Im sold.

Mark.



From nhodgson@bigpond.net.au  Wed May 10 00:17:27 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Wed, 10 May 2000 09:17:27 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
Message-ID: <009a01bfba0c$bdf13a20$e3cb8490@neil>

> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS
promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

   Maybe someone has made noise about this before I joined the discussion,
but I see the absence of a mixed mode being a big problem for users. I don't
think that there will be the 'quick clean" migration from 32 to 64 that
there was for 16 to 32. It doesn't offer that much for most applications. So
there will need to be both 32 bit and 64 bit versions of Python present on
machines. With duplicated libraries. Each DLL should be available in both 32
and 64 bit form. The IDEs will have to be available in both forms as they
are loading, running and debugging code of either width. Users will have to
remember to run a different Python if they are using libraries of the
non-default width.

   Neil




From czupancic@beopen.com  Wed May 10 00:44:20 2000
From: czupancic@beopen.com (Christian Zupancic)
Date: Tue, 09 May 2000 16:44:20 -0700
Subject: [Python-Dev] Python Query
Message-ID: <3918A2D4.B0FE7DDF@beopen.com>

This is a multi-part message in MIME format.
--------------AF8E1B3E17F2CC7DF987495F
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

======================================================================
Greetings Python Developers,

Please participate in a small survey about Python for BeOpen.com that we
are conducting with the guidance of our advisor, and the creator of
Python, Guido van Rossum. In return for answering just five short
questions, I will mail you up to three (3) BeOpen T-shirts-- highly
esteemed by select trade-show attendees as "really cool". In addition,
three lucky survey participants will receive a Life-Size Inflatable
Penguin (as they say, "very cool").

- Why do you prefer Python over other languages, e.g. Perl?


- What do you consider to be (a) competitor(s) to Python?


- What are Python's strong points and weaknesses?


- What other languages do you program in?


- If you had one wish about Python, what would it be?


- For Monty Python fans only:
What is the average airspeed of a swallow (European, non-migratory)?

 THANKS! That wasn't so bad, was it?  Make sure you've attached a
business card or address of some sort so I know where to send your
prizes.

Best Regards,
Christian Zupancic
Market Analyst, BeOpen.com

--------------AF8E1B3E17F2CC7DF987495F
Content-Type: text/x-vcard; charset=us-ascii;
 name="czupancic.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Christian Zupancic
Content-Disposition: attachment;
 filename="czupancic.vcf"

begin:vcard 
n:Zupancic;Christian
tel;work:408.985.4775
x-mozilla-html:FALSE
adr:;;;;;;
version:2.1
email;internet:czupancic@beopen.com
end:vcard

--------------AF8E1B3E17F2CC7DF987495F--



From trentm@activestate.com  Wed May 10 00:45:36 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 9 May 2000 16:45:36 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <3917D5D3.A8CD1B3E@lemburg.com>
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com>
Message-ID: <20000509164536.A31366@activestate.com>

On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> Just curious, what's the output of platform.py on Win64 ?
> (You can download platform.py from my Python Pages.)

I get the following:

"""
The system cannot find the path specified
win64-32bit
"""

Sorry, I did not hunt down the "path" error message.

Trent

-- 
Trent Mick
trentm@activestate.com


From tim_one@email.msn.com  Wed May 10 05:53:20 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 10 May 2000 00:53:20 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <009a01bfba0c$bdf13a20$e3cb8490@neil>
Message-ID: <000301bfba3b$a9e11300$022d153f@tim>

[Neil Hodgson]
>    Maybe someone has made noise about this before I joined the
> discussion, but I see the absence of a mixed mode being a big
> problem for users. ...

Intel doesn't -- they're not positioning Itanium for the consumer market.
They're going after the high-performance server market with this, and most
signs are that MS is too.

> ...
> It doesn't offer that much for most applications.

Bingo.

plenty-of-time-to-panic-later-if-end-users-ever-care-ly y'rs  - tim




From mal@lemburg.com  Wed May 10 08:47:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm@activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed May 10 08:47:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm@activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
__________X-Mozilla-Status: 0009______________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido@python.org  Wed May 10 17:52:49 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 10 May 2000 12:52:49 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Tools/idle browser.py,NONE,1.1
In-Reply-To: Your message of "Wed, 10 May 2000 12:47:30 EDT."
 <200005101647.MAA30408@seahag.cnri.reston.va.us>
References: <200005101647.MAA30408@seahag.cnri.reston.va.us>
Message-ID: <200005101652.MAA28936@eric.cnri.reston.va.us>

Fred,

"browser" is a particularly non-descriptive name for this module.

Perhaps it's not too late to rename it to e.g. "BrowserControl"?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Wed May 10 21:14:46 2000
From: trentm@activestate.com (Trent Mick)
Date: Wed, 10 May 2000 13:14:46 -0700
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <000201bfba3b$a74ad7c0$022d153f@tim>
References: <20000509162504.A31192@activestate.com> <000201bfba3b$a74ad7c0$022d153f@tim>
Message-ID: <20000510131446.A25926@activestate.com>

On Wed, May 10, 2000 at 12:53:16AM -0400, Tim Peters wrote:
> [Trent Mick]
> > Discussion:
> >
> > Okay, it is debatable to call float_hash and complex_hash broken,
> > but their code presumed that sizeof(long) was 32-bits. As a result
> > the hashed values for floats and complex values were not the same
> > on a 64-bit *nix system as on a 32-bit *nix system. With this
> > patch they are.
> 
> The goal is laudable but the analysis seems flawed.  For example, this new
> comment:

Firstly, I should have admitted my ignorance with regards to hash functions.


> Looks to me like the real problem in the original was here:
> 
>     x = hipart + (long)fractpart + (long)intpart + (expo << 15);
>                                    ^^^^^^^^^^^^^
> 
> The difficulty is that intpart may *not* fit in 32 bits, so the cast of
> intpart to long is ill-defined when sizeof(long) == 4.

> 
> That is, the hash function truly is broken for "large" values with a
> fractional part, and I expect your after-patch code suffers the same
> problem: 

Yes it did.


> The
> solution to this is to break intpart in this branch into pieces no larger
> than 32 bits too 

Okay here is another try (only for floatobject.c) for discussion. If it looks
good then I will submit a patch for float and complex objects. So do the same
for 'intpart' as was done for 'fractpart'.


static long
float_hash(v)
    PyFloatObject *v;
{
    double intpart, fractpart;
    long x;

    fractpart = modf(v->ob_fval, &intpart);

    if (fractpart == 0.0) {
		// ... snip ...
    }
    else {
        int expo;
        long hipart;

        fractpart = frexp(fractpart, &expo);
        fractpart = fractpart * 2147483648.0; 
        hipart = (long)fractpart; 
        fractpart = (fractpart - (double)hipart) * 2147483648.0;

        x = hipart + (long)fractpart + (expo << 15); /* combine the fract parts */

        intpart = frexp(intpart, &expo);
        intpart = intpart * 2147483648.0;
        hipart = (long)intpart;
        intpart = (intpart - (double)hipart) * 2147483648.0;

        x += hipart + (long)intpart + (expo << 15); /* add in the int parts */
    }
    if (x == -1)
        x = -2;
    return x;
}




> Note this consequence under the Win32 Python:

With this change, on Linux32:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 -2141945856
1.10810156237e+12 -2137751552
1.11669149696e+12 -2129362944
1.13387136614e+12 -2112585728
1.16823110451e+12 -2079031296
1.23695058125e+12 -2011922432
1.37438953472e+12 -1877704704
1.64926744166e+12 -1609269248
2.19902325555e+12 -2146107392
3.29853488333e+12 -1609236480
5.49755813888e+12 -1877639168
9.89560464998e+12 -2011824128
1.86916976722e+13 -2078900224


On Linux64:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 2153021440
1.10810156237e+12 2157215744
1.11669149696e+12 2165604352
1.13387136614e+12 2182381568
1.16823110451e+12 2215936000
1.23695058125e+12 2283044864
1.37438953472e+12 2417262592
1.64926744166e+12 2685698048
2.19902325555e+12 2148859904
3.29853488333e+12 2685730816
5.49755813888e+12 2417328128
9.89560464998e+12 2283143168
1.86916976722e+13 2216067072

>-- and that should also fix your 64-bit woes "by magic".
> 

As you can see it did not, but for another reason. The summation of the parts
overflows 'x'. Is this a problem? I.e., does it matter if a hash function
returns an overflowed integral value (my hash function ignorance is showing)?
And if this does not matter, does it matter that a hash returns different
values on different platforms?


> a hash function should never ignore any bit in its input. 

Which brings up a question regarding instance_hash(), func_hash(),
meth_hash(), HKEY_hash() [or whatever it is called], and other which cast a
pointer to a long (discarding the upperhalf of the pointer on Win64). Do
these really need to be fixed. Am I nitpicking too much on this whole thing?


Thanks,
Trent

-- 
Trent Mick
trentm@activestate.com


From tim_one@email.msn.com  Thu May 11 05:13:29 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 00:13:29 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <20000510131446.A25926@activestate.com>
Message-ID: <000b01bfbaff$43d320c0$2aa0143f@tim>

[Trent Mick]
> ...
> Okay here is another try (only for floatobject.c) for discussion.
> If it looks good then I will submit a patch for float and complex
> objects. So do the same for 'intpart' as was done for 'fractpart'.
>
>
> static long
> float_hash(v)
>     PyFloatObject *v;
> {
>     double intpart, fractpart;
>     long x;
>
>     fractpart = modf(v->ob_fval, &intpart);
>
>     if (fractpart == 0.0) {
> 		// ... snip ...
>     }
>     else {
>         int expo;
>         long hipart;
>
>         fractpart = frexp(fractpart, &expo);
>         fractpart = fractpart * 2147483648.0;

It's OK to use "*=" in C <wink>.

Would like a comment that this is 2**31 (which makes the code obvious <wink>
instead of mysterious).  A comment block at the top would help too, like

/* Use frexp to get at the bits in intpart and fractpart.
 * Since the VAX D double format has 56 mantissa bits, which is the
 * most of any double format in use, each of these parts may have as
 * many as (but no more than) 56 significant bits.
 * So, assuming sizeof(long) >= 4, each part can be broken into two longs;
 * frexp and multiplication are used to do that.
 * Also, since the Cray double format has 15 exponent bits, which is the
 * most of any double format in use, shifting the exponent field left by
 * 15 won't overflow a long (again assuming sizeof(long) >= 4).
 */

And this code has gotten messy enough that it's probably better to pkg it in
a utility function rather than duplicate it.

Another approach would be to play with the bits directly, via casting
tricks.  But then you have to wrestle with platform crap like endianness.

>         hipart = (long)fractpart;
>         fractpart = (fractpart - (double)hipart) * 2147483648.0;
>
>         x = hipart + (long)fractpart + (expo << 15); /* combine
> the fract parts */
>
>         intpart = frexp(intpart, &expo);
>         intpart = intpart * 2147483648.0;
>         hipart = (long)intpart;
>         intpart = (intpart - (double)hipart) * 2147483648.0;
>
>         x += hipart + (long)intpart + (expo << 15); /* add in the
> int parts */

There's no point adding in (expo << 15) a second time.

> With this change, on Linux32:
> ...
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 -2141945856
> 1.10810156237e+12 -2137751552
> 1.11669149696e+12 -2129362944
> 1.13387136614e+12 -2112585728
> 1.16823110451e+12 -2079031296
> 1.23695058125e+12 -2011922432
> 1.37438953472e+12 -1877704704
> 1.64926744166e+12 -1609269248
> 2.19902325555e+12 -2146107392
> 3.29853488333e+12 -1609236480
> 5.49755813888e+12 -1877639168
> 9.89560464998e+12 -2011824128
> 1.86916976722e+13 -2078900224
>
>
> On Linux64:
>
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 2153021440
> 1.10810156237e+12 2157215744
> 1.11669149696e+12 2165604352
> 1.13387136614e+12 2182381568
> 1.16823110451e+12 2215936000
> 1.23695058125e+12 2283044864
> 1.37438953472e+12 2417262592
> 1.64926744166e+12 2685698048
> 2.19902325555e+12 2148859904
> 3.29853488333e+12 2685730816
> 5.49755813888e+12 2417328128
> 9.89560464998e+12 2283143168
> 1.86916976722e+13 2216067072

>>-- and that should also fix your 64-bit woes "by magic".

> As you can see it did not, but for another reason.

I read your original complaint as that hash(double) yielded different
results between two *64* bit platforms (Linux64 vs Win64), but what you
showed above appears to be a comparison between a 64-bit platform and a
32-bit platform, and where presumably sizeof(long) is 8 on the former but 4
on the latter.  If so, of *course* results may be different:  hash returns a
C long, and they're different sizes across these platforms.

In any case, the results above aren't really different!

>>> hex(-2141945856)  # 1st result from Linux32
'0x80548000'
>>> hex(2153021440L)  # 1st result from Linux64
'0x80548000L'
>>>

That is, the bits are the same.  How much more do you want from me <wink>?

> The summation of the parts overflows 'x'. Is this a problem? I.e., does
> it matter if a hash function returns an overflowed integral value (my
> hash function ignorance is showing)?

Overflow generally doesn't matter.  In fact, it's usual <wink>; e.g., the
hash for strings iterates over

    x = (1000003*x) ^ *p++;

and overflows madly.  The saving grace is that C defines integer overflow in
such a way that losing the high bits on every operation yields the same
result as if the entire result were computed to infinite precision and the
high bits tossed only at the end.  So overflow doesn't hurt this from being
as reproducible as possible, given that Python's int size is different.

Overflow can be avoided by using xor instead of addition, but addition is
generally preferred because it helps to "scramble" the bits a little more.

> And if this does not matter, does it matter that a hash returns different
> values on different platforms?

No, and it doesn't always stay the same from release to release on a single
platform.  For example, your patch above will change hash(double) on Win32!

>> a hash function should never ignore any bit in its input.

> Which brings up a question regarding instance_hash(), func_hash(),
> meth_hash(), HKEY_hash() [or whatever it is called], and other
> which cast a pointer to a long (discarding the upperhalf of the
> pointer on Win64). Do these really need to be fixed. Am I nitpicking
> too much on this whole thing?

I have to apologize (although only semi-sincerely) for not being meaner
about this when I did the first 64-bit port.  I did that for my own use, and
avoided the problem areas rather than fix them.  But unless a language dies,
you end up paying for every hole in the end, and the sooner they're plugged
the less it costs.

That is, no, you're not nitpicking too much!  Everyone else probably thinks
you are <wink>, *but*, they're not running on 64-bit platforms yet so these
issues are still invisible to their gut radar.  I'll bet your life that
every hole remaining will trip up an end user eventually -- and they're the
ones least able to deal with the "mysterious problems".





From guido@python.org  Thu May 11 14:01:10 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 11 May 2000 09:01:10 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: Your message of "Thu, 11 May 2000 00:13:29 EDT."
 <000b01bfbaff$43d320c0$2aa0143f@tim>
References: <000b01bfbaff$43d320c0$2aa0143f@tim>
Message-ID: <200005111301.JAA00512@eric.cnri.reston.va.us>

I have to admit I have no clue about the details of this debate any
more, and I'm cowardly awaiting a patch submission that Tim approves
of.  (I'm hoping a day will come when Tim can check it in himself. :-)

In the mean time, I'd like to emphasize the key invariant here: we
must ensure that (a==b) => (hash(a)==hash(b)).  One quick way to deal
with this could be the following pseudo C:

    PyObject *double_hash(double x)
    {
        long l = (long)x;
        if ((double)l == x)
	    return long_hash(l);
	...double-specific code...
    }

This code makes one assumption: that if there exists a long l equal to
a double x, the cast (long)x should yield l...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Thu May 11 23:14:45 2000
From: trentm@activestate.com (Trent Mick)
Date: Thu, 11 May 2000 15:14:45 -0700
Subject: [Python-Dev] testing the C API in the test suite (was: bug in PyLong_FromLongLong (PR#324))
In-Reply-To: <200005111323.JAA00637@eric.cnri.reston.va.us>
References: <200005111323.JAA00637@eric.cnri.reston.va.us>
Message-ID: <20000511151445.B15936@activestate.com>

> Date:    Wed, 10 May 2000 15:37:30 -0400
> From:    Thomas.Malik@t-online.de
> To:      python-bugs-list@python.org
> cc:      bugs-py@python.org
> Subject: [Python-bugs-list] bug in PyLong_FromLongLong (PR#324)
> 
> Full_Name: Thomas Malik
> Version: 1.5.2
> OS: all
> Submission from: p3e9ed447.dip.t-dialin.net (62.158.212.71)
> 
> 
> there's a bug in PyLong_FromLongLong, resulting in truncation of negative 64 bi
> t
> integers. PyLong_FromLongLong starts with: 
> 	if( ival <= (LONG_LONG)LONG_MAX ) {
> 		return PyLong_FromLong( (long)ival );
> 	}
> 	else if( ival <= (unsigned LONG_LONG)ULONG_MAX ) {
> 		return PyLong_FromUnsignedLong( (unsigned long)ival );
> 	}
> 	else {
>              ....
> 
> Now, if ival is smaller than -LONG_MAX, it falls outside the long integer range
> (being a 64 bit negative integer), but gets handled by the first if-then-case i
> n
> above code ('cause it is, of course, smaller than LONG_MAX). This results in
> truncation of the 64 bit negative integer to a more or less arbitrary 32 bit
> number. The way to fix it is to compare the absolute value of imax against
> LONG_MAX in the first condition. The second condition (ULONG_MAX) must, at
> least, check wether ival is positive. 
> 

To test this error I found the easiest way was to make a C extension module
to Python that called the C API functions under test directly. I can't
quickly think of a way I could have shown this error *clearly* at the Python
level without a specialized extension module. This has been true for other
things that I have been testing.

Would it make sense to create a standard extension module (called '__test' or
something like that) in which direct tests on the C API could be made? This
would be hooked into the standard testsuite via a test_capi.py that would:
- import __test
- run every exported function in __test (or everyone starting with 'test_',
  or whatever)
- the ImportError could continue to be used to signify skipping, etc
  (although, I think that a new, more explicit TestSuiteError class would be
  more appropriate and clear)

Does something like this already exist that I am missing?

This would make testing some things a lot easier, and clearer. Where
some interface is exposed to the Python programmer it is appropriate to test
it at the Python level. Python also provides a C API and it would be
appropriate to test that at the C level.

I would like to hear some people's thoughts before I go off and put anything
together.

Thanks,
Trent


-- 
Trent Mick
trentm@activestate.com


From DavidA@ActiveState.com  Thu May 11 23:16:43 2000
From: DavidA@ActiveState.com (David Ascher)
Date: Thu, 11 May 2000 15:16:43 -0700
Subject: [Python-Dev] c.l.p.announce
Message-ID: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>

What's the status of comp.lang.python.announce and the 'reviving' thereof?

--david


From tim_one@email.msn.com  Fri May 12 03:58:35 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 22:58:35 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <200005111301.JAA00512@eric.cnri.reston.va.us>
Message-ID: <000001bfbbbd$f74572c0$9ca2143f@tim>

[Guido]
> I have to admit I have no clue about the details of this debate any
> more,

Na, there's no debate here.  I believe I confused things by misunderstanding
what Trent's original claim was (sorry, Trent!), but we bumped into real
flaws in the current hash anyway (even on 32-bit machines).  I don't think
there's any actual disagreement about anything here.

> and I'm cowardly awaiting a patch submission that Tim approves
> of.

As am I <wink>.

> (I'm hoping a day will come when Tim can check it in himself. :-)

Well, all you have to do to make that happen is get a real job and then hire
me <wink>.

> In the mean time, I'd like to emphasize the key invariant here: we
> must ensure that (a==b) => (hash(a)==hash(b)).

Absolutely.  That's already true, and is so non-controversial that Trent
elided ("...") the code for that in his last post.

> One quick way to deal with this could be the following pseudo C:
>
>     PyObject *double_hash(double x)
>     {
>         long l = (long)x;
>         if ((double)l == x)
> 	    return long_hash(l);
> 	...double-specific code...
>     }
>
> This code makes one assumption: that if there exists a long l equal to
> a double x, the cast (long)x should yield l...

No, that fails on two counts:

1.  If x is "too big" to fit in a long (and a great many doubles are),
    the cast to long is undefined.  Don't know about all current platforms,
    but on the KSR platform such casts raised a fatal hardware
    exception.  The current code already accomplishes this part in a
    safe way (which Trent's patch improves by using a symbol instead of
    the current hard-coded hex constant).

2.  The key invariant needs to be preserved also when x is an exact
    integral value that happens to be (possibly very!) much bigger than
    a C long; e.g.,

>>> long(1.23e300)  # 1.23e300 is an integer! albeit not the one you think
12299999999999999456195024356787918820614965027709909500456844293279
60298864608335541984218516600989160291306221939122973741400364055485
57167627474369519296563706976894811817595986395177079943535811102573
51951343133141138298152217970719263233891682157645730823560232757272
73837119288529943287157489664L
>>> hash(1.23e300) == hash(_)
1
>>>

The current code already handles that correctly too.  All the problems occur
when the double has a non-zero fractional part, and Trent knows how to fix
that now.  hash(x) may differ across platforms because sizeof(long) differs
across platforms, but that's just as true of strings as floats (i.e., Python
has never computed platform-independent hashes -- if that bothers *you*
(doesn't bother me), that's the part you should chime in on).




From guido@python.org  Fri May 12 13:24:25 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 12 May 2000 08:24:25 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Thu, 11 May 2000 15:16:43 PDT."
 <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
Message-ID: <200005121224.IAA06063@eric.cnri.reston.va.us>

> What's the status of comp.lang.python.announce and the 'reviving' thereof?

Good question.  Several of us here at CNRI have volunteered to become
moderators.  I think we may have to start faking Approved: headers in
the mean time...

(I wonder if we can make posts to python-announce@python.com be
forwarded to c.l.py.a with such a header automatically tacked on?)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Fri May 12 14:43:37 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 15:43:37 +0200
Subject: [Python-Dev] Unicode and its partners...
Message-ID: <391C0A89.819A33EA@lemburg.com>

It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
discussion. 

Not that I would like it to restart (I think everybody has
made their point), but it kind of surprised me that now with the
ability to actually set the default string encoding at run-time,
noone seems to have played around with it...

>>> import sys
>>> sys.set_string_encoding('unicode-escape')
>>> "abcäöü" + u"abc"
u'abc\344\366\374abc'
>>> "abcäöü\u1234" + u"abc"
u'abc\344\366\374\u1234abc'
>>> print "abcäöü\u1234" + u"abc"
abc\344\366\374\u1234abc

Any takers ?

BTW, has anyone tried to use the codec design for other
tasks than converting text ? It should also be usable for
e.g. compressing/decompressing or other data oriented
content.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From Fredrik Lundh" <effbot@telia.com  Fri May 12 15:25:24 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Fri, 12 May 2000 16:25:24 +0200
Subject: [Python-Dev] Unicode and its partners...
References: <391C0A89.819A33EA@lemburg.com>
Message-ID: <026901bfbc1d$efe06fc0$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
> discussion. 

that's only because I've promised Guido to prepare SRE
for the next alpha, before spending more time trying to
get this one done right ;-)

and as usual, the last 10% takes 90% of the effort :-(

</F>



From akuchlin@mems-exchange.org  Fri May 12 15:27:21 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 12 May 2000 10:27:21 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: <200005121224.IAA06063@eric.cnri.reston.va.us>
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
 <200005121224.IAA06063@eric.cnri.reston.va.us>
Message-ID: <14620.5321.510321.341870@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>(I wonder if we can make posts to python-announce@python.com be
>forwarded to c.l.py.a with such a header automatically tacked on?)

Probably not a good idea; if the e-mail address is on the Web site, it
probably gets a certain amount of spam that would need to be filtered
out.  

--amk


From guido@python.org  Fri May 12 15:31:55 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 12 May 2000 10:31:55 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Fri, 12 May 2000 10:27:21 EDT."
 <14620.5321.510321.341870@amarok.cnri.reston.va.us>
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>
 <14620.5321.510321.341870@amarok.cnri.reston.va.us>
Message-ID: <200005121431.KAA06538@eric.cnri.reston.va.us>

> Guido van Rossum writes:
> >(I wonder if we can make posts to python-announce@python.com be
> >forwarded to c.l.py.a with such a header automatically tacked on?)
> 
> Probably not a good idea; if the e-mail address is on the Web site, it
> probably gets a certain amount of spam that would need to be filtered
> out.  

OK, let's make it a moderated mailman mailing list; we can make
everyone on python-dev (who wants to) a moderator.  Barry, is there an
easy way to add additional headers to messages posted by mailman to
the news gateway?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From jcollins@pacificnet.net  Fri May 12 16:39:28 2000
From: jcollins@pacificnet.net (Jeffery D. Collins)
Date: Fri, 12 May 2000 08:39:28 -0700
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>
 <14620.5321.510321.341870@amarok.cnri.reston.va.us> <200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <391C25B0.EC327BCF@pacificnet.net>

I volunteer to moderate.

Jeff


Guido van Rossum wrote:

> > Guido van Rossum writes:
> > >(I wonder if we can make posts to python-announce@python.com be
> > >forwarded to c.l.py.a with such a header automatically tacked on?)
> >
> > Probably not a good idea; if the e-mail address is on the Web site, it
> > probably gets a certain amount of spam that would need to be filtered
> > out.
>
> OK, let's make it a moderated mailman mailing list; we can make
> everyone on python-dev (who wants to) a moderator.  Barry, is there an
> easy way to add additional headers to messages posted by mailman to
> the news gateway?
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev



From bwarsaw@python.org  Fri May 12 16:41:01 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Fri, 12 May 2000 11:41:01 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
 <200005121224.IAA06063@eric.cnri.reston.va.us>
 <14620.5321.510321.341870@amarok.cnri.reston.va.us>
 <200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <14620.9741.164735.998570@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido@python.org> writes:

    GvR> OK, let's make it a moderated mailman mailing list; we can
    GvR> make everyone on python-dev (who wants to) a moderator.
    GvR> Barry, is there an easy way to add additional headers to
    GvR> messages posted by mailman to the news gateway?

No, but I'll add that.  It might be a little while before I push the
changes out to python.org; I've got a bunch of things I need to test
first.

-Barry


From mal@lemburg.com  Fri May 12 16:47:55 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 17:47:55 +0200
Subject: [Python-Dev] Landmark
Message-ID: <391C27AB.2F5339D6@lemburg.com>

While trying to configure an in-package Python interpreter
I found that the interpreter still uses 'string.py' as
landmark for finding the standard library.

Since string.py is being depreciated, I think we should
consider a new landmark (such as os.py) or maybe even a
whole new strategy for finding the standard lib location.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido@python.org  Fri May 12 20:04:50 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 12 May 2000 15:04:50 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 17:47:55 +0200."
 <391C27AB.2F5339D6@lemburg.com>
References: <391C27AB.2F5339D6@lemburg.com>
Message-ID: <200005121904.PAA08166@eric.cnri.reston.va.us>

> While trying to configure an in-package Python interpreter
> I found that the interpreter still uses 'string.py' as
> landmark for finding the standard library.

Oops.

> Since string.py is being depreciated, I think we should
> consider a new landmark (such as os.py) or maybe even a
> whole new strategy for finding the standard lib location.

I don't see a need for a new strategy, but I'll gladly accept patches
that look for os.py.  Note that there are several versions of that
code: Modules/getpath.c, PC/getpathp.c, PC/os2vacpp/getpathp.c.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gmcm@hypernet.com  Fri May 12 20:50:56 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 15:50:56 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: <200005121904.PAA08166@eric.cnri.reston.va.us>
References: Your message of "Fri, 12 May 2000 17:47:55 +0200."             <391C27AB.2F5339D6@lemburg.com>
Message-ID: <1253961418-52039567@hypernet.com>

[MAL]
> > Since string.py is being depreciated, I think we should
> > consider a new landmark (such as os.py) or maybe even a
> > whole new strategy for finding the standard lib location.
[GvR]
> I don't see a need for a new strategy

I'll argue for (a choice of) new strategy. The getpath & friends 
code spends a whole lot of time and energy trying to reverse 
engineer things like developer builds and strange sys-admin 
pranks. I agree that code shouldn't die. But it creates painful 
startup times when Python is being used for something like 
CGI.

How about something on the command line that says (pick 
one or come up with another choice):
 - PYTHONPATH is *it*
 - use PYTHONPATH and .pth files found <here>
 - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
PYTHONPATH
 - there's a .pth file <here> with the whole list
 - pretty much any permutation of the above elements

The idea being to avoid a few hundred system calls when a 
dozen or so will suffice. Default behavior should still be to 
magically get it right.


- Gordon


From guido@python.org  Fri May 12 21:29:05 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 12 May 2000 16:29:05 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 15:50:56 EDT."
 <1253961418-52039567@hypernet.com>
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>
 <1253961418-52039567@hypernet.com>
Message-ID: <200005122029.QAA08252@eric.cnri.reston.va.us>

> [MAL]
> > > Since string.py is being depreciated, I think we should
> > > consider a new landmark (such as os.py) or maybe even a
> > > whole new strategy for finding the standard lib location.
> [GvR]
> > I don't see a need for a new strategy
> 
> I'll argue for (a choice of) new strategy. The getpath & friends 
> code spends a whole lot of time and energy trying to reverse 
> engineer things like developer builds and strange sys-admin 
> pranks. I agree that code shouldn't die. But it creates painful 
> startup times when Python is being used for something like 
> CGI.
> 
> How about something on the command line that says (pick 
> one or come up with another choice):
>  - PYTHONPATH is *it*
>  - use PYTHONPATH and .pth files found <here>
>  - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
> PYTHONPATH
>  - there's a .pth file <here> with the whole list
>  - pretty much any permutation of the above elements
> 
> The idea being to avoid a few hundred system calls when a 
> dozen or so will suffice. Default behavior should still be to 
> magically get it right.

I'm not keen on changing the meaning of PYTHONPATH, but if you're
willing and able to set an environment variable, you can set
PYTHONHOME and it will abandon the search.  If you want a command line
option for CGI, an option to set PYTHONHOME makes sense.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From weeks@golden.dtc.hp.com  Fri May 12 21:29:52 2000
From: weeks@golden.dtc.hp.com ( (Greg Weeks))
Date: Fri, 12 May 2000 13:29:52 -0700
Subject: [Python-Dev] "is", "==", and sameness
Message-ID: <200005122029.AA126653392@golden.dtc.hp.com>

>From the Python Reference Manual [emphasis added]:

    Types affect almost all aspects of object behavior. Even the importance
    of object IDENTITY is affected in some sense: for immutable types,
    operations that compute new values may actually return a reference to
    any existing object with the same type and value, while for mutable
    objects this is not allowed.

This seems to be saying that two immutable objects are (in some sense) the
same iff they have the same type and value, while two mutable objects are
the same iff they have the same id().  I heartily agree, and I think that
this notion of sameness is the single most useful variant of the "equals"
relation.

Indeed, I think it worthwhile to consider modifying the "is" operator to
compute this notion of sameness.  (This would break only exceedingly
strange user code.)  "is" would then be the natural comparator of
dictionary keys, which could then be any object.

The usefulness of this idea is limited by the absence of user-definable
immutable instances.  It might be nice to be able to declare a class -- eg,
Point -- to be have immutable instances.  This declaration would promise
that:

1.  When the expression Point(3.0,4.0) is evaluated, its reference count
    will be zero.

2.  After Point(3.0,4.0) is evaluated, its attributes will not be changed.


I sent the above thoughts to Guido, who graciously and politely responded
that they struck him as somewhere between bad and poorly presented.  (Which
surprised me.  I would have guessed that the ideas were already in his
head.)  Nevertheless, he mentioned passing them along to you, so I have.


Regards,
Greg


From gmcm@hypernet.com  Fri May 12 23:05:46 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 18:05:46 -0400
Subject: [Python-Dev] "is", "==", and sameness
In-Reply-To: <200005122029.AA126653392@golden.dtc.hp.com>
Message-ID: <1253953328-52526193@hypernet.com>

Greg Weeks wrote:

> >From the Python Reference Manual [emphasis added]:
> 
>     Types affect almost all aspects of object behavior. Even the
>     importance of object IDENTITY is affected in some sense: for
>     immutable types, operations that compute new values may
>     actually return a reference to any existing object with the
>     same type and value, while for mutable objects this is not
>     allowed.
> 
> This seems to be saying that two immutable objects are (in some
> sense) the same iff they have the same type and value, while two
> mutable objects are the same iff they have the same id().  I
> heartily agree, and I think that this notion of sameness is the
> single most useful variant of the "equals" relation.

Notice the "may" in the reference text.

>>> 88 + 11 is 98 + 1
1
>>> 100 + 3 is 101 + 2
0
>>>

Python goes to the effort of keeping singleton instances of the 
integers less than 100. In certain situations, a similar effort is 
invested in strings. But it is by no means the general case, 
and (unless you've got a solution) it would be expensive to 
make it so.
 
> Indeed, I think it worthwhile to consider modifying the "is"
> operator to compute this notion of sameness.  (This would break
> only exceedingly strange user code.)  "is" would then be the
> natural comparator of dictionary keys, which could then be any
> object.

The implications don't follow. The restriction that dictionary 
keys be immutable is not because of the comparison method. 
It's the principle of "least surprise". Use a mutable object as a 
dict key. Now mutate the object. Now the key / value pair in 
the dictionary is inaccessible. That is, there is some pair (k,v) 
in dict.items() where dict[k] does not yield v.
 
> The usefulness of this idea is limited by the absence of
> user-definable immutable instances.  It might be nice to be able
> to declare a class -- eg, Point -- to be have immutable
> instances.  This declaration would promise that:
> 
> 1.  When the expression Point(3.0,4.0) is evaluated, its
> reference count
>     will be zero.

That's a big change from the way Python works:

>>> sys.getrefcount(None)
167
>>>
 
> 2.  After Point(3.0,4.0) is evaluated, its attributes will not be
> changed.

You can make an instance effectively immutable (by messing 
with __setattr__). You can override __hash__ to return 
something suitable (eg, hash(id(self))), and then use an 
instance as a dict key. You don't even need to do the first to 
do the latter.

- Gordon


From mal@lemburg.com  Fri May 12 22:25:02 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 23:25:02 +0200
Subject: [Python-Dev] Landmark
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>
 <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>
Message-ID: <391C76AE.A3118AF1@lemburg.com>

Guido van Rossum wrote:
> [Gordon]
> > [MAL]
> > > > Since string.py is being depreciated, I think we should
> > > > consider a new landmark (such as os.py) or maybe even a
> > > > whole new strategy for finding the standard lib location.
> > [GvR]
> > > I don't see a need for a new strategy
> >
> > I'll argue for (a choice of) new strategy.
> 
> I'm not keen on changing the meaning of PYTHONPATH, but if you're
> willing and able to set an environment variable, you can set
> PYTHONHOME and it will abandon the search.  If you want a command line
> option for CGI, an option to set PYTHONHOME makes sense.

The routines will still look for the landmark though (which
is what surprised me and made me look deeper -- setting
PYTHONHOME didn't work for me because I had only .pyo files
in the lib/python1.5 dir).

Perhaps Python should put more trust into the setting of
PYTHONHOME ?!

[An of course the landmark should change to something like
 os.py -- I'll try to submit a patch for this.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido@python.org  Sat May 13 01:53:27 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 12 May 2000 20:53:27 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 23:25:02 +0200."
 <391C76AE.A3118AF1@lemburg.com>
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com> <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>
 <391C76AE.A3118AF1@lemburg.com>
Message-ID: <200005130053.UAA08687@eric.cnri.reston.va.us>

[me]
> > I'm not keen on changing the meaning of PYTHONPATH, but if you're
> > willing and able to set an environment variable, you can set
> > PYTHONHOME and it will abandon the search.  If you want a command line
> > option for CGI, an option to set PYTHONHOME makes sense.

[MAL]
> The routines will still look for the landmark though (which
> is what surprised me and made me look deeper -- setting
> PYTHONHOME didn't work for me because I had only .pyo files
> in the lib/python1.5 dir).
> 
> Perhaps Python should put more trust into the setting of
> PYTHONHOME ?!

Yes!  Note that PC/getpathp.c already trusts PYTHONHOME 100% --
Modules/getpath.c should follow suit.

> [An of course the landmark should change to something like
>  os.py -- I'll try to submit a patch for this.]

Maybe you can combine the two?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Sat May 13 13:56:41 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 14:56:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
Message-ID: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>

in the current 're' engine, a newline is chr(10) and nothing
else.

however, in the new unicode aware engine, I used the new
LINEBREAK predicate instead, but it turned out to break one
of the tests in the current test suite:

    sre.match('a\rb', 'a.b') => None

(unicode adds chr(13), chr(28), chr(29), chr(30), and also
unichr(133), unichr(8232), and unichr(8233) to the list of
line breaking codes)

what's the best way to deal with this?  I see three alter-
natives:

a) stick to the old definition, and use chr(10) also for
   unicode strings

b) use different definitions for 8-bit strings and unicode
   strings; if given an 8-bit string, use chr(10); if given
   a 16-bit string, use the LINEBREAK predicate.

c) use LINEBREAK in either case.

I think (c) is the "right thing", but it's the only that may
break existing code...

</F>



From bckfnn@worldonline.dk  Sat May 13 14:47:10 2000
From: bckfnn@worldonline.dk (Finn Bock)
Date: Sat, 13 May 2000 13:47:10 GMT
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <391d5b7f.3713359@smtp.worldonline.dk>

On Sat, 13 May 2000 14:56:41 +0200, you wrote:

>in the current 're' engine, a newline is chr(10) and nothing
>else.
>
>however, in the new unicode aware engine, I used the new
>LINEBREAK predicate instead, but it turned out to break one
>of the tests in the current test suite:
>
>    sre.match('a\rb', 'a.b') => None
>
>(unicode adds chr(13), chr(28), chr(29), chr(30), and also
>unichr(133), unichr(8232), and unichr(8233) to the list of
>line breaking codes)
>
>what's the best way to deal with this?  I see three alter-
>natives:
>
>a) stick to the old definition, and use chr(10) also for
>   unicode strings

In the ORO matcher that comes with jpython, the dot matches all but
chr(10). But that is bad IMO. Unicode should use the LINEBREAK
predicate.

regards,
finn


From Fredrik Lundh" <effbot@telia.com  Sat May 13 15:14:32 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:14:32 +0200
Subject: [Python-Dev] for the todo list: cStringIO uses string.joinfields
Message-ID: <00a101bfbce5$91dbd860$34aab5d4@hagrid>

the O_writelines function in Modules/cStringIO contains the
following code:

  if (!string_joinfields) {
    UNLESS(string_module = PyImport_ImportModule("string")) {
      return NULL;
    }

    UNLESS(string_joinfields=
        PyObject_GetAttrString(string_module, "joinfields")) {
      return NULL;
    }

    Py_DECREF(string_module);
  }

I suppose someone should fix this some day...

(btw, the C API reference implies that ImportModule doesn't
use import hooks.  does that mean that cStringIO doesn't work
under e.g. Gordon's installer?)

</F>



From Fredrik Lundh" <effbot@telia.com  Sat May 13 15:36:30 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:36:30 +0200
Subject: [Python-Dev] cvs for dummies
Message-ID: <000d01bfbce8$a3466f40$34aab5d4@hagrid>

what's the best way to make sure that a "cvs update" really brings
everything up to date, even if you've accidentally changed some-
thing in your local workspace?

</F>



From Moshe Zadka <moshez@math.huji.ac.il>  Sat May 13 15:58:17 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Sat, 13 May 2000 17:58:17 +0300 (IDT)
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <Pine.GSO.4.10.10005131755560.14940-100000@sundial>

On Sat, 13 May 2000, Fredrik Lundh wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
> 
> a) stick to the old definition, and use chr(10) also for
>    unicode strings

If we also supply a \something (is \l taken?) for LINEBREAK, people can
then use [^\l] if they need a Unicode line break. Just a point for a way
to do a thing close to rightness and still not break code.

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From fdrake@acm.org  Sat May 13 16:22:12 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Sat, 13 May 2000 11:22:12 -0400 (EDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <14621.29476.390092.610442@newcnri.cnri.reston.va.us>

Fredrik Lundh writes:
 > what's the best way to make sure that a "cvs update" really brings
 > everything up to date, even if you've accidentally changed some-
 > thing in your local workspace?

  Delete the file(s) that got changed and cvs update again.


  -Fred

--
Fred L. Drake, Jr.           <fdrake@acm.org>
Corporation for National Research Initiatives



From Fredrik Lundh" <effbot@telia.com  Sat May 13 16:28:02 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 17:28:02 +0200
Subject: [Python-Dev] cvs for dummies
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid> <14621.29476.390092.610442@newcnri.cnri.reston.va.us>
Message-ID: <001901bfbcef$d4672b80$34aab5d4@hagrid>

Fred L. Drake, Jr. wrote:
> Fredrik Lundh writes:
>  > what's the best way to make sure that a "cvs update" really brings
>  > everything up to date, even if you've accidentally changed some-
>  > thing in your local workspace?
> 
>   Delete the file(s) that got changed and cvs update again.

okay, what's the best way to get a list of locally changed files?

(in this case, one file ended up with neat little <<<<<<< and
>>>>>> marks in it...  several weeks and about a dozen CVS
updates after I'd touched it...)

</F>



From gmcm@hypernet.com  Sat May 13 17:25:42 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 12:25:42 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <00a101bfbce5$91dbd860$34aab5d4@hagrid>
Message-ID: <1253887332-56495837@hypernet.com>

Fredrik wrote:

> (btw, the C API reference implies that ImportModule doesn't
> use import hooks.  does that mean that cStringIO doesn't work
> under e.g. Gordon's installer?)

You have to fool C code that uses ImportModule by doing an 
import first in your Python code. It's the same for freeze. It's 
tiresome tracking this stuff down. For example, to use shelve:

# this is needed because of the use of __import__ in anydbm 
# (modulefinder does not follow __import__)
import dbhash
# the next 2 are needed because cPickle won't use our import
# hook so we need them already in sys.modules when
# cPickle starts
import string
import copy_reg
# now it will work
import shelve

Imagine the c preprocessor letting you do
#define snarf #include
and then trying to use a dependency tracker.


- Gordon


From Fredrik Lundh" <effbot@telia.com  Sat May 13 19:09:44 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 20:09:44 +0200
Subject: [Python-Dev] hey, who broke the array module?
Message-ID: <006e01bfbd06$6ba21120$34aab5d4@hagrid>

sigh.  never resync the CVS repository until you've fixed all
bugs in your *own* code ;-)

in 1.5.2:

>>> array.array("h", [65535])
array('h', [-1])

>>> array.array("H", [65535])
array('H', [65535])

in the current CVS version:

>>> array.array("h", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

okay, this might break some existing code -- but one
can always argue that such code were already broken.

on the other hand:

>>> array.array("H", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

oops.

dunno if the right thing would be to add support for various kinds
of unsigned integers to Python/getargs.c, or to hack around this
in the array module...

</F>



From mhammond@skippinet.com.au  Sat May 13 20:19:44 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sun, 14 May 2000 05:19:44 +1000
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOENPCKAA.mhammond@skippinet.com.au>

> >   Delete the file(s) that got changed and cvs update again.
>
> okay, what's the best way to get a list of locally changed files?

Diff the directory.  Or better still, use wincvs - nice little red icons
for the changed files.

> (in this case, one file ended up with neat little <<<<<<< and
> >>>>>> marks in it...  several weeks and about a dozen CVS
> updates after I'd touched it...)

This happens when CVS can't manage to perform a successful merge.  You
original is still there, but with a funky name (in the same directory - it
should be obvious).

WinCV also makes this a little more obvious - the icon has a special
"conflict" indicator, and the console messages also reflect the conflict in
red.

Mark.



From tismer@tismer.com  Sat May 13 21:32:45 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 22:32:45 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com> <8fe76b$684$1@newshost.accu.uu.nl> <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com> <8fh9ki$51h$1@slb3.atl.mindspring.net> <8fk4mh$i4$1@kopp.stud.ntnu.no>
Message-ID: <391DBBED.B252E597@tismer.com>


Magnus Lie Hetland wrote:
> 
> Aahz Maruch <aahz@netcom.com> wrote in message
> news:8fh9ki$51h$1@slb3.atl.mindspring.net...
> > In article <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>,
> > Ben Wolfson  <rumjuggler@cryptarchy.org> wrote:
> > >
> > >', '.join(['foo', 'bar', 'baz'])
> >
> > This only works in Python 1.6, which is only released as an alpha at
> > this point.  I suggest rather strongly that we avoid 1.6-specific idioms
> > until 1.6 gets released, particularly in relation to FAQ-type questions.
> 
> This is indeed a bit strange IMO... If I were to join the elements of a
> list I would rather ask the list to do it than some string... I.e.
> 
>    ['foo', 'bar', 'baz'].join(', ')
> 
> (...although it is the string that joins the elements in the resulting
> string...)

I believe the notation of "everything is an object, and objects
provide all their functionality" is a bit stressed in Python 1.6 .
The above example touches the limits where I'd just say
"OO isn't always the right thing, and always OO is the wrong thing".

A clear advantage of 1.6's string methods is that much code
becomes shorter and easier to read, since the nesting level
of braces is reduced quite much. The notation also appears to be
more in the order of which actions are actually processed.

The split/join issue is really on the edge where I begin to not
like it.
It is clear that the join method *must* be performed as a method
of the joining character, since the method expects a list as its
argument. It doesn't make sense to use a list method, since
lists have nothing to do with strings.
Furthermore, the argument to join can be any sequence. Adding
a join method to any sequence, just since we want to join some
strings would be overkill.
So the " ".join(seq) notation is the only possible compromise,
IMHO.
It is actually arguable if this is still "Pythonic".
What you want is to join a list of string by some other string.
This is neither a natural method of the list, nor of the joining
string in the first place.

If it came to the point where the string module had some extra
methods which operate on two lists of string perhaps, we would
have been totally lost, and enforcing some OO method to support
it would be completely off the road.

Already a little strange is that the most string methods
return new objects all the time, since strings are immutable.

join is of really extreme design, and compared with other
string functions which became more readable, I think it is
counter-intuitive and not the way people are thinking.
The think "I want to join this list by this string".

Furthermore, you still have to import string, in order to use
its constants.

Instead of using a module with constants and functions, we
now always have to refer to instances and use their methods.
It has some benefits in simple cases.

But if there are a number of different objects handled
by a function, I think enforcing it to be a method of
one of the objects is the wrong way, OO overdone.

doing-OO-only-if-it-looks-natural-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From guido@python.org  Sat May 13 21:39:19 2000
From: guido@python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:39:19 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: Your message of "Sat, 13 May 2000 12:25:42 EDT."
 <1253887332-56495837@hypernet.com>
References: <1253887332-56495837@hypernet.com>
Message-ID: <200005132039.QAA09114@eric.cnri.reston.va.us>

> Fredrik wrote:
> 
> > (btw, the C API reference implies that ImportModule doesn't
> > use import hooks.  does that mean that cStringIO doesn't work
> > under e.g. Gordon's installer?)
> 
> You have to fool C code that uses ImportModule by doing an 
> import first in your Python code. It's the same for freeze. It's 
> tiresome tracking this stuff down. For example, to use shelve:
> 
> # this is needed because of the use of __import__ in anydbm 
> # (modulefinder does not follow __import__)
> import dbhash
> # the next 2 are needed because cPickle won't use our import
> # hook so we need them already in sys.modules when
> # cPickle starts
> import string
> import copy_reg
> # now it will work
> import shelve

Hm, the way I read the code (but I didn't write it!) it calls
PyImport_Import, which is a higher level function that *does* use the
__import__ hook.  Maybe this wasn't always the case?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Sat May 13 21:43:32 2000
From: guido@python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:43:32 -0400
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Sat, 13 May 2000 13:47:10 GMT."
 <391d5b7f.3713359@smtp.worldonline.dk>
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
 <391d5b7f.3713359@smtp.worldonline.dk>
Message-ID: <200005132043.QAA09151@eric.cnri.reston.va.us>

[Swede]
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
> >
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings

[Finn]
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

There's no need for invention.  We're supposed to be as close to Perl
as reasonable.  What does Perl do?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gmcm@hypernet.com  Sat May 13 21:54:09 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
In-Reply-To: <391DBBED.B252E597@tismer.com>
Message-ID: <1253871224-57464726@hypernet.com>

Christian wrote:

> The split/join issue is really on the edge where I begin to not
> like it. It is clear that the join method *must* be performed as
> a method of the joining character, since the method expects a
> list as its argument.

We've been through this a number of times on c.l.py.

"What is this trash - I want list.join(sep)!"

After some head banging (often quite violent - ie, 4 or 5 
exchanges), they get that list.join(sep) sucks. But they still 
swear they'll never use sep.join(list).

So you end up saying "Well, string.join still works".

We'll need a pre-emptive FAQ entry with the link bound to a 
key stroke. Or a big increase in the PSU budget...

- Gordon


From gmcm@hypernet.com  Sat May 13 21:54:09 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <200005132039.QAA09114@eric.cnri.reston.va.us>
References: Your message of "Sat, 13 May 2000 12:25:42 EDT."             <1253887332-56495837@hypernet.com>
Message-ID: <1253871222-57464840@hypernet.com>

[Fredrik]
> > > (btw, the C API reference implies that ImportModule doesn't
> > > use import hooks.  does that mean that cStringIO doesn't work
> > > under e.g. Gordon's installer?)
[Guido]
> Hm, the way I read the code (but I didn't write it!) it calls
> PyImport_Import, which is a higher level function that *does* use
> the __import__ hook.  Maybe this wasn't always the case?

In stock 1.5.2 it's PyImport_ImportModule. Same in cPickle. 
I'm delighted to see them moving towards PyImport_Import.


- Gordon


From Fredrik Lundh" <effbot@telia.com  Sat May 13 22:40:01 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 23:40:01 +0200
Subject: [Python-Dev] Re: [Patches] getpath patch
References: <391D2BC6.95E4FD3E@lemburg.com>
Message-ID: <001501bfbd23$cc45e160$34aab5d4@hagrid>

MAL wrote:
> Note: Python will dump core if it cannot find the exceptions
> module. Perhaps we should add a builtin _exceptions module
> (basically a frozen exceptions.py) which is then used as
> fallback solution ?!

or use this one:
http://w1.132.telia.com/~u13208596/exceptions.htm

</F>



From bwarsaw@python.org  Sat May 13 22:40:47 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Sat, 13 May 2000 17:40:47 -0400 (EDT)
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
 <8fe76b$684$1@newshost.accu.uu.nl>
 <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
 <8fh9ki$51h$1@slb3.atl.mindspring.net>
 <8fk4mh$i4$1@kopp.stud.ntnu.no>
 <391DBBED.B252E597@tismer.com>
Message-ID: <14621.52191.448037.799287@anthem.cnri.reston.va.us>

>>>>> "CT" == Christian Tismer <tismer@tismer.com> writes:

    CT> If it came to the point where the string module had some extra
    CT> methods which operate on two lists of string perhaps, we would
    CT> have been totally lost, and enforcing some OO method to
    CT> support it would be completely off the road.

The new .join() method reads a bit better if you first name the
glue string:

space = ' '
name = space.join(['Barry', 'Aloisius', 'Warsaw'])

But yes, it does look odd when used like

' '.join(['Christian', 'Aloisius', 'Tismer'])

I still think it's nice not to have to import string "just" to get the
join functionality, but remember of course that string.join() isn't
going away, so you can still use this if you like it better.

Alternatively, there has been talk about moving join() into the
built-ins, but I'm not sure if the semantics of tha have been nailed
down.

-Barry


From tismer@tismer.com  Sat May 13 22:48:37 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:48:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
References: <1253871224-57464726@hypernet.com>
Message-ID: <391DCDB5.4FCAB97F@tismer.com>


Gordon McMillan wrote:
> 
> Christian wrote:
> 
> > The split/join issue is really on the edge where I begin to not
> > like it. It is clear that the join method *must* be performed as
> > a method of the joining character, since the method expects a
> > list as its argument.
> 
> We've been through this a number of times on c.l.py.

I know. It just came up when I really used it, when
I read through this huge patch from Fred Gansevles, and when
I see people wondering about it.
After all, it is no surprize. They are right.
If we have to change their mind in order to understand
a basic operation, then we are wrong, not they.

> "What is this trash - I want list.join(sep)!"
> 
> After some head banging (often quite violent - ie, 4 or 5
> exchanges), they get that list.join(sep) sucks. But they still
> swear they'll never use sep.join(list).
> 
> So you end up saying "Well, string.join still works".

And it is the cleanest possible way to go, IMHO.
Unless we had some compound object methods, like

(somelist, somestring).join()

> We'll need a pre-emptive FAQ entry with the link bound to a
> key stroke. Or a big increase in the PSU budget...

We should reconsider the OO pattern.
The user's complaining is natural. " ".join() is not.
We might have gone too far. 

Python isn't just OO, it is better.

Joining lists of strings is joining lists of strings.
This is not a method of a string in the first place.
And not a method od a sequence in the first place.

Making it a method of the joining string now appears to be
a hack to me. (Sorry, Tim, the idea was great in the first place)

I am now
+1 on leaving join() to the string module
-1 on making some filler.join() to be the preferred joining way.

this-was-my-most-conservative-day-since-years-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From tismer@tismer.com  Sat May 13 22:55:43 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:55:43 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
 <8fe76b$684$1@newshost.accu.uu.nl>
 <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
 <8fh9ki$51h$1@slb3.atl.mindspring.net>
 <8fk4mh$i4$1@kopp.stud.ntnu.no>
 <391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <391DCF5F.BA981607@tismer.com>


"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer@tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])

Agreed.

> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])

I'd love that Aloisius, really. I'll ask my parents for a renaming :-)

> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

Sure, and I'm glad to be able to use string methods without ugly
imports. It just came to me when my former colleague Axel met me
last time, and I showed him the 1.6 alpha with its string methods
(just looking over Fred's huge patch) that he said
"Well, quite nice. So they now go the same wrong way as Java did?
The OO pattern is dead. This example shows why."

> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

Sounds like a good alternative.

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From martin@loewis.home.cs.tu-berlin.de  Sun May 14 22:39:52 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 14 May 2000 23:39:52 +0200
Subject: [Python-Dev] Unicode
Message-ID: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>

> comments?  (for obvious reasons, I'm especially interested in comments
> from people using non-ASCII characters on a daily basis...)

> nobody?

Hi Frederik,

I think the problem you try to see is not real. My guideline for using
Unicode in Python 1.6 will be that people should be very careful to
*not* mix byte strings and Unicode strings. If you are processing text
data, obtained from a narrow-string source, you'll always have to make
an explicit decision what the encoding is.

If you follow this guideline, I think the Unicode type of Python 1.6
will work just fine.

If you use Unicode text *a lot*, you may find the need to combine them
with plain byte text in a more convenient way. This is the time you
should look at the implicit conversion stuff, and see which of the
functionality is useful. You then don't need to memorize *all* the
rules where implicit conversion would work - just the cases you care
about.

That may all look difficult - it probably is. But then, it is not more
difficult than tuples vs. lists: why does

>>> [a,b,c] = (1,2,3)

work, and

>>> [1,2]+(3,4)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

does not?

Regards,
Martin


From tim_one@email.msn.com  Mon May 15 00:51:41 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Sun, 14 May 2000 19:51:41 -0400
Subject: [Python-Dev] Memory woes under Windows
Message-ID: <000001bfbdff$5bcdfe40$192d153f@tim>

[Noah, I'm wondering whether this is related to our W98 NatSpeak woes --
Python grows its lists much like a certain product we both work on <ahem>
grows its arrays ...]

Here's a simple test case:

from time import clock

def run():
    n = 1
    while n < 4000000:
        a = []
        push = a.append
        start = clock()
        for i in xrange(n):
            push(1)
        finish = clock()
        print "%10d push  %10.3f" % (n, round(finish - start, 3))
        n = n + n

for i in (1, 2, 3):
    try:
        run()
    except MemoryError:
        print "Got a memory error"

So run() builds a number of power-of-2 sized lists, each by appending one
element at a time.  It prints the list length and elapsed time to build each
one (on Windows, this is basically wall-clock time, and is derived from the
Pentium's high-resolution cycle timer).  The driver simply runs this 3
times, reporting any MemoryError that pops up.

The largest array constructed has 2M elements, so consumes about 8Mb -- no
big deal on most machines these days.

Here's what happens on my new laptop (damn, this thing is fast! -- usually):

Win98 (Second Edition)
600MHz Pentium III
160Mb RAM
Python 1.6a2 from python.org, via the Windows installer

         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.011
      4096 push       0.020
      8192 push       0.053
     16384 push       0.074
     32768 push       0.163
     65536 push       0.262
    131072 push       0.514
    262144 push       0.713
    524288 push       1.440
   1048576 push       2.961
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.029
     16384 push       0.057
     32768 push       0.116
     65536 push       0.231
    131072 push       0.474
    262144 push       2.361
    524288 push      24.059
   1048576 push      67.492
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.028
     16384 push       0.057
     32768 push       0.115
     65536 push       0.232
    131072 push       0.462
    262144 push       2.349
    524288 push      23.982
   1048576 push      67.257
Got a memory error

Commentary:  The first time it runs, the timing behavior is
indistinguishable from O(N).  But realloc returns NULL at some point when
growing the 2M array!  There "should be" huge gobs of memory available.

The 2nd and 3rd runs are very similar to each other, both blow up at about
the same time, but both run *very* much slower than the 1st run before that
point as the list size gets non-trivial -- and, while the output doesn't
show this, the disk starts thrashing too.

It's *not* the case that Win98 won't give Python more than 8Mb of memory.
For example,

>>> a = [1]*30000000  # that's 30M
>>>

works fine and fast on this machine, with no visible disk traffic [Noah,
that line sucks up about 120Mb from malloc in one shot].

So, somehow or other, masses of allocations are confusing the system memory
manager nearly to death (implying we should use Vladimir's PyMalloc under
Windows after grabbing every byte the machine has <0.6 wink>).

My belief is that the Windows 1.6a2 from python.org was compiled with VC6,
yes?  Scream if that's wrong.

This particular test case doesn't run any better under my Win95 (original)
P5-166 with 32Mb RAM using Python 1.5.2.  But at work, we've got a
(unfortunately huge, and C++) program that runs much slower on a
large-memory W98 machine than a small-memory W95 one, due to disk thrashing.
It's a mystery!  If anyone has a clue about any of this, spit it out <wink>.

[Noah, I watched the disk cache size while running the above, and it's not
the problem -- while W98 had allocated about 100Mb for disk cache at the
start, it gracefully gave that up as the program's memory demands increased]

just-another-day-with-windows-ly y'rs  - tim




From mhammond@skippinet.com.au  Mon May 15 01:28:05 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 15 May 2000 10:28:05 +1000
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbdff$5bcdfe40$192d153f@tim>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>

This is definately wierd!  As you only mentioned Win9x, I thought I would
give it a go on Win2k.

This is from a CVS update of only a few days ago, but it is a non-debug
build.  PII266 with 196MB ram:

         1 push       0.001
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.001
       256 push       0.001
       512 push       0.003
      1024 push       0.006
      2048 push       0.011
      4096 push       0.040
      8192 push       0.043
     16384 push       0.103
     32768 push       0.203
     65536 push       0.583

Things are looking OK to here - the behaviour Tim expected.  But then
things seem to start going a little wrong:

    131072 push       1.456
    262144 push       4.763
    524288 push      16.119
   1048576 push      60.765

All of a sudden we seem to hit N*N behaviour?

I gave up waiting for the next one.  Performance monitor was showing CPU at
100%, but the Python process was only sitting on around 15MB of RAM (and
growing _very_ slowly - at the rate you would expect).  Machine had tons of
ram showing as available, and the disk was not thrashing - ie, Windows
definately had lots of mem available, and I have no reason to believe that
a malloc() would fail here - but certainly no one would ever want to wait
and see :-)

This was all definately built with MSVC6, SP3.

no-room-should-ever-have-more-than-one-windows-ly y'rs

Mark.



From gstein@lyra.org  Mon May 15 05:08:33 2000
From: gstein@lyra.org (Greg Stein)
Date: Sun, 14 May 2000 21:08:33 -0700 (PDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005142108070.28031-100000@nebula.lyra.org>

On Sat, 13 May 2000, Fredrik Lundh wrote:
> Fred L. Drake, Jr. wrote:
> > Fredrik Lundh writes:
> >  > what's the best way to make sure that a "cvs update" really brings
> >  > everything up to date, even if you've accidentally changed some-
> >  > thing in your local workspace?
> > 
> >   Delete the file(s) that got changed and cvs update again.
> 
> okay, what's the best way to get a list of locally changed files?

I use the following:

% cvs stat | fgrep Local


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From tim_one@email.msn.com  Mon May 15 08:34:39 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:39 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>
Message-ID: <000001bfbe40$07f14520$b82d153f@tim>

[Mark Hammond]
> This is definately wierd!  As you only mentioned Win9x, I thought I would
> give it a go on Win2k.

Thanks, Mark!  I've only got W9X machines at home.

> This is from a CVS update of only a few days ago, but it is a non-debug
> build.  PII266 with 196MB ram:
>
>          1 push       0.001
>          2 push       0.000
>          4 push       0.000
>          8 push       0.000
>         16 push       0.000
>         32 push       0.000
>         64 push       0.000
>        128 push       0.001
>        256 push       0.001
>        512 push       0.003
>       1024 push       0.006
>       2048 push       0.011
>       4096 push       0.040
>       8192 push       0.043
>      16384 push       0.103
>      32768 push       0.203
>      65536 push       0.583
>
> Things are looking OK to here - the behaviour Tim expected.  But then
> things seem to start going a little wrong:
>
>     131072 push       1.456
>     262144 push       4.763
>     524288 push      16.119
>    1048576 push      60.765

So that acts like my Win95 (which I didn't show), and somewhat like my 2nd &
3rd Win98 runs.

> All of a sudden we seem to hit N*N behaviour?

*That* part really isn't too surprising.  Python "overallocates", but by a
fixed amount independent of the current size.  This leads to quadratic-time
behavior "in theory" once a vector gets large enough.  Guido's cultural myth
for why that theory shouldn't matter is that if you keep appending to the
same vector, the OS will eventually move it to the end of the address space,
whereupon further growth simply boosts the VM high-water mark without
actually moving anything.  I call that "a cultural myth" because some
flavors of Unix did used to work that way, and some may still -- I doubt
it's ever been a valid argument under Windows, though. (you, of all people,
know how much Python's internal strategies were informed by machines nobody
uses <wink>).

So I was more surprised up to this point by the supernatural linearity of my
first W98 run (which is reproducible, btw).  But my 2nd & 3rd W98 runs (also
reproducible), and unlike your W2K run, show *worse* than quadratic
behavior.

> I gave up waiting for the next one.

Under both W98 and W95, the next one does eventually hit the MemoryError for
me, but it does take a long time.  If I thought it would help, I'd measure
it.  And *this* one is surprising, because, as you say:

> Performance monitor was showing CPU at 100%, but the Python process
> was only sitting on around 15MB of RAM (and growing _very_ slowly -
> at the rate you would expect).  Machine had tons of ram showing as
> available, and the disk was not thrashing - ie, Windows definately
> had lots of mem available, and I have no reason to believe that
> a malloc() would fail here - but certainly no one would ever want to wait
> and see :-)

How long did you wait?  If less than 10 minutes, perhaps not long enough.  I
certainly didn't expect a NULL return either, even on my tiny machine, and
certainly not on the box with 20x more RAM than the list needs.

> This was all definately built with MSVC6, SP3.

Again good to know.  I'll chew on this, but don't expect a revelation soon.

> no-room-should-ever-have-more-than-one-windows-ly y'rs

Hmm.  I *did* run these in different rooms <wink>.

no-accounting-for-windows-ly y'rs  - tim




From tim_one@email.msn.com  Mon May 15 08:34:51 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:51 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <391DCDB5.4FCAB97F@tismer.com>
Message-ID: <000301bfbe40$0e2a49a0$b82d153f@tim>

[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

Huh!  I would not have guessed that you'd give up on Stackless that easily
<wink>.

> ...
> Making it a method of the joining string now appears to be
> a hack to me. (Sorry, Tim, the idea was great in the first place)

Just the opposite here:  it looked like a hack the first time I thought of
it, but has gotten more charming with each use.  space.join(sequence) is so
pretty it aches.

redefining-truth-all-over-the-place-ly y'rs  - tim




From gward@mems-exchange.org  Mon May 15 14:30:54 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Mon, 15 May 2000 09:30:54 -0400
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>; from effbot@telia.com on Sat, May 13, 2000 at 04:36:30PM +0200
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <20000515093053.A5765@mems-exchange.org>

--KsGdsel6WgEHnImy
Content-Type: text/plain; charset=us-ascii

On 13 May 2000, Fredrik Lundh said:
> what's the best way to make sure that a "cvs update" really brings
> everything up to date, even if you've accidentally changed some-
> thing in your local workspace?

Try the attached script -- it's basically the same as Greg Stein's "cvs
status | grep Local", but beefed-up and overkilled.

Example:

  $ cvstatus -l
  .cvsignore                     Up-to-date        2000-05-02 14:31:04
  Makefile.in                    Locally Modified  2000-05-12 12:25:39
  README                         Up-to-date        2000-05-12 12:34:42
  acconfig.h                     Up-to-date        2000-05-12 12:25:40
  config.h.in                    Up-to-date        2000-05-12 12:25:40
  configure                      Up-to-date        2000-05-12 12:25:40
  configure.in                   Up-to-date        2000-05-12 12:25:40
  install-sh                     Up-to-date        1998-08-13 12:08:45

...so yeah, it generates a lot of output when run on a large working
tree, eg. Python's.  But not as much as "cvs status" on its own.  ;-)

        Greg

PS. I just noticed it uses the "#!/usr/bin/env" hack with a command-line
option for the interpreter, which doesn't work on Linux.  ;-(  You may
have to hack the shebang line to make it work.

-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367

--KsGdsel6WgEHnImy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=cvstatus

#!/usr/bin/env perl -w

#
# cvstatus
#
# runs "cvs status" (with optional file arguments), filtering out
# uninteresting stuff and putting in the last-modification time
# of each file.
#
# Usage: cvstatus [files]
#
# GPW 1999/02/17
#
# $Id: cvstatus,v 1.4 2000/04/14 14:56:14 gward Exp $
#

use strict;
use POSIX 'strftime';

my @files = @ARGV;

# Open a pipe to a forked child process
my $pid = open (CVS, "-|");
die "couldn't open pipe: $!\n" unless defined $pid;

# In the child -- run "cvs status" (with optional list of files
# from command line)
unless ($pid)
{
   open (STDERR, ">&STDOUT");           # merge stderr with stdout
   exec 'cvs', 'status', @files;
   die "couldn't exec cvs: $!\n";
}

# In the parent -- read "cvs status" output from the child
else
{
   my $dir = '';
   while (<CVS>)
   {
      my ($filename, $status, $mtime);
      if (/Examining (.*)/)
      {
         $dir = $1;
         if (! -d $dir)
         {
            warn "huh? no directory called $dir!";
            $dir = '';
         }
         elsif ($dir eq '.')
            { $dir = ''; }
         else
            { $dir .= '/' unless $dir =~ m|/$|; }
      }
      elsif (($filename, $status) = /^File: \s* (\S+) \s* Status: \s* (.*)/x)
      {
         $filename = $dir . $filename;
         if ($mtime = (stat $filename)[9])
         {
            $mtime = strftime ("%Y-%m-%d %H:%M:%S", localtime $mtime);
            printf "%-30.30s %-17s %s\n", $filename, $status, $mtime;
         }
         else
         {
            #warn "couldn't stat $filename: $!\n";
            printf "%-30.30s %-17s ???\n", $filename, $status;
         }
      }
   }

   close (CVS);
   warn "cvs failed\n" unless $? == 0;
}

--KsGdsel6WgEHnImy--


From trentm@activestate.com  Mon May 15 22:09:58 2000
From: trentm@activestate.com (Trent Mick)
Date: Mon, 15 May 2000 14:09:58 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
References: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
Message-ID: <20000515140958.C20418@activestate.com>

I broke it with my patches to test overflow for some of the PyArg_Parse*()
formatting characters. The upshot of testing for overflow is that now those
formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
unsigned-ness as appropriate (you have to know if the value is signed or
unsigned to know what limits to check against for overflow). Two
possibilities presented themselves:

1. Enforce 'b' as unsigned char (the common usage) and the rest as signed
values (short, int, and long). If you want a signed char, or an unsigned
short you have to work around it yourself.

2. Add formatting characters or modifiers for signed and unsigned versions of
all the integral type to PyArg_Parse*() in getargs.c

Guido prefered the former because (my own interpretation of the reasons) it
covers the common case and keeps the clutter and feature creep down. It is
debatable whether or not we really need signed and unsigned for all of them.
See the following threads on python-dev and patches:
  make 'b' formatter an *unsigned* char
  issues with int/long on 64bit platforms - eg stringobject (PR#306) 
  make 'b','h','i' raise overflow exception
  
Possible code breakage is the drawback.


[Fredrik Lundh wrote]:
> sigh.  never resync the CVS repository until you've fixed all
> bugs in your *own* code ;-)

Sorry, I guess. The test suite did not catch this so it is hard for me to
know that the bug was raised. My patches adds tests for these to the test
suite.

> 
> in 1.5.2:
> 
> >>> array.array("h", [65535])
> array('h', [-1])
> 
> >>> array.array("H", [65535])
> array('H', [65535])
> 
> in the current CVS version:
> 
> >>> array.array("h", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> okay, this might break some existing code -- but one
> can always argue that such code were already broken.

Yes.

> 
> on the other hand:
> 
> >>> array.array("H", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> oops.
> 
oops. See my patch that fixes this for 'H', and 'b', and 'I', and 'L'.


> dunno if the right thing would be to add support for various kinds
> of unsigned integers to Python/getargs.c, or to hack around this
> in the array module...
> 
My patch does the latter and that would be my suggestion because:
(1) Guido didn't like the idea of adding more formatters to getargs.c (see
above)
(2) Adding support for unsigned and signed versions in getargs.c could be
confusing because the formatting characters cannot be the same as in the
array module because 'L' is already used for LONG_LONG types in
PyArg_Parse*().
(3) KISS and the common case. Keep the number of formatters for
PyArg_Parse*() short and simple. I would presume that the common case user
does not really need the extra support.


Trent


-- 
Trent Mick
trentm@activestate.com


From mhammond@skippinet.com.au  Tue May 16 07:22:53 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 16 May 2000 16:22:53 +1000
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
Message-ID: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>

For about the 1,000,000th time in my life (no exaggeration :-), I just
typed "python.exe foo" - I forgot the .py.

It would seem a simple and useful change to append a ".py" extension and
try-again, instead of dieing the first time around - ie, all we would be
changing is that we continue to run where we previously failed.

Is there a good reason why we dont do this?

Mark.



From mal@lemburg.com  Mon May 15 23:07:53 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 00:07:53 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <391d5b7f.3713359@smtp.worldonline.dk>
Message-ID: <39207539.F1C14A25@lemburg.com>

Finn Bock wrote:
> 
> On Sat, 13 May 2000 14:56:41 +0200, you wrote:
> 
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
>
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings
> 
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

+1 on that one... just like \s should use Py_UNICODE_ISSPACE()
and \d Py_UNICODE_ISDECIMAL().

BTW, how have you implemented the locale aware \w and \W
for Unicode ? Unicode doesn't have any locales, but quite a
lot more alphanumeric characters (or equivalents) and there
currently is no Py_UNICODE_ISALPHA() in the core.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Mon May 15 22:50:39 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 May 2000 23:50:39 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com>
 <8fe76b$684$1@newshost.accu.uu.nl>
 <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
 <8fh9ki$51h$1@slb3.atl.mindspring.net>
 <8fk4mh$i4$1@kopp.stud.ntnu.no>
 <391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <3920712F.1FD0B910@lemburg.com>

"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer@tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])
> 
> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])
> 
> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

string.py is depreciated, AFAIK (not that it'll go away anytime
soon, but using string method directly is really the better,
more readable and faster approach).
 
> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

This is probably the way to go. Semantics should probably
be:

	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

and should work with any type providing addition or
concat slot methods.

Patches anyone ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May 16 09:21:46 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 10:21:46 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <3921051A.56C7B63E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > comments?  (for obvious reasons, I'm especially interested in comments
> > from people using non-ASCII characters on a daily basis...)
> 
> > nobody?
> 
> Hi Frederik,
> 
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Right, that's the way to go :-)
 
> If you follow this guideline, I think the Unicode type of Python 1.6
> will work just fine.
> 
> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. This is the time you
> should look at the implicit conversion stuff, and see which of the
> functionality is useful. You then don't need to memorize *all* the
> rules where implicit conversion would work - just the cases you care
> about.

One should better not rely on the implicit conversions. These
are really only there to ease porting applications to Unicode
and perhaps make some existing APIs deal with Unicode without
even knowing about it -- of course this will not always work
and those places will need some extra porting effort to make
them useful w/r to Unicode. open() is one such candidate.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik@pythonware.com  Tue May 16 10:30:54 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 May 2000 11:30:54 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>

Martin v. Loewis wrote:
> I think the problem you try to see is not real.

it is real.  I won't repeat the arguments one more time; please read
the W3C character model note and the python-dev archives, and read
up on the unicode support in Tcl and Perl.

> But then, it is not more difficult than tuples vs. lists

your examples always behave the same way, no matter what's in the
containers.  that's not true for MAL's design.

</F>



From guido@python.org  Tue May 16 11:03:07 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 16 May 2000 06:03:07 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: Your message of "Tue, 16 May 2000 16:22:53 +1000."
 <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>
References: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>
Message-ID: <200005161003.GAA12247@eric.cnri.reston.va.us>

> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.
> 
> It would seem a simple and useful change to append a ".py" extension and
> try-again, instead of dieing the first time around - ie, all we would be
> changing is that we continue to run where we previously failed.
> 
> Is there a good reason why we dont do this?

Just inertia, plus it's "not the Unix way".  I agree it's a good idea.
(I also found in user testsing that IDLE definitely has to supply the
".py" when saving a module if the user didn't.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From skip@mojam.com (Skip Montanaro)  Tue May 16 15:52:59 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 16 May 2000 09:52:59 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <3920712F.1FD0B910@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
 <8fe76b$684$1@newshost.accu.uu.nl>
 <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
 <8fh9ki$51h$1@slb3.atl.mindspring.net>
 <8fk4mh$i4$1@kopp.stud.ntnu.no>
 <391DBBED.B252E597@tismer.com>
 <14621.52191.448037.799287@anthem.cnri.reston.va.us>
 <3920712F.1FD0B910@lemburg.com>
Message-ID: <14625.24779.329534.364663@beluga.mojam.com>

    >> Alternatively, there has been talk about moving join() into the
    >> built-ins, but I'm not sure if the semantics of tha have been nailed
    >> down.

    Marc> This is probably the way to go. Semantics should probably
    Marc> be:

    Marc> 	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

    Marc> and should work with any type providing addition or concat slot
    Marc> methods.

Of course, while it will always yield what you ask for, it might not always
yield what you expect:

    >>> seq = [1,2,3]
    >>> sep = 5
    >>> reduce(lambda x,y: x + sep + y, seq)
    16

;-)

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From Fredrik Lundh" <effbot@telia.com  Tue May 16 16:22:06 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 17:22:06 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com>
Message-ID: <000d01bfbf4a$85321400$34aab5d4@hagrid>

>     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
>
> Of course, while it will always yield what you ask for, it might not always
> yield what you expect:
> 
>     >>> seq = [1,2,3]
>     >>> sep = 5
>     >>> reduce(lambda x,y: x + sep + y, seq)
>     16

not to mention:

>>> print join([], " ")
TypeError: reduce of empty sequence with no initial value

...

</F>



From mal@lemburg.com  Tue May 16 18:15:05 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 19:15:05 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com> <000d01bfbf4a$85321400$34aab5d4@hagrid>
Message-ID: <39218219.9E8115E2@lemburg.com>

Fredrik Lundh wrote:
> 
> >     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
> >
> > Of course, while it will always yield what you ask for, it might not always
> > yield what you expect:
> >
> >     >>> seq = [1,2,3]
> >     >>> sep = 5
> >     >>> reduce(lambda x,y: x + sep + y, seq)
> >     16
> 
> not to mention:
> 
> >>> print join([], " ")
> TypeError: reduce of empty sequence with no initial value

Ok, here's a more readable and semantically useful definition:

def join(sequence,sep=''):

    # Special case: empty sequence
    if len(sequence) == 0:
        try:
            return 0*sep
        except TypeError:
            return sep[0:0]
        
    # Normal case
    x = None
    for y in sequence:
        if x is None:
            x = y
        elif sep:
            x = x + sep + y
        else:
            x = x + y
    return x

Examples:

>>> join((1,2,3))
6

>>> join(((1,2),(3,4)),('x',))
(1, 2, 'x', 3, 4)

>>> join(('a','b','c'), ' ')
'a b c'

>>> join(())
''

>>> join((),())
()

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From paul@prescod.net  Tue May 16 18:58:33 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 12:58:33 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <39218C49.C66FEEDE@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
>
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. 

I think that as soon as we are adding admonishions to documentation that
things "probably don't behave as you expect, so be careful", we have
failed. Sometimes failure is unavaoidable (e.g. floats do not act
rationally -- deal with it). But let's not pretend that failure is
success.

> If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Are Python literals a "narrow string source"? It seems blatantly clear
to me that the "encoding" of Python literals should be determined at
compile time, not runtime. Byte arrays from a file are different. 

> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. 

Unfortunately there will be many people with no interesting in Unicode
who will be dealing with it merely because that is the way APIs are
going: XML APIs, Windows APIs, TK, DCOM, SOAP, WebDAV even some X/Unix
APIs. Unicode is the new ASCII.

I want to get a (Unicode) string from an XML document or SOAP request,
compare it to a string literal and never think about Unicode.

> ...
> why does
> 
> >>> [a,b,c] = (1,2,3)
> 
> work, and
> 
> >>> [1,2]+(3,4)
> ...
> 
> does not?

I dunno. If there is no good reason then it is a bug that should be
fixed. The __radd__ operator on lists should iterate over its argument
as a sequence.

As Fredrik points out, though, this situation is not as dangerous as
auto-conversions because

 a) the latter could be loosened later without breaking code

 b) the operation always fails. It never does the wrong thing silently
and it never succeeds for some inputs.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)


From skip@mojam.com (Skip Montanaro)  Tue May 16 19:15:40 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 16 May 2000 13:15:40 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <39218219.9E8115E2@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
 <8fe76b$684$1@newshost.accu.uu.nl>
 <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
 <8fh9ki$51h$1@slb3.atl.mindspring.net>
 <8fk4mh$i4$1@kopp.stud.ntnu.no>
 <391DBBED.B252E597@tismer.com>
 <14621.52191.448037.799287@anthem.cnri.reston.va.us>
 <3920712F.1FD0B910@lemburg.com>
 <14625.24779.329534.364663@beluga.mojam.com>
 <000d01bfbf4a$85321400$34aab5d4@hagrid>
 <39218219.9E8115E2@lemburg.com>
Message-ID: <14625.36940.160373.900909@beluga.mojam.com>

    Marc> Ok, here's a more readable and semantically useful definition:
    ...

    >>> join((1,2,3))
    6

My point was that the verb "join" doesn't connote "sum".  The idea of
"join"ing a sequence suggests (to me) that the individual sequence elements
are still identifiable in the result, so "join((1,2,3))" would look
something like "123" or "1 2 3" or "10203", not "6".

It's not a huge deal to me, but I think it mildly violates the principle of
least surprise when you try to apply it to sequences of non-strings.

To extend this into the absurd, what should the following code display?

    class Spam: pass

    eggs = Spam()
    bacon = Spam()
    toast = Spam()

    print join((eggs,bacon,toast))

If a join builtin is supposed to be applicable to all types, we need to
decide what the semantics are going to be for all types.  Maybe all that
needs to happen is that you stringify any non-string elements before
applying the + operator (just one possibility among many, not necessarily
one I recommend).  If you want to limit join's inputs to (or only make it
semantically meaningful for) sequences of strings, then it should probably
not be a builtin, no matter how visually annoying you find

    " ".join(["a","b","c"])

Skip


From Fredrik Lundh" <effbot@telia.com  Tue May 16 19:26:10 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 20:26:10 +0200
Subject: [Python-Dev] homer-dev, anyone?
Message-ID: <009d01bfbf64$b779a260$34aab5d4@hagrid>

http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40

</F>



From martin@loewis.home.cs.tu-berlin.de  Tue May 16 19:43:34 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 16 May 2000 20:43:34 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
 (fredrik@pythonware.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
Message-ID: <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>

> it is real.  I won't repeat the arguments one more time; please read
> the W3C character model note and the python-dev archives, and read
> up on the unicode support in Tcl and Perl.

I did read all that, so there really is no point in repeating the
arguments - yet I'm still not convinced. One of the causes may be that
all your commentary either

- discusses an alternative solution to the existing one, merely
  pointing out the difference, without any strong selling point
- explains small examples that work counter-intuitively

I'd like to know whether you have an example of a real-world
big-application problem that could not be conveniently implemented
using the new Unicode API. For all the examples I can think where
Unicode would matter (XML processing, CORBA wstring mapping,
internationalized messages and GUIs), it would work just fine.

So while it may not be perfect, I think it is good enough. Perhaps my
problem is that I'm not a perfectionist :-)

However, one remark from http://www.w3.org/TR/charmod/ reminded me of
an earlier proposal by Bill Janssen. The Character Model says

# Because encoded text cannot be interpreted and processed without
# knowing the encoding, it is vitally important that the character
# encoding is known at all times and places where text is exchanged or
# stored.

While they were considering document encodings, I think this applies
in general. Bill Janssen's proposal was that each (narrow) string
should have an attribute .encoding. If set, you'll know what encoding
a string has. If not set, it is a byte string, subject to the default
encoding. I'd still like to see that as a feature in Python.

Regards,
Martin


From paul@prescod.net  Tue May 16 19:49:46 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 13:49:46 -0500
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <3921984A.8CDE8E1D@prescod.net>

I hope that if Python were renamed we would not choose yet another name
which turns up hundreds of false hits in web engines. Perhaps Homr or
Home_r. Or maybe Pythahn.

Fredrik Lundh wrote:
> 
> http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> 
> </F>
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I ahve been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)


From tismer@tismer.com  Tue May 16 20:01:21 2000
From: tismer@tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:01:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
Message-ID: <39219B01.A4EE0920@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> Huh!  I would not have guessed that you'd give up on Stackless that easily
> <wink>.

Noh, I didn't give up Stackless, but fishing for soles.
After Just v. R. has become my most ambitious user,
I'm happy enough.

(Again, better don't take me too serious :)

> > ...
> > Making it a method of the joining string now appears to be
> > a hack to me. (Sorry, Tim, the idea was great in the first place)
> 
> Just the opposite here:  it looked like a hack the first time I thought of
> it, but has gotten more charming with each use.  space.join(sequence) is so
> pretty it aches.

It is absolutely phantastic.
The most uninteresting stuff in the join is the separator,
and it has the power to merge thousands of strings
together, without asking the sequence at all
 - give all power to the suppressed, long live the Python anarchy :-)

We now just have to convince the user no longer to think
of *what* to join in te first place, but how.

> redefining-truth-all-over-the-place-ly y'rs  - tim

" "-is-small-but-sooo-strong---lets-elect-new-users - ly y'rs - chris

p.s.: no this is *no* offense, just kidding.

" ".join(":-)", ":^)", "<wink> ") * 42

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From tismer@tismer.com  Tue May 16 20:10:42 2000
From: tismer@tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:10:42 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim> <39219B01.A4EE0920@tismer.com>
Message-ID: <39219D32.BD82DE83@tismer.com>

Oh, while we are at it...

Christian Tismer wrote:
> " ".join(":-)", ":^)", "<wink> ") * 42

is actually wrong, since it needs a seuqence, not just
the arg tuple. Wouldn't it make sense to allow this?
Exactly the opposite as in list.append(), since in this
case we are just expecting strings?

While I have to say that

>>> " ".join("123")
'1 2 3'
>>> 

is not a feature to me but just annoying ;-)

ciao again - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From Fredrik Lundh" <effbot@telia.com  Tue May 16 20:30:49 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 21:30:49 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > it is real.  I won't repeat the arguments one more time; please read
> > the W3C character model note and the python-dev archives, and read
> > up on the unicode support in Tcl and Perl.
> 
> I did read all that, so there really is no point in repeating the
> arguments - yet I'm still not convinced. One of the causes may be that
> all your commentary either
> 
> - discusses an alternative solution to the existing one, merely
>   pointing out the difference, without any strong selling point
> - explains small examples that work counter-intuitively

umm.  I could have sworn that getting rid of counter-intuitive
behaviour was rather important in python.  maybe we're using
the language in radically different ways?

> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

of course I can kludge my way around the flaws in MAL's design,
but why should I have to do that? it's broken. fixing it is easy.

> Perhaps my problem is that I'm not a perfectionist :-)

perfectionist or not, I only want Python's Unicode support to
be as intuitive as anything else in Python.  as it stands right
now, Perl and Tcl's Unicode support is intuitive.  Python's not.

(it also backs us into a corner -- once you mess this one up,
you cannot fix it in Py3K without breaking lots of code.  that's
really bad).

in contrast, Guido's compromise proposal allows us to do this
the right way in 1.7/Py3K (i.e. teach python about source code
encodings, system api encodings, and stream i/o encodings).

btw, I thought we'd all agreed on GvR's solution for 1.6?

what did I miss?

> So while it may not be perfect, I think it is good enough.

so tell me, if "good enough" is what we're aiming at, why isn't
my counter-proposal good enough?

if not else, it's much easier to document...

</F>



From skip@mojam.com (Skip Montanaro)  Tue May 16 20:30:08 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 16 May 2000 14:30:08 -0500 (CDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <39219D32.BD82DE83@tismer.com>
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
 <39219B01.A4EE0920@tismer.com>
 <39219D32.BD82DE83@tismer.com>
Message-ID: <14625.41408.423282.529732@beluga.mojam.com>

    Christian> While I have to say that

    >>>> " ".join("123")
    Christian> '1 2 3'
    >>>> 

    Christian> is not a feature to me but just annoying ;-)

More annoying than

    >>> import string
    >>> string.join("123")
    '1 2 3'

? ;-)

a-sequence-is-a-sequence-ly y'rs,

Skip


From tismer@tismer.com  Tue May 16 20:43:33 2000
From: tismer@tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:43:33 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
 <39219B01.A4EE0920@tismer.com>
 <39219D32.BD82DE83@tismer.com> <14625.41408.423282.529732@beluga.mojam.com>
Message-ID: <3921A4E5.9BDEBF49@tismer.com>


Skip Montanaro wrote:
> 
>     Christian> While I have to say that
> 
>     >>>> " ".join("123")
>     Christian> '1 2 3'
>     >>>>
> 
>     Christian> is not a feature to me but just annoying ;-)
> 
> More annoying than
> 
>     >>> import string
>     >>> string.join("123")
>     '1 2 3'
> 
> ? ;-)

You are right. Equally bad, just in different flavor.
*gulp* this is going to be a can of worms since...

> a-sequence-is-a-sequence-ly y'rs,

Then a string should better not be a sequence.

The number of places where I really used the string sequence
protocol to take advantage of it is outperfomed by a factor
of ten by cases where I missed to tupleise and got a bad
result. A traceback is better than a sequence here.

oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

p.s.: the Spanish Inquisition can't get me since I'm in Russia
until Sunday - omsk

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From guido@python.org  Tue May 16 20:49:17 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 16 May 2000 15:49:17 -0400
Subject: [Python-Dev] Unicode
In-Reply-To: Your message of "Tue, 16 May 2000 21:30:49 +0200."
 <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
 <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <200005161949.PAA16607@eric.cnri.reston.va.us>

> in contrast, Guido's compromise proposal allows us to do this
> the right way in 1.7/Py3K (i.e. teach python about source code
> encodings, system api encodings, and stream i/o encodings).
> 
> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

Nothing.  We are going to do that (my "ASCII" proposal).  I'm just
waiting for the final SRE code first.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May 16 21:01:46 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 16 May 2000 16:01:46 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: Your message of "Tue, 16 May 2000 13:49:46 CDT."
 <3921984A.8CDE8E1D@prescod.net>
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
 <3921984A.8CDE8E1D@prescod.net>
Message-ID: <200005162001.QAA16657@eric.cnri.reston.va.us>

> I hope that if Python were renamed we would not choose yet another name
> which turns up hundreds of false hits in web engines. Perhaps Homr or
> Home_r. Or maybe Pythahn.

Actually, I'd like to call the next version Throatwobbler Mangrove.
But you'd have to pronounce it Raymond Luxyry Yach-t.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From akuchlin@mems-exchange.org  Tue May 16 21:10:22 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 May 2000 16:10:22 -0400 (EDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
 <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
 <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
 <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <14625.43822.773966.59550@amarok.cnri.reston.va.us>

Fredrik Lundh writes:
>perfectionist or not, I only want Python's Unicode support to
>be as intuitive as anything else in Python.  as it stands right
>now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I don't know about Tcl, but Perl 5.6's Unicode support is still
considered experimental.  Consider the following excerpts, for
example.  (And Fredrik's right; we shouldn't release a 1.6 with broken
support, or we'll pay for it for *years*...  But if GvR's ASCII
proposal is considered OK, then great!)

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-04/msg00084.html:

>Ah, yes. Unicode. But after two years of work, the one thing that users
>will want to do - open and read Unicode data - is still not there.
>Who cares if stuff's now represented internally in Unicode if they can't
>read the files they need to.

This is a "big" (as in "huge") disappointment for me as well.  I hope
we'll do better next time.

========================
http://www.egroups.com/message/perl5-porters/67906:
But given that interpretation, I'm amazed at how many operators seem
to be broken with UTF8.    It certainly supports Ilya's contention of
"pre-alpha".

Here's another example:
 
  DB<1> x (256.255.254 . 257.258.259) eq (256.255.254.257.258.259)
0  ''
  DB<2>

Rummaging with Devel::Peek shows that in this case, it's the fault of
the . operator.

And eq is broken as well:

  DB<11> x "\x{100}" eq "\xc4\x80"
0  1
  DB<12>

Aaaaargh!

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-03/msg00971.html:

A couple problems here...passage through a hash key removes the UTF8
flag (as might be expected).  Even if keys were to attempt to restore
the UTF8 flag (ala Convert::UTF::decode_utf8) or hash keys were real
SVs, what then do you do with $h{"\304\254"} and the like?

Suggestions:

1. Leave things as they are, but document UTF8 hash keys as experimental
and subject to change.

or 2. When under use bytes, leave things as they are.  Otherwise, have
keys turn on the utf8 flag if appropriate.  Also give a warning when
using a hash key like "\304\254" since keys will in effect return a
different string that just happens to have the same interal encoding.

========================

 


From paul@prescod.net  Tue May 16 21:36:42 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 15:36:42 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <3921B15A.73EF6355@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

Of course an implicit behavior can never get in the way of
big-application building. The question is about principle of least
surprise, and simplicity of explanation and understanding.

 I'm-told-that-even-Perl-and-C++-can-be-used-for-big-apps -ly yrs

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)


From martin@loewis.home.cs.tu-berlin.de  Tue May 16 23:02:10 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 17 May 2000 00:02:10 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> (effbot@telia.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>

> perfectionist or not, I only want Python's Unicode support to
> be as intuitive as anything else in Python.  as it stands right
> now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I haven't much experience with Perl, but I don't think Tcl is
intuitive in this area. I really think that they got it all wrong.
They use the string type for "plain bytes", just as we do, but then
have the notion of "correct" and "incorrect" UTF-8 (i.e. strings with
violations of the encoding rule). For a "plain bytes" string, the
following might happen

- the string is scanned for non-UTF-8 characters
- if any are found, the string is converted into UTF-8, essentially
  treating the original string as Latin-1.
- it then continues to use the UTF-8 "version" of the original string,
  and converts it back on demand.

Maybe I got something wrong, but the Unicode support in Tcl makes me
worry very much.

> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

I like the 'only ASCII is converted' approach very much, so I'm not
objecting to that solution - just as I wasn't objecting to the
previous one.

> so tell me, if "good enough" is what we're aiming at, why isn't
> my counter-proposal good enough?

Do you mean the one in

http://www.python.org/pipermail/python-dev/2000-April/005218.html

which I suppose is the same one as the "java-like approach"? AFAICT,
all it does is to change the default encoding from UTF-8 to Latin-1.
I can't follow why this should be *better*, but it would be certainly
as good... In comparison, restricting the "character" interpretation
of the string type (in terms of your proposal) to 7-bit characters
has the advantage that it is less error-prone, as Guido points out.

Regards,
Martin


From mal@lemburg.com  Tue May 16 23:59:45 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 00:59:45 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <3921D2E1.6282AA8F@lemburg.com>

Fredrik Lundh wrote:
> 
> of course I can kludge my way around the flaws in MAL's design,
> but why should I have to do that? it's broken. fixing it is easy.

Look Fredrik, it's not *my* design. All this was discussed in
public and in several rounds late last year. If someone made
a mistake and "broke" anything, then we all did... I still
don't think so, but that's my personal opinion.

--

Now to get back to some non-flammable content: 

Has anyone played around with the latest sys.set_string_encoding()
patches ? I would really like to know what you think.

The idea behind it is that you can define what the Unicode
implementaion is to expect as encoding when it sees an
8-bit string. The encoding is used for coercion, str(unicode)
and printing. It is currently *not* used for the "s"
parser marker and hash values (mainly due to internal issues).

See my patch comments for details.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From tim_one@email.msn.com  Wed May 17 07:45:59 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 02:45:59 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <14625.36940.160373.900909@beluga.mojam.com>
Message-ID: <000701bfbfcb$8f6cc600$b52d153f@tim>

[Skip Montanaro]
> ...
> It's not a huge deal to me, but I think it mildly violates the
> principle of least surprise when you try to apply it to sequences
> of non-strings.

When sep.join(seq) was first discussed, half the debate was whether str()
should be magically applied to seq's elements.  I still favor doing that, as
I have often explained the TypeError in e.g.

    string.join(some_mixed_list_of_strings_and_numbers)

to people and agree with their next complaint:  their intent was obvious,
since string.join *produces* a string.  I've never seen an instance of this
error that was appreciated (i.e., it never exposed an error in program logic
or concept, it's just an anal gripe about an arbitrary and unnatural
restriction).  Not at all like

    "42" + 42

where the intent is unknowable.

> To extend this into the absurd, what should the following code display?
>
>     class Spam: pass
>
>     eggs = Spam()
>     bacon = Spam()
>     toast = Spam()
>
>     print join((eggs,bacon,toast))

Note that we killed the idea of a new builtin join last time around.  It's
the kind of muddy & gratuitous hypergeneralization Guido will veto if we
don't kill it ourselves.  That said,

    space.join((eggs, bacon, toast))

should <wink> produce

    str(egg) + space + str(bacon) + space + str(toast)

although how Unicode should fit into all this was never clear to me.

> If a join builtin is supposed to be applicable to all types, we need to
> decide what the semantics are going to be for all types.

See above.

> Maybe all that needs to happen is that you stringify any non-string
> elements before applying the + operator (just one possibility among
> many, not necessarily one I recommend).

In my experience, that it *doesn't* do that today is a common source of
surprise & mild irritation.  But I insist that "stringify" return a string
in this context, and that "+" is simply shorthand for "string catenation".
Generalizing this would be counterproductive.

> If you want to limit join's inputs to (or only make it semantically
> meaningful for) sequences of strings, then it should probably
> not be a builtin, no matter how visually annoying you find
>
>     " ".join(["a","b","c"])

This is one of those "doctor, doctor, it hurts when I stick an onion up my
ass!" things <wink>.  space.join(etc) reads beautifully, and anyone who
doesn't spell it that way but hates the above is picking at a scab they
don't *want* to heal <0.3 wink>.

having-said-nothing-new-he-signs-off-ly y'rs  - tim




From tim_one@email.msn.com  Wed May 17 08:12:27 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 03:12:27 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>
Message-ID: <000801bfbfcf$424029e0$b52d153f@tim>

[Mark Hammond]
> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.

Mark, is this an Australian thing?  That is, you must be the only person on
earth (besides a guy I know from New Zealand -- Australia, New Zealand, same
thing to American eyes <wink>) who puts ".exe" at the end of "python"!  I'm
speculating that you think backwards because you're upside-down down there.

throwing-another-extension-on-the-barbie-mate-ly y'rs  - tim




From Fredrik Lundh" <effbot@telia.com  Wed May 17 08:36:03 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 09:36:03 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>
Message-ID: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > perfectionist or not, I only want Python's Unicode support to
> > be as intuitive as anything else in Python.  as it stands right
> > now, Perl and Tcl's Unicode support is intuitive.  Python's not.
> 
> I haven't much experience with Perl, but I don't think Tcl is
> intuitive in this area. I really think that they got it all wrong.

"all wrong"?

Tcl works hard to maintain the characters are characters model
(implementation level 2), just like Perl.  the length of a string is
always the number of characters, slicing works as it should, the
internal representation is as efficient as you can make it.

but yes, they have a somewhat dubious autoconversion mechanism
in there.  if something isn't valid UTF-8, it's assumed to be Latin-1.

scary, huh?  not really, if you step back and look at how UTF-8 was
designed.  quoting from RFC 2279:

    "UTF-8 strings can be fairly reliably recognized as such by a
    simple algorithm, i.e. the probability that a string of characters
    in any other encoding appears as valid UTF-8 is low, diminishing
    with increasing string length."

besides, their design is based on the plan 9 rune stuff.  that code
was written by the inventors of UTF-8, who has this to say:

    "There is little a rune-oriented program can do when given bad
    data except exit, which is unreasonable, or carry on. Originally
    the conversion routines, described below, returned errors when
    given invalid UTF, but we found ourselves repeatedly checking
    for errors and ignoring them. We therefore decided to convert
    a bad sequence to a valid rune and continue processing.

    "This technique does have the unfortunate property that con-
    verting invalid UTF byte strings in and out of runes does not
    preserve the input, but this circumstance only occurs when
    non-textual input is given to a textual program."

so let's see: they aimed for a high level of unicode support (layer
2, stream encodings, and system api encodings, etc), they've based
their design on work by the inventors of UTF-8, they have several
years of experience using their implementation in real life, and you
seriously claim that they got it "all wrong"?

that's weird.

> AFAICT, all it does is to change the default encoding from UTF-8
> to Latin-1.

now you're using "all" in that strange way again...  check the archives
for the full story (hint: a conceptual design model isn't the same thing
as a C implementation)

> I can't follow why this should be *better*, but it would be certainly
> as good... In comparison, restricting the "character" interpretation
> of the string type (in terms of your proposal) to 7-bit characters
> has the advantage that it is less error-prone, as Guido points out.

the main reason for that is that Python 1.6 doesn't have any way to
specify source encodings.  add that, so you no longer have to guess
what a string *literal* really is, and that problem goes away.  but
that's something for 1.7.

</F>



From mal@lemburg.com  Wed May 17 09:56:19 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 10:56:19 +0200
Subject: [Python-Dev] join() et al.
References: <000701bfbfcb$8f6cc600$b52d153f@tim>
Message-ID: <39225EB3.8D2C9A26@lemburg.com>

Tim Peters wrote:
> 
> [Skip Montanaro]
> > ...
> > It's not a huge deal to me, but I think it mildly violates the
> > principle of least surprise when you try to apply it to sequences
> > of non-strings.
> 
> When sep.join(seq) was first discussed, half the debate was whether str()
> should be magically applied to seq's elements.  I still favor doing that, as
> I have often explained the TypeError in e.g.
> 
>     string.join(some_mixed_list_of_strings_and_numbers)
> 
> to people and agree with their next complaint:  their intent was obvious,
> since string.join *produces* a string.  I've never seen an instance of this
> error that was appreciated (i.e., it never exposed an error in program logic
> or concept, it's just an anal gripe about an arbitrary and unnatural
> restriction).  Not at all like
> 
>     "42" + 42
> 
> where the intent is unknowable.

Uhm, aren't we discussing a generic sequence join API here ?

For strings, I think that " ".join(seq) is just fine... but it
would be nice to have similar functionality for other sequence
items as well, e.g. for sequences of sequences.
 
> > To extend this into the absurd, what should the following code display?
> >
> >     class Spam: pass
> >
> >     eggs = Spam()
> >     bacon = Spam()
> >     toast = Spam()
> >
> >     print join((eggs,bacon,toast))
> 
> Note that we killed the idea of a new builtin join last time around.  It's
> the kind of muddy & gratuitous hypergeneralization Guido will veto if we
> don't kill it ourselves.

We did ? (I must have been too busy hacking Unicode ;-)

Well, in that case I'd still be interested in hearing about
your thoughts so that I can intergrate such a beast in mxTools.
The acceptance level neede for doing that is much lower than
for the core builtins ;-)

>  That said,
> 
>     space.join((eggs, bacon, toast))
> 
> should <wink> produce
> 
>     str(egg) + space + str(bacon) + space + str(toast)
> 
> although how Unicode should fit into all this was never clear to me.

But that would mask errors and, even worse, "work around" coercion,
which is not a good idea, IMHO. Note that the need to coerce to
Unicode was the reason why the implicit str() in " ".join() was
removed from Barry's original string methods implementation.

space.join(map(str,seq)) is much clearer in this respect: it
forces the user to think about what the join should do with non-
string types.

> > If a join builtin is supposed to be applicable to all types, we need to
> > decide what the semantics are going to be for all types.
> 
> See above.
> 
> > Maybe all that needs to happen is that you stringify any non-string
> > elements before applying the + operator (just one possibility among
> > many, not necessarily one I recommend).
> 
> In my experience, that it *doesn't* do that today is a common source of
> surprise & mild irritation.  But I insist that "stringify" return a string
> in this context, and that "+" is simply shorthand for "string catenation".
> Generalizing this would be counterproductive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Wed May 17 15:12:01 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Wed, 17 May 2000 07:12:01 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>

On Wed, 17 May 2000, Fredrik Lundh wrote:
 > the main reason for that is that Python 1.6 doesn't have any way to
 > specify source encodings.  add that, so you no longer have to guess
 > what a string *literal* really is, and that problem goes away.  but

  You seem to be familiar with the Tcl work, so I'll ask you
this question:  Does Tcl have a way to specify source encoding?
I'm not aware of it, but I've only had time to follow the Tcl
world very lightly these past few years.  ;)


  -Fred

--
Fred L. Drake, Jr.  <fdrake at acm.org>



From Fredrik Lundh" <effbot@telia.com  Wed May 17 15:29:32 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 16:29:32 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <018101bfc00c$52be3180$34aab5d4@hagrid>

Fred L. Drake wrote:
> On Wed, 17 May 2000, Fredrik Lundh wrote:
>  > the main reason for that is that Python 1.6 doesn't have any way to
>  > specify source encodings.  add that, so you no longer have to guess
>  > what a string *literal* really is, and that problem goes away.  but
> 
>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?

Tcl has a system encoding (which is used when passing strings
through system APIs), and file/channel-specific encodings.

(for info on how they initialize the system encoding, see earlier
posts).

unfortunately, they're using the system encoding also for source
code.  for portable code, they recommend sticking to ASCII or
using "bootstrap scripts", e.g:

    set fd [open "app.tcl" r]
    fconfigure $fd -encoding euc-jp
    set jpscript [read $fd]
    close $fd
    eval $jpscript

we can surely do better in 1.7...

</F>



From jeremy@alum.mit.edu  Wed May 17 23:38:20 2000
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Wed, 17 May 2000 15:38:20 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <3921D2E1.6282AA8F@lemburg.com>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
 <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
 <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
 <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
 <3921D2E1.6282AA8F@lemburg.com>
Message-ID: <14627.8028.887219.978041@localhost.localdomain>

>>>>> "MAL" == M -A Lemburg <mal@lemburg.com> writes:

  MAL> Fredrik Lundh wrote:
  >>  of course I can kludge my way around the flaws in MAL's design,
  >> but why should I have to do that? it's broken. fixing it is easy.

  MAL> Look Fredrik, it's not *my* design. All this was discussed in
  MAL> public and in several rounds late last year. If someone made a
  MAL> mistake and "broke" anything, then we all did... I still don't
  MAL> think so, but that's my personal opinion.

I find its best to avoid referring to a design as "so-and-so's design"
unless you've got something specifically complementary to say.  Using
the person's name in combination with some criticism of the design
tends to produce a defensive reaction.  Perhaps it would help make
this discussion less contentious.

Jeremy



From martin@loewis.home.cs.tu-berlin.de  Wed May 17 23:55:21 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 18 May 2000 00:55:21 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
 (fdrake@acm.org)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>

>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?
> I'm not aware of it, but I've only had time to follow the Tcl
> world very lightly these past few years.  ;)

To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
Unicode escapes, and treats all other source code as
Latin-1. encoding(n) says

# However, because the source command always reads files using the
# ISO8859-1 encoding, Tcl will treat each byte in the file as a
# separate character that maps to the 00 page in Unicode.

Regards
Martin



From tim_one@email.msn.com  Thu May 18 05:34:13 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:13 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <3921A4E5.9BDEBF49@tismer.com>
Message-ID: <000301bfc082$51ce0180$6c2d153f@tim>

[Christian Tismer]
> ...
> Then a string should better not be a sequence.
>
> The number of places where I really used the string sequence
> protocol to take advantage of it is outperfomed by a factor
> of ten by cases where I missed to tupleise and got a bad
> result. A traceback is better than a sequence here.

Alas, I think

    for ch in string:
        muck w/ the character ch

is a common idiom.

> oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

The "sequenenceness" of strings does get in the way often enough.  Strings
have the amazing property that, since characters are also strings,

    while 1:
        string = string[0]

never terminates with an error.  This often manifests as unbounded recursion
in generic functions that crawl over nested sequences (the first time you
code one of these, you try to stop the recursion on a "is it a sequence?"
test, and then someone passes in something containing a string and it
descends forever).  And we also have that

    format % values

requires "values" to be specifically a tuple rather than any old sequence,
else the current

    "%s" % some_string

could be interpreted the wrong way.

There may be some hope in that the "for/in" protocol is now conflated with
the __getitem__ protocol, so if Python grows a more general iteration
protocol, perhaps we could back away from the sequenceness of strings
without harming "for" iteration over the characters ...




From tim_one@email.msn.com  Thu May 18 05:34:05 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:05 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <39225EB3.8D2C9A26@lemburg.com>
Message-ID: <000001bfc082$4d9d5020$6c2d153f@tim>

[M.-A. Lemburg]
> ...
> Uhm, aren't we discussing a generic sequence join API here ?

It depends on whether your "we" includes me <wink>.

> Well, in that case I'd still be interested in hearing about
> your thoughts so that I can intergrate such a beast in mxTools.
> The acceptance level neede for doing that is much lower than
> for the core builtins ;-)

Heh heh.  Python already has a generic sequence join API, called "reduce".
What else do you want beyond that?  There's nothing else I want, and I don't
even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
APL, for more of this ilk.  NumPy has some use for stuff like this, but
effective schemes require dealing with multiple dimensions intelligently,
and then you're in the proper domain of matrices rather than sequences.

> >  That said,
> >
> >     space.join((eggs, bacon, toast))
> >
> > should <wink> produce
> >
> >     str(egg) + space + str(bacon) + space + str(toast)
> >
> > although how Unicode should fit into all this was never clear to me.

> But that would mask errors and,

As I said elsewhere in the msg, I have never seen this "error" do anything
except irritate a user whose intent was the utterly obvious one (i.e.,
convert the object to a string, than catenate it).

> even worse, "work around" coercion, which is not a good idea, IMHO.
> Note that the need to coerce to Unicode was the reason why the
> implicit str() in " ".join() was removed from Barry's original string
> methods implementation.

I'm hoping that in P3K we have only one string type, and then the ambiguity
goes away.  In the meantime, it's a good reason to drop Unicode support
<snicker>.

> space.join(map(str,seq)) is much clearer in this respect: it
> forces the user to think about what the join should do with non-
> string types.

They're producing a string; they want join to turn the pieces into strings;
it's a no-brainer unless join is hypergeneralized into terminal obscurity
(like, indeed, Python's "reduce").

simple-tools-for-tedious-little-tasks-ly y'rs  - tim




From tim_one@email.msn.com  Thu May 18 05:34:11 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:11 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39219B01.A4EE0920@tismer.com>
Message-ID: <000201bfc082$50909f80$6c2d153f@tim>


[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

[Tim]
> Huh!  I would not have guessed that you'd give up on Stackless
> that easily <wink>.

[Chris]
> Noh, I didn't give up Stackless, but fishing for soles.
> After Just v. R. has become my most ambitious user,
> I'm happy enough.

I suspect you missed the point:  Stackless is the *ultimate* exercise in
"changing their mind in order to understand a basic operation".  I was
tweaking you, just as you're tweaking me <smile!>.

> It is absolutely phantastic.
> The most uninteresting stuff in the join is the separator,
> and it has the power to merge thousands of strings
> together, without asking the sequence at all
>  - give all power to the suppressed, long live the Python anarchy :-)

Exactly!  Just as love has the power to bind thousands of incompatible
humans without asking them either:  a vote for space.join() is a vote for
peace on earth.

while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim




From tim_one@email.msn.com  Thu May 18 05:34:17 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:17 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbe40$07f14520$b82d153f@tim>
Message-ID: <000401bfc082$54211940$6c2d153f@tim>

Just a brief note on the little list-grower I posted.  Upon more digging
this doesn't appear to have any relation to Dragon's Win98 headaches, so I
haven't looked at it much more.  Two data points:

1. Gordon McM and I both tried it under NT 4 systems (thanks, G!), and
   those are the only Windows platforms under which no MemoryError is
   raised.  But the runtime behavior is very clearly quadratic-time (in
   the ultimate length of the list) under NT.

2. Win98 comes with very few diagnostic tools useful at this level.  The
   Python process does *not* grow to an unreasonable size.  However, using
   a freeware heap walker I quickly determined that Python quickly sprays
   data *all over* its entire 2Gb virtual heap space while running this
   thing, and then the memory error occurs.  The dump file for the system
   heap memory blocks (just listing the start address, length, & status of
   each block) is about 128Kb and I haven't had time to analyze it.  It's
   clearly terribly fragmented, though.  The mystery here is why Win98
   isn't coalescing all the gazillions of free areas to come with a big-
   enough contiguous chunk to satisfy the request (according to me <wink>,
   the program doesn't create any long-lived data other than the list --
   it appends "1" each time, and uses xrange).

Dragon's Win98 woes appear due to something else:  right after a Win98
system w/ 64Mb RAM is booted, about half the memory is already locked (not
just committed)!  Dragon's product needs more than the remaining 32Mb to
avoid thrashing.  Even stranger, killing every process after booting
releases an insignificant amount of that locked memory.  Strange too, on my
Win98 w/ 160Mb of RAM, upon booting Win98 a massive 50Mb is locked.  This is
insane, and we haven't been able to figure out on whose behalf all this
memory is being allocated.

personally-like-win98-a-lot-but-then-i-bought-a-lot-of-ram-ly y'rs
    - tim




From Moshe Zadka <moshez@math.huji.ac.il>  Thu May 18 06:36:09 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Thu, 18 May 2000 08:36:09 +0300 (IDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.GSO.4.10.10005180827490.14709-100000@sundial>

[Tim Peters, on sequenceness of strings]
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Hmmmm...if you add a new method,

for ch in string.as_sequence():
	muck w/ the character ch

You'd solve this.

But you won't manage to convince me that you haven't used things like

string[3:5]+string[6:] to get all the characters that...

The real problem (as I see it, from my very strange POV) is that Python
uses strings for two distinct uses:

1 -- Symbols
2 -- Arrays of characters

"Symbols" are ``run-time representation of identifiers''. For example,
getattr's "prototype" "should be"

getattr(object, symbol, object=None)

While re's search method should be

re_object.search(string)

Of course, there are symbol->string and string->symbol functions, just as
there are list->tuple and tuple->list functions. 

BTW, this would also solve problems if you want to go case-insensitive in
Py3K: == is case-sensitive on strings, but case-insensitive on symbols.

i've-got-this-on-my-chest-since-the-python-conference-and-it-was-a-
  good-opportunity-to-get-it-off-ly y'rs, Z.
--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From ping@lfw.org  Thu May 18 05:37:42 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 17 May 2000 21:37:42 -0700 (PDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.LNX.4.10.10005172133490.775-100000@skuld.lfw.org>

On Thu, 18 May 2000, Tim Peters wrote:
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

But there's no way we can back away from

    spam = eggs[hack:chop] + ham[slice:dice]

on strings.  It's just too ideal.

Perhaps eventually the answer will be a character type?

Or perhaps no change at all.  I've not had the pleasure of running
into these problems with characters-being-strings before, even though
your survey of the various gotchas now makes that kind of surprising.


-- ?!ng

"Happiness isn't something you experience; it's something you remember."
    -- Oscar Levant



From mal@lemburg.com  Thu May 18 10:43:57 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 May 2000 11:43:57 +0200
Subject: [Python-Dev] join() et al.
References: <000001bfc082$4d9d5020$6c2d153f@tim>
Message-ID: <3923BB5D.47A28CBE@lemburg.com>

Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > ...
> > Uhm, aren't we discussing a generic sequence join API here ?
> 
> It depends on whether your "we" includes me <wink>.
> 
> > Well, in that case I'd still be interested in hearing about
> > your thoughts so that I can intergrate such a beast in mxTools.
> > The acceptance level neede for doing that is much lower than
> > for the core builtins ;-)
> 
> Heh heh.  Python already has a generic sequence join API, called "reduce".
> What else do you want beyond that?  There's nothing else I want, and I don't
> even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
> APL, for more of this ilk.  NumPy has some use for stuff like this, but
> effective schemes require dealing with multiple dimensions intelligently,
> and then you're in the proper domain of matrices rather than sequences.

The idea behind a generic join() API was that it could be
used to make algorithms dealing with sequences polymorph --
but you're right: this goal is probably too far fetched.

> > >  That said,
> > >
> > >     space.join((eggs, bacon, toast))
> > >
> > > should <wink> produce
> > >
> > >     str(egg) + space + str(bacon) + space + str(toast)
> > >
> > > although how Unicode should fit into all this was never clear to me.
> 
> > But that would mask errors and,
> 
> As I said elsewhere in the msg, I have never seen this "error" do anything
> except irritate a user whose intent was the utterly obvious one (i.e.,
> convert the object to a string, than catenate it).
> 
> > even worse, "work around" coercion, which is not a good idea, IMHO.
> > Note that the need to coerce to Unicode was the reason why the
> > implicit str() in " ".join() was removed from Barry's original string
> > methods implementation.
> 
> I'm hoping that in P3K we have only one string type, and then the ambiguity
> goes away.  In the meantime, it's a good reason to drop Unicode support
> <snicker>.

I'm hoping for that too... it should be Unicode everywhere if you'd
ask me.

In the meantime we can test drive this goal using the -U command
line option: it turns "" into u"" without any source code change.
The fun part about this is that running python in -U mode
reveals quite a few places where the standard lib doesn't handle
Unicode properly, so there's a lot of work ahead...

> > space.join(map(str,seq)) is much clearer in this respect: it
> > forces the user to think about what the join should do with non-
> > string types.
> 
> They're producing a string; they want join to turn the pieces into strings;
> it's a no-brainer unless join is hypergeneralized into terminal obscurity
> (like, indeed, Python's "reduce").

Hmm, the Unicode implementation does these implicit
conversions during coercion and you've all seen the success...
are you sure you want more of this ? 

We could have "".join() apply str() for all objects *except* Unicode.
1 + "2" == "12" would also be an option, or maybe 1 + "2" == 3 ? ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jack@oratrix.nl  Thu May 18 11:01:16 2000
From: jack@oratrix.nl (Jack Jansen)
Date: Thu, 18 May 2000 12:01:16 +0200
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: Message by Trent Mick <trentm@activestate.com> ,
 Mon, 15 May 2000 14:09:58 -0700 , <20000515140958.C20418@activestate.com>
Message-ID: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>

> I broke it with my patches to test overflow for some of the PyArg_Parse*()
> formatting characters. The upshot of testing for overflow is that now those
> formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> unsigned-ness as appropriate (you have to know if the value is signed or
> unsigned to know what limits to check against for overflow). Two
> possibilities presented themselves:

I think this is a _very_ bad idea. I have a few thousand (literally) routines 
calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
and the constants are all of the form

kDoSomething = 0x0001
kDoSomethingElse = 0x0002
...
kDoSomethingEvenMoreBrilliant = 0x8000

I'm pretty sure other operating systems have lots of calls with similar 
problems. I would strongly suggest using a new format char if you want 
overflow-tested integers.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From trentm@activestate.com  Thu May 18 17:56:47 2000
From: trentm@activestate.com (Trent Mick)
Date: Thu, 18 May 2000 09:56:47 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
Message-ID: <20000518095647.D32135@activestate.com>

On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > formatting characters. The upshot of testing for overflow is that now those
> > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > unsigned-ness as appropriate (you have to know if the value is signed or
> > unsigned to know what limits to check against for overflow). Two
> > possibilities presented themselves:
> 
> I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> and the constants are all of the form
> 
> kDoSomething = 0x0001
> kDoSomethingElse = 0x0002
> ...
> kDoSomethingEvenMoreBrilliant = 0x8000
> 
> I'm pretty sure other operating systems have lots of calls with similar 
> problems. I would strongly suggest using a new format char if you want 
> overflow-tested integers.

Sigh. What do you think Guido? This is your call.

1. go back to no bounds testing
2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
unsigned values but is sort of false security for bounds checking)
3. keep it the way it is: 'b' is unsigned and the rest are signed
4. add new format characters or a modifying character for signed and unsigned
versions of these.

Trent

-- 
Trent Mick
trentm@activestate.com


From guido@python.org  Thu May 18 23:05:45 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 18 May 2000 15:05:45 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: Your message of "Thu, 18 May 2000 09:56:47 PDT."
 <20000518095647.D32135@activestate.com>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
 <20000518095647.D32135@activestate.com>
Message-ID: <200005182205.PAA12830@cj20424-a.reston1.va.home.com>

> On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > > formatting characters. The upshot of testing for overflow is that now those
> > > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > > unsigned-ness as appropriate (you have to know if the value is signed or
> > > unsigned to know what limits to check against for overflow). Two
> > > possibilities presented themselves:
> > 
> > I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> > calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> > and the constants are all of the form
> > 
> > kDoSomething = 0x0001
> > kDoSomethingElse = 0x0002
> > ...
> > kDoSomethingEvenMoreBrilliant = 0x8000
> > 
> > I'm pretty sure other operating systems have lots of calls with similar 
> > problems. I would strongly suggest using a new format char if you want 
> > overflow-tested integers.
> 
> Sigh. What do you think Guido? This is your call.
> 
> 1. go back to no bounds testing
> 2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
> unsigned values but is sort of false security for bounds checking)
> 3. keep it the way it is: 'b' is unsigned and the rest are signed
> 4. add new format characters or a modifying character for signed and unsigned
> versions of these.

Sigh indeed.  Ideally, we'd introduce H for unsigned and then lock
Jack in a room with his Macintosh computer for 48 hours to fix all his
code...

Jack, what do you think?  Is this acceptable?  (I don't know if you're
still into S&M :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From trentm@activestate.com  Thu May 18 21:38:59 2000
From: trentm@activestate.com (Trent Mick)
Date: Thu, 18 May 2000 13:38:59 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl> <20000518095647.D32135@activestate.com> <200005182205.PAA12830@cj20424-a.reston1.va.home.com> <20000518121723.A3252@activestate.com> <200005182225.PAA12950@cj20424-a.reston1.va.home.com> <20000518123029.A3330@activestate.com> <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
Message-ID: <20000518133859.A3665@activestate.com>

On Thu, May 18, 2000 at 03:49:59PM -0700, Guido van Rossum wrote:
> 
> Maybe we can come up with a modifier for signed or unsigned range
> checking?

Ha! How about 'u'? :) Or 's'? :)

I really can't think of a nice answer for this. Could introduce completely
separate formatter characters that do the range checking and remove range
checking from the current formatters. That is an ugly kludge. Could introduce
a separate PyArg_CheckedParse*() or something like that and slowly migrate to
it. This one could use something other than "L" for LONG_LONG.

I think the long term solution should be:
 - have bounds-checked signed and unsigned version of all the integral types
 - call then i/I, b/B, etc. (a la array module)
 - use something other than "L" for LONG_LONG (as you said, q/Q maybe)

The problem is to find a satisfactory migratory path to that.

Sorry, I don't have an answer. Just more questions.

Trent


p.s. If you were going to check in my associate patch I have a problem in the
tab usage in test_array.py which I will resubmit soon (couple of days).

-- 
Trent Mick
trentm@activestate.com


From guido@python.org  Fri May 19 16:06:52 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:06:52 -0700
Subject: [Python-Dev] repr vs. str and locales again
Message-ID: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>

The email below suggests a simple solution to a problem that
e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
all non-ASCII chars into \oct escapes.  Jyrki's solution: use
isprint(), which makes it locale-dependent.  I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
similar patch for unicode strings (once the ASCII proposal is
implemented).

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Fri, 19 May 2000 10:48:29 +0300
From:    Jyrki Kuoppala <jkp@kaapeli.fi>
To:      guido@python.org
Subject: python bug?: python 1.5.2 fails to print printable 8-bit characters in
	   strings

I'm not sure if this exactly is a bug, ie. whether python 1.5.2 is
supposed to support locales and 8-bit characters.  However, on Linux
Debian "unstable" distribution the diff below makes python 1.5.2
handle printable 8-bit characters as one would expect.

Problem description:

python doesn't properly print printable 8-bit characters for the current locale
.

Details:

With no locale set, 8-bit characters in quoted strings print as
backslash-escapes, which I guess is OK:

$ unset LC_ALL
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'k\344\344k')
>>>

But with a locale with a printable 'ä' character (octal 344) I get:

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'k\344\344k')
>>>

I should be getting (output from python patched with the enclosed patch):

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, May 18 2000, 14:43:46)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'kääk')
>>>                              

This hits for example when Zope with squishdot weblog (squishdot
0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
strings with valid Latin1 characters get indexed as backslash-escaped
octal codes, and thus become unsearchable.

I am using debian unstable, kernels 2.2.15pre10 and 2.0.36, libc 2.1.3.

I suggest that the test for printability in python-1.5.2
/Objects/stringobject.c be fixed to use isprint() which takes the
locale into account:

- --- python-1.5.2/Objects/stringobject.c.orig	Thu Oct  8 05:17:48 1998
+++ python-1.5.2/Objects/stringobject.c	Thu May 18 14:36:28 2000
@@ -224,7 +224,7 @@
 		c = op->ob_sval[i];
 		if (c == quote || c == '\\')
 			fprintf(fp, "\\%c", c);
- -		else if (c < ' ' || c >= 0177)
+		else if (! isprint (c))
 			fprintf(fp, "\\%03o", c & 0377);
 		else
 			fputc(c, fp);
@@ -260,7 +260,7 @@
 			c = op->ob_sval[i];
 			if (c == quote || c == '\\')
 				*p++ = '\\', *p++ = c;
- -			else if (c < ' ' || c >= 0177) {
+			else if (! isprint (c)) {
 				sprintf(p, "\\%03o", c & 0377);
 				while (*p != '\0')
 					p++;



//Jyrki

------- End of Forwarded Message



From guido@python.org  Fri May 19 16:13:01 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:13:01 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 11:25:43 +0200."
 <39250897.6F42@cnet.francetelecom.fr>
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>
 <39250897.6F42@cnet.francetelecom.fr>
Message-ID: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>

[Quoting the entire mail because I've added python-dev to the cc:
list]

> Subject: Re: Python multiplexing is too hard (was: Network statistics program)
> From: Alexandre Ferrieux <alexandre.ferrieux@cnet.francetelecom.fr>
> To: Guido van Rossum <guido@python.org>
> Cc: claird@starbase.neosoft.com
> Date: Fri, 19 May 2000 11:25:43 +0200
> Delivery-Date: Fri May 19 05:26:59 2000
> 
> Guido van Rossum wrote:
> > 
> > Cameron Laird wrote:
> > >                    .
> > > Right.  asyncore is nice--but restricted to socket
> > > connections.  For many applications, that's not a
> > > restriction at all.  However, it'd be nice to have
> > > such a handy interface for communication with
> > > same-host processes; that's why I mentioned popen*().
> > > Does no one else perceive a gap there, in convenient
> > > asynchronous piped IPC?  Do folks just fall back on
> > > select() for this case?
> > 
> > Hm, really?  For same-host processes, threads would
> > do the job nicely I'd say.
> 
> Overkill.
> 
> >  Or you could probably
> > use unix domain sockets (popen only really works on
> > Unix, so that's not much of a restriction).
> 
> Overkill.
> 
> > Also note that often this is needed in the context
> > of a GUI app; there something integrated in the GUI
> > main loop is recommended.  (E.g. the file events that
> > Moshe mentioned.)
> 
> Okay so your answer is, The Python Way of doing it is to use Tcl.
> That's pretty disappointing, I'm sorry to say...
> 
> Consider:
> 
> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
> 	  event queue:
> 		- on unix, by a an additional bit on X's fd (socket) in 
> 		  the select()
> 		- on 'doze, everything is brought back to messages 
> 		  anyway.
> 
> 	And, in both cases, it works with pipes, sockets, serial or other
> devices. Uniform, clean.
> 
> 	- In python "popen only really works on Unix": are you satisfied with
> that state of affairs ? I understand (and value) Python's focus on
> algorithms and data structures, and worming around OS misgivings is a
> boring, ancillary task. But what about the potential gain ?
> 
> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> just so beautiful inside. But while Tcl is weaker in the algorithms, it
> is stronger in the os-wrapping library, and taught me to love high-level
> abstractions. [fileevent] shines in this respect, and I'll miss it in
> Python.
> 		
> -Alex

Alex, it's disappointing to me too!  There just isn't anything
currently in the library to do this, and I haven't written apps that
needs this often enough to have a good feel for what kind of
abstraction is needed.

However perhaps we can come up with a design for something better?  Do
you have a suggestion here?

I agree with your comment that higher-level abstractions around OS
stuff are needed -- I learned system programming long ago, in C, and
I'm "happy enough" with the current state of affairs, but I agree that
for many people this is a problem, and there's no reason why Python
couldn't do better...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Fri May 19 13:44:55 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 19 May 2000 14:44:55 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <002f01bfc190$09870c00$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> Jyrki's solution: use isprint(), which makes it locale-dependent.
> I can live with this.
>=20
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
>=20
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

does ctype-related locale stuff really mix well with unicode?

if yes, -0. if no, +0.

(intuitively, I'd say no -- deprecate in 1.6, remove in 1.7)

(btw, what about "eval(repr(s)) =3D=3D s" ?)

</F>



From mal@lemburg.com  Fri May 19 13:30:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 14:30:08 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <392533D0.965E47E4@lemburg.com>

Guido van Rossum wrote:
> 
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.
> 
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
> 
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

The subject line is a bit misleading: the patch only touches
tp_print, not repr() output. And this is good, IMHO, since
otherwise eval(repr(string)) wouldn't necessarily result
in string.

Unicode objects don't implement a tp_print slot... perhaps
they should ?

--

About the ASCII proposal:

Would you be satisfied with what

import sys
sys.set_string_encoding('ascii')

currently implements ?

There are several places where an encoding comes into play with
the Unicode implementation. The above API currently changes
str(unicode), print unicode and the assumption made by the
implementation during coercion of strings to Unicode.

It does not change the encoding used to implement the "s"
or "t" parser markers and also doesn't change the way the
Unicode hash value is computed (these are currently still
hard-coded as UTF-8).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gward@mems-exchange.org  Fri May 19 13:45:12 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 08:45:12 -0400
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 19, 2000 at 08:06:52AM -0700
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <20000519084511.A14717@mems-exchange.org>

On 19 May 2000, Guido van Rossum said:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

For "ASCII" strings in this day and age -- which are often not
necessarily plain ol' 7-bit ASCII -- I'd say that "32 <= c <= 127" is
not the right way to determine printability.  'isprint()' seems much
more appropriate to me.

Are there other areas of Python that should be locale-sensitive but
aren't?  A minor objection to this patch is that it's a creeping change
that brings in a little bit of locale-sensitivity without addressing a
(possibly) wider problem.  However, I will immediately shoot down my own
objection on the grounds that if we try to fix everything all at once,
then nothing will ever get fixed.  Locale sensitivity strikes me as the
sort of thing that *can* be a "creeping" change -- just fix the bits
that bug people most, and eventually all the important bits will be
fixed.

I have no expertise and therefore no opinion on such a change for
Unicode strings.

        Greg


From pf@artcom-gmbh.de  Fri May 19 13:44:00 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 19 May 2000 14:44:00 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 19, 2000  8: 6:52 am"
Message-ID: <m12sm8G-000CnCC@artcom0.artcom-gmbh.de>

Guido van Rossum asks:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

How portable is the locale awareness property of 'is_print' among
traditional Unix environments, WinXX and MacOS?  This works fine on
my favorite development platform (Linux), but an accidental use of
this new 'feature' might hurt the portability of my Python apps to
other platforms.  If 'is_print' honors the locale in a similar way
on other important platforms I would like this.  Otherwise I would
prefer the current behaviour so that I can deal with it during the
early stages of development on my Linux boxes.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From bwarsaw@python.org  Fri May 19 19:51:23 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Fri, 19 May 2000 11:51:23 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
 <20000519084511.A14717@mems-exchange.org>
Message-ID: <14629.36139.735410.272339@localhost.localdomain>

>>>>> "GW" == Greg Ward <gward@mems-exchange.org> writes:

    GW> Locale sensitivity strikes me as the sort of thing that *can*
    GW> be a "creeping" change -- just fix the bits that bug people
    GW> most, and eventually all the important bits will be fixed.

Another decidedly ignorant Anglophone here, but one problem that I see
with localizing stuff is that locale is app- (or at least thread-)
global, isn't it?  That would suck for applications like Mailman which
are (going to be) multilingual in the sense that a single instance of
the application will serve up documents in many languages, as opposed
to serving up documents in just one of a choice of languages.

If it seems I don't know what I'm talking about, you're probably
right.  I just wanted to point out that there are applications have to
deal with many languages at the same time.

-Barry



From Fredrik Lundh" <effbot@telia.com  Fri May 19 17:46:39 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Fri, 19 May 2000 18:46:39 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com><20000519084511.A14717@mems-exchange.org> <14629.36139.735410.272339@localhost.localdomain>
Message-ID: <00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>

Barry Warsaw wrote:
> Another decidedly ignorant Anglophone here, but one problem that I see
> with localizing stuff is that locale is app- (or at least thread-)
> global, isn't it?  That would suck for applications like Mailman which
> are (going to be) multilingual in the sense that a single instance of
> the application will serve up documents in many languages, as opposed
> to serving up documents in just one of a choice of languages.
> 
> If it seems I don't know what I'm talking about, you're probably
> right.  I just wanted to point out that there are applications have to
> deal with many languages at the same time.

Applications may also have to deal with output devices (i.e. GUI
toolkits, printers, communication links) that don't necessarily have
the same restrictions as the "default console".

better do it the right way: deal with encodings at the boundaries,
not inside the application.

</F>



From gward@mems-exchange.org  Fri May 19 18:03:18 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 13:03:18 -0400
Subject: [Python-Dev] Dynamic linking problem on Solaris
Message-ID: <20000519130317.A16111@mems-exchange.org>

Hi all --

interesting problem with building Robin Dunn's extension for BSD DB 2.x
as a shared object on Solaris 2.6 for Python 1.5.2 with GCC 2.8.1 and
Sun's linker.  (Yes, all of those things seem to matter.)

DB 2.x (well, at least 2.7.7) contains this line of C code:

    *mbytesp = sb.st_size / MEGABYTE;

where 'sb' is a 'struct stat' -- ie. 'sb.st_size' is a long long, which
I believe is 64 bits on Solaris.  Anyways, GCC compiles this division
into a subroutine call -- I guess the SPARC doesn't have a 64-bit
divide, or if it does then GCC doesn't know about it.

Of course, the subroutine in question -- '__cmpdi2' -- is defined in
libgcc.a.  So if you write a C application that uses BSD DB 2.x, and
compile and link it with GCC, no problem -- everything is controlled by
GCC, so libgcc.a gets linked in at the appropriate time, the linker
finds '__cmpdi2' and includes it in your binary executable, and
everything works.

However, if you're building a Python extension that uses BSD DB 2.x,
there's a problem: the default command for creating a shared extension
on Solaris is "ld -G" -- this is in Python's Makefile, so it affects
extension building with either Makefile.pre.in or the Distutils.

However, since "ld" is Sun's "ld", it doesn't know anything about
libgcc.a.  And, since presumably no 64-bit division is done in Python
itself, '__cmpdi2' isn't already present in the Python binary.  The
result: when you attempt to load the extension, you die:

  $ python -c "import dbc"
  Traceback (innermost last):
    File "<string>", line 1, in ?
  ImportError: ld.so.1: python: fatal: relocation error: file ./dbcmodule.so: symbol __cmpdi2: referenced symbol not found

The workaround turns out to be fairly easy, and there are actually two
of them.  First, add libgcc.a to the link command, ie. instead of

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb -o dbcmodule.so

use

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb \
    /depot/gnu/plat/lib/gcc-lib/sparc-sun-solaris2.6/2.8.1/libgcc.a \
    -o dbcmodule.so

(where the location of libgcc.a is variable, but invariably hairy).  Or,
it turns out that you can just use "gcc -G" to create the extension:

  gcc -G db_wrap.o -ldb -o dbcmodule.so

Seems to me that the latter is a no-brainer.

So the question arises: why is the default command for building
extensions on Solaris "ld -G" instead of "gcc -G"?  I'm inclined to go
edit my installed Makefile to make this permanent... what will that
break?

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From bwarsaw@python.org  Fri May 19 21:09:09 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Fri, 19 May 2000 13:09:09 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
 <20000519084511.A14717@mems-exchange.org>
 <14629.36139.735410.272339@localhost.localdomain>
 <00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>
Message-ID: <14629.40805.180119.929694@localhost.localdomain>

>>>>> "FL" == Fredrik Lundh <effbot@telia.com> writes:

    FL> better do it the right way: deal with encodings at the
    FL> boundaries, not inside the application.

Sounds good to me. :)



From ping@lfw.org  Fri May 19 18:04:18 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 10:04:18 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005190957260.2892-100000@localhost>

On Fri, 19 May 2000, Ka-Ping Yee wrote:
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data)

Clarification: what i meant by the above is, repr() is not
explicitly an input or an output function.  It does "some
internal computation".

Here is one alternative:

    repr(obj, **kw): options specified in kw dict
                     
        push each element in kw dict into sys.repr_options
        now do the normal conversion, referring to whatever
            options are relevant (such as "locale" if doing strings)
        for looking up any option, first check kw dict,
            then look for sys.repr_options[option]
        restore sys.repr_options

This is ugly and i still like printon/printout better, but
at least it's a smaller change and won't prevent the implementation
of printon/printout later.

This suggestion is not thread-safe.


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell



From ping@lfw.org  Fri May 19 17:56:50 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 09:56:50 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>

On Fri, 19 May 2000, Guido van Rossum wrote:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

Changing the behaviour of repr() (a function that internally
converts data into data) based on a fixed global system parameter
makes me uncomfortable.  Wouldn't it make more sense for the
locale business to be a property of the stream that the string
is being printed on?

This was the gist of my proposal for files having a printout
method a while ago.  I understand if that proposal is a bit too
much of a change to swallow at once, but i'd like to ensure the
door stays open to let it be possible in the future.

Surely there are other language systems that deal with the
issue of "nicely" printing their own data structures for human
interpretation... anyone have any experience to share?  The
printout/printon thing originally comes from Smalltalk, i believe.

(...which reminds me -- i played with Squeak the other day and
thought to myself, it would be cool to browse and edit code in
Python with a system browser like that.)


Note, however:

> This hits for example when Zope with squishdot weblog (squishdot
> 0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
> strings with valid Latin1 characters get indexed as backslash-escaped
> octal codes, and thus become unsearchable.

The above comment in particular strikes me as very fishy.
How on earth can the escaping behaviour of repr() affect the
indexing of text?  Surely when you do a search, you search for
exactly what you asked for.

And does the above mean that, with Jyrki's proposed fix, the
sorting and searching behaviour of Squishdot will suddenly
change, and magically differ from locale to locale?  Is that
something we want?  (That last is not a rhetorical question --
my gut says no, but i don't actually have enough experience
working with these issues to know the answer.)


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell



From mal@lemburg.com  Fri May 19 20:06:24 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 21:06:24 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <392590B0.5CA4F31D@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > The email below suggests a simple solution to a problem that
> > e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> > all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> > isprint(), which makes it locale-dependent.  I can live with this.
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data) based on a fixed global system parameter
> makes me uncomfortable.  Wouldn't it make more sense for the
> locale business to be a property of the stream that the string
> is being printed on?

Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
string_print API which is used for the tp_print slot, so the
only effect to be seen is when printing a string to a real
file object (tp_print is only used by PyObject_Print() and that
API is only used for writing to real PyFileObjects -- all
other stream get the output of str() or repr()).

Perhaps we should drop tp_print for strings altogether and
let str() and repr() to decide what to do... (this is
what Unicode objects do). The only good reason for implementing
tp_print is to write huge amounts of data to a stream without
creating intermediate objects -- not really needed for strings,
since these *are* the intermediate object usually created for
just this purpose ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jeremy@alum.mit.edu  Sat May 20 01:46:11 2000
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 19 May 2000 17:46:11 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
Message-ID: <14629.57427.9434.623247@localhost.localdomain>

I applied the recent changes to the CVS httplib to Greg's httplib
(call it httplib11) this afternoon.  The result is included below.  I
think this is quite close to checking in, but it could use a slightly
better test suite.

There are a few outstanding questions.

httplib11 does not implement the debuglevel feature.  I don't think
it's important, but it is currently documented and may be used.
Guido, should we implement it?

httplib w/SSL uses a constructor with this prototype:
    def __init__(self, host='', port=None, **x509):
It looks like the x509 dictionary should contain two variables --
key_file and cert_file.  Since we know what the entries are, why not
make them explicit?
    def __init__(self, host='', port=None, cert_file=None, key_file=None):
(Or reverse the two arguments if that is clearer.)

The FakeSocket class in CVS has a comment after the makefile def line
that says "hopefully, never have to write."  It won't do at all the
right thing when called with a write mode, so it ought to raise an
exception.  Any reason it doesn't?

I'd like to add a couple of test cases that use HTTP/1.1 to get some
pages from python.org, including one that uses the chunked encoding.
Just haven't gotten around to it.  Question on that front: Does it
make sense to incorporate the test function in the module with the std
regression test suite?  In general, I would think so.  In this
particular case, the test could fail because of host networking
problems.  I think that's okay as long as the error message is clear
enough. 

Jeremy

"""HTTP/1.1 client library"""

# Written by Greg Stein.

import socket
import string
import mimetools

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO

error = 'httplib.error'

HTTP_PORT = 80
HTTPS_PORT = 443

class HTTPResponse(mimetools.Message):
    __super_init = mimetools.Message.__init__
    
    def __init__(self, fp, version, errcode):
        self.__super_init(fp, 0)

        if version == 'HTTP/1.0':
            self.version = 10
        elif version[:7] == 'HTTP/1.':
            self.version = 11 # use HTTP/1.1 code for HTTP/1.x where x>=1
        else:
            raise error, 'unknown HTTP protocol'

        # are we using the chunked-style of transfer encoding?
        tr_enc = self.getheader('transfer-encoding')
        if tr_enc:
            if string.lower(tr_enc) != 'chunked':
                raise error, 'unknown transfer-encoding'
            self.chunked = 1
            self.chunk_left = None
        else:
            self.chunked = 0

        # will the connection close at the end of the response?
        conn = self.getheader('connection')
        if conn:
            conn = string.lower(conn)
            # a "Connection: close" will always close the
            # connection. if we don't see that and this is not
            # HTTP/1.1, then the connection will close unless we see a
            # Keep-Alive header. 
            self.will_close = string.find(conn, 'close') != -1 or \
                              ( self.version != 11 and \
                                not self.getheader('keep-alive') )
        else:
            # for HTTP/1.1, the connection will always remain open
            # otherwise, it will remain open IFF we see a Keep-Alive header
            self.will_close = self.version != 11 and \
                              not self.getheader('keep-alive')

        # do we have a Content-Length?
        # NOTE: RFC 2616, S4.4, #3 says we ignore this if tr_enc is "chunked"
        length = self.getheader('content-length')
        if length and not self.chunked:
            self.length = int(length)
        else:
            self.length = None

        # does the body have a fixed length? (of zero)
        if (errcode == 204 or               # No Content
            errcode == 304 or               # Not Modified
            100 <= errcode < 200):          # 1xx codes
            self.length = 0

        # if the connection remains open, and we aren't using chunked, and
        # a content-length was not provided, then assume that the connection
        # WILL close.
        if not self.will_close and \
           not self.chunked and \
           self.length is None:
            self.will_close = 1

        # if there is no body, then close NOW. read() may never be
        # called, thus we will never mark self as closed.
        if self.length == 0:
            self.close()

    def close(self):
        if self.fp:
            self.fp.close()
            self.fp = None

    def isclosed(self):
        # NOTE: it is possible that we will not ever call self.close(). This
        #       case occurs when will_close is TRUE, length is None, and we
        #       read up to the last byte, but NOT past it.
        #
        # IMPLIES: if will_close is FALSE, then self.close() will ALWAYS be
        #          called, meaning self.isclosed() is meaningful.
        return self.fp is None

    def read(self, amt=None):
        if self.fp is None:
            return ''

        if self.chunked:
            chunk_left = self.chunk_left
            value = ''
            while 1:
                if chunk_left is None:
                    line = self.fp.readline()
                    i = string.find(line, ';')
                    if i >= 0:
                        line = line[:i]     # strip chunk-extensions
                    chunk_left = string.atoi(line, 16)
                    if chunk_left == 0:
                        break
                if amt is None:
                    value = value + self.fp.read(chunk_left)
                elif amt < chunk_left:
                    value = value + self.fp.read(amt)
                    self.chunk_left = chunk_left - amt
                    return value
                elif amt == chunk_left:
                    value = value + self.fp.read(amt)
                    self.fp.read(2)    # toss the CRLF at the end of the chunk
                    self.chunk_left = None
                    return value
                else:
                    value = value + self.fp.read(chunk_left)
                    amt = amt - chunk_left

                # we read the whole chunk, get another
                self.fp.read(2)        # toss the CRLF at the end of the chunk
                chunk_left = None

            # read and discard trailer up to the CRLF terminator
            ### note: we shouldn't have any trailers!
            while 1:
                line = self.fp.readline()
                if line == '\r\n':
                    break

            # we read everything; close the "file"
            self.close()

            return value

        elif amt is None:
            # unbounded read
            if self.will_close:
                s = self.fp.read()
            else:
                s = self.fp.read(self.length)
            self.close()      # we read everything
            return s

        if self.length is not None:
            if amt > self.length:
                # clip the read to the "end of response"
                amt = self.length
            self.length = self.length - amt

        s = self.fp.read(amt)

        # close our "file" if we know we should
        ### I'm not sure about the len(s) < amt part; we should be
        ### safe because we shouldn't be using non-blocking sockets
        if self.length == 0 or len(s) < amt:
            self.close()

        return s


class HTTPConnection:

    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'

    response_class = HTTPResponse
    default_port = HTTP_PORT

    def __init__(self, host, port=None):
        self.sock = None
        self.response = None
        self._set_hostport(host, port)

    def _set_hostport(self, host, port):
        if port is None:
            i = string.find(host, ':')
            if i >= 0:
                port = int(host[i+1:])
                host = host[:i]
            else:
                port = self.default_port
        self.host = host
        self.port = port
        self.addr = host, port

    def connect(self):
        """Connect to the host and port specified in __init__."""
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect(self.addr)

    def close(self):
        """Close the connection to the HTTP server."""
        if self.sock:
            self.sock.close() # close it manually... there may be other refs
            self.sock = None
        if self.response:
            self.response.close()
            self.response = None

    def send(self, str):
        """Send `str' to the server."""
        if self.sock is None:
            self.connect()

        # send the data to the server. if we get a broken pipe, then close
        # the socket. we want to reconnect when somebody tries to send again.
        #
        # NOTE: we DO propagate the error, though, because we cannot simply
        #       ignore the error... the caller will know if they can retry.
        try:
            self.sock.send(str)
        except socket.error, v:
            if v[0] == 32:    # Broken pipe
                self.close()
            raise

    def putrequest(self, method, url):
        """Send a request to the server.

        `method' specifies an HTTP request method, e.g. 'GET'.
        `url' specifies the object being requested, e.g.
        '/index.html'.
        """
        if self.response is not None:
            if not self.response.isclosed():
                ### implies half-duplex!
                raise error, 'prior response has not been fully handled'
            self.response = None

        if not url:
            url = '/'
        str = '%s %s %s\r\n' % (method, url, self._http_vsn_str)

        try:
            self.send(str)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time (the socket was closed; this will reopen)
            self.send(str)

        self.putheader('Host', self.host)

        if self._http_vsn == 11:
            # Issue some standard headers for better HTTP/1.1 compliance

            # note: we are assuming that clients will not attempt to set these
            #     headers since *this* library must deal with the consequences.
            #     this also means that when the supporting libraries are
            #     updated to recognize other forms, then this code should be
            #     changed (removed or updated).

            # we only want a Content-Encoding of "identity" since we don't
            # support encodings such as x-gzip or x-deflate.
            self.putheader('Accept-Encoding', 'identity')

            # we can accept "chunked" Transfer-Encodings, but no others
            # NOTE: no TE header implies *only* "chunked"
            #self.putheader('TE', 'chunked')

            # if TE is supplied in the header, then it must appear in a
            # Connection header.
            #self.putheader('Connection', 'TE')

        else:
            # For HTTP/1.0, the server will assume "not chunked"
            pass

    def putheader(self, header, value):
        """Send a request header line to the server.

        For example: h.putheader('Accept', 'text/html')
        """
        str = '%s: %s\r\n' % (header, value)
        self.send(str)

    def endheaders(self):
        """Indicate that the last header line has been sent to the server."""

        self.send('\r\n')

    def request(self, method, url, body=None, headers={}):
        """Send a complete request to the server."""

        try:
            self._send_request(method, url, body, headers)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time
            self._send_request(method, url, body, headers)

    def _send_request(self, method, url, body, headers):
        self.putrequest(method, url)

        if body:
            self.putheader('Content-Length', str(len(body)))
        for hdr, value in headers.items():
            self.putheader(hdr, value)
        self.endheaders()

        if body:
            self.send(body)

    def getreply(self):
        """Get a reply from the server.

        Returns a tuple consisting of:
        - server response code (e.g. '200' if all goes well)
        - server response string corresponding to response code
        - any RFC822 headers in the response from the server

        """
        file = self.sock.makefile('rb')
        line = file.readline()
        try:
            [ver, code, msg] = string.split(line, None, 2)
        except ValueError:
            try:
                [ver, code] = string.split(line, None, 1)
                msg = ""
            except ValueError:
                self.close()
                return -1, line, file
        if ver[:5] != 'HTTP/':
            self.close()
            return -1, line, file
        errcode = int(code)
        errmsg = string.strip(msg)
        response = self.response_class(file, ver, errcode)
        if response.will_close:
            # this effectively passes the connection to the response
            self.close()
        else:
            # remember this, so we can tell when it is complete
            self.response = response
        return errcode, errmsg, response

class FakeSocket:
    def __init__(self, sock, ssl):
        self.__sock = sock
        self.__ssl = ssl
        return

    def makefile(self, mode):           # hopefully, never have to write
        # XXX add assert about mode != w???
        msgbuf = ""
        while 1:
            try:
                msgbuf = msgbuf + self.__ssl.read()
            except socket.sslerror, msg:
                break
        return StringIO(msgbuf)

    def send(self, stuff, flags = 0):
        return self.__ssl.write(stuff)

    def recv(self, len = 1024, flags = 0):
        return self.__ssl.read(len)

    def __getattr__(self, attr):
        return getattr(self.__sock, attr)

class HTTPSConnection(HTTPConnection):
    """This class allows communication via SSL."""
    __super_init = HTTPConnection.__init__

    default_port = HTTPS_PORT

    def __init__(self, host, port=None, **x509):
        self.__super_init(host, port)
        self.key_file = x509.get('key_file')
        self.cert_file = x509.get('cert_file')

    def connect(self):
        """Connect to a host onf a given port
        
        Note: This method is automatically invoked by __init__, if a host
        is specified during instantiation.
        """
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect(self.addr)
        ssl = socket.ssl(sock, self.key_file, self.cert_file)
        self.sock = FakeSocket(sock, ssl)

class HTTPMixin:
    """Mixin for compatibility with httplib.py from 1.5.

    requires that class that inherits defines the following attributes:
    super_init 
    super_connect 
    super_putheader 
    super_getreply
    """

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def connect(self, host=None, port=None):
        "Accept arguments to set the host/port, since the superclass doesn't."
        if host is not None:
            self._set_hostport(host, port)
        self.super_connect()

    def set_debuglevel(self, debuglevel):
        "The class no longer supports the debuglevel."
        pass

    def getfile(self):
        "Provide a getfile, since the superclass' use of HTTP/1.1 prevents it."
        return self.file

    def putheader(self, header, *values):
        "The superclass allows only one value argument."
        self.super_putheader(header, string.joinfields(values,'\r\n\t'))

    def getreply(self):
        "Compensate for an instance attribute shuffling."
        errcode, errmsg, response = self.super_getreply()
        if errcode == -1:
            self.file = response  # response is the "file" when errcode==-1
            self.headers = None
            return -1, errmsg, None

        self.headers = response
        self.file = response.fp
        return errcode, errmsg, response

class HTTP(HTTPMixin, HTTPConnection):
    super_init = HTTPConnection.__init__
    super_connect = HTTPConnection.connect
    super_putheader = HTTPConnection.putheader
    super_getreply = HTTPConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port)

class HTTPS(HTTPMixin, HTTPSConnection):
    super_init = HTTPSConnection.__init__
    super_connect = HTTPSConnection.connect
    super_putheader = HTTPSConnection.putheader
    super_getreply = HTTPSConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None, **x509):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port, **x509)

def test():
    """Test this module.

    The test consists of retrieving and displaying the Python
    home page, along with the error code and error string returned
    by the www.python.org server.
    """

    import sys
    import getopt
    opts, args = getopt.getopt(sys.argv[1:], 'd')
    dl = 0
    for o, a in opts:
        if o == '-d': dl = dl + 1
    host = 'www.python.org'
    selector = '/'
    if args[0:]: host = args[0]
    if args[1:]: selector = args[1]
    h = HTTP()
    h.set_debuglevel(dl)
    h.connect(host)
    h.putrequest('GET', selector)
    h.endheaders()
    errcode, errmsg, headers = h.getreply()
    print 'errcode =', errcode
    print 'errmsg  =', errmsg
    print
    if headers:
        for header in headers.headers: print string.strip(header)
    print
    print h.getfile().read()

    if hasattr(socket, 'ssl'):
        host = 'www.c2.net'
        hs = HTTPS()
        hs.connect(host)
        hs.putrequest('GET', selector)
        hs.endheaders()
        errcode, errmsg, headers = hs.getreply()
        print 'errcode =', errcode
        print 'errmsg  =', errmsg
        print
        if headers:
            for header in headers.headers: print string.strip(header)
        print
        print hs.getfile().read()

if __name__ == '__main__':
    test()



From claird@starbase.neosoft.com  Fri May 19 23:02:47 2000
From: claird@starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:02:47 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005192202.RAA48753@starbase.neosoft.com>

	From guido@cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	> Consider:
	> 
	> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
	> 	  event queue:
	> 		- on unix, by a an additional bit on X's fd (socket) in 
	> 		  the select()
	> 		- on 'doze, everything is brought back to messages 
	> 		  anyway.
	> 
	> 	And, in both cases, it works with pipes, sockets, serial or other
	> devices. Uniform, clean.
	> 
	> 	- In python "popen only really works on Unix": are you satisfied with
	> that state of affairs ? I understand (and value) Python's focus on
	> algorithms and data structures, and worming around OS misgivings is a
	> boring, ancillary task. But what about the potential gain ?
	> 
	> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
	> just so beautiful inside. But while Tcl is weaker in the algorithms, it
	> is stronger in the os-wrapping library, and taught me to love high-level
	> abstractions. [fileevent] shines in this respect, and I'll miss it in
	> Python.
	> 		
	> -Alex

	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
Great questions!  Alex and I are both working
on answers, I think; we're definitely not ig-
noring this.  More, in time.

One thing of which I'm certain:  I do NOT like
documentation entries that say things like
"select() doesn't really work except under Unix"
(still true?  Maybe that's been fixed?).  As a
user, I just find that intolerable.  Sufficiently
intolerable that I'll help change the situation?
Well, I'm working on that part now ...


From guido@python.org  Sat May 20 02:19:20 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:19:20 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 17:02:47 CDT."
 <200005192202.RAA48753@starbase.neosoft.com>
References: <200005192202.RAA48753@starbase.neosoft.com>
Message-ID: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>

> One thing of which I'm certain:  I do NOT like
> documentation entries that say things like
> "select() doesn't really work except under Unix"
> (still true?  Maybe that's been fixed?).

Hm, that's bogus.  It works well under Windows -- with the restriction
that it only works for sockets, but for sockets it works as well as
on Unix.  it also works well on the Mac.  I wonder where that note
came from (it's probably 6 years old :-).

Fred...?

> As a
> user, I just find that intolerable.  Sufficiently
> intolerable that I'll help change the situation?
> Well, I'm working on that part now ...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From claird@starbase.neosoft.com  Fri May 19 23:37:48 2000
From: claird@starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:37:48 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <200005192237.RAA49766@starbase.neosoft.com>

	From guido@cj20424-a.reston1.va.home.com  Fri May 19 17:32:39 2000
			.
			.
			.
	> One thing of which I'm certain:  I do NOT like
	> documentation entries that say things like
	> "select() doesn't really work except under Unix"
	> (still true?  Maybe that's been fixed?).

	Hm, that's bogus.  It works well under Windows -- with the restriction
	that it only works for sockets, but for sockets it works as well as
	on Unix.  it also works well on the Mac.  I wonder where that note
	came from (it's probably 6 years old :-).

	Fred...?
			.
			.
			.
I sure don't mean to propagate misinformation.
I'll make it more of a habit to forward such
items to Fred as I find them.


From guido@python.org  Sat May 20 02:30:30 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:30:30 -0700
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: Your message of "Fri, 19 May 2000 17:46:11 PDT."
 <14629.57427.9434.623247@localhost.localdomain>
References: <14629.57427.9434.623247@localhost.localdomain>
Message-ID: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>

> I applied the recent changes to the CVS httplib to Greg's httplib
> (call it httplib11) this afternoon.  The result is included below.  I
> think this is quite close to checking in, but it could use a slightly
> better test suite.

Thanks -- but note that I don't have the time to review the code.

> There are a few outstanding questions.
> 
> httplib11 does not implement the debuglevel feature.  I don't think
> it's important, but it is currently documented and may be used.
> Guido, should we implement it?

I think the solution is to provide the API ignore the call or
argument.

> httplib w/SSL uses a constructor with this prototype:
>     def __init__(self, host='', port=None, **x509):
> It looks like the x509 dictionary should contain two variables --
> key_file and cert_file.  Since we know what the entries are, why not
> make them explicit?
>     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> (Or reverse the two arguments if that is clearer.)

The reason for the **x509 syntax (I think -- I didn't introduce it) is
that it *forces* the user to use keyword args, which is a good thing
for such an advanced feature.  However there should be code that
checks that no other keyword args are present.

> The FakeSocket class in CVS has a comment after the makefile def line
> that says "hopefully, never have to write."  It won't do at all the
> right thing when called with a write mode, so it ought to raise an
> exception.  Any reason it doesn't?

Probably laziness of the code.  Thanks for this code review (I guess I
was in a hurry when I checked that code in :-).

> I'd like to add a couple of test cases that use HTTP/1.1 to get some
> pages from python.org, including one that uses the chunked encoding.
> Just haven't gotten around to it.  Question on that front: Does it
> make sense to incorporate the test function in the module with the std
> regression test suite?  In general, I would think so.  In this
> particular case, the test could fail because of host networking
> problems.  I think that's okay as long as the error message is clear
> enough. 

Yes, I agree.  Maybe it should raise ImportError when the network is
unreachable -- this is the one exception that the regrtest module
considers non-fatal.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From DavidA@ActiveState.com  Fri May 19 23:38:16 2000
From: DavidA@ActiveState.com (David Ascher)
Date: Fri, 19 May 2000 15:38:16 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005192237.RAA49766@starbase.neosoft.com>
Message-ID: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>

> 	> One thing of which I'm certain:  I do NOT like
> 	> documentation entries that say things like
> 	> "select() doesn't really work except under Unix"
> 	> (still true?  Maybe that's been fixed?).
>
> 	Hm, that's bogus.  It works well under Windows -- with the
> restriction
> 	that it only works for sockets, but for sockets it works as well as
> 	on Unix.  it also works well on the Mac.  I wonder where that note
> 	came from (it's probably 6 years old :-).

I'm pretty sure I know where it came from -- it came from Sam Rushing's
tutorial on how to use Medusa, which was more or less cut & pasted into the
doc, probably at the time that asyncore and asynchat were added to the
Python core.  IMO, it's not the best part of the Python doc -- it is much
too low-to-the ground, and assumes the reader already understands much about
I/O, sync/async issues, and cares mostly about high performance.  All of
which are true of wonderful Sam, most of which are not true of the average
Python user.

While we're complaining about doc, asynchat is not documented, I believe.
Alas, I'm unable to find the time to write up said documentation.

--david

PS: I'm not sure that multiplexing can be made _easy_.  Issues like
block/nonblocking communications channels, multithreading etc. are hard to
ignore, as much as one might want to.



From gstein@lyra.org  Fri May 19 23:38:59 2000
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 May 2000 15:38:59 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>

On Fri, 19 May 2000, Guido van Rossum wrote:
> > I applied the recent changes to the CVS httplib to Greg's httplib
> > (call it httplib11) this afternoon.  The result is included below.  I
> > think this is quite close to checking in,

I'll fold the changes into my copy here (at least), until we're ready to
check into Python itself.

THANK YOU for doing this work. It is the "heavy lifting" part that I just
haven't had a chance to get to myself.

I have a small, local change dealing with the 'Host' header (it shouldn't
be sent automatically for HTTP/1.0; some httplib users already send it
and having *two* in the output headers will make some servers puke).

> > but it could use a slightly
> > better test suite.
> 
> Thanks -- but note that I don't have the time to review the code.

I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
the code, though... :-)

> > There are a few outstanding questions.
> > 
> > httplib11 does not implement the debuglevel feature.  I don't think
> > it's important, but it is currently documented and may be used.
> > Guido, should we implement it?
> 
> I think the solution is to provide the API ignore the call or
> argument.

Can do: ignore the debuglevel feature.

> > httplib w/SSL uses a constructor with this prototype:
> >     def __init__(self, host='', port=None, **x509):
> > It looks like the x509 dictionary should contain two variables --
> > key_file and cert_file.  Since we know what the entries are, why not
> > make them explicit?
> >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > (Or reverse the two arguments if that is clearer.)
> 
> The reason for the **x509 syntax (I think -- I didn't introduce it) is
> that it *forces* the user to use keyword args, which is a good thing
> for such an advanced feature.  However there should be code that
> checks that no other keyword args are present.

Can do: raise an error if other keyword args are present.

> > The FakeSocket class in CVS has a comment after the makefile def line
> > that says "hopefully, never have to write."  It won't do at all the
> > right thing when called with a write mode, so it ought to raise an
> > exception.  Any reason it doesn't?
> 
> Probably laziness of the code.  Thanks for this code review (I guess I
> was in a hurry when I checked that code in :-).

+1 on raising an exception.

> 
> > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > pages from python.org, including one that uses the chunked encoding.
> > Just haven't gotten around to it.  Question on that front: Does it
> > make sense to incorporate the test function in the module with the std
> > regression test suite?  In general, I would think so.  In this
> > particular case, the test could fail because of host networking
> > problems.  I think that's okay as long as the error message is clear
> > enough. 
> 
> Yes, I agree.  Maybe it should raise ImportError when the network is
> unreachable -- this is the one exception that the regrtest module
> considers non-fatal.

+1 on shifting to the test modules.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From bckfnn@worldonline.dk  Sat May 20 16:19:09 2000
From: bckfnn@worldonline.dk (Finn Bock)
Date: Sat, 20 May 2000 15:19:09 GMT
Subject: [Python-Dev] Heads up: unicode file I/O in JPython.
Message-ID: <392690f3.17235923@smtp.worldonline.dk>

I have recently released errata-07 which improves on JPython's ability
to handle unicode characters as well as binary data read from and
written to python files.

The conversions can be described as

- I/O to a file opened in binary mode will read/write the low 8-bit 
  of each char. Writing Unicode chars >0xFF will cause silent
  truncation [*].

- I/O to a file opened in text mode will push the character 
  through the default encoding for the platform (in addition to 
  handling CR/LF issues).

This breaks completely with python1.6a2, but I believe that it is close
to the expectations of java users. (The current JPython-1.1 behavior are
completely useless for both characters and binary data. It only barely
manage to handle 7-bit ASCII).

In JPython (with the errata) we can do:

  f = open("test207.out", "w")
  f.write("\x20ac") # On my w2k platform this writes 0x80 to the file.
  f.close()

  f = open("test207.out", "r")
  print hex(ord(f.read()))
  f.close()

  f = open("test207.out", "wb")
  f.write("\x20ac") # On all platforms this writes 0xAC to the file.
  f.close()

  f = open("test207.out", "rb")
  print hex(ord(f.read()))
  f.close()

With the output of:

  0x20ac
  0xac

I do not expect anything like this in CPython. I just hope that all
unicode advice given on c.l.py comes with the modifier, that JPython
might do it differently.

regards,
finn

    http://sourceforge.net/project/filelist.php?group_id=1842

[*] Silent overflow is bad, but it is at least twice as fast as having
to check each char for overflow.




From esr@netaxs.com  Sat May 20 23:36:56 2000
From: esr@netaxs.com (Eric Raymond)
Date: Sat, 20 May 2000 18:36:56 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <200005162001.QAA16657@eric.cnri.reston.va.us>; from Guido van Rossum on Tue, May 16, 2000 at 04:01:46PM -0400
References: <009d01bfbf64$b779a260$34aab5d4@hagrid> <3921984A.8CDE8E1D@prescod.net> <200005162001.QAA16657@eric.cnri.reston.va.us>
Message-ID: <20000520183656.F7487@unix3.netaxs.com>

On Tue, May 16, 2000 at 04:01:46PM -0400, Guido van Rossum wrote:
> > I hope that if Python were renamed we would not choose yet another name
> > which turns up hundreds of false hits in web engines. Perhaps Homr or
> > Home_r. Or maybe Pythahn.
> 
> Actually, I'd like to call the next version Throatwobbler Mangrove.
> But you'd have to pronounce it Raymond Luxyry Yach-t.

Great.  I'll take a J-class kitted for open-ocean sailing, please.  Do
I get a side of bikini babes with that?
-- 
	<a href="http://www.tuxedo.org/~esr/home.html">Eric S. Raymond</a>


From ping@lfw.org  Sun May 21 11:30:05 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Sun, 21 May 2000 03:30:05 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <392590B0.5CA4F31D@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005210329160.420-100000@localhost>

On Fri, 19 May 2000, M.-A. Lemburg wrote:
> Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> string_print API which is used for the tp_print slot,

Very sorry!  I didn't actually look to see where the patch
was being applied.

But then how can this have any effect on squishdot's indexing?



-- ?!ng

"All models are wrong; some models are useful."
    -- George Box




From pf@artcom-gmbh.de  Sun May 21 16:54:06 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Sun, 21 May 2000 17:54:06 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005210329160.420-100000@localhost> from Ka-Ping Yee at "May 21, 2000  3:30: 5 am"
Message-ID: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>

Hi!

Ka-Ping Yee:
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

Sigh.  Let me explain this in some detail.

What do you see here: äöüÄÖÜß?  If all went well, you should
see some Umlauts which occur quite often in german words, like
"Begrüssung", "ätzend" or "Grützkacke" and so on.

During the late 80s we here Germany spend a lot of our free time to
patch open source tools software like 'elm', 'B-News', 'less' and
others to make them "8-Bit clean".  For example on ancient Unices
like SCO Xenix where the implementations of C-library functions
like 'is_print', 'is_lower' where out of reach.

After several years everybody seems to agree on ISO-8859-1 as the new
european standard character set, which was also often losely called 
8-Bit ASCII, because ASCII is a true subset of ISO latin1.  Even at least
the german versions of Windows used ISO-8859-1.

As the WWW began to gain popularity nobody with a sane mind really 
used these splendid ASCII escapes like for example '&auml;' instead 
of 'ä'.  The same holds true for TeX users community where everybody 
was happy to type real umlauts instead of these ugly backslash escapes
sequences used before: \"a\"o\"u ...

To make a short: A lot of effort has been spend to make *ALL* programs
8-Bit clean: That is to move the bytes through without translating
them from or into a bunch of incompatible multi bytes sequences,
which nobody can read or even wants to look at.

Now to get to back to your question:  There are several nice HTML indexing
engines out there.  I personally use HTDig.  At least on Linux these
programs deal fine with HTML files containing 8-bit chars.  

But if for some reason Umlauts end up as octal escapes ('\344' instead of 'ä')
due to the use of a Python 'print some_tuple' during the creation of HTML
files, a search engine will be unable to find those words with escaped
umlauts.

Mit freundlichen Grüßen, Peter
P.S.: Hope you didn't find my explanation boring or off-topic.


From Fredrik Lundh" <effbot@telia.com  Sun May 21 17:26:00 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sun, 21 May 2000 18:26:00 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <005601bfc341$40eb0d60$34aab5d4@hagrid>

Peter Funk <pf@artcom-gmbh.de> wrote:
> But if for some reason Umlauts end up as octal escapes ('\344' instead of 'ä')
> due to the use of a Python 'print some_tuple' during the creation of HTML
> files, a search engine will be unable to find those words with escaped
> umlauts.

umm.  why would anyone use "print some_tuple" when generating
HTML pages?  what if the tuple contains something that results in
a "<" character?

</F>



From guido@python.org  Sun May 21 22:20:03 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 21 May 2000 14:20:03 -0700
Subject: [Python-Dev] Is the tempfile module really a security risk?
Message-ID: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>

Every few months I receive patches that purport to make the tempfile
module more secure.  I've never felt that it is a problem.  What is
with these people?  My feeling about these suggestions has always been
that they have read about similar insecurities in C code run by the
super-user, and are trying to get the teacher's attention by proposing
something clever.

Or is there really a problem?  Is anyone in this forum aware of
security issues with tempfile?  Should I worry?  Is the
"random-tempfile" patch that the poster below suggested worth
applying?

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Sun, 21 May 2000 19:34:43 +0200
From:    =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <ragnark@vestdata.no>
To:      Guido van Rossum <guido@python.org>
cc:      patches@python.org
Subject: Re: [Patches] Patch to make tempfile return random filenames

On Sun, May 21, 2000 at 12:17:08PM -0700, Guido van Rossum wrote:
> Hm, I don't like this very much.  Random sequences have a small but
> nonzero probability of generating the same number in rapid succession
> -- probably one in a million or so.  It would be very bad if one in a
> million rums of a particular application crashed for this reason.
> 
> A better way do prevent this kind of attack (if you care about it) is
> to use mktemp.TemporaryFile(), which avoids this vulnerability in a
> different way.
> 
> (Also note the test for os.path.exists() so that an attacker would
> have to use very precise timing to make this work.)

1. the path.exist part does not solve the problem. It causes a racing
condition that is not very hard to get around, by having a program
creating and deleting the file at maximum speed. It will have a 50%
chance of breaking your program.

2. O_EXCL does not always work. E.g. it does not work over NFS - there
are probably other broken implementations too.

3. Even if mktemp.TemporaryFile had been sufficient, providing mktemp in
this dangerous way is not good. Many are likely to use it either not
thinking about the problem at all, or assuming it's solved in the
module.

4. The problems you describe can easily be overcome. I removed the
counter and the file-exist check because I figgured they were no longer
needed. I was wrong. Either a larger number should be used and/or
counter and or file-exist check. Personally I would want the random part
to bee large enough not have to worry about collisions either by chance,
after a fork, or by deliberate attack.


Do you want a new patch that adresses theese problems better?


- -- 
Ragnar Kjørstad

_______________________________________________
Patches mailing list
Patches@python.org
http://www.python.org/mailman/listinfo/patches

------- End of Forwarded Message



From guido@python.org  Sun May 21 23:05:58 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 21 May 2000 15:05:58 -0700
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
Message-ID: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>

I'm happy to announce that we've moved the Python CVS tree to
SourceForge.  SourceForge (www.sourceforge.net) is a free service to
Open Source developers run by VA Linux.

The change has two advantages for us: (1) we no longer have to deal
with the mirrorring of our writable CVS repository to the read-only
mirror ar cvs.python.org (which will soon be decommissioned); (2) we
will be able to add new developers with checkin privileges.  In
addition, we benefit from the high visibility and availability of
SourceForge.

Instructions on how to access the Python SourceForge tree are here:

  http://sourceforge.net/cvs/?group_id=5470

If you have an existing working tree that points to the cvs.python.org
repository, you may want to retarget it to the SourceForge tree.  This
can be done painlessly with Greg Ward's cvs_chroot script:

  http://starship.python.net/~gward/python/

The email notification to python-checkins@python.org still works
(although during the transition a few checkin messages may have been
lots).

While I've got your attention, please remember that the proper
procedures to submit patches is described here:

  http://www.python.org/patches/

We've accumulated quite the backlog of patches to be processed during
the transition; we'll start working on these ASAP.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Sun May 21 21:54:23 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 21 May 2000 22:54:23 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost>
Message-ID: <39284CFF.5D9C9B13@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

The only possible reason I can see is that this squishdot
application uses 'print' to write the data -- perhaps
it pipes it through some other tool ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From Fredrik Lundh" <effbot@telia.com  Mon May 22 00:24:02 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 01:24:02 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>
Message-ID: <004f01bfc37b$b1551480$34aab5d4@hagrid>

M.-A. Lemburg <mal@lemburg.com> wrote:
> > But then how can this have any effect on squishdot's indexing?
> 
> The only possible reason I can see is that this squishdot
> application uses 'print' to write the data -- perhaps
> it pipes it through some other tool ?

but doesn't the patch only affects code that manages to call tp_print
without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
mode)

or to put it another way, if they manage to call tp_print without the
PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

or am I just totally confused?

</F>



From guido@python.org  Mon May 22 04:47:16 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 21 May 2000 20:47:16 -0700
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: Your message of "Mon, 22 May 2000 01:24:02 +0200."
 <004f01bfc37b$b1551480$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>
 <004f01bfc37b$b1551480$34aab5d4@hagrid>
Message-ID: <200005220347.UAA06235@cj20424-a.reston1.va.home.com>

Let's reboot this thread.  Never mind the details of the actual patch,
or why it would affect a particular index.

Obviously if we're going to patch string_print() we're also going to
patch string_repr() (and vice versa) -- the former (without the
Py_PRINT_RAW flag) is supposed to be an optimization of the latter.
(I hadn't even read the patch that far to realize that it only did one
and not the other.)

The point is simply this.

The repr() function for a string turns it into a valid string literal.
There's considerable freedom allowed in this conversion, some of which
is taken (e.g. it prefers single quotes but will use double quotes
when the string contains single quotes).

For safety reasons, control characters are replaced by their octal
escapes.  This is also done for non-ASCI characters.

Lots of people, most of them living in countries where Latin-1 (or
another 8-bit ASCII superset) is in actual use, would prefer that
non-ASCII characters would be left alone rather than changed into
octal escapes.  I think it's not unreasonable to ask that what they
consider printable characters aren't treated as control characters.

I think that using the locale to guide this is reasonable.  If the
locale is set to imply Latin-1, then we can assume that most output
devices are capable of displaying those characters.  What good does
converting those characters to octal escapes do us then?  If the input
string was in fact binary goop, then the output will be unreadable
goop -- but it won't screw up the output device (as control characters
are wont to do, which is the main reason to turn them into octal
escapes).

So I don't see how the patch can do much harm, I don't expect that it
will break much code, and I see a real value for those who use
Latin-1 or other 8-bit supersets of ASCII.

The one objection could be that the locale may be obsolescent -- but
I've only heard /F vent an opinion about that; personally, I doubt
that we will be able to remove the locale any time soon, even if we
invent a better way.  Plus, I think that "better way" should address
this issue anyway.  If the locale eventually disappears, the feature
automatically disappears with it, because you *have* to make a
locale.setlocale() call before the behavior of repr() changes.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From pf@artcom-gmbh.de  Mon May 22 07:18:22 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 08:18:22 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005220347.UAA06235@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 21, 2000  8:47:16 pm"
Message-ID: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>

Guido van Rossum:
[...]
> The one objection could be that the locale may be obsolescent -- but
> I've only heard /F vent an opinion about that; personally, I doubt
> that we will be able to remove the locale any time soon, even if we
> invent a better way.  

AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

Although I understand Barrys and Pings objections against a global state,
it used to work very well:  On a typical single user Linux system the
user chooses his locale during the first stages of system setup and
never has to think about it again.  On multi user systems the locale
of individual accounts may be customized using several environment
variables, which can overide the default locale of the system.

> Plus, I think that "better way" should address
> this issue anyway.  If the locale eventually disappears, the feature
> automatically disappears with it, because you *have* to make a
> locale.setlocale() call before the behavior of repr() changes.

The last sentence is at least not the whole truth.

On POSIX systems there are a several environment variables used to
control the default locale settings for a users session.  For example
on my SuSE Linux system currently running in the german locale the
environment variable LC_CTYPE=de_DE is automatically set by a file 
/etc/profile during login, which causes automatically the C-library 
function toupper('ä') to return an 'Ä' ---you should see
a lower case a-umlaut as argument and an upper case umlaut as return
value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls
to 'setlocale' to all application program using this C-library functions.

Regards, Peter.


From tim_one@email.msn.com  Mon May 22 07:59:16 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 02:59:16 -0400
Subject: [Python-Dev] Is the tempfile module really a security risk?
In-Reply-To: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECEGBAA.tim_one@email.msn.com>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

Doing a google search on

    tempfile security

turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
real vulnerability here somewhere (not necessarily Python), but the closest
I found to a clear explanation in 10 minutes was an annoyed paragraph,
saying that if I didn't already understand the problem I should turn in my
Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
issued one of those to me.

> ...
> Is the "random-tempfile" patch that the poster below suggested worth
> applying?

Certainly not the patch he posted!  And for reasons I sketched in my
patches-list commentary, I doubt any hack based on pseudo-random numbers
*can* solve anything.

assuming-there's-indeed-something-in-need-of-solving-ly y'rs  - tim




From Fredrik Lundh" <effbot@telia.com  Mon May 22 08:20:50 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 09:20:50 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <008001bfc3be$7e5eae40$34aab5d4@hagrid>

Peter Funk wrote:
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to
Python, the old 8-bit locale *ctype* stuff no longer works.  while some
platforms implement a wctype interface, it's not widely available, and it's
not always unicode.

so in order to provide platform-independent unicode support, Python 1.6
comes with unicode-aware and fully portable replacements for the ctype
functions.

the code is already in there...

> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file
> /etc/profile during login, which causes automatically the C-library
> function toupper('ä') to return an 'Ä' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
>
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays.  may contain binary goop, or text in some strange
  encoding.  upper, strip, etc should not be used.

- 8-bit text strings using the system encoding.  upper, strip, etc works
  as long as the locale is properly configured.

- 8-bit unicode text strings.  upper, strip, etc may work, as long as the
  system encoding is a subset of unicode -- which means US ASCII or
  ISO Latin 1.

- wide unicode text strings.  upper, strip, etc always works.

is this complexity really worth it?

</F>



From gstein@lyra.org  Mon May 22 08:47:50 2000
From: gstein@lyra.org (Greg Stein)
Date: Mon, 22 May 2000 00:47:50 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.10.10005220045170.30706-100000@nebula.lyra.org>

I've integrated all of these changes into the httplib.py posted on my
pages at:
    http://www.lyra.org/greg/python/

The actual changes are visible thru ViewCVS at:
    http://www.lyra.org/cgi-bin/viewcvs.cgi/gjspy/httplib.py/


The test code is still in there, until a test_httplib can be written.

Still missing: doc for the new-style semantics.

Cheers,
-g

On Fri, 19 May 2000, Greg Stein wrote:
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > > I applied the recent changes to the CVS httplib to Greg's httplib
> > > (call it httplib11) this afternoon.  The result is included below.  I
> > > think this is quite close to checking in,
> 
> I'll fold the changes into my copy here (at least), until we're ready to
> check into Python itself.
> 
> THANK YOU for doing this work. It is the "heavy lifting" part that I just
> haven't had a chance to get to myself.
> 
> I have a small, local change dealing with the 'Host' header (it shouldn't
> be sent automatically for HTTP/1.0; some httplib users already send it
> and having *two* in the output headers will make some servers puke).
> 
> > > but it could use a slightly
> > > better test suite.
> > 
> > Thanks -- but note that I don't have the time to review the code.
> 
> I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
> the code, though... :-)
> 
> > > There are a few outstanding questions.
> > > 
> > > httplib11 does not implement the debuglevel feature.  I don't think
> > > it's important, but it is currently documented and may be used.
> > > Guido, should we implement it?
> > 
> > I think the solution is to provide the API ignore the call or
> > argument.
> 
> Can do: ignore the debuglevel feature.
> 
> > > httplib w/SSL uses a constructor with this prototype:
> > >     def __init__(self, host='', port=None, **x509):
> > > It looks like the x509 dictionary should contain two variables --
> > > key_file and cert_file.  Since we know what the entries are, why not
> > > make them explicit?
> > >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > > (Or reverse the two arguments if that is clearer.)
> > 
> > The reason for the **x509 syntax (I think -- I didn't introduce it) is
> > that it *forces* the user to use keyword args, which is a good thing
> > for such an advanced feature.  However there should be code that
> > checks that no other keyword args are present.
> 
> Can do: raise an error if other keyword args are present.
> 
> > > The FakeSocket class in CVS has a comment after the makefile def line
> > > that says "hopefully, never have to write."  It won't do at all the
> > > right thing when called with a write mode, so it ought to raise an
> > > exception.  Any reason it doesn't?
> > 
> > Probably laziness of the code.  Thanks for this code review (I guess I
> > was in a hurry when I checked that code in :-).
> 
> +1 on raising an exception.
> 
> > 
> > > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > > pages from python.org, including one that uses the chunked encoding.
> > > Just haven't gotten around to it.  Question on that front: Does it
> > > make sense to incorporate the test function in the module with the std
> > > regression test suite?  In general, I would think so.  In this
> > > particular case, the test could fail because of host networking
> > > problems.  I think that's okay as long as the error message is clear
> > > enough. 
> > 
> > Yes, I agree.  Maybe it should raise ImportError when the network is
> > unreachable -- this is the one exception that the regrtest module
> > considers non-fatal.
> 
> +1 on shifting to the test modules.
> 
> Cheers,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 

-- 
Greg Stein, http://www.lyra.org/



From alexandre.ferrieux@cnet.francetelecom.fr  Mon May 22 09:25:21 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 10:25:21 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>
 <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <3928EEF1.693F@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > From: Alexandre Ferrieux <alexandre.ferrieux@cnet.francetelecom.fr>
> >
> > I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> > just so beautiful inside. But while Tcl is weaker in the algorithms, it
> > is stronger in the os-wrapping library, and taught me to love high-level
> > abstractions. [fileevent] shines in this respect, and I'll miss it in
> > Python.
> 
> Alex, it's disappointing to me too!  There just isn't anything
> currently in the library to do this, and I haven't written apps that
> needs this often enough to have a good feel for what kind of
> abstraction is needed.

Thanks for the empathy. Apologies for my slight overreaction.

> However perhaps we can come up with a design for something better?  Do
> you have a suggestion here?

Yup. One easy answer is 'just copy from Tcl'...

Seriously, I'm really too new to Python to suggest the details or even
the *style* of this 'level 2 API to multiplexing'. However, I can sketch
the implementation since select() (from C or Tcl) is the one primitive I
most depend on !

Basically, as shortly mentioned before, the key problem is the
heterogeneity of seemingly-selectable things in Windoze. On unix, not
only does select() work with
all descriptor types on which it makes sense, but also the fd used by
Xlib is accessible; hence clean multiplexing even with a GUI package is
trivial. Now to the real (rotten) meat, that is M$'s. Facts:

	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
selectable. Why ? Ask a relative in Redmond...

	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
handles. They are selectable, but for example you can't use'em for
redirections. Okay in our case we don't care. I only mention it cause
its scary and could pop back into your face some time later.

	3. The GUI API doesn't expose a descriptor (handle), but fortunately
(though disgustingly) there is a special syscall to wait on both "the
message queue" and selectable handles: MsgWaitForMultipleObjects. So its
doable, if not beautiful.

The Tcl solution to (1.), which is the only real issue, is to have a
separate thread blockingly read 1 byte from the pipe, and then post a
message back to the main thread to awaken it (yes, ugly code to handle
that extra byte and integrate it with the buffering scheme).

In summary, why not peruse Tcl's hard-won experience on
selecting-on-windoze-pipes ?

Then, for the API exposed to the Python programmer, the Tclly exposed
one is a starter:

	fileevent $channel readable|writable callback
	...
	vwait breaker_variable

Explanation for non-Tclers: fileevent hooks the callback, vwait does a
loop of select(). The callback(s) is(are) called without breaking the
loop, unless $breaker_variable is set, at which time vwait returns.

One note about 'breaker_variable': I'm not sure I like it. I'd prefer
something based on exceptions. I don't quite understand why it's not
already this way in Tcl (which has (kindof) first-class exceptions), but
let's not repeat the mistake: let's suggest that (the equivalent of)
vwait loops forever, only to be broken out by an exception from within
one of the callbacks.

HTH,

-Alex


From mal@lemburg.com  Mon May 22 09:56:10 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 10:56:10 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com> <004f01bfc37b$b1551480$34aab5d4@hagrid> <3928F437.D4DB3C25@lemburg.com>
Message-ID: <3928F62A.94980623@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > > But then how can this have any effect on squishdot's indexing?
> >
> > The only possible reason I can see is that this squishdot
> > application uses 'print' to write the data -- perhaps
> > it pipes it through some other tool ?
> 
> but doesn't the patch only affects code that manages to call tp_print
> without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
> mode)

Right.
 
> or to put it another way, if they manage to call tp_print without the
> PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

Looking at the code, the 'print' statement doesn't set
PRINT_RAW -- still the output is written literally to
stdout. Don't know where PRINT_RAW gets set... perhaps
they use PyFile_WriteObject() directly ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From python-dev@python.org  Mon May 22 10:44:14 2000
From: python-dev@python.org (Peter Funk)
Date: Mon, 22 May 2000 11:44:14 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
Message-ID: <m12tokw-000DieC@artcom0.artcom-gmbh.de>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

[Tim]
> Doing a google search on
> 
>     tempfile security
> 
> turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
> real vulnerability here somewhere (not necessarily Python), but the closest
> I found to a clear explanation in 10 minutes was an annoyed paragraph,
> saying that if I didn't already understand the problem I should turn in my
> Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
> issued one of those to me.

On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a 
working example which exploits this vulnerability in older versions
of GCC.

The basic idea is indeed very simple:  Since the /tmp directory is
writable for any user, the bad guy can create a symbolic link in /tmp
pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
program will than overwrite this arbitrary file (where the programmer
really wanted to write something to his tempfile instead).  Since this
will happen with the access permissions of the process running this
program, this opens a bunch of vulnerabilities in many programs
writing something into temporary files with predictable file names.

www.cert.org is another great place to look for security related info.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From claird@starbase.neosoft.com  Mon May 22 12:31:08 2000
From: claird@starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 06:31:08 -0500 (CDT)
Subject: [Python-Dev] Re: Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <200005221131.GAA39671@starbase.neosoft.com>

	From alexandre.ferrieux@cnet.francetelecom.fr  Mon May 22 03:40:13 2000
			.
			.
			.
	> Alex, it's disappointing to me too!  There just isn't anything
	> currently in the library to do this, and I haven't written apps that
	> needs this often enough to have a good feel for what kind of
	> abstraction is needed.

	Thanks for the empathy. Apologies for my slight overreaction.

	> However perhaps we can come up with a design for something better?  Do
	> you have a suggestion here?

	Yup. One easy answer is 'just copy from Tcl'...

	Seriously, I'm really too new to Python to suggest the details or even
	the *style* of this 'level 2 API to multiplexing'. However, I can sketch
	the implementation since select() (from C or Tcl) is the one primitive I
	most depend on !

	Basically, as shortly mentioned before, the key problem is the
	heterogeneity of seemingly-selectable things in Windoze. On unix, not
	only does select() work with
	all descriptor types on which it makes sense, but also the fd used by
	Xlib is accessible; hence clean multiplexing even with a GUI package is
	trivial. Now to the real (rotten) meat, that is M$'s. Facts:

		1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
	selectable. Why ? Ask a relative in Redmond...

		2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
	handles. They are selectable, but for example you can't use'em for
	redirections. Okay in our case we don't care. I only mention it cause
	its scary and could pop back into your face some time later.

		3. The GUI API doesn't expose a descriptor (handle), but fortunately
	(though disgustingly) there is a special syscall to wait on both "the
	message queue" and selectable handles: MsgWaitForMultipleObjects. So its
	doable, if not beautiful.

	The Tcl solution to (1.), which is the only real issue, is to have a
	separate thread blockingly read 1 byte from the pipe, and then post a
	message back to the main thread to awaken it (yes, ugly code to handle
	that extra byte and integrate it with the buffering scheme).

	In summary, why not peruse Tcl's hard-won experience on
	selecting-on-windoze-pipes ?

	Then, for the API exposed to the Python programmer, the Tclly exposed
	one is a starter:

		fileevent $channel readable|writable callback
		...
		vwait breaker_variable

	Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	loop of select(). The callback(s) is(are) called without breaking the
	loop, unless $breaker_variable is set, at which time vwait returns.

	One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	something based on exceptions. I don't quite understand why it's not
	already this way in Tcl (which has (kindof) first-class exceptions), but
	let's not repeat the mistake: let's suggest that (the equivalent of)
	vwait loops forever, only to be broken out by an exception from within
	one of the callbacks.
			.
			.
			.
I've copied everything Alex wrote, because he writes for
me, also.

As much as I welcome it, I can't answer Guido's question,
"What should the API look like?"  I've been mulling this
over, and concluded I don't have sufficiently deep know-
ledge to be trustworthy on this.

Instead, I'll just give a bit of personal testimony.  I
made the rather coy c.l.p posting, in which I sincerely
asked, "How do you expert Pythoneers do it?" (my para-
phrase), without disclosing either that Alex and I have
been discussing this, or that the Tcl interface we both
know is simply a delight to me.

Here's the delight.  Guido asked, approximately, "What's
the point?  Do you need this for more than the keeping-
the-GUI-responsive-for-which-there's-already-a-notifier-
around case?"  The answer is, yes.  It's a good question,
though.  I'll repeat what Alex has said, with my own em-
phasis:  Tcl gives a uniform command API for
* files (including I/O ports, ...)
* subprocesses
* TCP socket connections
and allows the same fcntl()-like configuration of them
all as to encodings, blocking, buffering, and character
translation.  As a programmer, I use this stuff
CONSTANTLY, and very happily.  It's not just for GUIs; 
several of my mission-critical delivered products have
Tcl-coded daemons to monitor hardware, manage customer
transactions, ...  It's simply wonderful to be able to
evolve a protocol from a socket connection to an fopen()
read to ...

Tcl is GREAT at "gluing".  Python can do it, but Tcl has
a couple of years of refinement in regard to portability
issues of managing subprocesses.  I really, *really*
miss this stuff when I work with a language other than
Tcl.

I don't often whine, "Language A isn't language B."  I'm
happy to let individual character come out.  This is,
for me, an exceptional case.  It's not that Python doesn't
do it the Tcl way; it's that the Tcl way is wonderful, and
moreover that Python doesn't feel to me to have much of an
alternative answer.  I conclude that there might be some-
thing for Python to learn here.

A colleague has also write an even higher-level wrapper in
Tcl for asynchronous sockets.  I'll likely explain more
about it <URL:http://www-users.cs.umn.edu/~dejong/tcl/EasySocket.tar.gz>
in a follow-up.

Conclusion for now:  Alex and I like Python so much that we
want you guys to know that better piping-gluing-networking
truly is possible, and even worthwhile.  This is sort of
like the emigrants who've reported, "Yeah, here's the the
stuff about CPAN that's cool, and how we can have it, too."
Through it all, we absolutely want Python to continue to be
Python.


From guido@python.org  Mon May 22 16:09:44 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 08:09:44 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 08:18:22 +0200."
 <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <200005221509.IAA06955@cj20424-a.reston1.va.home.com>

> From: pf@artcom-gmbh.de (Peter Funk)
> 
> Guido van Rossum:
> [...]
> > The one objection could be that the locale may be obsolescent -- but
> > I've only heard /F vent an opinion about that; personally, I doubt
> > that we will be able to remove the locale any time soon, even if we
> > invent a better way.  
> 
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> Although I understand Barrys and Pings objections against a global state,
> it used to work very well:  On a typical single user Linux system the
> user chooses his locale during the first stages of system setup and
> never has to think about it again.  On multi user systems the locale
> of individual accounts may be customized using several environment
> variables, which can overide the default locale of the system.
> 
> > Plus, I think that "better way" should address
> > this issue anyway.  If the locale eventually disappears, the feature
> > automatically disappears with it, because you *have* to make a
> > locale.setlocale() call before the behavior of repr() changes.
> 
> The last sentence is at least not the whole truth.
> 
> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file 
> /etc/profile during login, which causes automatically the C-library 
> function toupper('ä') to return an 'Ä' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
> 
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

I don;t believe that.  According to the ANSI standard, a C program
*must* call setlocale(LC_..., "") if it wants the environment
variables to be honored; without this call, the locale is always the
"C" locale, which should *not* honor the environment variables.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tismer@tismer.com  Mon May 22 13:40:51 2000
From: tismer@tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:40:51 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <39292AD2.F5080E35@tismer.com>

Hi, I'm back from White Russia (yup, a surviver) :-)

Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > Then a string should better not be a sequence.
> >
> > The number of places where I really used the string sequence
> > protocol to take advantage of it is outperfomed by a factor
> > of ten by cases where I missed to tupleise and got a bad
> > result. A traceback is better than a sequence here.
> 
> Alas, I think
> 
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Sure.
And now for my proposal:

Strings should be strings, but not sequences.
Slicing is ok, and it will always yield strings.
Indexing would either
a - not yield anything but an excpetion
b - just integers instead of 1-char strings

The above idiom would read like this:

Version a: Access string elements via a coercion like tuple() or list():

    for ch in tuple(string):
        muck w/ the character ch

Version b: Access string elements as integer codes:

    for c in string:
        # either:
        ch = chr(c)
        muck w/ the character ch
        # or:
        muck w/ the character code c

> > oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris
> 
> The "sequenenceness" of strings does get in the way often enough.  Strings
> have the amazing property that, since characters are also strings,
> 
>     while 1:
>         string = string[0]
> 
> never terminates with an error.  This often manifests as unbounded recursion
> in generic functions that crawl over nested sequences (the first time you
> code one of these, you try to stop the recursion on a "is it a sequence?"
> test, and then someone passes in something containing a string and it
> descends forever).  And we also have that
> 
>     format % values
> 
> requires "values" to be specifically a tuple rather than any old sequence,
> else the current
> 
>     "%s" % some_string
> 
> could be interpreted the wrong way.
> 
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

O-K!
We seem to have a similar conclusion: It would be better if strings
were no sequences, after all. How to achieve this seems to be
kind of a problem, of course.

Oh, there is another idiom possible!
How about this, after we have the new string methods :-)

    for ch in string.split():
        muck w/ the character ch

Ok, in the long term, we need to rethink iteration of course.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From tismer@tismer.com  Mon May 22 13:55:21 2000
From: tismer@tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:55:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000201bfc082$50909f80$6c2d153f@tim>
Message-ID: <39292E38.A5A89270@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> [Tim]
> > Huh!  I would not have guessed that you'd give up on Stackless
> > that easily <wink>.
> 
> [Chris]
> > Noh, I didn't give up Stackless, but fishing for soles.
> > After Just v. R. has become my most ambitious user,
> > I'm happy enough.
> 
> I suspect you missed the point:  Stackless is the *ultimate* exercise in
> "changing their mind in order to understand a basic operation".  I was
> tweaking you, just as you're tweaking me <smile!>.

Squeek! Peace on earth :-)

And you are almost right on Stackless.
Almost, since I know of at least three new Python users who came
to Python *because* it has Stackless + Continuations. This is a very
new aspect to me.
Things are getting interesting now: Today I got a request from CCP
regarding continuations: They will build a masive parallel
multiplayer game with that. http://www.ccp.cc/eve

> > It is absolutely phantastic.
> > The most uninteresting stuff in the join is the separator,
> > and it has the power to merge thousands of strings
> > together, without asking the sequence at all
> >  - give all power to the suppressed, long live the Python anarchy :-)
> 
> Exactly!  Just as love has the power to bind thousands of incompatible
> humans without asking them either:  a vote for space.join() is a vote for
> peace on earth.

hmmm - that's so nice...

So let's drop a generic join, and use string.love() instead.

> while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim

join-is-a-peacemaker-like-a-Winchester-Cathedral-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From claird@starbase.neosoft.com  Mon May 22 14:09:03 2000
From: claird@starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 08:09:03 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005221309.IAA41866@starbase.neosoft.com>

	From guido@cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?
Review:  Alex and I have so far presented
the Tcl way.  We're still a bit off-balance
at the generosity of spirit that's listen-
ing to us so respectfully.  Still ahead is
the hard work of designing an interface or
higher-level abstraction that's right for
Python.

The good thing, of course, is that this is
absolutely not a language issue at all.
Python is more than sufficiently expressive
for this matter.  All we're doing is working
to insert the right thing in the (a) library.

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...
I've got a whole list of "higher-level
abstractions around OS stuff" that I've been
collecting.  Maybe I'll make it fit for
others to see once we're through this affair
...

	--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido@python.org  Mon May 22 17:16:08 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:16:08 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:20:50 +0200."
 <008001bfc3be$7e5eae40$34aab5d4@hagrid>
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
 <008001bfc3be$7e5eae40$34aab5d4@hagrid>
Message-ID: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot@telia.com>
>
> Peter Funk wrote:
> > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> you're missing the point -- now that we've added unicode support to
> Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> platforms implement a wctype interface, it's not widely available, and it's
> not always unicode.

Huh?  We were talking strictly 8-bit strings here.  The locale support
hasn't changed there.

> so in order to provide platform-independent unicode support, Python 1.6
> comes with unicode-aware and fully portable replacements for the ctype
> functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the
Unicode stuff is overkill.

> the code is already in there...
> 
> > On POSIX systems there are a several environment variables used to
> > control the default locale settings for a users session.  For example
> > on my SuSE Linux system currently running in the german locale the
> > environment variable LC_CTYPE=de_DE is automatically set by a file
> > /etc/profile during login, which causes automatically the C-library
> > function toupper('ä') to return an 'Ä' ---you should see
> > a lower case a-umlaut as argument and an upper case umlaut as return
> > value--- without having all applications to call 'setlocale' explicitly.
> >
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.
> 
> note that this leaves us with four string flavours in 1.6:
> 
> - 8-bit binary arrays.  may contain binary goop, or text in some strange
>   encoding.  upper, strip, etc should not be used.

These are not strings.

> - 8-bit text strings using the system encoding.  upper, strip, etc works
>   as long as the locale is properly configured.
> 
> - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
>   system encoding is a subset of unicode -- which means US ASCII or
>   ISO Latin 1.

This is a figment of your imagination.  You can use 8-bit text strings
to contain Latin-1, but you have to set your locale to match.

> - wide unicode text strings.  upper, strip, etc always works.
> 
> is this complexity really worth it?

From a backwards compatibility point of view, yes.  Basically,
programs that don't use Unicode should see no change in semantics.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From pf@artcom-gmbh.de  Mon May 22 14:02:18 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 15:02:18 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221509.IAA06955@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  8: 9:44 am"
Message-ID: <m12trqc-000DieC@artcom0.artcom-gmbh.de>

Hi!

[...]
[me]:
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.

[Guido van Rossum]:
> I don;t believe that.  According to the ANSI standard, a C program
> *must* call setlocale(LC_..., "") if it wants the environment
> variables to be honored; without this call, the locale is always the
> "C" locale, which should *not* honor the environment variables.

pf@pefunbk> python 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string
>>> print string.upper("ä")
Ä
>>> 

This was the vanilla Python 1.5.2 as originally delivered by SuSE Linux.  
But yes, you are right. :-(  My memory was confused by this practical
experience.  Now I like to quote from the man pages here:

man toupper:
[...]
BUGS
       The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con­
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.

man setlocale:
[...]
       A  program  may be made portable to all locales by calling
       setlocale(LC_ALL, "" ) after program   initialization,  by
       using  the  values  returned  from a localeconv() call for
       locale - dependent information and by using  strcoll()  or
       strxfrm() to compare strings.
[...]
   CONFORMING TO
       ANSI C, POSIX.1

       Linux  (that  is,  libc) supports the portable locales "C"
       and "POSIX".  In the good old days there used to  be  sup­
       port for the European Latin-1 "ISO-8859-1" locale (e.g. in
       libc-4.5.21 and  libc-4.6.27),  and  the  Russian  "KOI-8"
       (more  precisely,  "koi-8r") locale (e.g. in libc-4.6.27),
       so that having an environment variable LC_CTYPE=ISO-8859-1
       sufficed to make isprint() return the right answer.  These
       days non-English speaking Europeans have  to  work  a  bit
       harder, and must install actual locale files.
[...]

In recent Linux distributions almost every Linux C-program seems to 
contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
to forget about it.  However the core Python interpreter does not.
it seems the Linux C-Library is not fully ANSI compliant in this case.
It seems to honour the setting of $LANG regardless whether a program
calls 'setlocale' or not.

Regards, Peter


From guido@python.org  Mon May 22 17:31:50 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:31:50 -0700
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: Your message of "Mon, 22 May 2000 10:25:21 +0200."
 <3928EEF1.693F@cnet.francetelecom.fr>
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
 <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>

> Yup. One easy answer is 'just copy from Tcl'...

Tcl seems to be your only frame of reference.  I think it's too early
to say that borrowing Tcl's design is right for Python.  Don't forget
that part of Tcl's design was guided by the desire for backwards
compatibility with Tcl's strong (stronger than Python I find!) Unix
background.

> Seriously, I'm really too new to Python to suggest the details or even
> the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> the implementation since select() (from C or Tcl) is the one primitive I
> most depend on !
> 
> Basically, as shortly mentioned before, the key problem is the
> heterogeneity of seemingly-selectable things in Windoze. On unix, not
> only does select() work with
> all descriptor types on which it makes sense, but also the fd used by
> Xlib is accessible; hence clean multiplexing even with a GUI package is
> trivial. Now to the real (rotten) meat, that is M$'s. Facts:

Note that on Windows, select() is part of SOCKLIB, which explains why
it only understands sockets.  Native Windows code uses the
wait-for-event primitives that you are describing, and these are
powerful enough to wait on named pipes, sockets, and GUI events.
Complaining about the select interface on Windows isn't quite fair.

> 	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> selectable. Why ? Ask a relative in Redmond...

Can we cut the name-calling?

> 	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> handles. They are selectable, but for example you can't use'em for
> redirections. Okay in our case we don't care. I only mention it cause
> its scary and could pop back into your face some time later.

Handles are a much more low-level concept than file descriptors.  get
used to it.

> 	3. The GUI API doesn't expose a descriptor (handle), but fortunately
> (though disgustingly) there is a special syscall to wait on both "the
> message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> doable, if not beautiful.
> 
> The Tcl solution to (1.), which is the only real issue,

Why is (1) the only issue?  Maybe in Tcl-land...

> is to have a
> separate thread blockingly read 1 byte from the pipe, and then post a
> message back to the main thread to awaken it (yes, ugly code to handle
> that extra byte and integrate it with the buffering scheme).

Or the exposed API could deal with this in a different way.

> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?

Because it's designed for Tcl.

> Then, for the API exposed to the Python programmer, the Tclly exposed
> one is a starter:
> 
> 	fileevent $channel readable|writable callback
> 	...
> 	vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> loop of select(). The callback(s) is(are) called without breaking the
> loop, unless $breaker_variable is set, at which time vwait returns.

Sorry, you've lost me here.  Fortunately there's more info at
http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
very complicated, and I'm not sure why you rejected my earlier
suggestion to use threads outright as "too complicated".  After
reading that man page, threads seem easy compared to the caution one
has to exert when using non-blocking I/O.

> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
> something based on exceptions. I don't quite understand why it's not
> already this way in Tcl (which has (kindof) first-class exceptions), but
> let's not repeat the mistake: let's suggest that (the equivalent of)
> vwait loops forever, only to be broken out by an exception from within
> one of the callbacks.

Vwait seems to be part of the Tcl event model.  Maybe we would need to
think about an event model for Python?  On the other hand, Python is
at the mercy of the event model of whatever GUI package it is using --
which could be Tk, or wxWindows, or Gtk, or native Windows, or native
MacOS, or any of a number of other event models.

Perhaps this is an issue that each GUI package available to Python
will have to deal with separately...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May 22 17:49:24 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:49:24 -0700
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: Your message of "Mon, 22 May 2000 14:40:51 +0200."
 <39292AD2.F5080E35@tismer.com>
References: <000301bfc082$51ce0180$6c2d153f@tim>
 <39292AD2.F5080E35@tismer.com>
Message-ID: <200005221649.JAA07398@cj20424-a.reston1.va.home.com>

Christian, there was a smiley in your signature, so I can safely
ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
return 97 instead of "a".

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May 22 17:54:35 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:54:35 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 15:02:18 +0200."
 <m12trqc-000DieC@artcom0.artcom-gmbh.de>
References: <m12trqc-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <200005221654.JAA07426@cj20424-a.reston1.va.home.com>

> pf@pefunbk> python 
> Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import string
> >>> print string.upper("ä")
> Ä
> >>> 

This threw me off too.  However try this:

python -c 'print "ä".upper()'

It will print "ä".  A mystery?  No, the GNU readline library calls
setlocale().  It is wrong, but I can't help it.  But it only affects
interactive use of Python.

> In recent Linux distributions almost every Linux C-program seems to 
> contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
> to forget about it.  However the core Python interpreter does not.
> it seems the Linux C-Library is not fully ANSI compliant in this case.
> It seems to honour the setting of $LANG regardless whether a program
> calls 'setlocale' or not.

No, the explanation is in GNU readline.

Compile this little program and see for yourself:

#include <ctype.h>
#include <stdio.h>

main()
{
	printf("toupper(%c) = %c\n", 'ä', toupper('ä'));
}

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tismer@tismer.com  Mon May 22 15:11:37 2000
From: tismer@tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 16:11:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>
 <39292AD2.F5080E35@tismer.com> <200005221649.JAA07398@cj20424-a.reston1.va.home.com>
Message-ID: <39294019.3CB47800@tismer.com>


Guido van Rossum wrote:
> 
> Christian, there was a smiley in your signature, so I can safely
> ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
> return 97 instead of "a".

There was a smiley, but for the most since I cannot
decide what I want. I'm quite convinced that strings should
better not be sequences, at least not sequences of strings.

"abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
as listed in Tim's posting.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From fdrake@acm.org  Mon May 22 15:12:54 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:12:54 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220710520.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, Guido van Rossum wrote:
 > Hm, that's bogus.  It works well under Windows -- with the restriction
 > that it only works for sockets, but for sockets it works as well as
 > on Unix.  it also works well on the Mac.  I wonder where that note
 > came from (it's probably 6 years old :-).

  Is that still in there?  If I could get a pointer from someone I'll be
able to track it down.  I didn't see it in the select or socket module
documents, and a quick grep did't find 'really work'.
  It's definately fixable if we can find it.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From fdrake@acm.org  Mon May 22 15:21:48 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:21:48 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>
Message-ID: <Pine.LNX.4.10.10005220718250.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, David Ascher wrote:
 > I'm pretty sure I know where it came from -- it came from Sam Rushing's
 > tutorial on how to use Medusa, which was more or less cut & pasted into the
 > doc, probably at the time that asyncore and asynchat were added to the
 > Python core.  IMO, it's not the best part of the Python doc -- it is much
 > too low-to-the ground, and assumes the reader already understands much about
 > I/O, sync/async issues, and cares mostly about high performance.  All of

  It's a fairly young section, and I haven't had as much time to review
and edit that or some of the other young sections.  I'll try to pay
particular attention to these as I work on the 1.6 release.

 > which are true of wonderful Sam, most of which are not true of the average
 > Python user.
 > 
 > While we're complaining about doc, asynchat is not documented, I believe.
 > Alas, I'm unable to find the time to write up said documentation.

  Should that situation change, I'll gladly accept a section on asynchat!
Or, if anyone else has time to contribute...??


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From skip@mojam.com (Skip Montanaro)  Mon May 22 15:25:00 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Mon, 22 May 2000 09:25:00 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
In-Reply-To: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
References: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
Message-ID: <14633.17212.650090.540777@beluga.mojam.com>

    Guido> If you have an existing working tree that points to the
    Guido> cvs.python.org repository, you may want to retarget it to the
    Guido> SourceForge tree.  This can be done painlessly with Greg Ward's
    Guido> cvs_chroot script:

    Guido>   http://starship.python.net/~gward/python/

I tried this with (so far) no apparent success.  I ran cvs_chroot as

    cvs_chroot :pserver:anonymous@cvs.python.sourceforge.net:/cvsroot/python

It warned me about some directories that didn't match the top level
directory.  "No problem", I thought.  I figured they were for the nondist
portions of the tree.  When I tried a cvs update after logging in to the
SourceForge cvs server I got tons of messages that looked like:

    cvs update: move away dist/src/Tools/scripts/untabify.py; it is in the way
    C dist/src/Tools/scripts/untabify.py

It doesn't look like untabify.py has been hosed, but the warnings worry me.
Anyone else encounter this problem?  If so, what's its meaning?

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From alexandre.ferrieux@cnet.francetelecom.fr  Mon May 22 15:51:56 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 16:51:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
 <3928EEF1.693F@cnet.francetelecom.fr> <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <3929498C.1941@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > Yup. One easy answer is 'just copy from Tcl'...
> 
> Tcl seems to be your only frame of reference.

Nope, but I'll welcome any proof of existence of similar abstractions
(for multiplexing) elsewhere.

>  I think it's too early
> to say that borrowing Tcl's design is right for Python.  Don't forget
> that part of Tcl's design was guided by the desire for backwards
> compatibility with Tcl's strong (stronger than Python I find!) Unix
> background.

I don't quite get how the 'unix background' comes into play here, since
[fileevent] is now implemented and works correctly on all platforms.

If you are talinkg about the API as seen from above, I don't understand
why 'hooking a callback' and 'multiplexing event sources' are a unix
specificity, and/or why it should be avoided outside unix.

> > Seriously, I'm really too new to Python to suggest the details or even
> > the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> > the implementation since select() (from C or Tcl) is the one primitive I
> > most depend on !
> >
> > Basically, as shortly mentioned before, the key problem is the
> > heterogeneity of seemingly-selectable things in Windoze. On unix, not
> > only does select() work with
> > all descriptor types on which it makes sense, but also the fd used by
> > Xlib is accessible; hence clean multiplexing even with a GUI package is
> > trivial. Now to the real (rotten) meat, that is M$'s. Facts:
> 
> Note that on Windows, select() is part of SOCKLIB, which explains why
> it only understands sockets.  Native Windows code uses the
> wait-for-event primitives that you are describing, and these are
> powerful enough to wait on named pipes, sockets, and GUI events.
> Complaining about the select interface on Windows isn't quite fair.

Sorry, you missed the point. Here I used the term 'select()' as a 
generic one (I didn't want to pollute a general discussion with
OS-specific names...). On windows it means MsgWaitForMultipleObjects.

Now as you said "these are powerful enough to wait on named pipes,
sockets, and GUI events"; I won't deny the obvious truth. However,
again, they don't work on *unnamed pipes* (which are the only ones in
'95). That's my sole reason for complaining, and I'm afraid it is fair
;-)

> >       1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> > selectable. Why ? Ask a relative in Redmond...
> 
> Can we cut the name-calling?

Yes we can :^P

> 
> >       2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> > handles. They are selectable, but for example you can't use'em for
> > redirections. Okay in our case we don't care. I only mention it cause
> > its scary and could pop back into your face some time later.
> 
> Handles are a much more low-level concept than file descriptors.  get
> used to it.

Take it easy, I meant to help. Low level as they be, can you explain why
*some* can be passed to CreateProcess as redirections, and *some* can't
?
Obviously there *is* some attempt to unify things in Windows (if only
the
single name of 'handle'); and just as clearly it is not completely
successful.

> >       3. The GUI API doesn't expose a descriptor (handle), but fortunately
> > (though disgustingly) there is a special syscall to wait on both "the
> > message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> > doable, if not beautiful.
> >
> > The Tcl solution to (1.), which is the only real issue,
> 
> Why is (1) the only issue?

Because for (2) we don't care (no need for redirections in our case) and
for (3) the judgement is only aesthetic. 

>  Maybe in Tcl-land...

Come on, I'm emigrating from Tcl to Python with open palms, as Cameron
puts it.
I've already mentioned the outstanding beauty of Python's internal
design, and in comparison Tcl is absolutely awful. Even at the (script)
API level, Some of the early choices in Tcl are disgusting (and some
recent ones too...). I'm really turning to Python with the greatest
pleasure - please don't interpret my arguments as yet another Lang1 vs.
Lang2 flamewar.

> > is to have a
> > separate thread blockingly read 1 byte from the pipe, and then post a
> > message back to the main thread to awaken it (yes, ugly code to handle
> > that extra byte and integrate it with the buffering scheme).
> 
> Or the exposed API could deal with this in a different way.

Please elaborate ?

> > In summary, why not peruse Tcl's hard-won experience on
> > selecting-on-windoze-pipes ?
> 
> Because it's designed for Tcl.

I said 'why not' as a positive suggestion.
I didn't expect you to actually say why not...

Moreover, I don't understand 'designed for Tcl'. What's specific to Tcl
in
unifying descriptor types ?

> > Then, for the API exposed to the Python programmer, the Tclly exposed
> > one is a starter:
> >
> >       fileevent $channel readable|writable callback
> >       ...
> >       vwait breaker_variable
> >
> > Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> > loop of select(). The callback(s) is(are) called without breaking the
> > loop, unless $breaker_variable is set, at which time vwait returns.
> 
> Sorry, you've lost me here.  Fortunately there's more info at
> http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
> very complicated,

Ahem, self-destroying argument: "Fortunately ... very complicated".

While I agree the fileevent manpage is longer than it should be, I fail
to see
what's complicated in the model of 'hooking a callback for a given kind
of events'.

> and I'm not sure why you rejected my earlier
> suggestion to use threads outright as "too complicated".

Not on the same level. You're complaining about the script-level API (or
its documentation, more precisely !). I dismissed the thread-based
*implementation*
as an overkill in terms of resource consumption (thread context +
switching + ITC)
on platforms which can use select() (for anon pipes on Windows, as
already explained, the thread is unavoidable).

>  After
> reading that man page, threads seem easy compared to the caution one
> has to exert when using non-blocking I/O.

Oh, I get it. The problem is, *that* manpage unfortunately tries to
explain event-based and non-blocking I/O at the same time (presumably
because the average user will never follow the 'See Also' links). That's
a blatant pedagogic mistake. Let me try:

	fileevent <channel> readable|writable <script>

	Hooks <script> to be called back whenever the given <channel> becomes
readable|writable. 'Whenever' here means from within event processing
primitives (vwait, update).

	Example:

		# whenever a new line comes down the socket, display it.

		set s [socket $host $port]
		fileevent $s readable gotdata
		proc gotdata {} {global s;puts "New data: [gets $s]"}
		vwait forever

To answer a potential question about blockingness, yes, in the example
above the [gets] will block until a complete line is received. But
mentioning this fact in the manpage is uselessly misleading because  the
fileevent mechanism obviously allows to implement any kind of protocol,
line-based or not, terminator- or size-header- based or not. Uses with
blocking and nonblocking [read] and mixes thereof are immediate
consequences of this classification.

Hope this helps.

> Vwait seems to be part of the Tcl event model.

Hardly. It's just the Tcl name for the primitive that (blockingly) calls
select() (generic term - see above)

>  Maybe we would need to think about an event model for Python?

With pleasure - please define 'model'. Do you mean callbacks vs.
explicit decoding of an event strucutre ? Do you mean blocking select()
vs. something more asynchronous like threads or signals ?

> On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

Why should Python be alone to be exposed to this diversity ? Don't
assume
that Tk is the only option for Tcl. The Tcl/C API even exposes the
proper hooks
to integrate any new event source, like a GUI package.

Again, I'm not interested in Tcl vs. Python here (and anyway Python wins
!!!). I just want to extract what's truly orthogonal to specific design
choices. As it turns out, what you call 'the Tcl event model' can
happily be transported to any (imperative) lang.

I can even be more precise: a random GUI package can be used this way
iff the two following conditions hold:

	(a) Its queue can awaken a select()-like primitive.
	(b) Its queue can be Peek'ed (to check for buffered msgs
                                      before blockigng again)

> Perhaps this is an issue that each GUI package available to Python
> will have to deal with separately...

The characterization is given just above. To me it looks generic enough
to build an abstraction upon it. It's been done for Tcl, and is utterly
independent from its design peculiarities. Now everything depends on
whether abstraction is sought or not...

-Alex


From pf@artcom-gmbh.de  Mon May 22 16:01:50 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 17:01:50 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221654.JAA07426@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  9:54:35 am"
Message-ID: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>

Hi, 

Guido van Rossum:
> > pf@pefunbk> python 
> > Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> import string
> > >>> print string.upper("ä")
> > Ä
> > >>> 
> 
> This threw me off too.  However try this:
> 
> python -c 'print "ä".upper()'

Yes, you are right.  :-(

Conclusion:  If the 'locale' module would ever become depreciated  
then ...ummm...  we poor mortals will simply have to add a line
'import readline' to our Python programs.  Nifty... ;-)

Regards, Peter


From claird@starbase.neosoft.com  Mon May 22 16:19:21 2000
From: claird@starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 10:19:21 -0500 (CDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <200005221519.KAA45253@starbase.neosoft.com>

	From guido@cj20424-a.reston1.va.home.com  Mon May 22 08:45:58 2000
			.
			.
			.
	Tcl seems to be your only frame of reference.  I think it's too early
	to say that borrowing Tcl's design is right for Python.  Don't forget
	that part of Tcl's design was guided by the desire for backwards
	compatibility with Tcl's strong (stronger than Python I find!) Unix
	background.
Right.  We quite agree.  Both of us came
to this looking to learn in the first place
what *is* right for Python.
			.
		[various points]
			.
			.
	> Then, for the API exposed to the Python programmer, the Tclly exposed
	> one is a starter:
	> 
	> 	fileevent $channel readable|writable callback
	> 	...
	> 	vwait breaker_variable
	> 
	> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	> loop of select(). The callback(s) is(are) called without breaking the
	> loop, unless $breaker_variable is set, at which time vwait returns.

	Sorry, you've lost me here.  Fortunately there's more info at
	http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
	very complicated, and I'm not sure why you rejected my earlier
	suggestion to use threads outright as "too complicated".  After
	reading that man page, threads seem easy compared to the caution one
	has to exert when using non-blocking I/O.

	> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	> something based on exceptions. I don't quite understand why it's not
	> already this way in Tcl (which has (kindof) first-class exceptions), but
	> let's not repeat the mistake: let's suggest that (the equivalent of)
	> vwait loops forever, only to be broken out by an exception from within
	> one of the callbacks.

	Vwait seems to be part of the Tcl event model.  Maybe we would need to
	think about an event model for Python?  On the other hand, Python is
	at the mercy of the event model of whatever GUI package it is using --
	which could be Tk, or wxWindows, or Gtk, or native Windows, or native
	MacOS, or any of a number of other event models.

	Perhaps this is an issue that each GUI package available to Python
	will have to deal with separately...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
There are a lot of issues here.  I've got clients
with emergencies that'll keep me busy all week,
and will be able to respond only sporadically.
For now, I want to emphasize that Alex and I both
respect Python as itself; it would simply be alien
to us to do the all-too-common trick of whining,
"Why can't it be like this other language I just
left?"

Tcl's event model has been more successful than
any of you probably realize.  You deserve to know
that.

Should Python have an event model?  I'm not con-
vinced.  I want to work with Python threading a
bit more.  It could be that it answers all the
needs Python has in this regard.  The documentation
Guido found "very complicated" above we think of
as ...--well, I want to conclude by saying I find
this discussion productive, and appreciate your
patience in entertaining it.  Daemon construction
is a lot of what I do, and, more broadly, I like to
think about useful OS service abstractions.  I'll
be back as soon as I have something to contribute.


From Fredrik Lundh" <effbot@telia.com  Mon May 22 16:13:58 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:13:58 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <00b801bfc400$5c0d8fe0$34aab5d4@hagrid>

Peter Funk wrote:
> Conclusion:  If the 'locale' module would ever become depreciated  
> then ...ummm...  we poor mortals will simply have to add a line
> 'import readline' to our Python programs.  Nifty... ;-)

won't help if python is changed to use the *unicode*
ctype functions...

...but on the other hand, if you use unicode strings for
anything that is not plain ASCII, upper and friends will
do the right thing even if you forget to import readline.

</F>



From Fredrik Lundh" <effbot@telia.com  Mon May 22 16:37:01 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:37:01 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <00e301bfc403$abac3940$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> > Peter Funk wrote:
> > > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> > 
> > you're missing the point -- now that we've added unicode support to
> > Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> > platforms implement a wctype interface, it's not widely available, and it's
> > not always unicode.
> 
> Huh?  We were talking strictly 8-bit strings here.  The locale support
> hasn't changed there.

I meant that the locale support, even though it's part of POSIX, isn't
good enough for unicode support...

> > so in order to provide platform-independent unicode support, Python 1.6
> > comes with unicode-aware and fully portable replacements for the ctype
> > functions.
> 
> For those who only need Latin-1 or another 8-bit ASCII superset, the
> Unicode stuff is overkill.

why?

besides, overkill or not:

> > the code is already in there...

> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

depends on who you're asking, of course:

>>> b = fetch_binary_goop()
>>> type(b)
<type 'string'>
>>> dir(b)
['capitalize', 'center', 'count', 'endswith', 'expandtabs', ...

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour
for unicode), maybe we should base the default unicode/string con-
versions on the locale too?

background:

until now, I've been convinced that the goal should be to have two
"string-like" types: binary arrays for binary goop (including encoded
text), and a Unicode-based string type for text.  afaik, that's the
solution used in Tcl and Perl, and it's also "conceptually compatible"
with things like Java, Windows NT, and XML (and everything else from
the web universe).

given that, it has been clear to me that anything that is not compatible
with this model should be removed as soon as possible (and deprecated
as soon as we understand why it won't fly under the new scheme).

but if backwards compatibility is more important than a minimalistic
design, maybe we need three different "string-like" types:

-- binary arrays (still implemented by the 8-bit string type in 1.6)

-- 8-bit old-style strings (using the "system encoding", as defined
   by the locale.  if the locale is not set, they're assumed to contain
   ASCII)

-- unicode strings (possibly using a "polymorphic" internal representation)

this also solves the default conversion problem: use the locale environ-
ment variables to determine the default encoding, and call
sys.set_string_encoding from site.py (see my earlier post for details).

what have I missed this time?

</F>

PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

>>> sys
... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...



From gmcm@hypernet.com  Mon May 22 17:08:07 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Mon, 22 May 2000 12:08:07 -0400
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <1253110775-103205454@hypernet.com>

Alexandre Ferrieux wrote:

> The Tcl solution to (1.), which is the only real issue, is to
> have a separate thread blockingly read 1 byte from the pipe, and
> then post a message back to the main thread to awaken it (yes,
> ugly code to handle that extra byte and integrate it with the
> buffering scheme).

What's the actual mechanism here? A (dummy) socket so 
"select" works? The WSAEvent... stuff (to associate sockets 
with waitable events) and WaitForMultiple...? The 
WSAAsync... stuff (creates Windows msgs when socket stuff 
happens) with MsgWait...? Some other combination?

Is the mechanism different if it's a console app (vs GUI)?

I'd assume in a GUI, the fileevent-checker gets integrated with 
the message pump. In a console app, how does it get control?

 
> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?
> 
> Then, for the API exposed to the Python programmer, the Tclly
> exposed one is a starter:
> 
>  fileevent $channel readable|writable callback
>  ...
>  vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait
> does a loop of select(). The callback(s) is(are) called without
> breaking the loop, unless $breaker_variable is set, at which time
> vwait returns.
> 
> One note about 'breaker_variable': I'm not sure I like it. I'd
> prefer something based on exceptions. I don't quite understand
> why it's not already this way in Tcl (which has (kindof)
> first-class exceptions), but let's not repeat the mistake: let's
> suggest that (the equivalent of) vwait loops forever, only to be
> broken out by an exception from within one of the callbacks.
> 
> HTH,
> 
> -Alex
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev



- Gordon


From ping@lfw.org  Mon May 22 17:29:54 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:29:54 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221519.KAA45253@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220923360.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> 
> Tcl's event model has been more successful than
> any of you probably realize.  You deserve to know
> that.

Events are a very powerful concurrency model (arguably
more reliable because they are easier to understand
than threads).  My friend Mark Miller has designed a
language called E (http://www.erights.org/) that uses
an event model for all object messaging, and i would
be interested in exploring how we can apply those ideas
to improve Python.

> Should Python have an event model?  I'm not con-
> vinced.

Indeed.  This would be a huge core change, way too
large to be feasible.  But i do think it would be
excellent to simply provide more facilities for
helping people use whatever model they want, and
given the toolkit we let people build great things.

What you described sounded like it could be implemented
fairly easily with some functions like

    register(handle, mode, callback)
        or file.register(mode, callback)

        Put 'callback' in a dictionary of files
        to be watched for mode 'mode'.

    mainloop(timeout)

        Repeat (forever or until 'timeout') a
        'select' on all the files that have been
        registered, and do calls to the callbacks
        that have been registered.

Presumably there would be some exception that a
callback could raise to quietly exit the 'select'
loop.

    1. How does Tcl handle exiting the loop?
       Is there a way for a callback to break
       out of the vwait?

    2. How do you unregister these callbacks in Tcl?

    


-- ?!ng



From ping@lfw.org  Mon May 22 17:23:23 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:23:23 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005221309.IAA41866@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220917350.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> I've got a whole list of "higher-level
> abstractions around OS stuff" that I've been
> collecting.  Maybe I'll make it fit for
> others to see once we're through this affair

Absolutely!  I've thought about this too.  A nice "child process
management" module would be very convenient to have -- i've done
such stuff before -- though i don't know enough about Windows
semantics to make one that works on multiple platforms.  Some
sort of (hypothetical)

    delegate.spawn(function) - return a child object or id
    delegate.kill(id) - kill child

etc. could possibly free us from some of the system dependencies
of fork, signal, etc.

I currently have a module called "delegate" which can run a
function in a child process for you.  It uses pickle() to send
the return value of the function back to the parent (via an
unnamed pipe).  Again, Unix-specific -- but it would be very
cool if we could provide this functionality in a module.  My
module provides just two things, but it's already very useful:

    delegate.timeout(function, timeout) - run the 'function' in
        a child process; if the function doesn't finish in
        'timeout' seconds, kill it and raise an exception;
        otherwise, return the return value of the function

    delegate.parallelize(function, [work, work, work...]) -
        fork off many children (you can specify how many if
        you want) and set each one to work calling the 'function'
        with one of the 'work' items, queueing up work for
        each of the children until all the work gets done.
        Return the results in a dictionary mapping each 'work'
        item to its result.


-- ?!ng



From ping@lfw.org  Mon May 22 17:17:01 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:17:01 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220914000.461-100000@localhost>

On Mon, 22 May 2000, Guido van Rossum wrote:
> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

Indeed -- but at the moment, we're letting people continue to
use strings this way, since they already do it.

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

I would like it to be only the latter, as Fred, i, and others
have previously suggested, and as corresponds to your ASCII
proposal for treatment of 8-bit strings.

But doesn't the current locale-dependent behaviour of upper()
etc. mean that strings are getting interpreted in the first way?

> > is this complexity really worth it?
> 
> From a backwards compatibility point of view, yes.  Basically,
> programs that don't use Unicode should see no change in semantics.

I'm afraid i have to agree with this, because i don't see any
other option that lets us escape from any of these four ways
of using strings...


-- ?!ng



From fdrake@acm.org  Mon May 22 18:05:46 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 10:05:46 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <Pine.LNX.4.10.10005220914000.461-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005221004220.14844-100000@mailhost.beopen.com>

On Mon, 22 May 2000, Ka-Ping Yee wrote:
 > I would like it to be only the latter, as Fred, i, and others

  Please refer to Fredrik as Fredrik or /F; I don't think anyone else
refers to him as "Fred", and I got really confused when I saw this!  ;)


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From pf@artcom-gmbh.de  Mon May 22 18:17:40 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 19:17:40 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <00e301bfc403$abac3940$34aab5d4@hagrid> from Fredrik Lundh at "May 22, 2000  5:37: 1 pm"
Message-ID: <m12tvpk-000DieC@artcom0.artcom-gmbh.de>

Hi!

Fredrik Lund:
[...]
> > > so in order to provide platform-independent unicode support, Python 1.6
> > > comes with unicode-aware and fully portable replacements for the ctype
> > > functions.
> > 
> > For those who only need Latin-1 or another 8-bit ASCII superset, the
> > Unicode stuff is overkill.
> 
> why?

Going from 8 bit strings to 16 bit strings doubles the memory 
requirements, right?

As long as we only deal with English, Spanish, French, Swedish, Italian
and several other languages, 8 bit strings work out pretty well.  
Unicode will be neat if you can effort the additional space.  
People using Python on small computers in western countries
probably don't want to double the size of their data structures
for no reasonable benefit.

> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

Many locales effectively use Latin1 but for some other locales there
is a difference:

$ LANG="es_ES" python  # Espanõl uses Latin-1, the same as "de_DE"
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("äöü")
ÄÖÜ

$ LANG="ru_RU" python  # This uses ISO 8859-5 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("äöü")
Ħ¬

I don't know, how many people for example in Russia already depend 
on this behaviour.  I suggest it should stay as is.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From guido@python.org  Mon May 22 21:38:17 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 15:38:17 -0500
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:17:01 MST."
 <Pine.LNX.4.10.10005220914000.461-100000@localhost>
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost>
Message-ID: <200005222038.PAA01284@cj20424-a.reston1.va.home.com>

[Fredrik]
> > > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> > >   encoding.  upper, strip, etc should not be used.

[Guido]
> > These are not strings.

[Ping]
> Indeed -- but at the moment, we're letting people continue to
> use strings this way, since they already do it.

Oops, mistake.  I thought that Fredrik (not Fred! that's another
person in this context!) meant the array module, but upon re-reading
he didn't.

> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > > 
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> > 
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> I would like it to be only the latter, as Fred, i, and others
Fredrik, right?
> have previously suggested, and as corresponds to your ASCII
> proposal for treatment of 8-bit strings.
> 
> But doesn't the current locale-dependent behaviour of upper()
> etc. mean that strings are getting interpreted in the first way?

That's what I meant to say -- 8-bit strings use the system encoding
guided by the locale.

> > > is this complexity really worth it?
> > 
> > From a backwards compatibility point of view, yes.  Basically,
> > programs that don't use Unicode should see no change in semantics.
> 
> I'm afraid i have to agree with this, because i don't see any
> other option that lets us escape from any of these four ways
> of using strings...

Which is why I find Fredrik's attitude unproductive.

And where's the SRE release?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Mon May 22 21:53:55 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 22:53:55 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and
 locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com> <00e301bfc403$abac3940$34aab5d4@hagrid>
Message-ID: <39299E63.CD996D7D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > >
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> >
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

This was proposed by Guido some time ago... the discussion
ended with the problem of extracting the encoding definition
from the locale names. There are some ways to solve this
problem (static mappings, fancy LANG variables etc.), but
AFAIK, there is no widely used standard on this yet, so
in the end you're stuck with defining the encoding by hand...
e.g.
	setenv LANG de_DE:latin-1

Perhaps we should help out a little and provide Python with
a parser for the LANG variable with some added magic
to provide useful defaults ?!

> [...]
> 
> this also solves the default conversion problem: use the locale environ-
> ment variables to determine the default encoding, and call
> sys.set_string_encoding from site.py (see my earlier post for details).

Right, that would indeed open up a path for consent...

> </F>
> 
> PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

Perhaps... these were really only added as experimental feature
to test the various possibilities (and a possible implementation).

My original intention was removing these after final consent
-- perhaps we should keep the functionality (expanded
to a per thread setting; the global is a temporary hack) ?!
 
> >>> sys
> ... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...
> 
> looks a little strange...

True; see above for the reason why ;-)

PS: What do you think about the current internal design of
sys.set_string_encoding() ? Note that hash() and the "st"
parser markers still use UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one@email.msn.com  Tue May 23 03:21:00 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 22:21:00 -0400
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <m12tokw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <000501bfc45d$893ad4c0$9ea2143f@tim>

[Peter Funk]
> On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> working example which exploits this vulnerability in older versions
> of GCC.
>
> The basic idea is indeed very simple:  Since the /tmp directory is
> writable for any user, the bad guy can create a symbolic link in /tmp
> pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> program will than overwrite this arbitrary file (where the programmer
> really wanted to write something to his tempfile instead).  Since this
> will happen with the access permissions of the process running this
> program, this opens a bunch of vulnerabilities in many programs
> writing something into temporary files with predictable file names.

I can understand all that, but does it have anything to do with Python's
tempfile module?  gcc wasn't fixed by changing glibc, right?  Playing games
with the file *names* doesn't appear to me to solve anything; the few posts
I bumped into where that was somehow viewed as a Good Thing were about
Solaris systems, where Sun kept the source for generating the "new,
improved, messy" names secret.  In Python, any attacker can read the code
for anything we do, which it makes it much clearer that a name-game approach
is half-assed.

and-people-whine-about-worming-around-bad-decisions-in-
    windows<wink>-ly y'rs  - tim




From tim_one@email.msn.com  Tue May 23 06:15:46 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 23 May 2000 01:15:46 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39294019.3CB47800@tismer.com>
Message-ID: <000401bfc475$f3a110a0$612d153f@tim>

[Christian Tismer]
> There was a smiley, but for the most since I cannot
> decide what I want. I'm quite convinced that strings should
> better not be sequences, at least not sequences of strings.
>
> "abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
> as listed in Tim's posting.

Oh, it's worth a lot more than those!  As Ping testified, the gotchas I
listed really don't catch many people, while string[index] is about as
common as integer+1.

The need for tuples specifically in "format % values" can be wormed around
by special-casing the snot out of a string in the "values" position.

The non-termination of repeated "string = string[0]" *could* be stopped by
introducing a distinct character type.  Trying to formalize the current type
of a string is messy ("string = sequence of string" is a bit paradoxical
<wink>).  The notion that a string is a sequence of characters instead is
vanilla and wholly natural.  OTOH, drawing that distinction at the type
level may well be more trouble in practice than it buys in theory!

So I don't know what I want either -- but I don't want *much* <wink>.

first-do-no-harm-ly y'rs  - tim




From Moshe Zadka <moshez@math.huji.ac.il>  Tue May 23 06:27:12 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Tue, 23 May 2000 08:27:12 +0300 (IDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.GSO.4.10.10005230824130.12103-100000@sundial>

On Mon, 22 May 2000, Guido van Rossum wrote:

> Can we cut the name-calling?

Hey, what's life without a MS bashing now and then <wink>?

> Vwait seems to be part of the Tcl event model.  Maybe we would need to
> think about an event model for Python?  On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

But that's sort of the point: Python needs a non-GUI event model, to 
use with daemons which need to handle many files. Every GUI package
would have its own event model, and Python will have one event model
that's not tied to a GUI package. 

that-only-proves-we-have-a-problem-ly y'rs, Z.
--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 08:16:50 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:16:50 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com>
Message-ID: <392A3062.E1@cnet.francetelecom.fr>

Gordon McMillan wrote:
> 
> Alexandre Ferrieux wrote:
> 
> > The Tcl solution to (1.), which is the only real issue, is to
> > have a separate thread blockingly read 1 byte from the pipe, and
> > then post a message back to the main thread to awaken it (yes,
> > ugly code to handle that extra byte and integrate it with the
> > buffering scheme).
> 
> What's the actual mechanism here? A (dummy) socket so
> "select" works? The WSAEvent... stuff (to associate sockets
> with waitable events) and WaitForMultiple...? The
> WSAAsync... stuff (creates Windows msgs when socket stuff
> happens) with MsgWait...? Some other combination?

Other. Forget about sockets here, we're talking about true anonymous
pipes, under 95 and NT. Since they are not waitable nor peekable,
the only remaining option is to read in blocking mode from a dedicated
thread. Then of course, this thread reports back to the main
MsgWaitForMultiple with PostThreadMessage.

> Is the mechanism different if it's a console app (vs GUI)?

No. Why should it ?

> I'd assume in a GUI, the fileevent-checker gets integrated with
> the message pump.

The converse: MsgWaitForMultiple integrates the thread's message queue
which is a superset of the GUI's event stream.

-Alex


From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 08:36:35 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:36:35 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python  multiplexing is too hard)
References: <Pine.LNX.4.10.10005220923360.461-100000@localhost>
Message-ID: <392A3503.7C72@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> > Should Python have an event model?  I'm not con-
> > vinced.
> 
> Indeed.  This would be a huge core change, way too
> large to be feasible. 

Warning here. What would indeed need a huge core change,
is a pervasive use of events like in E. 'Having an event model'
is often interpreted in a less extreme way, simply meaning
'having the proper set of primitives at hand'.
Our discussion (and your comments below too, agreed !) was focussed
on the latter, so we're only talking about a pure library issue.
Asking any change in the Python lang itself for such a peripheral need
never even remotely crossed my mind !

> But i do think it would be
> excellent to simply provide more facilities for
> helping people use whatever model they want, and
> given the toolkit we let people build great things.

Right.

> What you described sounded like it could be implemented
> fairly easily with some functions like
> 
>     register(handle, mode, callback)
>         or file.register(mode, callback)
> 
>         Put 'callback' in a dictionary of files
>         to be watched for mode 'mode'.
> 
>     mainloop(timeout)
> 
>         Repeat (forever or until 'timeout') a
>         'select' on all the files that have been
>         registered, and do calls to the callbacks
>         that have been registered.
> 
> Presumably there would be some exception that a
> callback could raise to quietly exit the 'select'
> loop.

Great !!! That's exactly the kind of Pythonic translation I was
expecting. Thanks !

>     1. How does Tcl handle exiting the loop?
>        Is there a way for a callback to break
>        out of the vwait?

Yes, as explained before, in Tcl the loop-breaker is a write sentinel
on a variable. When a callback wants to break out, it simply sets the
var. But as also mentioned before, I'd prefer an exception-based
mechanism as you summarized.

>     2. How do you unregister these callbacks in Tcl?

We just register an empty string as the callback name (script).
But this is just a random API choice. Anything more Pythonic is welcome
(an explicit unregister function is okay for me).

-Alex


From nhodgson@bigpond.net.au  Tue May 23 08:47:14 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 17:47:14 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <047c01bfc48b$1d2addb0$e3cb8490@neil>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. ...

   Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

   Neil




From Fredrik Lundh" <effbot@telia.com  Tue May 23 08:50:56 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 09:50:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <00d901bfc48b$a40432a0$f2a6b5d4@hagrid>

Alexandre Ferrieux wrote:
> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,

I thought PeekNamedPipe worked just fine on anonymous pipes.

or are "true anonymous pipes" not the same thing as anonymous
pipes created by CreatePipe?

</F>



From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 08:51:07 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:51:07 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil>
Message-ID: <392A386B.4112@cnet.francetelecom.fr>

Neil Hodgson wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. ...
> 
>    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

Hmmm... You're right, it's documented as such. But I seem to recall we
encountered a problem when actually using it. I'll check with Gordon
Chaffee (Cc of this msg).

-Alex


From mhammond@skippinet.com.au  Tue May 23 08:57:18 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 17:57:18 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. Then of course, this thread reports back to the main
> MsgWaitForMultiple with PostThreadMessage.

Or maybe just with SetEvent(), as the main thread may just be using
WaitForMultipleObjects() - it really depends on whether the app has a
message loop or not.

>
> > Is the mechanism different if it's a console app (vs GUI)?
>
> No. Why should it ?

Because it generally wont have a message loop.  This is also commonly true
for NT services - they only wait on settable objects and if they dont
create a window generally dont need a message loop.   However, it is
precisely these apps that the proposal offers the most benefits to.

> > I'd assume in a GUI, the fileevent-checker gets integrated with
> > the message pump.
>
> The converse: MsgWaitForMultiple integrates the thread's message queue
> which is a superset of the GUI's event stream.

But what happens when we dont own the message loop?  Eg, IDLE is based on
Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
primary message loops are coded in C/C++, and wont provide this level of
customization.

Ironically, Tk seems to be one of the worst for this.  For example, Guido
and I recently(ish) both added threading support to our respective IDEs.
MFC was quite simple to do, as it used a "standard" windows message loop.
From all accounts, Guido had quite a difficult time due to some of the
assumptions made in the message loop.  The other anecdote I have relates to
debugging.  The Pythonwin debugger is able to live happily under most other
GUI applications - eg, those written in VB, Delphi, etc.  Pythonwin creates
a new "standard" message loop under these apps, and generally things work
well.  However, Tkinter based apps remain un-debuggable using Pythonwin due
to the assumptions made by the message loop.  This is probably my most
oft-requested feature addition!!

So, IMO, I can't really see the point in defining a whole set of these
asynch primitives when none of the GUI frameworks in common use are likely
to be able to take advantage of them...

Mark.



From nhodgson@bigpond.net.au  Tue May 23 09:04:03 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 18:04:03 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil> <392A386B.4112@cnet.francetelecom.fr>
Message-ID: <04b001bfc48d$76dbeaa0$e3cb8490@neil>

> >    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.
>
> Hmmm... You're right, it's documented as such. But I seem to recall we
> encountered a problem when actually using it. I'll check with Gordon
> Chaffee (Cc of this msg).

   I can vouch that this does work on 95, NT and W2K as I have been using it
in my SciTE editor for the past year as the means for gathering output from
running tool programs. There was a fiddle required to ensure all output was
retrieved on 95 but it works well with that implemented.

   Neil




From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 09:11:53 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:11:53 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>
Message-ID: <392A3D49.5129@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. Then of course, this thread reports back to the main
> > MsgWaitForMultiple with PostThreadMessage.
> 
> Or maybe just with SetEvent(), as the main thread may just be using
> WaitForMultipleObjects() - it really depends on whether the app has a
> message loop or not.

Yes but why emphasize the differences when you can instead wipe them out
by using MsgWaitForMultiple which integrates all sources ? Even if
there's
no message stream, it's fine !

> > > Is the mechanism different if it's a console app (vs GUI)?
> >
> > No. Why should it ?
> 
> Because it generally wont have a message loop.  This is also commonly true
> for NT services - they only wait on settable objects and if they dont
> create a window generally dont need a message loop.   However, it is
> precisely these apps that the proposal offers the most benefits to.

Yes, but see above: how would it hurt them to call MsgWait* instead of
Wait* ?

> > > I'd assume in a GUI, the fileevent-checker gets integrated with
> > > the message pump.
> >
> > The converse: MsgWaitForMultiple integrates the thread's message queue
> > which is a superset of the GUI's event stream.
> 
> But what happens when we dont own the message loop?  Eg, IDLE is based on
> Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
> primary message loops are coded in C/C++, and wont provide this level of
> customization.

Can you be more precise ? Which one(s) do(es)/n't fulfill the two
conditions mentioned earlier ? I do agree with the fact that the primary
msg loop of a random GUI package is a black box, however it must use one
of the IPC mechanisms provided by the OS. Unifying them is not uniformly
trivial (that's the point of this discussion), but since even on Windows
it is doable (MsgWait*), I fail to see by what magic a GUI package could
bypass its supervision.

> Ironically, Tk seems to be one of the worst for this.

Possibly. Personally I don't like Tk very much, at least from an
implementation standpoint. But precisely, the fact that the model
described so far can accomodate *even* Tk is a proof of generality !

> and I recently(ish) both added threading support to our respective IDEs.
> MFC was quite simple to do, as it used a "standard" windows message loop.
> From all accounts, Guido had quite a difficult time due to some of the
> assumptions made in the message loop.  The other anecdote I have relates to
> debugging.  The Pythonwin debugger is able to live happily under most other
> GUI applications - eg, those written in VB, Delphi, etc.  Pythonwin creates
> a new "standard" message loop under these apps, and generally things work
> well.  However, Tkinter based apps remain un-debuggable using Pythonwin due
> to the assumptions made by the message loop.  This is probably my most
> oft-requested feature addition!!

As you said, all this is due to the assumptions made in Tk. Clearly a
mistake not to repeat, and also orthogonal to the issue of unifying IPC
mechanisms and the API to their multiplexing.

-Alex


From mhammond@skippinet.com.au  Tue May 23 09:24:39 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 18:24:39 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3D49.5129@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>

> Yes but why emphasize the differences when you can instead wipe them out
> by using MsgWaitForMultiple which integrates all sources ? Even if
> there's
> no message stream, it's fine !

Agreed - as I said, it is with these apps that I think it has the most
chance of success.


> Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> conditions mentioned earlier ? I do agree with the fact that the primary
> msg loop of a random GUI package is a black box, however it must use one
> of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> trivial (that's the point of this discussion), but since even on Windows
> it is doable (MsgWait*), I fail to see by what magic a GUI package could
> bypass its supervision.

The only way I could see this working would be to use real, actual Windows
messages on Windows.  Python would need to nominate a special message that
it knows will not conflict with any GUI environments Python may need to run
in.

Each GUI package maintainer would then need to add some special logic in
their message hooking code.  When their black-box message loop delivers
this special message, the framework would need to enter the Python
"event-loop", where it does its stuff - until a new message arrives. It
would need to return, unwind back to the original message pump where it
will be processed as normal, and the entire process repeats.  The process
of waking other objects neednt be GUI toolkit dependent - as you said, it
only need place the well known message in the threads message loop using
PostThreadMessage()

Unless Im missing something?

Mark.



From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 09:38:06 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:38:06 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>
Message-ID: <392A436E.4027@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> > conditions mentioned earlier ? I do agree with the fact that the primary
> > msg loop of a random GUI package is a black box, however it must use one
> > of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> > trivial (that's the point of this discussion), but since even on Windows
> > it is doable (MsgWait*), I fail to see by what magic a GUI package could
> > bypass its supervision.
> 
> The only way I could see this working would be to use real, actual Windows
> messages on Windows.  Python would need to nominate a special message that
> it knows will not conflict with any GUI environments Python may need to run
> in.

Why use a special message ? MsgWait* does multiplex true Windows Message
*and* other IPC mechanisms. So if a package uses messages, it will
awaken MsgWait* by its 'message queue' side, while if the package uses a
socket or a pipe, it will awaken it by its 'waitable handle' side
(provided, of course, that you can get your hands on that handle and
pass it in th elist of objects to wait for...).

> Each GUI package maintainer would then need to add some special logic in
> their message hooking code.  When their black-box message loop delivers
> this special message, the framework would need to enter the Python
> "event-loop", where it does its stuff - until a new message arrives.

The key is that there wouldn't be two separate Python/GUI evloops.
That's the reason for the (a) condition: be able to awaken a
multiplexing syscall.

> Unless Im missing something?

I believe the next thing to do is to enumerate which GUI packages
fullfill the following conditions ((a) updated to (a') to reflect the
first paragraph of this msg):

	(a') Its internal event source is either the vanilla Windows Message
queue, or an IPC channel which can be exposed to the outer framework
(for enlisting in a select()-like call), like the socket of an X
connection.

	(b) Its queue can be Peek'ed (to check for buffered msgs before
blockigng again)

HTH,

-Alex


From pf@artcom-gmbh.de  Tue May 23 09:39:11 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 10:39:11 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <000501bfc45d$893ad4c0$9ea2143f@tim> from Tim Peters at "May 22, 2000 10:21: 0 pm"
Message-ID: <m12uADY-000DieC@artcom0.artcom-gmbh.de>

> [Peter Funk(me)]
> > On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> > working example which exploits this vulnerability in older versions
> > of GCC.
> >
> > The basic idea is indeed very simple:  Since the /tmp directory is
> > writable for any user, the bad guy can create a symbolic link in /tmp
> > pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> > program will than overwrite this arbitrary file (where the programmer
> > really wanted to write something to his tempfile instead).  Since this
> > will happen with the access permissions of the process running this
> > program, this opens a bunch of vulnerabilities in many programs
> > writing something into temporary files with predictable file names.
 
[Tim Peters]:
> I can understand all that, but does it have anything to do with Python's
> tempfile module?  gcc wasn't fixed by changing glibc, right?  

Okay.  But people seem to have the opponion, that "application programmers"
are dumb and "system programmers" are clever and smart. ;-)  So they seem
to think, that the library should solve possible security issues.
I don't share this oppinion, but if a some problem can be solved once
and for all in a library, this is better than having to solve this over and
over again in each application.

Concerning 'tempfile' this would either involve changing (or extending) 
the interface (IMO a better approach to this class of problems) or if the
goal is to solve this for existing applications already using 'tempfile', to 
play games with the filenames returned from 'mktemp()'.  This would require
to make them to be truely random... which AFAIK can't be achieved with 
traditional coding techniques and would require access to a secure white
noise generator.  But may be I'm wrong.

> Playing games
> with the file *names* doesn't appear to me to solve anything; the few posts
> I bumped into where that was somehow viewed as a Good Thing were about
> Solaris systems, where Sun kept the source for generating the "new,
> improved, messy" names secret.  In Python, any attacker can read the code
> for anything we do, which it makes it much clearer that a name-game approach
> is half-assed.

I agree.  But I think we should at least extend the documentation
of 'tempfile' (Fred?) to guide people not to write Pythoncode like
	mytemp = open(tempfile.mktemp(), "w")
in programs that are intended to be used on Unix systems by arbitrary
users (possibly 'root').  Even better:  Someone with enough spare time 
should add a new function 'mktempfile()', which creates a temporary 
file and takes care of the security issue and than returns the file 
handle.  This implementation must take care of race conditions using
'os.open' with the following flags:

       O_CREAT If the file does not exist it will be created.
       O_EXCL  When used with O_CREAT, if the file already  exist
	       it is  an error and the open will fail. 

> and-people-whine-about-worming-around-bad-decisions-in-
>     windows<wink>-ly y'rs  - tim

I don't whine.  But currently I've more problems with my GUI app using
Tkinter&Pmw on the Mac <wink>.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60
Wer sich zu wichtig für kleine Arbeiten hält,
ist meist zu klein für wichtige Arbeiten.     --      Jacques Tati


From mhammond@skippinet.com.au  Tue May 23 10:00:54 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 19:00:54 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A436E.4027@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>

> Why use a special message ? MsgWait* does multiplex true Windows Message
> *and* other IPC mechanisms.

But the point was that Python programs need to live inside other GUI
environments, and that these GUI environments provide their own message
loops that we must consider a black box.

So, we can not change the existing message loop to use MsgWait*().  We can
not replace their message loop with one of our own that _does_ do this, as
their message loop is likely to have its own special requirements (eg,
MFC's has idle-time processing, etc)

So I can't see a way out of this bind, other than to come up with a way to
live _in_ a 3rd party, immutable message loop.  My message tried to outline
what would be required, for example, to make Pythonwin use such a Python
driven event loop while still using the MFC message loop.

> The key is that there wouldn't be two separate Python/GUI evloops.
> That's the reason for the (a) condition: be able to awaken a
> multiplexing syscall.

Im not sure that is feasable.  With what I know about MFC, I almost
certainly would not attempt to integrate such a scheme with Pythonwin.  I
obviously can not speak for the other GUI toolkit maintainers.

> I believe the next thing to do is to enumerate which GUI packages
> fullfill the following conditions ((a) updated to (a') to reflect the
> first paragraph of this msg):

That would certainly help.  I believe it is safe to say there are 3 major
GUI environments for Python currently released; Tkinter, wxPython and
Pythonwin.  I know Pythonwin does not qualify.  We both know Tkinter does
not qualify.  I dont know enough about wxPython, but even if it _does_
qualify, the simple fact that Tkinter doesnt would appear to be the
show-stopper...

Dont get me wrong - its a noble goal that I _have_ pondered myself in the
past - but can't see a good solution.

Mark.



From ping@lfw.org  Tue May 23 10:41:01 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 23 May 2000 02:41:01 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <392A3503.7C72@cnet.francetelecom.fr>
Message-ID: <Pine.LNX.4.10.10005230230130.461-200000@localhost>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.
  Send mail to mime@docserver.cac.washington.edu for more info.

--1520843820-1916374079-959074861=:461
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> 
> Great !!! That's exactly the kind of Pythonic translation I was
> expecting. Thanks !

Here's a straw man.  Try the attached module.  To test it, run:

    python ./watcher.py 10203

then telnet to port 10203 on the local machine.  You can open
several telnet connections to port 10203 at once.

In one session:

    skuld[1041]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> 1 + 2
    3
    >>> spam = 3

In another session:

    skuld[1008]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> spam
    3

> We just register an empty string as the callback name (script).
> But this is just a random API choice. Anything more Pythonic is welcome
> (an explicit unregister function is okay for me).

So is there no way to register more than one callback on a
particular file?  Do you ever find yourself wanting to do that?


-- ?!ng

--1520843820-1916374079-959074861=:461
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="watcher.py"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.LNX.4.10.10005230241010.461@localhost>
Content-Description: 
Content-Disposition: attachment; filename="watcher.py"

IiIiV2F0Y2hlciBtb2R1bGUsIGJ5IEthLVBpbmcgWWVlICgyMiBNYXkgMjAw
MCkuDQoNClRoaXMgbW9kdWxlIGltcGxlbWVudHMgZXZlbnQgaGFuZGxpbmcg
b24gZmlsZXMuICBUbyB1c2UgaXQsIGNyZWF0ZSBhDQpXYXRjaGVyIG9iamVj
dCwgYW5kIHJlZ2lzdGVyIGNhbGxiYWNrcyBvbiB0aGUgV2F0Y2hlciB3aXRo
IHRoZSB3YXRjaCgpDQptZXRob2QuICBXaGVuIHJlYWR5LCBjYWxsIHRoZSBn
bygpIG1ldGhvZCB0byBzdGFydCB0aGUgbWFpbiBsb29wLiIiIg0KDQppbXBv
cnQgc2VsZWN0DQoNCmNsYXNzIFN0b3BXYXRjaGluZzoNCiAgICAiIiJDYWxs
YmFja3MgbWF5IHJhaXNlIHRoaXMgZXhjZXB0aW9uIHRvIGV4aXQgdGhlIG1h
aW4gbG9vcC4iIiINCiAgICBwYXNzDQoNCmNsYXNzIFdhdGNoZXI6DQogICAg
IiIiVGhpcyBjbGFzcyBwcm92aWRlcyB0aGUgYWJpbGl0eSB0byByZWdpc3Rl
ciBjYWxsYmFja3Mgb24gZmlsZSBldmVudHMuDQogICAgRWFjaCBpbnN0YW5j
ZSByZXByZXNlbnRzIG9uZSBtYXBwaW5nIGZyb20gZmlsZSBldmVudHMgdG8g
Y2FsbGJhY2tzLiIiIg0KDQogICAgZGVmIF9faW5pdF9fKHNlbGYpOg0KICAg
ICAgICBzZWxmLnJlYWRlcnMgPSB7fQ0KICAgICAgICBzZWxmLndyaXRlcnMg
PSB7fQ0KICAgICAgICBzZWxmLmVycmhhbmRsZXJzID0ge30NCiAgICAgICAg
c2VsZi5kaWN0cyA9IFsoInIiLCBzZWxmLnJlYWRlcnMpLCAoInciLCBzZWxm
LndyaXRlcnMpLA0KICAgICAgICAgICAgICAgICAgICAgICgiZSIsIHNlbGYu
ZXJyaGFuZGxlcnMpXQ0KDQogICAgZGVmIHdhdGNoKHNlbGYsIGhhbmRsZSwg
Y2FsbGJhY2ssIG1vZGVzPSJyIik6DQogICAgICAgICIiIlJlZ2lzdGVyIGEg
Y2FsbGJhY2sgb24gYSBmaWxlIGhhbmRsZSBmb3Igc3BlY2lmaWVkIGV2ZW50
cy4NCiAgICAgICAgVGhlICdoYW5kbGUnIGFyZ3VtZW50IG1heSBiZSBhIGZp
bGUgb2JqZWN0IG9yIGFueSBvYmplY3QgcHJvdmlkaW5nDQogICAgICAgIGEg
ZmFpdGhmdWwgJ2ZpbGVubygpJyBtZXRob2QgKHRoaXMgaW5jbHVkZXMgc29j
a2V0cykuICBUaGUgJ21vZGVzJw0KICAgICAgICBhcmd1bWVudCBpcyBhIHN0
cmluZyBjb250YWluaW5nIGFueSBvZiB0aGUgY2hhcnMgInIiLCAidyIsIG9y
ICJlIg0KICAgICAgICB0byBzcGVjaWZ5IHRoYXQgdGhlIGNhbGxiYWNrIHNo
b3VsZCBiZSB0cmlnZ2VyZWQgd2hlbiB0aGUgZmlsZQ0KICAgICAgICBiZWNv
bWVzIHJlYWRhYmxlLCB3cml0YWJsZSwgb3IgZW5jb3VudGVycyBhbiBlcnJv
ciwgcmVzcGVjdGl2ZWx5Lg0KICAgICAgICBUaGUgJ2NhbGxiYWNrJyBzaG91
bGQgYmUgYSBmdW5jdGlvbiB0aGF0IGV4cGVjdHMgdG8gYmUgY2FsbGVkIHdp
dGgNCiAgICAgICAgdGhlIHRocmVlIGFyZ3VtZW50cyAod2F0Y2hlciwgaGFu
ZGxlLCBtb2RlKS4iIiINCiAgICAgICAgZmQgPSBoYW5kbGUuZmlsZW5vKCkN
CiAgICAgICAgZm9yIG1vZGUsIGRpY3QgaW4gc2VsZi5kaWN0czoNCiAgICAg
ICAgICAgIGlmIG1vZGUgaW4gbW9kZXM6IGRpY3RbZmRdID0gKGhhbmRsZSwg
Y2FsbGJhY2spDQoNCiAgICBkZWYgdW53YXRjaChzZWxmLCBoYW5kbGUsIG1v
ZGVzPSJyIik6DQogICAgICAgICIiIlVucmVnaXN0ZXIgYW55IGNhbGxiYWNr
cyBvbiBhIGZpbGUgZm9yIHRoZSBzcGVjaWZpZWQgZXZlbnRzLg0KICAgICAg
ICBUaGUgJ2hhbmRsZScgYXJndW1lbnQgc2hvdWxkIGJlIGEgZmlsZSBvYmpl
Y3QgYW5kIHRoZSAnbW9kZXMnDQogICAgICAgIGFyZ3VtZW50IHNob3VsZCBj
b250YWluIG9uZSBvciBtb3JlIG9mIHRoZSBjaGFycyAiciIsICJ3Iiwgb3Ig
ImUiLiIiIg0KICAgICAgICBmZCA9IGhhbmRsZS5maWxlbm8oKQ0KICAgICAg
ICBmb3IgbW9kZSwgZGljdCBpbiBzZWxmLmRpY3RzOg0KICAgICAgICAgICAg
aWYgbW9kZSBpbiBtb2RlcyBhbmQgZGljdC5oYXNfa2V5KGZkKTogZGVsIGRp
Y3RbZmRdDQogICAgICAgICAgICANCiAgICBkZWYgZ28oc2VsZiwgdGltZW91
dD1Ob25lKToNCiAgICAgICAgIiIiTG9vcCBmb3JldmVyLCB3YXRjaGluZyBm
b3IgZmlsZSBldmVudHMgYW5kIHRyaWdnZXJpbmcgY2FsbGJhY2tzLA0KICAg
ICAgICB1bnRpbCBzb21lYm9keSByYWlzZXMgYW4gZXhjZXB0aW9uLiAgVGhl
IFN0b3BXYXRjaGluZyBleGNlcHRpb24NCiAgICAgICAgcHJvdmlkZXMgYSBx
dWlldCB3YXkgdG8gZXhpdCB0aGUgZXZlbnQgbG9vcC4gIElmIGEgdGltZW91
dCBpcyANCiAgICAgICAgc3BlY2lmaWVkLCB0aGUgbG9vcCB3aWxsIGV4aXQg
YWZ0ZXIgdGhhdCBtYW55IHNlY29uZHMgcGFzcyBieSB3aXRoDQogICAgICAg
IG5vIGV2ZW50cyBvY2N1cnJpbmcuIiIiDQogICAgICAgIHRyeToNCiAgICAg
ICAgICAgIHdoaWxlIHNlbGYucmVhZGVycyBvciBzZWxmLndyaXRlcnMgb3Ig
c2VsZi5lcnJoYW5kbGVyczoNCiAgICAgICAgICAgICAgICByZCwgd3IsIGV4
ID0gc2VsZWN0LnNlbGVjdChzZWxmLnJlYWRlcnMua2V5cygpLA0KICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNlbGYud3Jp
dGVycy5rZXlzKCksDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgc2VsZi5lcnJoYW5kbGVycy5rZXlzKCksIHRpbWVvdXQp
DQogICAgICAgICAgICAgICAgaWYgbm90IChyZCArIHdyICsgZXgpOiBicmVh
aw0KICAgICAgICAgICAgICAgIGZvciBmZHMsIChtb2RlLCBkaWN0KSBpbiBt
YXAoTm9uZSwgW3JkLCB3ciwgZXhdLCBzZWxmLmRpY3RzKToNCiAgICAgICAg
ICAgICAgICAgICAgZm9yIGZkIGluIGZkczoNCiAgICAgICAgICAgICAgICAg
ICAgICAgIGhhbmRsZSwgY2FsbGJhY2sgPSBkaWN0W2ZkXQ0KICAgICAgICAg
ICAgICAgICAgICAgICAgY2FsbGJhY2soc2VsZiwgaGFuZGxlLCBtb2RlKQ0K
ICAgICAgICBleGNlcHQgU3RvcFdhdGNoaW5nOiBwYXNzDQoNCmlmIF9fbmFt
ZV9fID09ICJfX21haW5fXyI6DQogICAgaW1wb3J0IHN5cywgc29ja2V0LCBj
b2RlDQogICAgcyA9IHNvY2tldC5zb2NrZXQoc29ja2V0LkFGX0lORVQsIHNv
Y2tldC5TT0NLX1NUUkVBTSkNCiAgICBzLmJpbmQoImxvY2FsaG9zdCIsIDEw
MjAzKSAjIEZpdmUgaXMgUklHSFQgT1VULg0KICAgIHMubGlzdGVuKDEpDQog
ICAgY29uc29sZXMgPSB7fQ0KICAgIGxvY2FscyA9IHt9ICMgU2hhcmUgbG9j
YWxzLCBqdXN0IGZvciBmdW4uDQoNCiAgICBjbGFzcyBSZWRpcmVjdG9yOg0K
ICAgICAgICBkZWYgX19pbml0X18oc2VsZiwgd3JpdGUpOg0KICAgICAgICAg
ICAgc2VsZi53cml0ZSA9IHdyaXRlDQoNCiAgICBkZWYgZ2V0bGluZShoYW5k
bGUpOg0KICAgICAgICBsaW5lID0gIiINCiAgICAgICAgd2hpbGUgMToNCiAg
ICAgICAgICAgIGNoID0gaGFuZGxlLnJlY3YoMSkNCiAgICAgICAgICAgIGxp
bmUgPSBsaW5lICsgY2gNCiAgICAgICAgICAgIGlmIG5vdCBjaCBvciBjaCA9
PSAiXG4iOiByZXR1cm4gbGluZQ0KDQogICAgZGVmIHJlYWQod2F0Y2hlciwg
aGFuZGxlLCBtb2RlKToNCiAgICAgICAgbGluZSA9IGdldGxpbmUoaGFuZGxl
KQ0KICAgICAgICBpZiBsaW5lOg0KICAgICAgICAgICAgaWYgbGluZVstMjpd
ID09ICJcclxuIjogbGluZSA9IGxpbmVbOi0yXQ0KICAgICAgICAgICAgaWYg
bGluZVstMTpdID09ICJcbiI6IGxpbmUgPSBsaW5lWzotMV0NCiAgICAgICAg
ICAgIG91dCwgZXJyID0gc3lzLnN0ZG91dCwgc3lzLnN0ZGVycg0KICAgICAg
ICAgICAgc3lzLnN0ZG91dCA9IHN5cy5zdGRlcnIgPSBSZWRpcmVjdG9yKGhh
bmRsZS5zZW5kKQ0KICAgICAgICAgICAgbW9yZSA9IGNvbnNvbGVzW2hhbmRs
ZV0ucHVzaChsaW5lKQ0KICAgICAgICAgICAgaGFuZGxlLnNlbmQobW9yZSBh
bmQgIi4uLiAiIG9yICI+Pj4gIikNCiAgICAgICAgICAgIHN5cy5zdGRvdXQs
IHN5cy5zdGRlcnIgPSBvdXQsIGVycg0KICAgICAgICBlbHNlOg0KICAgICAg
ICAgICAgd2F0Y2hlci51bndhdGNoKGhhbmRsZSkNCiAgICAgICAgICAgIGhh
bmRsZS5jbG9zZSgpDQoNCiAgICBkZWYgY29ubmVjdCh3YXRjaGVyLCBoYW5k
bGUsIG1vZGUpOg0KICAgICAgICBucywgYWRkciA9IGhhbmRsZS5hY2NlcHQo
KQ0KICAgICAgICBjb25zb2xlc1tuc10gPSBjb2RlLkludGVyYWN0aXZlQ29u
c29sZShsb2NhbHMsICI8JXM6JWQ+IiAlIGFkZHIpDQogICAgICAgIHdhdGNo
ZXIud2F0Y2gobnMsIHJlYWQpDQogICAgICAgIG5zLnNlbmQoIj4+PiAiKQ0K
DQogICAgdyA9IFdhdGNoZXIoKQ0KICAgIHcud2F0Y2gocywgY29ubmVjdCkN
CiAgICB3LmdvKCkNCg==
--1520843820-1916374079-959074861=:461--


From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 10:54:31 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 11:54:31 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python   multiplexing is too hard)
References: <Pine.LNX.4.10.10005230230130.461-200000@localhost>
Message-ID: <392A5557.72F7@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> >
> > Great !!! That's exactly the kind of Pythonic translation I was
> > expecting. Thanks !
> 
> Here's a straw man.  <watcher.py>

Nice. Now what's left to do is make select.select() truly
crossplatform...

> So is there no way to register more than one callback on a
> particular file?

Nope - it's considered the responsibility of higher layers.

> Do you ever find yourself wanting to do that?

Seldom, but it happened to me once, and I did exactly that: a layer
above.

-Alex


From mal@lemburg.com  Tue May 23 11:10:20 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 12:10:20 +0200
Subject: [Python-Dev] String encoding
Message-ID: <392A590C.E41239D3@lemburg.com>

The recent discussion about repr() et al. brought up the idea
of a locale based string encoding again.

A support module for querying the encoding used in the current
locale together with the experimental hook to set the string
encoding could yield a compromise which satisfies ASCII, Latin-1
and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter
from within Python (rather than making the encoding a compile
time option). This is easily doable using the (yet to be written)
support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed
to whatever the user or administrator wants it to be on a per
site basis. Furthermore, the encoding should be settable on
a per thread basis inside the interpreter (Python threads
do not seem to inherit any per-thread globals, so the
encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

"""
import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be
# determined
defenc = locale.get_encoding('ascii')

# Set main thread's string encoding
sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume
defenc as encoding of strings.
"""

Minor nit: due to the implementation, the C parser markers
"s" and "t" and the hash() value calculation will still need
to work with a fixed encoding which still is UTF-8. C APIs
which want to support Unicode should be fixed to use "es"
or query the object directly and then apply proper, possibly
OS dependent conversion.

Before starting off into implementing the above, I'd like to
hear some comments...

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From alexandre.ferrieux@cnet.francetelecom.fr  Tue May 23 11:16:42 2000
From: alexandre.ferrieux@cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 12:16:42 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>
Message-ID: <392A5A8A.1534@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> >
> > I'd really like to challenge that 'almost'...
> 
> Sure - but the problem is simply that MFC has a non-standard message - so
> it _must_ be like I described in my message - and you will agree that
> sounds messy.
> 
> If such a set of multiplexing primitives took off, and people found them
> useful, and started complaining they dont work in Pythonwin, then I will
> surely look at it again.  Its too much work and too intrusive for a
> proof-of-concept effort.

Okay, fine.

> > I understand. Maybe I underestimated some of the difficulties. However,
> > I'd still like to separate what can be separated. The unfriendliness to
> > Python debuggers is sad news to me, but is not strictly related to the
> > problem of heterogeneous multiplexing: if I were to design a debugger
> > from scratch for a random language, I believe I'd arrange for the IPC
> > channel used to be more transparent. IOW, the very fact of using the
> > message queue for the debugging IPC *is* the culprit ! In unix, the
> > ptrace() or /proc interfaces have never walked on the toes of any
> > package, GUI or not...
> 
> The unfriendliness is purely related to Pythonwin, and not the general
> Python debugger.  I agree 100% that an RPC type mechanism is far better for
> a debugger.  It was just an anecdote to show how fickle these message loops
> can be (and therefore the complex requirements they have).

Okay, so maybe it's time to summarize what we agreed on:

	(1) 'tearing open' the main loop of a GUI package is tricky in the
general case.
	(2) perusing undefined WM_* messages requires care...

	(3) on the other hand, all other IPC channels are multiplexable. Even
for the worst case (pipes on Windows) at least 1 (1.5?) method has been
identified.

The temporary conclusion as far as I understand, is that nobody in the
Python community has the spare time and energy to tackle (1), that (2)
is tricky due to an unfortunate choice in the implementation of some
debuggers, and that the seemingly appealing unification outlined by (3)
is not enough of a motivation...

Under these conditions, clearly the only option is to put the blackbox
GUI loop inside a separate thread and arrange for it to use a
well-chosen IPC channel to awaken (something like) the Watcher.go()
proposed by Ka-Ping Yee.

Now there's still the issue of actually making select.select()
crossplatform.
Any takers ?

-Alex


From gstein@lyra.org  Tue May 23 11:57:21 2000
From: gstein@lyra.org (Greg Stein)
Date: Tue, 23 May 2000 03:57:21 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A590C.E41239D3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>

I still think that having any kind of global setting is going to be
troublesome. Whether it is per-thread or not, it still means that Module
Foo cannot alter the value without interfering with Module Bar.

Cheers,
-g

On Tue, 23 May 2000, M.-A. Lemburg wrote:

> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.
> 
> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.
> 
> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.
> 
> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis. Furthermore, the encoding should be settable on
> a per thread basis inside the interpreter (Python threads
> do not seem to inherit any per-thread globals, so the
> encoding would have to be set for all new threads).
> 
> E.g. a site.py module could look like this:
> 
> """
> import locale,sys
> 
> # Get encoding, defaulting to 'ascii' in case it cannot be
> # determined
> defenc = locale.get_encoding('ascii')
> 
> # Set main thread's string encoding
> sys.setstringencoding(defenc)
> 
> This would result in the Unicode implementation to assume
> defenc as encoding of strings.
> """
> 
> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8. C APIs
> which want to support Unicode should be fixed to use "es"
> or query the object directly and then apply proper, possibly
> OS dependent conversion.
> 
> Before starting off into implementing the above, I'd like to
> hear some comments...
> 
> Thanks,
> -- 
> Marc-Andre Lemburg
> ______________________________________________________________________
> Business:                                      http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/



From fredrik@pythonware.com  Tue May 23 12:38:41 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 13:38:41 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com>
Message-ID: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.

before proceeding down this (not very slippery but slightly
unfortunate, imho) slope, I think we should decide whether

    assert eval(repr(s)) =3D=3D s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr'
to use isprint, without having to make sure that you can still
parse the resulting string.

but if it is important, you cannot really change 'repr' without
addressing the big issue.

so assuming that the assertion must hold, and that changing
'repr' to be locale-dependent is a good idea, let's move on:

> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.

agreed.

> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is
easy enough to do in Python (either in site.py or in locale.py).
no need for external support modules for Unix, in other words.

for windows, I suggest adding GetACP() to the _locale module,
and let the glue layer (site.py 0or locale.py) do:

    if sys.platform =3D=3D "win32":
        sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the
system font, and fall back to "macroman" if that doesn't work out.
but figuring out the right way to do that is best left to anyone who
actually has access to a Mac.  in the meantime, just make it:

    elif sys.platform =3D=3D "mac":
        sys.setstringencoding("macroman")

> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis.=20

Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
that the vast majority of non-Mac platforms are either modern Unixes
or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

    else:
        # try to determine encoding from POSIX locale environment
        # variables
        ...

    else:
        sys.setstringencoding("iso-latin-1")

> Furthermore, the encoding should be settable on a per thread basis
> inside the interpreter (Python threads do not seem to inherit any
> per-thread globals, so the encoding would have to be set for all
> new threads).

is the C/POSIX locale setting thread specific?

if not, I think the default encoding should be a global setting, just
like the system locale itself.  otherwise, you'll just be addressing a
real problem (thread/module/function/class/object specific locale
handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8.

can this be fixed?  or rather, what changes to the buffer api
are required if we want to work around this problem?

> C APIs which want to support Unicode should be fixed to use
> "es" or query the object directly and then apply proper, possibly
> OS dependent conversion.

for convenience, it might be a good idea to have a "wide system
encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the
time?

unproductive-ly yrs /F



From pf@artcom-gmbh.de  Tue May 23 13:02:17 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 14:02:17 +0200 (MEST)
Subject: [Python-Dev] String encoding
In-Reply-To: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> from Fredrik Lundh at "May 23, 2000  1:38:41 pm"
Message-ID: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>

Hi Fredrik!

you wrote:
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
[...]

What's the problem with this one?  I've played around with several
locale settings here and I observed no problems, while doing:

>>> import string
>>> s = string.join(map(chr, range(128,256)),"")
>>> assert eval('"'+s+'"') == s

What do you fear here, if 'repr' will output characters from the
upper half of the charset without quoting them as octal sequences?
I don't understand.

Regards, Peter


From fredrik@pythonware.com  Tue May 23 14:09:11 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 15:09:11 +0200
Subject: [Python-Dev] String encoding
References: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <01fa01bfc4b8$16e64700$0500a8c0@secret.pythonware.com>

Peter wrote:
>
> >     assert eval(repr(s)) =3D=3D s
>
> What's the problem with this one?  I've played around with several
> locale settings here and I observed no problems, while doing:

what if the default encoding for source code is different
from the locale?  (think UTF-8 source code)

(no, that's not supported by 1.6.  but if we don't consider that
case now, we won't be able to support source encodings in the
future -- unless the above assertion isn't important, of course).

</F>



From mal@lemburg.com  Tue May 23 12:14:46 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 13:14:46 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>
Message-ID: <392A6826.CCDD2246@lemburg.com>

Greg Stein wrote:
> 
> I still think that having any kind of global setting is going to be
> troublesome. Whether it is per-thread or not, it still means that Module
> Foo cannot alter the value without interfering with Module Bar.

True. 

The only reasonable place to alter the setting is in
site.py for the main thread. I think the setting should be
inherited by child threads, but I'm not sure whether this is
possible or not.
 
Modules that would need to change the settings are better
(re)designed in a way that doesn't rely on the setting at all, e.g.
work on Unicode exclusively which doesn't introduce the need
in the first place.

And then, noone is forced to alter the ASCII default to begin
with :-) The good thing about exposing this mechanism in Python
is that it gets user attention...

> Cheers,
> -g
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
> 
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> >
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> >
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis. Furthermore, the encoding should be settable on
> > a per thread basis inside the interpreter (Python threads
> > do not seem to inherit any per-thread globals, so the
> > encoding would have to be set for all new threads).
> >
> > E.g. a site.py module could look like this:
> >
> > """
> > import locale,sys
> >
> > # Get encoding, defaulting to 'ascii' in case it cannot be
> > # determined
> > defenc = locale.get_encoding('ascii')
> >
> > # Set main thread's string encoding
> > sys.setstringencoding(defenc)
> >
> > This would result in the Unicode implementation to assume
> > defenc as encoding of strings.
> > """
> >
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8. C APIs
> > which want to support Unicode should be fixed to use "es"
> > or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> >
> > Before starting off into implementing the above, I'd like to
> > hear some comments...
> >
> > Thanks,
> > --
> > Marc-Andre Lemburg
> > ______________________________________________________________________
> > Business:                                      http://www.lemburg.com/
> > Python Pages:                           http://www.lemburg.com/python/
> >
> >
> > _______________________________________________
> > Python-Dev mailing list
> > Python-Dev@python.org
> > http://www.python.org/mailman/listinfo/python-dev
> >
> 
> --
> Greg Stein, http://www.lyra.org/
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From pf@artcom-gmbh.de  Tue May 23 15:29:58 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 16:29:58 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
Message-ID: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>

Python 1.6 reports a bad magic error, when someone tries to import a .pyc
file compiled by Python 1.5.2.  AFAIK only new features have been
added.  So why it isn't possible to use these old files in Python 1.6?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From dan@cgsoftware.com  Tue May 23 15:43:44 2000
From: dan@cgsoftware.com (Daniel Berlin)
Date: Tue, 23 May 2000 10:43:44 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <BAEMJNPFHMFPEFAGBKKAMEJBCCAA.dan@cgsoftware.com>

Because of the unicode changes, AFAIK.
Or was it the multi-arg vs single arg append and friends.
Anyway, the point is that their were incompatible changes made, and thus,
the magic was changed.
--Dan
>
>
> Python 1.6 reports a bad magic error, when someone tries to import a .pyc
> file compiled by Python 1.5.2.  AFAIK only new features have been
> added.  So why it isn't possible to use these old files in Python 1.6?
>
> Regards, Peter
> --
> Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany,
> Fax:+49 4222950260
> office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From fdrake@acm.org  Tue May 23 15:47:52 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 07:47:52 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Peter Funk wrote:
 > Python 1.6 reports a bad magic error, when someone tries to import a .pyc
 > file compiled by Python 1.5.2.  AFAIK only new features have been
 > added.  So why it isn't possible to use these old files in Python 1.6?

Peter,
  In theory, perhaps it could; I don't know if the extra work is worth it,
however.
  What's happening is that the .pyc magic number changed because the
marshal format has been extended to support Unicode string objects.  The
old format should still be readable, but there's nothing in the .pyc
loader that supports the acceptance of multiple versions of the marshal
format.
  Is there reason to think that's a substantial problem for users, given
the automatic recompilation of bytecode from source?  The only serious
problems I can see are when multiple versions of the interpreter are being
used on the same collection of source files (because the re-compilation
occurs more often and affects performance), and when *only* .pyc/.pyo
files are available.
  Do you have reason to suspect that either case is sufficiently common to
complicate the .pyc loader, or is there another reason that I've missed
(very possible, I admit)?


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From mal@lemburg.com  Tue May 23 15:20:19 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:20:19 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A93A3.91188372@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> 
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
> 
> if this isn't important, nothing stops you from changing 'repr'
> to use isprint, without having to make sure that you can still
> parse the resulting string.
> 
> but if it is important, you cannot really change 'repr' without
> addressing the big issue.

This is a different discussion which I don't really want to
get into... I don't have any need for repr() being locale
dependent, since I only use it for debugging purposes and
never to rebuild objects (marshal and pickle are much better
at that).

BTW, repr(unicode) is not affected by the string encoding:
it always returns unicode-escape.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May 23 15:47:40 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:47:40 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A9A0C.2E297072@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> > [...]
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> 
> agreed.
> 
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> 
> agreed.
> 
> note that parsing LANG (etc) variables on a POSIX platform is
> easy enough to do in Python (either in site.py or in locale.py).
> no need for external support modules for Unix, in other words.

Agreed... the locale.py (and _locale builtin module) are probably
the right place to put such a parser.
 
> for windows, I suggest adding GetACP() to the _locale module,
> and let the glue layer (site.py 0or locale.py) do:
> 
>     if sys.platform == "win32":
>         sys.setstringencoding("cp%d" % GetACP())
> 
> on mac, I think you can determine the encoding by inspecting the
> system font, and fall back to "macroman" if that doesn't work out.
> but figuring out the right way to do that is best left to anyone who
> actually has access to a Mac.  in the meantime, just make it:
> 
>     elif sys.platform == "mac":
>         sys.setstringencoding("macroman")
> 
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis.
> 
> Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
> that the vast majority of non-Mac platforms are either modern Unixes
> or Windows boxes, that makes a lot more sense than US ASCII...
> 
> in other words:
> 
>     else:
>         # try to determine encoding from POSIX locale environment
>         # variables
>         ...
> 
>     else:
>         sys.setstringencoding("iso-latin-1")

That's a different topic which I don't want to revive ;-)

With the above tools you can easily code the latin-1 default
into your site.py.

> > Furthermore, the encoding should be settable on a per thread basis
> > inside the interpreter (Python threads do not seem to inherit any
> > per-thread globals, so the encoding would have to be set for all
> > new threads).
> 
> is the C/POSIX locale setting thread specific?

Good question -- I don't know.

> if not, I think the default encoding should be a global setting, just
> like the system locale itself.  otherwise, you'll just be addressing a
> real problem (thread/module/function/class/object specific locale
> handling), but not really solving it...
>
> better use unicode strings and explicit encodings in that case.

Agreed.
 
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8.
> 
> can this be fixed?  or rather, what changes to the buffer api
> are required if we want to work around this problem?

The problem is that "s" and "t" return C pointers to some
internal data structure of the object. It has to be assured
that this data remains intact at least as long as the object
itself exists.

AFAIK, this cannot be fixed without creating a memory leak.
 
The "es" parser marker uses a different strategy, BTW: the
data is copied into a buffer, thus detaching the object
from the data.

> > C APIs which want to support Unicode should be fixed to use
> > "es" or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> 
> for convenience, it might be a good idea to have a "wide system
> encoding" too, and special parser markers for that purpose.
> 
> or can we assume that all wide system API's use unicode all the
> time?

At least in all references I've seen (e.g. ODBC, wchar_t
implementations, etc.) "wide" refers to Unicode.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Tue May 23 16:13:59 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 08:13:59 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A9A0C.2E297072@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, M.-A. Lemburg wrote:
 > The problem is that "s" and "t" return C pointers to some
 > internal data structure of the object. It has to be assured
 > that this data remains intact at least as long as the object
 > itself exists.
 > 
 > AFAIK, this cannot be fixed without creating a memory leak.
 >  
 > The "es" parser marker uses a different strategy, BTW: the
 > data is copied into a buffer, thus detaching the object
 > from the data.
 > 
 > > > C APIs which want to support Unicode should be fixed to use
 > > > "es" or query the object directly and then apply proper, possibly
 > > > OS dependent conversion.
 > > 
 > > for convenience, it might be a good idea to have a "wide system
 > > encoding" too, and special parser markers for that purpose.
 > > 
 > > or can we assume that all wide system API's use unicode all the
 > > time?
 > 
 > At least in all references I've seen (e.g. ODBC, wchar_t
 > implementations, etc.) "wide" refers to Unicode.

  On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
10646 require a 32-bit space?
  I recall a fair bit of discussion about wchar_t when it was introduced
to ANSI C, and the character set and encoding were specifically not made
part of the specification.  Making a requirement that wchar_t be Unicode
doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From Fredrik Lundh" <effbot@telia.com  Tue May 23 16:16:42 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:16:42 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com>
Message-ID: <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>

M.-A. Lemburg <mal@lemburg.com> wrote:
> > before proceeding down this (not very slippery but slightly
> > unfortunate, imho) slope, I think we should decide whether
> > 
> >     assert eval(repr(s)) == s
> > 
> > should be true for strings.

footnote: as far as I can tell, the language reference says it should:
http://www.python.org/doc/current/ref/string-conversions.html

> This is a different discussion which I don't really want to
> get into... I don't have any need for repr() being locale
> dependent, since I only use it for debugging purposes and
> never to rebuild objects (marshal and pickle are much better
> at that).

in other words, you leave it to 'pickle' to call 'repr' for you ;-)

</F>



From pf@artcom-gmbh.de  Tue May 23 16:23:48 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 17:23:48 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> from "Fred L. Drake" at "May 23, 2000  7:47:52 am"
Message-ID: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>

Fred, 
Thank you for your quick response.

Fred L. Drake:
> Peter,
>   In theory, perhaps it could; I don't know if the extra work is worth it,
> however.
[...]
>   Do you have reason to suspect that either case is sufficiently common to
> complicate the .pyc loader, or is there another reason that I've missed
> (very possible, I admit)?

Well, currently we (our company) deliver no source code to our
customers.  I don't want to discuss this policy and the reasoning
behind here.  But this situation may also apply to other commercial
software vendors using Python.

During late 2000 there may be several customers out there running
Python 1.6 and others still running Python 1.5.2.  So we will have
several choices to deal with this situation:
   1. Supply two different binary distribution packages: 
      one containing 1.5.2 .pyc files and one containing 1.6 .pyc files.
      This will introduce some new logistic problems.
   2. Upgrade to Python 1.6 at each customer site at once. 
      This will be difficult.
   3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
      and supply our own patched Python distribution.
      (and this would also be "carrying owls to Athen" for Linux systems)
    [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
      german idiom makes any sense in english ;-) ]
      I personally don't like this.
   4. Change our policy and distribute also the .py sources.  Beside the
      difficulty to convince the management about this one, this also
      introduces new technical "challenges".  The Unix text files have to be
      converted from LF lineends to CR lineends or MacPython wouldn't be 
      able to parse the files.  So the mac source distributions
      must be build from a different directory tree.

No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
some such somewhere in the 1.6 unmarshaller should do the trick.
But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
worth the effort. <wink>

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From esr@thyrsus.com  Tue May 23 16:40:50 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 11:40:50 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Tue, May 23, 2000 at 05:23:48PM +0200
References: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> <m12uGX6-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000523114050.A4781@thyrsus.com>

Peter Funk <pf@artcom-gmbh.de>:
>       (and this would also be "carrying owls to Athen" for Linux systems)
>     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
>       german idiom makes any sense in english ;-) ]

There is a precise equivalent: "carrying coals to Newcastle".
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Are we to understand," asked the judge, "that you hold your own interests
above the interests of the public?"

"I hold that such a question can never arise except in a society of cannibals."
	-- Ayn Rand


From Fredrik Lundh" <effbot@telia.com  Tue May 23 16:41:46 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:41:46 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com>
Message-ID: <024001bfc4cd$68210f00$f2a6b5d4@hagrid>

M.-A. Lemburg wrote:
> That's a different topic which I don't want to revive ;-)

in a way, you've already done that -- if you're setting the system encoding
in the site.py module, lots of people will end up with the encoding set to ISO
Latin 1 or it's windows superset.

one might of course the system encoding if the user actually calls setlocale,
but there's no way for python to trap calls to that function from a submodule
(e.g. readline), so it's easy to get out of sync.  hmm.

(on the other hand, I'd say it's far more likely that americans are among the
few who don't know how to set the locale, so defaulting to us ascii might be
best after all -- even if their computers really use iso-latin-1, we don't have
to cause unnecessary confusion...)

...

but I guess you're right: let's be politically correct and pretend that this really
is a completely different issue ;-)

</F>



From Fredrik Lundh" <effbot@telia.com  Tue May 23 17:04:38 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:04:38 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost>  <200005222038.PAA01284@cj20424-a.reston1.va.home.com>
Message-ID: <027f01bfc4d0$99c48ca0$f2a6b5d4@hagrid>

> Which is why I find Fredrik's attitude unproductive.

given that locale support isn't included if you make a default build,
I don't think deprecating it would hurt that many people...

but that's me; when designing libraries, I've always strived to find
the *minimal* set of functions (and code) that makes it possible for
a programmer to do her job well.  I'm especially wary of blind alleys
(sure, you can use locale, but that'll only take you this far, and you
have to start all over if you want to do it right).

btw, talking about productivity, go check out the case sensitivity
threads on comp.lang.python.  imagine if all those people hammered
away on the 1.6 alpha instead...

> And where's the SRE release?

at the usual place:

    http://w1.132.telia.com/~u13208596/sre/index.htm

still one showstopper left, which is why I haven't made the long-
awaited public "now it's finished, dammit" announcement yet.  but
it shouldn't be that far away.

</F>



From fdrake@acm.org  Tue May 23 17:11:14 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 09:11:14 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <20000523114050.A4781@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005230904570.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Eric S. Raymond wrote:
 > Peter Funk <pf@artcom-gmbh.de>:
 > >       (and this would also be "carrying owls to Athen" for Linux systems)
 > >     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
 > >       german idiom makes any sense in english ;-) ]
 > 
 > There is a precise equivalent: "carrying coals to Newcastle".

  That's interesting... I've never heard either, but I think I can guess
the meaning now.
  I agree; it looks like there's some work to do in getting the .pyc
loader to be a little more concerned about importing compatible marshal
formats.  I have an idea about how I'd like to see in done which may be a
little less magical.  I'll work up a patch later this week.
  I won't check in any changes for this until we've heard from Guido on
the matter, and he'll probably be unavailable for the next couple of days.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From Fredrik Lundh" <effbot@telia.com  Tue May 23 17:26:43 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:26:43 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>
Message-ID: <031101bfc4d3$afac2020$f2a6b5d4@hagrid>

Martin v. Loewis wrote:
> To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
> Unicode escapes, and treats all other source code as
> Latin-1. encoding(n) says
> 
> # However, because the source command always reads files using the
> # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> # separate character that maps to the 00 page in Unicode.

as far as I can tell from digging through the sources, the "source"
command uses the system encoding.  and from the look of it, it's
not always iso-latin-1...

</F>



From mal@lemburg.com  Tue May 23 17:48:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 18:48:08 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>
Message-ID: <392AB648.368663A8@lemburg.com>

"Fred L. Drake" wrote:
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
>  > The problem is that "s" and "t" return C pointers to some
>  > internal data structure of the object. It has to be assured
>  > that this data remains intact at least as long as the object
>  > itself exists.
>  >
>  > AFAIK, this cannot be fixed without creating a memory leak.
>  >
>  > The "es" parser marker uses a different strategy, BTW: the
>  > data is copied into a buffer, thus detaching the object
>  > from the data.
>  >
>  > > > C APIs which want to support Unicode should be fixed to use
>  > > > "es" or query the object directly and then apply proper, possibly
>  > > > OS dependent conversion.
>  > >
>  > > for convenience, it might be a good idea to have a "wide system
>  > > encoding" too, and special parser markers for that purpose.
>  > >
>  > > or can we assume that all wide system API's use unicode all the
>  > > time?
>  >
>  > At least in all references I've seen (e.g. ODBC, wchar_t
>  > implementations, etc.) "wide" refers to Unicode.
> 
>   On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
> 10646 require a 32-bit space?

It is, Unicode is definitely moving in the 32-bit direction.

>   I recall a fair bit of discussion about wchar_t when it was introduced
> to ANSI C, and the character set and encoding were specifically not made
> part of the specification.  Making a requirement that wchar_t be Unicode
> doesn't make a lot of sense, and opens up potential portability issues.
> 
> -1 on any assumption that wchar_t is usefully portable.

Ok... so could be that Fredrik has a point there, but I'm
not deep enough into this to be able to comment.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May 23 18:15:17 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 19:15:17 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com> <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>
Message-ID: <392ABCA5.EC84824F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > > before proceeding down this (not very slippery but slightly
> > > unfortunate, imho) slope, I think we should decide whether
> > >
> > >     assert eval(repr(s)) == s
> > >
> > > should be true for strings.
> 
> footnote: as far as I can tell, the language reference says it should:
> http://www.python.org/doc/current/ref/string-conversions.html
> 
> > This is a different discussion which I don't really want to
> > get into... I don't have any need for repr() being locale
> > dependent, since I only use it for debugging purposes and
> > never to rebuild objects (marshal and pickle are much better
> > at that).
> 
> in other words, you leave it to 'pickle' to call 'repr' for you ;-)

Ooops... now this gives a totally new ring the changing
repr(). Hehe, perhaps we need a string.encode() method
too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From martin@loewis.home.cs.tu-berlin.de  Tue May 23 22:44:11 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 23 May 2000 23:44:11 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <031101bfc4d3$afac2020$f2a6b5d4@hagrid> (effbot@telia.com)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de> <031101bfc4d3$afac2020$f2a6b5d4@hagrid>
Message-ID: <200005232144.XAA01129@loewis.home.cs.tu-berlin.de>

> > # However, because the source command always reads files using the
> > # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> > # separate character that maps to the 00 page in Unicode.
> 
> as far as I can tell from digging through the sources, the "source"
> command uses the system encoding.  and from the look of it, it's
> not always iso-latin-1...

Indeed, this appears to be an error in the documentation. sourcing

encoding convertto utf-8 ä

has an outcome depending on the system encoding; just try koi8-r to
see the difference.

Regards,
Martin



From Fredrik Lundh" <effbot@telia.com  Tue May 23 22:57:57 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 23:57:57 +0200
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <008a01bfc502$17765260$f2a6b5d4@hagrid>

    http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
    "May 11: In a press conference held early this morning, Guido van Rossum
    ... announced that his most famous project will be undergoing a name
    change ..."

    http://www.scriptics.com/company/news/press_release_ajuba.html
    "May 22: Scriptics Corporation ... today announced that it has changed its
    name ..."

...



From akuchlin@mems-exchange.org  Wed May 24 00:33:28 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 23 May 2000 19:33:28 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
Message-ID: <200005232333.TAA16068@amarok.cnri.reston.va.us>

Today I checked in a new version of the curses module that will only
work with ncurses and/or SYSV curses.  I've tried compiling it on
Linux with ncurses 5.0, and on Solaris; there are also #ifdef's to
make it work with some version of SGI's curses.

I'd appreciate it if people could try the module with the curses
implementations on other platforms: Tru64, AIX, *BSDs (though they use
ncurses, maybe they're some versions behind), etc.  Please let me know
of your results through e-mail.

And if you have code that used the old curses module, and breaks with
the new module, please let me know; the goal is to have 100%
backward-compatibility.

Also, here's a list of ncurses functions that aren't yet supported;
should I make adding them a priority.  (Most of them seem to be pretty
marginal, except for the mouse-related functions which I want to add
next.)

addchnstr addchstr chgat color_set copywin define_key del_curterm
delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
scr_init scr_restore scr_set scrl set_curterm set_term setterm
setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
slk_attron slk_attrset slk_clear slk_color slk_init slk_label
slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
..signature has giant ASCII graphic: Forced to read "War And Peace" at 110
baud on a Braille terminal after having fingers rubbed with sandpaper.
  -- Kibo, in the Happynet Manifesto


From gstein@lyra.org  Wed May 24 01:00:55 2000
From: gstein@lyra.org (Greg Stein)
Date: Tue, 23 May 2000 17:00:55 -0700 (PDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <008a01bfc502$17765260$f2a6b5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>

what a dumb name...


On Tue, 23 May 2000, Fredrik Lundh wrote:

> 
>     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
>     "May 11: In a press conference held early this morning, Guido van Rossum
>     ... announced that his most famous project will be undergoing a name
>     change ..."
> 
>     http://www.scriptics.com/company/news/press_release_ajuba.html
>     "May 22: Scriptics Corporation ... today announced that it has changed its
>     name ..."
> 
> ...
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/



From klm@digicool.com  Wed May 24 01:33:57 2000
From: klm@digicool.com (Ken Manheimer)
Date: Tue, 23 May 2000 20:33:57 -0400 (EDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.21.0005232030340.31343-100000@korak.digicool.com>

On Tue, 23 May 2000, Greg Stein wrote:

> what a dumb name...
> On Tue, 23 May 2000, Fredrik Lundh wrote:
> 
> >     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> >     "May 11: In a press conference held early this morning, Guido van Rossum
> >     ... announced that his most famous project will be undergoing a name
> >     change ..."

Huh.  I dunno what's so dumb about it.  But i definitely was tickled by:

  !STOP PRESS! Microsoft Corporation announced this afternoon that it had
  aquired rights to use South Park characters in its software. The first
  such product, formerly known as Visual J++, will now be known as Kenny.
  !STOP PRESS!

:->

Ken
klm@digicool.com

(No relation.)



From esr@thyrsus.com  Wed May 24 01:47:50 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 20:47:50 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <200005232333.TAA16068@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Tue, May 23, 2000 at 07:33:28PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
Message-ID: <20000523204750.A6107@thyrsus.com>

Andrew M. Kuchling <akuchlin@mems-exchange.org>:
> Also, here's a list of ncurses functions that aren't yet supported;
> should I make adding them a priority.  (Most of them seem to be pretty
> marginal, except for the mouse-related functions which I want to add
> next.)
> 
> addchnstr addchstr chgat color_set copywin define_key del_curterm
> delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
> mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
> mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
> mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
> overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
> scr_init scr_restore scr_set scrl set_curterm set_term setterm
> setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
> slk_attron slk_attrset slk_clear slk_color slk_init slk_label
> slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
> tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
> tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
> vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
> winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

I think you're right to put the mouse support at highest priority.

I'd say napms() and the overlay/overwrite/copywin group are moderately
important.  So are the functions in the curs_inopts(3x) group -- when
you need those, nothing else will do.  

You can certainly pretty much forget the slk_* group; I only
implemented those for the sake of excruciating completeness.
Likewise for the mv* variants.  

Here's a function that ought to be in the Python wrapper associated with
the module:

def traceback_wrapper(func, *rest):
    "Call a hook function, guaranteeing curses cleanup on error or exit."
    try:
	# Initialize curses
	stdscr=curses.initscr()
	# Turn off echoing of keys, and enter cbreak mode,
	# where no buffering is performed on keyboard input
	curses.noecho() ; curses.cbreak()

	# In keypad mode, escape sequences for special keys
	# (like the cursor keys) will be interpreted and
	# a special value like curses.KEY_LEFT will be returned
        stdscr.keypad(1)

	# Run the hook.  Supply the screen window object as first argument
        apply(func, (stdscr,) + rest)

	# Set everything back to normal
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()		 # Terminate curses
    except:
	# In the event of an error, restore the terminal
	# to a sane state.
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()
	traceback.print_exc()	   # Print the exception

(Does this case mean, perhaps, that the Python interper ought to allow
setting a stack of hooks to be executed just before traceback-emission time?)

I'd also be willing to write a Python function that implements Emacs-style
keybindings for field editing, if that's interesting.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Don't think of it as `gun control', think of it as `victim
disarmament'. If we make enough laws, we can all be criminals.


From skip@mojam.com (Skip Montanaro)  Wed May 24 02:40:02 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 23 May 2000 20:40:02 -0500 (CDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
References: <008a01bfc502$17765260$f2a6b5d4@hagrid>
 <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <14635.13042.415077.857803@beluga.mojam.com>

Regarding "Ajuba", Greg wrote:

    what a dumb name...

The top 10 reasons why "Ajuba" is a great name for the former Scriptics:

   10. An accounting error left waaay too much money in the marketing
       budget.  They felt they had to spend it or risk a budget cut next
       year.

    9. It would make a cool name for a dance.  They will now be able to do
       the "Ajuba" at the company's Friday afternoon beer busts.

    8. It's almost palindromic, giving the company's art department all
       sorts of cool nearly symmetric logo possibilities.

    7. It has 7 +/- 2 letters, so when purchasing managers from other
       companies see it flash by in the background of a baseball or
       basketball game on TV they'll be able to remember it.

    6. No programming languages already exist with that name.

    5. It doesn't mean anything bad in any known Indo-European, Asian or
       African language so they won't risk losing market share (what market
       share?) in some obscure third-world country because it means "take a
       flying leap".

    4. It's not already registered in .com, .net, .edu or .org.

    3. No prospective employee will associate the new company name with the
       old, so they'll be able to pull in lots of resumes from people who
       would never have stooped to programming in Tcl for a living.

    2. It's more prounounceable than "Tcl" or "Tcl/Tk" by just about anybody
       who has ever seen English in print.

    1. It doesn't suggest anything, so the company is free to redirect its
       focus any way it wants, including replacing Tcl with Python in future
       versions of its products.

;-)

Skip


From gward@mems-exchange.org  Wed May 24 03:43:53 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:43:53 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
Message-ID: <20000523224352.A997@mems-exchange.org>

A couple of people are working on support in the Distutils for building
extensions on Windows with non-Microsoft compilers.  I think this is
crucial; I hate the idea of requiring people to either download a binary
or shell out megabucks (and support Chairman Bill's monopoly) just to
use some handy Python extension.  (OK, OK, more likely they'll go
without the extension, or go without Python.  But still...)

However, it seems like it would be nice if people could build Python
itself with (eg.) cygwin's gcc or Borland's compiler.  (It might be
essential to properly support building extensions with gcc.)  Has anyone
one anything towards that goal?  It appears that there is at least one
patch floating around that advises people to hack up their installed
config.h, and drop a libpython.a somewhere in the installation, in order
to compile extensions with cygwin gcc and/or mingw32.  This strikes me
as sub-optimal: can at least the required changes to config.h be made to
allow building Python with one of the Windows gcc ports?

I would be willing to hold my nose and struggle with cygwin for a little
while in Windows in dull moments at work -- had to reboot my Linux box
into Windows today in order to test try building CXX (since my VMware
trial license expired), so I might as well leave it there until it
crashes and play with cygwin.

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From gward@mems-exchange.org  Wed May 24 03:49:23 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:49:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
Message-ID: <20000523224923.A1008@mems-exchange.org>

My post on this from last week was met with a deafening silence, so I
will try to be short and to-the-point this time:

   Why are shared extensions on Solaris linked with "ld -G" instead of
   "gcc -G" when gcc is the compiler used to compile Python and
   extensions?

Is it historical?  Ie. did some versions of Solaris and/or gcc not do
the right thing here?  Could we detect that bogosity in "configure", and
only use "ld -G" if it's necessary, and use "gcc -G" by default?

The reason that using "ld -G" is the wrong thing is that libgcc.a is not
referenced when creating the .so file.  If the object code happens to
reference functions in libgcc.a that are not referenced anywhere in the
Python core, then importing the .so fails.  This happens if there is a
64-bit divide in the object code.  See my post of May 19 for details.

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From fredrik@pythonware.com  Wed May 24 10:42:57 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 24 May 2000 11:42:57 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com> <024001bfc4cd$68210f00$f2a6b5d4@hagrid>
Message-ID: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>

> one might of course the system encoding if the user actually calls =
setlocale,

I think that was supposed to be:

  one might of course SET the system encoding ONLY if the user actually =
calls setlocale,

or something...

</F>



From gmcm@hypernet.com  Wed May 24 13:24:20 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Wed, 24 May 2000 08:24:20 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
In-Reply-To: <20000523224352.A997@mems-exchange.org>
Message-ID: <1252951401-112791664@hypernet.com>

Greg Ward wrote:

> However, it seems like it would be nice if people could build
> Python itself with (eg.) cygwin's gcc or Borland's compiler.  (It
> might be essential to properly support building extensions with
> gcc.)  Has anyone one anything towards that goal?  

Robert Kern (mingw32) and Gordon Williams (Borland).

> It appears
> that there is at least one patch floating around that advises
> people to hack up their installed config.h, and drop a
> libpython.a somewhere in the installation, in order to compile
> extensions with cygwin gcc and/or mingw32.  This strikes me as
> sub-optimal: can at least the required changes to config.h be
> made to allow building Python with one of the Windows gcc ports?

Robert's starship pages (kernr/mingw32) has a config.h 
patched for mingw32.

I believe someone else built Python using cygwin without 
much trouble. But mingw32 is the preferred target - cygwin is 
slow, doesn't thread, has a viral GPL license and only gets 
along with binaries built with cygwin.
 
Robert's web pages talk about a patched mingw32. I don't 
*think* that's true anymore, (at least I found no problems in 
my limited testing of an unpatched mingw32). The difference 
between mingw32 and cygwin is just what runtime they're built 
for.


- Gordon


From guido@python.org  Wed May 24 15:17:29 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 24 May 2000 09:17:29 -0500
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: Your message of "Tue, 23 May 2000 10:39:11 +0200."
 <m12uADY-000DieC@artcom0.artcom-gmbh.de>
References: <m12uADY-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <200005241417.JAA07367@cj20424-a.reston1.va.home.com>

> I agree.  But I think we should at least extend the documentation
> of 'tempfile' (Fred?) to guide people not to write Pythoncode like
> 	mytemp = open(tempfile.mktemp(), "w")
> in programs that are intended to be used on Unix systems by arbitrary
> users (possibly 'root').  Even better:  Someone with enough spare time 
> should add a new function 'mktempfile()', which creates a temporary 
> file and takes care of the security issue and than returns the file 
> handle.  This implementation must take care of race conditions using
> 'os.open' with the following flags:
> 
>        O_CREAT If the file does not exist it will be created.
>        O_EXCL  When used with O_CREAT, if the file already  exist
> 	       it is  an error and the open will fail. 

Have you read a recent (CVS) version of tempfile.py?  It has all this
in the class TemporaryFile()!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May 24 16:11:12 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 24 May 2000 10:11:12 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Tue, 23 May 2000 17:23:48 +0200."
 <m12uGX6-000DieC@artcom0.artcom-gmbh.de>
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <200005241511.KAA07512@cj20424-a.reston1.va.home.com>

>    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files

I agree that this is the correct solution.

> No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
> some such somewhere in the 1.6 unmarshaller should do the trick.
> But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> worth the effort. <wink>

That's BDFL for you, thank you. ;-)

Before accepting the trivial patch, I would like to see some analysis
that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
This means you have to prove that (a) the 1.5.2 marshal format is a
subset of the 1.6 marshal format (easy enough probably) and (b) the
1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
one seems a little trickier; I don't remember if we moved opcodes or 
changed existing opcodes' semantics.  You may be lucky, but it will
cause an extra constraint on the evolution of the bytecode, so I'm
somewhat reluctant.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From ping@lfw.org  Wed May 24 16:56:49 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 24 May 2000 08:56:49 -0700 (PDT)
Subject: [Python-Dev] 1.6 release date
Message-ID: <Pine.LNX.4.10.10005240855340.465-100000@localhost>

Sorry if i missed an earlier announcement on this topic.

The web page about 1.6 currently says that Python 1.6 will
be released on June 1.  Is that still the target date?


-- ?!ng



From tismer@tismer.com  Wed May 24 19:37:05 2000
From: tismer@tismer.com (Christian Tismer)
Date: Wed, 24 May 2000 20:37:05 +0200
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6. Why not?
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de> <200005241511.KAA07512@cj20424-a.reston1.va.home.com>
Message-ID: <392C2151.93A0DF24@tismer.com>


Guido van Rossum wrote:
> 
> >    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
> 
> I agree that this is the correct solution.
> 
> > No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or
> > some such somewhere in the 1.6 unmarshaller should do the trick.
> > But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> > worth the effort. <wink>
> 
> That's BDFL for you, thank you. ;-)
> 
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

Be assured, I know the opcodes by heart.
We only appended to the end of opcode space, there are no changes.
But I can't tell about marshal.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From gstein@lyra.org  Wed May 24 21:15:24 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 24 May 2000 13:15:24 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>
Message-ID: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Fredrik Lundh wrote:
> > one might of course the system encoding if the user actually calls setlocale,
> 
> I think that was supposed to be:
> 
>   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> 
> or something...

Bleh. Global switches are bogus. Since you can't depend on the setting,
and you can't change it (for fear of busting something else), then you
have to be explicit about your encoding all the time. Since you're never
going to rely on a global encoding, then why keep it?

This global encoding (per thread or not) just reminds me of the single
hook for import, all over again.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From pf@artcom-gmbh.de  Wed May 24 22:34:19 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 24 May 2000 23:34:19 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 24, 2000 10:11:12 am"
Message-ID: <m12uinD-000DieC@artcom0.artcom-gmbh.de>

[...about accepting 1.5.2 generated .pyc files...]

Guido van Rossum:
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.

Would it be sufficient, if a Python 1.6a2 interpreter executable containing
such a trivial patch is able to process the test suite in a 1.5.2 tree with 
all the .py-files removed?  But some list.append calls with multiple args 
might cause errors.

> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or 
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

I feel the byte code format is rather mature and future evolution
is unlikely to remove or move opcodes to new values or change the 
semantics of existing opcodes in an incompatible way.  As has been
shown, it is even possible to solve the 1/2 == 0.5 issue with
upward compatible extension of the format.

But I feel unable to provide a formal proof other than comparing
1.5.2/Include/opcode.h, 1.5.2/Python/marshal.c and import.c
with the 1.6 ones.

There are certainly others here on python-dev who can do better.
Christian?

BTW: import.c contains the  following comment:
/* XXX Perhaps the magic number should be frozen and a version field
   added to the .pyc file header? */

Judging from my decade long experience with exotic image and CAD data 
formats I think this is always the way to go for binary data files.  
Using this method newer versions of a program can always recognize
the file format version and convert files generated by older versions
in an appropriate way.

Regards, Peter


From esr@thyrsus.com  Wed May 24 23:02:15 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 18:02:15 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <m12uinD-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Wed, May 24, 2000 at 11:34:19PM +0200
References: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> <m12uinD-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000524180215.A10281@thyrsus.com>

Peter Funk <pf@artcom-gmbh.de>:
> BTW: import.c contains the  following comment:
> /* XXX Perhaps the magic number should be frozen and a version field
>    added to the .pyc file header? */
> 
> Judging from my decade long experience with exotic image and CAD data 
> formats I think this is always the way to go for binary data files.  
> Using this method newer versions of a program can always recognize
> the file format version and convert files generated by older versions
> in an appropriate way.

I have similar experience, notably with hacking graphics file formats.
I concur with this recommendation.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The end move in politics is always to pick up a gun.
	-- R. Buckminster Fuller


From gstein@lyra.org  Wed May 24 22:58:48 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 24 May 2000 14:58:48 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6. Why not?
In-Reply-To: <20000524180215.A10281@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005241457000.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Eric S. Raymond wrote:
> Peter Funk <pf@artcom-gmbh.de>:
> > BTW: import.c contains the  following comment:
> > /* XXX Perhaps the magic number should be frozen and a version field
> >    added to the .pyc file header? */
> > 
> > Judging from my decade long experience with exotic image and CAD data 
> > formats I think this is always the way to go for binary data files.  
> > Using this method newer versions of a program can always recognize
> > the file format version and convert files generated by older versions
> > in an appropriate way.
> 
> I have similar experience, notably with hacking graphics file formats.
> I concur with this recommendation.

One more +1 here.

In another thread (right now, actually), I'm discussing how you can hook
up Linux to recognize .pyc files and directly execute them with the Python
interpreter (e.g. no need for #!/usr/bin/env python at the head of the
file). But if that magic number keeps changing, then it makes it a bit
harder to set this up.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From akuchlin@mems-exchange.org  Wed May 24 23:22:46 2000
From: akuchlin@mems-exchange.org (Andrew Kuchling)
Date: Wed, 24 May 2000 18:22:46 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <20000523204750.A6107@thyrsus.com>
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
 <20000523204750.A6107@thyrsus.com>
Message-ID: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>

Eric S. Raymond writes:
>Here's a function that ought to be in the Python wrapper associated with
>the module:

There currently is no such wrapper, but there probably should be.
Guess I'll rename the module to _curses, and add a curses.py file.  Or
should there be a curses package, instead?  That would leave room for
more future expansion.  Guido, any opinion?

--amk


From gstein@lyra.org  Wed May 24 23:38:07 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 24 May 2000 15:38:07 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Andrew Kuchling wrote:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:

Dang. Deleted Eric's note accidentally. Note that the proposed wrapper can
be simplifed by using try/finally.

> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

Just a file. IMO, a package would be overkill.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From esr@thyrsus.com  Thu May 25 01:26:49 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 20:26:49 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Wed, May 24, 2000 at 06:22:46PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us> <20000523204750.A6107@thyrsus.com> <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <20000524202649.B10384@thyrsus.com>

Andrew Kuchling <akuchlin@mems-exchange.org>:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:
> 
> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

I'll supply a field-editor function with Emacs-like bindings, too.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Never trust a man who praises compassion while pointing a gun at you.


From fdrake@acm.org  Thu May 25 03:36:59 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Wed, 24 May 2000 19:36:59 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241933380.624-100000@mailhost.beopen.com>

On Wed, 24 May 2000, Andrew Kuchling wrote:
 > There currently is no such wrapper, but there probably should be.
 > Guess I'll rename the module to _curses, and add a curses.py file.  Or
 > should there be a curses package, instead?  That would leave room for
 > more future expansion.  Guido, any opinion?

  I think a package makes sense; some of the libraries that provide widget
sets on top of ncurses would be prime candidates for inclusion.
  The structure should probably be something like:

	curses/
	    __init__.py		# from _curses import *, docstring
	    _curses.so		# current curses module
	    ...


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From gstein@lyra.org  Thu May 25 11:58:27 2000
From: gstein@lyra.org (Greg Stein)
Date: Thu, 25 May 2000 03:58:27 -0700 (PDT)
Subject: [Python-Dev] Larry's need for metacharacters...
Message-ID: <Pine.LNX.4.10.10005250355450.13822-100000@nebula.lyra.org>

[ paraphrased from a LWN letter to the editor ]

Regarding the note posted here last week about Perl development stopping
cuz Larry can't figure out any more characters to place after the '$'
character (to create "special" things) ...

Note that Larry became interested in Unicode a few years ago...

Note that Perl now supports Unicode throughout... *including* variable
names...

Coincidence? I think not!

$\uAB56 = 1;


:-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Thu May 25 13:22:09 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 25 May 2000 14:22:09 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>
Message-ID: <392D1AF1.5AA15F2F@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 24 May 2000, Fredrik Lundh wrote:
> > > one might of course the system encoding if the user actually calls setlocale,
> >
> > I think that was supposed to be:
> >
> >   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> >
> > or something...
> 
> Bleh. Global switches are bogus. Since you can't depend on the setting,
> and you can't change it (for fear of busting something else),

Sure you can: in site.py before any other code using Unicode
gets executed.

> then you
> have to be explicit about your encoding all the time. Since you're never
> going to rely on a global encoding, then why keep it?

For the same reason you use setlocale() in C (and Python): to
make programs portable to other locales without too much
fuzz.

> This global encoding (per thread or not) just reminds me of the single
> hook for import, all over again.

Think of it as a configuration switch which is made settable
via a Python interface -- much like the optimize switch or
the debug switch (which are settable via Python APIs in mxTools).
The per-thread implementation is mainly a design question: I
think globals should always be implemented on a per-thread basis.

Hmm, I wish Guido would comment on the idea of keeping the
runtime settable encoding...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido@python.org  Thu May 25 16:30:26 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:30:26 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Tue, 23 May 2000 22:49:23 -0400."
 <20000523224923.A1008@mems-exchange.org>
References: <20000523224923.A1008@mems-exchange.org>
Message-ID: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>

[Greg Ward]
> My post on this from last week was met with a deafening silence, so I
> will try to be short and to-the-point this time:
> 
>    Why are shared extensions on Solaris linked with "ld -G" instead of
>    "gcc -G" when gcc is the compiler used to compile Python and
>    extensions?
> 
> Is it historical?  Ie. did some versions of Solaris and/or gcc not do
> the right thing here?  Could we detect that bogosity in "configure", and
> only use "ld -G" if it's necessary, and use "gcc -G" by default?
> 
> The reason that using "ld -G" is the wrong thing is that libgcc.a is not
> referenced when creating the .so file.  If the object code happens to
> reference functions in libgcc.a that are not referenced anywhere in the
> Python core, then importing the .so fails.  This happens if there is a
> 64-bit divide in the object code.  See my post of May 19 for details.

Two excuses: (1) long ago, you really needed to use ld instead of cc
to create a shared library, because cc didn't recognize the flags or
did other things that shouldn't be done to shared libraries; (2) I
didn't know there was a problem with using ld.

Since you have now provided a patch which seems to work, why don't you
check it in...?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Thu May 25 16:35:10 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:35:10 -0500
Subject: [Python-Dev] 1.6 release date
In-Reply-To: Your message of "Wed, 24 May 2000 08:56:49 MST."
 <Pine.LNX.4.10.10005240855340.465-100000@localhost>
References: <Pine.LNX.4.10.10005240855340.465-100000@localhost>
Message-ID: <200005251535.KAA11834@cj20424-a.reston1.va.home.com>

[Ping]
> The web page about 1.6 currently says that Python 1.6 will
> be released on June 1.  Is that still the target date?

Obviously I won't make that date...  I'm holding back an official
announcement of the delay until next week so I can combine it with
some good news. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Thu May 25 16:51:44 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:51:44 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Wed, 24 May 2000 23:34:19 +0200."
 <m12uinD-000DieC@artcom0.artcom-gmbh.de>
References: <m12uinD-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <200005251551.KAA11897@cj20424-a.reston1.va.home.com>

Given Christian Tismer's testimonial and inspection of marshal.c, I
think Peter's small patch is acceptable.

A bigger question is whether we should freeze the magic number and add
a version number.  In theory I'm all for that, but it means more
changes; there are several tools (e.c. Lib/py_compile.py,
Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
intimate knowledge of the .pyc file format that would have to be
modified to match.

The current format of a .pyc file is as follows:

bytes 0-3   magic number
bytes 4-7   timestamp (mtime of .py file)
bytes 8-*   marshalled code object

The magic number itself is used to convey various bits of information,
all implicit:

- the Python version
- whether \r and \n are swapped (some old Mac compilers did this)
- whether all string literals are Unicode (experimental -U flag)

The current (1.6) value of the magic number (as a string -- the .pyc
file format is byte order independent) is '\374\304\015\012' on most
platforms; it's '\374\304\012\015' for the old Mac compilers
mentioned; and it's '\375\304\015\012' with -U.

Can anyone come up with a proposal?  I'm swamped!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Thu May 25 16:52:54 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:52:54 -0500
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: Your message of "Wed, 24 May 2000 15:38:07 MST."
 <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org>
Message-ID: <200005251552.KAA11922@cj20424-a.reston1.va.home.com>

> > There currently is no such wrapper, but there probably should be.
> > Guess I'll rename the module to _curses, and add a curses.py file.  Or
> > should there be a curses package, instead?  That would leave room for
> > more future expansion.  Guido, any opinion?

Whatever -- either way is fine with me!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From DavidA@ActiveState.com  Thu May 25 22:42:51 2000
From: DavidA@ActiveState.com (David Ascher)
Date: Thu, 25 May 2000 14:42:51 -0700
Subject: [Python-Dev] ActiveState news
Message-ID: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>

While not a technical point, I thought I'd mention to this group that
ActiveState just announced several things, including some Python-related
projects.  See www.ActiveState.com for details.

--david

PS: In case anyone's still under the delusion that cool Python jobs are hard
to find, let me know. =)



From bwarsaw@python.org  Fri May 26 23:42:10 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 18:42:10 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
Message-ID: <14638.64962.118047.467438@localhost.localdomain>

Hi all,

I've taken /F's C implementation of the standard class-based
exceptions, implemented the stuff he left out, proofread for reference
counting issues, hacked a bit more, and integrated it with the 1.6
interpreter.  Everything seems to work well; i.e. the regression test
suite passes and I don't get any core dumps ;).

I don't have the ability right now to Purify things[1], but I've tried
to be very careful in handling reference counting.  Since I've been
hacking on this all day, it could definitely use another set of eyes.
I think rather than email a huge patch kit, I'll just go ahead and
check the changes in.  Please take a look and give it a hard twist.

Thanks to /F for the excellent head start!
-Barry

[1] Purify was one of the coolest products on Solaris, but alas it
doesn't seem like they'll ever support Linux.  What do you all use to
do similar memory verification tests on Linux?  Or do you just not?


From bwarsaw@python.org  Sat May 27 00:24:48 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 19:24:48 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <14639.1984.920885.635040@localhost.localdomain>

I'm all done checking this stuff in.
-Barry


From gstein@lyra.org  Fri May 26 00:29:19 2000
From: gstein@lyra.org (Greg Stein)
Date: Thu, 25 May 2000 16:29:19 -0700 (PDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Modules _exceptions.c,NONE,1.1
In-Reply-To: <200005252318.QAA25455@slayer.i.sourceforge.net>
Message-ID: <Pine.LNX.4.10.10005251627061.16846-100000@nebula.lyra.org>

On Thu, 25 May 2000, Barry Warsaw wrote:
> Update of /cvsroot/python/python/dist/src/Modules
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25441
> 
> Added Files:
> 	_exceptions.c 
> Log Message:
> Built-in class-based standard exceptions.  Written by Fredrik Lundh.
> Modified, proofread, and integrated for Python 1.6 by Barry Warsaw.

Since the added files are not emailed, you can easily see this file at:

http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/dist/src/Modules/_exceptions.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=python


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gward@python.net  Thu May 25 23:33:54 2000
From: gward@python.net (Greg Ward)
Date: Thu, 25 May 2000 18:33:54 -0400
Subject: [Python-Dev] Terminology question
Message-ID: <20000525183354.A422@beelzebub>

A question of terminology: frequently in the Distutils docs I need to
refer to the package-that-is-not-a-package, ie. the "root" or "empty"
package.  I can't decide if I prefer "root package", "empty package" or
what.  ("Empty" just means the *name* is empty, so it's probably not a
very good thing to say "empty package" -- but "package with no name" or
"unnamed package" aren't much better.)

Is there some accepted convention that I have missed?

Here's the definition I've just written for the "Distribution Python
Modules" manual:

\item[root package] the ``package'' that modules not in a package live
  in.  The vast majority of the standard library is in the root package,
  as are many small, standalone third-party modules that don't belong to
  a larger module collection.  (The root package isn't really a package,
  since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
  call it something.)

Confusing enough?  I thought so...

        Greg
-- 
Greg Ward - Unix nerd                                   gward@python.net
http://starship.python.net/~gward/
Beware of altruism.  It is based on self-deception, the root of all evil.


From guido@python.org  Fri May 26 02:50:24 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 25 May 2000 20:50:24 -0500
Subject: [Python-Dev] Terminology question
In-Reply-To: Your message of "Thu, 25 May 2000 18:33:54 -0400."
 <20000525183354.A422@beelzebub>
References: <20000525183354.A422@beelzebub>
Message-ID: <200005260150.UAA10169@cj20424-a.reston1.va.home.com>

Greg,

If you have to refer to it as a package (which I don't doubt), the
correct name is definitely the "root package".

A possible clarification of your glossary entry:

\item[root package] the root of the hierarchy of packages.  (This
isn't really a package, since it doesn't have an
\file{\_\_init\_\_.py} file.  But we have to call it something.)  The
vast majority of the standard library is in the root package, as are
many small, standalone third-party modules that don't belong to a
larger module collection.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gward@python.net  Fri May 26 03:22:03 2000
From: gward@python.net (Greg Ward)
Date: Thu, 25 May 2000 22:22:03 -0400
Subject: [Python-Dev] Where to install non-code files
Message-ID: <20000525222203.A1114@beelzebub>

Another one for the combined distutils/python-dev braintrust; apologies
to those of you on both lists, but this is yet another distutils issue
that treads on python-dev territory.

The problem is this: some module distributions need to install files
other than code (modules, extensions, and scripts).  One example close
to home is the Distutils; it has a "system config file" and will soon
have a stub executable for creating Windows installers.

On Windows and Mac OS, clearly these should go somewhere under
sys.prefix: this is the directory for all things Python, including
third-party module distributions.  If Brian Hooper distributes a module
"foo" that requires a data file containing character encoding data (yes,
this is based on a true story), then the module belongs in (eg.)
C:\Python and the data file in (?) C:\Python\Data.  (Maybe
C:\Python\Data\foo, but that's a minor wrinkle.)

Any disagreement so far?

Anyways, what's bugging me is where to put these files on Unix.
<prefix>/lib/python1.x is *almost* the home for all things Python, but
not quite.  (Let's ignore platform-specific files for now: they don't
count as "miscellaneous data files", which is what I'm mainly concerned
with.)

Currently, misc. data files are put in <prefix>/share, and the
Distutil's config file is searched for in the directory of the distutils
package -- ie. site-packages/distutils under 1.5.2 (or
~/lib/python/distutils if that's where you installed it, or ./distutils
if you're running from the source directory, etc.).  I'm not thrilled
with either of these.

My inclination is to nominate a directory under <prefix>lib/python1.x
for these sort of files: not sure if I want to call it "etc" or "share"
or "data" or what, but it would be treading in Python-space.  It would
break the ability to have a standard library package called "etc" or
"share" or "data" or whatever, but dammit it's convenient.

Better ideas?

        Greg
-- 
Greg Ward - "always the quiet one"                      gward@python.net
http://starship.python.net/~gward/
I have many CHARTS and DIAGRAMS..


From mhammond@skippinet.com.au  Fri May 26 03:35:47 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 12:35:47 +1000
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKACLAA.mhammond@skippinet.com.au>

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python, including
> third-party module distributions.  If Brian Hooper distributes a module
> "foo" that requires a data file containing character encoding data (yes,
> this is based on a true story), then the module belongs in (eg.)
> C:\Python and the data file in (?) C:\Python\Data.  (Maybe
> C:\Python\Data\foo, but that's a minor wrinkle.)
>
> Any disagreement so far?

A little.  I dont think we need a new dump for arbitary files that no one
can associate with their application.

Why not put the data with the code?  It is quite trivial for a Python
package or module to find its own location, and this way we are not
dependent on anything.

Why assume packages are installed _under_ Python?  Why not just assume the
package is _reachable_ by Python.  Once our package/module is being
executed by Python, we know exactly where we are.

On my machine, there is no "data" equivilent; the closest would be
"python-cvs\pcbuild\data", and that certainly doesnt make sense.  Why can't
I just place it where I put all my other Python extensions, ensure it is on
the PythonPath, and have it "just work"?

It sounds a little complicated - do we provide an API for this magic
location, or does everybody cut-and-paste a reference implementation for
locating it?  Either way sounds pretty bad - the API shouldnt be distutils
dependent (I may not have installed this package via distutils), and really
Python itself shouldnt care about this...

So all in all, I dont think it is a problem we need to push up to this
level - let each package author do whatever makes sense, and point out how
trivial it would be if you assumed code and data in the same place/tree.

[If the data is considered read/write, then you need a better answer
anyway, as you can't assume "c:\python\data" is writable (when actually
running the code) anymore than "c:\python\my_package" is]

Mark.



From fdrake@acm.org  Fri May 26 04:05:40 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 20:05:40 -0700 (PDT)
Subject: [Python-Dev] Re: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <Pine.LNX.4.10.10005252003180.7550-100000@mailhost.beopen.com>

On Thu, 25 May 2000, Greg Ward wrote:
 > A question of terminology: frequently in the Distutils docs I need to
 > refer to the package-that-is-not-a-package, ie. the "root" or "empty"
 > package.  I can't decide if I prefer "root package", "empty package" or
 > what.  ("Empty" just means the *name* is empty, so it's probably not a
 > very good thing to say "empty package" -- but "package with no name" or
 > "unnamed package" aren't much better.)

  Well, it's not a package -- it's similar to Java's unnamed package, but
the idea that it's a package has never been advanced.  Why not just call
it the global module space (or namespace)?  That's the only way I've heard
it described, and it's more clear than "empty package" or "unnamed
package".


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From fdrake@acm.org  Fri May 26 05:47:10 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 21:47:10 -0700 (PDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Barry A. Warsaw wrote:
 > [1] Purify was one of the coolest products on Solaris, but alas it
 > doesn't seem like they'll ever support Linux.  What do you all use to
 > do similar memory verification tests on Linux?  Or do you just not?

  I'm not aware of anything as good, but there's "memprof" (check for it
with "rpm -q"), and I think a few others.  Checker is a malloc() & friends
implementation that can be used to detect memory errors:

	http://www.gnu.org/software/checker/checker.html

and there's ElectricFence from Bruce Perens:

	http://www.perens.com/FreeSoftware/

(There's a MailMan related link there are well you might be interested
in!)
  There may be others, and I can't speak to the quality of these as I've
not used any of them (yet).  memprof and ElectricFence were installed on
my Mandrake box without my doing anything about it; I don't know if RedHat
installs them on a stock develop box.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From tim_one@email.msn.com  Fri May 26 06:27:13 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 01:27:13 -0400
Subject: [Python-Dev] ActiveState news
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>
Message-ID: <000501bfc6d3$0c3a2700$c52d153f@tim>

[David Ascher]
> While not a technical point, I thought I'd mention to this group that
> ActiveState just announced several things, including some Python-related
> projects.  See www.ActiveState.com for details.

Thanks for pointing that out!  I just took a natural opportunity to plug the
Visual Studio integration on c.l.py:  it's very important that we do
everything we can to support and promote commercial Python endeavors at
every conceivable opportunity <wink>.

> PS: In case anyone's still under the delusion that cool Python
> jobs are hard to find, let me know. =)

Ditto cool speech recognition jobs in small companies about to be devoured
by Belgian conquerors.  And if anyone is under the illusion that golden
handcuffs don't bind, I can set 'em  straight on that one too.

hiring-is-darned-hard-everywhere-ly y'rs  - tim




From gstein@lyra.org  Fri May 26 08:48:12 2000
From: gstein@lyra.org (Greg Stein)
Date: Fri, 26 May 2000 00:48:12 -0700 (PDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
In-Reply-To: <000401bfc6d3$0afb3e60$c52d153f@tim>
Message-ID: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>

On Fri, 26 May 2000, Tim Peters wrote:
>...
> PS:  Barry's exception patch appears to have broken the CVS Windows build
> (nothing links anymore; all the PyExc_xxx symbols aren't found; no time to
> dig more now).

The .dsp file(s) need to be updated to include the new _exceptions.c file
in their build and link step. (the symbols moved there)

IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
directory. Dependencies from the core out to Modules/ seems a bit weird.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From pf@artcom-gmbh.de  Fri May 26 09:23:18 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 10:23:18 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <200005251551.KAA11897@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 25, 2000 10:51:44 am"
Message-ID: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]:
> Given Christian Tismer's testimonial and inspection of marshal.c, I
> think Peter's small patch is acceptable.
> 
> A bigger question is whether we should freeze the magic number and add
> a version number.  In theory I'm all for that, but it means more
> changes; there are several tools (e.c. Lib/py_compile.py,
> Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> intimate knowledge of the .pyc file format that would have to be
> modified to match.
> 
> The current format of a .pyc file is as follows:
> 
> bytes 0-3   magic number
> bytes 4-7   timestamp (mtime of .py file)
> bytes 8-*   marshalled code object

Proposal:
The future format (Python 1.6 and newer) of a .pyc file should be as follows:

bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
bytes 4-7   a version number (which should be == 1 in Python 1.6)
bytes 8-11  timestamp (mtime of .py file) (same as earlier)
bytes 12-*  marshalled code object (same as earlier)

> The magic number itself is used to convey various bits of information,
> all implicit:
[...]
This mechanism to construct the magic number should not be changed.

But now once again a new value must be choosen to prevent havoc 
with .pyc files floating around, where people already played with the 
Python 1.6 alpha releases.  But this change should be definitely the 
last one, which will ever happen during the future life time of Python.

The unmarshaller should do the following with the magic number read:
If the read magic is the old magic number from 1.5.2, skip reading a
version number and assume 0 as the version number.  

If the read magic is this new value instead, it should also read the 
version number and raise a new 'ByteCodeToNew' exception, if the read 
version number is greater than a #defind version number of this 
Python interpreter.  

If future incompatible extensions to the byte code format will happen, 
then this number should be incremented to 2, 3 and so on.

For safety, 'imp.get_magic()' should return the old 1.5.2 magic
number and only 'imp.get_magic(imp.PYC_FINAL)' should return the new 
final magic number.  A new function 'imp.get_version()' should be 
introduced, which will return the current compiled in version number
of this Python interpreter.

Of course all Python modules reading .pyc files must be changed 
ccordingly, so that are able to deal with new .pyc files.  
This shouldn't be too hard.

This proposed change of the .pyc file format must be described in the final
Python 1.6 annoucement, if there are people out there, who borrowed
code from 'Tools/scripts/checkpyc.py' or some such.

Regards, Peter


From mal@lemburg.com  Fri May 26 09:37:53 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 10:37:53 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
Message-ID: <392E37E1.75AC4D0E@lemburg.com>

> Update of /cvsroot/python/python/dist/src/Lib
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25262
> 
> Modified Files:
>         exceptions.py 
> Log Message:
> For backwards compatibility, simply import everything from the
> _exceptions module, including __doc__.

Hmm, wasn't _exceptions supposed to be a *fall back* solution for
the case where the exceptions.py module is not found ? It now
looks like _exceptions replaces exceptions.py...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri May 26 11:48:05 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 12:48:05 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E5665.8CB1C260@lemburg.com>

Peter Funk wrote:
> 
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> >
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> >
> > The current format of a .pyc file is as follows:
> >
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)

This will break all tools relying on having the code object available
in bytes[8:] and believe me: there are lots of those around ;-)

You cannot really change the file header, only add things to the end
of the PYC file...

Hmm, or perhaps we should move the version number to the code object
itself... after all, the changes we want to refer to
using the version number are located in the code object and not the
PYC file layout. Unmarshalling it would then raise the error.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gmcm@hypernet.com  Fri May 26 12:53:14 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 07:53:14 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <1252780469-123073242@hypernet.com>

Greg Ward wrote:

[installing data files]

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python,
> including third-party module distributions.  If Brian Hooper
> distributes a module "foo" that requires a data file containing
> character encoding data (yes, this is based on a true story),
> then the module belongs in (eg.) C:\Python and the data file in
> (?) C:\Python\Data.  (Maybe C:\Python\Data\foo, but that's a
> minor wrinkle.)
> 
> Any disagreement so far?

Yeah. I tend to install stuff outside the sys.prefix tree and then 
use .pth files. I realize I'm, um, unique in this regard but I lost 
everything in some upgrade gone bad. (When a Windows de-
install goes wrong, your only option is to do some manual 
directory and registry pruning.)

I often do much the same on my Linux box, but I don't worry 
about it as much - upgrading is not "click and pray" there. 
(Hmm, I guess it is if you use rpms.)
 
So for Windows, I agree with Mark - put the data with the 
module. On a real OS, I guess I'd be inclined to put global 
data with the module, but user data in ~/.<something>.

> Greg Ward - "always the quiet one"                     
<snort>


- Gordon


From pf@artcom-gmbh.de  Fri May 26 12:50:02 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 13:50:02 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <392E5665.8CB1C260@lemburg.com> from "M.-A. Lemburg" at "May 26, 2000 12:48: 5 pm"
Message-ID: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>

[M.-A. Lemburg]:
> > Proposal:
> > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > 
> > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > bytes 12-*  marshalled code object (same as earlier)
> 
> This will break all tools relying on having the code object available
> in bytes[8:] and believe me: there are lots of those around ;-)

In some way, this is intentional:  If these tools (are there are really
that many out there, that munge with .pyc byte code files?) simply use
'imp.get_magic()' and then silently assume a specific content of the
marshalled code object, they probably need changes anyway, since the
code needed to deal with the new unicode object is missing from them.

> You cannot really change the file header, only add things to the end
> of the PYC file...

Why?  Will this idea really cause such earth quaking grumbling?
Please review this in the context of my proposal to change 'imp.get_magic()'
to return the old 1.5.2 MAGIC, when called without parameter.

> Hmm, or perhaps we should move the version number to the code object
> itself... after all, the changes we want to refer to
> using the version number are located in the code object and not the
> PYC file layout. Unmarshalling it would then raise the error.

Since the file layout is a very thin layer around the marshalled
code object, this makes really no big difference to me.  But it
will be harder to come up with reasonable entries for /etc/magic [1]
and similar mechanisms.  

Putting the version number at the end of file is possible. 
But such a solution is some what "dirty" and only gives the false 
impression that the general file layout (pyc[8:] instead of pyc[12:]) 
is something you can rely on until the end of time.  Hardcoding the
size of an unpadded header (something like using buffer[8:]) is IMO 
bad style anyway.

Regards, Peter
[1]: /etc/magic on Unices is a small textual data base used by the 'file' 
     command to identify the type of a file by looking at the first
     few bytes.  Unix file managers may either use /etc/magic directly
     or a similar scheme to asciociate files with mimetypes and/or default
     applications.


From guido@python.org  Fri May 26 14:10:30 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:10:30 -0500
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: Your message of "Fri, 26 May 2000 00:48:12 MST."
 <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
Message-ID: <200005261310.IAA11256@cj20424-a.reston1.va.home.com>

> The .dsp file(s) need to be updated to include the new _exceptions.c file
> in their build and link step. (the symbols moved there)

I'll take care of this.

> IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
> directory. Dependencies from the core out to Modules/ seems a bit weird.

Good catch!  Since Barry's contemplating renaming it to exceptions.c
anyway that would be a good time to move it.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May 26 14:13:06 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:13:06 -0500
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: Your message of "Fri, 26 May 2000 07:53:14 -0400."
 <1252780469-123073242@hypernet.com>
References: <1252780469-123073242@hypernet.com>
Message-ID: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>

> So for Windows, I agree with Mark - put the data with the 
> module. On a real OS, I guess I'd be inclined to put global 
> data with the module, but user data in ~/.<something>.

Aha!  Good distinction.

Modifyable data needs to go in a per-user directory, even on Windows,
outside the Python tree.

But static data needs to go in the same directory as the module that
uses it.  (We use this in the standard test package, for example.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gward@mems-exchange.org  Fri May 26 13:24:23 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 08:24:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>; from guido@python.org on Thu, May 25, 2000 at 10:30:26AM -0500
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>
Message-ID: <20000526082423.A12100@mems-exchange.org>

On 25 May 2000, Guido van Rossum said:
> Two excuses: (1) long ago, you really needed to use ld instead of cc
> to create a shared library, because cc didn't recognize the flags or
> did other things that shouldn't be done to shared libraries; (2) I
> didn't know there was a problem with using ld.
> 
> Since you have now provided a patch which seems to work, why don't you
> check it in...?

Done.  I presume checking in configure.in and configure at the same time
is the right thing to do?  (I checked, and running "autoconf" on the
original configure.in regenerated exactly what's in CVS.)

        Greg


From thomas.heller@ion-tof.com  Fri May 26 13:28:49 2000
From: thomas.heller@ion-tof.com (Thomas Heller)
Date: Fri, 26 May 2000 14:28:49 +0200
Subject: [Distutils] Re: [Python-Dev] Where to install non-code files
References: <1252780469-123073242@hypernet.com>  <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <01ee01bfc70d$f1f17a20$4500a8c0@thomasnb>

[Guido writes]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
This seems to be the value of key "AppData" stored under in
  HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Filders

Right?

Thomas



From guido@python.org  Fri May 26 14:35:40 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:35:40 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Fri, 26 May 2000 08:24:23 -0400."
 <20000526082423.A12100@mems-exchange.org>
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>
 <20000526082423.A12100@mems-exchange.org>
Message-ID: <200005261335.IAA11410@cj20424-a.reston1.va.home.com>

> Done.  I presume checking in configure.in and configure at the same time
> is the right thing to do?  (I checked, and running "autoconf" on the
> original configure.in regenerated exactly what's in CVS.)

Yes.  WHat I usually do is manually bump the version number in
configure before checking it in (it references the configure.in
version) but that's a minor nit...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From pf@artcom-gmbh.de  Fri May 26 13:36:36 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 14:36:36 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 26, 2000  8:13: 6 am"
Message-ID: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]
[...]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.

Is there a reliable algorithm to find a "per-user" directory on any
Win95/98/NT/2000 system?  On MacOS?  

Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
module would provide 'os.environ["HOME"]' similar to the posix
version?  This would certainly simplify the task of application
programmers intending to write portable applications.

Regards, Peter


From bwarsaw@python.org  Sat May 27 13:46:44 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:46:44 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
 <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
Message-ID: <14639.50100.383806.969434@localhost.localdomain>

>>>>> "GS" == Greg Stein <gstein@lyra.org> writes:

    GS> On Fri, 26 May 2000, Tim Peters wrote:
    >> ...  PS: Barry's exception patch appears to have broken the CVS
    >> Windows build (nothing links anymore; all the PyExc_xxx symbols
    >> aren't found; no time to dig more now).

    GS> The .dsp file(s) need to be updated to include the new
    GS> _exceptions.c file in their build and link step. (the symbols
    GS> moved there)

    GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
    GS> the Python/ directory. Dependencies from the core out to
    GS> Modules/ seems a bit weird.

Guido made the suggestion to move _exceptions.c to exceptions.c any
way.  Should we move the file to the other directory too?  Get out
your plusses and minuses.

-Barry


From bwarsaw@python.org  Sat May 27 13:49:01 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:49:01 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
 <392E37E1.75AC4D0E@lemburg.com>
Message-ID: <14639.50237.999048.146898@localhost.localdomain>

>>>>> "M" == M  <mal@lemburg.com> writes:

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution
    M> for the case where the exceptions.py module is not found ? It
    M> now looks like _exceptions replaces exceptions.py...

I see no reason to keep both of them around.  Too much of a
synchronization headache.

-Barry


From mhammond@skippinet.com.au  Fri May 26 14:12:49 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 23:12:49 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKICLAA.mhammond@skippinet.com.au>

> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?

Ahhh - where to start.  SHGetFolderLocation offers the following
alternatives:

CSIDL_APPDATA
Version 4.71. File system directory that serves as a common repository for
application-specific data. A typical path is C:\Documents and
Settings\username\Application Data

CSIDL_COMMON_APPDATA
Version 5.0. Application data for all users. A typical path is C:\Documents
and Settings\All Users\Application Data.

CSIDL_LOCAL_APPDATA
Version 5.0. File system directory that serves as a data repository for
local (non-roaming) applications. A typical path is C:\Documents and
Settings\username\Local Settings\Application Data.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

Plus a few I didnt bother listing...

<sigh>

Mark.



From jlj@cfdrc.com  Fri May 26 14:20:34 2000
From: jlj@cfdrc.com (Lyle Johnson)
Date: Fri, 26 May 2000 08:20:34 -0500
Subject: [Python-Dev] RE: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <003c01bfc715$2c8fde90$4e574dc0@cfdrc.com>

How about "PWAN", the "package without a name"? ;)

> -----Original Message-----
> From: distutils-sig-admin@python.org
> [mailto:distutils-sig-admin@python.org]On Behalf Of Greg Ward
> Sent: Thursday, May 25, 2000 5:34 PM
> To: distutils-sig@python.org; python-dev@python.org
> Subject: [Distutils] Terminology question
> 
> 
> A question of terminology: frequently in the Distutils docs I need to
> refer to the package-that-is-not-a-package, ie. the "root" or "empty"
> package.  I can't decide if I prefer "root package", "empty package" or
> what.  ("Empty" just means the *name* is empty, so it's probably not a
> very good thing to say "empty package" -- but "package with no name" or
> "unnamed package" aren't much better.)
> 
> Is there some accepted convention that I have missed?
> 
> Here's the definition I've just written for the "Distribution Python
> Modules" manual:
> 
> \item[root package] the ``package'' that modules not in a package live
>   in.  The vast majority of the standard library is in the root package,
>   as are many small, standalone third-party modules that don't belong to
>   a larger module collection.  (The root package isn't really a package,
>   since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
>   call it something.)
> 
> Confusing enough?  I thought so...
> 
>         Greg
> -- 
> Greg Ward - Unix nerd                                   gward@python.net
> http://starship.python.net/~gward/
> Beware of altruism.  It is based on self-deception, the root of all evil.
> 
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG@python.org
> http://www.python.org/mailman/listinfo/distutils-sig
> 


From skip@mojam.com (Skip Montanaro)  Fri May 26 09:25:49 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 26 May 2000 03:25:49 -0500 (CDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
In-Reply-To: <14639.50237.999048.146898@localhost.localdomain>
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
 <392E37E1.75AC4D0E@lemburg.com>
 <14639.50237.999048.146898@localhost.localdomain>
Message-ID: <14638.13581.195350.511944@beluga.mojam.com>

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution for the
    M> case where the exceptions.py module is not found ? It now looks like
    M> _exceptions replaces exceptions.py...

    BAW> I see no reason to keep both of them around.  Too much of a
    BAW> synchronization headache.

Well, wait a minute.  Is Nick's third revision of his
AttributeError/NameError enhancement still on the table?  If so,
exceptions.py is the right place to put it.  In that case, I would recommend
that exceptions.py still be the file that is loaded.  It would take care of
importing _exceptions.

Oh, BTW.. +1 on Nick's latest version.

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From gward@mems-exchange.org  Fri May 26 14:27:16 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:27:16 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
References: <20000525222203.A1114@beelzebub> <1252780469-123073242@hypernet.com>
Message-ID: <20000526092716.B12100@mems-exchange.org>

On 26 May 2000, Gordon McMillan said:
> Yeah. I tend to install stuff outside the sys.prefix tree and then 
> use .pth files. I realize I'm, um, unique in this regard but I lost 
> everything in some upgrade gone bad. (When a Windows de-
> install goes wrong, your only option is to do some manual 
> directory and registry pruning.)

I think that's appropriate for Python "applications" -- in fact, now
that Distutils can install scripts and miscellaneous data, about the
only thing needed to properly support "applications" is an easy way for
developers to say, "Please give me my own directory and create a .pth
file".  (Actually, the .pth file should only be one way to install an
application: you might not want your app's Python library to muck up
everybody else's Python path.  An idea AMK and I cooked up yesterday
would be an addition to the Distutils "build_scripts" command: along
with frobbing the #! line to point to the right Python interpreter, add
a second line:
  import sys ; sys.append(path-to-this-app's-python-lib)

Or maybe "sys.insert(0, ...)".

Anyways, that's neither here nor there.  Except that applications that
get their own directory should be free to put their (static) data files
wherever they please, rather than having to put them in the app's Python
library.

I'm more concerned with the what the Distutils works best with now,
though: module distributions.  I think you guys have convinced me;
static data should normally sit with the code.  I think I'll make that
the default (instead of prefix + "share"), but give developers a way to
override it.  So eg.:

   data_files = ["this.dat", "that.cfg"]

will put the files in the same place as the code (which could be a bit
tricky to figure out, what with the vagaries of package-ization and
"extra" install dirs);

   data_files = [("share", ["this.dat"]), ("etc", ["that.cfg"])]

would put the data file in (eg.) /usr/local/share and the config file in
/usr/local/etc.  This obviously makes the module writer's job harder: he
has to grovel from sys.prefix looking for the files that he expects to
have been installed with his modules.  But if someone really wants to do
this, they should be allowed to.

Finally, you could also put absolute directories in 'data_files',
although this would not be recommended.

> (Hmm, I guess it is if you use rpms.)

All the smart Unix installers (RPM, Debian, FreeBSD, ...?) I know of
have some sort of dependency mechanism, which works to varying degrees
of "work".  I'm only familar with RPM, and my usual response to a
dependency warning is "dammit, I know what I'm doing", and then I rerun
"rpm --nodeps" to ignore the dependency checking.  (This usually arises
because I build my own Perl and Python, and don't use Red Hat's -- I
just make /usr/bin/{perl,python} symlinks to /usr/local/bin, which RPM
tends to whine about.)  But it's nice to know that someone is watching.
;-)

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From gward@mems-exchange.org  Fri May 26 14:30:29 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:30:29 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 26, 2000 at 08:13:06AM -0500
References: <1252780469-123073242@hypernet.com> <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <20000526093028.C12100@mems-exchange.org>

On 26 May 2000, Guido van Rossum said:
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
> But static data needs to go in the same directory as the module that
> uses it.  (We use this in the standard test package, for example.)

What about the Distutils system config file (pydistutils.cfg)?  This is
something that should only be modified by the sysadmin, and sets the
site-wide policy for building and installing Python modules.  Does this
belong in the code directory?  (I hope so, because that's where it goes
now...)

(Under Unix, users can have a personal Distutils config file that
overrides the system config (~/.pydistutils.cfg), and every module
distribution can have a setup.cfg that overrides both of them.  On
Windows and Mac OS, there are only two config files: system and
per-distribution.)

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From gward@mems-exchange.org  Fri May 26 15:30:15 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 10:30:15 -0400
Subject: [Python-Dev] py_compile and CR in source files
Message-ID: <20000526103014.A18937@mems-exchange.org>

Just made an unpleasant discovery: if a Python source file has CR-LF
line-endings, you can import it just fine under Unix.  But attempting to
'py_compile.compile()' it fails with a SyntaxError at the first
line-ending.

Arrghh!  This means that Distutils will either have to check/convert
line-endings at build-time (hey, finally, a good excuse for the
"build_py" command), or implicitly compile modules by importing them
(instead of using 'py_compile.compile()').

Perhaps I should "build" modules by line-at-a-time copying -- currently
it copies them in 16k chunks, which would make it hard to fix line
endings.  Hmmm.

        Greg


From skip@mojam.com (Skip Montanaro)  Fri May 26 10:39:39 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 26 May 2000 04:39:39 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526103014.A18937@mems-exchange.org>
References: <20000526103014.A18937@mems-exchange.org>
Message-ID: <14638.18011.331703.867404@beluga.mojam.com>

    Greg> Arrghh!  This means that Distutils will either have to
    Greg> check/convert line-endings at build-time (hey, finally, a good
    Greg> excuse for the "build_py" command), or implicitly compile modules
    Greg> by importing them (instead of using 'py_compile.compile()').

I don't think you can safely compile modules by importing them.  You have no
idea what the side effects of the import might be.

How about fixing py_compile.compile() instead?

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From mal@lemburg.com  Fri May 26 15:27:03 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:27:03 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E89B7.D6BC572D@lemburg.com>

Peter Funk wrote:
> 
> [M.-A. Lemburg]:
> > > Proposal:
> > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > >
> > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > bytes 12-*  marshalled code object (same as earlier)
> >
> > This will break all tools relying on having the code object available
> > in bytes[8:] and believe me: there are lots of those around ;-)
> 
> In some way, this is intentional:  If these tools (are there are really
> that many out there, that munge with .pyc byte code files?) simply use
> 'imp.get_magic()' and then silently assume a specific content of the
> marshalled code object, they probably need changes anyway, since the
> code needed to deal with the new unicode object is missing from them.

That's why I proposed to change the marshalled code object
and not the PYC file: the problem is not only related to 
PYC files, it touches all areas where marshal is used. If 
you try to load a code object using Unicode in Python 1.5
you'll get all sorts of errors, e.g. EOFError, SystemError.
 
Since marshal uses a specific format, that format should
receive the version number.

Ideally that version would be prepended to the format (not sure
whether this is possible), so that the PYC file layout
would then look like this:

word 0: magic
word 1: timestamp
word 2: version in the marshalled code object
word 3-*: rest of the marshalled code object

Please make sure that options such as the -U option are
also respected...

--

A different approach to all this would be fixing only the
first two bytes of the magic word, e.g.

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

This would be b/w compatible and still provide file(1)
with enough information to be able to tell the file type.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri May 26 15:49:23 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:49:23 +0200
Subject: [Python-Dev] Extending locale.py
Message-ID: <392E8EF3.CDA61525@lemburg.com>

This is a multi-part message in MIME format.
--------------FDA69223F7CCDED3D7828E2B
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

To make moving into the direction of making the string encoding
depend on the locale settings a little easier, I've started
to hack away at an extension of the locale.py module.

The module provides enough information to be able to set the string
encoding in site.py at startup. 

Additional code for _localemodule.c would be nice for platforms
which use other APIs to get at the active code page, e.g. on
Windows and Macs.

Please try it on your platform and tell me what you think
of the APIs.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------FDA69223F7CCDED3D7828E2B
Content-Type: text/python; charset=us-ascii;
 name="localex.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="localex.py"

""" Extented locale.py

    This version of locale.py contains a locale aliasing engine and
    knows the default encoding of many common locales.

    (c) Marc-Andre Lemburg, mal@lemburg.com

"""
from locale import *
import string

__version__ = '0.1'

### APIs

def normalize(localename):

    """ Returns a normalized locale code for the given locale
        name.

        The locale code is usable for setting the locale using
        setlocale().

        If normalization fails, the original name is returned
        unchanged.

    """
    lname = string.lower(localename)
    if ':' in lname:
        # ':' is sometimes used as encoding delimiter.
        lname = string.replace(lname, ':', '.')
    code = locale_alias.get(lname, localename)
    if code is localename:
        # Could be that the encoding is not known; in that case
        # we default to the default encoding for that locale code
        # just like setlocale() does.
        if '.' in lname:
            lname, encoding = string.split(lname, '.')
        code = locale_alias.get(lname, localename)
    return code

def _parse_localename(localename):

    """ Parses the locale code for localename and returns the
        result as tuple (language code, encoding).

        The language code corresponds to RFC 1766.  code and encoding
        can be None in case the values cannot be determined.

    """
    code = normalize(localename)
    l = string.split(code, '.')
    if len(l) != 2:
        if l[0] == 'C':
            return None,None
        else:
            raise SystemError,'error in locale.locale_alias: missing encoding'
    return l
    
def get_default():

    """ Tries to determine the default locale settings and returns
        them as tuple (language code, encoding).

        According to POSIX, a program which has not called
        setlocale(LC_ALL,"") runs using the portable 'C' locale.
        Calling setlocale(LC_ALL,"") lets it use the default locale as
        defined by the LANG variable. Since we don't want to interfere
        with the current locale setting we thus emulate the behaviour
        in the way described above.

        Except for the code 'C', the language code corresponds to RFC
        1766.  code and encoding can be None in case the values cannot
        be determined.

    """
    # XXX On some systems the environment variables LC_ALL, LC_CTYPE, etc.
    #     complement the setting of LANG... perhaps we should try those
    #     variables too ?
    import os
    localename = os.environ.get('LANG','C')
    return _parse_localename(localename)

def get_locale(category=LC_CTYPE):

    """ Returns the current setting for the given locale category as
        tuple (language code, encoding).

        category may be one of the LC_* value except LC_ALL. It
        defaults to LC_CTYPE.

        Except for the code 'C', the language code corresponds to RFC
        1766.  code and encoding can be None in case the values cannot
        be determined.

    """
    return _parse_localename(setlocale(category))

def set_locale(localename, category=LC_ALL):

    """ Set the locale to localename.

        localename can be given as common locale name. It will be
        filtered through a list of common aliases for the C locale
        codes.

        category may be given as one of the LC_* values. It defaults
        to LC_ALL.

    """
    setlocale(category, normalize(localename))

### Data
#
# The following table was extracted from the locale.alias file which
# comes with X11. It is usually available as
# /usr/lib/X11/locale/locale.alias.
#
# The table maps lowercase alias names to C locale names. Encodings
# are always separated from the locale name using a dot ('.').
#
locale_alias = {
        'american.iso88591':             'en_US.ISO8859-1',
        'american.iso885915':            'en_US.ISO8859-15',
        'ar':                            'ar_AA.ISO8859-6',
        'ar_aa':                         'ar_AA.ISO8859-6',
        'ar_sa.iso88596':                'ar_SA.ISO8859-6',
        'arabic.iso88596':               'ar_AA.ISO8859-6',
        'bg':                            'bg_BG.ISO8859-5',
        'bg_bg':                         'bg_BG.ISO8859-5',
        'bg_bg.iso88595':                'bg_BG.ISO8859-5',
        'bulgarian':                     'bg_BG.ISO8859-5',
        'c-french.iso88591':             'fr_CA.ISO8859-1',
        'c-french.iso885915':            'fr_CA.ISO8859-15',
        'c.en':                          'C',
        'c.iso88591':                    'en_US.ISO8859-1',
        'c.iso885915':                   'en_US.ISO8859-15',
        'c_c.c':                         'C',
        'cextend':                       'en_US.ISO8859-1',
        'cextend.en':                    'en_US.ISO8859-1',
        'chinese-s':                     'zh_CN.eucCN',
        'chinese-t':                     'zh_TW.eucTW',
        'croatian':                      'hr_HR.ISO8859-2',
        'cs':                            'cs_CZ.ISO8859-2',
        'cs_cs':                         'cs_CZ.ISO8859-2',
        'cs_cs.iso8859-2':               'cs_CZ.ISO8859-2',
        'cs_cz':                         'cs_CZ.ISO8859-2',
        'cs_cz.iso88592':                'cs_CZ.ISO8859-2',
        'cz':                            'cz_CZ.ISO8859-2',
        'cz_cz':                         'cz_CZ.ISO8859-2',
        'czech':                         'cs_CS.ISO8859-2',
        'da':                            'da_DK.ISO8859-1',
        'da_dk':                         'da_DK.ISO8859-1',
        'da_dk.88591':                   'da_DK.ISO8859-1',
        'da_dk.88591.en':                'da_DK.ISO8859-1',
        'da_dk.885915':                  'da_DK.ISO8859-15',
        'da_dk.885915.en':               'da_DK.ISO8859-15',
        'da_dk.iso88591':                'da_DK.ISO8859-1',
        'da_dk.iso885915':               'da_DK.ISO8859-15',
        'da_dk.iso_8859-1':              'da_DK.ISO8859-1',
        'da_dk.iso_8859-15':             'da_DK.ISO8859-15',
        'danish.iso88591':               'da_DK.ISO8859-1',
        'danish.iso885915':              'da_DK.ISO8859-15',
        'de':                            'de_DE.ISO8859-1',
        'de_at':                         'de_AT.ISO8859-1',
        'de_at.iso_8859-1':              'de_AT.ISO8859-1',
        'de_at.iso_8859-15':             'de_AT.ISO8859-15',
        'de_ch':                         'de_CH.ISO8859-1',
        'de_ch.iso_8859-1':              'de_CH.ISO8859-1',
        'de_ch.iso_8859-15':             'de_CH.ISO8859-15',
        'de_de':                         'de_DE.ISO8859-1',
        'de_de.88591':                   'de_DE.ISO8859-1',
        'de_de.88591.en':                'de_DE.ISO8859-1',
        'de_de.885915':                  'de_DE.ISO8859-15',
        'de_de.885915.en':               'de_DE.ISO8859-15',
        'de_de.iso88591':                'de_DE.ISO8859-1',
        'de_de.iso885915':               'de_DE.ISO8859-15',
        'de_de.iso_8859-1':              'de_DE.ISO8859-1',
        'de_de.iso_8859-15':             'de_DE.ISO8859-15',
        'dutch.iso88591':                'nl_BE.ISO8859-1',
        'dutch.iso885915':               'nl_BE.ISO8859-15',
        'ee':                            'ee_EE.ISO8859-4',
        'el':                            'el_GR.ISO8859-7',
        'el_gr':                         'el_GR.ISO8859-7',
        'el_gr.iso88597':                'el_GR.ISO8859-7',
        'en':                            'en_US.ISO8859-1',
        'en_au':                         'en_AU.ISO8859-1',
        'en_au.iso_8859-1':              'en_AU.ISO8859-1',
        'en_au.iso_8859-15':             'en_AU.ISO8859-15',
        'en_ca':                         'en_CA.ISO8859-1',
        'en_ca.iso_8859-1':              'en_CA.ISO8859-1',
        'en_ca.iso_8859-15':             'en_CA.ISO8859-15',
        'en_gb':                         'en_GB.ISO8859-1',
        'en_gb.88591':                   'en_GB.ISO8859-1',
        'en_gb.88591.en':                'en_GB.ISO8859-1',
        'en_gb.885915':                  'en_GB.ISO8859-15',
        'en_gb.885915.en':               'en_GB.ISO8859-15',
        'en_gb.iso88591':                'en_GB.ISO8859-1',
        'en_gb.iso885915':               'en_GB.ISO8859-15',
        'en_gb.iso_8859-1':              'en_GB.ISO8859-1',
        'en_gb.iso_8859-15':             'en_GB.ISO8859-15',
        'en_ie':                         'en_IE.ISO8859-1',
        'en_nz':                         'en_NZ.ISO8859-1',
        'en_uk':                         'en_GB.ISO8859-1',
        'en_us':                         'en_US.ISO8859-1',
        'en_us.88591':                   'en_US.ISO8859-1',
        'en_us.88591.en':                'en_US.ISO8859-1',
        'en_us.885915':                  'en_US.ISO8859-15',
        'en_us.885915.en':               'en_US.ISO8859-15',
        'en_us.iso88591':                'en_US.ISO8859-1',
        'en_us.iso885915':               'en_US.ISO8859-15',
        'en_us.iso_8859-1':              'en_US.ISO8859-1',
        'en_us.iso_8859-15':             'en_US.ISO8859-15',
        'en_us.utf-8':                   'en_US.utf',
        'en_us.utf-8':                   'en_US.utf',
        'eng_gb.8859':                   'en_GB.ISO8859-1',
        'eng_gb.8859.in':                'en_GB.ISO8859-1',
        'english.iso88591':              'en_EN.ISO8859-1',
        'english.iso885915':             'en_EN.ISO8859-15',
        'english_uk.8859':               'en_GB.ISO8859-1',
        'english_united-states.437':     'C',
        'english_us.8859':               'en_US.ISO8859-1',
        'english_us.ascii':              'en_US.ISO8859-1',
        'es':                            'es_ES.ISO8859-1',
        'es_ar':                         'es_AR.ISO8859-1',
        'es_bo':                         'es_BO.ISO8859-1',
        'es_cl':                         'es_CL.ISO8859-1',
        'es_co':                         'es_CO.ISO8859-1',
        'es_cr':                         'es_CR.ISO8859-1',
        'es_ec':                         'es_EC.ISO8859-1',
        'es_es':                         'es_ES.ISO8859-1',
        'es_es.88591':                   'es_ES.ISO8859-1',
        'es_es.88591.en':                'es_ES.ISO8859-1',
        'es_es.885915':                  'es_ES.ISO8859-15',
        'es_es.885915.en':               'es_ES.ISO8859-15',
        'es_es.iso88591':                'es_ES.ISO8859-1',
        'es_es.iso885915':               'es_ES.ISO8859-15',
        'es_es.iso_8859-1':              'es_ES.ISO8859-1',
        'es_es.iso_8859-15':             'es_ES.ISO8859-15',
        'es_gt':                         'es_GT.ISO8859-1',
        'es_mx':                         'es_MX.ISO8859-1',
        'es_ni':                         'es_NI.ISO8859-1',
        'es_pa':                         'es_PA.ISO8859-1',
        'es_pe':                         'es_PE.ISO8859-1',
        'es_py':                         'es_PY.ISO8859-1',
        'es_sv':                         'es_SV.ISO8859-1',
        'es_uy':                         'es_UY.ISO8859-1',
        'es_ve':                         'es_VE.ISO8859-1',
        'et':                            'et_EE.ISO8859-4',
        'et_ee':                         'et_EE.ISO8859-4',
        'fi':                            'fi_FI.ISO8859-1',
        'fi_fi':                         'fi_FI.ISO8859-1',
        'fi_fi.88591':                   'fi_FI.ISO8859-1',
        'fi_fi.88591.en':                'fi_FI.ISO8859-1',
        'fi_fi.885915':                  'fi_FI.ISO8859-15',
        'fi_fi.885915.en':               'fi_FI.ISO8859-15',
        'fi_fi.iso88591':                'fi_FI.ISO8859-1',
        'fi_fi.iso885915':               'fi_FI.ISO8859-15',
        'fi_fi.iso_8859-1':              'fi_FI.ISO8859-1',
        'fi_fi.iso_8859-15':             'fi_FI.ISO8859-15',
        'finnish.iso88591':              'fi_FI.ISO8859-1',
        'finnish.iso885915':             'fi_FI.ISO8859-15',
        'fr':                            'fr_FR.ISO8859-1',
        'fr_be':                         'fr_BE.ISO8859-1',
        'fr_be.88591':                   'fr_BE.ISO8859-1',
        'fr_be.88591.en':                'fr_BE.ISO8859-1',
        'fr_be.885915':                  'fr_BE.ISO8859-15',
        'fr_be.885915.en':               'fr_BE.ISO8859-15',
        'fr_be.iso_8859-1':              'fr_BE.ISO8859-1',
        'fr_be.iso_8859-15':             'fr_BE.ISO8859-15',
        'fr_ca':                         'fr_CA.ISO8859-1',
        'fr_ca.88591':                   'fr_CA.ISO8859-1',
        'fr_ca.88591.en':                'fr_CA.ISO8859-1',
        'fr_ca.885915':                  'fr_CA.ISO8859-15',
        'fr_ca.885915.en':               'fr_CA.ISO8859-15',
        'fr_ca.iso88591':                'fr_CA.ISO8859-1',
        'fr_ca.iso885915':               'fr_CA.ISO8859-15',
        'fr_ca.iso_8859-1':              'fr_CA.ISO8859-1',
        'fr_ca.iso_8859-15':             'fr_CA.ISO8859-15',
        'fr_ch':                         'fr_CH.ISO8859-1',
        'fr_ch.88591':                   'fr_CH.ISO8859-1',
        'fr_ch.88591.en':                'fr_CH.ISO8859-1',
        'fr_ch.885915':                  'fr_CH.ISO8859-15',
        'fr_ch.885915.en':               'fr_CH.ISO8859-15',
        'fr_ch.iso_8859-1':              'fr_CH.ISO8859-1',
        'fr_ch.iso_8859-15':             'fr_CH.ISO8859-15',
        'fr_fr':                         'fr_FR.ISO8859-1',
        'fr_fr.88591':                   'fr_FR.ISO8859-1',
        'fr_fr.88591.en':                'fr_FR.ISO8859-1',
        'fr_fr.885915':                  'fr_FR.ISO8859-15',
        'fr_fr.885915.en':               'fr_FR.ISO8859-15',
        'fr_fr.iso88591':                'fr_FR.ISO8859-1',
        'fr_fr.iso885915':               'fr_FR.ISO8859-15',
        'fr_fr.iso_8859-1':              'fr_FR.ISO8859-1',
        'fr_fr.iso_8859-15':             'fr_FR.ISO8859-15',
        'fre_fr.8859':                   'fr_FR.ISO8859-1',
        'fre_fr.8859.in':                'fr_FR.ISO8859-1',
        'french.iso88591':               'fr_CH.ISO8859-1',
        'french.iso885915':              'fr_CH.ISO8859-15',
        'french_france.8859':            'fr_FR.ISO8859-1',
        'ger_de.8859':                   'de_DE.ISO8859-1',
        'ger_de.8859.in':                'de_DE.ISO8859-1',
        'german.iso88591':               'de_CH.ISO8859-1',
        'german.iso885915':              'de_CH.ISO8859-15',
        'german_germany.8859':           'de_DE.ISO8859-1',
        'greek.iso88597':                'el_GR.ISO8859-7',
        'hebrew.iso88598':               'iw_IL.ISO8859-8',
        'hr':                            'hr_HR.ISO8859-2',
        'hr_hr':                         'hr_HR.ISO8859-2',
        'hr_hr.iso88592':                'hr_HR.ISO8859-2',
        'hr_hr.iso_8859-2':              'hr_HR.ISO8859-2',
        'hu':                            'hu_HU.ISO8859-2',
        'hu_hu':                         'hu_HU.ISO8859-2',
        'hu_hu.iso88592':                'hu_HU.ISO8859-2',
        'hungarian':                     'hu_HU.ISO8859-2',
        'icelandic.iso88591':            'is_IS.ISO8859-1',
        'icelandic.iso885915':           'is_IS.ISO8859-15',
        'id':                            'id_ID.ISO8859-1',
        'id_id':                         'id_ID.ISO8859-1',
        'id_id.iso88591':                'id_ID.ISO8859-1',
        'is':                            'is_IS.ISO8859-1',
        'is_is':                         'is_IS.ISO8859-1',
        'is_is.iso88591':                'is_IS.ISO8859-1',
        'is_is.iso885915':               'is_IS.ISO8859-15',
        'is_is.iso_8859-1':              'is_IS.ISO8859-1',
        'is_is.iso_8859-15':             'is_IS.ISO8859-15',
        'iso-8859-1':                    'en_US.ISO8859-1',
        'iso-8859-15':                   'en_US.ISO8859-15',
        'iso8859-1':                     'en_US.ISO8859-1',
        'iso8859-15':                    'en_US.ISO8859-15',
        'iso_8859_1':                    'en_US.ISO8859-1',
        'iso_8859_15':                   'en_US.ISO8859-15',
        'it':                            'it_IT.ISO8859-1',
        'it_ch':                         'it_CH.ISO8859-1',
        'it_ch.iso_8859-1':              'it_CH.ISO8859-1',
        'it_ch.iso_8859-15':             'it_CH.ISO8859-15',
        'it_it':                         'it_IT.ISO8859-1',
        'it_it.88591':                   'it_IT.ISO8859-1',
        'it_it.88591.en':                'it_IT.ISO8859-1',
        'it_it.885915':                  'it_IT.ISO8859-15',
        'it_it.885915.en':               'it_IT.ISO8859-15',
        'it_it.iso88591':                'it_IT.ISO8859-1',
        'it_it.iso885915':               'it_IT.ISO8859-15',
        'it_it.iso_8859-1':              'it_IT.ISO8859-1',
        'it_it.iso_8859-15':             'it_IT.ISO8859-15',
        'italian.iso88591':              'it_IT.ISO8859-1',
        'italian.iso885915':             'it_IT.ISO8859-15',
        'iw':                            'iw_IL.ISO8859-8',
        'iw_il':                         'iw_IL.ISO8859-8',
        'iw_il.iso88598':                'iw_IL.ISO8859-8',
        'ja':                            'ja_JP.eucJP',
        'ja.jis':                        'ja_JP.JIS7',
        'ja.sjis':                       'ja_JP.SJIS',
        'ja_jp':                         'ja_JP.eucJP',
        'ja_jp.ajec':                    'ja_JP.eucJP',
        'ja_jp.euc':                     'ja_JP.eucJP',
        'ja_jp.eucjp':                   'ja_JP.eucJP',
        'ja_jp.iso-2022-jp':             'ja_JP.JIS7',
        'ja_jp.jis':                     'ja_JP.JIS7',
        'ja_jp.jis7':                    'ja_JP.JIS7',
        'ja_jp.mscode':                  'ja_JP.SJIS',
        'ja_jp.sjis':                    'ja_JP.SJIS',
        'ja_jp.ujis':                    'ja_JP.eucJP',
        'japan':                         'ja_JP.eucJP',
        'japanese':                      'ja_JP.SJIS',
        'japanese-euc':                  'ja_JP.eucJP',
        'japanese.euc':                  'ja_JP.eucJP',
        'jp_jp':                         'ja_JP.eucJP',
        'ko':                            'ko_KR.eucKR',
        'ko_kr':                         'ko_KR.eucKR',
        'ko_kr.euc':                     'ko_KR.eucKR',
        'ko_kr.euc':                     'ko_KR.eucKR',
        'korean':                        'ko_KR.eucKR',
        'lt':                            'lt_LT.ISO8859-4',
        'lv':                            'lv_LV.ISO8859-4',
        'mk':                            'mk_MK.ISO8859-5',
        'mk_mk':                         'mk_MK.ISO8859-5',
        'nl':                            'nl_NL.ISO8859-1',
        'nl_be':                         'nl_BE.ISO8859-1',
        'nl_be.88591':                   'nl_BE.ISO8859-1',
        'nl_be.88591.en':                'nl_BE.ISO8859-1',
        'nl_be.885915':                  'nl_BE.ISO8859-15',
        'nl_be.885915.en':               'nl_BE.ISO8859-15',
        'nl_be.iso_8859-1':              'nl_BE.ISO8859-1',
        'nl_be.iso_8859-15':             'nl_BE.ISO8859-15',
        'nl_nl':                         'nl_NL.ISO8859-1',
        'nl_nl.88591':                   'nl_NL.ISO8859-1',
        'nl_nl.88591.en':                'nl_NL.ISO8859-1',
        'nl_nl.885915':                  'nl_NL.ISO8859-15',
        'nl_nl.885915.en':               'nl_NL.ISO8859-15',
        'nl_nl.iso88591':                'nl_NL.ISO8859-1',
        'nl_nl.iso885915':               'nl_NL.ISO8859-15',
        'nl_nl.iso_8859-1':              'nl_NL.ISO8859-1',
        'nl_nl.iso_8859-15':             'nl_NL.ISO8859-15',
        'no':                            'no_NO.ISO8859-1',
        'no_no':                         'no_NO.ISO8859-1',
        'no_no.88591':                   'no_NO.ISO8859-1',
        'no_no.88591.en':                'no_NO.ISO8859-1',
        'no_no.885915':                  'no_NO.ISO8859-15',
        'no_no.885915.en':               'no_NO.ISO8859-15',
        'no_no.iso88591':                'no_NO.ISO8859-1',
        'no_no.iso885915':               'no_NO.ISO8859-15',
        'no_no.iso_8859-1':              'no_NO.ISO8859-1',
        'no_no.iso_8859-15':             'no_NO.ISO8859-15',
        'norwegian.iso88591':            'no_NO.ISO8859-1',
        'norwegian.iso885915':           'no_NO.ISO8859-15',
        'pl':                            'pl_PL.ISO8859-2',
        'pl_pl':                         'pl_PL.ISO8859-2',
        'pl_pl.iso88592':                'pl_PL.ISO8859-2',
        'polish':                        'pl_PL.ISO8859-2',
        'portuguese.iso88591':           'pt_PT.ISO8859-1',
        'portuguese.iso885915':          'pt_PT.ISO8859-15',
        'portuguese_brazil.8859':        'pt_BR.ISO8859-1',
        'posix':                         'C',
        'posix-utf2':                    'C',
        'pt':                            'pt_PT.ISO8859-1',
        'pt_br':                         'pt_BR.ISO8859-1',
        'pt_pt':                         'pt_PT.ISO8859-1',
        'pt_pt.88591':                   'pt_PT.ISO8859-1',
        'pt_pt.88591.en':                'pt_PT.ISO8859-1',
        'pt_pt.885915':                  'pt_PT.ISO8859-15',
        'pt_pt.885915.en':               'pt_PT.ISO8859-15',
        'pt_pt.iso88591':                'pt_PT.ISO8859-1',
        'pt_pt.iso885915':               'pt_PT.ISO8859-15',
        'pt_pt.iso_8859-1':              'pt_PT.ISO8859-1',
        'pt_pt.iso_8859-15':             'pt_PT.ISO8859-15',
        'ro':                            'ro_RO.ISO8859-2',
        'ro_ro':                         'ro_RO.ISO8859-2',
        'ro_ro.iso88592':                'ro_RO.ISO8859-2',
        'ru':                            'ru_RU.ISO8859-5',
        'ru_ru':                         'ru_RU.ISO8859-5',
        'ru_ru.iso88595':                'ru_RU.ISO8859-5',
        'rumanian':                      'ro_RO.ISO8859-2',
        'russian':                       'ru_RU.ISO8859-5',
        'serbocroatian':                 'sh_YU.ISO8859-2',
        'sh':                            'sh_YU.ISO8859-2',
        'sh_hr.iso88592':                'sh_HR.ISO8859-2',
        'sh_sp':                         'sh_YU.ISO8859-2',
        'sh_yu':                         'sh_YU.ISO8859-2',
        'sk':                            'sk_SK.ISO8859-2',
        'sk_sk':                         'sk_SK.ISO8859-2',
        'sk_sk.iso88592':                'sk_SK.ISO8859-2',
        'sl':                            'sl_CS.ISO8859-2',
        'sl_cs':                         'sl_CS.ISO8859-2',
        'sl_si':                         'sl_SI.ISO8859-2',
        'sl_si.iso88592':                'sl_SI.ISO8859-2',
        'slovak':                        'sk_SK.ISO8859-2',
        'slovene':                       'sl_CS.ISO8859-2',
        'sp':                            'sp_YU.ISO8859-5',
        'sp_yu':                         'sp_YU.ISO8859-5',
        'spanish.iso88591':              'es_ES.ISO8859-1',
        'spanish.iso885915':             'es_ES.ISO8859-15',
        'spanish_spain.8859':            'es_ES.ISO8859-1',
        'sr_sp':                         'sr_SP.ISO8859-2',
        'sv':                            'sv_SE.ISO8859-1',
        'sv_se':                         'sv_SE.ISO8859-1',
        'sv_se.88591':                   'sv_SE.ISO8859-1',
        'sv_se.88591.en':                'sv_SE.ISO8859-1',
        'sv_se.885915':                  'sv_SE.ISO8859-15',
        'sv_se.885915.en':               'sv_SE.ISO8859-15',
        'sv_se.iso88591':                'sv_SE.ISO8859-1',
        'sv_se.iso885915':               'sv_SE.ISO8859-15',
        'sv_se.iso_8859-1':              'sv_SE.ISO8859-1',
        'sv_se.iso_8859-15':             'sv_SE.ISO8859-15',
        'swedish.iso88591':              'sv_SE.ISO8859-1',
        'swedish.iso885915':             'sv_SE.ISO8859-15',
        'th_th':                         'th_TH.TACTIS',
        'th_th.tis620':                  'th_TH.TACTIS',
        'tr':                            'tr_TR.ISO8859-9',
        'tr_tr':                         'tr_TR.ISO8859-9',
        'tr_tr.iso88599':                'tr_TR.ISO8859-9',
        'turkish.iso88599':              'tr_TR.ISO8859-9',
        'univ.utf8':                     'en_US.utf',
        'universal.utf8@ucs4':           'en_US.utf',
        'zh':                            'zh_CN.eucCN',
        'zh_cn':                         'zh_CN.eucCN',
        'zh_cn.big5':                    'zh_TW.eucTW',
        'zh_cn.euc':                     'zh_CN.eucCN',
        'zh_tw':                         'zh_TW.eucTW',
        'zh_tw.euc':                     'zh_TW.eucTW',
}

if __name__ == '__main__':

    categories = {}
    def _init_categories():
        for k,v in globals().items():
            if k[:3] == 'LC_':
                categories[k] = v
    _init_categories()

    print 'Locale defaults:'
    print '-'*72
    lang, enc = get_default()
    print 'Language: ', lang or '(undefined)'
    print 'Encoding: ', enc or '(undefined)'
    print

    print 'Locale settings on startup:'
    print '-'*72
    for name,category in categories.items():
        print name,'...'
        lang, enc = get_locale(category)
        print '   Language: ', lang or '(undefined)'
        print '   Encoding: ', enc or '(undefined)'
        print

    setlocale(LC_ALL,"")
    print
    print 'Locale settings after calling setlocale(LC_ALL,""):'
    print '-'*72
    for name,category in categories.items():
        print name,'...'
        lang, enc = get_locale(category)
        print '   Language: ', lang or '(undefined)'
        print '   Encoding: ', enc or '(undefined)'
        print
    

--------------FDA69223F7CCDED3D7828E2B--




From gmcm@hypernet.com  Fri May 26 15:56:27 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 10:56:27 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000526092716.B12100@mems-exchange.org>
References: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
Message-ID: <1252769476-123734481@hypernet.com>

Greg Ward wrote:

> On 26 May 2000, Gordon McMillan said:
> > Yeah. I tend to install stuff outside the sys.prefix tree and
> > then use .pth files. I realize I'm, um, unique in this regard
> > but I lost everything in some upgrade gone bad. (When a Windows
> > de- install goes wrong, your only option is to do some manual
> > directory and registry pruning.)
> 
> I think that's appropriate for Python "applications" -- in fact,
> now that Distutils can install scripts and miscellaneous data,
> about the only thing needed to properly support "applications" is
> an easy way for developers to say, "Please give me my own
> directory and create a .pth file". 

Hmm. I see an application as a module distribution that 
happens to have a script. (Or maybe I see a module 
distribution as a scriptless app ;-)).

At any rate, I don't see the need to dignify <prefix>/share and 
friends with an official position.

> (Actually, the .pth file
> should only be one way to install an application: you might not
> want your app's Python library to muck up everybody else's Python
> path.  An idea AMK and I cooked up yesterday would be an addition
> to the Distutils "build_scripts" command: along with frobbing the
> #! line to point to the right Python interpreter, add a second
> line:
>   import sys ; sys.append(path-to-this-app's-python-lib)
> 
> Or maybe "sys.insert(0, ...)".

$PYTHONSTARTUP ??

Never really had to deal with this. On my RH box, 
/usr/bin/python is my build. At a client site which had 1.4 
installed, I built 1.5 into $HOME/bin with a hacked getpath.c.

> I'm more concerned with the what the Distutils works best with
> now, though: module distributions.  I think you guys have
> convinced me; static data should normally sit with the code.  I
> think I'll make that the default (instead of prefix + "share"),
> but give developers a way to override it.  So eg.:
> 
>    data_files = ["this.dat", "that.cfg"]
> 
> will put the files in the same place as the code (which could be
> a bit tricky to figure out, what with the vagaries of
> package-ization and "extra" install dirs);

That's an artifact of your code ;-). If you figured it out once, 
you stand at least a 50% chance of getting the same answer 
a second time <.5 wink>.
 


- Gordon


From gward@mems-exchange.org  Fri May 26 16:06:09 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 11:06:09 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.18011.331703.867404@beluga.mojam.com>; from skip@mojam.com on Fri, May 26, 2000 at 04:39:39AM -0500
References: <20000526103014.A18937@mems-exchange.org> <14638.18011.331703.867404@beluga.mojam.com>
Message-ID: <20000526110608.F9083@mems-exchange.org>

On 26 May 2000, Skip Montanaro said:
> I don't think you can safely compile modules by importing them.  You have no
> idea what the side effects of the import might be.

Yeah, that's my concern.

> How about fixing py_compile.compile() instead?

Would be a good thing to do this for Python 1.6, but I can't go back and
fix all the Python 1.5.2 installations out there.

Does anyone know if any good reasons why 'import' and
'py_compile.compile()' are different?  Or is it something easily
fixable?

        Greg


From tim_one@email.msn.com  Fri May 26 16:41:57 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 11:41:57 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000401bfc082$54211940$6c2d153f@tim>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELFGBAA.tim_one@email.msn.com>

Just polishing part of this off, for the curious:

> ...
> Dragon's Win98 woes appear due to something else:  right after a Win98
> system w/ 64Mb RAM is booted, about half the memory is already locked (not
> just committed)!  Dragon's product needs more than the remaining 32Mb to
> avoid thrashing.  Even stranger, killing every process after booting
> releases an insignificant amount of that locked memory. ...

That turned out to be (mostly) irrelevant, and even if it were relevant it
turns out you can reduce the locked memory (to what appears to be an
undocumented minimum) and the file-cache size (to what is a documented
minimum) just by malloc'ing, zero'ing and free'ing a few giant arrays
(Windows malloc()-- unlike Linux's --returns a pointer to committed memory;
Windows has other calls if you really want memory you can't trust <0.5
wink>).

The next red herring was much funnier:  we couldn't reproduce the problem
when running the recognizer by hand (from a DOS box cmdline)!  But, run it
as Research did, system()'ed from a small Perl script, and it magically ran
3x slower, with monstrous disk thrashing.  So I had a great time besmirching
Perl's reputation <wink>.

Alas, it turned out the *real* trigger was something else entirely, that
we've known about for years but have never understood:  from inside the Perl
script, people used UNC paths to various network locations.  Like

    \\earwig\research2\data5\natspeak\testk\big55.voc

Exactly the same locations were referenced when people ran it "by hand", but
when people do it by hand, they naturally map a drive letter first, in order
reduce typing.  Like

    net use N: \\earwig\research2\data5\natspeak

once and then

    N:\testk\big55.voc

in their command lines.

This difference alone can make a *huge* timing difference!  Like I said,
we've never understood why.  Could simply be a bug in Dragon's
out-of-control network setup, or a bug in MS's networking code, or a bug in
Novell's server code -- I don't think we'll ever know.  The number of
IQ-hours that have gone into *trying* to figure this out over the years
could probably have carried several startups to successful IPOs <0.9 wink>.

One last useless clue:  do all this on a Win98 with 128Mb RAM, and the
timing difference goes away.  Ditto Win95, but much less RAM is needed.  It
sometimes acts like a UNC path consumes 32Mb of dedicated RAM!

Apart from this UNC-vs-mapped-drive issue, over many hours of dead-end
scenarios I was pleased to see that Win98 appears to do a good job of
reallocating physical RAM in response to changing demands, & in particular
better than Win95.  There's no problem here at all!

The original test case I posted-- showing massive heap fragmentation under
Win95, Win98, and W2K (but not NT), when growing a large Python list one
element at a time --remains an as-yet unstudied mystery.  I can easily make
*that* problem go away by, e.g., doing

    a = [1]*3000000
    del a

from time to time, apparently just to convince the Windows malloc that it
would be a wise idea to allocate a lot more than it thinks it needs from
time to time.  This suggests (untested) that it *could* be a huge win for
huge lists under Windows to overallocate huge lists by more than Python does
today.  I'll look into that "someday".




From gstein@lyra.org  Fri May 26 16:46:09 2000
From: gstein@lyra.org (Greg Stein)
Date: Fri, 26 May 2000 08:46:09 -0700 (PDT)
Subject: [Python-Dev] exceptions.c location (was: Win32 build)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005260845130.23146-100000@nebula.lyra.org>

On Sat, 27 May 2000, Barry A. Warsaw wrote:
> >>>>> "GS" == Greg Stein <gstein@lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.

+1 for moving it to Python/ (where bltinmodule.c and sysmodule.c exist)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri May 26 17:18:14 2000
From: gstein@lyra.org (Greg Stein)
Date: Fri, 26 May 2000 09:18:14 -0700 (PDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526110608.F9083@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>

On Fri, 26 May 2000, Greg Ward wrote:
> On 26 May 2000, Skip Montanaro said:
> > I don't think you can safely compile modules by importing them.  You have no
> > idea what the side effects of the import might be.
> 
> Yeah, that's my concern.

I agree. You can't just import them.

> > How about fixing py_compile.compile() instead?
> 
> Would be a good thing to do this for Python 1.6, but I can't go back and
> fix all the Python 1.5.2 installations out there.

You and your 1.5 compatibility... :-)

> Does anyone know if any good reasons why 'import' and
> 'py_compile.compile()' are different?  Or is it something easily
> fixable?

I seem to recall needing to put an extra carriage return on the file, but
that the Python parser was fine with the different newline concepts. Guido
explained the difference once to me, but I don't recall offhand -- I'd
have to crawl back thru the email. Just yell over the cube at him to find
out.

*ponder*

Well, assuming that it is NOT okay with \r\n in there, then read the whole
blob in and use string.replace() on it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From skip@mojam.com (Skip Montanaro)  Fri May 26 17:30:08 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 26 May 2000 11:30:08 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
References: <20000526110608.F9083@mems-exchange.org>
 <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
Message-ID: <14638.42640.835838.859270@beluga.mojam.com>

    Greg> Well, assuming that it is NOT okay with \r\n in there, then read
    Greg> the whole blob in and use string.replace() on it.

I thought of that too, but quickly dismissed it.  You may have a CRLF pair
embedded in a triple-quoted string.  Those should be left untouched.

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From fdrake@acm.org  Fri May 26 18:18:00 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Fri, 26 May 2000 10:18:00 -0700 (PDT)
Subject: [Distutils] Re: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <Pine.LNX.4.10.10005261014420.12340-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Skip Montanaro wrote:
 > I thought of that too, but quickly dismissed it.  You may have a CRLF pair
 > embedded in a triple-quoted string.  Those should be left untouched.

  No, it would be OK to do  the replacement; source files are supposed to
be treated as text, meaning that lineends should be represented as \n.
We're not talking about changing the values of the strings, which will
still be treated as \n and that's what will be incorporated in the value
of the string.  This has no impact on the explicit inclusion of \r or \r\n
in strings.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From bwarsaw@python.org  Fri May 26 18:32:02 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Fri, 26 May 2000 13:32:02 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
 <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
Message-ID: <14638.46354.960974.536560@localhost.localdomain>

>>>>> "Fred" == Fred L Drake <fdrake@acm.org> writes:

    Fred> and there's ElectricFence from Bruce Perens:

    Fred> 	http://www.perens.com/FreeSoftware/

Yup, this comes with RH6.2 and is fairly easy to hook up; just link
with -lefence and go.  Running an efenced python over the whole test
suite fails miserably, but running it over just
Lib/test/test_exceptions.py has already (quickly) revealed one
refcounting bug, which I will check in to fix later today (as I move
Modules/_exceptions.c to Python/exceptions.c).

    Fred> (There's a MailMan related link there are well you might be
    Fred> interested in!)

Indeed!  I've seen Bruce contribute on the various Mailman mailing
lists.

-Barry


From skip@mojam.com (Skip Montanaro)  Fri May 26 18:48:46 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 26 May 2000 12:48:46 -0500 (CDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.46354.960974.536560@localhost.localdomain>
References: <14638.64962.118047.467438@localhost.localdomain>
 <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
 <14638.46354.960974.536560@localhost.localdomain>
Message-ID: <14638.47358.724731.392760@beluga.mojam.com>

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up; just link
    BAW> with -lefence and go.

Hmmm...  Sounds like an extra configure flag waiting to be added...

Skip


From bwarsaw@python.org  Fri May 26 19:38:19 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Fri, 26 May 2000 14:38:19 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
 <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
 <14638.46354.960974.536560@localhost.localdomain>
 <14638.47358.724731.392760@beluga.mojam.com>
Message-ID: <14638.50331.542338.196305@localhost.localdomain>

>>>>> "SM" == Skip Montanaro <skip@mojam.com> writes:

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up;
    BAW> just link with -lefence and go.

    SM> Hmmm...  Sounds like an extra configure flag waiting to be
    SM> added...

I dunno.  I just did a "make -k OPT=-g LIBC=-lefence".

-Barry


From trentm@activestate.com  Fri May 26 19:55:55 2000
From: trentm@activestate.com (Trent Mick)
Date: Fri, 26 May 2000 11:55:55 -0700
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
References: <000401bfc6d3$0afb3e60$c52d153f@tim> <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <20000526115555.C32427@activestate.com>

On Sat, May 27, 2000 at 08:46:44AM -0400, Barry A. Warsaw wrote:
> 
> >>>>> "GS" == Greg Stein <gstein@lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.
> 
+1 moving exceptions.c to Python/


Trent

-- 
Trent Mick
trentm@activestate.com


Return-Path: <trentm@molotok.activestate.com>
Delivered-To: python-dev@python.org
Received: from molotok.activestate.com (molotok.activestate.com [199.60.48.208])
	by dinsdale.python.org (Postfix) with ESMTP id C7A5F1CD69
	for <python-dev@python.org>; Fri, 26 May 2000 14:49:28 -0400 (EDT)
Received: (from trentm@localhost)
	by molotok.activestate.com (8.9.3/8.9.3) id LAA01949
	for python-dev@python.org; Fri, 26 May 2000 11:49:22 -0700
Resent-Message-Id: <200005261849.LAA01949@molotok.activestate.com>
Date: Fri, 26 May 2000 11:39:40 -0700
From: Trent Mick <trentm@activestate.com>
To: Peter Funk <pf@artcom-gmbh.de>
Subject: Re: [Python-Dev] Proposal: .pyc file format change
Message-ID: <20000526113940.B32427@activestate.com>
References: <200005251551.KAA11897@cj20424-a.reston1.va.home.com> <m12vFOo-000DieC@artcom0.artcom-gmbh.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3us
In-Reply-To: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>
Resent-From: trentm@activestate.com
Resent-Date: Fri, 26 May 2000 11:49:22 -0700
Resent-To: python-dev@python.org
Resent-Sender: trentm@molotok.activestate.com
Sender: python-dev-admin@python.org
Errors-To: python-dev-admin@python.org
X-BeenThere: python-dev@python.org
X-Mailman-Version: 2.0beta3
Precedence: bulk
List-Id: Python core developers <python-dev.python.org>

On Fri, May 26, 2000 at 10:23:18AM +0200, Peter Funk wrote:
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> > 
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> > 
> > The current format of a .pyc file is as follows:
> > 
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)
> 

This may be important: timestamps (as represented by the time_t type) are 8
bytes wide on 64-bit Linux and Win64. However, it will be a while (another 38
years) before time_t starts overflowing past 31 bits (it is a signed value).

The use of a 4 byte timestamp in the .pyc files constitutes an assumption
that this will fit in 4 bytes. The best portable way of handling this issue
(I think) is to just add an overflow check in import.c where
PyOS_GetLastModificationTime (which now properly return time_t) that raises
an exception if the time_t return value from overflows 4-bytes.

I have been going through the Python code looking for possible oveflow cases
for Win64 and Linux64 of late so I will submit these patches (Real Soon Now
(tm)).

CHeers,
Trent

-- 
Trent Mick
trentm@activestate.com


From bwarsaw@python.org  Fri May 26 20:11:40 2000
From: bwarsaw@python.org (bwarsaw@python.org)
Date: Fri, 26 May 2000 15:11:40 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
 <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
 <14639.50100.383806.969434@localhost.localdomain>
 <20000526115555.C32427@activestate.com>
Message-ID: <14638.52332.741025.292435@localhost.localdomain>

>>>>> "TM" == Trent Mick <trentm@activestate.com> writes:

    TM> +1 moving exceptions.c to Python/

Done.  And it looks like someone with a more accessible Windows setup
is going to have to modify the .dsp files.

-Barry


From jeremy@alum.mit.edu  Fri May 26 22:40:53 2000
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 26 May 2000 17:40:53 -0400 (EDT)
Subject: [Python-Dev] Guido is offline
Message-ID: <14638.61285.606894.914184@localhost.localdomain>

FYI: Guido's cable modem service is giving him trouble and he's unable
to read email at the moment.  He wanted me to let you know that lack
of response isn't for lack of interest.  I imagine he won't be fully
responsive until after the holiday weekend :-).

Jeremy



From tim_one@email.msn.com  Sat May 27 05:53:14 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Sat, 27 May 2000 00:53:14 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <000001bfc797$781e3d20$cd2d153f@tim>

[GregS]
> Well, assuming that it is NOT okay with \r\n in there, then read
> the whole blob in and use string.replace() on it.
>
[Skip Montanaro]
> I thought of that too, but quickly dismissed it.  You may have a CRLF pair
> embedded in a triple-quoted string.  Those should be left untouched.

Why?  When Python compiles a module "normally", line-ends get normalized,
and the CRLF pairs on Windows vanish anyway.  For example, here's cr.py:

def f():
    s = """a
b
c
d
"""
    for ch in s:
        print ord(ch),
    print

f()
import dis
dis.dis(f)

I'm running on Win98 as I type, and the source file has CRLF line ends.

C:\Python16>python misc/cr.py
97 10 98 10 99 10 100 10

That line shows that only the LFs survived.  The rest shows why:

          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 ('a\012b\012c\012d\012')
          9 STORE_FAST               0 (s)
          etc

That is, as far as the generated code is concerned, the CRs never existed.

60-years-of-computers-and-we-still-can't-agree-on-how-to-end-a-line-ly
    y'rs  - tim




From martin@loewis.home.cs.tu-berlin.de  Sun May 28 07:28:55 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 08:28:55 +0200
Subject: [Python-Dev] String encoding
Message-ID: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>

Fred L. Drake wrote

> I recall a fair bit of discussion about wchar_t when it was
> introduced to ANSI C, and the character set and encoding were
> specifically not made part of the specification.  Making a
> requirement that wchar_t be Unicode doesn't make a lot of sense, and
> opens up potential portability issues.

In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
indicate that wchar_t is Unicode. The exact wording is

# A decimal constant of the form yyyymmL (for example, 199712L),
# intended to indicate that values of type wchar_t are the coded
# representations of the characters defined by ISO/IEC 10646, along
# with all amendments and technical corrigenda as of the specified
# year and month.

Of course, at the moment, there are few, if any, implementations that
define this macro.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Sun May 28 11:34:01 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 12:34:01 +0200
Subject: [Python-Dev] Patch: AttributeError and NameError: second attempt.
Message-ID: <200005281034.MAA04765@loewis.home.cs.tu-berlin.de>

[thread moved, since I can't put in proper References headers, anyway,
 just by looking at the archive]
> 1) I rewrite the stuff that went into exceptions.py in C, and stick it
>   in the _exceptions module.  I don't much like this idea, since it 
>   kills the advantage noted above.

>2) I leave the stuff that's in C already in C.  I add C __str__ methods 
>   to AttributeError and NameError, which dispatch to helper functions
>   in the python 'exceptions' module, if that module is available.

>Which is better, or is there a third choice available?

There is a third choice: Patch AttributeError afterwards. I.e. in
site.py, say

_AttributeError_str(self):
  code

AttributeError.__str__ = _AttributeError_str

Guido said
> This kind of user-friendliness should really be in the tools, not in
> the core language implementation!

And I think Nick's patch exactly follows this guideline. Currently,
the C code raising AttributeError tries to be friendly, formatting a
string, and passing it to the AttributeError.__init__. With his patch,
the AttributeError just gets enough information so that tools later
can be friendly - actually printing anything is done in Python code.

Fred said
>   I see no problem with the functionality from Nick's patch; this is
> exactly te sort of thing what's needed, including at the basic
> interactive prompt.

I agree. Much of the strength of this approach is lost if it only
works inside tools. When I get an AttributeError, I'd like to see
right away what the problem is. If I had to fire up IDLE and re-run it
first, I'd rather stare at my code long enough to see the problem.

Regards,
Martin



From tismer@tismer.com  Sun May 28 16:02:41 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 17:02:41 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de> <392E89B7.D6BC572D@lemburg.com>
Message-ID: <39313511.4A312B4A@tismer.com>


"M.-A. Lemburg" wrote:
> 
> Peter Funk wrote:
> >
> > [M.-A. Lemburg]:
> > > > Proposal:
> > > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > > >
> > > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > > bytes 12-*  marshalled code object (same as earlier)

<snip/>

> A different approach to all this would be fixing only the
> first two bytes of the magic word, e.g.
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> This would be b/w compatible and still provide file(1)
> with enough information to be able to tell the file type.

I think this approach is simple and powerful enough
to survive Py3000.
Peter's approach is of course nicer and cleaner from
a "redo from scratch" point of view. But then, I'd even
vote for a better format that includes another field
which names the header size explicitly.

For simplicity, comapibility and ease of change,
I vote with +1 for adopting the solution of

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

If that turns out to be insufficient in some future,
do a complete redesign.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From pf@artcom-gmbh.de  Sun May 28 17:23:52 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Sun, 28 May 2000 18:23:52 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39313511.4A312B4A@tismer.com> from Christian Tismer at "May 28, 2000  5: 2:41 pm"
Message-ID: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>

[...]
> For simplicity, comapibility and ease of change,
> I vote with +1 for adopting the solution of
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> If that turns out to be insufficient in some future,
> do a complete redesign.

What about the CR/LF issue with some Mac Compilers (see
Guido's mail for details)?  Can we simply drop this?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From tismer@tismer.com  Sun May 28 17:51:20 2000
From: tismer@tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 18:51:20 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39314E88.6AD944CE@tismer.com>


Peter Funk wrote:
> 
> [...]
> > For simplicity, comapibility and ease of change,
> > I vote with +1 for adopting the solution of
> >
> > byte 0: 'P'
> > byte 1: 'Y'
> > byte 2: version number (counting from 1)
> > byte 3: option byte (8 bits: one for each option;
> >                      bit 0: -U cmd switch)
> >
> > If that turns out to be insufficient in some future,
> > do a complete redesign.
> 
> What about the CR/LF issue with some Mac Compilers (see
> Guido's mail for details)?  Can we simply drop this?

Well, forgot about that.
How about swapping bytes 0 and 1?

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From guido@python.org  Mon May 29 00:54:11 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 28 May 2000 18:54:11 -0500
Subject: [Python-Dev] Guido is offline
In-Reply-To: Your message of "Fri, 26 May 2000 17:40:53 -0400."
 <14638.61285.606894.914184@localhost.localdomain>
References: <14638.61285.606894.914184@localhost.localdomain>
Message-ID: <200005282354.SAA02034@cj20424-a.reston1.va.home.com>

> FYI: Guido's cable modem service is giving him trouble and he's unable
> to read email at the moment.  He wanted me to let you know that lack
> of response isn't for lack of interest.  I imagine he won't be fully
> responsive until after the holiday weekend :-).

I'm finally back online now, but can't really enjoy it, because my
in-laws are here... So I have 300 unread emails that will remain
unread until Tuesday. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May 29 01:00:39 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 28 May 2000 19:00:39 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Fri, 26 May 2000 14:36:36 +0200."
 <m12vJLw-000DieC@artcom0.artcom-gmbh.de>
References: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>

> > Modifyable data needs to go in a per-user directory, even on Windows,
> > outside the Python tree.
> 
> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?  On MacOS?  

I don't know -- often $HOME is set on Windows.  E.g. IDLE uses $HOME
if set and otherwise the current directory.

The Mac doesn't have an environment at all.

> Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
> module would provide 'os.environ["HOME"]' similar to the posix
> version?  This would certainly simplify the task of application
> programmers intending to write portable applications.

This sounds like a nice idea...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gstein@lyra.org  Mon May 29 20:58:41 2000
From: gstein@lyra.org (Greg Stein)
Date: Mon, 29 May 2000 12:58:41 -0700 (PDT)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39314E88.6AD944CE@tismer.com>
Message-ID: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>

I don't think we should have a two-byte magic value. Especially where
those two bytes are printable, 7-bit ASCII.

"But it is four bytes," you say. Nope. It is two plus a couple parameters
that can now change over time.

To ensure uniqueness, I think a four-byte magic should stay.

I would recommend the approach of adding opcodes into the marshal format.
Specifically, 'V' followed by a single byte. That can only occur at the
beginning. If it is not present, then you know that you have an old
marshal value.

Cheers,
-g

On Sun, 28 May 2000, Christian Tismer wrote:
> Peter Funk wrote:
> > 
> > [...]
> > > For simplicity, comapibility and ease of change,
> > > I vote with +1 for adopting the solution of
> > >
> > > byte 0: 'P'
> > > byte 1: 'Y'
> > > byte 2: version number (counting from 1)
> > > byte 3: option byte (8 bits: one for each option;
> > >                      bit 0: -U cmd switch)
> > >
> > > If that turns out to be insufficient in some future,
> > > do a complete redesign.
> > 
> > What about the CR/LF issue with some Mac Compilers (see
> > Guido's mail for details)?  Can we simply drop this?
> 
> Well, forgot about that.
> How about swapping bytes 0 and 1?
> 
> -- 
> Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
> Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
> Kaunstr. 26                  :    *Starship* http://starship.python.net
> 14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
> PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
>      where do you want to jump today?   http://www.stackless.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/



From pf@artcom-gmbh.de  Tue May 30 08:08:15 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 09:08:15 +0200 (MEST)
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> from Greg Stein at "May 29, 2000 12:58:41 pm"
Message-ID: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>

Greg Stein:
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
[...]
> To ensure uniqueness, I think a four-byte magic should stay.

Looking at /etc/magic I see many 16-bit magic numbers kept around
from the good old days.  But you are right: Choosing a four-byte magic
value would make the chance of a clash with some other file format
much less likely.

> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

But this would not solve the problem with 8 byte versus 4 byte timestamps
in the header on 64-bit OSes.  Trent Mick pointed this out.

I think, the situation we have now, is very unsatisfactory:  I don't 
see a reasonable solution, which allows us to keep the length of the
header before the marshal-block at a fixed length of 8 bytes together
with a frozen 4 byte magic number.  

Moving the version number into the marshal doesn't help to resolve
this conflict.  So either you have to accept a new magic on 64 bit
systems or you have to enlarge the header.

To come up with a new proposal, the following questions should be answered:
  1. Is there really too much code out there, which depends on 
     the hardcoded assumption, that the marshal part of a .pyc file 
     starts at byte 8?  I see no further evidence for or against this.
     MAL pointed this out in 
     <http://www.python.org/pipermail/python-dev/2000-May/005756.html>
  2. If we decide to enlarge the header, do we really need a new
     header field defining the length of the header ? 
     This was proposed by Christian Tismer in 
     <http://www.python.org/pipermail/python-dev/2000-May/005792.html>
  3. The 'imp' module exposes somewhat the structure of an .pyc file
     through the function 'get_magic()'.  I proposed changing the signature of
     'imp.get_magic()' in an upward compatible way.  I also proposed 
     adding a new function 'imp.get_version()'.  What do you think about 
     this idea?
  4. Greg proposed prepending the version number to the marshal
     format.  If we do this, we definitely need a frozen way to find
     out, where the marshalled code object actually starts.  This has
     also the disadvantage of making the task to come up with a /etc/magic
     definition whichs displays the version number of a .pyc file slightly
     harder.

If we decide to move the version number into the marshal, if we can
also move the .py-timestamp there.  This way the timestamp will be handled
in the same way as large integer literals.  Quoting from the docs:

"""Caveat: On machines where C's long int type has more than 32 bits
   (such as the DEC Alpha), it is possible to create plain Python
   integers that are longer than 32 bits. Since the current marshal
   module uses 32 bits to transfer plain Python integers, such values
   are silently truncated. This particularly affects the use of very
   long integer literals in Python modules -- these will be accepted
   by the parser on such machines, but will be silently be truncated
   when the module is read from the .pyc instead.
   [...]
   A solution would be to refuse such literals in the parser, since
   they are inherently non-portable. Another solution would be to let
   the marshal module raise an exception when an integer value would
   be truncated. At least one of these solutions will be implemented
   in a future version."""

Should this be 1.6?  Changing the format of .pyc files over and over
again in the 1.x series doesn't look very attractive.

Regards, Peter


From trentm@activestate.com  Tue May 30 08:46:09 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 30 May 2000 00:46:09 -0700
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000530004609.A16383@activestate.com>

On Tue, May 30, 2000 at 09:08:15AM +0200, Peter Funk wrote:
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.
> 

I kind of intimated but did not make it clear: I wouldn't worry about the
limitations of a 4 byte timestamp too much. That value is not going to
overflow for another 38 years. Presumably the .pyc header (if such a thing
even still exists then) will change by then.


[peter summarizes .pyc header format options]

> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.
> 
I *hope* it gets into 1.6, because I have implemented the latter suggestion
(raise an exception is truncation of a PyInt to 32-bits will cause data
loss) in the docs that you quoted and will be submitting a patch for it on
Wed or Thurs.

Ciao,
Trent

-- 
Trent Mick
trentm@activestate.com


From Fredrik Lundh" <effbot@telia.com  Tue May 30 09:21:10 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 10:21:10 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de> <20000530004609.A16383@activestate.com>
Message-ID: <009901bfca10$040531c0$f2a6b5d4@hagrid>

Trent Mick wrote:
> > But this would not solve the problem with 8 byte versus 4 byte =
timestamps
> > in the header on 64-bit OSes.  Trent Mick pointed this out.
>=20
> I kind of intimated but did not make it clear: I wouldn't worry about =
the
> limitations of a 4 byte timestamp too much. That value is not going to
> overflow for another 38 years. Presumably the .pyc header (if such a =
thing
> even still exists then) will change by then.

note that py_compile (which is used to create PYC files after =
installation,
among other things) treats the time as an unsigned integer.

so in other words, if we fix the built-in "PYC compiler" so it does the =
same
thing before 2038, we can spend another 68 years on coming up with a
really future proof design... ;-)

I really hope Py3K will be out before 2106.

as for the other changes: *please* don't break the header layout in the
1.X series.  and *please* don't break the "if the magic is the same, I =
can
unmarshal and run this code blob without crashing the interpreter" rule
(raising an exception would be okay, though).

</F>



From mal@lemburg.com  Tue May 30 09:10:25 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 10:10:25 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal:
 .pyc file format change)
References: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39337771.D78BAAF5@lemburg.com>

Peter Funk wrote:
> 
> Greg Stein:
> > I don't think we should have a two-byte magic value. Especially where
> > those two bytes are printable, 7-bit ASCII.
> [...]
> > To ensure uniqueness, I think a four-byte magic should stay.
> 
> Looking at /etc/magic I see many 16-bit magic numbers kept around
> from the good old days.  But you are right: Choosing a four-byte magic
> value would make the chance of a clash with some other file format
> much less likely.

Just for quotes: the current /etc/magic I have on my Linux
machine doesn't know anything about PYC or PYO files, so I
don't really see much of a problem here -- noone seems to
be interested in finding out the file type for these
files anyway ;-)

Also, I don't really get the 16-bit magic argument: we still
have a 32-bit magic number -- one with a 16-bit fixed value and
predefined ranges for the remaining 16 bits. This already is
much better than what we have now w/r to making file(1) work
on PYC files.
 
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.

The switch to 8 byte timestamps is only needed when the current
4 bytes can no longer hold the timestamp value. That will happen
in 2038...

Note that import.c writes the timestamp in 4 bytes until it
reaches an overflow situation.

> I think, the situation we have now, is very unsatisfactory:  I don't
> see a reasonable solution, which allows us to keep the length of the
> header before the marshal-block at a fixed length of 8 bytes together
> with a frozen 4 byte magic number.

Adding a version to the marshal format is a Good Thing --
independent of this discussion.
 
> Moving the version number into the marshal doesn't help to resolve
> this conflict.  So either you have to accept a new magic on 64 bit
> systems or you have to enlarge the header.

No you don't... please read the code: marshal only writes
8 bytes in case 4 bytes aren't enough to hold the value.
 
> To come up with a new proposal, the following questions should be answered:
>   1. Is there really too much code out there, which depends on
>      the hardcoded assumption, that the marshal part of a .pyc file
>      starts at byte 8?  I see no further evidence for or against this.
>      MAL pointed this out in
>      <http://www.python.org/pipermail/python-dev/2000-May/005756.html>

I have several references in my tool collection, the import
stuff uses it, old import hooks (remember ihooks ?) also do, etc.

>   2. If we decide to enlarge the header, do we really need a new
>      header field defining the length of the header ?
>      This was proposed by Christian Tismer in
>      <http://www.python.org/pipermail/python-dev/2000-May/005792.html>

In Py3K we can do this right (breaking things is allowed)...
and I agree with Christian that a proper file format needs
a header length field too. Basically, these values have to
be present, IMHO:

1. Magic
2. Version
3. Length of Header
4. (Header Attribute)*n
-- Start of Data ---

Header Attribute can be pretty much anything -- timestamps,
names of files or other entities, bit sizes, architecture
flags, optimization settings, etc.

>   3. The 'imp' module exposes somewhat the structure of an .pyc file
>      through the function 'get_magic()'.  I proposed changing the signature of
>      'imp.get_magic()' in an upward compatible way.  I also proposed
>      adding a new function 'imp.get_version()'.  What do you think about
>      this idea?

imp.get_magic() would have to return the proposed 32-bit value
('PY' + version byte + option byte).

I'd suggest adding additional functions which can read and write the
header given a PYCHeader object which would hold the 
values version and options.

>   4. Greg proposed prepending the version number to the marshal
>      format.  If we do this, we definitely need a frozen way to find
>      out, where the marshalled code object actually starts.  This has
>      also the disadvantage of making the task to come up with a /etc/magic
>      definition whichs displays the version number of a .pyc file slightly
>      harder.
> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From ping@lfw.org  Tue May 30 10:48:50 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 30 May 2000 02:48:50 -0700 (PDT)
Subject: [Python-Dev] inspect.py
Message-ID: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>

I just posted the HTML document generator script i promised
to do at IPC8.  It's at http://www.lfw.org/python/ (see the
bottom of the page).

The reason i'm mentioning this here is that, in the course of
doing that, i put all the introspection work in a separate
module called "inspect.py".  It's at

    http://www.lfw.org/python/inspect.py

It tries to encapsulate the interface provided by func_*, co_*,
et al. with something a little richer.  It can handle anonymous
(tuple) arguments for you, for example.  It can also get the
source code of any function, method, or class for you, as long
as the original .py file is still available.  And more stuff
like that.

I think most of this stuff is quite generally useful, and it
seems good to wrap this up in a module.  I'd like your thoughts
on whether this is worth including in the standard library.



-- ?!ng

"To be human is to continually change.  Your desire to remain as you are
is what ultimately limits you."
    -- The Puppet Master, Ghost in the Shell



From Fredrik Lundh" <effbot@telia.com  Tue May 30 11:26:29 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 12:26:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>

I wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
>=20
> a) stick to the old definition, and use chr(10) also for
>    unicode strings
>=20
> b) use different definitions for 8-bit strings and unicode
>    strings; if given an 8-bit string, use chr(10); if given
>    a 16-bit string, use the LINEBREAK predicate.
>=20
> c) use LINEBREAK in either case.
>=20
> I think (c) is the "right thing", but it's the only that may
> break existing code...

I'm probably getting old, but I don't remember if anyone followed
up on this, and I don't have time to check the archives right now.

so for the upcoming "feature complete" release, I've decided to
stick to (a).

...

for the next release, I suggest implementing a fourth alternative:

d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
   use chr(10).

background: in the current implementation, this decision has to
be made at compile time, and a compiled expression can be used
with either 8-bit strings or 16-bit strings.

a fifth alternative would be to use the locale flag to tell the
difference between unicode and 8-bit characters:

e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).

comments?

</F>

<project name=3D"sre" phase=3D" complete=3D"97.1%" />



From tismer@tismer.com  Tue May 30 12:24:55 2000
From: tismer@tismer.com (Christian Tismer)
Date: Tue, 30 May 2000 13:24:55 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>
Message-ID: <3933A507.9FA6ABD6@tismer.com>


Greg Stein wrote:
> 
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
> 
> "But it is four bytes," you say. Nope. It is two plus a couple parameters
> that can now change over time.
> 
> To ensure uniqueness, I think a four-byte magic should stay.
> 
> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

Fine with me, too!
Everything that keeps the current 8 byte header intact
and doesn't break much code is fine with me. Moving
additional info intot he marshalled obejcts themselves
gives even more flexibility than any header extension.
Yes I'm all for it.

ciao - chris++

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com


From mal@lemburg.com  Tue May 30 12:36:00 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 13:36:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com>
Message-ID: <3933A7A0.5FAAC5FD@lemburg.com>

This is a multi-part message in MIME format.
--------------B0D558452C6B5A792F8287B1
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Here is my second version of the module. It is somewhat more
flexible and also smaller in size.

BTW, I haven't found any mention of what language and encoding
the locale 'C' assumes or defines. Currently, the module
reports these as None, meaning undefined. Are language and
encoding defined for 'C' ?

(Sorry for posting the whole module -- starship seems to be
down again...)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------B0D558452C6B5A792F8287B1
Content-Type: text/python; charset=us-ascii;
 name="localex.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="localex.py"

""" Extented locale.py

    This version of locale.py contains a locale aliasing engine and
    knows about the default encoding of many common locales.

    (c) Marc-Andre Lemburg, mal@lemburg.com

"""
from locale import *
import string

__version__ = '0.2'

### APIs

def normalize(localename):

    """ Returns a normalized locale code for the given locale
        name.

        The returned locale code is formatted for use with
        setlocale().

        If normalization fails, the original name is returned
        unchanged.

        If the given encoding is not known, the function defaults to
        the default encoding for the locale code just like setlocale()
        does.

    """
    # Normalize the locale name and extract the encoding
    fullname = string.lower(localename)
    if ':' in fullname:
        # ':' is sometimes used as encoding delimiter.
        fullname = string.replace(fullname, ':', '.')
    if '.' in fullname:
        langname, encoding = string.split(fullname, '.')[:2]
        fullname = langname + '.' + encoding
    else:
        langname = fullname
        encoding = ''

    # First lookup: fullname (possibly with encoding)
    code = locale_alias.get(fullname, None)
    if code is not None:
        return code

    # Second try: langname (without encoding)
    code = locale_alias.get(langname, None)
    if code is not None:
        if '.' in code:
            langname, defenc = string.split(code, '.')
        else:
            langname = code
            defenc = ''
        if encoding:
            encoding = encoding_alias.get(encoding, encoding)
        else:
            encoding = defenc
        if encoding:
            return langname + '.' + encoding
        else:
            return langname

    else:
        return localename

def _parse_localename(localename):

    """ Parses the locale code for localename and returns the
        result as tuple (language code, encoding).

        The localename is normalized and passed through the locale
        alias engine. A ValueError is raised in case the locale name
        cannot be parsed.

        The language code corresponds to RFC 1766.  code and encoding
        can be None in case the values cannot be determined or are
        unkown to this implementation.

    """
    code = normalize(localename)
    if '.' in code:
        return string.split(code, '.')[:2]
    elif code == 'C':
        return None, None
    else:
        raise ValueError,'unkown locale: %s' % localename
    return l

def _build_localename(localetuple):

    """ Builds a locale code from the given tuple (language code,
        encoding).

        No aliasing or normalizing takes place.

    """
    language, encoding = localetuple
    if language is None:
        language = 'C'
    if encoding is None:
        return language
    else:
        return language + '.' + encoding
    
def get_default(envvars=('LANGUAGE', 'LC_ALL', 'LC_CTYPE', 'LANG')):

    """ Tries to determine the default locale settings and returns
        them as tuple (language code, encoding).

        According to POSIX, a program which has not called
        setlocale(LC_ALL,"") runs using the portable 'C' locale.
        Calling setlocale(LC_ALL,"") lets it use the default locale as
        defined by the LANG variable. Since we don't want to interfere
        with the current locale setting we thus emulate the behaviour
        in the way described above.

        To maintain compatibility with other platforms, not only the
        LANG variable is tested, but a list of variables given as
        envvars parameter. The first found to be defined will be
        used. envvars defaults to the search path used in GNU gettext;
        it must always contain the variable name 'LANG'.

        Except for the code 'C', the language code corresponds to RFC
        1766.  code and encoding can be None in case the values cannot
        be determined.

    """
    import os
    lookup = os.environ.get
    for variable in envvars:
        localename = lookup(variable,None)
        if localename is not None:
            break
    else:
        localename = 'C'
    return _parse_localename(localename)

def get_locale(category=LC_CTYPE):

    """ Returns the current setting for the given locale category as
        tuple (language code, encoding).

        category may be one of the LC_* value except LC_ALL. It
        defaults to LC_CTYPE.

        Except for the code 'C', the language code corresponds to RFC
        1766.  code and encoding can be None in case the values cannot
        be determined.

    """
    localename = setlocale(category)
    if category == LC_ALL and ';' in localename:
        raise TypeError,'category LC_ALL is not supported'
    return _parse_localename(localename)

def set_locale(localetuple, category=LC_ALL):

    """ Set the locale according to the localetuple (language code,
        encoding) as returned by get_locale() and get_default().

        The given codes are passed through the locale aliasing engine
        before being given to setlocale() for processing.

        category may be given as one of the LC_* values. It defaults
        to LC_ALL.

    """
    setlocale(category, normalize(_build_localename(localetuple)))

def set_to_default(category=LC_ALL):

    """ Sets the locale for category to the default setting.

        The default setting is determined by calling
        get_default(). category defaults to LC_ALL.
        
    """
    setlocale(category, _build_localename(get_default()))

### Data
#
# The following data was extracted from the locale.alias file which
# comes with X11 and then hand edited removing the explicit encoding
# definitions and adding some more aliases. The file is usually
# available as /usr/lib/X11/locale/locale.alias.
#    

#
# The encoding_alias table maps lowercase encoding alias names to C
# locale encoding names (case-sensitive).
#
encoding_alias = {
        '437': 				'C',
        'c': 				'C',
        'iso8859': 			'ISO8859-1',
        '8859': 			'ISO8859-1',
        '88591': 			'ISO8859-1',
        'ascii': 			'ISO8859-1',
        'en': 				'ISO8859-1',
        'iso88591': 			'ISO8859-1',
        'iso_8859-1': 			'ISO8859-1',
        '885915': 			'ISO8859-15',
        'iso885915': 			'ISO8859-15',
        'iso_8859-15': 			'ISO8859-15',
        'iso8859-2': 			'ISO8859-2',
        'iso88592': 			'ISO8859-2',
        'iso_8859-2': 			'ISO8859-2',
        'iso88595': 			'ISO8859-5',
        'iso88596': 			'ISO8859-6',
        'iso88597': 			'ISO8859-7',
        'iso88598': 			'ISO8859-8',
        'iso88599': 			'ISO8859-9',
        'iso-2022-jp': 			'JIS7',
        'jis': 				'JIS7',
        'jis7': 			'JIS7',
        'sjis': 			'SJIS',
        'tis620': 			'TACTIS',
        'ajec': 			'eucJP',
        'eucjp': 			'eucJP',
        'ujis': 			'eucJP',
        'utf-8': 			'utf',
        'utf8': 			'utf',
        'utf8@ucs4': 			'utf',
}

#    
# The locale_alias table maps lowercase alias names to C locale names
# (case-sensitive). Encodings are always separated from the locale
# name using a dot ('.'); they should only be given in case the
# language name is needed to interpret the given encoding alias
# correctly (CJK codes often have this need).
#
locale_alias = {
        'american':                      'en_US.ISO8859-1',
        'ar':                            'ar_AA.ISO8859-6',
        'ar_aa':                         'ar_AA.ISO8859-6',
        'ar_sa':                         'ar_SA.ISO8859-6',
        'arabic':                        'ar_AA.ISO8859-6',
        'bg':                            'bg_BG.ISO8859-5',
        'bg_bg':                         'bg_BG.ISO8859-5',
        'bulgarian':                     'bg_BG.ISO8859-5',
        'c-french':                      'fr_CA.ISO8859-1',
        'c':                             'C',
        'c_c':                           'C',
        'cextend':                       'en_US.ISO8859-1',
        'chinese-s':                     'zh_CN.eucCN',
        'chinese-t':                     'zh_TW.eucTW',
        'croatian':                      'hr_HR.ISO8859-2',
        'cs':                            'cs_CZ.ISO8859-2',
        'cs_cs':                         'cs_CZ.ISO8859-2',
        'cs_cz':                         'cs_CZ.ISO8859-2',
        'cz':                            'cz_CZ.ISO8859-2',
        'cz_cz':                         'cz_CZ.ISO8859-2',
        'czech':                         'cs_CS.ISO8859-2',
        'da':                            'da_DK.ISO8859-1',
        'da_dk':                         'da_DK.ISO8859-1',
        'danish':                        'da_DK.ISO8859-1',
        'de':                            'de_DE.ISO8859-1',
        'de_at':                         'de_AT.ISO8859-1',
        'de_ch':                         'de_CH.ISO8859-1',
        'de_de':                         'de_DE.ISO8859-1',
        'dutch':                         'nl_BE.ISO8859-1',
        'ee':                            'ee_EE.ISO8859-4',
        'el':                            'el_GR.ISO8859-7',
        'el_gr':                         'el_GR.ISO8859-7',
        'en':                            'en_US.ISO8859-1',
        'en_au':                         'en_AU.ISO8859-1',
        'en_ca':                         'en_CA.ISO8859-1',
        'en_gb':                         'en_GB.ISO8859-1',
        'en_ie':                         'en_IE.ISO8859-1',
        'en_nz':                         'en_NZ.ISO8859-1',
        'en_uk':                         'en_GB.ISO8859-1',
        'en_us':                         'en_US.ISO8859-1',
        'eng_gb':                        'en_GB.ISO8859-1',
        'english':                       'en_EN.ISO8859-1',
        'english_uk':                    'en_GB.ISO8859-1',
        'english_united-states':         'en_US.ISO8859-1',
        'english_us':                    'en_US.ISO8859-1',
        'es':                            'es_ES.ISO8859-1',
        'es_ar':                         'es_AR.ISO8859-1',
        'es_bo':                         'es_BO.ISO8859-1',
        'es_cl':                         'es_CL.ISO8859-1',
        'es_co':                         'es_CO.ISO8859-1',
        'es_cr':                         'es_CR.ISO8859-1',
        'es_ec':                         'es_EC.ISO8859-1',
        'es_es':                         'es_ES.ISO8859-1',
        'es_gt':                         'es_GT.ISO8859-1',
        'es_mx':                         'es_MX.ISO8859-1',
        'es_ni':                         'es_NI.ISO8859-1',
        'es_pa':                         'es_PA.ISO8859-1',
        'es_pe':                         'es_PE.ISO8859-1',
        'es_py':                         'es_PY.ISO8859-1',
        'es_sv':                         'es_SV.ISO8859-1',
        'es_uy':                         'es_UY.ISO8859-1',
        'es_ve':                         'es_VE.ISO8859-1',
        'et':                            'et_EE.ISO8859-4',
        'et_ee':                         'et_EE.ISO8859-4',
        'fi':                            'fi_FI.ISO8859-1',
        'fi_fi':                         'fi_FI.ISO8859-1',
        'finnish':                       'fi_FI.ISO8859-1',
        'fr':                            'fr_FR.ISO8859-1',
        'fr_be':                         'fr_BE.ISO8859-1',
        'fr_ca':                         'fr_CA.ISO8859-1',
        'fr_ch':                         'fr_CH.ISO8859-1',
        'fr_fr':                         'fr_FR.ISO8859-1',
        'fre_fr':                        'fr_FR.ISO8859-1',
        'french':                        'fr_FR.ISO8859-1',
        'french_france':                 'fr_FR.ISO8859-1',
        'ger_de':                        'de_DE.ISO8859-1',
        'german':                        'de_DE.ISO8859-1',
        'german_germany':                'de_DE.ISO8859-1',
        'greek':                         'el_GR.ISO8859-7',
        'hebrew':                        'iw_IL.ISO8859-8',
        'hr':                            'hr_HR.ISO8859-2',
        'hr_hr':                         'hr_HR.ISO8859-2',
        'hu':                            'hu_HU.ISO8859-2',
        'hu_hu':                         'hu_HU.ISO8859-2',
        'hungarian':                     'hu_HU.ISO8859-2',
        'icelandic':                     'is_IS.ISO8859-1',
        'id':                            'id_ID.ISO8859-1',
        'id_id':                         'id_ID.ISO8859-1',
        'is':                            'is_IS.ISO8859-1',
        'is_is':                         'is_IS.ISO8859-1',
        'iso-8859-1':                    'en_US.ISO8859-1',
        'iso-8859-15':                   'en_US.ISO8859-15',
        'iso8859-1':                     'en_US.ISO8859-1',
        'iso8859-15':                    'en_US.ISO8859-15',
        'iso_8859_1':                    'en_US.ISO8859-1',
        'iso_8859_15':                   'en_US.ISO8859-15',
        'it':                            'it_IT.ISO8859-1',
        'it_ch':                         'it_CH.ISO8859-1',
        'it_it':                         'it_IT.ISO8859-1',
        'italian':                       'it_IT.ISO8859-1',
        'iw':                            'iw_IL.ISO8859-8',
        'iw_il':                         'iw_IL.ISO8859-8',
        'ja':                            'ja_JP.eucJP',
        'ja.jis':                        'ja_JP.JIS7',
        'ja.sjis':                       'ja_JP.SJIS',
        'ja_jp':                         'ja_JP.eucJP',
        'ja_jp.ajec':                    'ja_JP.eucJP',
        'ja_jp.euc':                     'ja_JP.eucJP',
        'ja_jp.eucjp':                   'ja_JP.eucJP',
        'ja_jp.iso-2022-jp':             'ja_JP.JIS7',
        'ja_jp.jis':                     'ja_JP.JIS7',
        'ja_jp.jis7':                    'ja_JP.JIS7',
        'ja_jp.mscode':                  'ja_JP.SJIS',
        'ja_jp.sjis':                    'ja_JP.SJIS',
        'ja_jp.ujis':                    'ja_JP.eucJP',
        'japan':                         'ja_JP.eucJP',
        'japanese':                      'ja_JP.SJIS',
        'japanese-euc':                  'ja_JP.eucJP',
        'japanese.euc':                  'ja_JP.eucJP',
        'jp_jp':                         'ja_JP.eucJP',
        'ko':                            'ko_KR.eucKR',
        'ko_kr':                         'ko_KR.eucKR',
        'ko_kr.euc':                     'ko_KR.eucKR',
        'korean':                        'ko_KR.eucKR',
        'lt':                            'lt_LT.ISO8859-4',
        'lv':                            'lv_LV.ISO8859-4',
        'mk':                            'mk_MK.ISO8859-5',
        'mk_mk':                         'mk_MK.ISO8859-5',
        'nl':                            'nl_NL.ISO8859-1',
        'nl_be':                         'nl_BE.ISO8859-1',
        'nl_nl':                         'nl_NL.ISO8859-1',
        'no':                            'no_NO.ISO8859-1',
        'no_no':                         'no_NO.ISO8859-1',
        'norwegian':                     'no_NO.ISO8859-1',
        'pl':                            'pl_PL.ISO8859-2',
        'pl_pl':                         'pl_PL.ISO8859-2',
        'polish':                        'pl_PL.ISO8859-2',
        'portuguese':                    'pt_PT.ISO8859-1',
        'portuguese_brazil':             'pt_BR.ISO8859-1',
        'posix':                         'C',
        'posix-utf2':                    'C',
        'pt':                            'pt_PT.ISO8859-1',
        'pt_br':                         'pt_BR.ISO8859-1',
        'pt_pt':                         'pt_PT.ISO8859-1',
        'ro':                            'ro_RO.ISO8859-2',
        'ro_ro':                         'ro_RO.ISO8859-2',
        'ru':                            'ru_RU.ISO8859-5',
        'ru_ru':                         'ru_RU.ISO8859-5',
        'rumanian':                      'ro_RO.ISO8859-2',
        'russian':                       'ru_RU.ISO8859-5',
        'serbocroatian':                 'sh_YU.ISO8859-2',
        'sh':                            'sh_YU.ISO8859-2',
        'sh_hr':                         'sh_HR.ISO8859-2',
        'sh_sp':                         'sh_YU.ISO8859-2',
        'sh_yu':                         'sh_YU.ISO8859-2',
        'sk':                            'sk_SK.ISO8859-2',
        'sk_sk':                         'sk_SK.ISO8859-2',
        'sl':                            'sl_CS.ISO8859-2',
        'sl_cs':                         'sl_CS.ISO8859-2',
        'sl_si':                         'sl_SI.ISO8859-2',
        'slovak':                        'sk_SK.ISO8859-2',
        'slovene':                       'sl_CS.ISO8859-2',
        'sp':                            'sp_YU.ISO8859-5',
        'sp_yu':                         'sp_YU.ISO8859-5',
        'spanish':                       'es_ES.ISO8859-1',
        'spanish_spain':                 'es_ES.ISO8859-1',
        'sr_sp':                         'sr_SP.ISO8859-2',
        'sv':                            'sv_SE.ISO8859-1',
        'sv_se':                         'sv_SE.ISO8859-1',
        'swedish':                       'sv_SE.ISO8859-1',
        'th_th':                         'th_TH.TACTIS',
        'tr':                            'tr_TR.ISO8859-9',
        'tr_tr':                         'tr_TR.ISO8859-9',
        'turkish':                       'tr_TR.ISO8859-9',
        'univ':                          'en_US.utf',
        'universal':                     'en_US.utf',
        'zh':                            'zh_CN.eucCN',
        'zh_cn':                         'zh_CN.eucCN',
        'zh_cn.big5':                    'zh_TW.eucTW',
        'zh_cn.euc':                     'zh_CN.eucCN',
        'zh_tw':                         'zh_TW.eucTW',
        'zh_tw.euc':                     'zh_TW.eucTW',
}

if __name__ == '__main__':

    categories = {}
    def _init_categories():
        for k,v in globals().items():
            if k[:3] == 'LC_':
                categories[k] = v
    _init_categories()

    print 'Locale defaults as determined by get_default():'
    print '-'*72
    lang, enc = get_default()
    print 'Language: ', lang or '(undefined)'
    print 'Encoding: ', enc or '(undefined)'
    print

    print 'Locale settings on startup:'
    print '-'*72
    for name,category in categories.items():
        print name,'...'
        lang, enc = get_locale(category)
        print '   Language: ', lang or '(undefined)'
        print '   Encoding: ', enc or '(undefined)'
        print

    set_to_default()
    print
    print 'Locale settings after calling set_to_default():'
    print '-'*72
    for name,category in categories.items():
        print name,'...'
        lang, enc = get_locale(category)
        print '   Language: ', lang or '(undefined)'
        print '   Encoding: ', enc or '(undefined)'
        print
    
    try:
        setlocale(LC_ALL,"")
    except:
        print 'NOTE:'
        print 'setlocale(LC_ALL,"") does not support the default locale'
        print 'given in the OS environment variables.'
    else:
        print
        print 'Locale settings after calling setlocale(LC_ALL,""):'
        print '-'*72
        for name,category in categories.items():
            print name,'...'
            lang, enc = get_locale(category)
            print '   Language: ', lang or '(undefined)'
            print '   Encoding: ', enc or '(undefined)'
            print
    

--------------B0D558452C6B5A792F8287B1--




From guido@python.org  Tue May 30 14:59:37 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 30 May 2000 08:59:37 -0500
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Tue, 30 May 2000 12:26:29 +0200."
 <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
 <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>
Message-ID: <200005301359.IAA05484@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot@telia.com>
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> > 
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> > 
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> > 
> > c) use LINEBREAK in either case.
> > 
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

I proposed before to see what Perl does -- since we're supposedly
following Perl's RE syntax anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Tue May 30 13:03:17 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 14:03:17 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>
Message-ID: <3933AE05.4640A75D@lemburg.com>

Fredrik Lundh wrote:
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> >
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> >
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> >
> > c) use LINEBREAK in either case.
> >
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

For Unicode objects you should really default to using the 
Py_UNICODE_ISLINEBREAK() macro which defines all line break
characters (note that CRLF should be interpreted as a
single line break; see PyUnicode_Splitlines()). The reason
here is that Unicode defines how to handle line breaks
and we should try to stick to the standard as close as possible.
All other possibilities could still be made available via new
flags.

For 8-bit strings I'd suggest sticking to the re definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Tue May 30 14:40:53 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 06:40:53 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>
Message-ID: <Pine.LNX.4.10.10005300638110.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Martin v. Loewis wrote:
 > In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
 > indicate that wchar_t is Unicode. The exact wording is

  This is a real improvement!  I've seen brief summmaries of the changes
in C99, but I should take a little time to become more familiar with them.
It looked like a real improvement.

 > Of course, at the moment, there are few, if any, implementations that
 > define this macro.

  I think the gcc people are still working on it, but that's to be
expected; there's a lot of things they're still working on.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From fredrik@pythonware.com  Tue May 30 15:23:46 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:23:46 +0200
Subject: [Python-Dev] Q: join vs. __join__ ?
Message-ID: <001101bfca42$ae9f9710$0500a8c0@secret.pythonware.com>

(re: yet another endless thread on comp.lang.python)

how about renaming the "join" method to "__join__", so we can
argue that it doesn't really exist.

</F>

<project name=3D"sre" complete=3D"97.1%" />



From fdrake@acm.org  Tue May 30 15:22:42 2000
From: fdrake@acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 07:22:42 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where
 to install non-code files)
In-Reply-To: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Guido van Rossum wrote:
 > > Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
 > > module would provide 'os.environ["HOME"]' similar to the posix
 > > version?  This would certainly simplify the task of application
 > > programmers intending to write portable applications.
 > 
 > This sounds like a nice idea...

  Now that this idea has fermented for a few days, I'm inclined to not
like it.  It smells of making Unix-centric interface to something that
isn't terribly portable as a concept.
  Perhaps there should be a function that does the "right thing",
extracting os.environ["HOME"] if defined, and taking an alternate approach
(os.getcwd() or whatever) otherwise.  I don't think setting
os.environ["HOME"] in the library is a good idea because that changes the
environment that gets published to child processes beyond what the
application does.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From jeremy@alum.mit.edu  Tue May 30 15:33:02 2000
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 10:33:02 -0400 (EDT)
Subject: [Python-Dev] SRE snapshot broken
Message-ID: <14643.53534.143126.349006@localhost.localdomain>

I believe I'm looking at the current version.  (It's a file called
snapshot.zip with no version-specific identifying info that I can
find.)

The sre module changed one line in _fixflags from the CVS version.

def _fixflags(flags):
    # convert flag bitmask to sequence
    assert flags == 0
    return ()

The assert flags == 0 is apparently wrong, because it gets called with
an empty tuple if you use sre.search or sre.match.

Also, assuming that simply reverting to the previous test "assert not
flags" fix this bug, is there a test suite that I can run?  Guido
asked me to check in the current snapshot, but it's hard to tell how
to do that correctly.  It's not clear which files belong in the Python
CVS tree, nor is it clear how to test that the build worked.

Jeremy



From guido@python.org  Tue May 30 16:34:04 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 30 May 2000 10:34:04 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Tue, 30 May 2000 07:22:42 MST."
 <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>
References: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>
Message-ID: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>

[Fred]
>   Now that this idea has fermented for a few days, I'm inclined to not
> like it.  It smells of making Unix-centric interface to something that
> isn't terribly portable as a concept.
>   Perhaps there should be a function that does the "right thing",
> extracting os.environ["HOME"] if defined, and taking an alternate approach
> (os.getcwd() or whatever) otherwise.  I don't think setting
> os.environ["HOME"] in the library is a good idea because that changes the
> environment that gets published to child processes beyond what the
> application does.

The passing on to child processes doesn't sound like a big deal to me.
Either these are Python programs, in which case they might appreciate
that the work has already been done, or they aren't, in which case
they probably don't look at $HOME at all (since apparently they worked
before).

I could see defining a new API, e.g. os.gethomedir(), but that doesn't
help all the programs that currently use $HOME...  Perhaps we could do
both?  (I.e. add os.gethomedir() *and* set $HOME.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Tue May 30 15:24:59 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:24:59 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>             <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>  <200005301359.IAA05484@cj20424-a.reston1.va.home.com>
Message-ID: <002801bfca44$b533c900$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> I proposed before to see what Perl does -- since we're supposedly
> following Perl's RE syntax anyway.

anyone happen to have 5.6 on their box?

</F>

<project name=3D"sre" complete=3D"97.1%" />



From fredrik@pythonware.com  Tue May 30 15:38:29 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:38:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com>
Message-ID: <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
...
> > background: in the current implementation, this decision has to
> > be made at compile time, and a compiled expression can be used
> > with either 8-bit strings or 16-bit strings.
...
> For Unicode objects you should really default to using the=20
> Py_UNICODE_ISLINEBREAK() macro which defines all line break
> characters (note that CRLF should be interpreted as a
> single line break; see PyUnicode_Splitlines()). The reason
> here is that Unicode defines how to handle line breaks
> and we should try to stick to the standard as close as possible.
> All other possibilities could still be made available via new
> flags.
>=20
> For 8-bit strings I'd suggest sticking to the re definition.

guess my background description wasn't clear:

Once a pattern has been compiled, it will always handle line
endings in the same way. The parser doesn't really care if the
pattern is a unicode string or an 8-bit string (unicode strings
can contain "wide" characters, but that's the only difference).

At the other end, the same compiled pattern can be applied
to either 8-bit or unicode strings.  It's all just characters to
the engine...

Now, I can of course change the engine so that it always uses
chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
result is that

    pattern.match(widestring)

won't necessarily match the same thing as

    pattern.match(str(widestring))

even if the wide string only contains plain ASCII.

(an other alternative is to recompile the pattern for each target
string type, but that will hurt performance...)

</F>

<project name=3D"sre" complete=3D"97.1%" />



From mal@lemburg.com  Tue May 30 15:57:57 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 16:57:57 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>
Message-ID: <3933D6F5.F6BDA39@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> ...
> > > background: in the current implementation, this decision has to
> > > be made at compile time, and a compiled expression can be used
> > > with either 8-bit strings or 16-bit strings.
> ...
> > For Unicode objects you should really default to using the
> > Py_UNICODE_ISLINEBREAK() macro which defines all line break
> > characters (note that CRLF should be interpreted as a
> > single line break; see PyUnicode_Splitlines()). The reason
> > here is that Unicode defines how to handle line breaks
> > and we should try to stick to the standard as close as possible.
> > All other possibilities could still be made available via new
> > flags.
> >
> > For 8-bit strings I'd suggest sticking to the re definition.
> 
> guess my background description wasn't clear:
> 
> Once a pattern has been compiled, it will always handle line
> endings in the same way. The parser doesn't really care if the
> pattern is a unicode string or an 8-bit string (unicode strings
> can contain "wide" characters, but that's the only difference).

Ok.

> At the other end, the same compiled pattern can be applied
> to either 8-bit or unicode strings.  It's all just characters to
> the engine...

Doesn't the engine remember wether the pattern was a string
or Unicode ?
 
> Now, I can of course change the engine so that it always uses
> chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
> result is that
> 
>     pattern.match(widestring)
> 
> won't necessarily match the same thing as
> 
>     pattern.match(str(widestring))
> 
> even if the wide string only contains plain ASCII.

Hmm, I wouldn't mind, as long as the engine does the right
thing for Unicode which is to respect the line break
standard defined in Unicode TR13.

Thinking about this some more: I wouldn't even mind if
the engine would use LINEBREAK for all strings :-). It would
certainly make life easier whenever you have to deal with
file input from different platforms, e.g. Mac, Unix and
Windows.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From fredrik@pythonware.com  Tue May 30 16:14:00 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:14:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com>
Message-ID: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg <mal@lemburg.com> wrote:
> BTW, I haven't found any mention of what language and encoding
> the locale 'C' assumes or defines. Currently, the module
> reports these as None, meaning undefined. Are language and
> encoding defined for 'C' ?

IIRC, the C locale (and the POSIX character set) is defined in terms
of a "portable character set".  This set contains all ASCII characters,
but doesn't specify what code points to use.

But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
using Python on a non-ASCII platform?  does it even build and run
on such a beast?)

</F>

<project name=3D"sre" complete=3D"97.1%" />



From fredrik@pythonware.com  Tue May 30 16:19:48 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:19:48 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com>
Message-ID: <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> > At the other end, the same compiled pattern can be applied
> > to either 8-bit or unicode strings.  It's all just characters to
> > the engine...
>=20
> Doesn't the engine remember wether the pattern was a string
> or Unicode ?

The pattern object contains a reference to the original pattern
string, so I guess the answer is "yes, but indirectly".  But the core
engine doesn't really care -- it just follows the instructions in the
compiled pattern.

> Thinking about this some more: I wouldn't even mind if
> the engine would use LINEBREAK for all strings :-). It would
> certainly make life easier whenever you have to deal with
> file input from different platforms, e.g. Mac, Unix and
> Windows.

That's what I originally proposed (and implemented).  But this may
(in theory, at least) break existing code.  If not else, it broke the
test suite ;-)

</F>

<project name=3D"sre" complete=3D"97.1%" />



From akuchlin@cnri.reston.va.us  Tue May 30 16:16:14 2000
From: akuchlin@cnri.reston.va.us (Andrew M. Kuchling)
Date: Tue, 30 May 2000 11:16:14 -0400
Subject: [Python-Dev] Re: Extending locale.py
In-Reply-To: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>; from fredrik@pythonware.com on Tue, May 30, 2000 at 05:14:00PM +0200
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <20000530111614.B7942@amarok.cnri.reston.va.us>

On Tue, May 30, 2000 at 05:14:00PM +0200, Fredrik Lundh wrote:
>But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
>using Python on a non-ASCII platform?  does it even build and run
>on such a beast?)

The OS/390 port of 1.4? (http://www.s390.ibm.com/products/oe/python.html)
But it doesn't look like they ported the regex module at all.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Better get going; your parents still think me imaginary, and I'd hate to
shatter an illusion like that before dinner.
  -- The monster, in STANLEY AND HIS MONSTER #1




From gmcm@hypernet.com  Tue May 30 16:29:39 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Tue, 30 May 2000 11:29:39 -0400
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <1252421881-3397332@hypernet.com>

Fred L. Drake wrote:

>   Now that this idea has fermented for a few days, I'm inclined
>   to not
> like it.  It smells of making Unix-centric interface to something
> that isn't terribly portable as a concept.

I've refrained from jumping in here (as, it seems, have all the 
Windows users) because this is a god-awful friggin' mess on 
Windows.

From the 10**3 foot view, yes, they have the concept. From 
any closer it falls apart miserably.

In practice, I think you can safely regard a Win9x box as 
single user. I do sometimes run across NT boxes that mutiple 
people use, and have separate configurations. It sort of works, 
sometimes.

But there's no $HOME as such.

There's 
HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
hell Folders with around 16 subkeys, including AppData 
(which on my system has one entry installed by a program I've 
never used and didn't know I had). But MSOffice uses the 
Personal subkey. Others seem to use the Desktop subkey.

Still other programs will remember the per-user directories 
under HKLM\....\<program specific> with a subkey == userid.

That said, the above referenced AppData is probably the 
closest thing to a $HOME directory, despite the fact that it 
smells, tastes, acts and looks nothing like the *nix 
counterpart.

(An cmd.exe "cd" w/o arg acts like "pwd". I notice that the 
bash shell requires you to set $HOME, and won't make any 
guesses.)



- Gordon


From fdrake@acm.org  Tue May 30 17:10:29 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 09:10:29 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <1252421881-3397332@hypernet.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
 <1252421881-3397332@hypernet.com>
Message-ID: <14643.59381.73286.292195@mailhost.beopen.com>

Gordon McMillan writes:
 > From the 10**3 foot view, yes, they have the concept. From 
 > any closer it falls apart miserably.

  So they have the concept, just no implementation.  ;)  Sounds like
leaving it up to the application to interpret their requirements is
the right thing.  Or the right thing is to provide a function to ask
where configuration information should be stored for the
user/application; this would be $HOME under Unix and <whatever> on
Windows.  The only other reason I can think of that $HOME is needed is
for navigation purposes (as in a filesystem browser), and for that the
application needs to deal with the lack of the concept in the
operating system as appropriate.

 > (An cmd.exe "cd" w/o arg acts like "pwd". I notice that the 
 > bash shell requires you to set $HOME, and won't make any 
 > guesses.)

  This very definately sounds like overloading $HOME is the wrong
thing.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From pf@artcom-gmbh.de  Tue May 30 17:37:41 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 18:37:41 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005301534.KAA06322@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 30, 2000 10:34: 4 am"
Message-ID: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>

> [Fred]
> >   Now that this idea has fermented for a few days, I'm inclined to not
> > like it.  It smells of making Unix-centric interface to something that
> > isn't terribly portable as a concept.

Yes.  After thinking more carefully and after a closer look to what 
Jack Jansen finally figured out for MacOS (see 
	<http://www.python.org/pipermail/pythonmac-sig/2000-May/003667.html>
) I agree with Fred.  My initial idea to put something into
'os.environ["HOME"]' on those platforms was too simple minded.

> >   Perhaps there should be a function that does the "right thing",
> > extracting os.environ["HOME"] if defined, and taking an alternate approach
> > (os.getcwd() or whatever) otherwise.  
[...]

Every serious (non trivial) application usually contains something like 
"user preferences" or other state information, which should --if possible-- 
survive the following kind of events:
  1. An upgrade of the application to a newer version.  This is
     often accomplished by removing the directory tree, in which this
     application lives and replacing it by unpacking or installing
     in archive containing the new version of the application.
  2. Another colleague uses the application on the same computer and
     modifies settings to fit his personal taste.

On several versions of WinXX and on MacOS prior to release 9.X (and due
to stability problems with the multiuser capabilities even in MacOS 9)
the second kind of event seems to be rather unimportant to the users
of these platforms, since the OSes are considered as "single user"
systems anyway.  Or in other words:  the users are already used to
this situation.

Only the first kind of event should be solved for all platforms:  
<FANTASY>
    Imagine you are using grail version 4.61 on a daily basis for WWW 
    browsing and one day you decide to install the nifty upgrade 
    grail 4.73 on your computer running WinXX or MacOS X.Y 
    and after doing so you just discover that all your carefully
    sorted bookmarks are gone!  That wouldn't be nice?
</FANTASY>

I see some similarities here to the standard library module 'tempfile',
which supplies (or at least it tries ;-) to) a cross-platform portable
strategy for all applications which have to store temporary data.

My intentation was to have simple a cross-platform portable API to store
and retrieve such user specific state information (examples: the bookmarks
of a Web browser, themes, color settings, fonts...  other GUI settings, 
and so... you get the picture).  On unices applications usually use the
idiom 
	os.path.join(os.environ.get("HOME", "."), ".dotfoobar")
or something similar.

Do people remember 'grail'?  I've just stolen the following code snippets
from 'grail0.6/grailbase/utils.py' just to demonstrate, that this is still 
a very common programming problem:
---------------- snip ---------------------
# XXX Unix specific stuff
# XXX (Actually it limps along just fine for Macintosh, too)
 
def getgraildir():
    return getenv("GRAILDIR") or os.path.join(gethome(), ".grail")    
----- snip ------
def gethome():
    try:
        home = getenv("HOME")
        if not home:
            import pwd
            user = getenv("USER") or getenv("LOGNAME")
            if not user:
                pwent = pwd.getpwuid(os.getuid())
            else:
                pwent = pwd.getpwnam(user)
            home = pwent[6]
        return home
    except (KeyError, ImportError):
        return os.curdir
---------------- snap ---------------------
[...]

[Guido van Rossum]:
> I could see defining a new API, e.g. os.gethomedir(), but that doesn't
> help all the programs that currently use $HOME...  Perhaps we could do
> both?  (I.e. add os.gethomedir() *and* set $HOME.)

I'm not sure whether this is really generic enough for the OS module.

May be we should introduce a new small standard library module called 
'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
will probably not know what to do with 'os.gethomedir()'.  

However for the time being this module would only contain one simple 
function returning a directory pathname, which is guaranteed to exist 
and to survive a deinstallation of an application.  May be introducing
a new module is overkill?  What do you think?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From fdrake@acm.org  Tue May 30 18:17:56 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 10:17:56 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>
References: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>
 <m12wp1R-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <14643.63428.387306.455383@mailhost.beopen.com>

Peter Funk writes:
 > <FANTASY>
 >     Imagine you are using grail version 4.61 on a daily basis for WWW 
 >     browsing and one day you decide to install the nifty upgrade 
 >     grail 4.73 on your computer running WinXX or MacOS X.Y 

  God thing you marked that as fantasy -- I would have asked for the
download URL!  ;)

 > Do people remember 'grail'?  I've just stolen the following code snippets

  Not on good days.  ;)

 > I'm not sure whether this is really generic enough for the OS module.

  The location selected is constrained by the OS, but this isn't an
exposure of operating system functionality, so there should probably
be something else.

 > May be we should introduce a new small standard library module called 
 > 'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
 > will probably not know what to do with 'os.gethomedir()'.  
 > 
 > However for the time being this module would only contain one simple 
 > function returning a directory pathname, which is guaranteed to exist 
 > and to survive a deinstallation of an application.  May be introducing

  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
directories, and that's all most applications need; Web browser are a
special case in this way; there aren't that many things that require a
directory.  Those things which do often are program that form an
essential part of a user's environment -- Web browsers and email
clients are two good examples I've seen that really seem to have a lot
of things.
  I think what's needed is a function to return the location where the
application can make one directory entry.  The caller is still
responsible for creating a directory to store a larger set of files if
needed.  Something like grailbase.utils.establish_dir() might be a
nice convenience function.
  An additional convenience may be to offer a function which takes the
application name and a dotfile name, and returns the one to use; the
Windows and MacOS (and BeOS?) worlds seem more comfortable with the
longer, mixed-case, more readable names, while the Unix world enjoys
cryptic little names with a dot at the front.
  Ok, so now that I've rambled, the "userprefs" module looks like it
contains:

        get_appdata_root() -- $HOME, or other based on platform
        get_appdata_name() -- "MyApplication Preferences" or ".myapp"
        establish_dir() -- create dir if it doesn't exist

  Maybe this really is a separate module.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From mal@lemburg.com  Tue May 30 18:54:32 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:54:32 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <39340058.CA3FC798@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > BTW, I haven't found any mention of what language and encoding
> > the locale 'C' assumes or defines. Currently, the module
> > reports these as None, meaning undefined. Are language and
> > encoding defined for 'C' ?
> 
> IIRC, the C locale (and the POSIX character set) is defined in terms
> of a "portable character set".  This set contains all ASCII characters,
> but doesn't specify what code points to use.
> 
> But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
> using Python on a non-ASCII platform?  does it even build and run
> on such a beast?)

Hmm, that would mean having an encoding, but no language
definition available -- setlocale() doesn't work without
language code... I guess its better to leave things
undefined in that case.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue May 30 18:57:41 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:57:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com> <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>
Message-ID: <39340115.7E05DA6C@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > > At the other end, the same compiled pattern can be applied
> > > to either 8-bit or unicode strings.  It's all just characters to
> > > the engine...
> >
> > Doesn't the engine remember wether the pattern was a string
> > or Unicode ?
> 
> The pattern object contains a reference to the original pattern
> string, so I guess the answer is "yes, but indirectly".  But the core
> engine doesn't really care -- it just follows the instructions in the
> compiled pattern.
> 
> > Thinking about this some more: I wouldn't even mind if
> > the engine would use LINEBREAK for all strings :-). It would
> > certainly make life easier whenever you have to deal with
> > file input from different platforms, e.g. Mac, Unix and
> > Windows.
> 
> That's what I originally proposed (and implemented).  But this may
> (in theory, at least) break existing code.  If not else, it broke the
> test suite ;-)

SRE is new, so what could it break ?

Anyway, perhaps we should wait for some Perl 5.6 wizard to
speak up ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido@python.org  Tue May 30 20:16:13 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:16:13 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
Message-ID: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>

FYI, here's an important announcement that I just sent to c.l.py.  I'm
very excited that we can finally announce this!

I'll be checking mail sporadically until Thursday morning.  Back on
June 19.

--Guido van Rossum (home page: http://www.python.org/~guido/)

To all Python users and developers:

Python is growing rapidly.  In order to take it to the next level,
I've moved with my core development group to a new employer,
BeOpen.com.  BeOpen.com is a startup company with a focus on open
source communities, and an interest in facilitating next generation
application development.  It is a natural fit for Python.

At BeOpen.com I am the director of a new development team named
PythonLabs.  The team includes three of my former colleagues at CNRI:
Fred Drake, Jeremy Hylton, and Barry Warsaw.  Another familiar face
will joins us shortly: Tim Peters.  We have our own website
(www.pythonlabs.com) where you can read more about us, our plans and
our activities.  We've also posted a FAQ there specifically about
PythonLabs, our transition to BeOpen.com, and what it means for the
Python community.

What will change, and what will stay the same? First of all, Python
will remain Open Source.  In fact, everything we produce at PythonLabs
will be released with an Open Source license.  Also, www.python.org
will remain the number one website for the Python community.  CNRI
will continue to host it, and we'll maintain it as a community
project.

What changes is how much time we have for Python.  Previously, Python
was a hobby or side project, which had to compete with our day jobs;
at BeOpen.com we will be focused full time on Python development! This
means that we'll be able to spend much more time on exciting new
projects like Python 3000.  We'll also get support for website
management from BeOpen.com's professional web developers, and we'll
work with their marketing department.

Marketing for Python, you ask? Sure, why not! We want to grow the size
of the Python user and developer community at an even faster pace than
today.  This should benefit everyone: the larger the community, the
more resources will be available to all, and the easier it will be to
find Python expertise when you need it.  We're also planning to make
commercial offerings (within the Open Source guidelines!) to help
Python find its way into the hands of more programmers, especially in
large enterprises where adoption is still lagging.

There's one piece of bad news: Python 1.6 won't be released by June
1st.  There's simply too much left to be done.  We promise that we'll
get it out of the door as soon as possible.  By the way, Python 1.6
will be the last release from CNRI; after that, we'll issue Python
releases from BeOpen.com.

Oh, and to top it all off, I'm going on vacation.  I'm getting married
and will be relaxing on my honeymoon.  For all questions about
PythonLabs, write to pythonlabs-info@beopen.com.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From esr@thyrsus.com  Tue May 30 19:27:18 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:27:18 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142718.A24289@thyrsus.com>

Guido van Rossum <guido@python.org>:
> Oh, and to top it all off, I'm going on vacation.  I'm getting married
> and will be relaxing on my honeymoon.

Mazel tov, Guido!

BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
to include it in 1.6?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The Constitution is not neutral. It was designed to take the
government off the backs of the people.
	-- Justice William O. Douglas 


From fdrake@acm.org  Tue May 30 19:23:25 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 11:23:25 -0700 (PDT)
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <20000530142718.A24289@thyrsus.com>
References: <20000530142718.A24289@thyrsus.com>
Message-ID: <14644.1821.67068.165890@mailhost.beopen.com>

Eric S. Raymond writes:
 > BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
 > to include it in 1.6?

Eric,
  Appearantly the rest of us haven't heard of it.  Since Guido's a
little distracted right now, perhaps you should send the files to
python-dev for discussion?
  Thanks!


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From gward@mems-exchange.org  Tue May 30 19:25:42 2000
From: gward@mems-exchange.org (Greg Ward)
Date: Tue, 30 May 2000 14:25:42 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142541.D20088@mems-exchange.org>

On 30 May 2000, Guido van Rossum said:
> At BeOpen.com I am the director of a new development team named
> PythonLabs.  The team includes three of my former colleagues at CNRI:
> Fred Drake, Jeremy Hylton, and Barry Warsaw.

Ahh, no wonder it's been so quiet around here.  I was wondering where
you guys had gone.  Mystery solved!

(It's a *joke!*  We already *knew* they were leaving...)

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367


From trentm@activestate.com  Tue May 30 19:26:38 2000
From: trentm@activestate.com (Trent Mick)
Date: Tue, 30 May 2000 11:26:38 -0700
Subject: [Python-Dev] inspect.py
In-Reply-To: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <20000530112638.E18024@activestate.com>

Looks cool, Ping.

Trent


-- 
Trent Mick
trentm@activestate.com


From guido@python.org  Tue May 30 20:34:38 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:34:38 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: Your message of "Tue, 30 May 2000 14:27:18 -0400."
 <20000530142718.A24289@thyrsus.com>
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
 <20000530142718.A24289@thyrsus.com>
Message-ID: <200005301934.OAA07671@cj20424-a.reston1.va.home.com>

> Mazel tov, Guido!

Thanks!

> BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
> to include it in 1.6?

Yes, and probably.  As Fred suggested, could you resend to the patches
list?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Tue May 30 19:40:13 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 20:40:13 +0200
Subject: [Python-Dev] inspect.py
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <012e01bfca66$7ed61ee0$f2a6b5d4@hagrid>

ping wrote:
> The reason i'm mentioning this here is that, in the course of
> doing that, i put all the introspection work in a separate
> module called "inspect.py".  It's at
>=20
>     http://www.lfw.org/python/inspect.py
>
...
>
> I think most of this stuff is quite generally useful, and it
> seems good to wrap this up in a module.  I'd like your thoughts
> on whether this is worth including in the standard library.

haven't looked at the code (yet), but +1 on concept.

(if this goes into 1.6, I no longer have to keep reposting
pointers to my "describe" module...)

</F>



From skip@mojam.com (Skip Montanaro)  Tue May 30 19:43:36 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Tue, 30 May 2000 13:43:36 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <14644.3032.900770.450584@beluga.mojam.com>

    Guido> Python is growing rapidly.  In order to take it to the next
    Guido> level, I've moved with my core development group to a new
    Guido> employer, BeOpen.com.

Great news!

    Guido> Oh, and to top it all off, I'm going on vacation.  I'm getting
    Guido> married and will be relaxing on my honeymoon.  For all questions
    Guido> about PythonLabs, write to pythonlabs-info@beopen.com.

Nice to see you are trying to maintain some consistency in the face of huge
professional and personal changes.  I would have worried if you weren't
going to go on vacation!  Congratulations on both moves...

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From esr@thyrsus.com  Tue May 30 19:58:38 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:58:38 -0400
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <14644.1821.67068.165890@mailhost.beopen.com>; from fdrake@acm.org on Tue, May 30, 2000 at 11:23:25AM -0700
References: <20000530142718.A24289@thyrsus.com> <14644.1821.67068.165890@mailhost.beopen.com>
Message-ID: <20000530145838.A24339@thyrsus.com>

--gBBFr7Ir9EOA20Yy
Content-Type: text/plain; charset=us-ascii

Fred L. Drake, Jr. <fdrake@acm.org>:
>   Appearantly the rest of us haven't heard of it.  Since Guido's a
> little distracted right now, perhaps you should send the files to
> python-dev for discussion?

Righty-O.  Here they are enclosed.  I wrote this for use with the
curses module; one reason it's useful is because because the curses
getch function returns ordinal values rather than characters.  It should
be more generally useful for any pPython program with a raw character-by-
character commmand interface.

The tex may need trivial markup fixes.  You might want to add a "See also"
to curses.

I'm using this code heavily in my CML2 project, so it has been tested.
For those of you who haven't heard about CML2, I've written a replacement
for the Linux kernel configuration system in Python.  You can find out more
at:

	http://www.tuxedo.org/~esr/kbuild/

The code has some interesting properties, including the ability to
probe its environment and come up in a Tk-based, curses-based, or
line-oriented mode depending on what it sees.

ascii.py will probably not be the last library code this project spawns.
I have another package called menubrowser that is a framework for writing
menu systems. And I have some Python wrapper enhancements for curses in
the works.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The two pillars of `political correctness' are, 
  a) willful ignorance, and
  b) a steadfast refusal to face the truth
	-- George MacDonald Fraser

--gBBFr7Ir9EOA20Yy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="ascii.py"

#
# ascii.py -- constants and memembership tests for ASCII characters
#

NUL	= 0x00	# ^@
SOH	= 0x01	# ^A
STX	= 0x02	# ^B
ETX	= 0x03	# ^C
EOT	= 0x04	# ^D
ENQ	= 0x05	# ^E
ACK	= 0x06	# ^F
BEL	= 0x07	# ^G
BS	= 0x08	# ^H
TAB	= 0x09	# ^I
HT	= 0x09	# ^I
LF	= 0x0a	# ^J
NL	= 0x0a	# ^J
VT	= 0x0b	# ^K
FF	= 0x0c	# ^L
CR	= 0x0d	# ^M
SO	= 0x0e	# ^N
SI	= 0x0f	# ^O
DLE	= 0x10	# ^P
DC1	= 0x11	# ^Q
DC2	= 0x12	# ^R
DC3	= 0x13	# ^S
DC4	= 0x14	# ^T
NAK	= 0x15	# ^U
SYN	= 0x16	# ^V
ETB	= 0x17	# ^W
CAN	= 0x18	# ^X
EM	= 0x19	# ^Y
SUB	= 0x1a	# ^Z
ESC	= 0x1b	# ^[
FS	= 0x1c	# ^\
GS	= 0x1d	# ^]
RS	= 0x1e	# ^^
US	= 0x1f	# ^_
SP	= 0x20	# space
DEL	= 0x7f	# delete

def _ctoi(c):
    if type(c) == type(""):
        return ord(c)
    else:
        return c

def isalnum(c): return isalpha(c) or isdigit(c)
def isalpha(c): return isupper(c) or islower(c)
def isascii(c): return _ctoi(c) <= 127		# ?
def isblank(c): return _ctoi(c) in (8,32)
def iscntrl(c): return _ctoi(c) <= 31
def isdigit(c): return _ctoi(c) >= 48 and _ctoi(c) <= 57
def isgraph(c): return _ctoi(c) >= 33 and _ctoi(c) <= 126
def islower(c): return _ctoi(c) >= 97 and _ctoi(c) <= 122
def isprint(c): return _ctoi(c) >= 32 and _ctoi(c) <= 126
def ispunct(c): return _ctoi(c) != 32 and not isalnum(c)
def isspace(c): return _ctoi(c) in (12, 10, 13, 9, 11)
def isupper(c): return _ctoi(c) >= 65 and _ctoi(c) <= 90
def isxdigit(c): return isdigit(c) or \
    (_ctoi(c) >= 65 and _ctoi(c) <= 70) or (_ctoi(c) >= 97 and _ctoi(c) <= 102)

def ctrl(c):
    if type(c) == type(""):
        return chr(_ctoi(c) & 0x1f)
    else:
        return _ctoi(c) & 0x1f

def alt(c):
    if type(c) == type(""):
        return chr(_ctoi(c) | 0x80)
    else:
        return _ctoi(c) | 0x80





--gBBFr7Ir9EOA20Yy
Content-Type: application/x-tex
Content-Disposition: attachment; filename="ascii.tex"

\section{\module{ascii} ---
         Constants and set-membership functions for ASCII characters.}

\declaremodule{standard}{ascii}
\modulesynopsis{Constants and set-membership functions for ASCII characters.}
\moduleauthor{Eric S. Raymond}{esr@thyrsus.com}
\sectionauthor{Eric S. Raymond}{esr@thyrsus.com}

\versionadded{1.6}

The \module{ascii} module supplies name constants for ASCII characters
and functions to test membership in various ASCII character classes.  
The constants supplied are names for control characters as follows:

NUL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS, TAB, HT, LF, NL, VT, FF, CR,
SO, SI, DLE, DC1, DC2, DC3, DC4, NAK, SYN, ETB, CAN, EM, SUB, ESC, FS, 
GS, RS, US, SP, DEL.

NL and LF are synonyms.  The module also supplies the following
functions, patterned on those in the standard C library:

\begin{funcdesc}{isalnum}{c}
Checks for an ASCII alphanumeric character; it is equivalent to
isalpha(c) or isdigit(c))
\end{funcdesc}

\begin{funcdesc}{isalpha}{c}
Checks for an ASCII alphabetic character; it is equivalent to
isupper(c) or islower(c))
\end{funcdesc}

\begin{funcdesc}{isascii}{c}
Checks for a character value that fits in the 7-bit ASCII set.
\end{funcdesc}

\begin{funcdesc}{isblank}{c}
Checks for an ASCII alphanumeric character; it is equivalent to
isalpha(c) or isdigit(c))
\end{funcdesc}

\begin{funcdesc}{iscntrl}{c}
Checks for an ASCII control character (range 0x00 to 0x1f).
\end{funcdesc}

\begin{funcdesc}{isdigit}{c}
Checks for an ASCII decimal digit, 0 through 9.
\end{funcdesc}

\begin{funcdesc}{isgraph}{c}
Checks for ASCII any printable character except space.
\end{funcdesc}

\begin{funcdesc}{islower}{c}
Checks for an ASCII lower-case character.
\end{funcdesc}

\begin{funcdesc}{isprint}{c}
Checks for any ASCII printable character including space.
\end{funcdesc}

\begin{funcdesc}{ispunct}{c}
Checks for any printable ASCII character which is not a space or an
alphanumeric character.
\end{funcdesc}

\begin{funcdesc}{isspace}{c}
Checks for ASCII white-space characters; space, tab, line feed,
carriage return, form feed, horizontal tab, vertical tab.
\end{funcdesc}

\begin{funcdesc}{isupper}{c}
Checks for an ASCII uppercase letter.
\end{funcdesc}

\begin{funcdesc}{isxdigit}{c}
Checks for an ASCII hexadecimal digit, i.e. one of 0123456789abcdefABCDEF.
\end{funcdesc}

These functions accept either integers or strings; when the argument
is a string, it is first converted using the built-in function ord().

Note that all these functions check ordinal bit values derived from the 
first character of the string you pass in; they do not actually know
anything about the host machine's character encoding.  For functions 
that know about the character encoding (and handle
internationalization properly) see the string module.

The following two functions take either a single-character string or
integer byte value; they return a value of the same type.

\begin{funcdesc}{ctrl}{c}
Return the control character corresponding to the given character
(the character bit value is logical-anded with 0x1f).
\end{funcdesc}

\begin{funcdesc}{alt}{c}
Return the 8-bit character corresponding to the given ASCII character
(the character bit value is logical-ored with 0x80).
\end{funcdesc}





--gBBFr7Ir9EOA20Yy--


From jeremy@alum.mit.edu  Tue May 30 22:09:13 2000
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 17:09:13 -0400 (EDT)
Subject: [Python-Dev] Python 3000 is going to be *really* different
Message-ID: <14644.11769.197518.938252@localhost.localdomain>

http://www.autopreservers.com/autope07.html

Jeremy



From paul@prescod.net  Wed May 31 06:53:47 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 00:53:47 -0500
Subject: [Python-Dev] SIG: python-lang
Message-ID: <3934A8EB.6608B0E1@prescod.net>

I think that we need a forum somewhere between comp.lang.python and
pythondev. Let's call it python-lang.

By virtue of being buried on the "sigs" page, python-lang would be
mostly only accessible to those who have more than a cursory interest in
Python. Furthermore, you would have to go through a simple
administration procedure to join, as you do with any mailman list.

Appropriate topics of python-lang would be new ideas about language
features. Participants would be expected and encouraged to use archives
and FAQs to avoid repetitive topics. Particular verboten would be
"ritual topics": indentation, case sensitivity, integer division,
language comparisions, etc. These discussions would be redirected loudly
and firmly to comp.lang.python.

Python-dev would remain invitation only but it would focus on the day to
day mechanics of getting new versions of Python out the door.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)


From nhodgson@bigpond.net.au  Wed May 31 07:39:34 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 16:39:34 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com> <1252421881-3397332@hypernet.com>
Message-ID: <019b01bfcaca$fd72ffc0$e3cb8490@neil>

Gordon writes,

> But there's no $HOME as such.
>
> There's
> HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> hell Folders with around 16 subkeys, including AppData
> (which on my system has one entry installed by a program I've
> never used and didn't know I had). But MSOffice uses the
> Personal subkey. Others seem to use the Desktop subkey.

   SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
preferred' method for this as it allows roaming (not that I've ever seen
roaming work). If Unix code expects $HOME to be per machine (and so used to
store, for example, window locations which are dependent on screen
resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

   To make these work on 9x and NT 4 Microsoft provides a redistributable
Shfolder.dll.

   Fred writes,

>  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> directories, and that's all most applications need;

   This may have been the case in the past and for people who understand
Unix well enough to maintain it, but for us just-want-it-to-run folks, its
no longer true. I formatted my Linux partition this week and installed Red
Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
outnumber the dot files 18 to 16.

   Neil



From pf@artcom-gmbh.de  Wed May 31 08:34:34 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 31 May 2000 09:34:34 +0200 (MEST)
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil> from Neil Hodgson at "May 31, 2000  4:39:34 pm"
Message-ID: <m12x31O-000DifC@artcom0.artcom-gmbh.de>

> Gordon writes,
> 
> > But there's no $HOME as such.
> >
> > There's
> > HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> > hell Folders with around 16 subkeys, including AppData
> > (which on my system has one entry installed by a program I've
> > never used and didn't know I had). But MSOffice uses the
> > Personal subkey. Others seem to use the Desktop subkey.

Neil responds:
>    SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
> preferred' method for this as it allows roaming (not that I've ever seen
> roaming work). If Unix code expects $HOME to be per machine (and so used to
> store, for example, window locations which are dependent on screen
> resolution) then CSIDL_LOCAL_APPDATA would be a better choice.
> 
>    To make these work on 9x and NT 4 Microsoft provides a redistributable
> Shfolder.dll.

Using a place on the local machine of the user makes more sense to me.

But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
sources and also through Mark Hammonds Win32 Python extension c++
sources (here on my Notebook running Linux) and found nothing called
'SHGetSpecialFolderPath'.  So I believe, this API is currently not
exposed to the Python level.  Right?

So it would be very nice, if you WinXX-gurus more familar with the WinXX
platform would come up with some Python code snippet, which I could try
to include into an upcoming standard library 'userprefs.py' I plan to
write.  something like:
    if os.name == 'nt':
        try:
            import win32XYZ
            if hasattr(win32XYZ, 'SHGetSpecialFolderPath'):
                userplace = win32XYZ.SHGetSpecialFolderPath(.....) 
        except ImportError:
            .....
would be very fine.

>    Fred writes,
> 
> >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> > directories, and that's all most applications need;
> 
>    This may have been the case in the past and for people who understand
> Unix well enough to maintain it, but for us just-want-it-to-run folks, its
> no longer true. I formatted my Linux partition this week and installed Red
> Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
> outnumber the dot files 18 to 16.

Fred proposed an API, which leaves the decision whether to use a
single file or to use several files in special directory up to the
application developer.  

I aggree with Fred.  

Simple applications will use only a simple config file, where bigger
applications will need a directory to store several files.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From nhodgson@bigpond.net.au  Wed May 31 09:18:20 2000
From: nhodgson@bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 18:18:20 +1000
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
References: <m12x31O-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <006b01bfcad8$c899c2d0$e3cb8490@neil>

> Using a place on the local machine of the user makes more sense to me.
>
> But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
> sources and also through Mark Hammonds Win32 Python extension c++
> sources (here on my Notebook running Linux) and found nothing called
> 'SHGetSpecialFolderPath'.  So I believe, this API is currently not
> exposed to the Python level.  Right?

   Only through the Win32 Python extensions, I think:

>>> from win32com.shell import shell
>>> from win32com.shell import shellcon
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_APPDATA)
u'G:\\Documents and Settings\\Neil1\\Application Data'
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_LOCAL_APPDATA)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: CSIDL_LOCAL_APPDATA
>>> shell.SHGetSpecialFolderPath(0, 0x1c)
u'G:\\Documents and Settings\\Neil1\\Local Settings\\Application Data'

   Looks like CSIDL_LOCAL_APPDATA isn't included yet, but its value is 0x1c.

   Neil



From Fredrik Lundh" <effbot@telia.com  Wed May 31 15:05:41 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 31 May 2000 16:05:41 +0200
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
Message-ID: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>

from comp.lang.python:

> Thanks for the info.  This choice of name is very confusing, to say =
the least.
> I used commandline completion with __buil TAB, and got __builtins__.

a simple way to avoid this problem is to change global_matches
in rlcompleter.py so that it doesn't return this name.  I suggest
changing:

                if word[:n] =3D=3D text:

to

                if word[:n] =3D=3D text and work !=3D "__builtins__":

Comments?

(should we do a series of double-blind tests first? ;-)

</F>

    "People Propose, Science Studies, Technology Conforms"
    -- Don Norman




From fdrake@acm.org  Wed May 31 15:32:27 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 10:32:27 -0400 (EDT)
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
In-Reply-To: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
References: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
Message-ID: <14645.8827.104869.733028@cj42289-a.reston1.va.home.com>

Fredrik Lundh writes:
 > a simple way to avoid this problem is to change global_matches
 > in rlcompleter.py so that it doesn't return this name.  I suggest
 > changing:

  I've made the change in both global_matches() and attr_matches(); we
don't want to see it as a module attribute any more than as a global.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>



From gstein@lyra.org  Wed May 31 16:04:20 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 08:04:20 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <3934A8EB.6608B0E1@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>

[ The correct forum is probably meta-sig. ]

IMO, I don't see a need for yet another forum. The dividing lines become a
bit too blurry, and it will result in questions like "where do I post
this?" Or "what is the difference between python-lang@python.org and
python-list@python.org?"

Cheers,
-g

On Wed, 31 May 2000, Paul Prescod wrote:
> I think that we need a forum somewhere between comp.lang.python and
> pythondev. Let's call it python-lang.
> 
> By virtue of being buried on the "sigs" page, python-lang would be
> mostly only accessible to those who have more than a cursory interest in
> Python. Furthermore, you would have to go through a simple
> administration procedure to join, as you do with any mailman list.
> 
> Appropriate topics of python-lang would be new ideas about language
> features. Participants would be expected and encouraged to use archives
> and FAQs to avoid repetitive topics. Particular verboten would be
> "ritual topics": indentation, case sensitivity, integer division,
> language comparisions, etc. These discussions would be redirected loudly
> and firmly to comp.lang.python.
> 
> Python-dev would remain invitation only but it would focus on the day to
> day mechanics of getting new versions of Python out the door.
> 
> -- 
>  Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
> "Hardly anything more unwelcome can befall a scientific writer than 
> having the foundations of his edifice shaken after the work is 
> finished.  I have been placed in this position by a letter from 
> Mr. Bertrand Russell..." 
>  - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/



From fdrake@acm.org  Wed May 31 16:09:03 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 11:09:03 -0400 (EDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
 <1252421881-3397332@hypernet.com>
 <019b01bfcaca$fd72ffc0$e3cb8490@neil>
Message-ID: <14645.11023.118707.176016@cj42289-a.reston1.va.home.com>

Neil Hodgson writes:
 > roaming work). If Unix code expects $HOME to be per machine (and so used to
 > store, for example, window locations which are dependent on screen
 > resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

  This makes me think that there's a need for both per-host and
per-user directories, but I don't know of a good strategy for dealing
with this in general.  Many applications have both kinds of data, but
clump it all together.  What "the norm" is on Unix, I don't really
know, but what I've seen is typically that /home/ is often mounted
over NFS, and so shared for many hosts.  I've seen it always be local
as well, which I find really annoying, but it is easier to support
host-local information.  The catch is that very little information is
*really* host-local, especicially using X11 (where window
configurations are display-local at most, and the user may prefer them
to be display-size-local ;).
  What it boils down to is that doing too much before the separations
are easily maintained is too much; a lot of that separation needs to
be handled inside the application, which knows what information is
user-specific and what *might* be host- or display-specific.  Trying
to provide these abstractions in the standard library is likely to be
hard to use if sufficient generality is also provided.

I wrote:
 >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
 > directories, and that's all most applications need;

And Neil commented:
 >    This may have been the case in the past and for people who understand
 > Unix well enough to maintain it, but for us just-want-it-to-run folks, its
 > no longer true. I formatted my Linux partition this week and installed Red
 > Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
 > outnumber the dot files 18 to 16.

  Interesting!  But is suspect this is still very dependent on what
software you actually use as well; just because something is placed
there in your "standard" install doesn't mean it's useful.  It might
be more interesting to check after you've used that installation for a
year!  Lots of programs add dotfiles on an as-needed basis, and others
never create them, but require the user to create them using a text
editor (though the later seems to be falling out of favor in these
days of GUI applications!).


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com



From mal@lemburg.com  Wed May 31 17:18:49 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 31 May 2000 18:18:49 +0200
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
Message-ID: <39353B69.D6E74E2C@lemburg.com>

Would there be interest in adding the python-ldap module
(http://sourceforge.net/project/?group_id=2072) to the
core distribution ?

If yes, I think we should approach David Leonard and
ask him if he is willing to donate the lib (which is
in the public domain) to the core.

FYI, LDAP is a well accepted standard network protocol for
querying address and user information.

An older web page with more background is available at: 

   http://www.it.uq.edu.au/~leonard/dc-prj/ldapmodule/

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From paul@prescod.net  Wed May 31 17:24:45 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:24:45 -0500
Subject: [Python-Dev] What's that sound?
Message-ID: <39353CCD.1F3E9A0B@prescod.net>

ActiveState announces four new Python-related projects (PythonDirect,
Komodo, Visual Python, ActivePython).

PythonLabs announces four planet-sized-brains are going to be working on
the Python implementation full time.

PythonWare announces PythonWorks.

Is that the sound of pieces falling into place or of a rumbling
avalanche "warming up" before obliterating everything in its path?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International


From gstein@lyra.org  Wed May 31 17:30:57 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:30:57 -0700 (PDT)
Subject: [Python-Dev] What's that sound?
In-Reply-To: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> ActiveState announces four new Python-related projects (PythonDirect,
> Komodo, Visual Python, ActivePython).
> 
> PythonLabs announces four planet-sized-brains are going to be working on
> the Python implementation full time.

Five.

> PythonWare announces PythonWorks.
> 
> Is that the sound of pieces falling into place or of a rumbling
> avalanche "warming up" before obliterating everything in its path?

Full-on, robot chubby earthquake.

:-)

I agree with the basic premise: Python *is* going to get a lot more
visibility than it has enjoyed in the past. You might even add that the
latest GNOME release (1.2) has excellent Python support.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From paul@prescod.net  Wed May 31 17:35:23 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:35:23 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
Message-ID: <39353F4B.3E78E22E@prescod.net>

Greg Stein wrote:
> 
> [ The correct forum is probably meta-sig. ]
> 
> IMO, I don't see a need for yet another forum. The dividing lines become a
> bit too blurry, and it will result in questions like "where do I post
> this?" Or "what is the difference between python-lang@python.org and
> python-list@python.org?"

Well, you admit that yhou don't read python-list, right? Most of us
don't, most of the time. Instead we have important discussions about the
language's future on python-dev, where most of the Python community
cannot participate. I'll say it flat out: I'm uncomfortable with that. I
did not include meta-sig because (or python-list) because my issue is
really with the accidental elitism of the python-dev setup. If
python-dev participants do not agree to have important linguistic
discussions in an open forum then setting up the forum is a waste of
time. That's why I'm feeling people here out first.
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International


From gmcm@hypernet.com  Wed May 31 17:54:22 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 12:54:22 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <1252330400-4656058@hypernet.com>

[Paul Prescod]
> > PythonLabs announces four planet-sized-brains are going to be
> > working on the Python implementation full time.
[Greg] 
> Five.

No, he said "planet-sized-brains", not "planet-sized-egos".

Just notice how long it takes Barry to figure out who I meant....

- Gordon


From bwarsaw@python.org  Wed May 31 17:56:06 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:56:06 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17446.749848.895965@anthem.python.org>

>>>>> "M" == M  <mal@lemburg.com> writes:

    M> Would there be interest in adding the python-ldap module
    M> (http://sourceforge.net/project/?group_id=2072) to the
    M> core distribution ?

I haven't looked at this stuff, but yes, I think a standard LDAP
module would be quite useful.  It's a well enough established
protocol, and it would be good to be able to count on it "being
there".

-Barry


From bwarsaw@python.org  Wed May 31 17:58:51 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:58:51 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <14645.17611.318538.986772@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul@prescod.net> writes:

    PP> Is that the sound of pieces falling into place or of a
    PP> rumbling avalanche "warming up" before obliterating everything
    PP> in its path?

Or a big foot hurtling its way earthward?  The question is, what's
that thing under the shadow of the big toe?  I can only vaguely make
out the first of four letters, and I think it's a `P'.

:)

-Barry


From gstein@lyra.org  Wed May 31 17:59:10 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:59:10 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39353F4B.3E78E22E@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > [ The correct forum is probably meta-sig. ]
> > 
> > IMO, I don't see a need for yet another forum. The dividing lines become a
> > bit too blurry, and it will result in questions like "where do I post
> > this?" Or "what is the difference between python-lang@python.org and
> > python-list@python.org?"
> 
> Well, you admit that yhou don't read python-list, right?

Hehe... you make it sound like I'm a criminal on trial :-)

"And do you admit that you don't read that newsgroup? And do you admit
that you harbor irregular thoughts towards c.l.py posters? And do you
admit to obscene thoughts about Salma Hayek?"

Well, yes, no, and damn straight. :-)

> Most of us
> don't, most of the time. Instead we have important discussions about the
> language's future on python-dev, where most of the Python community
> cannot participate. I'll say it flat out: I'm uncomfortable with that. I

I share that concern, and raised it during the formation of python-dev. It
appears that the pipermail archive is truncated (nothing before April last
year). Honestly, though, I would have to say that I am/was more concerned
with the *perception* rather than actual result.

> did not include meta-sig because (or python-list) because my issue is
> really with the accidental elitism of the python-dev setup. If

I disagree with the term "accidental elitism." I would call it "purposeful
meritocracy." The people on python-dev have shown over the span of *years*
that they are capable developers, designers, and have a genuine interest
and care about Python's development. Based on each person's merits, Guido
invited them to participate in this forum.

Perhaps "guido-advisors" would be more appropriately named, but I don't
think Guido likes to display his BDFL status more than necessary :-)

> python-dev participants do not agree to have important linguistic
> discussions in an open forum then setting up the forum is a waste of
> time. That's why I'm feeling people here out first.

Personally, I like the python-dev setting. The noise here is zero. There
are some things that I'm not particularly interested in, thus I pay much
less attention to them, but those items are never noise. I *really* like
that aspect, and would not care to start arguing about language
development in a larger forum where noise, spam, uninformed opinions, and
subjective discussions take place.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From fdrake@acm.org  Wed May 31 18:04:13 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 13:04:13 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <39353B69.D6E74E2C@lemburg.com>
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17933.810181.300650@cj42289-a.reston1.va.home.com>

M.-A. Lemburg writes:
 > Would there be interest in adding the python-ldap module
 > (http://sourceforge.net/project/?group_id=2072) to the
 > core distribution ?

  Probably!  ACAP (Application Configuration Access Protocol) would be
nice as well -- anybody working on that?

 > FYI, LDAP is a well accepted standard network protocol for
 > querying address and user information.

  And lots of other stuff as well.  Jeremy and I contributed to a
project where it was used to store network latency information.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com



From paul@prescod.net  Wed May 31 18:10:58 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:10:58 -0500
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net> <14645.17611.318538.986772@anthem.python.org>
Message-ID: <393547A2.30CB7113@prescod.net>

"Barry A. Warsaw" wrote:
> 
> Or a big foot hurtling its way earthward?  The question is, what's
> that thing under the shadow of the big toe?  I can only vaguely make
> out the first of four letters, and I think it's a `P'.

Look closer, big-egoed-four-stringed-guitar-playing-one. It could just
as easily be a J.

And you know what you get when you put P and J together?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International


From paul@prescod.net  Wed May 31 18:21:56 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:21:56 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>
Message-ID: <39354A34.88B8B6ED@prescod.net>

Greg Stein wrote:
> 
> Hehe... you make it sound like I'm a criminal on trial :-)

Sorry about that. But I'll bet you didn't expect this inquisition did
you?

> I share that concern, and raised it during the formation of python-dev. It
> appears that the pipermail archive is truncated (nothing before April last
> year). Honestly, though, I would have to say that I am/was more concerned
> with the *perception* rather than actual result.

Right, that perception is making people in comp-lang-python get a little
frustrated, paranoid, alienated and nasty. And relaying conversations
from here to there and back puts Fredrik in a bad mood which isn't good
for anyone.

> > did not include meta-sig because (or python-list) because my issue is
> > really with the accidental elitism of the python-dev setup. If
> 
> I disagree with the term "accidental elitism." I would call it "purposeful
> meritocracy." 

The reason I think that it is accidental is because I don't think that
anyone expected so many of us to abandon comp.lang.python and thus our
direct connection to Python's user base. It just happened that way due
to human nature. That forum is full of stuff that you or I don't care
about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
(oops, that's here!) etc, and this one is noise-free. I'm saying that we
could have a middle ground where we trade a little noise for a little
democracy -- if only in perception.

I think that perl-porters and linux-kernel are open lists? The dictators
and demigods just had to learn to filter a little. By keeping
"python-dev" for immediately important things and implementation
details, we will actually make it easier to get the day to day pumpkin
passing done.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International


From bwarsaw@python.org  Wed May 31 18:28:04 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:28:04 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
 <1252330400-4656058@hypernet.com>
Message-ID: <14645.19364.259837.684595@anthem.python.org>

>>>>> "Gordo" == Gordon McMillan <gmcm@hypernet.com> writes:

    Gordo> No, he said "planet-sized-brains", not "planet-sized-egos".

    Gordo> Just notice how long it takes Barry to figure out who I
    Gordo> meant....

Waaaaiitt a second....

I /do/ have a very large brain.  I keep it in a jar on the headboard
of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
you think I got so studly playing bass?


From bwarsaw@python.org  Wed May 31 18:35:36 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:35:36 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
 <14645.17611.318538.986772@anthem.python.org>
 <393547A2.30CB7113@prescod.net>
Message-ID: <14645.19816.256896.367440@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul@prescod.net> writes:

    PP> Look closer, big-egoed-four-stringed-guitar-playing-one. It
    PP> could just as easily be a J.

<squint> Could be!  The absolute value of my diopter is about as big
as my ego.

    PP> And you know what you get when you put P and J together?

A very tasty sammich!

-Barry


From paul@prescod.net  Wed May 31 18:45:30 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:45:30 -0500
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
 <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <14645.16679.139843.148933@anthem.python.org>
Message-ID: <39354FBA.E1DEFEFA@prescod.net>

"Barry A. Warsaw" wrote:
> 
> ...
>
> I agree.  I think anybody who'd be interested in python-lang is
> already going to be a member of python-dev 

Huh? What about Greg Ewing, Amit Patel, Martijn Faassen, William
Tanksley, Mike Fletcher, Neel Krishnaswami, the various stackless
groupies and a million others. This is just a short list of people who
have made reasonable language suggestions recently. Those suggestions
are going into the bit-bucket unless one of us happens to notice and
champion it here. But we're too busy thinking about 1.6 to think about
long-term ideas anyhow.

Plus, we hand down decisions about (e.g. string.join) and they have the
exact, parallel discussion over there. All the while, anyone from
PythonDev is telling them: "We've already been through this stuff. We've
already discussed this." which only (understandably) annoys them more.

> and any discussion will
> probably be crossposted to the point where it makes no difference.

I think that python-dev's role should change. I think that it would
handle day to day implementation stuff -- nothing long term. I mean if
the noise level on python-lang was too high then we could retreat to
python-dev again but I'd like to think we wouldn't have to. A couple of
sharp words from Guido or Tim could end a flamewar pretty quickly.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International


From esr@thyrsus.com  Wed May 31 18:53:10 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 13:53:10 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <14645.19364.259837.684595@anthem.python.org>; from bwarsaw@python.org on Wed, May 31, 2000 at 01:28:04PM -0400
References: <39353CCD.1F3E9A0B@prescod.net> <1252330400-4656058@hypernet.com> <14645.19364.259837.684595@anthem.python.org>
Message-ID: <20000531135310.B29319@thyrsus.com>

Barry A. Warsaw <bwarsaw@python.org>:
> Waaaaiitt a second....
> 
> I /do/ have a very large brain.  I keep it in a jar on the headboard
> of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
> you think I got so studly playing bass?

Ah, yes.  We take you back now to that splendid year of 1978.  Cue a
certain high-voiced Canadian singing

	The trouble with the Perl guys
	is they're quite convinced they're right...

Duuude....
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The world is filled with violence. Because criminals carry guns, we
decent law-abiding citizens should also have guns. Otherwise they will
win and the decent people will lose.
        -- James Earl Jones


From esr@snark.thyrsus.com  Wed May 31 19:05:33 2000
From: esr@snark.thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 14:05:33 -0400
Subject: [Python-Dev] Constants
Message-ID: <200005311805.OAA29447@snark.thyrsus.com>

I just looked at Jeremy Hylton's warts posting
at <http://starship.python.net/crew/amk/python/writing/warts.html>

It reminded me that one feature I really, really want in Python 3000
is the ability to declare constants.  Assigning to a constant should 
raise an error.

Is this on the to-do list?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

What, then is law [government]? It is the collective organization of
the individual right to lawful defense."
	-- Frederic Bastiat, "The Law"


From petrilli@amber.org  Wed May 31 19:17:57 2000
From: petrilli@amber.org (Christopher Petrilli)
Date: Wed, 31 May 2000 14:17:57 -0400
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>; from esr@snark.thyrsus.com on Wed, May 31, 2000 at 02:05:33PM -0400
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <20000531141757.E5766@trump.amber.org>

Eric S. Raymond [esr@snark.thyrsus.com] wrote:
> I just looked at Jeremy Hylton's warts posting
> at <http://starship.python.net/crew/amk/python/writing/warts.html>
> 
> It reminded me that one feature I really, really want in Python 3000
> is the ability to declare constants.  Assigning to a constant should 
> raise an error.
> 
> Is this on the to-do list?

I know this isn't "perfect", but what I do often is have a
Constants.py file that has all my constants in a class which has
__setattr__ over ridden to raise an exception.  This has two things;

    1. Difficult to modify the attributes, at least accidentally
    2. Keeps the name-space less poluted by thousands of constants.

Just an idea, I think do this:

     constants = Constants()
     x = constants.foo

Seems clean (reasonably) to me.

I think I stole this from the timbot.

Chris
-- 
| Christopher Petrilli
| petrilli@amber.org


From jeremy@beopen.com  Wed May 31 19:07:18 2000
From: jeremy@beopen.com (Jeremy Hylton)
Date: Wed, 31 May 2000 14:07:18 -0400 (EDT)
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <14645.21718.365823.507322@localhost.localdomain>

Correction: It's Andrew Kuchling's list of language warts.  I
mentioned it in a post on slashdot, where I ventured a guess that the
most substantial changes most new users will see with Python 3000 are
the removal of these warts.

Jeremy


From akuchlin@mems-exchange.org  Wed May 31 19:21:04 2000
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 31 May 2000 14:21:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <14645.20637.864287.86178@localhost.localdomain>; from jeremy@beopen.com on Wed, May 31, 2000 at 01:49:17PM -0400
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <39353F4B.3E78E22E@prescod.net> <14645.20637.864287.86178@localhost.localdomain>
Message-ID: <20000531142104.A8989@amarok.cnri.reston.va.us>

On Wed, May 31, 2000 at 01:49:17PM -0400, Jeremy Hylton wrote:
>I'm actually more worried about the second.  It's been a while since I
>read c.l.py and I'm occasionally disappointed to miss out on
>seemingly interesting threads.  On the other hand, there is no way I
>could manage to read or even filter the volume on that list.

Really?  I read it through Usenet with GNUS, and it takes about a half
hour to go through everything. Skipping threads by subject usually
makes it easy to avoid uninteresting stuff.  

I'd rather see python-dev limited to very narrow, CVS-tree-related
material, such as: should we add this module?  is this change OK?
&c...  The long-winded language speculation threads are better left to
c.l.python, where more people offer opinions, it's more public, and
newsreaders are more suited to coping with the volume.  (Incidentally,
has any progress been made on reviving c.l.py.announce?)

OTOH, newbies have reported fear of posting in c.l.py, because they
feel the group is too advanced, what with everyone sitting around
talking about coroutines and SNOBOL string parsing.  But I think it's
a good thing if newbies see the high-flown chatter and get their minds
stretched. :)

--amk


From gstein@lyra.org  Wed May 31 19:37:32 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 11:37:32 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39354A34.88B8B6ED@prescod.net>
Message-ID: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > 
> > Hehe... you make it sound like I'm a criminal on trial :-)
> 
> Sorry about that. But I'll bet you didn't expect this inquisition did
> you?

Well, of course not. Nobody expects the Spanish Inquisition!

Hmm. But you're not Spanish. Dang...

> > I share that concern, and raised it during the formation of python-dev. It
> > appears that the pipermail archive is truncated (nothing before April last
> > year). Honestly, though, I would have to say that I am/was more concerned
> > with the *perception* rather than actual result.
> 
> Right, that perception is making people in comp-lang-python get a little
> frustrated, paranoid, alienated and nasty. And relaying conversations
> from here to there and back puts Fredrik in a bad mood which isn't good
> for anyone.

Understood. I don't have a particular solution to the problem, but I also
believe that python-lang is not going to be a benefit/solution.

Hmm. How about this: you stated the premise is to generate proposals for
language features, extensions, additions, whatever. If that is the only
goal, then consider a web-based system: anybody can post a "feature" with
a description/spec/code/whatever; each feature has threaded comments
attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

When you have a feature with a total vote of +73, then you know that it
needs to be looked at in more detail. All votes are open (not anonymous).
Features can be revised, in an effort to remedy issues raised by -1
voters (and thus turn them into +1 votes).

People can review features and votes in a quick pass. If they prefer to
take more time, then they can also review comments.

Of course, this is only a suggestion. I've got so many other projects that
I'd like to code up right now, then I would not want to sign up for
something like this :-)

> > > did not include meta-sig because (or python-list) because my issue is
> > > really with the accidental elitism of the python-dev setup. If
> > 
> > I disagree with the term "accidental elitism." I would call it "purposeful
> > meritocracy." 
> 
> The reason I think that it is accidental is because I don't think that
> anyone expected so many of us to abandon comp.lang.python and thus our
> direct connection to Python's user base.

Good point.

I would still disagree with your "elitism" term, but the side-effect is
definitely accidental and unfortunate. It may even be arguable whether
python-dev *is* responsible for that. The SIGs had much more traffic
before python-dev, too. I might suggest that the SIGs were the previous
"low-noise" forum (in favor of c.l.py). python-dev yanked focus from the
SIGs, and only a little from c.l.py (I think c.l.py's burgeoning traffic
reduced readership on its own).

> It just happened that way due
> to human nature. That forum is full of stuff that you or I don't care
> about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
> (oops, that's here!) etc, and this one is noise-free. I'm saying that we
> could have a middle ground where we trade a little noise for a little
> democracy -- if only in perception.

Admirable, but I think it would be ineffectual. People would be confused
about where to post. Too many forums, with arbitrary/unclear lines about
which to use.

How do you like your new job at DataChannel? Rate it on 1-100. "83" you
say? Well, why not 82? What is the difference between 82 and 83?

"Why does this post belong on c.l.py, and not on python-lang?"

The result will be cross-posting because people will want to ensure they
reach the right people/forum.

Of course, people will also post to the "wrong" forum. Confusion, lack of
care, whatever.

> I think that perl-porters and linux-kernel are open lists? The dictators
> and demigods just had to learn to filter a little. By keeping
> "python-dev" for immediately important things and implementation
> details, we will actually make it easier to get the day to day pumpkin
> passing done.

Yes, they are. And Dick Hardt has expressed the opinion that perl-porters
is practically useless. He was literally dumbfounded when I told him that
python-dev is (near) zero-noise.

The Linux guys filter very well. I don't know enough of, say, Alan's or
Linus' other mailing subscriptions to know whether that is the only thing
they subscribe to, or just one of many. I could easily see keeping up with
linux-kernel if that was your only mailing list. I also suspect there is
plenty of out-of-band mail going on between Linus and his "lieutenants"
when they forward patches to him (and his inevitable replies, rejections,
etc).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From bwarsaw@python.org  Wed May 31 19:39:46 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 14:39:46 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
 <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
 <14645.16679.139843.148933@anthem.python.org>
 <39354FBA.E1DEFEFA@prescod.net>
Message-ID: <14645.23666.161619.557413@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul@prescod.net> writes:

    PP> Plus, we hand down decisions about (e.g. string.join) and they
    PP> have the exact, parallel discussion over there. All the while,
    PP> anyone from PythonDev is telling them: "We've already been
    PP> through this stuff. We've already discussed this." which only
    PP> (understandably) annoys them more.

Good point.

    >> and any discussion will probably be crossposted to the point
    >> where it makes no difference.

    PP> I think that python-dev's role should change. I think that it
    PP> would handle day to day implementation stuff -- nothing long
    PP> term. I mean if the noise level on python-lang was too high
    PP> then we could retreat to python-dev again but I'd like to
    PP> think we wouldn't have to. A couple of sharp words from Guido
    PP> or Tim could end a flamewar pretty quickly.

Then I suggest to moderate python-lang.  Would you (and/or others) be
willing to serve as moderators?  I'd support an open subscription
policy in that case.

-Barry


From pingster@ilm.com  Wed May 31 19:41:13 2000
From: pingster@ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 11:41:13 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> Hmm. How about this: you stated the premise is to generate proposals for
> language features, extensions, additions, whatever. If that is the only
> goal, then consider a web-based system: anybody can post a "feature" with
> a description/spec/code/whatever; each feature has threaded comments
> attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

Gee, this sounds familiar.  (Hint: starts with an R and has seven
letters.)  Why are we using Jitterbug again?  Does anybody even submit
things there, and still check the Jitterbug indexes regularly?

Okay, Roundup doesn't have voting, but it does already have priorities
and colour-coded statuses, and voting would be trivial to add.


-- ?!ng



From gstein@lyra.org  Wed May 31 20:04:34 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 12:04:34 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Ka-Ping Yee wrote:
> On Wed, 31 May 2000, Greg Stein wrote:
> > Hmm. How about this: you stated the premise is to generate proposals for
> > language features, extensions, additions, whatever. If that is the only
> > goal, then consider a web-based system: anybody can post a "feature" with
> > a description/spec/code/whatever; each feature has threaded comments
> > attached to it; the kicker: each feature has votes (+1/+0/-0/-1).
> 
> Gee, this sounds familiar.  (Hint: starts with an R and has seven
> letters.)  Why are we using Jitterbug again?  Does anybody even submit
> things there, and still check the Jitterbug indexes regularly?
> 
> Okay, Roundup doesn't have voting, but it does already have priorities
> and colour-coded statuses, and voting would be trivial to add.

Does Roundup have a web-based interface, where I can see all of the
features, their comments, and their votes? Can the person who posted the
original feature/spec update it? (or must they followup with a
modified proposal instead)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From bwarsaw@python.org  Wed May 31 20:12:23 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:12:23 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
 <39353F4B.3E78E22E@prescod.net>
 <14645.20637.864287.86178@localhost.localdomain>
 <20000531142104.A8989@amarok.cnri.reston.va.us>
Message-ID: <14645.25623.615735.836896@anthem.python.org>

>>>>> "AMK" == Andrew M Kuchling <akuchlin@cnri.reston.va.us> writes:

    AMK> more suited to coping with the volume.  (Incidentally, has
    AMK> any progress been made on reviving c.l.py.announce?)

Not that I'm aware of, sadly.

-Barry


From bwarsaw@python.org  Wed May 31 20:18:09 2000
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:18:09 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
 <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <14645.25969.657083.55499@anthem.python.org>

>>>>> "KY" == Ka-Ping Yee <pingster@ilm.com> writes:

    KY> Gee, this sounds familiar.  (Hint: starts with an R and has
    KY> seven letters.)  Why are we using Jitterbug again?  Does
    KY> anybody even submit things there, and still check the
    KY> Jitterbug indexes regularly?

Jitterbug blows.

    KY> Okay, Roundup doesn't have voting, but it does already have
    KY> priorities and colour-coded statuses, and voting would be
    KY> trivial to add.

Roundup sounded just so cool when ?!ng described it at the
conference.  I gotta find some time to look at it! :)

-Barry


From pingster@ilm.com  Wed May 31 20:24:07 2000
From: pingster@ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 12:24:07 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> 
> Does Roundup have a web-based interface,

Yes.

> where I can see all of the
> features, their comments, and their votes?

At the moment, you see date of last activity, description,
priority, status, and fixer (i.e. person who has taken
responsibility for the item).  No votes, but as i said,
that would be really easy.

> Can the person who posted the original feature/spec update it?

Each item has a bunch of mail messages attached to it.
Anyone can edit the description, but that's a short one-line
summary; the only way to propose another design right now
is to send in another message.

Hey, i admit it's a bit primitive, but it seems significantly
better than nothing.  The software people at ILM have coped
with it fairly well for a year, and for the most part we like it.

Go play:  http://www.lfw.org/ping/roundup/roundup.cgi

Username: test  Password: test
Username: spam  Password: spam
Username: eggs  Password: eggs


-- ?!ng



From fdrake@acm.org  Wed May 31 20:58:13 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 15:58:13 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
References: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
 <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
Message-ID: <14645.28373.733094.942361@cj42289-a.reston1.va.home.com>

Ka-Ping Yee writes:
 > On Wed, 31 May 2000, Greg Stein wrote:
 > > Can the person who posted the original feature/spec update it?
 > 
 > Each item has a bunch of mail messages attached to it.
 > Anyone can edit the description, but that's a short one-line
 > summary; the only way to propose another design right now
 > is to send in another message.

  I thought the roundup interface was quite nice, esp. with the nosy
lists and such.  I'm sure there are a number of small issues, but
nothing Ping can't deal with in a matter of minutes.  ;)
  One thing that might need further consideration is that a feature
proposal may need a slightly different sort of support; it makes more
sense to include more than the one-liner summary, and that should be
modifiable as discussions show adjustments may be needed.  That might
be doable by adding a URL to an external document rather than
including the summary in the issues database.
  I'd love to get rid of the Jitterbug thing!


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com



From paul@prescod.net  Wed May 31 21:52:38 2000
From: paul@prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 15:52:38 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <39357B96.E819537F@prescod.net>

Greg Stein wrote:
> 
> ...
>
> People can review features and votes in a quick pass. If they prefer to
> take more time, then they can also review comments.

I like this idea for its persistence but I'm not convinced that it
serves the same purpose as the give and take of a mailing list with many
subscribers.
 
> Admirable, but I think it would be ineffectual. People would be confused
> about where to post. Too many forums, with arbitrary/unclear lines about
> which to use.

To me, they are clear:

 * anything Python related can go to comp.lang.python, but many people
will not read it.

 * anything that belongs to a particular SIG goes to that sig.

 * any feature suggestions/debates that do not go in a particular SIG
(especially things related to the core language) go to python-lang

 * python-dev is for any message that has the words "CVS", "patch",
"memory leak", "reference count" etc. in it. It is for implementing the
design that Guido refines out of the rough and tumble of python-lang.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
At the same moment that the Justice Department and the Federal Trade 
Commission are trying to restrict the negative consequences of 
monopoly, the Commerce Department and the Congress are helping to 
define new intellectual property rights, rights that have a 
significant potential to create new monopolies. This is the policy 
equivalent of arm-wrestling with yourself.
	- http://www.salon.com/tech/feature/2000/04/07/greenspan/index.html


From gstein@lyra.org  Wed May 31 22:53:13 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 14:53:13 -0700 (PDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <14645.17446.749848.895965@anthem.python.org>
Message-ID: <Pine.LNX.4.10.10005311452150.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Barry A. Warsaw wrote:
> >>>>> "M" == M  <mal@lemburg.com> writes:
> 
>     M> Would there be interest in adding the python-ldap module
>     M> (http://sourceforge.net/project/?group_id=2072) to the
>     M> core distribution ?
> 
> I haven't looked at this stuff, but yes, I think a standard LDAP
> module would be quite useful.  It's a well enough established
> protocol, and it would be good to be able to count on it "being
> there".

My WebDAV module implements an established protocol (an RFC tends to do
that :-), but the API within the module is still in flux (IMO).

Is the LDAP module's API pretty solid? Is it changing?

And is this module a C extension, or a pure Python implementation?

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gmcm@hypernet.com  Wed May 31 22:58:04 2000
From: gmcm@hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 17:58:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39357B96.E819537F@prescod.net>
Message-ID: <1252312176-5752122@hypernet.com>

Paul Prescod  wrote:

> Greg Stein wrote:
> > Admirable, but I think it would be ineffectual. People would be
> > confused about where to post. Too many forums, with
> > arbitrary/unclear lines about which to use.
> 
> To me, they are clear:

Of course they are ;-). While something doesn't seem right 
about the current set up, and c.l.py is still remarkably 
civilized, the fact is that the hotheads who say "I'll never use 
Python again if you do something as brain-dead as [ case-
insensitivity | require (host, addr) tuples | ''.join(list) | ... ]" will 
post their sentiments to every available outlet.
 
I agree the shift of some of these syntax issues from python-
dev to c.l.py was ugly, but the truth is that:
 - no new arguments came from c.l.py
 - the c.l.py discussion was much more emotional
 - you can't keep out the riff-raff without inviting reasonable 
accusations of elitistism
 - the vast majority of, erm, "grass-roots" syntax proposals are 
absolutely horrid.

(As you surely know, Paul, from your types-SIG tenure; 
proposing syntax changes without the slightest intention of 
putting any effort into them is a favorite activity of posters.)



- Gordon


From klm@digicool.com  Wed May 31 23:03:31 2000
From: klm@digicool.com (Ken Manheimer)
Date: Wed, 31 May 2000 18:03:31 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <643145F79272D211914B0020AFF640190BAA8F@gandalf.digicool.com>
Message-ID: <Pine.LNX.4.21.0005311755420.9023-100000@korak.digicool.com>

On Wed, 31 May 2000, Ka-Ping Yee wrote:

> Hey, i admit it's a bit primitive, but it seems significantly
> better than nothing.  The software people at ILM have coped
> with it fairly well for a year, and for the most part we like it.

I'm not sure about the requirements - particularly, submissions and
correspondence about bugs via email, which my zope "tracker" doesn't do -
but the tracker may be worth looking at, also:

  http://www.zope.org/Members/klm/SoftwareCarpentry

(See the prototype tracker, mentioned there, or my "tracker tracker" at
http://www.zope.org/Members/klm/Tracker , for flagrant, embarrassing
exposure of the outstanding tracker complaints...)

(I haven't had the time to take care of the tracker as i would like, or to
look at how tracker and roundup could inform eachother - but i haven't
even gotten as far as examining that.  I get the feeling they take fairly
different approaches - which could mean neat synergy, or total
disconnection.  Ping, any thoughts?)

Ken
klm@digicool.com



From fdrake@acm.org  Wed May 31 23:17:02 2000
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 18:17:02 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <Pine.LNX.4.10.10005311452150.30220-100000@nebula.lyra.org>
References: <14645.17446.749848.895965@anthem.python.org>
 <Pine.LNX.4.10.10005311452150.30220-100000@nebula.lyra.org>
Message-ID: <14645.36702.769794.807329@cj42289-a.reston1.va.home.com>

Greg Stein writes:
 > My WebDAV module implements an established protocol (an RFC tends to do
 > that :-), but the API within the module is still in flux (IMO).

  I'd love to see this sort of thing added to the standard library,
esp. once packages are used there.  Especially if the implementation
is pure Python (which I think your WebDAV stuff is, right?)

 > Is the LDAP module's API pretty solid? Is it changing?

  This I don't know.

 > And is this module a C extension, or a pure Python implementation?

  Mixed, I think.  There is definately a C component.  I'd rather it
be pure Python, but I think it's a SWIGged wrapper around a C client
library.
  Is anyone talking to the developer about this yet?


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com



From gstein@lyra.org  Wed May 31 23:27:04 2000
From: gstein@lyra.org (Greg Stein)
Date: Wed, 31 May 2000 15:27:04 -0700 (PDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <14645.36702.769794.807329@cj42289-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005311522480.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Fred L. Drake, Jr. wrote:
> Greg Stein writes:
>  > My WebDAV module implements an established protocol (an RFC tends to do
>  > that :-), but the API within the module is still in flux (IMO).
> 
>   I'd love to see this sort of thing added to the standard library,
> esp. once packages are used there.  Especially if the implementation
> is pure Python (which I think your WebDAV stuff is, right?)

davlib.py is pure Python, building upon my upgraded httplib.py and
xml.utils.qp_xml (and pyexpat)

[ and recall my email last week that I've updated httplib.py and posted it
  to my web pages; it is awaiting review for integration into the Python
  core; it still needs docs and more testing scenarios, tho

  http://www.python.org/pipermail/python-dev/2000-May/005643.html
]

davlib will probably be a 1.7 item. It still needs some heavy work to
easily deal with authentication (which is usually going to be required for
DAV operations).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From mhammond@skippinet.com.au  Wed May 31 23:53:15 2000
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 1 Jun 2000 08:53:15 +1000
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.21.0005311755420.9023-100000@korak.digicool.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEAKCMAA.mhammond@skippinet.com.au>

[Ken writes]
> I'm not sure about the requirements - particularly, submissions and
> correspondence about bugs via email, which my zope "tracker" doesn't do -
> but the tracker may be worth looking at, also:
>
>   http://www.zope.org/Members/klm/SoftwareCarpentry
>

Another alternative could be Bugzilla:

http://bugzilla.mozilla.org/

Sources at:

http://www.mozilla.org/bugs/source.html

Has many of the features people seem to want, and obviously supports large
projects - which may be the biggest problem - it may be too "heavy" for our
requirement...

Mark.



From tim_one at email.msn.com  Mon May  1 08:31:05 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 1 May 2000 02:31:05 -0400
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com>
Message-ID: <000001bfb336$d4f512a0$0f2d153f@tim>

[Guido]
> The email below is a serious bug report.  A quick analysis
> shows that UserString.count() calls the count() method on a string
> object, which calls PyArg_ParseTuple() with the format string "O|ii".
> The 'i' format code truncates integers.

For people unfamiliar w/ the details, let's be explicit:  the "i" code
implicitly converts a Python int (== a C long) to a C int (which has no
visible Python counterpart).  Overflow is not detected, so this is broken on
the face of it.

> It probably should raise an overflow exception instead.

Definitely.

> But that would still cause the test to fail -- just in a different
> way (more explicit).  Then the string methods should be fixed to
> use long ints instead -- and then something else would probably break...

Yup.  Seems inevitable.

[MAL]
> Since strings and Unicode objects use integers to describe the
> length of the object (as well as most if not all other
> builtin sequence types), the correct default value should
> thus be something like sys.maxlen which then gets set to
> INT_MAX.
>
> I'd suggest adding sys.maxlen and the modifying UserString.py,
> re.py and sre_parse.py accordingly.

I understand this, but hate it.  I'd rather get rid of the user-visible
distinction between the two int types already there, not add yet a third
artificial int limit.

[Guido]
> Hm, I'm not so sure.  It would be much better if passing sys.maxint
> would just WORK...  Since that's what people have been doing so far.

[Trent Mick]
> Possible solutions (I give 4 of them):
>
> 1. The 'i' format code could raise an overflow exception and the
> PyArg_ParseTuple() call in string_count() could catch it and truncate to
> INT_MAX (reasoning that any overflow of the end position of a
> string can be bound to INT_MAX because that is the limit for any string
> in Python).

There's stronger reason than that:  string_count's "start" and "end"
arguments are documented as "interpreted as in slice notation", and slice
notation with out-of-range indices is well defined in all cases:

    The semantics for a simple slicing are as follows. The primary
    must evaluate to a sequence object. The lower and upper bound
    expressions, if present, must evaluate to plain integers; defaults
    are zero and the sequence's length, respectively. If either bound
    is negative, the sequence's length is added to it. The slicing now
    selects all items with index k such that i <= k < j where i and j
    are the specified lower and upper bounds. This may be an empty
    sequence. It is not an error if i or j lie outside the range of
    valid indexes (such items don't exist so they aren't selected).

(From the Ref Man's section "Slicings")  That is, what string_count should
do is perfectly clear already (or will be, when you read that two more times
<wink>).  Note that you need to distinguish between positive and negative
overflow, though!

> Pros:
> - This "would just WORK" for usage of sys.maxint.
>
> Cons:
> -  This overflow exception catching should then reasonably be
> propagated to other similar functions (like string.endswith(), etc).

Absolutely, but they *all* follow from what "sequence slicing* is *always*
supposed to do in case of out-of-bounds indices.

> - We have to assume that the exception raised in the
> PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for
> the second integer (i.e. 'last'). This is subtle and ugly.

Luckily <wink>, it's not that simple:  exactly the same treatment needs to
be given to both the optional "start" and "end" arguments, and in every
function that accepts optional slice indices.  So you write one utility
function to deal with all that, called in case PyArg_ParseTuple raises an
overflow error.

> Pro or Con:
> - Do we want to start raising overflow exceptions for other conversion
> formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on
> Win64 where sizeof(long) < size(void*))? I think this is a good idea
> in principle but may break code (even if it *does* identify bugs in that
> code).

The code this would break is already broken <0.1 wink>.

> 2. Just change the definitions of the UserString methods to pass
> a variable length argument list instead of default value parameters.
> For example change UserString.count() from:
>
>     def count(self, sub, start=0, end=sys.maxint):
>         return self.data.count(sub, start, end)
>
> to:
>
>     def count(self, *args)):
>         return self.data.count(*args)
>
> The result is that the default value for 'end' is now set by
> string_count() rather than by the UserString implementation:
> ...

This doesn't solve anything -- users can (& do) pass sys.maxint explicitly.
That's part of what Guido means by "since that's what people have been doing
so far".

> ...
> Cons:
> - Does not fix the general problem of the (common?) usage of sys.maxint to
> mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit
> Unices).

Anyone using sys.maxint to mean INT_MAX is fatally confused; passing
sys.maxint as a slice index is not an instance of that confusion, it's just
relying on the documented behavior of out-of-bounds slice indices.

> 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with
> breaks the logical difference with sys.maxint (set to LONG_MAX):
> ...

I hate this (see above).

> ...
> 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f.
> ANSI size_t type). It is probably not a biggie, but Python currently
> makes the assumption that string never exceed INT_MAX in length.

It's not an assumption, it's an explicit part of the design:
PyObject_VAR_HEAD declares ob_size to be an int.  This leads to strain for
sure, partly because the *natural* limit on sizes is derived from malloc
(which indeed takes neither int nor long, but size_t), and partly because
Python exposes no corresponding integer type.  I vaguely recall that this
was deliberate, with the *intent* being to save space in object headers on
the upcoming 128-bit KSR machines <wink>.

> While this assumption is not likely to be proven false it technically
> could be on 64-bit systems.

Well, Guido once said he would take away Python's recursion overflow checks
just as soon as machines came with infinite memory <wink> -- 2Gb is a
reasonable limit for string length, and especially if it's a tradeoff
against increasing the header size for all string objects (it's almost
certainly more important to cater to oodles of small strings on smaller
machines than to one or two gigantic strings on huge machines).

> As well, when you start compiling on Win64 (where sizeof(int) ==
> sizeof(long) < sizeof(size_t)) then you are going to be annoyed
> by hundreds of warnings about implicit casts from size_t (64-bits) to
> int (32-bits) for every strlen, str*, fwrite, and sizeof call that
> you make.

Every place the code implicitly downcasts from size_t to int is plainly
broken today, so we *should* get warnings.  Python has been sloppy about
this!  In large part it's because Python was written before ANSI C, and
size_t simply wasn't supported at the time.  But as with all software,
people rarely go back to clean up; it's overdue (just be thankful you're not
working on the Perl source <0.9 wink>).

> Pros:
> - IMHO logically more correct.
> - Might clean up some subtle bugs.
> - Cleans up annoying and disconcerting warnings.
> - Will probably mean less pain down the road as 64-bit systems
> (esp. Win64) become more prevalent.
>
> Cons:
> - Lot of coding changes.
> - As Guido said: "and then something else would probably break".
> (Though, on currently 32-bits system, there should be no effective
> change).  Only 64-bit systems should be affected and, I would hope,
> the effect would be a clean up.

I support this as a long-term solution, perhaps for P3K.  Note that
ob_refcnt should also be declared size_t (no overflow-checking is done on
refcounts today; the theory is that a refcount can't possibly get bigger
than the total # of pointers in the system, and so if you declare ob_refcnt
to be large enough to hold that, refcount overflow is impossible; but, in
theory, this argument has been broken on every machine where sizeof(int) <
sizeof(void*)).

> I apologize for not being succinct.

Humbug -- it was a wonderfully concise posting, Trent!  The issues are
simply messy.

> Note that I am volunteering here.  Opinions and guidance please.

Alas, the first four letters in "guidance" spell out four-fifths of the only
one able to give you that.

opinions-are-fun-but-don't-count<wink>-ly y'rs  - tim





From mal at lemburg.com  Mon May  1 12:55:52 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 12:55:52 +0200
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg 
 stringobject (PR#306)
References: <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <390D62B8.15331407@lemburg.com>

I've just posted a simple patch to the patches list which
implements the idea I posted earlier:

Silent truncation still takes place, but in a somwhat more
natural way ;-) ...

                       /* Silently truncate to INT_MAX/INT_MIN to
                          make passing sys.maxint to 'i' parser
                          markers work on 64-bit platforms work just
                          like on 32-bit platforms. Overflow errors
                          are not raised. */
                       else if (ival > INT_MAX)
                               ival = INT_MAX;
                       else if (ival < INT_MIN)
                               ival = INT_MIN;
                       *p = ival;

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Mon May  1 16:04:08 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 10:04:08 -0400 (EDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
Message-ID: <14605.36568.455646.598506@seahag.cnri.reston.va.us>

Moshe Zadka writes:
 > 1. I'm not sure what to call this function. Currently, I call it
 > __print_expr__, but I'm not sure it's a good name

  It's not.  ;)  How about printresult?
  Another thing to think about is interface; formatting a result and
"printing" it may be different, and you may want to overload them
separately in an environment like IDLE.  Some people may want to just
say:

	import sys
	sys.formatresult = str

  I'm inclined to think that level of control may be better left to
the application; if one hook is provided as you've described, the
application can build different layers as appropriate.

 > 2. I haven't yet supplied a default in __builtin__, so the user *must*
 > override this. This is unacceptable, of course.

  You're right!  But a default is easy enough to add.  I'd put it in
sys instead of __builtin__ though.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From moshez at math.huji.ac.il  Mon May  1 16:19:46 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Mon, 1 May 2000 17:19:46 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <14605.36568.455646.598506@seahag.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005011712410.25942-100000@sundial>

On Mon, 1 May 2000, Fred L. Drake, Jr. wrote:

>   It's not.  ;)  How about printresult?

Hmmmm...better then mine at least.

> 	import sys
> 	sys.formatresult = str

And where does the "don't print if it's None" enter? I doubt if there is a
really good way to divide functionality. OF course, specific IDEs may
provide their own hooks.

>   You're right!  But a default is easy enough to add.

I agree. It was more to spur discussion -- with the advantage that there
is already a way to include Python sessions.

> I'd put it in
> sys instead of __builtin__ though.

Hmmm.. that's a Guido Issue(TM). Guido?
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From fdrake at acm.org  Mon May  1 17:19:10 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:19:10 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
Message-ID: <14605.41070.290137.787832@seahag.cnri.reston.va.us>

  The "winreg" module needs some documentation; is anyone here up to
the task?  I don't think I know enough about the registry to write
something reasonable.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From fdrake at acm.org  Mon May  1 17:23:06 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:23:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
Message-ID: <14605.41306.146320.597637@seahag.cnri.reston.va.us>

I wrote:
 >   The "winreg" module needs some documentation; is anyone here up to
 > the task?  I don't think I know enough about the registry to write
 > something reasonable.

  Of course, as soon as I sent this message I remembered that there's
also the linuxaudiodev module; that needs documentation as well!  (I
guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
wants to document audiodev, perhaps I could avoid the Linux chapter
(with one module) by adding documentation for the portable interface.
  There's also the pyexpat module; Andrew/Paul, did one of you want to
contribute something for that?


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Mon May  1 17:26:44 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 11:26:44 -0400
Subject: [Python-Dev] documentation for new modules
In-Reply-To: Your message of "Mon, 01 May 2000 11:23:06 EDT."
             <14605.41306.146320.597637@seahag.cnri.reston.va.us> 
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>  
            <14605.41306.146320.597637@seahag.cnri.reston.va.us> 
Message-ID: <200005011526.LAA20332@eric.cnri.reston.va.us>

>  >   The "winreg" module needs some documentation; is anyone here up to
>  > the task?  I don't think I know enough about the registry to write
>  > something reasonable.

Maybe you could adapt the documentation for the registry functions in
Mark Hammond's win32all?  Not all the APIs are the same but the should
mostly do the same thing...

>   Of course, as soon as I sent this message I remembered that there's
> also the linuxaudiodev module; that needs documentation as well!  (I
> guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
> wants to document audiodev, perhaps I could avoid the Linux chapter
> (with one module) by adding documentation for the portable interface.

There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
mostly the same.

>   There's also the pyexpat module; Andrew/Paul, did one of you want to
> contribute something for that?

I would hope so!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Mon May  1 18:17:06 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 12:17:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <200005011526.LAA20332@eric.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
	<14605.41306.146320.597637@seahag.cnri.reston.va.us>
	<200005011526.LAA20332@eric.cnri.reston.va.us>
Message-ID: <14605.44546.568978.296426@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Maybe you could adapt the documentation for the registry functions in
 > Mark Hammond's win32all?  Not all the APIs are the same but the should
 > mostly do the same thing...

  I'll take a look at it when I have time, unless anyone beats me to
it.

 > There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
 > mostly the same.

  It's been documented for a long time.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Mon May  1 20:02:32 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:02:32 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 09:18:05 CDT."
             <390AEF1D.253B93EF@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>  
            <390AEF1D.253B93EF@prescod.net> 
Message-ID: <200005011802.OAA21612@eric.cnri.reston.va.us>

[Guido]
> > And this is exactly why encodings will remain important: entities
> > encoded in ISO-2022-JP have no compelling reason to be recoded
> > permanently into ISO10646, and there are lots of forces that make it
> > convenient to keep it encoded in ISO-2022-JP (like existing tools).

[Paul]
> You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
> a character *set* and not an encoding. ISO-2022-JP says how you should
> represent characters in terms of bits and bytes. ISO10646 defines a
> mapping from integers to characters.

OK.  I really meant recoding in UTF-8 -- I maintain that there are
lots of forces that prevent recoding most ISO-2022-JP documents in
UTF-8.

> They are both important, but separate. I think that this automagical
> re-encoding conflates them.

Who is proposing any automagical re-encoding?

Are you sure you understand what we are arguing about?

*I* am not even sure what we are arguing about.

I am simply saying that 8-bit strings (literals or otherwise) in
Python have always been able to contain encoded strings.

Earlier, you quoted some reference documentation that defines 8-bit
strings as containing characters.  That's taken out of context -- this
was written in a time when there was (for most people anyway) no
difference between characters and bytes, and I really meant bytes.
There's plenty of use of 8-bit Python strings for non-character uses
so your "proof" that 8-bit strings should contain "characters"
according to your definition is invalid.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Mon May  1 20:05:33 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:05:33 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011802.OAA21612@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <14605.51053.369016.283239@cymru.basistech.com>

Guido van Rossum writes:
 > OK.  I really meant recoding in UTF-8 -- I maintain that there are
 > lots of forces that prevent recoding most ISO-2022-JP documents in
 > UTF-8.

Such as?

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From effbot at telia.com  Mon May  1 20:39:52 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 20:39:52 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <009f01bfb39c$a603cc00$34aab5d4@hagrid>

Tom Emerson wrote:
> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.
> 
> Such as?

ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
you just recode the character codes, you'll lose important information.

</F>




From tree at cymru.basistech.com  Mon May  1 20:42:40 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:42:40 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <009f01bfb39c$a603cc00$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<14605.51053.369016.283239@cymru.basistech.com>
	<009f01bfb39c$a603cc00$34aab5d4@hagrid>
Message-ID: <14605.53280.55595.335112@cymru.basistech.com>

Fredrik Lundh writes:
 > ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
 > you just recode the character codes, you'll lose important information.

So encode them using the Plane 14 language tags.

I won't start with whether language/locale should be encoded in a
character encoding... 8-)

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From guido at python.org  Mon May  1 20:52:04 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:52:04 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 14:05:33 EDT."
             <14605.51053.369016.283239@cymru.basistech.com> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
            <14605.51053.369016.283239@cymru.basistech.com> 
Message-ID: <200005011852.OAA21973@eric.cnri.reston.va.us>

> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.

[Tom Emerson]
> Such as?

The standard forces that work against all change -- existing tools,
user habits, compatibility, etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Mon May  1 20:46:04 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:46:04 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011852.OAA21973@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<14605.51053.369016.283239@cymru.basistech.com>
	<200005011852.OAA21973@eric.cnri.reston.va.us>
Message-ID: <14605.53484.225980.235301@cymru.basistech.com>

Guido van Rossum writes:
 > The standard forces that work against all change -- existing tools,
 > user habits, compatibility, etc.

Ah... I misread your original statement, which I took to be a
technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of
course one cannot expect everyone to switch en masse to a new
encoding, pulling their existing documents with them. I'm in full
agreement there.

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From paul at prescod.net  Mon May  1 22:38:29 2000
From: paul at prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 15:38:29 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>  
	            <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <390DEB45.D8D12337@prescod.net>

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:
> 
> ...
>
> OK.  I really meant recoding in UTF-8 -- I maintain that there are
> lots of forces that prevent recoding most ISO-2022-JP documents in
> UTF-8.

Absolutely agree.
 
> Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal,
and then compare that string literal against a Unicode object, should
those funny characters be treated as logical units of text (characters)
or as bytes? And if bytes, should some transformation be automatically
performed to have those bytes be reinterpreted as characters according
to some particular encoding scheme (probably UTF-8).

I claim that we should *as far as possible* treat strings as character
lists and not add any new functionality that depends on them being byte
list. Ideally, we could add a byte array type and start deprecating the
use of strings in that manner. Yes, it will take a long time to fix this
bug but that's what happens when good software lives a long time and the
world changes around it.

> Earlier, you quoted some reference documentation that defines 8-bit
> strings as containing characters.  That's taken out of context -- this
> was written in a time when there was (for most people anyway) no
> difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik. 

Anyhow, you wrote the documentation that way because it was the most
intuitive way of thinking about strings. It remains the most intuitive
way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving
people towards the "character-list" model. In concrete terms I would
suggest that old fashioned lists be automatically coerced to Unicode by
interpreting each byte as a Unicode character. Trying to go the other
way could cause the moral equivalent of an OverflowError but that's not
a problem. 

>>> a=1000000000000000000000000000000000000L
>>> int(a)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify
strings and unicode strings (but not byte arrays).

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Mon May  1 23:32:38 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 17:32:38 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT."
             <390DEB45.D8D12337@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
            <390DEB45.D8D12337@prescod.net> 
Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us>

> > Are you sure you understand what we are arguing about?
> 
> Here's what I thought we were arguing about:
> 
> If you put a bunch of "funny characters" into a Python string literal,
> and then compare that string literal against a Unicode object, should
> those funny characters be treated as logical units of text (characters)
> or as bytes? And if bytes, should some transformation be automatically
> performed to have those bytes be reinterpreted as characters according
> to some particular encoding scheme (probably UTF-8).
> 
> I claim that we should *as far as possible* treat strings as character
> lists and not add any new functionality that depends on them being byte
> list. Ideally, we could add a byte array type and start deprecating the
> use of strings in that manner. Yes, it will take a long time to fix this
> bug but that's what happens when good software lives a long time and the
> world changes around it.
> 
> > Earlier, you quoted some reference documentation that defines 8-bit
> > strings as containing characters.  That's taken out of context -- this
> > was written in a time when there was (for most people anyway) no
> > difference between characters and bytes, and I really meant bytes.
> 
> Actually, I think that that was Fredrik. 

Yes, I came across the post again later.  Sorry.

> Anyhow, you wrote the documentation that way because it was the most
> intuitive way of thinking about strings. It remains the most intuitive
> way. I think that that was the point Fredrik was trying to make.

I just wish he made the point more eloquently.  The eff-bot seems to
be in a crunchy mood lately...

> We can't make "byte-list" strings go away soon but we can start moving
> people towards the "character-list" model. In concrete terms I would
> suggest that old fashioned lists be automatically coerced to Unicode by
> interpreting each byte as a Unicode character. Trying to go the other
> way could cause the moral equivalent of an OverflowError but that's not
> a problem. 
> 
> >>> a=1000000000000000000000000000000000000L
> >>> int(a)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too long to convert
> 
> And just as with ints and longs, we would expect to eventually unify
> strings and unicode strings (but not byte arrays).

OK, you've made your claim -- like Fredrik, you want to interpret
8-bit strings as Latin-1 when converting (not just comparing!) them to
Unicode.

I don't think I've heard a good *argument* for this rule though.  "A
character is a character is a character" sounds like an axiom to me --
something you can't prove or disprove rationally.

I have a bunch of good reasons (I think) for liking UTF-8: it allows
you to convert between Unicode and 8-bit strings without losses, Tcl
uses it (so displaying Unicode in Tkinter *just* *works*...), it is
not Western-language-centric.

Another reason: while you may claim that your (and /F's, and Just's)
preferred solution doesn't enter into the encodings issue, I claim it
does: Latin-1 is just as much an encoding as any other one.

I claim that as long as we're using an encoding we might as well use
the most accepted 8-bit encoding of Unicode as the default encoding.

I also think that the issue is blown out of proportions: this ONLY
happens when you use Unicode objects, and it ONLY matters when some
other part of the program uses 8-bit string objects containing
non-ASCII characters.  Given the long tradition of using different
encodings in 8-bit strings, at that point it is anybody's guess what
encoding is used, and UTF-8 is a better guess than Latin-1.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 00:17:17 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 18:17:17 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Sat, 29 Apr 2000 21:09:40 +0300."
             <Pine.GSO.4.10.10004292105210.28387-100000@sundial> 
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial> 
Message-ID: <200005012217.SAA23503@eric.cnri.reston.va.us>

> Continuing the recent debate about what is appropriate to the interactive
> prompt printing, and the wide agreement that whatever we decide, users
> might think otherwise, I've written up a patch to have the user control 
> via a function in __builtin__ the way things are printed at the prompt.
> This is not patches at python level stuff for two reasons:
> 
> 1. I'm not sure what to call this function. Currently, I call it
> __print_expr__, but I'm not sure it's a good name
> 
> 2. I haven't yet supplied a default in __builtin__, so the user *must*
> override this. This is unacceptable, of course.
> 
> I'd just like people to tell me if they think this is worth while, and if
> there is anything I missed.

Thanks for bringing this up again.  I think it should be called
sys.displayhook.  The default could be something like

import __builtin__
def displayhook(obj):
    if obj is None:
        return
    __builtin__._ = obj
    sys.stdout.write("%s\n" % repr(obj))

to be nearly 100% compatible with current practice; or use str(obj) to
do what most people would probably prefer.

(Note that you couldn't do "%s\n" % obj because obj might be a tuple.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May  2 00:29:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 00:29:41 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>             <390DEB45.D8D12337@prescod.net>  <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> I just wish he made the point more eloquently.  The eff-bot seems to
> be in a crunchy mood lately...

I've posted a few thousand messages on this topic, most of which
seem to have been ignored.  if you'd read all my messages, and seen
all the replies, you'd be cranky too...

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

maybe, but it's a darn good axiom, and it's used by everyone else.
Perl uses it, Tcl uses it, XML uses it, etc.  see:

http://www.python.org/pipermail/python-dev/2000-April/005218.html

> I have a bunch of good reasons (I think) for liking UTF-8: it allows
> you to convert between Unicode and 8-bit strings without losses, Tcl
> uses it (so displaying Unicode in Tkinter *just* *works*...), it is
> not Western-language-centric.

the "Tcl uses it" is a red herring -- their internal implementation
uses 16-bit integers, and the external interface works very hard
to keep the "strings are character sequences" illusion.

in other words, the length of a string is *always* the number of
characters, the character at index i is *always* the i'th character
in the string, etc.

that's not true in Python 1.6a2.

(as for Tkinter, you only have to add 2-3 lines of code to make it
use 16-bit strings instead...)

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

this is another red herring: my argument is that 8-bit strings should
contain unicode characters, using unicode character codes.  there
should be only one character repertoire, and that repertoire is uni-
code.  for a definition of these terms, see:

http://www.python.org/pipermail/python-dev/2000-April/005225.html

obviously, you can only store 256 different values in a single 8-bit
character (just like you can only store 4294967296 different values
in a single 32-bit int).

to store larger values, use unicode strings (or long integers).

conversion from a small type to a large type always work, conversion
from a large type to a small one may result in an OverflowError.

it has nothing to do with encodings.

> I claim that as long as we're using an encoding we might as well use
> the most accepted 8-bit encoding of Unicode as the default encoding.

yeah, and I claim that it won't fly, as long as it breaks the "strings
are character sequences" rule used by all other contemporary (and
competing) systems.

(if you like, I can post more "fun with unicode" messages ;-)

and as I've mentioned before, there are (at least) two ways to solve
this:

1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and
   Perl).  make sure len(s) returns the number of characters in the
   string, make sure s[i] returns the i'th character (not necessarily
   starting at the i'th byte, and not necessarily one byte), etc.  to
   make this run reasonable fast, use as many implementation tricks
   as you can come up with (I've described three ways to implement
   this in an earlier post).

2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i])
   is a unicode character code, whether s is an 8-bit string or a unicode
   string.

for alternative 1 to work, you need to add some way to explicitly work
with binary strings (like it's done in Perl and Tcl).

alternative 2 doesn't need that; 8-bit strings can still be used to hold
any kind of binary data, as in 1.5.2.  just keep in mind you cannot use
use all methods on such an object...

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

I still think it's very unfortunate that you think that unicode strings
are a special kind of strings.  Perl and Tcl don't, so why should we?

</F>




From gward at mems-exchange.org  Tue May  2 00:40:18 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Mon, 1 May 2000 18:40:18 -0400
Subject: [Python-Dev] Comparison inconsistency with ExtensionClass
Message-ID: <20000501184017.A1171@mems-exchange.org>

Hi all --

I seem to have discovered an inconsistency in the semantics of object
comparison between plain old Python instances and ExtensionClass
instances.  (I've cc'd python-dev because it looks as though one *could*
blame Python for the inconsistency, but I don't really understand the
guts of either Python or ExtensionClass enough to know.)

Here's a simple script that shows the difference:

    class Simple:
        def __init__ (self, data):
            self.data = data

        def __repr__ (self):
            return "<%s at %x: %s>" % (self.__class__.__name__,
                                       id(self),
                                       `self.data`)

        def __cmp__ (self, other):
            print "Simple.__cmp__: self=%s, other=%s" % (`self`, `other`)
            return cmp (self.data, other)


    if __name__ == "__main__":
        v1 = 36
        v2 = Simple (36)
        print "v1 == v2?", (v1 == v2 and "yes" or "no")
        print "v2 == v1?", (v2 == v1 and "yes" or "no")
        print "v1 == v2.data?", (v1 == v2.data and "yes" or "no")
        print "v2.data == v1?", (v2.data == v1 and "yes" or "no")

If I run this under Python 1.5.2, then all the comparisons come out true
and my '__cmp__()' method is called twice:

    v1 == v2? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v2 == v1? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes


The first one and the last two are obvious, but the second one only
works thanks to a trick in PyObject_Compare():

    if (PyInstance_Check(v) || PyInstance_Check(w)) {
        ...
        if (!PyInstance_Check(v))
	    return -PyObject_Compare(w, v);
        ...
    }

However, if I make Simple an ExtensionClass:

    from ExtensionClass import Base

    class Simple (Base):

Then the "swap v and w and use w's comparison method" no longer works.
Here's the output of the script with Simple as an ExtensionClass:

    v1 == v2? no
    v2 == v1? Simple.__cmp__: self=<Simple at 1b51c0: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes

It looks as though ExtensionClass would have to duplicate the trick in
PyObject_Compare() that I quoted, since Python has no idea that
ExtensionClass instances really should act like instances.  This smells
to me like a bug in ExtensionClass.  Comments?

BTW, I'm using the ExtensionClass provided with Zope 2.1.4.  Mostly
tested with Python 1.5.2, but also under the latest CVS Python and we
observed the same behaviour.

        Greg



From mhammond at skippinet.com.au  Tue May  2 01:45:02 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 09:45:02 +1000
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>

> Guido van Rossum writes:
>  > Maybe you could adapt the documentation for the
> registry functions in
>  > Mark Hammond's win32all?  Not all the APIs are the
> same but the should
>  > mostly do the same thing...
>
>   I'll take a look at it when I have time, unless anyone
> beats me to
> it.

I wonder if that anyone could be me? :-)

Note that all the win32api docs for the registry all made it into
docstrings - so winreg has OK documentation as it is...

But I will try and put something together.  It will need to be plain
text or HTML, but I assume that is better than nothing!

Give me a few days...

Mark.




From paul at prescod.net  Tue May  2 02:19:20 2000
From: paul at prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 19:19:20 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
	            <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <390E1F08.EA91599E@prescod.net>

Sorry for the long message. Of course you need only respond to that
which is interesting to you. I don't think that most of it is redundant.

Guido van Rossum wrote:
> 
> ...
> 
> OK, you've made your claim -- like Fredrik, you want to interpret
> 8-bit strings as Latin-1 when converting (not just comparing!) them to
> Unicode.

If the user provides an explicit conversion function (e.g. UTF-8-decode)
then of course we should use that function. Under my character is a
character is a character model, this "conversion" is morally equivalent
to ROT-13, strupr or some other text->text translation. So you could
apply UTF-8-decode even to a Unicode string as long as each character in
the string has ord()<256 (so that it could be interpreted as a character
representation for a byte).

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

I don't see it as an axiom, but rather as a design decision you make to
keep your language simple. Along the lines of "all values are objects"
and (now) all integer values are representable with a single type. Are
you happy with this?

a="\244"
b=u"\244"
assert len(a)==len(b)
assert ord(a[0])==ord(b[0])

# same thing, right?
print b==a
# Traceback (most recent call last):
#  File "<stdin>", line 1, in ?
# UnicodeError: UTF-8 decoding error: unexpected code byte

If I type "\244" it means I want character 244, not the first half of a
UTF-8 escape sequence. "\244" is a string with one character. It has no
encoding. It is not latin-1. It is not UTF-8. It is a string with one
character and should compare as equal with another string with the same
character.

I would laugh my ass off if I was using Perl and it did something weird
like this to me (as long as it didn't take a month to track down the
bug!). Now it isn't so funny.

> I have a bunch of good reasons (I think) for liking UTF-8: 

I'm not against UTF-8. It could be an internal representation for some
Unicode objects.

> it allows
> you to convert between Unicode and 8-bit strings without losses, 

Here's the heart of our disagreement:

******
I don't want, in Py3K, to think about "converting between Unicode and
8-bit strings." I want strings and I want byte-arrays and I want to
worry about converting between *them*. There should be only one string
type, its characters should all live in the Unicode character repertoire
and the character numbers should all come from Unicode. "Special"
characters can be assigned to the Unicode Private User Area. Byte arrays
would be entirely seperate and would be converted to Unicode strings
with explicit conversion functions.
*****

In the meantime I'm just trying to get other people thinking in this
mode so that the transition is easier. If I see people embedding UTF-8
escape sequences in literal strings today, I'm going to hit them.

I recognize that we can't design the universe right now but we could
agree on this direction and use it to guide our decision-making.

By the way, if we DID think of 8-bit strings as essentially "byte
arrays" then let's use that terminology and imagine some future
documentation:

"Python's string type is equivalent to a list of bytes. For clarity, we
will call this type a byte list from now on. In contexts where a Unicode
character-string is desired, Python automatically converts byte lists to
charcter strings by doing a UTF-8 decode on them." 

What would you think if Java had a default (I say "magical") conversion
from byte arrays to character strings.

The only reason we are discussing this is because Python strings have a
dual personality which was useful in the past but will (IMHO, of course)
become increasingly confusing in the future. We want the best of both
worlds without confusing anybody and I don't think that we can have it.

If you want 8-bit strings to be really byte arrays in perpetuity then
let's be consistent in that view. We can compare them to Unicode as we
would two completely separate types. "U" comes after "S" so unicode
strings always compare greater than 8-bit strings. The use of the word
"string" for both objects can be considered just a historical accident.

> Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), 

Don't follow this entirely. Shouldn't the next version of TKinter accept
and return Unicode strings? It would be rather ugly for two
Unicode-aware systems (Python and TK) to talk to each other in 8-bit
strings. I mean I don't care what you do at the C level but at the
Python level arguments should be "just strings."

Consider that len() on the TKinter side would return a different value
than on the Python side. 

What about integral indexes into buffers? I'm totally ignorant about
TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is
between the 5th and 6th character when in an 8-bit string the equivalent
index might be the 11th or 12th byte?

> it is not Western-language-centric.

If you look at encoding efficiency it is.

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

The fact that my proposal has the same effect as making Latin-1 the
"default encoding" is a near-term side effect of the definition of
Unicode. My long term proposal is to do away with the concept of 8-bit
strings (and thus, conversions from 8-bit to Unicode) altogether. One
string to rule them all!

Is Unicode going to be the canonical Py3K character set or will we have
different objects for different character sets/encodings with different
default (I say "magical") conversions between them. Such a design would
not be entirely insane though it would be a PITA to implement and
maintain. If we aren't ready to establish Unicode as the one true
character set then we should probably make no special concessions for
Unicode at all. Let a thousand string objects bloom!

Even if we agreed to allow many string objects, byte==character should
not be the default string object. Unicode should be the default.

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  

Won't this be totally common? Most people are going to use 8-bit
literals in their program text but work with Unicode data from XML
parsers, COM, WebDAV, Tkinter, etc?

> Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

If we are guessing then we are doing something wrong. My answer to the
question of "default encoding" falls out naturally from a certain way of
looking at text, popularized in various other languages and increasingly
"the norm" on the Web. If you accept the model (a character is a
character is a character), the right behavior is obvious. 

"\244"==u"\244"

Nobody is ever going to have trouble understanding how this works.
Choose simplicity!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From mhammond at skippinet.com.au  Tue May  2 02:34:16 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 10:34:16 +1000
Subject: [Python-Dev] Neil Hodgson on python-dev?
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au>

I'd like to propose that we invite Neil Hodgson to join the
python-dev family.

Neil is the author of the Scintilla editor control, now used by
wxPython and Pythonwin...  Smart guy, and very experienced with
Python (scintilla was originally written because he had trouble
converting Pythonwin to be a color'd editor :-)

But most relevant at the moment is his Unicode experience.  He
worked for along time with Fujitsu, working with Japanese and all
the encoding issues there.  I have heard him echo the exact
sentiments of Andy.  He is also in the process of polishing the
recent Unicode support in Scintilla.

As this Unicode debate seems to be going nowhere fast, and appears
to simply need more people with _experience_, I think he would be
valuable.  Further, he is a pretty quiet guy - you wont find him
offering his opinion on every post that moves through here :-)

Thoughts?

Mark.




From guido at python.org  Tue May  2 02:41:43 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:41:43 -0400
Subject: [Python-Dev] Neil Hodgson on python-dev?
In-Reply-To: Your message of "Tue, 02 May 2000 10:34:16 +1000."
             <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au> 
Message-ID: <200005020041.UAA23648@eric.cnri.reston.va.us>

> I'd like to propose that we invite Neil Hodgson to join the
> python-dev family.

Excellent!

> As this Unicode debate seems to be going nowhere fast, and appears
> to simply need more people with _experience_, I think he would be
> valuable.  Further, he is a pretty quiet guy - you wont find him
> offering his opinion on every post that moves through here :-)

As long as he isn't too quiet on the Unicode thing ;-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 02:53:26 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:53:26 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT."
             <390E1F08.EA91599E@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>  
            <390E1F08.EA91599E@prescod.net> 
Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us>

Paul, we're both just saying the same thing over and over without
convincing each other.  I'll wait till someone who wasn't in this
debate before chimes in.

Have you tried using this?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May  2 03:26:06 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 03:26:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net>
Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid>

Paul Prescod <paul at prescod.net> wrote:
> I would laugh my ass off if I was using Perl and it did something weird
> like this to me.

you don't have to -- in Perl 5.6, a character is a character...

does anyone on this list follow the perl-porters list?  was this as
controversial over in Perl land as it appears to be over here?

</F>




From tpassin at home.com  Tue May  2 03:55:25 2000
From: tpassin at home.com (tpassin at home.com)
Date: Mon, 1 May 2000 21:55:25 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>

Guido van  Rossum wrote, about how to represent strings:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

I'm with Paul and Federick on this one - at least about characters being the
atoms of a string.  We **have** to be able to refer to **characters** in a
string, and without guessing.  Otherwise, how could you ever construct a
test, like theString[3]==[a particular japanese ideograph]?  If we do it by
having a "string" datatype, which is really a byte list, and a
"unicodeString" datatype which is a list of abstract characters, I'd say
everyone could get used to working with them.  We'd have to supply
conversion functions, of course.

This route might be the easiest to understand for users.  We'd have to be
very clear about what file.read() would return, for example, and all those
similar read and write functions.  And we'd have to work out how real 8-bit
calls (like writing to a socket?) would play with the new types.

For extra clarity, we could leave string the way it is, introduce stringU
(unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
be the best equivalent to the current string).  Then we would deprecate
string in favor of string8.  Then if tcl and perl go to unicode strings we
pass them a stringU, and if they go some other way, we pass them something
else.  COme to think of it, we need some some data type that will continue
to work with c and c++.  Would that be string8 or would we keep string for
that purpose?

Clarity and ease of use for the user should be primary, fast implementations
next.  If we didn't care about ease of use and clarity, we could all use
Scheme or c, don't use sight of it.

I'd suggest we could create some use cases or scenarios for this area -
needs input from those who know encodings and low level Python stuff better
than I.  Then we could examine more systematically how well various
approaches would work out.

Regards,
Tom Passin





From mhammond at skippinet.com.au  Tue May  2 04:17:09 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 12:17:09 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPJCJAA.mhammond@skippinet.com.au>

> Guido van  Rossum wrote, about how to represent strings:
>
> > Paul, we're both just saying the same thing over and
> over without
> > convincing each other.  I'll wait till someone who
> wasn't in this
> > debate before chimes in.

Ive chimed in a little, but Ill chime in again :-)

> I'm with Paul and Federick on this one - at least about
> characters being the
> atoms of a string.  We **have** to be able to refer to
> **characters** in a
> string, and without guessing.  Otherwise, how could you

I see the point, and agree 100% with the intent.  However, reality
does bite.

As far as I can see, the following are immuatable:
* There will be 2 types - a string type and a Unicode type.
* History dicates that the string type may hold binary data.

Thus, it is clear that Python simply can not treat characters as the
smallest atoms of strings.  If I understand things correctly, this
is key to Guido's point, and a bit of a communication block.

The issue, to my mind, is how we handle these facts to produce "the
principal of least surprise".  We simply need to accept that Python
1.x will never be able to treat string objects as sequences of
"characters" - only bytes.

However, with my limited understanding of the full issues, it does
appear that the proposal championed by Fredrik, Just and Paul is the
best solution - not because it magically causes Python to treat
strings as characters in all cases, but because it offers the
prinipcal of least surprise.

As I said, I dont really have a deep enough understanding of the
issues, so this is probably (hopefully!?) my last word on the
matter - but that doesnt mean I dont share the concerns raised
here...

Mark.




From guido at python.org  Tue May  2 05:31:54 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 23:31:54 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:55:25 EDT."
             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> 
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> 
Message-ID: <200005020331.XAA23818@eric.cnri.reston.va.us>

Tom Passin:
> I'm with Paul and Federick on this one - at least about characters being the
> atoms of a string.  We **have** to be able to refer to **characters** in a
> string, and without guessing.  Otherwise, how could you ever construct a
> test, like theString[3]==[a particular japanese ideograph]?  If we do it by
> having a "string" datatype, which is really a byte list, and a
> "unicodeString" datatype which is a list of abstract characters, I'd say
> everyone could get used to working with them.  We'd have to supply
> conversion functions, of course.

You seem unfamiliar with the details of the implementation we're
proposing?  We already have two datatypes, 8-bit string (call it byte
array) and Unicode string.  There are conversions between them:
explicit conversions such as u.encode("utf-8") or unicode(s,
"latin-1") and implicit conversions used in situations like u+s or
u==s.  The whole discussion is *only* about what the default
conversion in the latter cases should be -- the rest of the
implementation is rock solid and works well.

Users can accomplish what you are proposing by simply ensuring that
theString is a Unicode string.

> This route might be the easiest to understand for users.  We'd have to be
> very clear about what file.read() would return, for example, and all those
> similar read and write functions.  And we'd have to work out how real 8-bit
> calls (like writing to a socket?) would play with the new types.

These are all well defined -- they all deal in 8-bit strings
internally, and all use the default conversions when given Unicode
strings.  Programs that only deal in 8-bit strings don't need to
change.  Programs that want to deal with Unicode and sockets, for
example, must know what encoding to use on the socket, and if it's not
the default encoding, must use explicit conversions.

> For extra clarity, we could leave string the way it is, introduce stringU
> (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
> be the best equivalent to the current string).  Then we would deprecate
> string in favor of string8.  Then if tcl and perl go to unicode strings we
> pass them a stringU, and if they go some other way, we pass them something
> else.  COme to think of it, we need some some data type that will continue
> to work with c and c++.  Would that be string8 or would we keep string for
> that purpose?

What would be the difference between string and string8?

> Clarity and ease of use for the user should be primary, fast implementations
> next.  If we didn't care about ease of use and clarity, we could all use
> Scheme or c, don't use sight of it.
> 
> I'd suggest we could create some use cases or scenarios for this area -
> needs input from those who know encodings and low level Python stuff better
> than I.  Then we could examine more systematically how well various
> approaches would work out.

Very good.

Here's one usage scenario.

A Japanese user is reading lines from a file encoded in ISO-2022-JP.
The readline() method returns 8-bit strings in that encoding (the file
object doesn't do any decoding).  She realizes that she wants to do
some character-level processing on the file so she decides to convert
the strings to Unicode.

I believe that whether the default encoding is UTF-8 or Latin-1
doesn't matter for here -- both are wrong, she needs to write explicit
unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
"better", because interpreting ISO-2022-JP data as UTF-8 will most
likely give an exception (when a \300 range byte isn't followed by a
\200 range byte) -- while interpreting it as Latin-1 will silently do
the wrong thing.  (An explicit error is always better than silent
failure.)

I'd love to discuss other scenarios.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From moshez at math.huji.ac.il  Tue May  2 06:39:12 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 2 May 2000 07:39:12 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <200005012217.SAA23503@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>

> Thanks for bringing this up again.  I think it should be called
> sys.displayhook.

That should be the easy part -- I'll do it as soon as I'm home.

> The default could be something like
> 
> import __builtin__
import sys # Sorry, I couldn't resist
> def displayhook(obj):
>     if obj is None:
>         return
>     __builtin__._ = obj
>     sys.stdout.write("%s\n" % repr(obj))

This brings up a painful point -- the reason I haven't wrote the default
is because it was way much easier to write it in Python. Of course, I
shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
pains me Python cannot be written with more Python and less C.

A while ago we started talking about the mini-interpreter idea, which
would then freeze Python code into itself, and then it sort of died out.
What have become of it?

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From just at letterror.com  Tue May  2 07:47:35 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 06:47:35 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020331.XAA23818@eric.cnri.reston.va.us>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <l03102802b534149a9639@[193.78.237.164]>

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
>Here's one usage scenario.
>
>A Japanese user is reading lines from a file encoded in ISO-2022-JP.
>The readline() method returns 8-bit strings in that encoding (the file
>object doesn't do any decoding).  She realizes that she wants to do
>some character-level processing on the file so she decides to convert
>the strings to Unicode.
>
>I believe that whether the default encoding is UTF-8 or Latin-1
>doesn't matter for here -- both are wrong, she needs to write explicit
>unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
>"better", because interpreting ISO-2022-JP data as UTF-8 will most
>likely give an exception (when a \300 range byte isn't followed by a
>\200 range byte) -- while interpreting it as Latin-1 will silently do
>the wrong thing.  (An explicit error is always better than silent
>failure.)

But then it's even better to *always* raise an exception, since it's
entirely possible a string contains valid utf-8 while not *being* utf-8. I
really think the exception argument is moot, since there can *always* be
situations that will pass silently. Encoding issues are silent by nature --
eg. there's no way any system can tell that interpreting MacRoman data as
Latin-1 is wrong, maybe even fatal -- the user will just have to deal with
it. You can argue what you want, but *any* multi-byte encoding stored in an
8-bit string is a buffer, not a string, for all the reasons Fredrik and
Paul have thrown at you, and right they are. Choosing such an encoding as a
default conversion to Unicode makes no sense at all. Recap of the main
arguments:

pro UTF-8:
always reversible when going from Unicode to 8-bit

con UTF-8:
not a string: confusing semantics

pro Latin-1:
simpler semantics

con Latin-1:
non-reversible, western-centric

Given the fact that very often *both* will be wrong, I'd go for the simpler
semantics.

Just





From guido at python.org  Tue May  2 06:51:45 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 00:51:45 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 07:39:12 +0300."
             <Pine.GSO.4.10.10005020732200.8759-100000@sundial> 
References: <Pine.GSO.4.10.10005020732200.8759-100000@sundial> 
Message-ID: <200005020451.AAA23940@eric.cnri.reston.va.us>

> > import __builtin__
> import sys # Sorry, I couldn't resist
> > def displayhook(obj):
> >     if obj is None:
> >         return
> >     __builtin__._ = obj
> >     sys.stdout.write("%s\n" % repr(obj))
> 
> This brings up a painful point -- the reason I haven't wrote the default
> is because it was way much easier to write it in Python. Of course, I
> shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
> pains me Python cannot be written with more Python and less C.
> 

But the C code on how to do it was present in the code you deleted
from ceval.c!

> A while ago we started talking about the mini-interpreter idea,
> which would then freeze Python code into itself, and then it sort of
> died out.  What have become of it?

Nobody sent me a patch :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)



From nhodgson at bigpond.net.au  Tue May  2 07:04:12 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 15:04:12 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <035501bfb3f3$db87fb10$e3cb8490@neil>

   I'm dropping in a bit late in this thread but can the current problem be
summarised in an example as "how is 'literal' interpreted here"?

s = aUnicodeStringFromSomewhere
DoSomething(s + "<literal>")

   The two options being that literal is either assumed to be encoded in
Latin-1 or UTF-8. I can see some arguments for both sides.

Latin-1: more current code was written in a European locale with an implicit
assumption that all string handling was Latin-1. Current editors are more
likely to be displaying literal as it is meant to be interpreted.

UTF-8: all languages can be written in UTF-8 and more recent editors can
display this correctly. Thus people using non-Roman alphabets can write code
which is interpreted as is seen with no need to remember to call conversion
functions.

   Neil




From tpassin at home.com  Tue May  2 07:07:07 2000
From: tpassin at home.com (tpassin at home.com)
Date: Tue, 2 May 2000 01:07:07 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>  <200005020331.XAA23818@eric.cnri.reston.va.us>
Message-ID: <006101bfb3f4$454f99e0$7cac1218@reston1.va.home.com>

Guido van Rossum said
<snip/>
> What would be the difference between string and string8?

Probably none, except to alert people that string8 might have different
behavior than the present-day string, perhaps when interacting with
unicode - probably its behavior would be specified more tightly (i.e., is it
strictly a list of bytes or does it have some assumption about encoding?) or
changed in some way from what we have now.  Or if it turned out that a lot
of programmers in other languages (perl, tcl, perhaps?) expected "string" to
behave in particular ways, the use of a term like "string8" might reduce
confusion.   Possibly none of these apply - no need for "string8" then.

>
> > Clarity and ease of use for the user should be primary, fast
implementations
> > next.  If we didn't care about ease of use and clarity, we could all use
> > Scheme or c, don't use sight of it.
> >
> > I'd suggest we could create some use cases or scenarios for this area -
> > needs input from those who know encodings and low level Python stuff
better
> > than I.  Then we could examine more systematically how well various
> > approaches would work out.
>
> Very good.
>
<snip/>

Tom Passin




From effbot at telia.com  Tue May  2 08:59:03 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 08:59:03 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <003b01bfb404$03cd0560$34aab5d4@hagrid>

Neil Hodgson <nhodgson at bigpond.net.au> wrote:
>    I'm dropping in a bit late in this thread but can the current problem be
> summarised in an example as "how is 'literal' interpreted here"?
> 
> s = aUnicodeStringFromSomewhere
> DoSomething(s + "<literal>")

nope.  the whole discussion centers around what happens
if you type:

    # example 1

    u = aUnicodeStringFromSomewhere
    s = an8bitStringFromSomewhere

    DoSomething(s + u)

and

    # example 2

    u = aUnicodeStringFromSomewhere
    s = an8bitStringFromSomewhere

    if len(u) + len(s) == len(u + s):
        print "true"
    else:
        print "not true"

in Guido's design, the first example may or may not result in
an "UTF-8 decoding error: UTF-8 decoding error: unexpected
code byte" exception.  the second example may result in a
similar error, print "true", or print "not true", depending on the
contents of the 8-bit string.

(under the counter proposal, the first example will never
raise an exception, and the second will always print "true")

...

the string literal issue is a slightly different problem.

> The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. I can see some arguments for both sides.

better make that "two options", not "the two options" ;-)

a more flexible scheme would be to borrow the design from XML
(see http://www.w3.org/TR/1998/REC-xml-19980210). for those
who haven't looked closer at XML, it basically treats the source
file as an encoded unicode character stream, and does all pro-
cessing on the decoded side.

replace "entity" with "script file" in the following excerpts, and you
get close:

section 2.2:

    A parsed entity contains text, a sequence of characters,
    which may represent markup or character data.

    A character is an atomic unit of text as specified by
    ISO/IEC 10646.

section 4.3.3:

    Each external parsed entity in an XML document may
    use a different encoding for its characters. All XML
    processors must be able to read entities in either
    UTF-8 or UTF-16. 

    Entities encoded in UTF-16 must begin with the Byte
    Order Mark /.../ XML processors must be able to use
    this character to differentiate between UTF-8 and
    UTF-16 encoded documents.

    Parsed entities which are stored in an encoding other
    than UTF-8 or UTF-16 must begin with a text declaration
    containing an encoding declaration.

(also see appendix F: Autodetection of Character Encodings)

I propose that we adopt a similar scheme for Python -- but not
in 1.6.  the current "dunno, so we just copy the characters" is
good enough for now...

</F>




From tim_one at email.msn.com  Tue May  2 09:20:52 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 03:20:52 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: <200004271523.LAA13614@eric.cnri.reston.va.us>
Message-ID: <000201bfb406$f2f35520$df2d153f@tim>

[Guido asks good questions about how Windows deals w/ Unicode filenames,
 last Thursday, but gets no answers]

> ...
> I'd like to solve this problem, but I have some questions: what *IS*
> the encoding used for filenames on Windows?  This may differ per
> Windows version; perhaps it can differ drive letter?  Or per
> application or per thread?  On Windows NT, filenames are supposed to
> be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> with a given Unicode string for its name, in a C program?  I suppose
> there's a Win32 API call for that which has a Unicode variant.
>
> On Windows 95/98, the Unicode variants of the Win32 API calls don't
> exist.  So what is the poor Python runtime to do there?
>
> Can Japanese people use Japanese characters in filenames on Windows
> 95/98?  Let's assume they can.  Since the filesystem isn't Unicode
> aware, the filenames must be encoded.  Which encoding is used?  Let's
> assume they use Microsoft's multibyte encoding.  If they put such a
> file on a floppy and ship it to Link?ping, what will Fredrik see as
> the filename?  (I.e., is the encoding fixed by the disk volume, or by
> the operating system?)
>
> Once we have a few answers here, we can solve the problem.  Note that
> sometimes we'll have to refuse a Unicode filename because there's no
> mapping for some of the characters it contains in the filename
> encoding used.

I just thought I'd repeat the questions <wink>.  However, I don't think
you'll really want the answers -- Windows is a legacy-encrusted mess, and
there are always many ways to get a thing done in the end.  For example ...

> Question: how does Fredrik create a file with a Euro
> character (u'\u20ac') in its name?

This particular one is shallower than you were hoping:  in many of the
TrueType fonts (e.g., Courier New but not Courier), Windows extended its
Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80.
So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8.
This is true even on US Win98 (which has no visible Unicode support) -- but
was not supported in US Win95.

i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop-
    at-work-so-can-verify-ms-sure-got-japanese-characters-into-the-
    filenames-somehow-but-doubt-it's-via-unicode-ly y'rs  - tim





From effbot at telia.com  Tue May  2 09:55:49 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 09:55:49 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <007d01bfb40b$d7693720$34aab5d4@hagrid>

Tim Peters wrote:
> [Guido asks good questions about how Windows deals w/ Unicode filenames,
>  last Thursday, but gets no answers]

you missed Finn Bock's post on how Java does it.

here's another data point:

Tcl uses a system encoding to convert from unicode to a suitable
system API encoding, and uses the following approach to figure out
what that one is:

    windows NT/2000:
        unicode (use wide api)

    windows 95/98:
        "cp%d" % GetACP()
        (note that this is "cp1252" in us and western europe,
        not "iso-8859-1")
  
    macintosh:
        determine encoding for fontId 0 based on (script,
        smScriptLanguage) tuple. if that fails, assume
        "macroman"

    unix:
        figure out the locale from LC_ALL, LC_CTYPE, or LANG.
        use heuristics to map from the locale to an encoding
        (see unix/tclUnixInit). if that fails, assume "iso-8859-1"

I propose adding a similar mechanism to Python, along these lines:

    sys.getdefaultencoding() returns the right thing for windows
    and macintosh, "iso-8859-1" for other platforms.

    sys.setencoding(codec) changes the system encoding.  it's
    used from site.py to set things up properly on unix and other
    non-unicode platforms.

</F>




From nhodgson at bigpond.net.au  Tue May  2 10:22:36 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 18:22:36 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <004501bfb40f$92ff0980$e3cb8490@neil>

> > I'd like to solve this problem, but I have some questions: what *IS*
> > the encoding used for filenames on Windows?  This may differ per
> > Windows version; perhaps it can differ drive letter?  Or per
> > application or per thread?  On Windows NT, filenames are supposed to
> > be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> > with a given Unicode string for its name, in a C program?  I suppose
> > there's a Win32 API call for that which has a Unicode variant.

   Its decided by each file system.

   For FAT file systems, the OEM code page is used. The OEM code page
generally used in the United States is code page 437 which is different from
the code page windows uses for display. I had to deal with this in a system
where people used fractions (1/4, 1/2 and 3/4) as part of names which had to
be converted into valid file names. For example 1/4 is 0xBC for display but
0xAC when used in a file name.

   In Japan, I think different manufacturers used different encodings with
NEC trying to maintain market control with their own encoding.

   VFAT stores both Unicode long file names and shortened aliases. However
the Unicode variant is hard to get to from Windows 95/98.

   NTFS stores Unicode.

> > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > exist.  So what is the poor Python runtime to do there?

   Fail the call. All existing files can be opened because they have short
non-Unicode aliases. If a file with a Unicode name can not be created
because the OS doesn't support it then you should give up. Just as you
should give up if you try to save a file with a name that includes a
character not allowed by the file system.

> > Can Japanese people use Japanese characters in filenames on Windows
> > 95/98?

   Yes.

> > Let's assume they can.  Since the filesystem isn't Unicode
> > aware, the filenames must be encoded.  Which encoding is used?  Let's
> > assume they use Microsoft's multibyte encoding.  If they put such a
> > file on a floppy and ship it to Link?ping, what will Fredrik see as
> > the filename?  (I.e., is the encoding fixed by the disk volume, or by
> > the operating system?)

   If Fredrik is running a non-Japanese version of Windows 9x, he will see
some 'random' western characters replacing the Japanese.

   Neil




From effbot at telia.com  Tue May  2 10:36:40 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 10:36:40 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil>
Message-ID: <008501bfb411$8e0502c0$34aab5d4@hagrid>

Neil Hodgson wrote:
>    Its decided by each file system.

...but the system API translates from the active code page to the
encoding used by the file system, right?

on my w95 box, GetACP() returns 1252, and GetOEMCP() returns
850.  

if I create a file with a name containing latin-1 characters, on a
FAT drive, it shows up correctly in the file browser (cp1252), and
also shows up correctly in the MS-DOS window (under cp850).

if I print the same filename to stdout in the same DOS window, I
get gibberish.

> > > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > > exist.  So what is the poor Python runtime to do there?
> 
>    Fail the call.

...if you fail to convert from unicode to the local code page.

</F>




From mal at lemburg.com  Tue May  2 10:36:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 10:36:43 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <390E939B.11B99B71@lemburg.com>

Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its
work.

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Tue May  2 10:42:51 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 01:42:51 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <Pine.LNX.4.10.10005020114250.522-100000@localhost>

I'll warn you that i'm not much experienced or well-informed, but
i suppose i might as well toss in my naive opinion.

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
> 
> I believe that whether the default encoding is UTF-8 or Latin-1
> doesn't matter for here -- both are wrong, she needs to write explicit
> unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
> "better", because [this] will most likely give an exception...

On Tue, 2 May 2000, Just van Rossum wrote:
> But then it's even better to *always* raise an exception, since it's
> entirely possible a string contains valid utf-8 while not *being* utf-8.

I believe it is time for me to make a truly radical proposal:

    No automatic conversions between 8-bit "strings" and Unicode strings.

If you want to turn UTF-8 into a Unicode string, say so.
If you want to turn Latin-1 into a Unicode string, say so.
If you want to turn ISO-2022-JP into a Unicode string, say so.
Adding a Unicode string and an 8-bit "string" gives an exception.

I know this sounds tedious, but at least it stands the least possible
chance of confusing anyone -- and given all i've seen here and in
other i18n and l10n discussions, there's plenty enough confusion to
go around already.


If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.


Okay.  Feel free to yell at me now.


-- ?!ng

P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
sense of them as byte-buffers -- since that *is* all you get when you
read in some bytes from a file.  If you manipulate an 8-bit "string"
as a character string, you are implicitly making the assumption that
the byte values correspond to the character encoding of the character
repertoire you want to work with, and that's your responsibility.

P. P. S.  If always having to specify encodings is really too much,
i'd probably be willing to consider a default-encoding state on the
Unicode class, but it would have to be a stack of values, not a
single value.




From effbot at telia.com  Tue May  2 11:00:07 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 11:00:07 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <009701bfb414$d35d0ea0$34aab5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> Just a small note on the subject of a character being atomic
> which seems to have been forgotten by the discussing parties:
> 
> Unicode itself can be understood as multi-word character
> encoding, just like UTF-8. The reason is that Unicode entities
> can be combined to produce single display characters (e.g.
> u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> Slicing such a combined Unicode string will have the same
> effect as slicing UTF-8 data.

really?  does it result in a decoder error?  or does it just result
in a rendering error, just as if you slice off any trailing character
without looking...

> It seems that most Latin-1 proponents seem to have single
> display characters in mind. While the same is true for
> many Unicode entities, there are quite a few cases of
> combining characters in Unicode 3.0 and the Unicode
> nomarization algorithm uses these as basis for its
> work.

do we supported automatic normalization in 1.6?

</F>




From ping at lfw.org  Tue May  2 11:46:40 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:46:40 -0700 (PDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>
Message-ID: <Pine.LNX.4.10.10005020242270.522-100000@localhost>

On Tue, 2 May 2000, Moshe Zadka wrote:
> 
> > Thanks for bringing this up again.  I think it should be called
> > sys.displayhook.

I apologize profusely for dropping the ball on this.  I
was going to do it; i have been having a tough time lately
figuring out a Big Life Decision.  (Hate those BLDs.)

I was partway through hacking the patch and didn't get back
to it, but i wanted to at least air the plan i had in mind.
I hope you'll allow me this indulgence.

I was planning to submit a patch that adds the built-in routines

    sys.display
    sys.displaytb

    sys.__display__
    sys.__displaytb__

sys.display(obj) would be implemented as 'print repr(obj)'
and sys.displaytb(tb, exc) would call the same built-in
traceback printer we all know and love.

I assumed that sys.__stdin__ was added to make it easier to
restore sys.stdin to its original value.  In the same vein,
sys.__display__ and sys.__displaytb__ would be saved references
to the original sys.display and sys.displaytb.

I hate to contradict Guido, but i'll gently suggest why i
like "display" better than "displayhook": "display" is a verb,
and i prefer function names to be verbs rather than nouns
describing what the functions are (e.g. "read" rather than
"reader", etc.)


-- ?!ng




From ping at lfw.org  Tue May  2 11:47:34 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:47:34 -0700 (PDT)
Subject: [Python-Dev] Traceback style
Message-ID: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>

This was also going to go out after i posted the
display/displaytb patch.  But anyway, let's see what
you all think.

I propose the following stylistic changes to traceback
printing:

    1.  If there is no function name for a given level
        in the traceback, just omit the ", in ?" at the
        end of the line.

    2.  If a given level of the traceback is in a method,
        instead of just printing the method name, print
        the class and the method name.

    3.  Instead of beginning each line with:
        
            File "foo.py", line 5

        print the line first and drop the quotes:

            Line 5 of foo.py

        In the common interactive case that the file
        is a typed-in string, the current printout is
        
            File "<stdin>", line 1
        
        and the following is easier to read in my opinion:

            Line 1 of <stdin>

Here is an example:

    >>> class Spam:
    ...     def eggs(self):
    ...         return self.ham
    ... 
    >>> s = Spam()
    >>> s.eggs()
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
      File "<stdin>", line 3, in eggs
    AttributeError: ham

With the suggested changes, this would print as

    Traceback (innermost last):
      Line 1 of <stdin>
      Line 3 of <stdin>, in Spam.eggs
    AttributeError: ham



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton




From ping at lfw.org  Tue May  2 11:53:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:53:01 -0700 (PDT)
Subject: [Python-Dev] Traceback behaviour in exceptional cases
Message-ID: <Pine.LNX.4.10.10004170045510.1157-100000@localhost>

Here is how i was planning to take care of exceptions in
sys.displaytb...


    1.  When the 'sys' module does not contain a 'stderr'
        attribute, Python currently prints 'lost sys.stderr'
        to the original stderr instead of printing the traceback.
        I propose that it proceed to try to print the traceback
        to the real stderr in this case.

    2.  If 'sys.stderr' is buffered, the traceback does not
        appear in the file.  I propose that Python flush
        'sys.stderr' immediately after printing a traceback.

    3.  Tracebacks get printed to whatever object happens to
        be in 'sys.stderr'.  If the object is not a file (or
        other problems occur during printing), nothing gets
        printed anywhere.  I propose that Python warn about
        this on stderr, then try to print the traceback to
        the real stderr as above.

    4.  Similarly, 'sys.displaytb' may cause an exception.
        I propose that when this happens, Python invoke its
        default traceback printer to print the exception from
        'sys.displaytb' as well as the original exception.

#4 may seem a little convoluted, so here is the exact logic
i suggest (described here in Python but to be implemented in C),
where 'handle_exception()' is the routine the interpreter uses
to handle an exception, 'print_exception' is the built-in
exception printer currently implemented in PyErr_PrintEx and
PyTraceBack_Print, and 'err' is the actual, original stderr.

    def print_double_exception(tb, exc, disptb, dispexc, file):
        file.write("Exception occured during traceback display:\n")
        print_exception(disptb, dispexc, file)
        file.write("\n")
        file.write("Original exception passed to display routine:\n")
        print_exception(tb, exc, file)

    def handle_double_exception(tb, exc, disptb, dispexc):
        if hasattr(sys, 'stderr'):
            err.write("Missing sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)
            return
        try:
            print_double_exception(tb, exc, disptb, dispexc, sys.stderr)
        except:
            err.write("Error on sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)

    def handle_exception():
        tb, exc = sys.exc_traceback, sys.exc_value
        try:
            sys.displaytb(tb, exc)
        except:
            disptb, dispexc = sys.exc_traceback, sys.exc_value
            try:
                handle_double_exception(tb, exc, disptb, dispexc)
            except: pass

    def default_displaytb(tb, exc):
        if hasattr(sys, 'stderr'):
            print_exception(tb, exc, sys.stderr)
        else:
            print "Missing sys.stderr; printing exception to stderr."
            print_exception(tb, exc, err)

    sys.displaytb = sys.__displaytb__ = default_displaytb



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton




From mal at lemburg.com  Tue May  2 11:56:21 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 11:56:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
Message-ID: <390EA645.89E3B22A@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > Just a small note on the subject of a character being atomic
> > which seems to have been forgotten by the discussing parties:
> >
> > Unicode itself can be understood as multi-word character
> > encoding, just like UTF-8. The reason is that Unicode entities
> > can be combined to produce single display characters (e.g.
> > u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> > Slicing such a combined Unicode string will have the same
> > effect as slicing UTF-8 data.
> 
> really?  does it result in a decoder error?  or does it just result
> in a rendering error, just as if you slice off any trailing character
> without looking...

In the example, if you cut off the u"\u0301", the "e" would
appear without the acute accent, cutting off the u"e" would
probably result in a rendering error or worse put the accent
over the next character to the left.

UTF-8 is better in this respect: it warns you about
the error by raising an exception when being converted to
Unicode.
 
> > It seems that most Latin-1 proponents seem to have single
> > display characters in mind. While the same is true for
> > many Unicode entities, there are quite a few cases of
> > combining characters in Unicode 3.0 and the Unicode
> > normalization algorithm uses these as basis for its
> > work.
> 
> do we supported automatic normalization in 1.6?

No, but it is likely to appear in 1.7... not sure about
the "automatic" though.

FYI: Normalization is needed to make comparing Unicode
strings robust, e.g. u"?" should compare equal to u"e\u0301".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From esr at thyrsus.com  Tue May  2 12:16:55 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 06:16:55 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502061655.A16999@thyrsus.com>

Ka-Ping Yee <ping at lfw.org>:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.
> 
>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.
> 
>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py
> 
>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>
> 
> Here is an example:
> 
>     >>> class Spam:
>     ...     def eggs(self):
>     ...         return self.ham
>     ... 
>     >>> s = Spam()
>     >>> s.eggs()
>     Traceback (innermost last):
>       File "<stdin>", line 1, in ?
>       File "<stdin>", line 3, in eggs
>     AttributeError: ham
> 
> With the suggested changes, this would print as
> 
>     Traceback (innermost last):
>       Line 1 of <stdin>
>       Line 3 of <stdin>, in Spam.eggs
>     AttributeError: ham

IMHO, this is not a good idea.  Emacs users like me want traceback
labels to be *more* like C compiler error messages, not less.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The United States is in no way founded upon the Christian religion
	-- George Washington & John Adams, in a diplomatic message to Malta.



From moshez at math.huji.ac.il  Tue May  2 12:12:14 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 2 May 2000 13:12:14 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>

On Mon, 1 May 2000, Guido van Rossum wrote:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

Well, I'm guessing you had someone specific in mind (Neil?), but I want to
say someothing too, as the only one here (I think) using ISO-8859-8
natively. I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8
is even more horribly western-centered (or I should say USA centered) --
ASCII documents are the same. I'd much prefer Python to reflect a
fundamental truth about Unicode, which at least makes sure binary-goop can
pass through Unicode and remain unharmed, then to reflect a nasty problem
with UTF-8 (not everything is legal). 

If I'm using Hebrew characters in my source (which I won't for a long
while), I'll use them in  Unicode strings only, and make sure I use
Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a
conversion to Unicode on the fly anyway, since most bidi libraries work on
Unicode. So having UTF-8 conversions magically happen won't help me at
all, and will only cause problem when I use "sort-for-uniqueness" on a
list with mixed binary-goop and Unicode strings. In short, this sounds
like a recipe for disaster.

internationally y'rs, Z.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From pf at artcom-gmbh.de  Tue May  2 12:12:26 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:12:26 +0200 (MEST)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> from "Barry A. Warsaw" at "May 1, 2000 12:18:25 pm"
Message-ID: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>

Barry A. Warsaw:
> Update of /projects/cvsroot/python/dist/src/Doc/lib
[...]
> 	libos.tex 
[...]
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}
> --- 703,712 ----
>   \end{funcdesc}
>   
> ! \begin{funcdesc}{utime}{path, times}
> ! Set the access and modified times of the file specified by \var{path}.
> ! If \var{times} is \code{None}, then the file's access and modified
> ! times are set to the current time.  Otherwise, \var{times} must be a
> ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> ! set the access and modified times, respectively.
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}

I may have missed something, but I haven't seen a patch to the WinXX
and MacOS implementation of the 'utime' function.  So either the
documentation should explicitly point out, that the new additional
signature is only available on Unices or even better it should be
implemented on all platforms so that programmers intending to write
portable Python have not to worry about this.

I suggest an additional note saying that this signature has been
added in Python 1.6.  There used to be several such notes all over
the documentation saying for example: "New in version 1.5.2." which
I found very useful in the past!

Regards, Peter



From nhodgson at bigpond.net.au  Tue May  2 12:22:00 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 20:22:00 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>
Message-ID: <00d101bfb420$4197e510$e3cb8490@neil>

> ...but the system API translates from the active code page to the
> encoding used by the file system, right?

   Yes, although I think that wasn't the case with Win16 and there are still
some situations in which you have to deal with the differences. Copying a
file from the console on Windows 95 to a FAT volume appears to allow use of
the OEM character set with no conversion.

> if I create a file with a name containing latin-1 characters, on a
> FAT drive, it shows up correctly in the file browser (cp1252), and
> also shows up correctly in the MS-DOS window (under cp850).

   Do you have a FAT drive or a VFAT drive? If you format as FAT on 9x or NT
you will get a VFAT volume.

   Neil




From ping at lfw.org  Tue May  2 12:23:26 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 03:23:26 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502061655.A16999@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005020317030.522-100000@localhost>

On Tue, 2 May 2000, Eric S. Raymond wrote:
>
> Ka-Ping Yee <ping at lfw.org>:
> > 
> > With the suggested changes, this would print as
> > 
> >     Traceback (innermost last):
> >       Line 1 of <stdin>
> >       Line 3 of <stdin>, in Spam.eggs
> >     AttributeError: ham
> 
> IMHO, this is not a good idea.  Emacs users like me want traceback
> labels to be *more* like C compiler error messages, not less.

I suppose Python could go all the way and say things like

    Traceback (innermost last):
      <stdin>:3
      foo.py:25: in Spam.eggs
    AttributeError: ham

but that might be more intimidating for a beginner.

Besides, you Emacs guys have plenty of programmability anyway :)
You would have to do a little parsing to get the file name and
line number from the current format; it's no more work to get
it from the suggested format.

(What i would really like, by the way, is to see the values of
the function arguments on the stack -- but that's a lot of work
to do in C, so implementing this with the help of repr.repr
will probably be the first thing i do with sys.displaytb.)


-- ?!ng




From mal at lemburg.com  Tue May  2 12:46:06 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 12:46:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <390EB1EE.EA557CA9@lemburg.com>

Moshe Zadka wrote:
> 
> I'd much prefer Python to reflect a
> fundamental truth about Unicode, which at least makes sure binary-goop can
> pass through Unicode and remain unharmed, then to reflect a nasty problem
> with UTF-8 (not everything is legal).

Let's not do the same mistake again: Unicode objects should *not*
be used to hold binary data. Please use buffers instead.

BTW, I think that this behaviour should be changed:

>>> buffer('binary') + 'data'
'binarydata'

while:

>>> 'data' + buffer('binary')         
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

IMHO, buffer objects should never coerce to strings, but instead
return a buffer object holding the combined contents. The
same applies to slicing buffer objects:

>>> buffer('binary')[2:5]
'nar'

should prefereably be buffer('nar').

--

Hmm, perhaps we need something like a data string object
to get this 100% right ?!

>>> d = data("...data...")
or
>>> d = d"...data..."
>>> print type(d)
<type 'data'>

>>> 'string' + d
d"string...data..."
>>> u'string' + d
d"s\000t\000r\000i\000n\000g\000...data..."

>>> d[:5]
d"...da"

etc.

Ideally, string and Unicode objects would then be subclasses
of this type in Py3K.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From pf at artcom-gmbh.de  Tue May  2 12:59:55 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:59:55 +0200 (MEST)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10005020317030.522-100000@localhost> from Ka-Ping Yee at "May 2, 2000  3:23:26 am"
Message-ID: <m12maPD-000CnCC@artcom0.artcom-gmbh.de>

> > Ka-Ping Yee <ping at lfw.org>:
> > > 
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

> On Tue, 2 May 2000, Eric S. Raymond wrote:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.
> 
Ka-Ping Yee :
[...]
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

I like pings proposed traceback output.  

But beside existing Elisp code there might be other software relying
on a particular format.  As a long time vim user I have absolutely
no idea about other IDEs.  So before changing the default format this
should be carefully checked.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

I'm eagerly waiting to see this. ;-)

Regards, Peter



From just at letterror.com  Tue May  2 14:34:57 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 13:34:57 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E939B.11B99B71@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            	
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102804b534772fc25b@[193.78.237.142]>

At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "?" in a Unicode aware renderer).

Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.

>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's
non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
argument can easily be turned around, backfiring at utf-8:

    Defaulting to utf-8 when going from Unicode to 8-bit and
    back only gives the *illusion* things "just work", since it
    will *silently* "work", even if utf-8 is *not* the desired
    8-bit encoding -- as shown by Fredrik's excellent "fun with
    Unicode, part 1" example. Defaulting to Latin-1 will
    warn the user *much* earlier, since it'll barf when
    converting a Unicode string that contains any character
    code > 255. So there.
]

>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
>work.

Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.

However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with
surrogates?

Just





From nhodgson at bigpond.net.au  Tue May  2 13:40:44 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 21:40:44 +1000
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil> <003b01bfb404$03cd0560$34aab5d4@hagrid>
Message-ID: <013e01bfb42b$41a3f200$e3cb8490@neil>

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    DoSomething(s + u)

> in Guido's design, the first example may or may not result in
> an "UTF-8 decoding error: UTF-8 decoding error: unexpected
> code byte" exception.

   I would say it is less surprising for most people for this to follow the
silent-widening of each byte - the Fredrik-Paul position. With the current
scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to
UTF-16 conversion. While complete prohibition of automatic conversion has
some appeal, it will just be more noise to many.

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    if len(u) + len(s) == len(u + s):
>        print "true"
>    else:
>        print "not true"

> the second example may result in a
> similar error, print "true", or print "not true", depending on the
> contents of the 8-bit string.

   I don't see this as important as its trying to take the Unicode strings
are equivalent to 8 bit strings too far. How much further before you have to
break? I always thought of len measuring the number of bytes rather than
characters when applied to strings. The same as strlen in C when you have a
DBCS string.

   I should correct some of the stuff Mark wrote about me. At Fujitsu we did
a lot more DBCS work than Unicode because that's what Japanese code uses.
Even with Java most storage is still DBCS. I was more involved with Unicode
architecture at Reuters 6 or so years ago.

   Neil




From guido at python.org  Tue May  2 13:53:10 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 07:53:10 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 02:46:40 PDT."
             <Pine.LNX.4.10.10005020242270.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020242270.522-100000@localhost> 
Message-ID: <200005021153.HAA24134@eric.cnri.reston.va.us>

> I was planning to submit a patch that adds the built-in routines
> 
>     sys.display
>     sys.displaytb
> 
>     sys.__display__
>     sys.__displaytb__
> 
> sys.display(obj) would be implemented as 'print repr(obj)'
> and sys.displaytb(tb, exc) would call the same built-in
> traceback printer we all know and love.

Sure.  Though I would recommend to separate the patch in two parts,
because their implementation is totally unrelated.

> I assumed that sys.__stdin__ was added to make it easier to
> restore sys.stdin to its original value.  In the same vein,
> sys.__display__ and sys.__displaytb__ would be saved references
> to the original sys.display and sys.displaytb.

Good idea.

> I hate to contradict Guido, but i'll gently suggest why i
> like "display" better than "displayhook": "display" is a verb,
> and i prefer function names to be verbs rather than nouns
> describing what the functions are (e.g. "read" rather than
> "reader", etc.)

Good idea.  But I hate the "displaytb" name (when I read your message
I had no idea what the "tb" stood for until you explained it).

Hm, perhaps we could do showvalue and showtraceback?
("displaytraceback" is a bit long.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:15:28 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:15:28 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: Your message of "Tue, 02 May 2000 12:12:26 +0200."
             <m12mZfG-000CnCC@artcom0.artcom-gmbh.de> 
References: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de> 
Message-ID: <200005021215.IAA24169@eric.cnri.reston.va.us>

> > ! \begin{funcdesc}{utime}{path, times}
> > ! Set the access and modified times of the file specified by \var{path}.
> > ! If \var{times} is \code{None}, then the file's access and modified
> > ! times are set to the current time.  Otherwise, \var{times} must be a
> > ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> > ! set the access and modified times, respectively.
> >   Availability: Macintosh, \UNIX{}, Windows.
> >   \end{funcdesc}
> 
> I may have missed something, but I haven't seen a patch to the WinXX
> and MacOS implementation of the 'utime' function.  So either the
> documentation should explicitly point out, that the new additional
> signature is only available on Unices or even better it should be
> implemented on all platforms so that programmers intending to write
> portable Python have not to worry about this.

Actually, it works on WinXX (tested on 98).  The utime()
implementation there is the same file as on Unix, so the patch fixed
both platforms.  The MS C library only seems to set the mtime, but
that's okay.

On Mac, I hope that the utime() function in GUSI 2 does this, in which
case Jack Jansen needs to copy Barry's patch.

> I suggest an additional note saying that this signature has been
> added in Python 1.6.  There used to be several such notes all over
> the documentation saying for example: "New in version 1.5.2." which
> I found very useful in the past!

Thanks, you're right!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:19:38 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:19:38 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: Your message of "Tue, 02 May 2000 20:22:00 +1000."
             <00d101bfb420$4197e510$e3cb8490@neil> 
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>  
            <00d101bfb420$4197e510$e3cb8490@neil> 
Message-ID: <200005021219.IAA24181@eric.cnri.reston.va.us>

>    Yes, although I think that wasn't the case with Win16 and there are still
> some situations in which you have to deal with the differences. Copying a
> file from the console on Windows 95 to a FAT volume appears to allow use of
> the OEM character set with no conversion.

BTW, MS's use of code pages is full of shit.  Yesterday I was
spell-checking a document that had the name Andre in it (the accent
was missing).  The popup menu suggested Andr* where the * was an upper
case slashed O.  I first thought this was because the menu character
set might be using a different code page, but no -- it must have been
bad in the database, because selecting that entry from the menu
actually inserted the slashed O character.  So they must have been
maintaining their database with a different code page.

Just to indicate that when we sort out the rest of the Unicode debate
(which I'm sure we will :-) there will still be surprises on
Windows...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:22:24 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:22:24 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 03:23:26 PDT."
             <Pine.LNX.4.10.10005020317030.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020317030.522-100000@localhost> 
Message-ID: <200005021222.IAA24192@eric.cnri.reston.va.us>

> > Ka-Ping Yee <ping at lfw.org>:
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

ESR:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.

Ping:
> I suppose Python could go all the way and say things like
> 
>     Traceback (innermost last):
>       <stdin>:3
>       foo.py:25: in Spam.eggs
>     AttributeError: ham
> 
> but that might be more intimidating for a beginner.
> 
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

Not sure -- I think I carefully designed the old format to be one of
the formats that Emacs parses *by default*: File "...", line ...  Your
change breaks this.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

Yes, this is much easier in Python.  Watch out for values that are
uncomfortably big or recursive or that cause additional exceptions on
displaying.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:26:50 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:26:50 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 12:46:06 +0200."
             <390EB1EE.EA557CA9@lemburg.com> 
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>  
            <390EB1EE.EA557CA9@lemburg.com> 
Message-ID: <200005021226.IAA24203@eric.cnri.reston.va.us>

[MAL]
> Let's not do the same mistake again: Unicode objects should *not*
> be used to hold binary data. Please use buffers instead.

Easier said than done -- Python doesn't really have a buffer data
type.  Or do you mean the array module?  It's not trivial to read a
file into an array (although it's possible, there are even two ways).
Fact is, most of Python's standard library and built-in objects use
(8-bit) strings as buffers.

I agree there's no reason to extend this to Unicode strings.

> BTW, I think that this behaviour should be changed:
> 
> >>> buffer('binary') + 'data'
> 'binarydata'
> 
> while:
> 
> >>> 'data' + buffer('binary')         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: illegal argument type for built-in operation
> 
> IMHO, buffer objects should never coerce to strings, but instead
> return a buffer object holding the combined contents. The
> same applies to slicing buffer objects:
> 
> >>> buffer('binary')[2:5]
> 'nar'
> 
> should prefereably be buffer('nar').

Note that a buffer object doesn't hold data!  It's only a pointer to
data.  I can't off-hand explain the asymmetry though.

> --
> 
> Hmm, perhaps we need something like a data string object
> to get this 100% right ?!
> 
> >>> d = data("...data...")
> or
> >>> d = d"...data..."
> >>> print type(d)
> <type 'data'>
> 
> >>> 'string' + d
> d"string...data..."
> >>> u'string' + d
> d"s\000t\000r\000i\000n\000g\000...data..."
> 
> >>> d[:5]
> d"...da"
> 
> etc.
> 
> Ideally, string and Unicode objects would then be subclasses
> of this type in Py3K.

Not clear.  I'd rather do the equivalent of byte arrays in Java, for
which no "string literal" notations exist.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at mems-exchange.org  Tue May  2 14:27:51 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 2 May 2000 08:27:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502082751.A1504@mems-exchange.org>

On 02 May 2000, Ka-Ping Yee said:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.

+0 on this: it doesn't really add anything, but it does neaten things
up.

>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.

+1 here too: this definitely adds utility.

>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py

-0: adds nothing, cleans nothing up, and just generally breaks things
for no good reason.

>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>

OK, that's a good reason.  Maybe you could special-case the "<stdin>"
case?  How about

   <stdin>, line 1

?

        Greg



From guido at python.org  Tue May  2 14:30:02 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:30:02 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 11:56:21 +0200."
             <390EA645.89E3B22A@lemburg.com> 
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>  
            <390EA645.89E3B22A@lemburg.com> 
Message-ID: <200005021230.IAA24232@eric.cnri.reston.va.us>

[MAL]
> > > Unicode itself can be understood as multi-word character
> > > encoding, just like UTF-8. The reason is that Unicode entities
> > > can be combined to produce single display characters (e.g.
> > > u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> > > Slicing such a combined Unicode string will have the same
> > > effect as slicing UTF-8 data.
[/F]
> > really?  does it result in a decoder error?  or does it just result
> > in a rendering error, just as if you slice off any trailing character
> > without looking...
[MAL]
> In the example, if you cut off the u"\u0301", the "e" would
> appear without the acute accent, cutting off the u"e" would
> probably result in a rendering error or worse put the accent
> over the next character to the left.
> 
> UTF-8 is better in this respect: it warns you about
> the error by raising an exception when being converted to
> Unicode.

I think /F's point was that the Unicode standard prescribes different
behavior here: for UTF-8, a missing or lone continuation byte is an
error; for Unicode, accents are separate characters that may be
inserted and deleted in a string but whose display is undefined under
certain conditions.

(I just noticed that this doesn't work in Tkinter but it does work in
wish.  Strange.)

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"?" should compare equal to u"e\u0301".

Aha, then we'll see u == v even though type(u) is type(v) and len(u)
!= len(v).  /F's world will collapse. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:31:55 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:31:55 -0400
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 01:42:51 PDT."
             <Pine.LNX.4.10.10005020114250.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> 
Message-ID: <200005021231.IAA24249@eric.cnri.reston.va.us>

>     No automatic conversions between 8-bit "strings" and Unicode strings.
> 
> If you want to turn UTF-8 into a Unicode string, say so.
> If you want to turn Latin-1 into a Unicode string, say so.
> If you want to turn ISO-2022-JP into a Unicode string, say so.
> Adding a Unicode string and an 8-bit "string" gives an exception.

I'd accept this, with one change: mixing Unicode and 8-bit strings is
okay when the 8-bit strings contain only ASCII (byte values 0 through
127).  That does the right thing when the program is combining
ASCII data (e.g. literals or data files) with Unicode and warns you
when you are using characters for which the encoding matters.  I
believe that this is important because much existing code dealing with
strings can in fact deal with Unicode just fine under these
assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
it deal with Unicode strings -- those changes were all getattr() and
setattr() calls.)

When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
bytes in either should make the comparison fail; when ordering is
important, we can make an arbitrary choice e.g. "\377" < u"\200".

Why not Latin-1?  Because it gives us Western-alphabet users a false
sense that our code works, where in fact it is broken as soon as you
change the encoding.

> P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
> sense of them as byte-buffers -- since that *is* all you get when you
> read in some bytes from a file.  If you manipulate an 8-bit "string"
> as a character string, you are implicitly making the assumption that
> the byte values correspond to the character encoding of the character
> repertoire you want to work with, and that's your responsibility.

This is how I think of them too.

> P. P. S.  If always having to specify encodings is really too much,
> i'd probably be willing to consider a default-encoding state on the
> Unicode class, but it would have to be a stack of values, not a
> single value.

Please elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Tue May  2 15:44:30 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:44:30 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021230.IAA24232@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."            
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>             
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <l03102807b5348b0e6e0b@[193.78.237.142]>

At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
>I think /F's point was that the Unicode standard prescribes different
>behavior here: for UTF-8, a missing or lone continuation byte is an
>error; for Unicode, accents are separate characters that may be
>inserted and deleted in a string but whose display is undefined under
>certain conditions.
>
>(I just noticed that this doesn't work in Tkinter but it does work in
>wish.  Strange.)
>
>> FYI: Normalization is needed to make comparing Unicode
>> strings robust, e.g. u"?" should compare equal to u"e\u0301".
>
>Aha, then we'll see u == v even though type(u) is type(v) and len(u)
>!= len(v).  /F's world will collapse. :-)

Does the Unicode spec *really* specifies u should compare equal to v? This
behavior would be the responsibility of a layout engine, a role which is
way beyond the scope of Unicode support in Python, as it is language- and
script-dependent.

Just





From just at letterror.com  Tue May  2 15:39:24 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:39:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode debate
In-Reply-To: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102806b534883ec4cf@[193.78.237.142]>

At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote:
>If it turns out automatic conversions *are* absolutely necessary,
>then i vote in favour of the simple, direct method promoted by Paul
>and Fredrik: just copy the numerical values of the bytes.  The fact
>that this happens to correspond to Latin-1 is not really the point;
>the main reason is that it satisfies the Principle of Least Surprise.

Exactly.

I'm not sure if automatic conversions are absolutely necessary, but seeing
8-bit strings as Latin-1 encoded Unicode strings seems most natural to me.
Heck, even 8-bit strings should have an s.encode() method, that would
behave *just* like u.encode(), and unicode(blah) could even *return* an
8-bit string if it turns out the string has no character codes > 255!

Conceptually, this gets *very* close to the ideal of "there is only one
string type", and at the same times leaves room for 8-bit strings doubling
as byte arrays for backward compatibility reasons.

(Unicode strings and 8-bit strings could even be the same type, which only
uses wide chars when neccesary!)

Just





From just at letterror.com  Tue May  2 15:55:31 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:55:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 01:42:51 PDT."            
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <l03102808b5348d1eea20@[193.78.237.142]>

At 8:31 AM -0400 02-05-2000, Guido van Rossum wrote:
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

Blech. Just document 8-bit strings *are* Latin-1 unless converted
explicitly, and you're done. It's really much simpler this way. For you as
well as the users.

>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

Yeah, and? It least it'll *show* it's broken instead of *silently* doing
the wrong thing with utf-8.

It's like using Python ints all over the place, and suddenly a user of the
application enters data that causes an integer overflow. Boom. Program
needs to be fixed. What's the big deal?

Just





From effbot at telia.com  Tue May  2 15:05:42 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 15:05:42 +0200
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>             <390EA645.89E3B22A@lemburg.com>  <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <00f301bfb437$227bc180$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

you're gonna do automatic normalization?  that's interesting.
will this make Python the first language to defines strings as
a "sequence of graphemes"?

or was this just the cheap shot it appeared to be?

</F>




From skip at mojam.com  Tue May  2 15:10:22 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 2 May 2000 08:10:22 -0500 (CDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <14606.54206.559407.213584@beluga.mojam.com>

[... completely eliding Ping's note and stealing his subject ...]

On a not-quite unrelated tack, I wonder if traceback printing can be
enhanced in the case where Python code calls a function or method written in
C (possibly calling multiple C functions), which in turn calls a Python
function that raises an exception.  Currently, the Python functions on
either side of the C functions are printed, but no hint of the C function's
existence is displayed.  Any way to get some indication there's another
function in the middle?

Thanks,

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From mbel44 at dial.pipex.net  Tue May  2 15:46:44 2000
From: mbel44 at dial.pipex.net (Toby Dickenson)
Date: Tue, 02 May 2000 14:46:44 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>

On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum
<guido at python.org> wrote:

>>     No automatic conversions between 8-bit "strings" and Unicode strings.
>> 
>> If you want to turn UTF-8 into a Unicode string, say so.
>> If you want to turn Latin-1 into a Unicode string, say so.
>> If you want to turn ISO-2022-JP into a Unicode string, say so.
>> Adding a Unicode string and an 8-bit "string" gives an exception.
>
>I'd accept this, with one change: mixing Unicode and 8-bit strings is
>okay when the 8-bit strings contain only ASCII (byte values 0 through
>127).  That does the right thing when the program is combining
>ASCII data (e.g. literals or data files) with Unicode and warns you
>when you are using characters for which the encoding matters.  I
>believe that this is important because much existing code dealing with
>strings can in fact deal with Unicode just fine under these
>assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
>it deal with Unicode strings -- those changes were all getattr() and
>setattr() calls.)
>
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

I assume 'fail' means 'non-equal', rather than 'raises an exception'?


Toby Dickenson
tdickenson at geminidataloggers.com



From guido at python.org  Tue May  2 15:58:51 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 09:58:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:10:22 CDT."
             <14606.54206.559407.213584@beluga.mojam.com> 
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>  
            <14606.54206.559407.213584@beluga.mojam.com> 
Message-ID: <200005021358.JAA24443@eric.cnri.reston.va.us>

[Skip]
> On a not-quite unrelated tack, I wonder if traceback printing can be
> enhanced in the case where Python code calls a function or method written in
> C (possibly calling multiple C functions), which in turn calls a Python
> function that raises an exception.  Currently, the Python functions on
> either side of the C functions are printed, but no hint of the C function's
> existence is displayed.  Any way to get some indication there's another
> function in the middle?

In some cases, that's a good thing -- in others, it's not.  There
should probably be an API that a C function can call to add an entry
onto the stack.

It's not going to be a trivial fix though -- you'd have to manufacture
a frame object.

I can see two options: you can do this "on the way out" when you catch
an exception, or you can do this "on the way in" when you are called.
The latter would require you to explicitly get rid of the frame too --
probably both on normal returns and on exception returns.  That seems
hairier than only having to make a call on exception returns; but it
means that the C function is invisible to the Python debugger unless
it fails.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 16:00:14 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:00:14 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:46:44 BST."
             <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>  
            <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> 
Message-ID: <200005021400.KAA24464@eric.cnri.reston.va.us>

[me]
> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> >bytes in either should make the comparison fail; when ordering is
> >important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby] 
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

Yes, sorry for the ambiguity.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Tue May  2 16:04:17 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 10:04:17 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
References: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
	<ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
Message-ID: <14606.57441.97184.499435@seahag.cnri.reston.va.us>

Mark Hammond writes:
 > I wonder if that anyone could be me? :-)

  I certainly wouldn't object!  ;)

 > But I will try and put something together.  It will need to be plain
 > text or HTML, but I assume that is better than nothing!

  Plain text would be better than HTML.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From just at letterror.com  Tue May  2 17:11:39 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:11:39 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 14:46:44 BST."            
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>             
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <l0310280fb5349fd24fc5@[193.78.237.142]>

At 10:00 AM -0400 02-05-2000, Guido van Rossum wrote:
>[me]
>> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>> >bytes in either should make the comparison fail; when ordering is
>> >important, we can make an arbitrary choice e.g. "\377" < u"\200".
>
>[Toby]
>> I assume 'fail' means 'non-equal', rather than 'raises an exception'?
>
>Yes, sorry for the ambiguity.

You're going to have a hard time explaining that "\377" != u"\377".

Again, if you define that "all strings are unicode" and that 8-bit strings
contain Unicode characters up to 255, you're all set. Clear semantics, few
surprises, simple implementation, etc. etc.

Just





From guido at python.org  Tue May  2 16:21:28 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:21:28 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 16:11:39 BST."
             <l0310280fb5349fd24fc5@[193.78.237.142]> 
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>  
            <l0310280fb5349fd24fc5@[193.78.237.142]> 
Message-ID: <200005021421.KAA24526@eric.cnri.reston.va.us>

[Just]
> You're going to have a hard time explaining that "\377" != u"\377".

I agree.  You are an example of how hard it is to explain: you still
don't understand that for a person using CJK encodings this is in fact
the truth.

> Again, if you define that "all strings are unicode" and that 8-bit strings
> contain Unicode characters up to 255, you're all set. Clear semantics, few
> surprises, simple implementation, etc. etc.

But not all 8-bit strings occurring in programs are Unicode.  Ask
Moshe.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Tue May  2 17:42:24 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:42:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021421.KAA24526@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 16:11:39 BST."            
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>             
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102812b534a7430fb6@[193.78.237.142]>

>[Just]
>> You're going to have a hard time explaining that "\377" != u"\377".
>
[GvR]
>I agree.  You are an example of how hard it is to explain: you still
>don't understand that for a person using CJK encodings this is in fact
>the truth.

That depends on the definition of truth: it you document that 8-bit strings
are Latin-1, the above is the truth. Conceptually classify all other 8-bit
encodings as binary goop makes the semantics chrystal clear.

>> Again, if you define that "all strings are unicode" and that 8-bit strings
>> contain Unicode characters up to 255, you're all set. Clear semantics, few
>> surprises, simple implementation, etc. etc.
>
>But not all 8-bit strings occurring in programs are Unicode.  Ask
>Moshe.

I know. They can be anything, even binary goop. But that's *only* an
artifact of the fact that 8-bit strings need to double as buffer objects.

Just





From just at letterror.com  Tue May  2 17:45:01 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:45:01 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <l03102812b534a7430fb6@[193.78.237.142]>
References: <200005021421.KAA24526@eric.cnri.reston.va.us> Your message of
 "Tue, 02 May 2000 16:11:39 BST."            
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>             
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102813b534a8484cf9@[193.78.237.142]>

I wrote:
>That depends on the definition of truth: it you document that 8-bit strings
>are Latin-1, the above is the truth.

Oops, I meant of course that "\377" == u"\377" is then the truth...

Sorry,

Just





From mal at lemburg.com  Tue May  2 17:18:21 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:18:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."            
	 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
	 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
	 <009701bfb414$d35d0ea0$34aab5d4@hagrid>             
	 <390EA645.89E3B22A@lemburg.com> <l03102807b5348b0e6e0b@[193.78.237.142]>
Message-ID: <390EF1BD.E6C7AF74@lemburg.com>

Just van Rossum wrote:
> 
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish.  Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"?" should compare equal to u"e\u0301".

                            ^
                            |

Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"?"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.

> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v).  /F's world will collapse. :-)
> 
> Does the Unicode spec *really* specifies u should compare equal to v?

The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.

Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:55:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:55:40 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390EFA7B.F6B622F0@lemburg.com>

[Guido going ASCII]

Do you mean going ASCII all the way (using it for all
aspects where Unicode gets converted to a string and cases
where strings get converted to Unicode), or just 
for some aspect of conversion, e.g. just for the silent
conversions from strings to Unicode ?

[BTW, I'm pretty sure that the Latin-1 folks won't like
ASCII for the same reason they don't like UTF-8: it's
simply an inconvenient way to write strings in their favorite
encoding directly in Python source code. My feeling in this
whole discussion is that it's more about convenience than
anything else. Still, it's very amusing ;-) ]

FYI, here's the conversion table of (potentially) all
conversions done by the implementation:

Python:
-------
string + unicode:       unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode:          print unicode.encode('utf-8'); with stdout
                        redirection this can be changed to any
                        other encoding
str(unicode):           unicode.encode('utf-8')
repr(unicode):          repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:          same as "s" + unicode.encode('utf-8')
"s#" + unicode:         same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:          same as "t" + unicode.encode('utf-8')
"t#" + unicode:         same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string            same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
                        Latin-1 chars as single-char input; using
                        escape sequences any Unicode char can be
                        entered (*)
codecs.open(filename,mode,encname)
                        opens an encoded file for
                        reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
                        returns UTF-8 strings based on the input
                        encoding

IO:
---
open(file,'w').write(unicode)
        same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
        same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
        same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
        same as unicode(open(file,'rb').read(),encname)
stdin + stdout
        can be redirected using StreamRecoders to handle any
        of the supported encodings

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:27:39 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:27:39 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>  
	            <390EB1EE.EA557CA9@lemburg.com> <200005021226.IAA24203@eric.cnri.reston.va.us>
Message-ID: <390EF3EB.5BCE9EC3@lemburg.com>

Guido van Rossum wrote:
> 
> [MAL]
> > Let's not do the same mistake again: Unicode objects should *not*
> > be used to hold binary data. Please use buffers instead.
> 
> Easier said than done -- Python doesn't really have a buffer data
> type.  Or do you mean the array module?  It's not trivial to read a
> file into an array (although it's possible, there are even two ways).
> Fact is, most of Python's standard library and built-in objects use
> (8-bit) strings as buffers.
> 
> I agree there's no reason to extend this to Unicode strings.
> 
> > BTW, I think that this behaviour should be changed:
> >
> > >>> buffer('binary') + 'data'
> > 'binarydata'
> >
> > while:
> >
> > >>> 'data' + buffer('binary')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > TypeError: illegal argument type for built-in operation
> >
> > IMHO, buffer objects should never coerce to strings, but instead
> > return a buffer object holding the combined contents. The
> > same applies to slicing buffer objects:
> >
> > >>> buffer('binary')[2:5]
> > 'nar'
> >
> > should prefereably be buffer('nar').
> 
> Note that a buffer object doesn't hold data!  It's only a pointer to
> data.  I can't off-hand explain the asymmetry though.

Dang, you're right...
 
> > --
> >
> > Hmm, perhaps we need something like a data string object
> > to get this 100% right ?!
> >
> > >>> d = data("...data...")
> > or
> > >>> d = d"...data..."
> > >>> print type(d)
> > <type 'data'>
> >
> > >>> 'string' + d
> > d"string...data..."
> > >>> u'string' + d
> > d"s\000t\000r\000i\000n\000g\000...data..."
> >
> > >>> d[:5]
> > d"...da"
> >
> > etc.
> >
> > Ideally, string and Unicode objects would then be subclasses
> > of this type in Py3K.
> 
> Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> which no "string literal" notations exist.

Anyway, one way or another I think we should make it clear
to users that they should start using some other type for
storing binary data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:24:24 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:24:24 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            	
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <390EF327.86D8C3D8@lemburg.com>

Just van Rossum wrote:
> 
> At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
> >Just a small note on the subject of a character being atomic
> >which seems to have been forgotten by the discussing parties:
> >
> >Unicode itself can be understood as multi-word character
> >encoding, just like UTF-8. The reason is that Unicode entities
> >can be combined to produce single display characters (e.g.
> >u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> 
> Erm, are you sure Unicode prescribes this behavior, for this
> example? I know similar behaviors are specified for certain
> languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried
in some of the tech reports on normalization and
collation.
 
> >Slicing such a combined Unicode string will have the same
> >effect as slicing UTF-8 data.
> 
> Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert
a broken UTF-8 sequence to Unicode. This is per design
of UTF-8 itself which uses the top bit to identify
multi-byte character encodings.

Or can you give an example (perhaps you've found a bug 
that needs fixing) ?

> [ Speaking of exceptions,
> 
> after I sent off my previous post I realized Guido's
> non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
> argument can easily be turned around, backfiring at utf-8:
> 
>     Defaulting to utf-8 when going from Unicode to 8-bit and
>     back only gives the *illusion* things "just work", since it
>     will *silently* "work", even if utf-8 is *not* the desired
>     8-bit encoding -- as shown by Fredrik's excellent "fun with
>     Unicode, part 1" example. Defaulting to Latin-1 will
>     warn the user *much* earlier, since it'll barf when
>     converting a Unicode string that contains any character
>     code > 255. So there.
> ]
> 
> >It seems that most Latin-1 proponents seem to have single
> >display characters in mind. While the same is true for
> >many Unicode entities, there are quite a few cases of
> >combining characters in Unicode 3.0 and the Unicode
> >nomarization algorithm uses these as basis for its
> >work.
> 
> Still, two combining characters are still two input characters for
> the renderer! They may result in one *glyph*, but trust me,
> that's an entirly different can of worms.

No. Please see my other post on the subject...
 
> However, if you'd be talking about Unicode surrogates,
> you'd definitely have a point. How do Java/Perl/Tcl deal with
> surrogates?

Good question... anybody know the answers ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From paul at prescod.net  Tue May  2 18:05:20 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:05:20 -0500
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <390EFCC0.240BC56B@prescod.net>

Neil, I sincerely appreciate your informed input. I want to emphasize
one ideological difference though. :)

Neil Hodgson wrote:
> 
> ...
>
>    The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. 

I reject that characterization.

I claim that both strings contain Unicode characters but one can contain
Unicode charactes with higher digits. UTF-8 versus latin-1 does not
enter into it. Python strings should not be documented in terms of
encodings any more than Python ints are documented in terms of their
two's complement representation. Then we could describe the default
conversion from integers to floats in terms of their bit-representation.
Ugh!

I accept that the effect is similar to calling Latin-1 the "default"
that's a side effect of the simple logical model that we are proposing.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From just at letterror.com  Tue May  2 19:33:56 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:33:56 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFA7B.F6B622F0@lemburg.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <l03102815b534c1763aa8@[193.78.237.142]>

At 5:55 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>[BTW, I'm pretty sure that the Latin-1 folks won't like
>ASCII for the same reason they don't like UTF-8: it's
>simply an inconvenient way to write strings in their favorite
>encoding directly in Python source code. My feeling in this
>whole discussion is that it's more about convenience than
>anything else. Still, it's very amusing ;-) ]

For the record, I don't want Latin-1 because it's my favorite encoding. It
isn't. Guido's right: I can't even *use* it derictly on my platform. I want
it *only* because it's the most logical 8-bit subset of Unicode -- as we
have stated over and opver and over and over again. What's so hard to
understand about this?

Just





From paul at prescod.net  Tue May  2 18:11:13 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:11:13 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
		 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
		 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <390EFE21.DAD7749B@prescod.net>

Combining characters are a whole 'nother level of complexity. Charater
sets are hard. I don't accept that the argument that "Unicode itself has
complexities so that gives us license to introduce even more
complexities at the character representation level."

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"?" should compare equal to u"e\u0301".

That's a whole 'nother debate at a whole 'nother level of abstraction. I
think we need to get the bytes/characters level right and then we can
worry about display-equivalent characters (or leave that to the Python
programmer to figure out...).
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From paul at prescod.net  Tue May  2 18:13:00 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:13:00 -0500
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>  
	            <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us>
Message-ID: <390EFE8C.4C10473C@prescod.net>

Guido van Rossum wrote:
> 
> ...
>
> But not all 8-bit strings occurring in programs are Unicode.  Ask
> Moshe.

Where are we going? What's our long-range vision?

Three years from now where will we be? 

1. How will we handle characters? 
2. How will we handle bytes?
3. What will unadorned literal strings "do"?
4. Will literal strings be the same type as byte arrays?

I don't see how we can make decisions today without a vision for the
future. I think that this is the central point in our disagreement. Some
of us are aiming for as much compatibility with where we think we should
be going and others are aiming for as much compatibility as possible
with where we came from.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From just at letterror.com  Tue May  2 19:37:09 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:37:09 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>		
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <l03102816b534c2476bce@[193.78.237.142]>

At 5:24 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>> Still, two combining characters are still two input characters for
>> the renderer! They may result in one *glyph*, but trust me,
>> that's an entirly different can of worms.
>
>No. Please see my other post on the subject...

It would help if you'd post some actual doco.

Just





From paul at prescod.net  Tue May  2 18:25:33 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:25:33 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>  
	            <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <390F017C.91C7A8A0@prescod.net>

Guido van Rossum wrote:
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

There are many levels of equality that are interesting. I don't think we
would move to grapheme equivalence until "the rest of the world" (XML,
Java, W3C, SQL) did. 

If we were going to move to grapheme equivalence (some day), the right
way would be to normalize characters in the construction of the Unicode
string. This is known as "Early normalization":

http://www.w3.org/TR/charmod/#NormalizationApplication

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From ping at lfw.org  Tue May  2 18:43:25 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 09:43:25 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502082751.A1504@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005020939050.522-100000@localhost>

On Tue, 2 May 2000, Greg Ward wrote:
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?

...and "<string>", and "<console>", and perhaps others... ?

    File "<string>", line 3

just looks downright clumsy the first time you see it.
(Well, it still looks kinda clumsy to me or i wouldn't be
proposing the change.)

Can someone verify the already-parseable-by-Emacs claim, and
describe how you get Emacs to do something useful with bits
of traceback?  (Alas, i'm not an Emacs user, so understanding
just how the current format is useful would help.)


-- ?!ng




From bwarsaw at python.org  Tue May  2 19:13:03 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:13:03 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <14607.3231.115841.262068@anthem.cnri.reston.va.us>

>>>>> "PF" == Peter Funk <pf at artcom-gmbh.de> writes:

    PF> I suggest an additional note saying that this signature has
    PF> been added in Python 1.6.  There used to be several such notes
    PF> all over the documentation saying for example: "New in version
    PF> 1.5.2." which I found very useful in the past!

Good point.  Fred, what is the Right Way to do this?

-Barry



From bwarsaw at python.org  Tue May  2 19:16:22 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:16:22 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
	<20000502082751.A1504@mems-exchange.org>
Message-ID: <14607.3430.941026.496225@anthem.cnri.reston.va.us>

I concur with Greg's scores.



From guido at python.org  Tue May  2 19:22:02 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 13:22:02 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:27:51 EDT."
             <20000502082751.A1504@mems-exchange.org> 
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>  
            <20000502082751.A1504@mems-exchange.org> 
Message-ID: <200005021722.NAA25854@eric.cnri.reston.va.us>

> On 02 May 2000, Ka-Ping Yee said:
> > I propose the following stylistic changes to traceback
> > printing:
> > 
> >     1.  If there is no function name for a given level
> >         in the traceback, just omit the ", in ?" at the
> >         end of the line.

Greg Ward expresses my sentiments:

> +0 on this: it doesn't really add anything, but it does neaten things
> up.
> 
> >     2.  If a given level of the traceback is in a method,
> >         instead of just printing the method name, print
> >         the class and the method name.
> 
> +1 here too: this definitely adds utility.
> 
> >     3.  Instead of beginning each line with:
> >         
> >             File "foo.py", line 5
> > 
> >         print the line first and drop the quotes:
> > 
> >             Line 5 of foo.py
> 
> -0: adds nothing, cleans nothing up, and just generally breaks things
> for no good reason.
> 
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?  How about
> 
>    <stdin>, line 1
> 
> ?

I'd special-case any filename that starts with < and ends with > --
those are all made-up names like <string> or <stdin>.  You can display
them however you like, perhaps

  In "<string>", line 3

For regular files I'd leave the formatting alone -- there are tools
out there that parse these.  (E.g. Emacs' Python mode jumps to the
line with the error if you run a file and it begets an exception.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Tue May  2 19:14:24 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Tue, 2 May 2000 13:14:24 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	<l03102802b534149a9639@[193.78.237.164]>
	<l03102804b534772fc25b@[193.78.237.142]>
	<390EF327.86D8C3D8@lemburg.com>
Message-ID: <14607.3312.660077.42872@cymru.basistech.com>

M.-A. Lemburg writes:
 > The details are on the www.unicode.org web-site burried
 > in some of the tech reports on normalization and
 > collation.

This is described in the Unicode standard itself, and in UTR #15 and
UTR #10. Normalization is an issue with wider imlications than just
handling glyph variants: indeed, it's irrelevant.

The question is this: should

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

compare equal to

U+0055 LATIN CAPITAL LETTER U
U+0308 COMBINING DIAERESIS

or not? It depends on the application. Certainly in a database system
I would want these to compare equal.

Perhaps normalization form needs to be an option of the string comparator?

        -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From bwarsaw at python.org  Tue May  2 19:51:17 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:51:17 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
	<20000502082751.A1504@mems-exchange.org>
	<200005021722.NAA25854@eric.cnri.reston.va.us>
Message-ID: <14607.5525.160379.760452@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido at python.org> writes:

    GvR> For regular files I'd leave the formatting alone -- there are
    GvR> tools out there that parse these.  (E.g. Emacs' Python mode
    GvR> jumps to the line with the error if you run a file and it
    GvR> begets an exception.)

py-traceback-line-re is what matches those lines.  It's current
definition is

(defconst py-traceback-line-re
  "[ \t]+File \"\\([^\"]+\\)\", line \\([0-9]+\\)"
  "Regular expression that describes tracebacks.")

There are probably also gud.el (and maybe compile.el) regexps that
need to be changed too.  I'd rather see something that outputs the
same regardless of whether it's a real file, or something "fake".
Something like

Line 1 of <stdin>
Line 12 of foo.py

should be fine.  I'm not crazy about something like

File "foo.py", line 12
In <stdin>, line 1

-Barry



From fdrake at acm.org  Tue May  2 19:59:43 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 13:59:43 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <14607.3231.115841.262068@anthem.cnri.reston.va.us>
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
	<14607.3231.115841.262068@anthem.cnri.reston.va.us>
Message-ID: <14607.6031.770981.424012@seahag.cnri.reston.va.us>

bwarsaw at python.org writes:
 > Good point.  Fred, what is the Right Way to do this?

  Pester me night and day until it gets done (email only!).
  Unless of course you've already seen the check-in messages.  ;)


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From bwarsaw at python.org  Tue May  2 20:05:00 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 14:05:00 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
	<14607.3231.115841.262068@anthem.cnri.reston.va.us>
	<14607.6031.770981.424012@seahag.cnri.reston.va.us>
Message-ID: <14607.6348.453682.219847@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr <fdrake at acm.org> writes:

    Fred>   Pester me night and day until it gets done (email only!).

Okay, I'll cancel the daily delivery of angry rabid velco monkeys.

    Fred> Unless of course you've already seen the check-in messages.
    Fred> ;)

Saw 'em.  Thanks.
-Barry


Return-Path: <sc-publicity-return-3-python-dev=python.org at software-carpentry.com>
Delivered-To: python-dev at python.org
Received: from merlin.codesourcery.com (merlin.codesourcery.com [206.168.99.1])
	by dinsdale.python.org (Postfix) with SMTP id 81F951CD8B
	for <python-dev at python.org>; Tue,  2 May 2000 14:45:04 -0400 (EDT)
Received: (qmail 9404 invoked by uid 513); 2 May 2000 18:53:01 -0000
Mailing-List: contact sc-publicity-help at software-carpentry.com; run by ezmlm
Precedence: bulk
X-No-Archive: yes
Delivered-To: mailing list sc-publicity at software-carpentry.com
Delivered-To: moderator for sc-publicity at software-carpentry.com
Received: (qmail 5829 invoked from network); 2 May 2000 18:12:54 -0000
Date: Tue, 2 May 2000 14:04:56 -0400 (EDT)
From: <gvwilson at nevex.com>
To: sc-discuss at software-carpentry.com,
	sc-announce at software-carpentry.com,
	sc-publicity at software-carpentry.com
Message-ID: <Pine.LNX.4.10.10005021403560.30804-100000 at akbar.nevex.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [Python-Dev] Software Carpentry Design Competition Finalists
Sender: python-dev-admin at python.org
Errors-To: python-dev-admin at python.org
X-BeenThere: python-dev at python.org
X-Mailman-Version: 2.0beta3
List-Id: Python core developers <python-dev.python.org>

		Software Carpentry Design Competition

			 First-Round Results

		  http://www.software-carpentry.com

			     May 2, 2000

The Software Carpentry Project is pleased to announce the selection of
finalists in its first Open Source Design Competition.  There were
many strong entries, and we would like to thank everyone who took the
time to participate.

We would also like to invite everyone who has been involved to contact
the teams listed below, and see if there is any way to collaborate in
the second round.  Many of you had excellent ideas that deserve to be
in the final tools, and the more involved you are in discussions over
the next two months, the easier it will be for you to take part in the
ensuing implementation effort.

The 12 entries that are going forward in the "Configuration", "Build",
and "Track" categories are listed below (in alphabetical order).  The
four prize-winning entries in the "Test" category are also listed, but
as is explained there, we are putting this section of the competition
on hold for a couple of months while we try to refine the requirements.
You can inspect these entries on-line at:

         http://www.software-carpentry.com/first-round.html

And so, without further ado...


== Configuration

The final four entries in the "Configuration" category are:

* BuildConf     Vassilis Virvilis

* ConfBase      Stefan Knappmann

* SapCat        Lindsay Todd

* Tan           David Ascher


== Build

The finalists in the "Build" category are:

* Black         David Ascher and Trent Mick

* PyMake        Rich Miller

* ScCons        Steven Knight

* Tromey        Tom Tromey

Honorable mentions in this category go to:

* Forge         Bill Bitner, Justin Patterson, and Gilbert Ramirez

* Quilt         David Lamb


== Track

The four entries to go forward in the "Track" category are:

* Egad          John Martin

* K2            David Belfer-Shevett

* Roundup       Ka-Ping Yee

* Tracker       Ken Manheimer

There is also an honorable mention for:

* TotalTrack    Alex Samuel, Mark Mitchell


== Test

This category was the most difficult one for the judges. First-round
prizes are being awarded to

* AppTest         Linda Timberlake

* TestTalk        Chang Liu

* Thomas          Patrick Campbell-Preston

* TotalQuality    Alex Samuel, Mark Mitchell

However, the judges did not feel that any of these tools would have an
impact on Open Source software development in general, or scientific
and numerical programming in particular.  This is due in large part to
the vagueness of the posted requirements, for which the project
coordinator (Greg Wilson) accepts full responsibility.

We will therefore not be going forward with this category at the
present time.  Instead, the judges and others will develop narrower,
more specific requirements, guidelines, and expectations.  The
category will be re-opened in July 2000.


== Contact

The aim of the Software Carpentry project is to create a new generation of
easy-to-use software engineering tools, and to document both those tools
and the working practices they are meant to support.  The Advanced
Computing Laboratory at Los Alamos National Laboratory is providing
$860,000 of funding for Software Carpentry, which is being administered by
Code Sourcery, LLC.  For more information, contact the project
coordinator, Dr. Gregory V. Wilson, at 'gvwilson at software-carpentry.com',
or on +1 (416) 504 2325 ext. 229.


== Footnote: Entries from CodeSourcery, LLC

Two entries (TotalTrack and TotalQuality) were received from employees
of CodeSourcery, LLC, the company which is hosting the Software
Carpentry web site.  We discussed this matter with Dr. Rod Oldehoeft,
Deputy Directory of the Advanced Computing Laboratory at Los Alamos
National Laboratory.  His response was:

    John Reynders [Director of the ACL] and I have discussed this
    matter.  We agree that since the judges who make decisions
    are not affiliated with Code Sourcery, there is no conflict of
    interest. Code Sourcery gains no advantage by hosting the
    Software Carpentry web pages.  Please continue evaluating all
    the entries on their merits, and choose the best for further
    eligibility.

Note that the project coordinator, Greg Wilson, is neither employed by
CodeSourcery, nor a judge in the competition.




From paul at prescod.net  Tue May  2 20:23:24 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 13:23:24 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>  
	            <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <390F1D1C.6EAF7EAD@prescod.net>

Guido van Rossum wrote:
> 
> ....
> 
> Have you tried using this?

Yes. I haven't had large problems with it.

As long as you know what is going on, it doesn't usually hurt anything
because you can just explicitly set up the decoding you want. It's like
the int division problem. You get bitten a few times and then get
careful.

It's the naive user who will be surprised by these random UTF-8 decoding
errors. 

That's why this is NOT a convenience issue (are you listening MAL???).
It's a short and long term simplicity issue. There are lots of languages
where it is de rigeur to discover and work around inconvenient and
confusing default behaviors. I just don't think that we should be ADDING
such behaviors.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Tue May  2 20:56:34 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 14:56:34 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT."
             <390F1D1C.6EAF7EAD@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
            <390F1D1C.6EAF7EAD@prescod.net> 
Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us>

> It's the naive user who will be surprised by these random UTF-8 decoding
> errors. 
> 
> That's why this is NOT a convenience issue (are you listening MAL???).
> It's a short and long term simplicity issue. There are lots of languages
> where it is de rigeur to discover and work around inconvenient and
> confusing default behaviors. I just don't think that we should be ADDING
> such behaviors.

So what do you think of my new proposal of using ASCII as the default
"encoding"?  It takes care of "a character is a character" but also
(almost) guarantees an error message when mixing encoded 8-bit strings
with Unicode strings without specifying an explicit conversion --
*any* 8-bit byte with the top bit set is rejected by the default
conversion to Unicode.

I think this is less confusing than Latin-1: when an unsuspecting user
is reading encoded text from a file into 8-bit strings and attempts to
use it in a Unicode context, an error is raised instead of producing
garbage Unicode characters.

It encourages the use of Unicode strings for everything beyond ASCII
-- there's no way around ASCII since that's the source encoding etc.,
but Latin-1 is an inconvenient default in most parts of the world.
ASCII is accepted everywhere as the base character set (e.g. for
email and for text-based protocols like FTP and HTTP), just like
English is the one natural language that we can all sue to communicate
(to some extent).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From dieter at handshake.de  Tue May  2 20:44:41 2000
From: dieter at handshake.de (Dieter Maurer)
Date: Tue,  2 May 2000 20:44:41 +0200 (CEST)
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<390E1F08.EA91599E@prescod.net>
Message-ID: <14607.7798.510723.419556@lindm.dm>

Paul Prescod writes:
 > The fact that my proposal has the same effect as making Latin-1 the
 > "default encoding" is a near-term side effect of the definition of
 > Unicode. My long term proposal is to do away with the concept of 8-bit
 > strings (and thus, conversions from 8-bit to Unicode) altogether. One
 > string to rule them all!
Why must this be a long term proposal?

I would find it quite attractive, when
 * the old string type became an imutable list of bytes
 * automatic conversion between byte lists and unicode strings 
   were performed via user customizable conversion functions
   (a la __import__).

Dieter



From paul at prescod.net  Tue May  2 21:01:32 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:01:32 -0500
Subject: [Python-Dev] Unicode compromise?
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390F260C.2314F97E@prescod.net>

Guido van Rossum wrote:
> 
> >     No automatic conversions between 8-bit "strings" and Unicode strings.
> >
> > If you want to turn UTF-8 into a Unicode string, say so.
> > If you want to turn Latin-1 into a Unicode string, say so.
> > If you want to turn ISO-2022-JP into a Unicode string, say so.
> > Adding a Unicode string and an 8-bit "string" gives an exception.
> 
> I'd accept this, with one change: mixing Unicode and 8-bit strings is
> okay when the 8-bit strings contain only ASCII (byte values 0 through
> 127).  

I could live with this compromise as long as we document that a future
version may use the "character is a character" model. I just don't want
people to start depending on a catchable exception being thrown because
that would stop us from ever unifying unmarked literal strings and
Unicode strings.

--

Are there any steps we could take to make a future divorce of strings
and byte arrays easier? What if we added a 

binary_read()

function that returns some form of byte array. The byte array type could
be just like today's string type except that its type object would be
distinct, it wouldn't have as many string-ish methods and it wouldn't
have any auto-conversion to Unicode at all.

People could start to transition code that reads non-ASCII data to the
new function. We could put big warning labels on read() to state that it
might not always be able to read data that is not in some small set of
recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binary_open(). Or perhaps both.

I do not suggest just using the text/binary flag on the existing open
function because we cannot immediately change its behavior without
breaking code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From jkraai at murlmail.com  Tue May  2 21:46:49 2000
From: jkraai at murlmail.com (jkraai at murlmail.com)
Date: Tue, 2 May 2000 14:46:49 -0500
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <200005021946.OAA03609@www.polytopic.com>

The ever quotable Guido:
> English is the one natural language that we can all sue to communicate



------------------------------------------------------------------
You've received MurlMail! -- FREE, web-based email, accessible
anywhere, anytime from any browser-enabled device. Sign up now at
http://murl.com

Murl.com - At Your Service



From paul at prescod.net  Tue May  2 21:23:27 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:23:27 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
	            <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
Message-ID: <390F2B2F.2953C72D@prescod.net>

Guido van Rossum wrote:
> 
> ...
> 
> So what do you think of my new proposal of using ASCII as the default
> "encoding"?  

I can live with it. I am mildly uncomfortable with the idea that I could
write a whole bunch of software that works great until some European
inserts one of their name characters. Nevertheless, being hard-assed is
better than being permissive because we can loosen up later.

What do we do about str( my_unicode_string )? Perhaps escape the Unicode
characters with backslashed numbers?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Tue May  2 21:58:20 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 15:58:20 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT."
             <390F2B2F.2953C72D@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>  
            <390F2B2F.2953C72D@prescod.net> 
Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us>

[me]
> > So what do you think of my new proposal of using ASCII as the default
> > "encoding"?  

[Paul]
> I can live with it. I am mildly uncomfortable with the idea that I could
> write a whole bunch of software that works great until some European
> inserts one of their name characters.

Better than that when some Japanese insert *their* name characters and
it produces gibberish instead.

> Nevertheless, being hard-assed is
> better than being permissive because we can loosen up later.

Exactly -- just as nobody should *count* on 10**10 raising
OverflowError, nobody (except maybe parts of the standard library :-)
should *count* on unicode("\347") raising ValueError.  I think that's
fine.

> What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> characters with backslashed numbers?

Hm, good question.  Tcl displays unknown characters as \x or \u
escapes.  I think this may make more sense than raising an error.

But there must be a way to turn on Unicode-awareness on e.g. stdout
and then printing a Unicode object should not use str() (as it
currently does).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Tue May  2 22:47:17 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 2 May 2000 13:47:17 -0700
Subject: [Python-Dev] Cannot declare the largest integer literal.
Message-ID: <20000502134717.A16825@activestate.com>

>>> i = -2147483648
OverflowError: integer literal too large
>>> i = -2147483648L
>>> int(i)   # it *is* a valid integer literal
-2147483648


As far as I traced back:

Python/compile.c::com_atom() calls
Python/compile.c::parsenumber(s = "2147483648") calls
Python/mystrtoul.c::PyOS_strtol() which

returns the ERANGE errno because it is given 2147483648 (which *is* out of
range) rather than -2147483648.


My question: Why is the minus sign not considered part of the "atom", i.e.
the integer literal? Should it be? PyOS_strtol() can properly parse this
integer literal if it is given the whole number with the minus sign.
Otherwise the special case largest negative number will always erroneously be
considered out of range.

I don't know how the tokenizer works in Python. Was there a design decision
to separate the integer literal and the leading sign? And was the effect on
functions like PyOS_strtol() down the pipe missed?


Trent

--
Trent Mick
trentm at activestate.com









From guido at python.org  Tue May  2 22:47:30 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 16:47:30 -0400
Subject: [Python-Dev] Unicode compromise?
In-Reply-To: Your message of "Tue, 02 May 2000 14:01:32 CDT."
             <390F260C.2314F97E@prescod.net> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>  
            <390F260C.2314F97E@prescod.net> 
Message-ID: <200005022047.QAA26828@eric.cnri.reston.va.us>

> I could live with this compromise as long as we document that a future
> version may use the "character is a character" model. I just don't want
> people to start depending on a catchable exception being thrown because
> that would stop us from ever unifying unmarked literal strings and
> Unicode strings.

Agreed (as I've said before).

> --
> 
> Are there any steps we could take to make a future divorce of strings
> and byte arrays easier? What if we added a 
> 
> binary_read()
> 
> function that returns some form of byte array. The byte array type could
> be just like today's string type except that its type object would be
> distinct, it wouldn't have as many string-ish methods and it wouldn't
> have any auto-conversion to Unicode at all.

You can do this now with the array module, although clumsily:

  >>> import array
  >>> f = open("/core", "rb")
  >>> a = array.array('B', [0]) * 1000
  >>> f.readinto(a)
  1000
  >>>

Or if you wanted to read raw Unicode (UTF-16):

  >>> a = array.array('H', [0]) * 1000
  >>> f.readinto(a)
  2000
  >>> u = unicode(a, "utf-16")
  >>> 

There are some performance issues, e.g. you have to initialize the
buffer somehow and that seems a bit wasteful.

> People could start to transition code that reads non-ASCII data to the
> new function. We could put big warning labels on read() to state that it
> might not always be able to read data that is not in some small set of
> recognized encodings (probably UTF-8 and UTF-16).
> 
> Or perhaps binary_open(). Or perhaps both.
> 
> I do not suggest just using the text/binary flag on the existing open
> function because we cannot immediately change its behavior without
> breaking code.

A new method makes most sense -- there are definitely situations where
you want to read in text mode for a while and then switch to binary
mode (e.g. HTTP).

I'd like to put this off until after Python 1.6 -- but it deserves
attention.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Wed May  3 01:03:22 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 2 May 2000 16:03:22 -0700
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
Message-ID: <20000502160322.A19101@activestate.com>

I apologize if I am hitting covered ground. What about a module (called
limits or something like that) that would expose some appropriate #define's
in limits.h and float.h.

For example:

limits.FLT_EPSILON could expose the C DBL_EPSILON
limits.FLT_MAX could expose the C DBL_MAX
limits.INT_MAX could expose the C LONG_MAX (although that particulay name
would cause confusion with the actual C INT_MAX)


- Does this kind of thing already exist somewhere? Maybe in NumPy.

- If we ever (perhaps in Py3K) turn the basic types into classes then these
  could turn into constant attributes of those classes, i.e.:
  f = 3.14159
  f.EPSILON = <as set by C's DBL_EPSILON>

- I thought of these values being useful when I thought of comparing two
  floats for equality. Doing a straight comparison of floats is
  dangerous/wrong but is it not okay to consider two floats reasonably equal
  iff:
  	-EPSILON < float2 - float1 < EPSILON
  Or maybe that should be two or three EPSILONs. It has been a while since
  I've done any numerical analysis stuff.

  I suppose the answer to my question is: "It depends on the situation."
  Could this algorithm for float comparison be a better default than the
  status quo? I know that Mark H. and others have suggested that Python
  should maybe not provide a float comparsion operator at all to beginners.



Trent

--
Trent Mick
trentm at activestate.com




From mal at lemburg.com  Wed May  3 01:11:37 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:11:37 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>  
	            <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <390F60A9.A3AA53A9@lemburg.com>

Guido van Rossum wrote:
> 
> > > So what do you think of my new proposal of using ASCII as the default
> > > "encoding"?

How about using unicode-escape or raw-unicode-escape as
default encoding ? (They would have to be adapted to disallow
Latin-1 char input, though.)

The advantage would be that they are compatible with ASCII
while still providing loss-less conversion and since they
use escape characters, you can even read them using an
ASCII based editor.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Wed May  3 01:12:18 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 3 May 2000 09:12:18 +1000
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEBGCKAA.mhammond@skippinet.com.au>

> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

I struck this years ago!  At the time, the answer was "yes, its an
implementation flaw thats not worth fixing".

Interestingly, it _does_ work as a hex literal:

>>> 0x80000000
-2147483648
>>> -2147483648
Traceback (OverflowError: integer literal too large
>>>

Mark.




From mal at lemburg.com  Wed May  3 01:05:28 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:05:28 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
			 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
			 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net>
Message-ID: <390F5F38.DD76CAF4@lemburg.com>

Paul Prescod wrote:
> 
> Combining characters are a whole 'nother level of complexity. Charater
> sets are hard. I don't accept that the argument that "Unicode itself has
> complexities so that gives us license to introduce even more
> complexities at the character representation level."
> 
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> 
> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> think we need to get the bytes/characters level right and then we can
> worry about display-equivalent characters (or leave that to the Python
> programmer to figure out...).

I just wanted to point out that the argument "slicing doesn't
work with UTF-8" is moot.

I do see a point against UTF-8 auto-conversion given the example
that Guido mailed me:

"""
s = 'ab\341\210\264def'        # == str(u"ab\u1234def")
s.find(u"def")

This prints 3 -- the wrong result since "def" is found at s[5:8], not
at s[3:6].
"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed May  3 04:20:20 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 22:20:20 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <000001bfb4a6$21da7900$922d153f@tim>

[Trent Mick]
> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

Python's grammar is such that negative integer literals don't exist; what
you actually have there is the unary minus operator applied to positive
integer literals; indeed,

>>> def f():
	return -42

>>> import dis
>>> dis.dis(f)
          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 (42)
          9 UNARY_NEGATIVE
         10 RETURN_VALUE
         11 LOAD_CONST               0 (None)
         14 RETURN_VALUE
>>>

Note that, at runtime, the example loads +42, then negates it:  this wart
has deep roots!

> ...
> And was the effect on functions like PyOS_strtol() down the pipe
> missed?

More that it was considered an inconsequential endcase.  It's sure not worth
changing the grammar for <wink>.  I'd rather see Python erase the visible
distinction between ints and longs.





From guido at python.org  Wed May  3 04:31:21 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 22:31:21 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200."
             <390F60A9.A3AA53A9@lemburg.com> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>  
            <390F60A9.A3AA53A9@lemburg.com> 
Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us>

> Guido van Rossum wrote:
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?

[MAL]
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

No, the backslash should mean itself when encoding from ASCII to
Unicode.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From esr at thyrsus.com  Wed May  3 05:22:20 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 23:22:20 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFE8C.4C10473C@prescod.net>; from paul@prescod.net on Tue, May 02, 2000 at 11:13:00AM -0500
References: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us> <390EFE8C.4C10473C@prescod.net>
Message-ID: <20000502232220.B18638@thyrsus.com>

Paul Prescod <paul at prescod.net>:
> Where are we going? What's our long-range vision?
> 
> Three years from now where will we be? 
> 
> 1. How will we handle characters? 
> 2. How will we handle bytes?
> 3. What will unadorned literal strings "do"?
> 4. Will literal strings be the same type as byte arrays?
> 
> I don't see how we can make decisions today without a vision for the
> future. I think that this is the central point in our disagreement. Some
> of us are aiming for as much compatibility with where we think we should
> be going and others are aiming for as much compatibility as possible
> with where we came from.

And *that* is the most insightful statement I have seen in this entire 
foofaraw (which I have carefully been staying right the hell out of). 

Everybody meditate on the above, please.  Then declare your objectives *at
this level* so our Fearless Leader can make an informed decision *at this
level*.  Only then will it make sense to argue encoding theology...
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Extremism in the defense of liberty is no vice; moderation in the
pursuit of justice is no virtue."
	-- Barry Goldwater (actually written by Karl Hess)



From tim_one at email.msn.com  Wed May  3 07:05:59 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:05:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
Message-ID: <000301bfb4bd$463ec280$622d153f@tim>

[Guido]
> When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> bytes in either should make the comparison fail; when ordering is
> important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby]
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

[Guido]
> Yes, sorry for the ambiguity.

Huh!  You sure about that?  If we're setting up a case where meaningful
comparison is impossible, isn't an exception more appropriate?  The current

>>> 83479278 < "42"
1
>>>

probably traps more people than it helps.





From tim_one at email.msn.com  Wed May  3 07:19:28 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:19:28 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>
Message-ID: <000401bfb4bf$27ec1600$622d153f@tim>

[Fredrik Lundh]
> ...
> (if you like, I can post more "fun with unicode" messages ;-)

By all means!  Exposing a gotcha to ridicule does more good than a dozen
abstract arguments.  But next time stoop to explaining what it is that's
surprising <wink>.





From just at letterror.com  Wed May  3 08:47:07 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 07:47:07 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390F5F38.DD76CAF4@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
  <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>			
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net>
Message-ID: <l03102800b53572ee87ad@[193.78.237.142]>

[MAL vs. PP]
>> > FYI: Normalization is needed to make comparing Unicode
>> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
>>
>> That's a whole 'nother debate at a whole 'nother level of abstraction. I
>> think we need to get the bytes/characters level right and then we can
>> worry about display-equivalent characters (or leave that to the Python
>> programmer to figure out...).
>
>I just wanted to point out that the argument "slicing doesn't
>work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue
(which is indeed not relevant to the current discussion, but it's
fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "?" is u"\350", "?" is
u"\366")

John Jenkins replied:
"""
Well, I'm not sure you want to hear the answer -- but it really depends on
what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated
the same as "?". This is a *semantic* equivalence -- that is, they *mean*
the same thing -- and doesn't depend on the display engine to be true.
Unicode also provides a default collation algorithm
(http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string
comparison and collation are complicated, language-specific problems
requiring a lot of work and interaction with the user to do right.

>From the perspective of a programming language, it would best be served IMHO
by implementing the contents of TR10 for string comparison and collation.
That would make "e`" and "?" come out as equivalent.
"""


Dave Opstad replied:
"""
Unicode talks about "canonical decomposition" in order to make it easier
to answer questions like yours. Specifically, in the Unicode 3.0
standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their
full canonical decompositions are identical. For example, the sequences
<o, combining-diaeresis> and <?> are canonical equivalents. Canonical
equivalence is a Unicode propert. It should not be confused with
language-specific collation or matching, which may add additional
equivalencies."

So they still have language-specific differences, even if Unicode sees
them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify
things further.
"""


It's very deep stuff, which seems more appropriate for an extension than
for builtin comparisons to me.

Just





From tim_one at email.msn.com  Wed May  3 07:47:37 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:47:37 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <000501bfb4c3$16743480$622d153f@tim>

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

Then you don't want Unicode at all, Moshe.  All the official encoding
schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
Unicode not yet having assigned a character to this position, it's that the
standard explicitly makes this sequence illegal and guarantees it will
always be illegal!  the other place this comes up is with surrogates, where
what's legal depends on both parts of a character pair; and, again, the
illegalities here are guaranteed illegal for all time).  UCS-4 is the
closest thing to binary-transparent Unicode encodings get, but even there
the length of a thing is contrained to be a multiple of 4 bytes.  Unicode
and binary goop will never coexist peacefully.





From ping at lfw.org  Wed May  3 07:56:12 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 22:56:12 -0700 (PDT)
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <Pine.LNX.4.10.10005022249330.522-100000@localhost>

On Wed, 3 May 2000, Tim Peters wrote:
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.
> 
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> 
> probably traps more people than it helps.

Yeah, when i said

    No automatic conversions between Unicode strings and 8-bit "strings".

i was about to say

    Raise an exception on any operation attempting to combine or
    compare Unicode strings and 8-bit "strings".

...and then i thought, oh crap, but everything in Python is supposed
to be comparable.

What happens when you have some lists with arbitrary objects in them
and you want to sort them for printing, or to canonicalize them so
you can compare?  It might be too troublesome for list.sort() to
throw an exception because e.g. strings and ints were incomparable,
or 8-bit "strings" and Unicode strings were incomparable...

So -- what's the philosophy, Guido?  Are we committed to "everything
is comparable" (well, "all built-in types are comparable") or not?


-- ?!ng




From tim_one at email.msn.com  Wed May  3 08:40:54 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 02:40:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <000701bfb4ca$87b765c0$622d153f@tim>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

He succeeded for me.  Blind slicing doesn't always "work right" no matter
what encoding you use, because "work right" depends on semantics beyond the
level of encoding.  UTF-8 is no worse than anything else in this respect.





From just at letterror.com  Wed May  3 09:50:11 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 08:50:11 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000701bfb4ca$87b765c0$622d153f@tim>
References: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102804b5358971d413@[193.78.237.152]>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

[Tim]
>He succeeded for me.  Blind slicing doesn't always "work right" no matter
>what encoding you use, because "work right" depends on semantics beyond the
>level of encoding.  UTF-8 is no worse than anything else in this respect.

But the discussion *was* at the level of encoding! Still it is worse, since
an arbitrary utf-8 slice may result in two illegal strings -- slicing "e`"
results in two perfectly legal strings, at the encoding level. Had he used
surrogates as an example, he would've been right... (But even that is an
encoding issue.)

Just





From tim_one at email.msn.com  Wed May  3 09:11:12 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 03:11:12 -0400
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
In-Reply-To: <20000502160322.A19101@activestate.com>
Message-ID: <000801bfb4ce$c361ea60$622d153f@tim>

[Trent Mick]
> I apologize if I am hitting covered ground. What about a module (called
> limits or something like that) that would expose some appropriate
> #define's
> in limits.h and float.h.

I personally have little use for these.

> For example:
>
> limits.FLT_EPSILON could expose the C DBL_EPSILON
> limits.FLT_MAX could expose the C DBL_MAX

Hmm -- all evidence suggests that your "O" and "A" keys work fine, so where
did the absurdly abbreviated FLT come from <wink>?

> limits.INT_MAX could expose the C LONG_MAX (although that particulay name
> would cause confusion with the actual C INT_MAX)

That one is available as sys.maxint.

> - Does this kind of thing already exist somewhere? Maybe in NumPy.

Dunno.  I compute the floating-point limits when needed with Python code,
and observing what the hardware actually does is a heck of a lot more
trustworthy than platform C header files (and especially when
cross-compiling).

> - If we ever (perhaps in Py3K) turn the basic types into classes
> then these could turn into constant attributes of those classes, i.e.:
>   f = 3.14159
>   f.EPSILON = <as set by C's DBL_EPSILON>

That sounds better.

> - I thought of these values being useful when I thought of comparing
>   two floats for equality. Doing a straight comparison of floats is
>   dangerous/wrong

This is a myth whose only claim to veracity is the frequency and intensity
with which it's mechanically repeated <0.6 wink>.  It's no more dangerous
than adding two floats:  you're potentially screwed if you don't know what
you're doing in either case, but you're in no trouble at all if you do.

> but is it not okay to consider two floats reasonably equal iff:
>   	-EPSILON < float2 - float1 < EPSILON

Knuth (Vol 2) gives a reasonable defn of approximate float equality.  Yours
is measuring absolute error, which is almost never reasonable; relative
error is the measure of interest, but then 0.0 is an especially irksome
comparand.

> ...
>   I suppose the answer to my question is: "It depends on the situation."

Yes.

>   Could this algorithm for float comparison be a better default than the
>   status quo?

No.

> I know that Mark H. and others have suggested that Python should maybe
> not provide a float comparsion operator at all to beginners.

There's a good case to be made for not exposing *anything* about fp to
beginners, but comparisons aren't especially surprising.  This usually gets
suggested when a newbie is surprised that e.g. 1./49*49 != 1.  Telling them
they *are* equal is simply a lie, and they'll pay for that false comfort
twice over a little bit later down the fp road.  For example, int(1./49*49)
is 0 on IEEE-754 platforms, which is awfully surprising for an expression
that "equals" 1(!).  The next suggestion is then to fudge int() too, and so
on and so on.  It's like the arcade Whack-A-Mole game:  each mole you knock
into its hole pops up two more where you weren't looking.  Before you know
it, not even a bona fide expert can guess what code will actually do
anymore.

the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim





From effbot at telia.com  Wed May  3 09:34:51 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:34:51 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005022249330.522-100000@localhost>
Message-ID: <00b201bfb4d3$07a95420$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> So -- what's the philosophy, Guido?  Are we committed to "everything
> is comparable" (well, "all built-in types are comparable") or not?

in 1.6a2, obviously not:

>>> aUnicodeString < an8bitString
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

in 1.6a3, maybe.

</F>




From effbot at telia.com  Wed May  3 09:48:56 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:48:56 +0200
Subject: [Python-Dev] Unicode debate
References: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <00ce01bfb4d4$0a7d1820$34aab5d4@hagrid>

Tim Peters <tim_one at email.msn.com> wrote:
> [Moshe Zadka]
> > ...
> > I'd much prefer Python to reflect a fundamental truth about Unicode,
> > which at least makes sure binary-goop can pass through Unicode and
> > remain unharmed, then to reflect a nasty problem with UTF-8 (not
> > everything is legal).
> 
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
> Unicode not yet having assigned a character to this position, it's that the
> standard explicitly makes this sequence illegal and guarantees it will
> always be illegal!

in context, I think what Moshe meant was that with a straight
character code mapping, any 8-bit string can always be mapped
to a unicode string and back again.

given a byte array "b":

    u = unicode(b, "default")
    assert map(ord, u) == map(ord, s)

again, this is no different from casting an integer to a long integer
and back again.  (imaging having to do that on the bits and bytes
level!).

and again, the internal unicode encoding used by the unicode string
type itself, or when serializing that string type, has nothing to do
with that.

</F>




From jack at oratrix.nl  Wed May  3 09:58:31 2000
From: jack at oratrix.nl (Jack Jansen)
Date: Wed, 03 May 2000 09:58:31 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python
 bltinmodule.c,2.154,2.155
In-Reply-To: Message by bwarsaw@cnri.reston.va.us (Barry A. Warsaw) ,
	     Tue, 2 May 2000 15:24:09 -0400 (EDT) , <20000502192409.8C44E6636B@anthem.cnri.reston.va.us> 
Message-ID: <20000503075832.18574370CF2@snelboot.oratrix.nl>

> _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> ahead and initialize the class-based standard exceptions.  If this
> fails, we throw a Py_FatalError.

Isn't a Py_FatalError overkill? Or will not having the class-based standard 
exceptions lead to so much havoc later on that it is better than limping on?
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From just at letterror.com  Wed May  3 11:03:16 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 10:03:16 +0100
Subject: [Python-Dev] Unicode comparisons & normalization
Message-ID: <l03102806b535964edb26@[193.78.237.152]>

After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:

- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)

Just





From effbot at telia.com  Wed May  3 09:56:25 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:56:25 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > characters with backslashed numbers?
> 
> Hm, good question.  Tcl displays unknown characters as \x or \u
> escapes.  I think this may make more sense than raising an error.

but that's on the display side of things, right?  similar to
repr, in other words.

> But there must be a way to turn on Unicode-awareness on e.g. stdout
> and then printing a Unicode object should not use str() (as it
> currently does).

to throw some extra gasoline on this, how about allowing
str() to return unicode strings?

(extra questions: how about renaming "unicode" to "string",
and getting rid of "unichr"?)

count to ten before replying, please.

</F>




From ping at lfw.org  Wed May  3 10:30:02 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:30:02 -0700 (PDT)
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <Pine.LNX.4.10.10005030116460.522-100000@localhost>

On Wed, 3 May 2000, Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:
> 
> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

I just looked through this document.  Indeed, there's a lot
of work to be done if we want to compare strings this way.

I thought the most striking feature was that this comparison
method does *not* satisfy the common assumption

    a > b  implies  a + c > b + d        (+ is concatenation)

-- in fact, it is specifically designed to allow for cases
where differences in the *later* part of a string can have
greater influence than differences in an earlier part of a
string.  It *does* still guarantee that

    a + b > a

and of course we can still rely on the most basic rules such as

    a > b  and  b > c  implies  a > c

There are sufficiently many significant transformations
described in the UTR 10 document that i'm pretty sure it
is possible for two things to collate equally but not be
equivalent.  (Even after Unicode normalization, there is
still the possibility of rearrangement in step 1.2.)

This would be another motivation for Python to carefully
separate the three types of equality:

    is         identity-equal
    ==         value-equal
    <=>        magnitude-equal

We currently don't distinguish between the last two;
the operator "<=>" is my proposal for how to spell
"magnitude-equal", and in terms of outward behaviour
you can consider (a <=> b) to be (a <= b and a >= b).
I suspect we will find ourselves needing it if we do
rich comparisons anyway.

(I don't know of any other useful kinds of equality,
but if you've run into this before, do pipe up...)


-- ?!ng




From mal at lemburg.com  Wed May  3 10:15:29 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:15:29 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
	  <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>			
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
	 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <390FE021.6F15C1C8@lemburg.com>

Just van Rossum wrote:
> 
> [MAL vs. PP]
> >> > FYI: Normalization is needed to make comparing Unicode
> >> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> >>
> >> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> >> think we need to get the bytes/characters level right and then we can
> >> worry about display-equivalent characters (or leave that to the Python
> >> programmer to figure out...).
> >
> >I just wanted to point out that the argument "slicing doesn't
> >work with UTF-8" is moot.
> 
> And failed...

Huh ? The pure fact that you can have two (or more)
Unicode characters to represent a single character makes
Unicode itself have the same problems as e.g. UTF-8.

> [Refs about collation and decomposition]
>
> It's very deep stuff, which seems more appropriate for an extension than
> for builtin comparisons to me.

That's what I think too; I never argued for making this
builtin and automatic (don't know where people got this idea
from).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From effbot at telia.com  Wed May  3 11:02:09 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:02:09 +0200
Subject: [Python-Dev] Unicode comparisons & normalization
References: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <018a01bfb4de$7744cc00$34aab5d4@hagrid>

Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:

here's another good paper that covers this, the universe, and everything:

    Character Model for the World Wide Web 
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../ 

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../
 
    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting. 

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:

> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).

> - this would indeed mean that it's possible for u == v even though type(u)
> is type(v) and len(u) != len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).

yes, but on layer 3 instead of layer 2.

> - there may be additional customized language-specific sorting rules. I
> currently don't see how to implement that without some global variable.

layer 4.

> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...

layer 4.

> - u.find() may need an alternative API, which returns a (begin, end) tuple,
> since the match may not have the same length as the search string... (This
> is tricky, since you need the begin and end indices in the non-canonical
> form...)

layer 3.




From effbot at telia.com  Wed May  3 11:11:26 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:11:26 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> > 
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?
> 
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

umm.  if you disallow latin-1 characters, how can you call this
one loss-less?

looks like political correctness taken to an entirely new level...

</F>




From ping at lfw.org  Wed May  3 10:50:30 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:50:30 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005030141580.522-100000@localhost>

On Wed, 3 May 2000, Fredrik Lundh wrote:
> Guido van Rossum <guido at python.org> wrote:
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?

You still need to *print* them somehow.  One way or another,
stdout is still just a stream with bytes on it, unless we
augment file objects to understand encodings.

stdout sends bytes to something -- and that something will
interpret the stream of bytes in some encoding (could be
Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:

    1.  You explicitly downconvert to bytes, and specify
        the encoding each time you do.  Then write the
        bytes to stdout (or your file object).

    2.  The file object is smart and can be told what
        encoding to use, and Unicode strings written to
        the file are automatically converted to bytes.

Another thread mentioned having separate read/write and
binary_read/binary_write methods on files.  I suggest
doing it the other way, actually: since read/write operate
on byte streams now, *they* are the binary operations;
the new methods should be the ones that do the extra
encoding/decoding work, and could be called uniread/uniwrite,
uread/uwrite, textread/textwrite, etc.

> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)

Would you expect chr(x) to return an 8-bit string when x < 128,
and a Unicode string when x >= 128?


-- ?!ng




From ping at lfw.org  Wed May  3 11:32:31 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:32:31 -0700 (PDT)
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030151150.522-100000@localhost>

On Tue, 2 May 2000, Guido van Rossum wrote:
> > P. P. S.  If always having to specify encodings is really too much,
> > i'd probably be willing to consider a default-encoding state on the
> > Unicode class, but it would have to be a stack of values, not a
> > single value.
> 
> Please elaborate?

On general principle, it seems bad to just have a "set" method
that encourages people to set static state in a way that
irretrievably loses the current state.  For something like this,
you want a "push" method and a "pop" method with which to bracket
a series of operations, so that you can easily write code which
politely leaves other code unaffected.

For example:

    >>> x = unicode("d\351but")        # assume Guido-ASCII wins
    UnicodeError: ASCII encoding error: value out of range
    >>> x = unicode("d\351but", "latin-1")
    >>> x
    u'd\351but'
    >>> print x.encode("latin-1")      # on my xterm with Latin-1 fonts
    d?but
    >>> x.encode("utf-8")
    'd\303\251but'

Now:

    >>> u"".pushenc("latin-1")         # need a better interface to this?
    >>> x = unicode("d\351but")        # okay now
    >>> x
    u'd\351but'
    >>> u"".pushenc("utf-8")
    >>> x = unicode("d\351but")
    UnicodeError: UTF-8 decoding error: invalid data
    >>> x = unicode("d\303\251but")
    >>> print x.encode("latin-1")
    d?but
    >>> str(x)
    'd\303\251\but'
    >>> u"".popenc()                   # back to the Latin-1 encoding
    >>> str(x)
    'd\351but'
        .
        .
        .
    >>> u"".popenc()                   # back to the ASCII encoding

Similarly, imagine:

    >>> x = u"<Japanese text...>"

    >>> file = open("foo.jis", "w")
    >>> file.pushenc("iso-2022-jp")
    >>> file.uniwrite(x)
        .
        .
        .
    >>> file.popenc()

    >>> import sys
    >>> sys.stdout.write(x)            # bad! x contains chars > 127
    UnicodeError: ASCII decoding error: value out of range

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> sys.stdout.write(x)            # on a kterm with kanji fonts
    <Japanese text...>
        .
        .
        .
    >>> sys.stdout.popenc()

The above examples incorporate the Guido-ASCII proposal, which
makes a fair amount of sense to me now.  How do they look to y'all?



This illustrates the remaining wart:

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> print x                        # still bad! str is still doing ASCII
    UnicodeError: ASCII decoding error: value out of range

    >>> u"".pushenc("iso-2022-jp")
    >>> print x                        # on a kterm with kanji fonts
    <Japanese text...>

Writing to files asks the file object to convert from Unicode to
bytes, then write the bytes.

Printing converts the Unicode to bytes first with str(), then
hands the bytes to the file object to write.

This wart is really a larger printing issue.  If we want to
solve it, files have to know what to do with objects, i.e.

    print x

doesn't mean

    sys.stdout.write(str(x) + "\n")

instead it means

    sys.stdout.printout(x)

Hmm.  I think this might deserve a separate subject line.


-- ?!ng




From ping at lfw.org  Wed May  3 11:41:20 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:41:20 -0700 (PDT)
Subject: [Python-Dev] Printing objects on files
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030232360.522-100000@localhost>

The following is all stolen from E: see http://www.erights.org/.

As i mentioned in the previous message, there are reasons that
we might want to enable files to know what it means to print
things on them.

    print x

would mean

    sys.stdout.printout(x)

where sys.stdout is defined something like

    def __init__(self):
        self.encs = ["ASCII"]

    def pushenc(self, enc):
        self.encs.append(enc)
    
    def popenc(self):
        self.encs.pop()
        if not self.encs: self.encs = ["ASCII"]

    def printout(self, x):
        if type(x) is type(u""):
            self.write(x.encode(self.encs[-1]))
        else:   
            x.__print__(self)
        self.write("\n")

and each object would have a __print__ method; for lists, e.g.:

    def __print__(self, file):
        file.write("[")
        if len(self):
            file.printout(self[0])
        for item in self[1:]:
            file.write(", ")
            file.printout(item)
        file.write("]")

for floats, e.g.:

    def __print__(self, file):
        if hasattr(file, "floatprec"):
            prec = file.floatprec
        else:
            prec = 17
        file.write("%%.%df" % prec % self)

The passing of control between the file and the objects to
be printed enables us to make Tim happy:

    >>> l = [1/2, 1/3, 1/4]            # I can dream, can't i?

    >>> print l
    [0.3, 0.33333333333333331, 0.25]

    >>> sys.stdout.floatprec = 6
    >>> print l
    [0.5, 0.333333, 0.25]

Fantasizing about other useful kinds of state beyond "encs"
and "floatprec" ("listmax"? "ratprec"?) and managing this
namespace is left as an exercise to the reader.


-- ?!ng




From ht at cogsci.ed.ac.uk  Wed May  3 11:59:28 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 03 May 2000 10:59:28 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <f5bog6o54zj.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage).  I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.

I notice one apparent mis-communication between the various
contributors:

Treating narrow-strings as consisting of UNICODE code points <= 255 is 
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.

I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings.  Encodings are only relevant to
conversion between byte-arrays and strings.  Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might.  The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the 
code points between 128 and 255.

This is effectively the way our C-based XML toolset (which we embed in 
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.

I'd like to keep byte-arrays at the margins in Python as well, for all 
the reasons advanced by Paul and Fredrik.

I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From ping at lfw.org  Wed May  3 11:51:30 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:51:30 -0700 (PDT)
Subject: [Python-Dev] Re: Printing objects on files
In-Reply-To: <Pine.LNX.4.10.10005030232360.522-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005030242030.522-100000@localhost>

On Wed, 3 May 2000, Ka-Ping Yee wrote:
> 
> Fantasizing about other useful kinds of state beyond "encs"
> and "floatprec" ("listmax"? "ratprec"?) and managing this
> namespace is left as an exercise to the reader.

Okay, i lied.  Shortly after writing this i realized that it
is probably advisable for all such bits of state to be stored
in stacks, so an interface such as this might do:

    def push(self, key, value):
        if not self.state.has_key(key):
            self.state[key] = []
        self.state[key].append(value)

    def pop(self, key):
        if self.state.has_key(key):
            if len(self.state[key]):
                self.state[key].pop()

    def get(self, key):
        if not self.state.has_key(key):
            stack = self.state[key][-1]
        if stack:
            return stack[-1]
        return None

Thus:

    >>> print 1/3
    0.33333333333333331

    >>> sys.stdout.push("float.prec", 6)
    >>> print 1/3
    0.333333

    >>> sys.stdout.pop("float.prec")
    >>> print 1/3
    0.33333333333333331

And once we allow arbitrary strings as keys to the bits
of state, the period is a natural separator we can use
for managing the namespace.

Take the special case for Unicode out of the file object:
    
    def printout(self, x):
        x.__print__(self)
        self.write("\n")

and have the Unicode string do the work:

    def __printon__(self, file):
        file.write(self.encode(file.get("unicode.enc")))

This behaves just right if an encoding of None means ASCII.

If mucking with encodings is sufficiently common, you could
imagine conveniences on file objects such as

    def __init__(self, filename, mode, encoding=None):
        ...
        if encoding:
            self.push("unicode.enc", encoding)

    def pushenc(self, encoding):
        self.push("unicode.enc", encoding)

    def popenc(self, encoding):
        self.pop("unicode.enc")


-- ?!ng




From effbot at telia.com  Wed May  3 12:31:34 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 12:31:34 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> > to throw some extra gasoline on this, how about allowing
> > str() to return unicode strings?
> 
> You still need to *print* them somehow.  One way or another,
> stdout is still just a stream with bytes on it, unless we
> augment file objects to understand encodings.
> 
> stdout sends bytes to something -- and that something will
> interpret the stream of bytes in some encoding (could be
> Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> 
>     1.  You explicitly downconvert to bytes, and specify
>         the encoding each time you do.  Then write the
>         bytes to stdout (or your file object).
> 
>     2.  The file object is smart and can be told what
>         encoding to use, and Unicode strings written to
>         the file are automatically converted to bytes.

which one's more convenient?

(no, I won't tell you what I prefer. guido doesn't want
more arguments from the old "characters are characters"
proponents, so I gotta trick someone else to spell them
out ;-)

> > (extra questions: how about renaming "unicode" to "string",
> > and getting rid of "unichr"?)
> 
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

that will break too much existing code, I think.  but what
about replacing 128 with 256?

</F>




From just at letterror.com  Wed May  3 13:41:27 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 12:41:27 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390FE021.6F15C1C8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
   <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>				
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>	
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102800b535bef21708@[193.78.237.152]>

At 10:15 AM +0200 03-05-2000, M.-A. Lemburg wrote:
>Huh ? The pure fact that you can have two (or more)
>Unicode characters to represent a single character makes
>Unicode itself have the same problems as e.g. UTF-8.

It's the different level of abstraction that makes it different.

Even if "e`" is _equivalent_ to the combined character, that doesn't mean
that it _is_ the combined character, on the level of abstraction we are
talking about: it's still 2 characters, and those can be sliced apart
without a problem. Slicing utf-8 doesn't work because it yields invalid
strings, slicing "e`" does work since both halves are valid strings. The
fact that "e`" is semantically equivalent to the combined character doesn't
change that.

Just





From guido at python.org  Wed May  3 13:12:44 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:12:44 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 01:30:02 PDT."
             <Pine.LNX.4.10.10005030116460.522-100000@localhost> 
References: <Pine.LNX.4.10.10005030116460.522-100000@localhost> 
Message-ID: <200005031112.HAA03138@eric.cnri.reston.va.us>

[Ping]
> This would be another motivation for Python to carefully
> separate the three types of equality:
> 
>     is         identity-equal
>     ==         value-equal
>     <=>        magnitude-equal
> 
> We currently don't distinguish between the last two;
> the operator "<=>" is my proposal for how to spell
> "magnitude-equal", and in terms of outward behaviour
> you can consider (a <=> b) to be (a <= b and a >= b).
> I suspect we will find ourselves needing it if we do
> rich comparisons anyway.

I don't think that this form of equality deserves its own operator.
The Unicode comparison rules are sufficiently hairy that it seems
better to implement them separately, either in a separate module or at
least as a Unicode-object-specific method, and let the == operator do
what it does best: compare the representations.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Wed May  3 13:14:54 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:14:54 -0400
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 11:02:09 +0200."
             <018a01bfb4de$7744cc00$34aab5d4@hagrid> 
References: <l03102806b535964edb26@[193.78.237.152]>  
            <018a01bfb4de$7744cc00$34aab5d4@hagrid> 
Message-ID: <200005031114.HAA03152@eric.cnri.reston.va.us>

> here's another good paper that covers this, the universe, and everything:

Theer's a lot of useful pointers being flung around.  Could someone
with more spare cycles than I currently have perhaps collect these and
produce a little write up "further reading on Unicode comparison and
normalization" (or perhaps a more comprehensive title if warranted) to
be added to the i18n-sig's home page?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From just at letterror.com  Wed May  3 14:26:50 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 13:26:50 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <l03102804b535cb14f243@[193.78.237.149]>

[Ka-Ping Yee]
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

[Fredrik Lundh]
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-)

Just





From guido at python.org  Wed May  3 14:04:29 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:04:29 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 12:31:34 +0200."
             <030a01bfb4ea$c2741e40$34aab5d4@hagrid> 
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>  
            <030a01bfb4ea$c2741e40$34aab5d4@hagrid> 
Message-ID: <200005031204.IAA03252@eric.cnri.reston.va.us>

[Ping]
> > stdout sends bytes to something -- and that something will
> > interpret the stream of bytes in some encoding (could be
> > Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> > 
> >     1.  You explicitly downconvert to bytes, and specify
> >         the encoding each time you do.  Then write the
> >         bytes to stdout (or your file object).
> > 
> >     2.  The file object is smart and can be told what
> >         encoding to use, and Unicode strings written to
> >         the file are automatically converted to bytes.

[Fredrik]
> which one's more convenient?

Marc-Andre's codec module contains file-like objects that support this
(or could easily be made to).

However the problem is that print *always* first converts the object
using str(), and str() enforces that the result is an 8-bit string.
I'm afraid that loosening this will break too much code.  (This all
really happens at the C level.)

I'm also afraid that this means that str(unicode) may have to be
defined to yield UTF-8.  My argument goes as follows:

1. We want to be able to set things up so that print u"..." does the
   right thing.  (What "the right thing" is, is not defined here,
   as long as the user sees the glyphs implied by u"...".)

2. print u is equivalent to sys.stdout.write(str(u)).

3. str() must always returns an 8-bit string.

4. So the solution must involve assigning an object to sys.stdout that
   does the right thing given an 8-bit encoding of u.

5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.

6. UTF-8 is the only sensible candidate.

Note that (apart from print) str() is never implicitly invoked -- all
implicit conversions when Unicode and 8-bit strings are combined
go from 8-bit to Unicode.

(There might be an alternative, but it would depend on having yet
another hook (similar to Ping's sys.display) that gets invoked when
printing an object (as opposed to displaying it at the interactive
prompt).  I'm not too keen on this because it would break code that
temporarily sets sys.stdout to a file of its own choosing and then
invokes print -- a common idiom to capture printed output in a string,
for example, which could be embedded deep inside a module.  If the
main program were to install a naive print hook that always sent
output to a designated place, this strategy might fail.)

> > > (extra questions: how about renaming "unicode" to "string",
> > > and getting rid of "unichr"?)
> > 
> > Would you expect chr(x) to return an 8-bit string when x < 128,
> > and a Unicode string when x >= 128?
> 
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

If the 8-bit Unicode proposal were accepted, this would make sense.
In my "only ASCII is implicitly convertible" proposal, this would be a
mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).

I agree with everyone that things would be much simpler if we had
separate data types for byte arrays and 8-bit character strings.  But
we don't have this distinction yet, and I don't see a quick way to add
it in 1.6 without major upsetting the release schedule.

So all of my proposals are to be considered hacks to maintain as much
b/w compatibility as possible while still supporting some form of
Unicode.  The fact that half the time 8-bit strings are really being
used as byte arrays, while Python can't tell the difference, means (to
me) that the default encoding is an important thing to argue about.

I don't know if I want to push it out all the way to Py3k, but I just
don't see a way to implement "a character is a character" in 1.6 given
all the current constraints.  (BTW I promise that 1.7 will be speedy
once 1.6 is out of the door -- there's a lot else that was put off to
1.7.)

Fredrik, I believe I haven't seen your response to my ASCII proposal.
Is it just as bad as UTF-8 to you, or could you live with it?  On a
scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?

Where's my sre snapshot?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:16:56 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:16:56 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "03 May 2000 10:59:28 BST."
             <f5bog6o54zj.fsf@cogsci.ed.ac.uk> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
            <f5bog6o54zj.fsf@cogsci.ed.ac.uk> 
Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us>

[Henry S. Thompson]
> OK, I've never contributed to this discussion, but I have a long
> history of shipping widely used Python/Tkinter/XML tools (see my
> homepage).  I care _very_ much that heretofore I have been unable to
> support full XML because of the lack of Unicode support in Python.
> I've already started playing with 1.6a2 for this reason.

Thanks for chiming in!

> I notice one apparent mis-communication between the various
> contributors:
> 
> Treating narrow-strings as consisting of UNICODE code points <= 255 is 
> not necessarily the same thing as making Latin-1 the default encoding.
> I don't think on Paul and Fredrik's account encoding are relevant to
> narrow-strings at all.

I agree that's what they are trying to tell me.

> I'd rather go right away to the coherent position of byte-arrays,
> narrow-strings and wide-strings.  Encodings are only relevant to
> conversion between byte-arrays and strings.  Decoding a byte-array
> with a UTF-8 encoding into a narrow string might cause
> overflow/truncation, just as decoding a byte-array with a UTF-8
> encoding into a wide-string might.  The fact that decoding a
> byte-array with a Latin-1 encoding into a narrow-string is a memcopy
> is just a side-effect of the courtesy of the UNICODE designers wrt the 
> code points between 128 and 255.
> 
> This is effectively the way our C-based XML toolset (which we embed in 
> Python) works today -- we build an 8-bit version which uses char*
> strings, and a 16-bit version which uses unsigned short* strings, and
> convert from/to byte-streams in any supported encoding at the margins.
> 
> I'd like to keep byte-arrays at the margins in Python as well, for all 
> the reasons advanced by Paul and Fredrik.
> 
> I think treating existing strings as a sort of pun between
> narrow-strings and byte-arrays is a recipe for ongoing confusion.

Very good analysis.

Unfortunately this is where we're stuck, until we have a chance to
redesign this kind of thing from scratch.  Python 1.5.2 programs use
strings for byte arrays probably as much as they use them for
character strings.  This is because way back in 1990 I when I was
designing Python, I wanted to have smallest set of basic types, but I
also wanted to be able to manipulate byte arrays somewhat.  Influenced
by K&R C, I chose to make strings and string I/O 8-bit clean so that
you could read a binary "string" from a file, manipulate it, and write
it back to a file, regardless of whether it was character or binary
data.

This model has never been challenged until now.  I agree that the Java
model (byte arrays and strings) or perhaps your proposed model (byte
arrays, narrow and wide strings) looks better.  But, although Python
has had rudimentary support for byte arrays for a while (the array
module, introduced in 1993), the majority of Python code manipulating
binary data still uses string objects.

My ASCII proposal is a compromise that tries to be fair to both uses
for strings.  Introducing byte arrays as a more fundamental type has
been on the wish list for a long time -- I see no way to introduce
this into Python 1.6 without totally botching the release schedule
(June 1st is very close already!).  I'd like to be able to move on,
there are other important things still to be added to 1.6 (Vladimir's
malloc patches, Neil's GC, Fredrik's completed sre...).

For 1.7 (which should happen later this year) I promise I'll reopen
the discussion on byte arrays.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:18:39 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:18:39 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python bltinmodule.c,2.154,2.155
In-Reply-To: Your message of "Wed, 03 May 2000 09:58:31 +0200."
             <20000503075832.18574370CF2@snelboot.oratrix.nl> 
References: <20000503075832.18574370CF2@snelboot.oratrix.nl> 
Message-ID: <200005031218.IAA03288@eric.cnri.reston.va.us>

> > _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> > ahead and initialize the class-based standard exceptions.  If this
> > fails, we throw a Py_FatalError.
> 
> Isn't a Py_FatalError overkill? Or will not having the class-based standard 
> exceptions lead to so much havoc later on that it is better than limping on?

There will be *no* exception objects -- they will all be NULL
pointers.  It's not clear that you will be able to limp very far, and
it's better to have a clear diagnostic at the source of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:22:57 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:22:57 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:05:59 EDT."
             <000301bfb4bd$463ec280$622d153f@tim> 
References: <000301bfb4bd$463ec280$622d153f@tim> 
Message-ID: <200005031222.IAA03300@eric.cnri.reston.va.us>

> [Guido]
> > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> > bytes in either should make the comparison fail; when ordering is
> > important, we can make an arbitrary choice e.g. "\377" < u"\200".
> 
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.

[Tim]
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> >>>
> 
> probably traps more people than it helps.

Agreed, but that's the rule we all currently live by, and changing it
is something for Python 3000.

I'm not real strong on this though -- I was willing to live with
exceptions from the UTF-8-to-Unicode conversion.  If we all agree that
it's better for u"\377" == "\377" to raise an precedent-setting
exception than to return false, that's fine with me too.  I do want
u"a" == "a" to be true though (and I believe we all already agree on
that one).

Note that it's not the first precedent -- you can already define
classes whose instances can raise exceptions during comparisons.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Wed May  3 10:56:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:56:08 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us> <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <390FE9A7.DE5545DA@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum <guido at python.org> wrote:
> > > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > > characters with backslashed numbers?
> >
> > Hm, good question.  Tcl displays unknown characters as \x or \u
> > escapes.  I think this may make more sense than raising an error.
> 
> but that's on the display side of things, right?  similar to
> repr, in other words.
> 
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?
> 
> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)
> 
> count to ten before replying, please.

1 2 3 4 5 6 7 8 9 10 ... ok ;-)

Guido's problem with printing Unicode can easily be solved
using the standard codecs.StreamRecoder class as I've done
in the example I posted some days ago.

Basically, what the stdout wrapper would do is take strings
as input, converting them to Unicode and then writing
them encoded to the original stdout. For Unicode objects
the conversion can be skipped and the encoded output written
directly to stdout.

This can be done for any encoding supported by Python; e.g.
you could do the indirection in site.py and then have
Unicode printed as Latin-1 or UTF-8 or one of the many
code pages supported through the mapping codec.

About having str() return Unicode objects: I see str()
as constructor for string objects and under that assumption
str() will always have to return string objects.
unicode() does the same for Unicode objects, so renaming
it to something else doesn't really help all that much.

BTW, __str__() has to return strings too. Perhaps we
need __unicode__() and a corresponding slot function too ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed May  3 15:06:27 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 15:06:27 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid>
Message-ID: <39102453.6923B10@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > > > So what do you think of my new proposal of using ASCII as the default
> > > > > "encoding"?
> >
> > How about using unicode-escape or raw-unicode-escape as
> > default encoding ? (They would have to be adapted to disallow
> > Latin-1 char input, though.)
> >
> > The advantage would be that they are compatible with ASCII
> > while still providing loss-less conversion and since they
> > use escape characters, you can even read them using an
> > ASCII based editor.
> 
> umm.  if you disallow latin-1 characters, how can you call this
> one loss-less?

[Guido didn't like this one, so its probably moot investing
 any more time on this...]

I meant that the unicode-escape codec should only take ASCII
characters as input and disallow non-escaped Latin-1 characters.

Anyway, I'm out of this discussion... 

I'll wait a week or so until things have been sorted out.

Have fun,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Wed May  3 15:09:59 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 06:09:59 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031204.IAA03252@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030556250.522-100000@localhost>

On Wed, 3 May 2000, Guido van Rossum wrote:
> (There might be an alternative, but it would depend on having yet
> another hook (similar to Ping's sys.display) that gets invoked when
> printing an object (as opposed to displaying it at the interactive
> prompt).  I'm not too keen on this because it would break code that
> temporarily sets sys.stdout to a file of its own choosing and then
> invokes print -- a common idiom to capture printed output in a string,
> for example, which could be embedded deep inside a module.  If the
> main program were to install a naive print hook that always sent
> output to a designated place, this strategy might fail.)

I know this is not a small change, but i'm pretty convinced the
right answer here is that the print hook should call a *method*
on sys.stdout, whatever sys.stdout happens to be.  The details
are described in the other long message i wrote ("Printing objects
on files").

Here is an addendum that might actually make that proposal
feasible enough (compatibility-wise) to fly in the short term:

    print x

does, conceptually:

    try:
        sys.stdout.printout(x)
    except AttributeError:
        sys.stdout.write(str(x))
        sys.stdout.write("\n")

The rest can then be added, and the change in 'print x' will
work nicely for any file objects, but will not break on file-like
substitutes that don't define a 'printout' method.

Any reactions to the other benefit of this proposal -- namely,
the ability to control the printing parameters of object
components as they're being traversed for printing?  That was
actually the original motivation for doing the file.printout
thing: it gives you some of the effect of "passing down str-ness"
that we were discussing so heatedly a little while ago.

The other thing that just might justify this much of a change
is that, as you reasoned clearly in your other message, without
adequate resolution to the printing problem we may have painted
ourselves into a corner with regard to str(u"") conversion, and
i don't like the look of that corner much.  *Even* if we were to
get people to agree that it's okay for str(u"") to produce UTF-8,
it still seems pretty hackish to me that we're forced to choose
this encoding as a way of working around that fact that we can't
simply give the file the thing we want to print.


-- ?!ng




From moshez at math.huji.ac.il  Wed May  3 15:55:37 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Wed, 3 May 2000 16:55:37 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <Pine.GSO.4.10.10005031649040.4859-100000@sundial>

On Wed, 3 May 2000, Tim Peters wrote:

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

[Tim Peters]
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences

Of course I don't, and of course you're right. But what I do want is for
my binary goop to pass unharmed through the evil Unicode forest. Which is
why I don't want it to interpret my goop as a sequence of bytes it tries
to decode, but I want the numeric values of my bytes to pass through to
Unicode uharmed -- that means Latin-1 because of the second design
decision of the horribly western-specific unicdoe - the first 256
characters are the same as Latin-1. If it were up to me, I'd use Latin-3,
but it wasn't, so it's not.

> (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE)

Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE
-- isn't it.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From akuchlin at mems-exchange.org  Wed May  3 16:12:06 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:12:06 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <14608.13238.339572.202494@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>been on the wish list for a long time -- I see no way to introduce
>this into Python 1.6 without totally botching the release schedule
>(June 1st is very close already!).  I'd like to be able to move on,

My suggested criterion is that 1.6 not screw things up in a way that
we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
a corner that 

(And can we choose a mailing list for discussing this and stick to it?
 This is being cross-posted to three lists: python-dev, i18-sig, and
 xml-sig!  i18-sig only, maybe?  Or string-sig?)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Chess! I'm tormented by thoughts of strip chess. Pure mind just isn't enough,
Mallah. I long for a body.
  -- The Brain, in DOOM PATROL #34




From akuchlin at mems-exchange.org  Wed May  3 16:15:18 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:15:18 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <14608.13238.339572.202494@amarok.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<14608.13238.339572.202494@amarok.cnri.reston.va.us>
Message-ID: <14608.13430.92985.717058@amarok.cnri.reston.va.us>

Andrew M. Kuchling writes:
>Guido van Rossum writes:
>My suggested criterion is that 1.6 not screw things up in a way that
>we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
>a corner that 

Doh!  To complete that paragraph: Magic conversions assuming UTF-8
does back us into a corner that is hard to get out of later.  Magic
conversions assuming Latin1 or ASCII are a bit better, but I'd lean
toward the draconian solution: we don't know what we're doing, so do
nothing and require the user to explicitly convert between Unicode and
8-bit strings in a user-selected encoding.

--amk



From guido at python.org  Wed May  3 17:48:32 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 11:48:32 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 10:15:18 EDT."
             <14608.13430.92985.717058@amarok.cnri.reston.va.us> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us> <14608.13238.339572.202494@amarok.cnri.reston.va.us>  
            <14608.13430.92985.717058@amarok.cnri.reston.va.us> 
Message-ID: <200005031548.LAA03595@eric.cnri.reston.va.us>

> >Guido van Rossum writes:
> >My suggested criterion is that 1.6 not screw things up in a way that
> >we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
> >a corner that 

> Andrew M. Kuchling writes:
> Doh!  To complete that paragraph: Magic conversions assuming UTF-8
> does back us into a corner that is hard to get out of later.  Magic
> conversions assuming Latin1 or ASCII are a bit better, but I'd lean
> toward the draconian solution: we don't know what we're doing, so do
> nothing and require the user to explicitly convert between Unicode and
> 8-bit strings in a user-selected encoding.

GvR responds:
That's what Ping suggested.  My reason for proposing default
conversions from ASCII is that there is much code that deals with
character strings in a fairly abstract sense and that would work out
of the box (or after very small changes) with Unicode strings.  This
code often uses some string literals containing ASCII characters.  An
arbitrary example: code to reformat a text paragraph; another: an XML
parser.  These look for certain ASCII characters given as literals in
the code (" ", "<" and so on) but the algorithm is essentially
independent of what encoding is used for non-ASCII characters.  (I
realize that the text reformatting example doesn't work for all
Unicode characters because its assumption that all characters have
equal width is broken -- but at the very least it should work with
Latin-1 or Greek or Cyrillic stored in Unicode strings.)

It's the same as for ints: a function to calculate the GCD works with
ints as well as long ints without change, even though it references
the int constant 0.  In other words, we want string-processing code to
be just as polymorphic as int-processing code.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Wed May  3 21:55:24 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 20:55:24 +0100
Subject: [Python-Dev] Unicode strings: an alternative
Message-ID: <l03102800b5362642bae3@[193.78.237.149]>

Today I had a relatively simple idea that unites wide strings and narrow
strings in a way that is more backward comatible at the C level. It's quite
possible this has already been considered and rejected for reasons that are
not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the
existing string object like so:
- wide strings are stored as if they were narrow strings, simply using two
bytes for each Unicode character.
- there's a flag that specifies whether the string is narrow or wide.
- the ob_size field is the _physical_ length of the data; if the string is
wide, len(s) will return ob_size/2, all other string operations will have
to do similar things.
- there can possibly be an encoding attribute which may specify the used
encoding, if known.

Admittedly, this is tricky and involves quite a bit of effort to implement,
since all string methods need to have narrow/wide switch. To make it worse,
it hardly offers anything the current solution doesn't. However, it offers
one IMHO _big_ advantage: C code that just passes strings along does not
need to change: wide strings can be seen as narrow strings without any
loss. This allows for __str__() & str() and friends to work with unicode
strings without any change.

Any thoughts?

Just





From tree at cymru.basistech.com  Wed May  3 22:19:05 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 16:19:05 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <14608.35257.729641.178724@cymru.basistech.com>

Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?


 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From skip at mojam.com  Wed May  3 22:51:49 2000
From: skip at mojam.com (Skip Montanaro)
Date: Wed, 3 May 2000 15:51:49 -0500 (CDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.35257.729641.178724@cymru.basistech.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
	<14608.35257.729641.178724@cymru.basistech.com>
Message-ID: <14608.37223.787291.236623@beluga.mojam.com>

    Tom> Is it possible to add a logical length field too? I presume it is
    Tom> too expensive to recalculate the logical (character) length of a
    Tom> string each time len(s) is called? Doing this is only slightly more
    Tom> time consuming than a normal strlen: ...

Note that currently the len() method doesn't call strlen() at all.  It just
returns the ob_size field.  Presumably, with Just's proposal len() would
simply return ob_size/width.  If you used a variable width encoding, Just's
plan wouldn't work.  (I don't know anything about string encodings - is
UTF-8 variable width?)




From guido at python.org  Wed May  3 23:22:59 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 17:22:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Wed, 03 May 2000 20:55:24 BST."
             <l03102800b5362642bae3@[193.78.237.149]> 
References: <l03102800b5362642bae3@[193.78.237.149]> 
Message-ID: <200005032122.RAA05150@eric.cnri.reston.va.us>

> Today I had a relatively simple idea that unites wide strings and narrow
> strings in a way that is more backward comatible at the C level. It's quite
> possible this has already been considered and rejected for reasons that are
> not yet obvious to me, but I'll give it a shot anyway.
> 
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
> - there's a flag that specifies whether the string is narrow or wide.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
> 
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.

This seems to have some nice properties, but I think it would cause
problems for existing C code that tries to *interpret* the bytes of a
string: it could very well do the wrong thing for wide strings (since
old C code doesn't check for the "wide" flag).  I'm not sure how much
C code there is that merely passes strings along...  Most C code using
strings makes use of the strings (e.g. open() falls in this category
in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Thu May  4 00:05:39 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 18:05:39 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.37223.787291.236623@beluga.mojam.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
	<14608.35257.729641.178724@cymru.basistech.com>
	<14608.37223.787291.236623@beluga.mojam.com>
Message-ID: <14608.41651.781464.747522@cymru.basistech.com>

Skip Montanaro writes:
 > Note that currently the len() method doesn't call strlen() at all.  It just
 > returns the ob_size field.  Presumably, with Just's proposal len() would
 > simply return ob_size/width.  If you used a variable width encoding, Just's
 > plan wouldn't work.  (I don't know anything about string encodings - is
 > UTF-8 variable width?)

Yes, technically from 1 - 6 bytes per character, though in practice
for Unicode it's 1 - 3.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From guido at python.org  Thu May  4 02:52:39 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:52:39 -0400
Subject: [Python-Dev] weird bug in test_winreg
Message-ID: <200005040052.UAA07874@eric.cnri.reston.va.us>

I just noticed a weird traceback in test_winreg.  When I import
test.autotest on Windows, I get a "test failed" notice for
test_winreg.  When I run it by itself the test succeeds.  But when I
first import test.autotest and then import test.test_winreg (which
should rerun the latter, since test.regrtest unloads all test modules
after they have run), I get an AttributeError telling me that 'None'
object has no attribute 'get'.  This is in encodings.__init__.py in
the first call to _cache.get() in search_function.  Somehow this is
called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
inspection of the encodings module shows that _cache is {}, not None,
and the source shows no evidence of how this could have happened.

Any suggestions?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Thu May  4 02:57:50 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:57:50 -0400
Subject: [Python-Dev] weird bug in test_winreg
In-Reply-To: Your message of "Wed, 03 May 2000 20:52:39 EDT."
             <200005040052.UAA07874@eric.cnri.reston.va.us> 
References: <200005040052.UAA07874@eric.cnri.reston.va.us> 
Message-ID: <200005040057.UAA07966@eric.cnri.reston.va.us>

> I just noticed a weird traceback in test_winreg.  When I import
> test.autotest on Windows, I get a "test failed" notice for
> test_winreg.  When I run it by itself the test succeeds.  But when I
> first import test.autotest and then import test.test_winreg (which
> should rerun the latter, since test.regrtest unloads all test modules
> after they have run), I get an AttributeError telling me that 'None'
> object has no attribute 'get'.  This is in encodings.__init__.py in
> the first call to _cache.get() in search_function.  Somehow this is
> called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
> inspection of the encodings module shows that _cache is {}, not None,
> and the source shows no evidence of how this could have happened.

I may have sounded confused: the problem is not caused by the
reload().  The test fails the first time around when run by
test.autotest.  My suspicion is that another test somehow overwrites
encodings._cache?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mhammond at skippinet.com.au  Thu May  4 03:20:24 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 4 May 2000 11:20:24 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au>

Oops - I didnt notice the CC - a copy of what I sent to Guido:

-----Original Message-----
From: Mark Hammond [mailto:mhammond at skippinet.com.au]
Sent: Thursday, 4 May 2000 11:13 AM
To: Guido van Rossum
Subject: RE: weird bug in test_winreg


Hah - I was just thinking about this this myself.  If I wasnt waiting 24
hours, I would have beaten you to the test_fork1 patch :-)

However, there is something bad going on.  If you remove your test_fork1
patch, and run it from regrtest (_not_ stand alone) you will see the
children threads die with:

  File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
    alive[id] = os.getpid()
AttributeError: 'None' object has no attribute 'getpid'

Note the error - os is None!

[The reason is only happens as part of the test is because the children are
created before the main thread fails with the attribute error]

Similarly, I get spurious:

Traceback (most recent call last):
  File ".\test_thread.py", line 103, in task2
    mutex.release()
AttributeError: 'None' object has no attribute 'release'

(Only rarely, and never when run stand-alone - the test_fork1 exception
happens 100% of the time from the test suite)

And of course the test_winreg one.

test_winreg, I guessed, may be caused by the import lock (but its certainly
not obvious how or why!?).  However, that doesnt explain the others.

I also saw these _before_ I applied the threading patches (and after!)

So I think the problem may be a little deeper?

Mark.




From just at letterror.com  Thu May  4 09:42:00 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 08:42:00 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <200005032122.RAA05150@eric.cnri.reston.va.us>
References: Your message of "Wed, 03 May 2000 20:55:24 BST."            
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102800b536d1d8c0bc@[193.78.237.161]>

(Thanks for all the comments. I'll condense my replies into one post.)

[JvR]
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.

[Tom Emerson wrote]
>I disagree with you here... store them as UTF-8.

Erm, utf-8 in a wide string? This makes no sense...

[Skip Montanaro]
>Presumably, with Just's proposal len() would
>simply return ob_size/width.

Right. And if you would allow values for width other than 1 and 2, it opens
the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and
"only" width==1 needs to be special-cased for speed.

>If you used a variable width encoding, Just's plan wouldn't work.

Correct, but nor does the current unicode object. Variable width encodings
are too messy to see as strings at all: they are only useful as byte arrays.

[GvR]
>This seems to have some nice properties, but I think it would cause
>problems for existing C code that tries to *interpret* the bytes of a
>string: it could very well do the wrong thing for wide strings (since
>old C code doesn't check for the "wide" flag).  I'm not sure how much
>C code there is that merely passes strings along...  Most C code using
>strings makes use of the strings (e.g. open() falls in this category
>in my eyes).

There are probably many cases that fall into this category. But then again,
these cases, especially those that potentially can deal with other
encodings than ascii, are not much helped by a default encoding, as /F
showed.

My idea arose after yesterday's discussions. Some quotes, plus comments:

[GvR]
>However the problem is that print *always* first converts the object
>using str(), and str() enforces that the result is an 8-bit string.
>I'm afraid that loosening this will break too much code.  (This all
>really happens at the C level.)

Guido goes on to explain that this means utf-8 is the only sensible default
in this case. Good reasoning, but I think it's backwards:
- str(unicodestring) should just return unicodestring
- it is important that stdout receives the original unicode object.

[MAL]
>BTW, __str__() has to return strings too. Perhaps we
>need __unicode__() and a corresponding slot function too ?!

This also seems backwards. If it's really too hard to change Python so that
__str__ can return unicode objects, my solution may help.

[Ka-Ping Yee]
>Here is an addendum that might actually make that proposal
>feasible enough (compatibility-wise) to fly in the short term:
>
>    print x
>
>does, conceptually:
>
>    try:
>        sys.stdout.printout(x)
>    except AttributeError:
>        sys.stdout.write(str(x))
>        sys.stdout.write("\n")

That stuff like this is even being *proposed* (not that it's not smart or
anything...) means there's a terrible bottleneck somewhere which needs
fixing. My proposal seems to do does that nicely.

Of course, there's no such thing as a free lunch, and I'm sure there are
other corners that'll need fixing, but it appears having to write

    if (!PyString_Check(doc) && !PyUnicode_Check(doc))
        ...

in all places that may accept unicode strings is no fun either.

Yes, some code will break if you throw a wide string at it, but I think
that code is easier repaired with my proposal than with the current
implementation.

It's a big advantage to have only one string type; it makes many problems
we've been discussing easier to talk about.

Just





From effbot at telia.com  Thu May  4 09:46:05 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 09:46:05 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030556250.522-100000@localhost>
Message-ID: <002d01bfb59c$cf482280$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> I know this is not a small change, but i'm pretty convinced the
> right answer here is that the print hook should call a *method*
> on sys.stdout, whatever sys.stdout happens to be.  The details
> are described in the other long message i wrote ("Printing objects
> on files").
> 
> Here is an addendum that might actually make that proposal
> feasible enough (compatibility-wise) to fly in the short term:
> 
>     print x
> 
> does, conceptually:
> 
>     try:
>         sys.stdout.printout(x)
>     except AttributeError:
>         sys.stdout.write(str(x))
>         sys.stdout.write("\n")
> 
> The rest can then be added, and the change in 'print x' will
> work nicely for any file objects, but will not break on file-like
> substitutes that don't define a 'printout' method.

another approach is (simplified):

    try:
        sys.stdout.write(x.encode(sys.stdout.encoding))
    except AttributeError:
        sys.stdout.write(str(x))

or, if str is changed to return any kind of string:

    x = str(x)
    try:
        x = x.encode(sys.stdout.encoding)
    except AttributeError:
        pass
    sys.stdout.write(x)

</F>




From ht at cogsci.ed.ac.uk  Thu May  4 10:51:39 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 04 May 2000 09:51:39 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

<snip/>

> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

I think I hear a moderate consensus developing that the 'ASCII
proposal' is a reasonable compromise given the time constraints.  But
let's not fail to come back to this ASAP -- it _really_ narcs me that
every time I load XML into my Python-based editor I'm going to convert
large amounts of wide-string data into UTF-8 just so Tk can convert it
back to wide-strings in order to display it!

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From just at letterror.com  Thu May  4 13:27:45 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 12:27:45 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b536d1d8c0bc@[193.78.237.161]>
References: <200005032122.RAA05150@eric.cnri.reston.va.us> Your message of
 "Wed, 03 May 2000 20:55:24 BST."            
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102809b53709fef820@[193.78.237.126]>

I wrote:
>It's a big advantage to have only one string type; it makes many problems
>we've been discussing easier to talk about.

I think I should've been more explicit about what I meant here. I'll try to
phrase it as an addendum to my proposal -- which suddenly is no longer just
a narrow/wide string unification but narrow/wide/ultrawide, to really be
ready for the future...

As someone else suggested in the discussion, I think it's good if we
separate the encoding from the data type. Meaning that wide strings are no
longer tied to Unicode. This allows for double-byte encodings other than
UCS-2 as well as for safe passing-through of binary goop, but that's not
the main point. The main point is that this will make the behavior of
(wide) strings more understandable and consistent.

The extended string type is simply a sequence of code points, allowing for
0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for
ultra-wide strings. Upcasting is always safe, downcasting may raise
OverflowError. Depending on the used encoding, this comes as close as
possible to the sequence-of-characters model.

The default character set should of course be Unicode -- and it should be
obvious that this implies Latin-1 for narrow strings.

(Additionally: an encoding attribute suddenly makes a whole lot of sense
again.)

Ok, y'all can shoot me now ;-)

Just





From guido at python.org  Thu May  4 14:40:35 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 04 May 2000 08:40:35 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "04 May 2000 09:51:39 BST."
             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>  
            <f5br9bi1yw4.fsf@cogsci.ed.ac.uk> 
Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us>

> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.  But
> let's not fail to come back to this ASAP -- it _really_ narcs me that
> every time I load XML into my Python-based editor I'm going to convert
> large amounts of wide-string data into UTF-8 just so Tk can convert it
> back to wide-strings in order to display it!

Thanks -- but that's really Tcl's fault, since the only way to get
character data *into* Tcl (or out of it) is through the UTF-8
encoding.

And is your XML really stored on disk in its 16-bit format?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Thu May  4 15:21:25 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 4 May 2000 15:21:25 +0200
Subject: [Python-Dev] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>  <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>

Guido van Rossum <guido at python.org> wrote:
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.

from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm

    Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)

    Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
    object or modify an existing object to hold a copy of the
    Unicode string given by unicode and numChars.

    (Tcl_UniChar* is currently the same thing as Py_UNICODE*)

</F>




From guido at python.org  Thu May  4 19:03:58 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 04 May 2000 13:03:58 -0400
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: Your message of "Thu, 04 May 2000 11:20:24 +1000."
             <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au> 
Message-ID: <200005041703.NAA13471@eric.cnri.reston.va.us>

Mark Hammond:

> However, there is something bad going on.  If you remove your test_fork1
> patch, and run it from regrtest (_not_ stand alone) you will see the
> children threads die with:
> 
>   File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
>     alive[id] = os.getpid()
> AttributeError: 'None' object has no attribute 'getpid'
> 
> Note the error - os is None!
> 
> [The reason is only happens as part of the test is because the children are
> created before the main thread fails with the attribute error]

I don't get this one -- maybe my machine is too slow.  (130 MHz
Pentium.)

> Similarly, I get spurious:
> 
> Traceback (most recent call last):
>   File ".\test_thread.py", line 103, in task2
>     mutex.release()
> AttributeError: 'None' object has no attribute 'release'
> 
> (Only rarely, and never when run stand-alone - the test_fork1 exception
> happens 100% of the time from the test suite)
> 
> And of course the test_winreg one.
> 
> test_winreg, I guessed, may be caused by the import lock (but its certainly
> not obvious how or why!?).  However, that doesnt explain the others.
> 
> I also saw these _before_ I applied the threading patches (and after!)
> 
> So I think the problem may be a little deeper?

It's Vladimir's patch which, after each tests, unloads all modules
that were loaded by that test.  If I change this to only unload
modules whose name starts with "test.", the test_winreg problem goes
away, and I bet yours go away too.

The real reason must be deeper -- there's also the import lock and the
fact that if a submodule of package "test" tries to import "os", a
search for "test.os" is made and if it doesn't find it it sticks None
in sys.modules['test.os'].

but I don't have time to research this further.

I'm tempted to apply the following change to regrtest.py.  This should
still unload the test modules (so you can rerun an individual test)
but it doesn't touch other modules.  I'll wait 24 hours. :-)

*** regrtest.py	2000/04/21 21:35:06	1.15
--- regrtest.py	2000/05/04 16:56:26
***************
*** 121,127 ****
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules:
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:
--- 121,127 ----
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules and module.startswith("test."):
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gvwilson at nevex.com  Thu May  4 21:03:54 2000
From: gvwilson at nevex.com (gvwilson at nevex.com)
Date: Thu, 4 May 2000 15:03:54 -0400 (EDT)
Subject: [Python-Dev] Minimal (single-file) Python?
Message-ID: <Pine.LNX.4.10.10005041448010.22917-100000@akbar.nevex.com>

Hi.  Has anyone ever built, or thought about building, a single-file
Python, in which all the "basic" capabilities are included in a single
executable (where "basic" means "can do as much as the Bourne shell")?
Some of the entries in the Software Carpentry competition would like to be
able to bootstrap from as small a starting point as possible.

Thanks,
Greg

p.s. I don't think this is the same problem as moving built-in features of
Python into optionally-loaded libraries, as some of the things in the
'sys', 'string', and 'os' modules would have to move in the other
direction to ensure Bourne shell equivalence.





From just at letterror.com  Thu May  4 23:22:38 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 22:22:38 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
Message-ID: <l03102810b5378dda02f5@[193.78.237.126]>

(Boy, is it quiet here all of a sudden ;-)

Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.

These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
  happen purely on the basis of the character codes; no
  assumption at all should be made about the encoding, ie.
  what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
  there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
  or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
  narrow or wide

The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
  since it is not neccesary for the character to be a Unicode
  code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
  contains a character > 255 the string should automatically
  become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
  - it is possible to specify a source encoding. I'm not sure if
    this is best done through an extra argument for encode()
    or that it should be a new method, eg. transcode().
  - s.encode() or s.transcode() are allowed to output a wide
    string, as in aNarrowString.encode("UCS-2") and
    s.transcode("Mac-Roman", "UCS-2").

My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.

Just





From skip at mojam.com  Thu May  4 22:43:42 2000
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 4 May 2000 15:43:42 -0500 (CDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <14609.57598.738381.250872@beluga.mojam.com>

    Just> Sorry for the duplication of stuff, but I'd like to reiterate my
    Just> points, to separate them from my implementation proposal, as
    Just> that's just what it is: an implementation detail.

    Just> These things are important to me:
    ...

For the encoding-challenged like me, does it make sense to explicitly state
that you can't mix character widths within a single string, or is that just
so obvious that I deserve a head slap just for mentioning it?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From effbot at telia.com  Thu May  4 23:02:35 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 23:02:35 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54zj.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid>

Henry S. Thompson <ht at cogsci.ed.ac.uk> wrote:
> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.

agreed.

(but even if we settle for "7-bit unicode" in 1.6, there are still a
few issues left to sort out before 1.6 final.  but it might be best
to get back to that after we've added SRE and GC to 1.6a3. we
might all need a short break...)

> But let's not fail to come back to this ASAP

first week in june, promise ;-)

</F>




From mhammond at skippinet.com.au  Fri May  5 01:55:15 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 5 May 2000 09:55:15 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: <200005041703.NAA13471@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEEBCKAA.mhammond@skippinet.com.au>

> It's Vladimir's patch which, after each tests, unloads all modules
> that were loaded by that test.  If I change this to only unload
> modules whose name starts with "test.", the test_winreg problem goes
> away, and I bet yours go away too.

They do indeed!

> The real reason must be deeper -- there's also the import lock and the
> fact that if a submodule of package "test" tries to import "os", a
> search for "test.os" is made and if it doesn't find it it sticks None
> in sys.modules['test.os'].
>
> but I don't have time to research this further.

I started to think about this.  The issue is simply that code which
blithely wipes sys.modules[] may cause unexpected results.  While the end
result is a bug, the symptoms are caused by extreme hackiness.

Seeing as my time is also limited, I say we forget it!

> I'm tempted to apply the following change to regrtest.py.  This should
> still unload the test modules (so you can rerun an individual test)
> but it doesn't touch other modules.  I'll wait 24 hours. :-)

The 24 hour time limit is only supposed to apply to _my_ patches - you can
check yours straight in (and if anyone asks, just tell them I said it was
OK) :-)

Mark.




From ht at cogsci.ed.ac.uk  Fri May  5 10:19:07 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:19:07 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
	<200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <f5bya5pxvd0.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

> > I think I hear a moderate consensus developing that the 'ASCII
> > proposal' is a reasonable compromise given the time constraints.  But
> > let's not fail to come back to this ASAP -- it _really_ narcs me that
> > every time I load XML into my Python-based editor I'm going to convert
> > large amounts of wide-string data into UTF-8 just so Tk can convert it
> > back to wide-strings in order to display it!
> 
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.
> 
> And is your XML really stored on disk in its 16-bit format?

No, I have no idea what encoding it's in, my XML parser supports over
a dozen encodings, and quite sensibly always delivers the content, as
per the XML REC, as wide-strings.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From ht at cogsci.ed.ac.uk  Fri May  5 10:21:41 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:21:41 +0100
Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate
In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
	<200005041240.IAA08277@eric.cnri.reston.va.us>
	<00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>
Message-ID: <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>

"Fredrik Lundh" <fredrik at pythonware.com> writes:

> Guido van Rossum <guido at python.org> wrote:
> > Thanks -- but that's really Tcl's fault, since the only way to get
> > character data *into* Tcl (or out of it) is through the UTF-8
> > encoding.
> 
> from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm
> 
>     Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)
> 
>     Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
>     object or modify an existing object to hold a copy of the
>     Unicode string given by unicode and numChars.
> 
>     (Tcl_UniChar* is currently the same thing as Py_UNICODE*)
> 

Any way this can be exploited in Tkinter?

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From just at letterror.com  Fri May  5 11:25:37 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 10:25:37 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <007701bfb60c$1543f060$34aab5d4@hagrid>
References:  <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237
 .154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres
 cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p
 rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233
 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91
 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54z
 j.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <l03102802b5383fd7c128@[193.78.237.126]>

At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
>Henry S. Thompson <ht at cogsci.ed.ac.uk> wrote:
>> I think I hear a moderate consensus developing that the 'ASCII
>> proposal' is a reasonable compromise given the time constraints.
>
>agreed.

This makes no sense: implementing the 7-bit proposal takes the more or less
the same time as implementing 8-bit downcasting. Or is it just the
bickering that's too time consuming? ;-)

I worry that if the current implementation goes into 1.6 more or less as it
is now there's no way we can ever go back (before P3K). Or will Unicode
support be marked "experimental" in 1.6? This is not so much about the
7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
and the u"" notation:

- unicode() only takes strings, so is effectively a method of the string type.
- if narrow and wide strings are meant to be as similar as possible,
chr(256) should just return a wide char
- similarly, why is the u"" notation at all needed?

The current design is more complex than needed, and still offers plenty of
surprises. Making it simpler (without integrating the two string types) is
not a huge effort. Seeing the wide string type as independent of Unicode
takes no physical effort at all, as it's just in our heads.

Fixing str() so it can return wide strings might be harder, and can wait
until later. Would be too bad, though.

Just





From ping at lfw.org  Fri May  5 11:21:20 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 5 May 2000 02:21:20 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <002d01bfb59c$cf482280$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>

On Thu, 4 May 2000, Fredrik Lundh wrote:
> 
> another approach is (simplified):
> 
>     try:
>         sys.stdout.write(x.encode(sys.stdout.encoding))
>     except AttributeError:
>         sys.stdout.write(str(x))

Indeed, that would work to solve just this specific Unicode
issue -- but there is a lot of flexibility and power to be
gained from the general solution of putting a method on the
stream object, as the example with the formatted list items
showed.  I think it is a good idea, for instance, to leave
decisions about how to print Unicode up to the Unicode object,
and not hardcode bits of it into print.

Guido, have you digested my earlier 'printout' suggestions?


-- ?!ng

"Old code doesn't die -- it just smells that way."
    -- Bill Frantz




From mbel44 at dial.pipex.net  Fri May  5 11:07:46 2000
From: mbel44 at dial.pipex.net (Toby Dickenson)
Date: Fri, 05 May 2000 10:07:46 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>

On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just at letterror.com> wrote:

>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points, to
>separate them from my implementation proposal, as that's just what it is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
>  happen purely on the basis of the character codes; no
>  assumption at all should be made about the encoding, ie.
>  what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
>  there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
>  or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
>  narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
>  since it is not neccesary for the character to be a Unicode
>  code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
>  contains a character > 255 the string should automatically
>  become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
>  - it is possible to specify a source encoding. I'm not sure if
>    this is best done through an extra argument for encode()
>    or that it should be a new method, eg. transcode().

>  - s.encode() or s.transcode() are allowed to output a wide
>    string, as in aNarrowString.encode("UCS-2") and
>    s.transcode("Mac-Roman", "UCS-2").

One other pleasant consequence:

- String comparisons work character-by character, even if the
  representation of those characters have different widths.

>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide some
>additional C compatibility (especially now that silent conversion to utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.


Toby Dickenson
tdickenson at geminidataloggers.com



From just at letterror.com  Fri May  5 13:40:49 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 12:40:49 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <l03102805b5385e3de8e8@[193.78.237.127]>

At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
>One other pleasant consequence:
>
>- String comparisons work character-by character, even if the
>  representation of those characters have different widths.

Exactly. By saying "(wide) strings are not tied to Unicode" the question
whether wide strings should or should not be sorted according to the
Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
too hard anyway"...

Just





From tree at cymru.basistech.com  Fri May  5 13:46:41 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 07:46:41 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
	<l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <14610.46241.129977.642796@cymru.basistech.com>

Just van Rossum writes:
 > At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
 > >One other pleasant consequence:
 > >
 > >- String comparisons work character-by character, even if the
 > >  representation of those characters have different widths.
 > 
 > Exactly. By saying "(wide) strings are not tied to Unicode" the question
 > whether wide strings should or should not be sorted according to the
 > Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
 > too hard anyway"...

Wait a second.

There is nothing about Unicode that would prevent you from defining
string equality as byte-level equality.

This strikes me as the wrong way to deal with the complex collation
issues of Unicode.

It seems to me that by default wide-strings compare at the byte-level
(i.e., '=' is a byte level comparison). If you want a normalized
comparison, then you make an explicit function call for that.

This is no different from comparing strings in a case sensitive
vs. case insensitive manner.

       -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From just at letterror.com  Fri May  5 15:17:31 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 14:17:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14610.46241.129977.642796@cymru.basistech.com>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <l03102808b53877a3e392@[193.78.237.127]>

[Me]
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

[Tom Emerson]
>Wait a second.
>
>There is nothing about Unicode that would prevent you from defining
>string equality as byte-level equality.

Agreed.

>This strikes me as the wrong way to deal with the complex collation
>issues of Unicode.

All I was trying to say, was that by looking at it this way, it is even
more obvious that the builtin comparison should not deal with Unicode
sorting & collation issues. It seems you're saying the exact same thing:

>It seems to me that by default wide-strings compare at the byte-level
>(i.e., '=' is a byte level comparison). If you want a normalized
>comparison, then you make an explicit function call for that.

Exactly.

>This is no different from comparing strings in a case sensitive
>vs. case insensitive manner.

Good point. All this taken together still means to me that comparisons
between wide and narrow strings should take place at the character level,
which implies that coercion from narrow to wide is done at the character
level, without looking at the encoding. (Which in my book in turn still
implies that as long as we're talking about Unicode, narrow strings are
effectively Latin-1.)

Just





From tree at cymru.basistech.com  Fri May  5 14:34:35 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 08:34:35 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102808b53877a3e392@[193.78.237.127]>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102810b5378dda02f5@[193.78.237.126]>
	<l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <14610.49115.820599.172598@cymru.basistech.com>

Just van Rossum writes:
 > Good point. All this taken together still means to me that comparisons
 > between wide and narrow strings should take place at the character level,
 > which implies that coercion from narrow to wide is done at the character
 > level, without looking at the encoding. (Which in my book in turn still
 > implies that as long as we're talking about Unicode, narrow strings are
 > effectively Latin-1.)

Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
characters" are Unicode, but stored in UTF-8 encoding, then you loose.

Hmmmm... how often do you expect to compare narrow vs. wide strings,
using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
use the byte comparison? I may very well have two strings (one narrow,
one wide) that compare equal, even though they're not. Not exactly
what I would expect.

     -tree

[I'm flying from Seattle to Boston today, so eventually I will
 disappear for a while]

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From pf at artcom-gmbh.de  Fri May  5 15:13:05 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 5 May 2000 15:13:05 +0200 (MEST)
Subject: [Python-Dev] wide strings vs. Unicode point of view (was Re: [I18n-sig] Unicode st.... alternative)
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]> from Just van Rossum at "May 5, 2000 12:40:49 pm"
Message-ID: <m12nhuj-000CnCC@artcom0.artcom-gmbh.de>

Just van Rossum:
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

I personally like the idea speaking of "wide strings" containing wide
character codes instead of Unicode objects.

Unfortunately there are many methods which need to interpret the
content of strings according to some encoding knowledge: for example
'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know,
to which class certain characters belong.

This problem was already some kind of visible in 1.5.2, since these methods 
were available as library functions from the string module and they did
work with a global state maintained by the 'setlocale()' C-library function.
Quoting from the C library man pages:

"""    The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con?
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.
"""

I guess applying 'upper' to a chinese char will not make much sense.

Now these former string module functions were moved into the Python
object core.  So the current Python string and Unicode object API is
somewhat "western centric".  ;-) At least Marc's implementation in
'unicodectype.c' contains the hard coded assumption, that wide strings
contain really unicode characters.  
print u"???".upper().encode("latin1") 
shows "???" independent from the locale setting.  This makes sense.
The output from  print u"???".upper().encode()  however looks ugly
here on my screen... UTF-8 ... blech:? ??

Regards and have a nice weekend, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From guido at python.org  Fri May  5 16:49:52 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:49:52 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Fri, 05 May 2000 02:21:20 PDT."
             <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org> 
References: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org> 
Message-ID: <200005051449.KAA14138@eric.cnri.reston.va.us>

> Guido, have you digested my earlier 'printout' suggestions?

Not quite, except to the point that they require more thought than to
rush them into 1.6.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May  5 16:54:16 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:54:16 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Thu, 04 May 2000 22:22:38 BST."
             <l03102810b5378dda02f5@[193.78.237.126]> 
References: <l03102810b5378dda02f5@[193.78.237.126]> 
Message-ID: <200005051454.KAA14168@eric.cnri.reston.va.us>

> (Boy, is it quiet here all of a sudden ;-)

Maybe because (according to one report on NPR here) 80% of the world's
email systems are victimized by the ILOVEYOU virus?  You & I are not
affected because it's Windows specific (a visual basic script, I got a
copy mailed to me so I could have a good look :-).  Note that there
are already mutations, one of which pretends to be a joke.

> Sorry for the duplication of stuff, but I'd like to reiterate my points, to
> separate them from my implementation proposal, as that's just what it is:
> an implementation detail.
> 
> These things are important to me:
> - get rid of the Unicode-ness of wide strings, in order to
> - make narrow and wide strings as similar as possible
> - implicit conversion between narrow and wide strings should
>   happen purely on the basis of the character codes; no
>   assumption at all should be made about the encoding, ie.
>   what the character code _means_.
> - downcasting from wide to narrow may raise OverflowError if
>   there are characters in the wide string that are > 255
> - str(s) should always return s if s is a string, whether narrow
>   or wide
> - file objects need to be responsible for handling wide strings
> - the above two points should make it possible for
> - if no encoding is known, Unicode is the default, whether
>   narrow or wide
> 
> The above points seem to have the following consequences:
> - the 'u' in \uXXXX notation no longer makes much sense,
>   since it is not neccesary for the character to be a Unicode
>   code point: it's just a 2-byte int. \wXXXX might be an option.
> - the u"" notation is no longer neccesary: if a string literal
>   contains a character > 255 the string should automatically
>   become a wide string.
> - narrow strings should also have an encode() method.
> - the builtin unicode() function might be redundant if:
>   - it is possible to specify a source encoding. I'm not sure if
>     this is best done through an extra argument for encode()
>     or that it should be a new method, eg. transcode().
>   - s.encode() or s.transcode() are allowed to output a wide
>     string, as in aNarrowString.encode("UCS-2") and
>     s.transcode("Mac-Roman", "UCS-2").
> 
> My proposal to extend the "old" string type to be able to contain wide
> strings is of course largely unrelated to all this. Yet it may provide some
> additional C compatibility (especially now that silent conversion to utf-8
> is out) as well as a workaround for the
> str()-having-to-return-a-narrow-string bottleneck.

I'm not so sure that this is enough.  You seem to propose wide strings
as vehicles for 16-bit values (and maybe later 32-bit values) apart
from their encoding.  We already have a data type for that (the array
module).  The Unicode type does a lot more than storing 16-bit values:
it knows lots of encodings to and from Unicode, and it knows things
like which characters are upper or lower or title case and how to map
between them, which characters are word characters, and so on.  All
this is highly Unicode specific and is part of what people ask for
when then when they request Unicode support.  (Example: Unicode has
405 characters classified as numeric, according to the isnumeric()
method.)

And by the way, don't worry about the comparison.  I'm not changing
the default comparison (==, cmp()) for Unicode strings to be anything
than per 16-bit-quantity.  However a Unicode object might in addition
has a method to do normalization or whatever, as long as it's language
independent and strictly defined by the Unicode standard.
Language-specific operations belong in separate modules.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May  5 17:07:48 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:07:48 -0400
Subject: [Python-Dev] Moving Unicode debate to i18n-sig@python.org
Message-ID: <200005051507.LAA14262@eric.cnri.reston.va.us>

I've moved all my responses to the Unicode debate to the i18n-sig
mailing list, where it belongs.  Please don't cross-post any more.

If you're interested in this issue but aren't subscribed to the
i18n-sig list, please subscribe at
http://www.python.org/mailman/listinfo/i18n-sig/.

To view the archives, go to http://www.python.org/pipermail/i18n-sig/.

See you there!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From jim at digicool.com  Fri May  5 19:09:34 2000
From: jim at digicool.com (Jim Fulton)
Date: Fri, 05 May 2000 13:09:34 -0400
Subject: [Python-Dev] Pickle diffs anyone?
Message-ID: <3913004E.6CC69857@digicool.com>

Someone recently made a cool proposal for utilizing
diffs to save space taken by old versions in
the Zope object database:

  http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

To make this work, we need a good way of diffing pickles.

I thought maybe someone here would have some good suggestions.
I do think that the topic is sort of interesting (for some
definition of "interesting" ;).

The page above is a Wiki page. (Wiki is awesome. If you haven't
seen it before, check out http://joyful.com/zwiki/ZWiki.)
If you are a member of zope.org, you can edit the page directly,
which would be fine with me. :)

Jim

--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From fdrake at acm.org  Fri May  5 19:14:16 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 5 May 2000 13:14:16 -0400 (EDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.360.166536.866583@seahag.cnri.reston.va.us>

Jim Fulton writes:
 > To make this work, we need a good way of diffing pickles.

Jim,
  If the basic requirement is for a binary diff facility, perhaps you
should look into XDelta; I think that's available as a C library as
well as a command line tool, so you should be able to hook it in
fairly easily.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From trentm at activestate.com  Fri May  5 19:25:48 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 5 May 2000 10:25:48 -0700
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <000001bfb336$d4f512a0$0f2d153f@tim>
References: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com> <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <20000505102548.B25914@activestate.com>

I posted a couple of patches a couple of days ago to correct the string
methods implementing slice-like optional parameters (count, find, index,
rfind, rindex) to properly clamp slice index values to the proper range (any
PyInt or PyLong value is acceptible now). In fact the slice_index() function
that was being used in ceval.c was reused (renamed to _PyEval_SliceIndex).

As well, the other patch changes PyArg_ParseTuple's 'b', 'h', and 'i'
formatters to raise an OverflowError if they overflow.

Trent

p.s. I thought I would whine here for some more attention. Who needs that
Unicode stuff anyway. ;-)



From fw at deneb.cygnus.argh.org  Fri May  5 18:13:42 2000
From: fw at deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:13:42 +0200
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Just van Rossum's message of "Fri, 5 May 2000 14:17:31 +0100"
References: <l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102810b5378dda02f5@[193.78.237.126]>
	<l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <8766st5615.fsf@deneb.cygnus.argh.org>

Just van Rossum <just at letterror.com> writes:

> Good point. All this taken together still means to me that comparisons
> between wide and narrow strings should take place at the character level,
> which implies that coercion from narrow to wide is done at the character
> level, without looking at the encoding. (Which in my book in turn still
> implies that as long as we're talking about Unicode, narrow strings are
> effectively Latin-1.)

Sorry for jumping in, I've only recently discovered this list. :-/

At the moment, most of the computing world is not Latin-1 but
Windows-12??.  That's why I don't think this is a good idea at all.



From skip at mojam.com  Fri May  5 21:10:24 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 5 May 2000 14:10:24 -0500 (CDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.7328.869011.109768@beluga.mojam.com>

    Jim> Someone recently made a cool proposal for utilizing diffs to save
    Jim> space taken by old versions in the Zope object database:

    Jim>   http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

    Jim> To make this work, we need a good way of diffing pickles.

Fred already mentioned a candidate library to do diffs.  If that works, the
only other thing I think you'd need to do is guarantee that dicts are
pickled in a consistent fashion, probably by sorting the keys before
enumerating them.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From trentm at activestate.com  Fri May  5 23:34:48 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 5 May 2000 14:34:48 -0700
Subject: [Python-Dev] should a float overflow or just equal 'inf'
Message-ID: <20000505143448.A10731@activestate.com>

Hi all,

I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
formatter for PyArg_ParseTuple raise an Overflow exception if they overflow
(currently they just silently overflow). Presuming that this is considered a
good idea, should this be carried to floats.

Floats don't really overflow, they just equal 'inf'. Would it be more
desireable to raise an Overflow exception for this? I am inclined to think
that this would *not* be desireable based on the following quote:

"""
the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim
"""

In any case, the question stands. I don't really have an idea of the
potential pains that this could cause to (1) efficiecy, (2) external code
that expects to deal with 'inf's itself. The reason I ask is because I am
looking at related issues in the Python code these days.


Trent
--
Trent Mick
trentm at activestate.com




From tismer at tismer.com  Sat May  6 16:29:07 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 06 May 2000 16:29:07 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000001bfb4a6$21da7900$922d153f@tim>
Message-ID: <39142C33.507025B5@tismer.com>


Tim Peters wrote:
> 
> [Trent Mick]
> > >>> i = -2147483648
> > OverflowError: integer literal too large
> > >>> i = -2147483648L
> > >>> int(i)   # it *is* a valid integer literal
> > -2147483648
> 
> Python's grammar is such that negative integer literals don't exist; what
> you actually have there is the unary minus operator applied to positive
> integer literals; indeed,

<disassembly snipped>

Well, knowing that there are more negatives than positives
and then coding it this way appears in fact as a design flaw to me.

A simple solution could be to do the opposite:
Always store a negative number and negate it
for positive numbers. A real negative number
would then end up with two UNARY_NEGATIVE
opcodes in sequence. If we had a simple postprocessor
to remove such sequences at the end, we're done.
As another step, it could also adjust all such consts
and remove those opcodes.

This could be a task for Skip's peephole optimizer.
Why did it never go into the core?

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tim_one at email.msn.com  Sat May  6 21:13:46 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:13:46 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39142C33.507025B5@tismer.com>
Message-ID: <000301bfb78f$33e33d80$452d153f@tim>

[Tim]
> Python's grammar is such that negative integer literals don't
> exist; what you actually have there is the unary minus operator
> applied to positive integer literals; ...

[Christian Tismer]
> Well, knowing that there are more negatives than positives
> and then coding it this way appears in fact as a design flaw to me.

Don't know what you're saying here.  Python's grammar has nothing to do with
the relative number of positive vs negative entities; indeed, in a
2's-complement machine it's not even true that there are more negatives than
positives.  Python generates the unary minus for "negative literals"
because, again, negative literals *don't exist* in the grammar.

> A simple solution could be to do the opposite:
> Always store a negative number and negate it
> for positive numbers.  ...

So long as negative literals don't exist in the grammar, "-2147483648" makes
no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
problem" here worth fixing, although if there is <wink>, it will get fixed
by magic as soon as Python ints and longs are unified.





From tim_one at email.msn.com  Sat May  6 21:47:25 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:47:25 -0400
Subject: [Python-Dev] should a float overflow or just equal 'inf'
In-Reply-To: <20000505143448.A10731@activestate.com>
Message-ID: <000801bfb793$e70c9420$452d153f@tim>

[Trent Mick]
> I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
> formatter for PyArg_ParseTuple raise an Overflow exception if
> they overflow (currently they just silently overflow). Presuming that
> this is considered a good idea, should this be carried to floats.
>
> Floats don't really overflow, they just equal 'inf'. Would it be more
> desireable to raise an Overflow exception for this? I am inclined to think
> that this would *not* be desireable based on the following quote:
>
> """
> the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
>     that-can-be-done-ly y'rs  - tim
> """
>
> In any case, the question stands. I don't really have an idea of the
> potential pains that this could cause to (1) efficiecy, (2) external code
> that expects to deal with 'inf's itself. The reason I ask is because I am
> looking at related issues in the Python code these days.

Alas, this is the tip of a very large project:  while (I believe) *every*
platform Python runs on now is 754-conformant, Python itself has no idea
what it's doing wrt 754 semantics.  In part this is because ISO/ANSI C has
no idea what it's doing either.  C9X (the next C std) is supposed to supply
portable spellings of ways to get at 754 features, but before then there's
simply nothing portable that can be done.

Guido & I already agreed in principle that Python will eventually follow 754
rules, but with the overflow, divide-by-0, and invalid operation exceptions
*enabled* by default (and the underflow and inexact exceptions disabled by
default).  It does this by accident <0.9 wink> already for, e.g.,

>>> 1. / 0.
Traceback (innermost last):
  File "<pyshell#0>", line 1, in ?
    1. / 0.
ZeroDivisionError: float division
>>>

Under the 754 defaults, that should silently return a NaN instead.  But
neither Guido nor I think the latter is reasonable default behavior, and
having done so before in a previous life I can formally justify changing the
defaults a language exposes.

Anyway, once all that is done, float overflow *will* raise an exception (by
default; there will also be a way to turn that off), unlike what happens
today.

Before then, I guess continuing the current policy of benign neglect (i.e.,
let it overflow silently) is best for consistency.  Without access to all
the 754 features in C, it's not even easy to detect overflow now!  "if (x ==
x * 0.5) overflow();" isn't quite good enough, as it can trigger a spurious
underflow error -- there's really no reasonable way to spell this stuff in
portable C now!





From gstein at lyra.org  Sun May  7 12:25:29 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 May 2000 03:25:29 -0700 (PDT)
Subject: [Python-Dev] buffer object (was: Unicode debate)
In-Reply-To: <390EF3EB.5BCE9EC3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>

[ damn, I wish people would pay more attention to changing the subject
  line to reflect the contents of the email ... I could not figure out if
  there were any further responses to this without opening most of those
  dang "Unicode debate" emails. sheesh... ]

On Tue, 2 May 2000, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> > 
> > [MAL]
> > > Let's not do the same mistake again: Unicode objects should *not*
> > > be used to hold binary data. Please use buffers instead.
> > 
> > Easier said than done -- Python doesn't really have a buffer data
> > type.

The buffer object. We *do* have the type.

> > Or do you mean the array module?  It's not trivial to read a
> > file into an array (although it's possible, there are even two ways).
> > Fact is, most of Python's standard library and built-in objects use
> > (8-bit) strings as buffers.

For historical reasons only. It would be very easy to change these to use
buffer objects, except for the simple fact that callers might expect a
*string* rather than something with string-like behavior.

>...
> > > BTW, I think that this behaviour should be changed:
> > >
> > > >>> buffer('binary') + 'data'
> > > 'binarydata'

In several places, bufferobject.c uses PyString_FromStringAndSize(). It
wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
copy the data in. A new API could also help out here:

  PyBuffer_CopyMemory(void *ptr, int size)


> > > while:
> > >
> > > >>> 'data' + buffer('binary')
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in ?
> > > TypeError: illegal argument type for built-in operation

The string object can't handle the buffer on the right side. Buffer
objects use the buffer interface, so they can deal with strings on the
right. Therefore: asymmetry :-(

> > > IMHO, buffer objects should never coerce to strings, but instead
> > > return a buffer object holding the combined contents. The
> > > same applies to slicing buffer objects:
> > >
> > > >>> buffer('binary')[2:5]
> > > 'nar'
> > >
> > > should prefereably be buffer('nar').

Sure. Wouldn't be a problem. The FromStringAndSize() thing.

> > Note that a buffer object doesn't hold data!  It's only a pointer to
> > data.  I can't off-hand explain the asymmetry though.
> 
> Dang, you're right...

Untrue. There is an API call which will construct a buffer object with its
own memory:

  PyObject * PyBuffer_New(int size)

The resulting buffer object will be read/write, and you can stuff values
into it using the slice notation.


> > > Hmm, perhaps we need something like a data string object
> > > to get this 100% right ?!

Nope. The buffer object is intended to be exactly this.

>...
> > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > which no "string literal" notations exist.
> 
> Anyway, one way or another I think we should make it clear
> to users that they should start using some other type for
> storing binary data.

Buffer objects. There are a couple changes to make this a bit easier for
people:

1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
   create a read/write buffer of a particular size. buffer() should create
   a zero-length read/write buffer.

2) if slice assignment is updated to allow changes to the length (for
   example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
   change. Specifically: when the buffer object owns the memory, it does
   this by appending the memory after the PyObject_HEAD and setting its
   internal pointer to it; when the dealloc() occurs, the target memory
   goes with the object. A flag would need to be added to tell the buffer
   object to do a second free() for the case where a realloc has returned
   a new pointer.
   [ I'm not sure that I would agree with this change, however; but it
     does make them a bit easier to work with; on the other hand, people
     have been working with immutable strings for a long time, so they're
     okay with concatenation, so I'm okay with saying length-altering
     operations must simply be done thru concatenation. ]


IMO, extensions should be using the buffer object for raw bytes. I know
that Mark has been updating some of the Win32 extensions to do this.
Python programs could use the objects if the buffer() builtin is tweaked
to allow a bit more flexibility in the arguments.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sun May  7 13:09:45 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 May 2000 04:09:45 -0700 (PDT)
Subject: [Python-Dev] introducing byte arrays in 1.6 (was: Unicode debate)
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005070406200.7610-100000@nebula.lyra.org>

On Wed, 3 May 2000, Guido van Rossum wrote:
>...
> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

See my other note. I think a simple change to the buffer() builtin would
allow read/write byte arrays to be simply constructed.

There are a couple API changes that could be made to bufferobject.[ch]
which could simplify some operations for C code and returning buffer
objects. But changes like that would be preconditioned on accepting the
change in return type from those extensions. For example, the doc may say
something returns a string; while buffer objects are similar to strings in
operation, they are not the *same*. IMO, Python 1.7 would be a good time
to alter return types to buffer objects as appropriate. (but I'm not
adverse to doing it today! (to get people used to the difference in
purposes))

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bckfnn at worldonline.dk  Sun May  7 15:37:21 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sun, 07 May 2000 13:37:21 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39156208.13412015@smtp.worldonline.dk>

[Greg Stein]

>IMO, extensions should be using the buffer object for raw bytes. I know
>that Mark has been updating some of the Win32 extensions to do this.
>Python programs could use the objects if the buffer() builtin is tweaked
>to allow a bit more flexibility in the arguments.

Forgive me for rewinding this to the very beginning. But what is a
buffer object usefull for? I'm trying think about buffer object in terms
of jpython, so my primary interest is the user experience of buffer
objects.

Please correct my misunderstandings.

- There is not a buffer protocol exposed to python object (in the way
  the sequence protocol __getitem__ & friends are exposed).
- A buffer object typically gives access to the raw bytes which
  under lays the backing object. Regardless of the structure of the
  bytes.
- It is only intended for object which have a natural byte storage to
  implement the buffer interface.
- Of the builtin object only string, unicode and array supports the
  buffer interface.
- When slicing a buffer object, the result is always a string regardless
  of the buffer object base.


In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
said to have some natural byte storage. The jpython string type doesn't.
It would take some awful bit shifting to present a jpython string as an
array of bytes.

Would it make any sense to have a buffer object which only accept a byte
array as base? So that jpython would say:

>>> buffer("abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: buffer object expected


Would it make sense to tell python users that they cannot depend on the
portability of using strings (both 8bit and 16bit) as buffer object
base?


Because it is so difficult to look at java storage as a sequence of
bytes, I think I'm all for keeping the buffer() builtin and buffer
object as obscure and unknown as possible <wink>.

regards,
finn



From guido at python.org  Sun May  7 23:29:43 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 07 May 2000 17:29:43 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Sun, 07 May 2000 13:37:21 GMT."
             <39156208.13412015@smtp.worldonline.dk> 
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>  
            <39156208.13412015@smtp.worldonline.dk> 
Message-ID: <200005072129.RAA15850@eric.cnri.reston.va.us>

[Finn Bock]

> Forgive me for rewinding this to the very beginning. But what is a
> buffer object usefull for? I'm trying think about buffer object in terms
> of jpython, so my primary interest is the user experience of buffer
> objects.
> 
> Please correct my misunderstandings.
> 
> - There is not a buffer protocol exposed to python object (in the way
>   the sequence protocol __getitem__ & friends are exposed).
> - A buffer object typically gives access to the raw bytes which
>   under lays the backing object. Regardless of the structure of the
>   bytes.
> - It is only intended for object which have a natural byte storage to
>   implement the buffer interface.

All true.

> - Of the builtin object only string, unicode and array supports the
>   buffer interface.

And the new mmap module.

> - When slicing a buffer object, the result is always a string regardless
>   of the buffer object base.
> 
> In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
> said to have some natural byte storage. The jpython string type doesn't.
> It would take some awful bit shifting to present a jpython string as an
> array of bytes.

I don't recall why JPython has jarray instead of array -- how do they
differ?  I think it's a shame that similar functionality is embodied
in different APIs.

> Would it make any sense to have a buffer object which only accept a byte
> array as base? So that jpython would say:
> 
> >>> buffer("abc")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: buffer object expected
> 
> 
> Would it make sense to tell python users that they cannot depend on the
> portability of using strings (both 8bit and 16bit) as buffer object
> base?

I think that the portability of many string properties is in danger
with the Unicode proposal.  Supporting this in the next version of
JPython will be a bit tricky.

> Because it is so difficult to look at java storage as a sequence of
> bytes, I think I'm all for keeping the buffer() builtin and buffer
> object as obscure and unknown as possible <wink>.

I basically agree, and in a private email to Greg Stein I've told him
this.  I think that the array module should be promoted to a built-in
function/type, and should be the recommended solution for data
storage.  The buffer API should remain a C-level API, and the buffer()
built-in should be labeled with "for experts only".

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon May  8 10:33:01 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 08 May 2000 10:33:01 +0200
Subject: [Python-Dev] buffer object (was: Unicode debate)
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39167BBD.88EB2C64@lemburg.com>

Greg Stein wrote:
> 
> [ damn, I wish people would pay more attention to changing the subject
>   line to reflect the contents of the email ... I could not figure out if
>   there were any further responses to this without opening most of those
>   dang "Unicode debate" emails. sheesh... ]
> 
> On Tue, 2 May 2000, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > [MAL]
> > > > Let's not do the same mistake again: Unicode objects should *not*
> > > > be used to hold binary data. Please use buffers instead.
> > >
> > > Easier said than done -- Python doesn't really have a buffer data
> > > type.
> 
> The buffer object. We *do* have the type.
> 
> > > Or do you mean the array module?  It's not trivial to read a
> > > file into an array (although it's possible, there are even two ways).
> > > Fact is, most of Python's standard library and built-in objects use
> > > (8-bit) strings as buffers.
> 
> For historical reasons only. It would be very easy to change these to use
> buffer objects, except for the simple fact that callers might expect a
> *string* rather than something with string-like behavior.

Would this be a too drastic change, then ? I think that we should
at least make use of buffers in the standard lib.

>
> >...
> > > > BTW, I think that this behaviour should be changed:
> > > >
> > > > >>> buffer('binary') + 'data'
> > > > 'binarydata'
> 
> In several places, bufferobject.c uses PyString_FromStringAndSize(). It
> wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
> copy the data in. A new API could also help out here:
> 
>   PyBuffer_CopyMemory(void *ptr, int size)
> 
> > > > while:
> > > >
> > > > >>> 'data' + buffer('binary')
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in ?
> > > > TypeError: illegal argument type for built-in operation
> 
> The string object can't handle the buffer on the right side. Buffer
> objects use the buffer interface, so they can deal with strings on the
> right. Therefore: asymmetry :-(
> 
> > > > IMHO, buffer objects should never coerce to strings, but instead
> > > > return a buffer object holding the combined contents. The
> > > > same applies to slicing buffer objects:
> > > >
> > > > >>> buffer('binary')[2:5]
> > > > 'nar'
> > > >
> > > > should prefereably be buffer('nar').
> 
> Sure. Wouldn't be a problem. The FromStringAndSize() thing.

Right.
 
Before digging deeper into this, I think we should here
Guido's opinion on this again: he said that he wanted to
use Java's binary arrays for binary data... perhaps we
need to tweak the array type and make it more directly
accessible (from C and Python) instead.

> > > Note that a buffer object doesn't hold data!  It's only a pointer to
> > > data.  I can't off-hand explain the asymmetry though.
> >
> > Dang, you're right...
> 
> Untrue. There is an API call which will construct a buffer object with its
> own memory:
> 
>   PyObject * PyBuffer_New(int size)
> 
> The resulting buffer object will be read/write, and you can stuff values
> into it using the slice notation.

Yes, but that API is not reachable from within Python,
AFAIK.
 
> > > > Hmm, perhaps we need something like a data string object
> > > > to get this 100% right ?!
> 
> Nope. The buffer object is intended to be exactly this.
> 
> >...
> > > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > > which no "string literal" notations exist.
> >
> > Anyway, one way or another I think we should make it clear
> > to users that they should start using some other type for
> > storing binary data.
> 
> Buffer objects. There are a couple changes to make this a bit easier for
> people:
> 
> 1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
>    create a read/write buffer of a particular size. buffer() should create
>    a zero-length read/write buffer.

This looks a lot like function overloading... I don't think we
should get into this: how about having the buffer() API take
keywords instead ?!

buffer(size=1024,mode='rw') - 1K of owned read write memory
buffer(obj) - read-only referenced memory from obj
buffer(obj,mode='rw') - read-write referenced memory in obj

etc.

Or we could allow passing None as object to obtain an owned
read-write memory block (much like passing NULL to the
C functions).

> 2) if slice assignment is updated to allow changes to the length (for
>    example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
>    change. Specifically: when the buffer object owns the memory, it does
>    this by appending the memory after the PyObject_HEAD and setting its
>    internal pointer to it; when the dealloc() occurs, the target memory
>    goes with the object. A flag would need to be added to tell the buffer
>    object to do a second free() for the case where a realloc has returned
>    a new pointer.
>    [ I'm not sure that I would agree with this change, however; but it
>      does make them a bit easier to work with; on the other hand, people
>      have been working with immutable strings for a long time, so they're
>      okay with concatenation, so I'm okay with saying length-altering
>      operations must simply be done thru concatenation. ]

I don't think I like this either: what happens when the buffer
doesn't own the memory ?
 
> IMO, extensions should be using the buffer object for raw bytes. I know
> that Mark has been updating some of the Win32 extensions to do this.
> Python programs could use the objects if the buffer() builtin is tweaked
> to allow a bit more flexibility in the arguments.

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From bckfnn at worldonline.dk  Mon May  8 21:44:27 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Mon, 08 May 2000 19:44:27 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <200005072129.RAA15850@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>   <39156208.13412015@smtp.worldonline.dk>  <200005072129.RAA15850@eric.cnri.reston.va.us>
Message-ID: <3917074c.8837607@smtp.worldonline.dk>

[Guido]

>I don't recall why JPython has jarray instead of array -- how do they
>differ?  I think it's a shame that similar functionality is embodied
>in different APIs.

The jarray module is a paper thin factory for the PyArray type which is
primary (I believe) a wrapper around any existing java array instance.
It exists to make arrays returned from java code useful for jpython.
Since a PyArray must always wrap the original java array, it cannot
resize the array.

In contrast an array instance would own the memory and can resize it as
necessary.

Due to the different purposes I agree with Jim's decision of making the
two module incompatible. And they are truly incompatible. jarray.array
have reversed the (typecode, seq) arguments.

OTOH creating a mostly compatible array module for jpython should not be
too hard.
 
regards,
finn





From guido at python.org  Mon May  8 21:55:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 15:55:50 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Mon, 08 May 2000 19:44:27 GMT."
             <3917074c.8837607@smtp.worldonline.dk> 
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org> <39156208.13412015@smtp.worldonline.dk> <200005072129.RAA15850@eric.cnri.reston.va.us>  
            <3917074c.8837607@smtp.worldonline.dk> 
Message-ID: <200005081955.PAA21928@eric.cnri.reston.va.us>

> >I don't recall why JPython has jarray instead of array -- how do they
> >differ?  I think it's a shame that similar functionality is embodied
> >in different APIs.
> 
> The jarray module is a paper thin factory for the PyArray type which is
> primary (I believe) a wrapper around any existing java array instance.
> It exists to make arrays returned from java code useful for jpython.
> Since a PyArray must always wrap the original java array, it cannot
> resize the array.

Understood.  This is a bit like the buffer API in CPython then (except
for Greg's vision where the buffer object manages storage as well :-).

> In contrast an array instance would own the memory and can resize it as
> necessary.

OK, this makes sense.

> Due to the different purposes I agree with Jim's decision of making the
> two module incompatible. And they are truly incompatible. jarray.array
> have reversed the (typecode, seq) arguments.

This I'm not so sure of.  Why be different just to be different?

> OTOH creating a mostly compatible array module for jpython should not be
> too hard.

OK, when we make array() a built-in, this should be done for Java too.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Mon May  8 22:29:21 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 8 May 2000 13:29:21 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005081400.KAA19889@eric.cnri.reston.va.us>
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>
Message-ID: <20000508132921.A31981@activestate.com>

On Mon, May 08, 2000 at 10:00:30AM -0400, Guido van Rossum wrote:
> > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > Overflow exception if they overflow (previously they just silently
> > overflowed).
> 
> Trent,
> 
> There's one issue with this: I believe the 'b' format is mostly used
> with unsigned character arguments in practice.
>However on systems
> with default signed characters, CHAR_MAX is 127 and values 128-255 are
> rejected.  I'll change the overflow test to:
> 
> 	else if (ival > CHAR_MAX && ival >= 256) {
> 
> if that's okay with you.
> 
Okay, I guess. Two things:

1. In a way this defeats the main purpose of the checks. Now a silent overflow
could happen for a signed byte value over CHAR_MAX. The only way to
automatically do the bounds checking is if the exact type is known, i.e.
different formatters for signed and unsigned integral values. I don't know if
this is desired (is it?). The obvious choice of 'u' prefixes to specify
unsigned is obviously not an option.

Another option might be to document 'b' as for unsigned chars and 'h', 'i',
'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
limiting the user to unsigned or signed depending on the formatter. (Which
again, means that it would be nice to have different formatters for signed
and unsigned.) I think that the bounds checking is false security unless
these restrictions are made.


2. The above aside, I would be more inclined to change the line in question to:

   else if (ival > UCHAR_MAX) {

as this is more explicit about what is being done.

> Another issue however is that there are probably cases where an 'i'
> format is used (which can't overflow on 32-bit architectures) but
> where the int value is then copied into a short field without an
> additional check...  I'm not sure how to fix this except by a complete
> inspection of all code...  Not clear if it's worth it.

Yes, a complete code inspection seems to be the only way. That is some of
what I am doing. Again, I have two questions:

1. There are a fairly large number of downcasting cases in the Python code
(not necessarily tied to PyArg_ParseTuple results). I was wondering if you
think a generalized check on each such downcast would be advisable. This
would take the form of some macro that would do a bounds check before doing
the cast. For example (a common one is the cast of strlen's size_t return
value to int, because Python strings use int for their length, this is a
downcast on 64-bit systems):

  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, len);

would become
  
  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));

CAST_TO_INT would ensure that 'len'did not overflow and would raise an
exception otherwise.

Pros:

- should never have to worry about overflows again
- easy to find (given MSC warnings) and easy to code in (staightforward)

Cons:

- more code, more time to execute
- looks ugly
- have to check PyErr_Occurred every time a cast is done


I would like other people's opinion on this kind of change. There are three
possible answers:

  +1 this is a bad change idea because...<reason>
  -1 this is a good idea, go for it
  +0 (mostly likely) This is probably a good idea for some case where the
     overflow *could* happen, however the strlen example that you gave is
	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
	 is a good assumption/limitation.



2. Microsofts compiler gives good warnings for casts where information loss
is possible. However, I cannot find a way to get similar warnings from gcc.
Does anyone know if that is possible. I.e.

	int i = 123456;
	short s = i;  // should warn about possible loss of information

should give a compiler warning.


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From trentm at activestate.com  Mon May  8 23:26:51 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 8 May 2000 14:26:51 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005081416.KAA20158@eric.cnri.reston.va.us>
References: <20000505135817.A9859@activestate.com> <200005081416.KAA20158@eric.cnri.reston.va.us>
Message-ID: <20000508142651.C8000@activestate.com>

On Mon, May 08, 2000 at 10:16:42AM -0400, Guido van Rossum wrote:
> > The patch to config.h looks big but it really is not. These are the effective
> > changes:
> > - MS_WINxx are keyed off _WINxx
> > - SIZEOF_VOID_P is set to 8 for Win64
> > - COMPILER string is changed appropriately for Win64
>
> One thing worries me: if COMPILER is changed, that changes
> sys.platform to "win64", right?  I'm sure that will break plenty of
> code which currently tests for sys.platform=="win32" but really wants
> to test for any form of Windows.  Maybe sys.platform should remain
> win32?
> 

No, but yes. :( Actually I forgot to mention that my config.h patch changes
the PLATFORM #define from win32 to win64. So yes, you are correct. And, yes
(Sigh) you are right that this will break tests for sys.platform == "win32".

So I guess the simplest thing to do is to leave it as win32 following the
same reasoning for defining MS_WIN32 on Win64:

>  The idea is that the common case is
>  that code specific to Win32 will also work on Win64 rather than being
>  specific to Win32 (i.e. there is more the same than different in WIn32 and
>  Win64).
 

What if someone needs to do something in Python code for either Win32 or
Win64 but not both? Or should this never be necessary (not likely). I would
like Mark H's opinion on this stuff.


Trent

-- 
Trent Mick
trentm at activestate.com



From tismer at tismer.com  Mon May  8 23:52:54 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 08 May 2000 23:52:54 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000301bfb78f$33e33d80$452d153f@tim>
Message-ID: <39173736.2A776348@tismer.com>


Tim Peters wrote:
> 
> [Tim]
> > Python's grammar is such that negative integer literals don't
> > exist; what you actually have there is the unary minus operator
> > applied to positive integer literals; ...
> 
> [Christian Tismer]
> > Well, knowing that there are more negatives than positives
> > and then coding it this way appears in fact as a design flaw to me.
> 
> Don't know what you're saying here. 

On a 2's-complement machine, there are 2**(n-1) negatives, zero, and
2**(n-1)-1 positives. The most negative number cannot be inverted.
Most machines today use the 2's complement.

> Python's grammar has nothing to do with
> the relative number of positive vs negative entities; indeed, in a
> 2's-complement machine it's not even true that there are more negatives than
> positives. 

If I read this 1's-complement machine then I believe it.
But we don't need to split hair on known stuff :-)

> Python generates the unary minus for "negative literals"
> because, again, negative literals *don't exist* in the grammar.

Yes. If I know the facts and don't build negative literals into
the grammar, then I call it an oversight. Not too bad but not nice.

> > A simple solution could be to do the opposite:
> > Always store a negative number and negate it
> > for positive numbers.  ...
> 
> So long as negative literals don't exist in the grammar, "-2147483648" makes
> no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> problem" here worth fixing, although if there is <wink>, it will get fixed
> by magic as soon as Python ints and longs are unified.

I'd change the grammar.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Mon May  8 23:54:31 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 8 May 2000 14:54:31 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173736.2A776348@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>

On Mon, 8 May 2000, Christian Tismer wrote:
>...
> > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > problem" here worth fixing, although if there is <wink>, it will get fixed
> > by magic as soon as Python ints and longs are unified.
> 
> I'd change the grammar.

That would be very difficult, with very little positive benefit. As Mark
said, use 0x80000000 if you want that number.

Consider that the grammar would probably want to deal with things like
  - 1234
or
  -0xA

Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
complicate the syntax for NUMBER.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From tismer at tismer.com  Tue May  9 00:09:43 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 09 May 2000 00:09:43 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>
Message-ID: <39173B27.4B3BEB40@tismer.com>


Greg Stein wrote:
> 
> On Mon, 8 May 2000, Christian Tismer wrote:
> >...
> > > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > > problem" here worth fixing, although if there is <wink>, it will get fixed
> > > by magic as soon as Python ints and longs are unified.
> >
> > I'd change the grammar.
> 
> That would be very difficult, with very little positive benefit. As Mark
> said, use 0x80000000 if you want that number.
> 
> Consider that the grammar would probably want to deal with things like
>   - 1234
> or
>   -0xA
> 
> Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
> complicate the syntax for NUMBER.

Right. That was the reason for my first, dumb, proposal:
Always interpret a number as negative and negate it once more.
That makes it positive. In a post process, remove double-negates.
This leaves negations always where they are allowed: On negatives.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Tue May  9 00:11:00 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 8 May 2000 15:11:00 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173B27.4B3BEB40@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081508490.18798-100000@nebula.lyra.org>

On Tue, 9 May 2000, Christian Tismer wrote:
>...
> Right. That was the reason for my first, dumb, proposal:
> Always interpret a number as negative and negate it once more.
> That makes it positive. In a post process, remove double-negates.
> This leaves negations always where they are allowed: On negatives.

IMO, that is a non-intuitive hack. It would increase the complexity of
Python's parsing internals. Again, with little measurable benefit.

I do not believe that I've run into a case of needing -2147483648 in the
source of one of my programs. If I had, then I'd simply switch to
0x80000000 and/or assign it to INT_MIN.

-1 on making Python more complex to support this single integer value.
   Users should be pointed to 0x80000000 to represent it. (a FAQ entry
   and/or comment in the language reference would be a Good Thing)


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Tue May  9 00:15:17 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 08:15:17 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000508142651.C8000@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au>

[Trent]
> What if someone needs to do something in Python code for either Win32 or
> Win64 but not both? Or should this never be necessary (not
> likely). I would
> like Mark H's opinion on this stuff.

OK :-)

I have always thought that it _would_ move to "win64", and the official way
of checking for "Windows" will be sys.platform[:3]=="win".

In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
sys.platform[:3] in ["win", "mac"])

It will no doubt cause a bit of pain, but IMO it is cleaner...

Mark.




From guido at python.org  Tue May  9 04:14:07 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:14:07 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 08:15:17 +1000."
             <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090214.WAA22419@eric.cnri.reston.va.us>

> [Trent]
> > What if someone needs to do something in Python code for either Win32 or
> > Win64 but not both? Or should this never be necessary (not
> > likely). I would
> > like Mark H's opinion on this stuff.

[Mark]
> OK :-)
> 
> I have always thought that it _would_ move to "win64", and the official way
> of checking for "Windows" will be sys.platform[:3]=="win".
> 
> In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> sys.platform[:3] in ["win", "mac"])
> 
> It will no doubt cause a bit of pain, but IMO it is cleaner...

Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
symbol is defined even on Win64 systems -- to test for Win64, you must
test the _WIN64 symbol.  The two variants are more similar than they
are different.

While testing sys.platform isn't quite the same thing, I think that
the same reasoning goes: a win64 system is everything that a win32
system is, and then some.

So I'd vote for leaving sys.platform alone (i.e. "win32" in both
cases), and providing another way to test for win64-ness.

I wish we had had the foresight to set sys.platform to 'windows', but
since we hadn't, I think we'll have to live with the consequences.

The changes that Trent had to make in the standard library are only
the tip of the iceberg...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  9 04:24:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:24:50 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Mon, 08 May 2000 13:29:21 PDT."
             <20000508132921.A31981@activestate.com> 
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>  
            <20000508132921.A31981@activestate.com> 
Message-ID: <200005090224.WAA22457@eric.cnri.reston.va.us>

[Trent]
> > > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > > Overflow exception if they overflow (previously they just silently
> > > overflowed).

[Guido]
> > There's one issue with this: I believe the 'b' format is mostly used
> > with unsigned character arguments in practice.
> > However on systems
> > with default signed characters, CHAR_MAX is 127 and values 128-255 are
> > rejected.  I'll change the overflow test to:
> > 
> > 	else if (ival > CHAR_MAX && ival >= 256) {
> > 
> > if that's okay with you.

[Trent]
> Okay, I guess. Two things:
> 
> 1. In a way this defeats the main purpose of the checks. Now a silent overflow
> could happen for a signed byte value over CHAR_MAX. The only way to
> automatically do the bounds checking is if the exact type is known, i.e.
> different formatters for signed and unsigned integral values. I don't know if
> this is desired (is it?). The obvious choice of 'u' prefixes to specify
> unsigned is obviously not an option.

The struct module uses upper case for unsigned.  I think this is
overkill here, and would add a lot of code (if applied systematically)
that would rarely be used.

> Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
> for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> limiting the user to unsigned or signed depending on the formatter. (Which
> again, means that it would be nice to have different formatters for signed
> and unsigned.) I think that the bounds checking is false security unless
> these restrictions are made.

I like this: 'b' is unsigned, the others are signed.

> 2. The above aside, I would be more inclined to change the line in question to:
> 
>    else if (ival > UCHAR_MAX) {
> 
> as this is more explicit about what is being done.

Agreed.

> > Another issue however is that there are probably cases where an 'i'
> > format is used (which can't overflow on 32-bit architectures) but
> > where the int value is then copied into a short field without an
> > additional check...  I'm not sure how to fix this except by a complete
> > inspection of all code...  Not clear if it's worth it.
> 
> Yes, a complete code inspection seems to be the only way. That is some of
> what I am doing. Again, I have two questions:
> 
> 1. There are a fairly large number of downcasting cases in the Python code
> (not necessarily tied to PyArg_ParseTuple results). I was wondering if you
> think a generalized check on each such downcast would be advisable. This
> would take the form of some macro that would do a bounds check before doing
> the cast. For example (a common one is the cast of strlen's size_t return
> value to int, because Python strings use int for their length, this is a
> downcast on 64-bit systems):
> 
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
> 
> would become
>   
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> 
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.
> 
> Pros:
> 
> - should never have to worry about overflows again
> - easy to find (given MSC warnings) and easy to code in (staightforward)
> 
> Cons:
> 
> - more code, more time to execute
> - looks ugly
> - have to check PyErr_Occurred every time a cast is done

How would the CAST_TO_INT macro signal an erro?  C doesn't have
exceptions.  If we have to add checks, I'd prefer to write

  size_t len = strlen(s);
  if (INT_OVERFLOW(len))
     return NULL; /* Or whatever is appropriate in this context */
  obj = PyString_FromStringAndSize(s, len);

> I would like other people's opinion on this kind of change. There are three
> possible answers:
> 
>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it
>   +0 (mostly likely) This is probably a good idea for some case where the
>      overflow *could* happen, however the strlen example that you gave is
> 	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
> 	 is a good assumption/limitation.

-0

> 2. Microsofts compiler gives good warnings for casts where information loss
> is possible. However, I cannot find a way to get similar warnings from gcc.
> Does anyone know if that is possible. I.e.
> 
> 	int i = 123456;
> 	short s = i;  // should warn about possible loss of information
> 
> should give a compiler warning.

Beats me :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mhammond at skippinet.com.au  Tue May  9 04:29:50 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:29:50 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>

> > It will no doubt cause a bit of pain, but IMO it is cleaner...
>
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.

Yes, but still, one day, (if MS have their way :-) win32 will be "legacy".

eg, imagine we were having the same debate about 5 years ago, but there was
a more established Windows 3.1 port available.

If we believed the hype, we probably _would_ have gone with "windows" for
both platforms, in the hope that they are more similar than different
(after all, that _was_ the story back then).

> The changes that Trent had to make in the standard library are only
> the tip of the iceberg...

Yes, but OTOH, the fact we explicitely use "win32" means people shouldnt
really expect code to work on Win64.  If nothing else, it will be a good
opportunity to examine the situation as each occurrence is found.  It will
be quite some time before many people play with the Win64 port seriously
(just like the first NT ports when I first came on the scene :-)

So, I remain a +0 on this - ie, I dont really care personally, but think
"win64" is the right thing.  In any case, Im happy to rely on Guido's time
machine...

Mark.




From mhammond at skippinet.com.au  Tue May  9 04:36:59 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:36:59 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au>

One more data point:

Windows CE uses "wince", and I certainly dont believe this should be
"win32" (although if you read the CE marketting stuff, they would have you
believe it is close enough that we should :-).

So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
:-)

Mark.




From guido at python.org  Tue May  9 05:16:34 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:16:34 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:29:50 +1000."
             <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090316.XAA22614@eric.cnri.reston.va.us>

To help me understand the significance of win64 vs. win32, can you
list the major differences?  I thought that the main thing was that
pointers are 64 bits, and that otherwise the APIs are the same.  In
fact, I don't know if WIN64 refers to Windows running on 64-bit
machines (e.g. Alphas) only, or that it is possible to have win64 on a
32-bit machine (e.g. Pentium).

If it's mostly a matter of pointer size, this is almost completely
hidden at the Python level, and I don't think it's worth changing the
plaform name.  All of the changes that Trent found were really tests
for the presence of Windows APIs like the registry...

I could defend calling it Windows in comments but having sys.platform
be "win32".  Like uname on Solaris 2.7 returns SunOS 5.7 -- there's
too much old code that doesn't deserve to be broken.  (And it's not
like we have an excuse that it was always documented this way -- this
wasn't documented very clearly at all...)

It's-spelt-Raymond-Luxury-Yach-t-but-it's-pronounced-Throatwobbler-Mangrove,

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Tue May  9 05:19:19 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:19:19 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:36:59 +1000."
             <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090319.XAA22627@eric.cnri.reston.va.us>

> Windows CE uses "wince", and I certainly dont believe this should be
> "win32" (although if you read the CE marketting stuff, they would have you
> believe it is close enough that we should :-).
> 
> So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
> :-)

That's a feature :-).  Too many things we think we know are true on
Windows don't hold on Win/CE, so it's worth being more precise.

I don't believe this is the case for Win64, but I have to admit I
speak from a position of ignorance -- I am clueless as to what defines
Win64.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From nhodgson at bigpond.net.au  Tue May  9 05:35:16 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 9 May 2000 13:35:16 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>  <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <035e01bfb968$9ad8cca0$e3cb8490@neil>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that
> pointers are 64 bits, and that otherwise the APIs are the same.  In
> fact, I don't know if WIN64 refers to Windows running on 64-bit
> machines (e.g. Alphas) only, or that it is possible to have win64 on a
> 32-bit machine (e.g. Pentium).

   The 64 bit pointer change propagates to related types like size_t and
window procedure parameters. Running the 64 bit checker over Scintilla found
one real problem and a large number of strlen returning 64 bit size_ts where
only ints were expected.

   64 bit machines will continue to run Win32 code but it is unlikely that
32 bit machines will be taught to run Win64 code.

   Mixed operations, calling between 32 bit and 64 bit code and vice-versa
will be fun. Microsoft (unlike IBM with OS/2) never really did the right
thing for the 16->32 bit conversion. Is there any information yet on mixed
size applications?

   Neil





From mhammond at skippinet.com.au  Tue May  9 06:06:25 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 14:06:25 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that

I just saw Neils, and Trent may have other input.

However, the point I was making is that 5 years ago, MS were telling us
that the Win32 API was almost identical to the Win16 API, except for the
size of pointers, and dropping of the "memory model" abominations.

The Windows CE department is telling us that CE is, or will be, basically
the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
this should be hidden from the Python programmer.

Now all we need is "win64s" - it will respond to Neil's criticism that
mixed mode programs are a pain, and MS will tell us what "win64s" will
solve all our problems, and allow win32 to run 64 bit programs well into
the future.  Until everyone in the world realizes it sucks, and MS promptly
says it was only ever a hack in the first place, and everyone should be on
Win64 by now anyway :-)

Its-times-like-this-we-really-need-that-time-machine-ly,

Mark.




From tim_one at email.msn.com  Tue May  9 08:54:51 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 9 May 2000 02:54:51 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005090224.WAA22457@eric.cnri.reston.va.us>
Message-ID: <000101bfb983$7a34d3c0$592d153f@tim>

[Trent]
> 1. There are a fairly large number of downcasting cases in the
> Python code (not necessarily tied to PyArg_ParseTuple results). I
> was wondering if you think a generalized check on each such
> downcast would be advisable. This would take the form of some macro
> that would do a bounds check before doing the cast. For example (a
> common one is the cast of strlen's size_t return value to int,
> because Python strings use int for their length, this is a downcast
> on 64-bit systems):
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
>
> would become
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
>
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.

[Guido]
> How would the CAST_TO_INT macro signal an erro?  C doesn't have
> exceptions.  If we have to add checks, I'd prefer to write
>
>   size_t len = strlen(s);
>   if (INT_OVERFLOW(len))
>      return NULL; /* Or whatever is appropriate in this context */
>   obj = PyString_FromStringAndSize(s, len);

Of course we have to add checks -- strlen doesn't return an int!  It hasn't
since about a year after Python was first written (ANSI C changed the rules,
and Python is long overdue in catching up -- if you want people to stop
passing multiple args to append, set a good example in our use of C <0.5
wink>).

[Trent]
> I would like other people's opinion on this kind of change.
> There are three possible answers:

Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
a hurrah, -0 and +0 are obvious <ahem>.

>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it

That one, except spelled +1.

>   +0 (mostly likely) This is probably a good idea for some case
> where the overflow *could* happen, however the strlen example that
> you gave is *not* such a situation. As Tim Peters said: 2GB limit on
> string lengths is a good assumption/limitation.

No, it's a defensible limitation, but it's *never* a valid assumption.  The
check isn't needed anywhere we can prove a priori that it could never fail
(in which case we're not assuming anything), but it's always needed when we
can't so prove (in which case skipping the check would be a bad asssuption).
In the absence of any context, your strlen example above definitely needs
the check.

An alternative would be to promote the size member from int to size_t;
that's no actual change on the 32-bit machines Guido generally assumes
without realizing it, and removes an arbitrary (albeit defensible)
limitation on some 64-bit machines at the cost of (just possibly, due to
alignment vagaries) boosting var objects' header size on the latter.

correctness-doesn't-happen-by-accident-ly y'rs  - tim





From guido at python.org  Tue May  9 12:48:16 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 06:48:16 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Tue, 09 May 2000 02:54:51 EDT."
             <000101bfb983$7a34d3c0$592d153f@tim> 
References: <000101bfb983$7a34d3c0$592d153f@tim> 
Message-ID: <200005091048.GAA22912@eric.cnri.reston.va.us>

> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.

Then the signatures of many, many functions would have to be changed
to take or return size_t, too -- almost anything in the Python/C API
that *conceptually* is a size_t is declared as int; the ob_size field
is only the tip of the iceberg.

We'd also have to change the size of Python ints (currently long) to
an integral type that can hold a size_t; on Windows (and I believe
*only* on Windows) this is a long long, or however they spell it
(except size_t is typically unsigned).

This all is a major reworking -- not good for 1.6, even though I agree
it needs to be done eventually.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  9 13:08:25 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 07:08:25 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:06:25 +1000."
             <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005091108.HAA22983@eric.cnri.reston.va.us>

> > To help me understand the significance of win64 vs. win32, can you
> > list the major differences?  I thought that the main thing was that
> 
> I just saw Neils, and Trent may have other input.
> 
> However, the point I was making is that 5 years ago, MS were telling us
> that the Win32 API was almost identical to the Win16 API, except for the
> size of pointers, and dropping of the "memory model" abominations.
> 
> The Windows CE department is telling us that CE is, or will be, basically
> the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
> this should be hidden from the Python programmer.
> 
> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

OK, I am beginning to get the picture.

The win16-win32-win64 distinction mostly affects the C API.  I agree
that the win16/win32 distinction was huge -- while they provided
backwards compatible APIs, most of these were quickly deprecated.  The
user experience was also completely different.  And huge amounts of
functionality were only available in the win32 version (e.g. the
registry), win32s notwithstanding.

I don't see the same difference for the win32/win64 API.  Yes, all the
APIs have changed -- but only in a way you would *expect* them to
change in a 64-bit world.  From the descriptions of differences, the
user experience and the sets of APIs available are basically the same,
but the APIs are tweaked to allow 64-bit values where this makes
sense.  This is a big deal for MS developers because of MS's
insistence on fixing the sizes of all datatypes -- POSIX developers
are used to typedefs that have platform-dependent widths, but MS in
its wisdom has decided that it should be okay to know that a long is
exactly 32 bits.

Again, the Windows/CE user experience is quite different, so I agree
on making the user-visible platform is different there.  But I still
don't see that the user experience for win64 will be any difference
than for win32.

Another view: win32 was my way of saying the union of Windows 95,
Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
platforms.  If Windows 2000 is sufficiently different to the user, it
deserves a different platform id (win2000?).

Is there a connection between Windows 2000 and _WIN64?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mal at lemburg.com  Tue May  9 11:09:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 11:09:40 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <3917D5D3.A8CD1B3E@lemburg.com>

Guido van Rossum wrote:
> 
> > [Trent]
> > > What if someone needs to do something in Python code for either Win32 or
> > > Win64 but not both? Or should this never be necessary (not
> > > likely). I would
> > > like Mark H's opinion on this stuff.
> 
> [Mark]
> > OK :-)
> >
> > I have always thought that it _would_ move to "win64", and the official way
> > of checking for "Windows" will be sys.platform[:3]=="win".
> >
> > In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> > sys.platform[:3] in ["win", "mac"])
> >
> > It will no doubt cause a bit of pain, but IMO it is cleaner...
> 
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.
> 
> While testing sys.platform isn't quite the same thing, I think that
> the same reasoning goes: a win64 system is everything that a win32
> system is, and then some.
> 
> So I'd vote for leaving sys.platform alone (i.e. "win32" in both
> cases), and providing another way to test for win64-ness.

Just curious, what's the output of platform.py on Win64 ?
(You can download platform.py from my Python Pages.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Tue May  9 20:53:37 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 14:53:37 -0400 (EDT)
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091108.HAA22983@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
	<200005091108.HAA22983@eric.cnri.reston.va.us>
Message-ID: <14616.24241.26240.247048@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Another view: win32 was my way of saying the union of Windows 95,
 > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
 > platforms.  If Windows 2000 is sufficiently different to the user, it
 > deserves a different platform id (win2000?).
 > 
 > Is there a connection between Windows 2000 and _WIN64?

  Since no one else has responded, here's some stuff from MS on the
topic of Win64:

http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

This document talks only of the Itanium (IA64) processor, and doesn't
mention the Alpha at all.  I know the NT shipping on Alpha machines is
Win32, though the actual application code can be 64-bit (think "32-bit
Solaris on an Ultra"); just the system APIs are 32 bits.
  The last link on the page links to some more detailed technical
information on moving application code to Win64.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Tue May  9 20:57:21 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 14:57:21 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:53:37 EDT."
             <14616.24241.26240.247048@seahag.cnri.reston.va.us> 
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us>  
            <14616.24241.26240.247048@seahag.cnri.reston.va.us> 
Message-ID: <200005091857.OAA24731@eric.cnri.reston.va.us>

>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

Thanks, this makes more sense.  I guess that Trent's interest in Win64
has to do with an early shipment of Itaniums that ActiveState might
have received. :-)

The document confirms my feeling that WIN64 vs WIN32, unlike WIN32 vs
WIN16, is mostly a compiler issue, and not a user experience or OS
functionality issue.  The table lists increased limits, not new
software subsystems.

So I still think that sys.platform should be 'win32', to avoid
breaking existing apps.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From gstein at lyra.org  Tue May  9 20:56:34 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 11:56:34 -0700 (PDT)
Subject: [Python-Dev] win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred L. Drake, Jr. wrote:
> Guido van Rossum writes:
>  > Another view: win32 was my way of saying the union of Windows 95,
>  > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
>  > platforms.  If Windows 2000 is sufficiently different to the user, it
>  > deserves a different platform id (win2000?).
>  > 
>  > Is there a connection between Windows 2000 and _WIN64?
> 
>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp
> 
> This document talks only of the Itanium (IA64) processor, and doesn't
> mention the Alpha at all.  I know the NT shipping on Alpha machines is
> Win32, though the actual application code can be 64-bit (think "32-bit
> Solaris on an Ultra"); just the system APIs are 32 bits.

Windows is no longer made/sold for the Alpha processor. That was canned in
August of '99, I believe. Possibly August 98.

Basically, Windows is just the x86 family, and Win/CE for various embedded
processors.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Tue May  9 21:06:49 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 15:06:49 -0400 (EDT)
Subject: [Python-Dev] Re: win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
References: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
	<Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
Message-ID: <14616.25033.883165.800216@seahag.cnri.reston.va.us>

Greg Stein writes:
 > Windows is no longer made/sold for the Alpha processor. That was canned in
 > August of '99, I believe. Possibly August 98.

  <sigh/>


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From trentm at activestate.com  Tue May  9 21:49:57 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 12:49:57 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091857.OAA24731@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us> <14616.24241.26240.247048@seahag.cnri.reston.va.us> <200005091857.OAA24731@eric.cnri.reston.va.us>
Message-ID: <20000509124957.A21838@activestate.com>

> Thanks, this makes more sense.  I guess that Trent's interest in Win64
> has to do with an early shipment of Itaniums that ActiveState might
> have received. :-)

Could be.... Or maybe we don't have any Itanium boxes. :)

Here is a good link on MSDN:

Getting Ready for 64-bit Windows
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_410z.htm

More specifically this (presuming it is being kept up to date) documents the
changes to the Win32 API for 64-bit Windows:
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_9xo3.htm
I am not a Windows programmer, but the changes are pretty minimal.

Summary:

Points for sys.platform == "win32" on Win64:
Pros:
- will not break existing sys.platform checks
- it would be nicer for casual Python programmer to have platform issues
  hidden, therefore one symbol for the common Windows OSes is more of the
  Pythonic ideal than "the first three characters of the platform string are
  'win'".
Cons:
- may need to add some other mechnism to differentiate Win32 and Win64 in
  Python code
- "win32" is a little misleading in that it refers to an API supported on
  Win32 and Win64 ("windows" would be more accurate, but too late for that)
  

Points for sys.platform == "win64" on Win64:
Pros:
- seems logically cleaner, given that the Win64 API may diverge from the
  Win32 API and there is no other current mechnism to differentiate Win32 and
  Win64 in Python code
Cons:
- may break existing sys.platform checks when run on Win64


Opinion:

I see the two choices ("win32" or "win64") as a trade off between:
- Use "win32" because a common user experience should translate to a common
  way to check for that environment, i.e. one value for sys.platform.
  Unfortunately we are stuck with "win32" instead of something like
  "windows".
- Use "win64" because it is not a big deal for the user to check for
  sys.platform[:3]=="win" and this way a mechanism exists to differentiate
  btwn Win32 and Win64 should it be necessary.

I am inclined to pick "win32" because:

1. While it may be confusing to the Python scriptor on Win64 that he has to
   check for win*32*, that is something that he will learn the first time. It
   is better than the alternative of the scriptor happily using "win64" and
   then that code not running on Win32 for no good reason. 
2. The main question is: is Win64 so much more like Win32 than different from
   it that the common-case general Python programmer should not ever have to
   make the differentiation in his Python code. Or, at least, enough so that
   such differentiation by the Python scriptor is rare enough that some other
   provided mechanism is sufficient (even preferable).
3. Guido has expressed that he favours this option. :) 

then change "win32" to "windows" in Py3K.



Trent

-- 
Trent Mick
trentm at activestate.com



From trentm at activestate.com  Tue May  9 22:05:53 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 13:05:53 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <000101bfb983$7a34d3c0$592d153f@tim>
References: <200005090224.WAA22457@eric.cnri.reston.va.us> <000101bfb983$7a34d3c0$592d153f@tim>
Message-ID: <20000509130553.D21443@activestate.com>

[Trent]
> > Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> > 'l' as signed integral values and then set the bounds checks ([0,
> > UCHAR_MAX]
> > for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> > limiting the user to unsigned or signed depending on the formatter. (Which
> > again, means that it would be nice to have different formatters for signed
> > and unsigned.) I think that the bounds checking is false security unless
> > these restrictions are made.
[guido]
> 
> I like this: 'b' is unsigned, the others are signed.

Okay I will submit a patch for this them. 'b' formatter will limit values to
[0, UCHAR_MAX].

> [Trent]
> > 1. There are a fairly large number of downcasting cases in the
> > Python code (not necessarily tied to PyArg_ParseTuple results). I
> > was wondering if you think a generalized check on each such
> > downcast would be advisable. This would take the form of some macro
> > that would do a bounds check before doing the cast. For example (a
> > common one is the cast of strlen's size_t return value to int,
> > because Python strings use int for their length, this is a downcast
> > on 64-bit systems):
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, len);
> >
> > would become
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> >
> > CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> > exception otherwise.
> 
> [Guido]
> > How would the CAST_TO_INT macro signal an erro?  C doesn't have
> > exceptions.  If we have to add checks, I'd prefer to write
> >
> >   size_t len = strlen(s);
> >   if (INT_OVERFLOW(len))
> >      return NULL; /* Or whatever is appropriate in this context */
> >   obj = PyString_FromStringAndSize(s, len);
> 
[Tim]
> Of course we have to add checks -- strlen doesn't return an int!  It hasn't
> since about a year after Python was first written (ANSI C changed the rules,
> and Python is long overdue in catching up -- if you want people to stop
> passing multiple args to append, set a good example in our use of C <0.5
> wink>).
>
> The
> check isn't needed anywhere we can prove a priori that it could never fail
> (in which case we're not assuming anything), but it's always needed when we
> can't so prove (in which case skipping the check would be a bad
> asssuption).
> In the absence of any context, your strlen example above definitely needs
> the check.
>

Okay, I just wanted a go ahead that this kind of thing was desired. I will
try to find the points where these overflows *can* happen and then I'll add
checks in a manner closer to Guido syntax above.

> 
> [Trent]
> > I would like other people's opinion on this kind of change.
> > There are three possible answers:
> 
> Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
> a hurrah, -0 and +0 are obvious <ahem>.
> 
> >   +1 this is a bad change idea because...<reason>
> >   -1 this is a good idea, go for it
> 
Whoa, sorry Tim. I mixed up the +/- there. I did not intend to change the
voting system.

[Tim]
> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.
> 
I agree with Guido that this is too big an immediate change. I'll just try to
find and catch the possible overflows.


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From gstein at lyra.org  Tue May  9 22:14:19 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:14:19 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: <200005091953.PAA28201@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred Drake wrote:
> Update of /projects/cvsroot/python/dist/src/Objects
> In directory seahag.cnri.reston.va.us:/home/fdrake/projects/python/Objects
> 
> Modified Files:
> 	unicodeobject.c 
> Log Message:
> 
> M.-A. Lemburg <mal at lemburg.com>:
> Added support for user settable default encodings. The
> current implementation uses a per-process global which
> defines the value of the encoding parameter in case it
> is set to NULL (meaning: use the default encoding).

Umm... maybe I missed something, but I thought there was pretty broad
feelings *against* having a global like this. This kind of thing is just
nasty.

1) Python modules can't change it, nor can they rely on it being a
   particular value
2) a mutable, global variable is just plain wrong. The InterpreterState
   and ThreadState structures were created *specifically* to avoid adding
   crap variables like this.
3) allowing a default other than utf-8 is sure to cause gotchas and
   surprises. Some code is going to rightly assume that the default is
   just that, but be horribly broken when an application changes it.

Somebody please say this is hugely experimental. And then say why it isn't
just a private patch, rather than sitting in CVS.

:-(

-g

-- 
Greg Stein, http://www.lyra.org/




From guido at python.org  Tue May  9 22:24:05 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 16:24:05 -0400
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: Your message of "Tue, 09 May 2000 13:14:19 PDT."
             <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> 
Message-ID: <200005092024.QAA25835@eric.cnri.reston.va.us>

> Umm... maybe I missed something, but I thought there was pretty broad
> feelings *against* having a global like this. This kind of thing is just
> nasty.
> 
> 1) Python modules can't change it, nor can they rely on it being a
>    particular value
> 2) a mutable, global variable is just plain wrong. The InterpreterState
>    and ThreadState structures were created *specifically* to avoid adding
>    crap variables like this.
> 3) allowing a default other than utf-8 is sure to cause gotchas and
>    surprises. Some code is going to rightly assume that the default is
>    just that, but be horribly broken when an application changes it.
> 
> Somebody please say this is hugely experimental. And then say why it isn't
> just a private patch, rather than sitting in CVS.

Watch your language.

Marc did this at my request.  It is my intention that the encoding be
hardcoded at compile time.  But while there's a discussion going about
what the hardcoded encoding should *be*, it would seem handy to have a
quick way to experiment.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Tue May  9 22:33:40 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:33:40 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ...
 unicodeobject.c)
In-Reply-To: <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091331300.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Guido van Rossum wrote:
>...
> Watch your language.

Yes, Dad :-) Sorry...

> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Okee dokee... That was one of my questions: is this experimental or not?

It is still a bit frightening, though, if it might get left in there, for
the reasons I listed (to name a few) ... :-(

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Tue May  9 23:35:16 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 23:35:16 +0200
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... 
 unicodeobject.c)
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <39188494.61424A7@lemburg.com>

Guido van Rossum wrote:
> 
> > Umm... maybe I missed something, but I thought there was pretty broad
> > feelings *against* having a global like this. This kind of thing is just
> > nasty.
> >
> > 1) Python modules can't change it, nor can they rely on it being a
> >    particular value
> > 2) a mutable, global variable is just plain wrong. The InterpreterState
> >    and ThreadState structures were created *specifically* to avoid adding
> >    crap variables like this.
> > 3) allowing a default other than utf-8 is sure to cause gotchas and
> >    surprises. Some code is going to rightly assume that the default is
> >    just that, but be horribly broken when an application changes it.

Hmm, the patch notice says it all I guess:

This patch fixes a few bugglets and adds an experimental
feature which allows setting the string encoding assumed
by the Unicode implementation at run-time.

The current implementation uses a process global for
the string encoding. This should subsequently be changed
to a thread state variable, so that the setting can
be done on a per thread basis.

Note that only the coercions from strings to Unicode
are affected by the encoding parameter. The "s" parser
marker still returns UTF-8. (str(unicode) also returns
the string encoding -- unlike what I wrote in the original
patch notice.)

The main intent of this patch is to provide a test
bed for the ongoing Unicode debate, e.g. to have the
implementation use 'latin-1' as default string encoding,
put

import sys
sys.set_string_encoding('latin-1')

in you site.py file.

> > Somebody please say this is hugely experimental. And then say why it isn't
> > just a private patch, rather than sitting in CVS.
> 
> Watch your language.
> 
> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Right and that's what the intent was behind adding a global
and some APIs to change it first... there are a few ways this
could one day get finalized:

1. hardcode the encoding (UTF-8 was previously hard-coded)
2. make the encoding a compile time option
3. make the encoding a per-process option
4. make the encoding a per-thread option
5. make the encoding a per-process setting which is deduced
   from env. vars such as LC_ALL, LC_CTYPE, LANG or system
   APIs which can be used to get at the currently
   active local encoding

Note that I have named the APIs sys.get/set_string_encoding()...
I've done that on purpose, because I have a feeling that
changing the conversion from Unicode to strings from UTF-8
to an encoding not capable of representing all Unicode
characters won't get us very far. Also, changing this is
rather tricky due to the way the buffer API works.

The other way around needs some experimenting though and this
is what the patch implements: it allows you to change the
string encoding assumption to test various
possibilities, e.g. ascii, latin-1, unicode-escape,
<your favourite local encoding> etc. without having to
recompile the interpreter every time.

Have fun with it :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mhammond at skippinet.com.au  Wed May 10 00:58:19 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 10 May 2000 08:58:19 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000509124957.A21838@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEJFCKAA.mhammond@skippinet.com.au>

Geez - Fred is posting links to the MS site, and Im battling ipchains and
DHCP on my newly installed Debian box - what is this world coming to!?!?!

> I am inclined to pick "win32" because:

OK - Im sold.

Mark.




From nhodgson at bigpond.net.au  Wed May 10 01:17:27 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 10 May 2000 09:17:27 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
Message-ID: <009a01bfba0c$bdf13a20$e3cb8490@neil>

> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS
promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

   Maybe someone has made noise about this before I joined the discussion,
but I see the absence of a mixed mode being a big problem for users. I don't
think that there will be the 'quick clean" migration from 32 to 64 that
there was for 16 to 32. It doesn't offer that much for most applications. So
there will need to be both 32 bit and 64 bit versions of Python present on
machines. With duplicated libraries. Each DLL should be available in both 32
and 64 bit form. The IDEs will have to be available in both forms as they
are loading, running and debugging code of either width. Users will have to
remember to run a different Python if they are using libraries of the
non-default width.

   Neil





From czupancic at beopen.com  Wed May 10 01:44:20 2000
From: czupancic at beopen.com (Christian Zupancic)
Date: Tue, 09 May 2000 16:44:20 -0700
Subject: [Python-Dev] Python Query
Message-ID: <3918A2D4.B0FE7DDF@beopen.com>

======================================================================
Greetings Python Developers,

Please participate in a small survey about Python for BeOpen.com that we
are conducting with the guidance of our advisor, and the creator of
Python, Guido van Rossum. In return for answering just five short
questions, I will mail you up to three (3) BeOpen T-shirts-- highly
esteemed by select trade-show attendees as "really cool". In addition,
three lucky survey participants will receive a Life-Size Inflatable
Penguin (as they say, "very cool").

- Why do you prefer Python over other languages, e.g. Perl?


- What do you consider to be (a) competitor(s) to Python?


- What are Python's strong points and weaknesses?


- What other languages do you program in?


- If you had one wish about Python, what would it be?


- For Monty Python fans only:
What is the average airspeed of a swallow (European, non-migratory)?

 THANKS! That wasn't so bad, was it?  Make sure you've attached a
business card or address of some sort so I know where to send your
prizes.

Best Regards,
Christian Zupancic
Market Analyst, BeOpen.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: czupancic.vcf
Type: text/x-vcard
Size: 146 bytes
Desc: Card for Christian Zupancic
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000509/44f93500/attachment.vcf>

From trentm at activestate.com  Wed May 10 01:45:36 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 16:45:36 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <3917D5D3.A8CD1B3E@lemburg.com>
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com>
Message-ID: <20000509164536.A31366@activestate.com>

On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> Just curious, what's the output of platform.py on Win64 ?
> (You can download platform.py from my Python Pages.)

I get the following:

"""
The system cannot find the path specified
win64-32bit
"""

Sorry, I did not hunt down the "path" error message.

Trent

-- 
Trent Mick
trentm at activestate.com



From tim_one at email.msn.com  Wed May 10 06:53:20 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 May 2000 00:53:20 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <009a01bfba0c$bdf13a20$e3cb8490@neil>
Message-ID: <000301bfba3b$a9e11300$022d153f@tim>

[Neil Hodgson]
>    Maybe someone has made noise about this before I joined the
> discussion, but I see the absence of a mixed mode being a big
> problem for users. ...

Intel doesn't -- they're not positioning Itanium for the consumer market.
They're going after the high-performance server market with this, and most
signs are that MS is too.

> ...
> It doesn't offer that much for most applications.

Bingo.

plenty-of-time-to-panic-later-if-end-users-ever-care-ly y'rs  - tim





From mal at lemburg.com  Wed May 10 09:47:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm at activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed May 10 09:47:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm at activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
__________X-Mozilla-Status: 0009______________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From guido at python.org  Wed May 10 18:52:49 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 10 May 2000 12:52:49 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Tools/idle browser.py,NONE,1.1
In-Reply-To: Your message of "Wed, 10 May 2000 12:47:30 EDT."
             <200005101647.MAA30408@seahag.cnri.reston.va.us> 
References: <200005101647.MAA30408@seahag.cnri.reston.va.us> 
Message-ID: <200005101652.MAA28936@eric.cnri.reston.va.us>

Fred,

"browser" is a particularly non-descriptive name for this module.

Perhaps it's not too late to rename it to e.g. "BrowserControl"?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Wed May 10 22:14:46 2000
From: trentm at activestate.com (Trent Mick)
Date: Wed, 10 May 2000 13:14:46 -0700
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <000201bfba3b$a74ad7c0$022d153f@tim>
References: <20000509162504.A31192@activestate.com> <000201bfba3b$a74ad7c0$022d153f@tim>
Message-ID: <20000510131446.A25926@activestate.com>

On Wed, May 10, 2000 at 12:53:16AM -0400, Tim Peters wrote:
> [Trent Mick]
> > Discussion:
> >
> > Okay, it is debatable to call float_hash and complex_hash broken,
> > but their code presumed that sizeof(long) was 32-bits. As a result
> > the hashed values for floats and complex values were not the same
> > on a 64-bit *nix system as on a 32-bit *nix system. With this
> > patch they are.
> 
> The goal is laudable but the analysis seems flawed.  For example, this new
> comment:

Firstly, I should have admitted my ignorance with regards to hash functions.


> Looks to me like the real problem in the original was here:
> 
>     x = hipart + (long)fractpart + (long)intpart + (expo << 15);
>                                    ^^^^^^^^^^^^^
> 
> The difficulty is that intpart may *not* fit in 32 bits, so the cast of
> intpart to long is ill-defined when sizeof(long) == 4.

> 
> That is, the hash function truly is broken for "large" values with a
> fractional part, and I expect your after-patch code suffers the same
> problem: 

Yes it did.


> The
> solution to this is to break intpart in this branch into pieces no larger
> than 32 bits too 

Okay here is another try (only for floatobject.c) for discussion. If it looks
good then I will submit a patch for float and complex objects. So do the same
for 'intpart' as was done for 'fractpart'.


static long
float_hash(v)
    PyFloatObject *v;
{
    double intpart, fractpart;
    long x;

    fractpart = modf(v->ob_fval, &intpart);

    if (fractpart == 0.0) {
		// ... snip ...
    }
    else {
        int expo;
        long hipart;

        fractpart = frexp(fractpart, &expo);
        fractpart = fractpart * 2147483648.0; 
        hipart = (long)fractpart; 
        fractpart = (fractpart - (double)hipart) * 2147483648.0;

        x = hipart + (long)fractpart + (expo << 15); /* combine the fract parts */

        intpart = frexp(intpart, &expo);
        intpart = intpart * 2147483648.0;
        hipart = (long)intpart;
        intpart = (intpart - (double)hipart) * 2147483648.0;

        x += hipart + (long)intpart + (expo << 15); /* add in the int parts */
    }
    if (x == -1)
        x = -2;
    return x;
}




> Note this consequence under the Win32 Python:

With this change, on Linux32:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 -2141945856
1.10810156237e+12 -2137751552
1.11669149696e+12 -2129362944
1.13387136614e+12 -2112585728
1.16823110451e+12 -2079031296
1.23695058125e+12 -2011922432
1.37438953472e+12 -1877704704
1.64926744166e+12 -1609269248
2.19902325555e+12 -2146107392
3.29853488333e+12 -1609236480
5.49755813888e+12 -1877639168
9.89560464998e+12 -2011824128
1.86916976722e+13 -2078900224


On Linux64:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 2153021440
1.10810156237e+12 2157215744
1.11669149696e+12 2165604352
1.13387136614e+12 2182381568
1.16823110451e+12 2215936000
1.23695058125e+12 2283044864
1.37438953472e+12 2417262592
1.64926744166e+12 2685698048
2.19902325555e+12 2148859904
3.29853488333e+12 2685730816
5.49755813888e+12 2417328128
9.89560464998e+12 2283143168
1.86916976722e+13 2216067072

>-- and that should also fix your 64-bit woes "by magic".
> 

As you can see it did not, but for another reason. The summation of the parts
overflows 'x'. Is this a problem? I.e., does it matter if a hash function
returns an overflowed integral value (my hash function ignorance is showing)?
And if this does not matter, does it matter that a hash returns different
values on different platforms?


> a hash function should never ignore any bit in its input. 

Which brings up a question regarding instance_hash(), func_hash(),
meth_hash(), HKEY_hash() [or whatever it is called], and other which cast a
pointer to a long (discarding the upperhalf of the pointer on Win64). Do
these really need to be fixed. Am I nitpicking too much on this whole thing?


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From tim_one at email.msn.com  Thu May 11 06:13:29 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 00:13:29 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <20000510131446.A25926@activestate.com>
Message-ID: <000b01bfbaff$43d320c0$2aa0143f@tim>

[Trent Mick]
> ...
> Okay here is another try (only for floatobject.c) for discussion.
> If it looks good then I will submit a patch for float and complex
> objects. So do the same for 'intpart' as was done for 'fractpart'.
>
>
> static long
> float_hash(v)
>     PyFloatObject *v;
> {
>     double intpart, fractpart;
>     long x;
>
>     fractpart = modf(v->ob_fval, &intpart);
>
>     if (fractpart == 0.0) {
> 		// ... snip ...
>     }
>     else {
>         int expo;
>         long hipart;
>
>         fractpart = frexp(fractpart, &expo);
>         fractpart = fractpart * 2147483648.0;

It's OK to use "*=" in C <wink>.

Would like a comment that this is 2**31 (which makes the code obvious <wink>
instead of mysterious).  A comment block at the top would help too, like

/* Use frexp to get at the bits in intpart and fractpart.
 * Since the VAX D double format has 56 mantissa bits, which is the
 * most of any double format in use, each of these parts may have as
 * many as (but no more than) 56 significant bits.
 * So, assuming sizeof(long) >= 4, each part can be broken into two longs;
 * frexp and multiplication are used to do that.
 * Also, since the Cray double format has 15 exponent bits, which is the
 * most of any double format in use, shifting the exponent field left by
 * 15 won't overflow a long (again assuming sizeof(long) >= 4).
 */

And this code has gotten messy enough that it's probably better to pkg it in
a utility function rather than duplicate it.

Another approach would be to play with the bits directly, via casting
tricks.  But then you have to wrestle with platform crap like endianness.

>         hipart = (long)fractpart;
>         fractpart = (fractpart - (double)hipart) * 2147483648.0;
>
>         x = hipart + (long)fractpart + (expo << 15); /* combine
> the fract parts */
>
>         intpart = frexp(intpart, &expo);
>         intpart = intpart * 2147483648.0;
>         hipart = (long)intpart;
>         intpart = (intpart - (double)hipart) * 2147483648.0;
>
>         x += hipart + (long)intpart + (expo << 15); /* add in the
> int parts */

There's no point adding in (expo << 15) a second time.

> With this change, on Linux32:
> ...
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 -2141945856
> 1.10810156237e+12 -2137751552
> 1.11669149696e+12 -2129362944
> 1.13387136614e+12 -2112585728
> 1.16823110451e+12 -2079031296
> 1.23695058125e+12 -2011922432
> 1.37438953472e+12 -1877704704
> 1.64926744166e+12 -1609269248
> 2.19902325555e+12 -2146107392
> 3.29853488333e+12 -1609236480
> 5.49755813888e+12 -1877639168
> 9.89560464998e+12 -2011824128
> 1.86916976722e+13 -2078900224
>
>
> On Linux64:
>
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 2153021440
> 1.10810156237e+12 2157215744
> 1.11669149696e+12 2165604352
> 1.13387136614e+12 2182381568
> 1.16823110451e+12 2215936000
> 1.23695058125e+12 2283044864
> 1.37438953472e+12 2417262592
> 1.64926744166e+12 2685698048
> 2.19902325555e+12 2148859904
> 3.29853488333e+12 2685730816
> 5.49755813888e+12 2417328128
> 9.89560464998e+12 2283143168
> 1.86916976722e+13 2216067072

>>-- and that should also fix your 64-bit woes "by magic".

> As you can see it did not, but for another reason.

I read your original complaint as that hash(double) yielded different
results between two *64* bit platforms (Linux64 vs Win64), but what you
showed above appears to be a comparison between a 64-bit platform and a
32-bit platform, and where presumably sizeof(long) is 8 on the former but 4
on the latter.  If so, of *course* results may be different:  hash returns a
C long, and they're different sizes across these platforms.

In any case, the results above aren't really different!

>>> hex(-2141945856)  # 1st result from Linux32
'0x80548000'
>>> hex(2153021440L)  # 1st result from Linux64
'0x80548000L'
>>>

That is, the bits are the same.  How much more do you want from me <wink>?

> The summation of the parts overflows 'x'. Is this a problem? I.e., does
> it matter if a hash function returns an overflowed integral value (my
> hash function ignorance is showing)?

Overflow generally doesn't matter.  In fact, it's usual <wink>; e.g., the
hash for strings iterates over

    x = (1000003*x) ^ *p++;

and overflows madly.  The saving grace is that C defines integer overflow in
such a way that losing the high bits on every operation yields the same
result as if the entire result were computed to infinite precision and the
high bits tossed only at the end.  So overflow doesn't hurt this from being
as reproducible as possible, given that Python's int size is different.

Overflow can be avoided by using xor instead of addition, but addition is
generally preferred because it helps to "scramble" the bits a little more.

> And if this does not matter, does it matter that a hash returns different
> values on different platforms?

No, and it doesn't always stay the same from release to release on a single
platform.  For example, your patch above will change hash(double) on Win32!

>> a hash function should never ignore any bit in its input.

> Which brings up a question regarding instance_hash(), func_hash(),
> meth_hash(), HKEY_hash() [or whatever it is called], and other
> which cast a pointer to a long (discarding the upperhalf of the
> pointer on Win64). Do these really need to be fixed. Am I nitpicking
> too much on this whole thing?

I have to apologize (although only semi-sincerely) for not being meaner
about this when I did the first 64-bit port.  I did that for my own use, and
avoided the problem areas rather than fix them.  But unless a language dies,
you end up paying for every hole in the end, and the sooner they're plugged
the less it costs.

That is, no, you're not nitpicking too much!  Everyone else probably thinks
you are <wink>, *but*, they're not running on 64-bit platforms yet so these
issues are still invisible to their gut radar.  I'll bet your life that
every hole remaining will trip up an end user eventually -- and they're the
ones least able to deal with the "mysterious problems".






From guido at python.org  Thu May 11 15:01:10 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 11 May 2000 09:01:10 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: Your message of "Thu, 11 May 2000 00:13:29 EDT."
             <000b01bfbaff$43d320c0$2aa0143f@tim> 
References: <000b01bfbaff$43d320c0$2aa0143f@tim> 
Message-ID: <200005111301.JAA00512@eric.cnri.reston.va.us>

I have to admit I have no clue about the details of this debate any
more, and I'm cowardly awaiting a patch submission that Tim approves
of.  (I'm hoping a day will come when Tim can check it in himself. :-)

In the mean time, I'd like to emphasize the key invariant here: we
must ensure that (a==b) => (hash(a)==hash(b)).  One quick way to deal
with this could be the following pseudo C:

    PyObject *double_hash(double x)
    {
        long l = (long)x;
        if ((double)l == x)
	    return long_hash(l);
	...double-specific code...
    }

This code makes one assumption: that if there exists a long l equal to
a double x, the cast (long)x should yield l...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Fri May 12 00:14:45 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 11 May 2000 15:14:45 -0700
Subject: [Python-Dev] testing the C API in the test suite (was: bug in PyLong_FromLongLong (PR#324))
In-Reply-To: <200005111323.JAA00637@eric.cnri.reston.va.us>
References: <200005111323.JAA00637@eric.cnri.reston.va.us>
Message-ID: <20000511151445.B15936@activestate.com>

> Date:    Wed, 10 May 2000 15:37:30 -0400
> From:    Thomas.Malik at t-online.de
> To:      python-bugs-list at python.org
> cc:      bugs-py at python.org
> Subject: [Python-bugs-list] bug in PyLong_FromLongLong (PR#324)
> 
> Full_Name: Thomas Malik
> Version: 1.5.2
> OS: all
> Submission from: p3e9ed447.dip.t-dialin.net (62.158.212.71)
> 
> 
> there's a bug in PyLong_FromLongLong, resulting in truncation of negative 64 bi
> t
> integers. PyLong_FromLongLong starts with: 
> 	if( ival <= (LONG_LONG)LONG_MAX ) {
> 		return PyLong_FromLong( (long)ival );
> 	}
> 	else if( ival <= (unsigned LONG_LONG)ULONG_MAX ) {
> 		return PyLong_FromUnsignedLong( (unsigned long)ival );
> 	}
> 	else {
>              ....
> 
> Now, if ival is smaller than -LONG_MAX, it falls outside the long integer range
> (being a 64 bit negative integer), but gets handled by the first if-then-case i
> n
> above code ('cause it is, of course, smaller than LONG_MAX). This results in
> truncation of the 64 bit negative integer to a more or less arbitrary 32 bit
> number. The way to fix it is to compare the absolute value of imax against
> LONG_MAX in the first condition. The second condition (ULONG_MAX) must, at
> least, check wether ival is positive. 
> 

To test this error I found the easiest way was to make a C extension module
to Python that called the C API functions under test directly. I can't
quickly think of a way I could have shown this error *clearly* at the Python
level without a specialized extension module. This has been true for other
things that I have been testing.

Would it make sense to create a standard extension module (called '__test' or
something like that) in which direct tests on the C API could be made? This
would be hooked into the standard testsuite via a test_capi.py that would:
- import __test
- run every exported function in __test (or everyone starting with 'test_',
  or whatever)
- the ImportError could continue to be used to signify skipping, etc
  (although, I think that a new, more explicit TestSuiteError class would be
  more appropriate and clear)

Does something like this already exist that I am missing?

This would make testing some things a lot easier, and clearer. Where
some interface is exposed to the Python programmer it is appropriate to test
it at the Python level. Python also provides a C API and it would be
appropriate to test that at the C level.

I would like to hear some people's thoughts before I go off and put anything
together.

Thanks,
Trent


-- 
Trent Mick
trentm at activestate.com



From DavidA at ActiveState.com  Fri May 12 00:16:43 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Thu, 11 May 2000 15:16:43 -0700
Subject: [Python-Dev] c.l.p.announce
Message-ID: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>

What's the status of comp.lang.python.announce and the 'reviving' thereof?

--david



From tim_one at email.msn.com  Fri May 12 04:58:35 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 22:58:35 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <200005111301.JAA00512@eric.cnri.reston.va.us>
Message-ID: <000001bfbbbd$f74572c0$9ca2143f@tim>

[Guido]
> I have to admit I have no clue about the details of this debate any
> more,

Na, there's no debate here.  I believe I confused things by misunderstanding
what Trent's original claim was (sorry, Trent!), but we bumped into real
flaws in the current hash anyway (even on 32-bit machines).  I don't think
there's any actual disagreement about anything here.

> and I'm cowardly awaiting a patch submission that Tim approves
> of.

As am I <wink>.

> (I'm hoping a day will come when Tim can check it in himself. :-)

Well, all you have to do to make that happen is get a real job and then hire
me <wink>.

> In the mean time, I'd like to emphasize the key invariant here: we
> must ensure that (a==b) => (hash(a)==hash(b)).

Absolutely.  That's already true, and is so non-controversial that Trent
elided ("...") the code for that in his last post.

> One quick way to deal with this could be the following pseudo C:
>
>     PyObject *double_hash(double x)
>     {
>         long l = (long)x;
>         if ((double)l == x)
> 	    return long_hash(l);
> 	...double-specific code...
>     }
>
> This code makes one assumption: that if there exists a long l equal to
> a double x, the cast (long)x should yield l...

No, that fails on two counts:

1.  If x is "too big" to fit in a long (and a great many doubles are),
    the cast to long is undefined.  Don't know about all current platforms,
    but on the KSR platform such casts raised a fatal hardware
    exception.  The current code already accomplishes this part in a
    safe way (which Trent's patch improves by using a symbol instead of
    the current hard-coded hex constant).

2.  The key invariant needs to be preserved also when x is an exact
    integral value that happens to be (possibly very!) much bigger than
    a C long; e.g.,

>>> long(1.23e300)  # 1.23e300 is an integer! albeit not the one you think
12299999999999999456195024356787918820614965027709909500456844293279
60298864608335541984218516600989160291306221939122973741400364055485
57167627474369519296563706976894811817595986395177079943535811102573
51951343133141138298152217970719263233891682157645730823560232757272
73837119288529943287157489664L
>>> hash(1.23e300) == hash(_)
1
>>>

The current code already handles that correctly too.  All the problems occur
when the double has a non-zero fractional part, and Trent knows how to fix
that now.  hash(x) may differ across platforms because sizeof(long) differs
across platforms, but that's just as true of strings as floats (i.e., Python
has never computed platform-independent hashes -- if that bothers *you*
(doesn't bother me), that's the part you should chime in on).





From guido at python.org  Fri May 12 14:24:25 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 08:24:25 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Thu, 11 May 2000 15:16:43 PDT."
             <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> 
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> 
Message-ID: <200005121224.IAA06063@eric.cnri.reston.va.us>

> What's the status of comp.lang.python.announce and the 'reviving' thereof?

Good question.  Several of us here at CNRI have volunteered to become
moderators.  I think we may have to start faking Approved: headers in
the mean time...

(I wonder if we can make posts to python-announce at python.com be
forwarded to c.l.py.a with such a header automatically tacked on?)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Fri May 12 15:43:37 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 15:43:37 +0200
Subject: [Python-Dev] Unicode and its partners...
Message-ID: <391C0A89.819A33EA@lemburg.com>

It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
discussion. 

Not that I would like it to restart (I think everybody has
made their point), but it kind of surprised me that now with the
ability to actually set the default string encoding at run-time,
noone seems to have played around with it...

>>> import sys
>>> sys.set_string_encoding('unicode-escape')
>>> "abc???" + u"abc"
u'abc\344\366\374abc'
>>> "abc???\u1234" + u"abc"
u'abc\344\366\374\u1234abc'
>>> print "abc???\u1234" + u"abc"
abc\344\366\374\u1234abc

Any takers ?

BTW, has anyone tried to use the codec design for other
tasks than converting text ? It should also be usable for
e.g. compressing/decompressing or other data oriented
content.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From effbot at telia.com  Fri May 12 16:25:24 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Fri, 12 May 2000 16:25:24 +0200
Subject: [Python-Dev] Unicode and its partners...
References: <391C0A89.819A33EA@lemburg.com>
Message-ID: <026901bfbc1d$efe06fc0$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
> discussion. 

that's only because I've promised Guido to prepare SRE
for the next alpha, before spending more time trying to
get this one done right ;-)

and as usual, the last 10% takes 90% of the effort :-(

</F>




From akuchlin at mems-exchange.org  Fri May 12 16:27:21 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 12 May 2000 10:27:21 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: <200005121224.IAA06063@eric.cnri.reston.va.us>
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
	<200005121224.IAA06063@eric.cnri.reston.va.us>
Message-ID: <14620.5321.510321.341870@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>(I wonder if we can make posts to python-announce at python.com be
>forwarded to c.l.py.a with such a header automatically tacked on?)

Probably not a good idea; if the e-mail address is on the Web site, it
probably gets a certain amount of spam that would need to be filtered
out.  

--amk



From guido at python.org  Fri May 12 16:31:55 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 10:31:55 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Fri, 12 May 2000 10:27:21 EDT."
             <14620.5321.510321.341870@amarok.cnri.reston.va.us> 
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>  
            <14620.5321.510321.341870@amarok.cnri.reston.va.us> 
Message-ID: <200005121431.KAA06538@eric.cnri.reston.va.us>

> Guido van Rossum writes:
> >(I wonder if we can make posts to python-announce at python.com be
> >forwarded to c.l.py.a with such a header automatically tacked on?)
> 
> Probably not a good idea; if the e-mail address is on the Web site, it
> probably gets a certain amount of spam that would need to be filtered
> out.  

OK, let's make it a moderated mailman mailing list; we can make
everyone on python-dev (who wants to) a moderator.  Barry, is there an
easy way to add additional headers to messages posted by mailman to
the news gateway?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From jcollins at pacificnet.net  Fri May 12 17:39:28 2000
From: jcollins at pacificnet.net (Jeffery D. Collins)
Date: Fri, 12 May 2000 08:39:28 -0700
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>  
	            <14620.5321.510321.341870@amarok.cnri.reston.va.us> <200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <391C25B0.EC327BCF@pacificnet.net>

I volunteer to moderate.

Jeff


Guido van Rossum wrote:

> > Guido van Rossum writes:
> > >(I wonder if we can make posts to python-announce at python.com be
> > >forwarded to c.l.py.a with such a header automatically tacked on?)
> >
> > Probably not a good idea; if the e-mail address is on the Web site, it
> > probably gets a certain amount of spam that would need to be filtered
> > out.
>
> OK, let's make it a moderated mailman mailing list; we can make
> everyone on python-dev (who wants to) a moderator.  Barry, is there an
> easy way to add additional headers to messages posted by mailman to
> the news gateway?
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev




From bwarsaw at python.org  Fri May 12 17:41:01 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 12 May 2000 11:41:01 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
	<200005121224.IAA06063@eric.cnri.reston.va.us>
	<14620.5321.510321.341870@amarok.cnri.reston.va.us>
	<200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <14620.9741.164735.998570@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido at python.org> writes:

    GvR> OK, let's make it a moderated mailman mailing list; we can
    GvR> make everyone on python-dev (who wants to) a moderator.
    GvR> Barry, is there an easy way to add additional headers to
    GvR> messages posted by mailman to the news gateway?

No, but I'll add that.  It might be a little while before I push the
changes out to python.org; I've got a bunch of things I need to test
first.

-Barry



From mal at lemburg.com  Fri May 12 17:47:55 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 17:47:55 +0200
Subject: [Python-Dev] Landmark
Message-ID: <391C27AB.2F5339D6@lemburg.com>

While trying to configure an in-package Python interpreter
I found that the interpreter still uses 'string.py' as
landmark for finding the standard library.

Since string.py is being depreciated, I think we should
consider a new landmark (such as os.py) or maybe even a
whole new strategy for finding the standard lib location.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Fri May 12 21:04:50 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 15:04:50 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 17:47:55 +0200."
             <391C27AB.2F5339D6@lemburg.com> 
References: <391C27AB.2F5339D6@lemburg.com> 
Message-ID: <200005121904.PAA08166@eric.cnri.reston.va.us>

> While trying to configure an in-package Python interpreter
> I found that the interpreter still uses 'string.py' as
> landmark for finding the standard library.

Oops.

> Since string.py is being depreciated, I think we should
> consider a new landmark (such as os.py) or maybe even a
> whole new strategy for finding the standard lib location.

I don't see a need for a new strategy, but I'll gladly accept patches
that look for os.py.  Note that there are several versions of that
code: Modules/getpath.c, PC/getpathp.c, PC/os2vacpp/getpathp.c.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From gmcm at hypernet.com  Fri May 12 21:50:56 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 15:50:56 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: <200005121904.PAA08166@eric.cnri.reston.va.us>
References: Your message of "Fri, 12 May 2000 17:47:55 +0200."             <391C27AB.2F5339D6@lemburg.com> 
Message-ID: <1253961418-52039567@hypernet.com>

[MAL]
> > Since string.py is being depreciated, I think we should
> > consider a new landmark (such as os.py) or maybe even a
> > whole new strategy for finding the standard lib location.
[GvR]
> I don't see a need for a new strategy

I'll argue for (a choice of) new strategy. The getpath & friends 
code spends a whole lot of time and energy trying to reverse 
engineer things like developer builds and strange sys-admin 
pranks. I agree that code shouldn't die. But it creates painful 
startup times when Python is being used for something like 
CGI.

How about something on the command line that says (pick 
one or come up with another choice):
 - PYTHONPATH is *it*
 - use PYTHONPATH and .pth files found <here>
 - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
PYTHONPATH
 - there's a .pth file <here> with the whole list
 - pretty much any permutation of the above elements

The idea being to avoid a few hundred system calls when a 
dozen or so will suffice. Default behavior should still be to 
magically get it right.


- Gordon



From guido at python.org  Fri May 12 22:29:05 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 16:29:05 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 15:50:56 EDT."
             <1253961418-52039567@hypernet.com> 
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>  
            <1253961418-52039567@hypernet.com> 
Message-ID: <200005122029.QAA08252@eric.cnri.reston.va.us>

> [MAL]
> > > Since string.py is being depreciated, I think we should
> > > consider a new landmark (such as os.py) or maybe even a
> > > whole new strategy for finding the standard lib location.
> [GvR]
> > I don't see a need for a new strategy
> 
> I'll argue for (a choice of) new strategy. The getpath & friends 
> code spends a whole lot of time and energy trying to reverse 
> engineer things like developer builds and strange sys-admin 
> pranks. I agree that code shouldn't die. But it creates painful 
> startup times when Python is being used for something like 
> CGI.
> 
> How about something on the command line that says (pick 
> one or come up with another choice):
>  - PYTHONPATH is *it*
>  - use PYTHONPATH and .pth files found <here>
>  - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
> PYTHONPATH
>  - there's a .pth file <here> with the whole list
>  - pretty much any permutation of the above elements
> 
> The idea being to avoid a few hundred system calls when a 
> dozen or so will suffice. Default behavior should still be to 
> magically get it right.

I'm not keen on changing the meaning of PYTHONPATH, but if you're
willing and able to set an environment variable, you can set
PYTHONHOME and it will abandon the search.  If you want a command line
option for CGI, an option to set PYTHONHOME makes sense.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From weeks at golden.dtc.hp.com  Fri May 12 22:29:52 2000
From: weeks at golden.dtc.hp.com ( (Greg Weeks))
Date: Fri, 12 May 2000 13:29:52 -0700
Subject: [Python-Dev] "is", "==", and sameness
Message-ID: <200005122029.AA126653392@golden.dtc.hp.com>

>From the Python Reference Manual [emphasis added]:

    Types affect almost all aspects of object behavior. Even the importance
    of object IDENTITY is affected in some sense: for immutable types,
    operations that compute new values may actually return a reference to
    any existing object with the same type and value, while for mutable
    objects this is not allowed.

This seems to be saying that two immutable objects are (in some sense) the
same iff they have the same type and value, while two mutable objects are
the same iff they have the same id().  I heartily agree, and I think that
this notion of sameness is the single most useful variant of the "equals"
relation.

Indeed, I think it worthwhile to consider modifying the "is" operator to
compute this notion of sameness.  (This would break only exceedingly
strange user code.)  "is" would then be the natural comparator of
dictionary keys, which could then be any object.

The usefulness of this idea is limited by the absence of user-definable
immutable instances.  It might be nice to be able to declare a class -- eg,
Point -- to be have immutable instances.  This declaration would promise
that:

1.  When the expression Point(3.0,4.0) is evaluated, its reference count
    will be zero.

2.  After Point(3.0,4.0) is evaluated, its attributes will not be changed.


I sent the above thoughts to Guido, who graciously and politely responded
that they struck him as somewhere between bad and poorly presented.  (Which
surprised me.  I would have guessed that the ideas were already in his
head.)  Nevertheless, he mentioned passing them along to you, so I have.


Regards,
Greg



From gmcm at hypernet.com  Sat May 13 00:05:46 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 18:05:46 -0400
Subject: [Python-Dev] "is", "==", and sameness
In-Reply-To: <200005122029.AA126653392@golden.dtc.hp.com>
Message-ID: <1253953328-52526193@hypernet.com>

Greg Weeks wrote:

> >From the Python Reference Manual [emphasis added]:
> 
>     Types affect almost all aspects of object behavior. Even the
>     importance of object IDENTITY is affected in some sense: for
>     immutable types, operations that compute new values may
>     actually return a reference to any existing object with the
>     same type and value, while for mutable objects this is not
>     allowed.
> 
> This seems to be saying that two immutable objects are (in some
> sense) the same iff they have the same type and value, while two
> mutable objects are the same iff they have the same id().  I
> heartily agree, and I think that this notion of sameness is the
> single most useful variant of the "equals" relation.

Notice the "may" in the reference text.

>>> 88 + 11 is 98 + 1
1
>>> 100 + 3 is 101 + 2
0
>>>

Python goes to the effort of keeping singleton instances of the 
integers less than 100. In certain situations, a similar effort is 
invested in strings. But it is by no means the general case, 
and (unless you've got a solution) it would be expensive to 
make it so.
 
> Indeed, I think it worthwhile to consider modifying the "is"
> operator to compute this notion of sameness.  (This would break
> only exceedingly strange user code.)  "is" would then be the
> natural comparator of dictionary keys, which could then be any
> object.

The implications don't follow. The restriction that dictionary 
keys be immutable is not because of the comparison method. 
It's the principle of "least surprise". Use a mutable object as a 
dict key. Now mutate the object. Now the key / value pair in 
the dictionary is inaccessible. That is, there is some pair (k,v) 
in dict.items() where dict[k] does not yield v.
 
> The usefulness of this idea is limited by the absence of
> user-definable immutable instances.  It might be nice to be able
> to declare a class -- eg, Point -- to be have immutable
> instances.  This declaration would promise that:
> 
> 1.  When the expression Point(3.0,4.0) is evaluated, its
> reference count
>     will be zero.

That's a big change from the way Python works:

>>> sys.getrefcount(None)
167
>>>
 
> 2.  After Point(3.0,4.0) is evaluated, its attributes will not be
> changed.

You can make an instance effectively immutable (by messing 
with __setattr__). You can override __hash__ to return 
something suitable (eg, hash(id(self))), and then use an 
instance as a dict key. You don't even need to do the first to 
do the latter.

- Gordon



From mal at lemburg.com  Fri May 12 23:25:02 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 23:25:02 +0200
Subject: [Python-Dev] Landmark
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>  
	            <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>
Message-ID: <391C76AE.A3118AF1@lemburg.com>

Guido van Rossum wrote:
> [Gordon]
> > [MAL]
> > > > Since string.py is being depreciated, I think we should
> > > > consider a new landmark (such as os.py) or maybe even a
> > > > whole new strategy for finding the standard lib location.
> > [GvR]
> > > I don't see a need for a new strategy
> >
> > I'll argue for (a choice of) new strategy.
> 
> I'm not keen on changing the meaning of PYTHONPATH, but if you're
> willing and able to set an environment variable, you can set
> PYTHONHOME and it will abandon the search.  If you want a command line
> option for CGI, an option to set PYTHONHOME makes sense.

The routines will still look for the landmark though (which
is what surprised me and made me look deeper -- setting
PYTHONHOME didn't work for me because I had only .pyo files
in the lib/python1.5 dir).

Perhaps Python should put more trust into the setting of
PYTHONHOME ?!

[An of course the landmark should change to something like
 os.py -- I'll try to submit a patch for this.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Sat May 13 02:53:27 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 20:53:27 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 23:25:02 +0200."
             <391C76AE.A3118AF1@lemburg.com> 
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com> <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>  
            <391C76AE.A3118AF1@lemburg.com> 
Message-ID: <200005130053.UAA08687@eric.cnri.reston.va.us>

[me]
> > I'm not keen on changing the meaning of PYTHONPATH, but if you're
> > willing and able to set an environment variable, you can set
> > PYTHONHOME and it will abandon the search.  If you want a command line
> > option for CGI, an option to set PYTHONHOME makes sense.

[MAL]
> The routines will still look for the landmark though (which
> is what surprised me and made me look deeper -- setting
> PYTHONHOME didn't work for me because I had only .pyo files
> in the lib/python1.5 dir).
> 
> Perhaps Python should put more trust into the setting of
> PYTHONHOME ?!

Yes!  Note that PC/getpathp.c already trusts PYTHONHOME 100% --
Modules/getpath.c should follow suit.

> [An of course the landmark should change to something like
>  os.py -- I'll try to submit a patch for this.]

Maybe you can combine the two?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Sat May 13 14:56:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 14:56:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
Message-ID: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>

in the current 're' engine, a newline is chr(10) and nothing
else.

however, in the new unicode aware engine, I used the new
LINEBREAK predicate instead, but it turned out to break one
of the tests in the current test suite:

    sre.match('a\rb', 'a.b') => None

(unicode adds chr(13), chr(28), chr(29), chr(30), and also
unichr(133), unichr(8232), and unichr(8233) to the list of
line breaking codes)

what's the best way to deal with this?  I see three alter-
natives:

a) stick to the old definition, and use chr(10) also for
   unicode strings

b) use different definitions for 8-bit strings and unicode
   strings; if given an 8-bit string, use chr(10); if given
   a 16-bit string, use the LINEBREAK predicate.

c) use LINEBREAK in either case.

I think (c) is the "right thing", but it's the only that may
break existing code...

</F>




From bckfnn at worldonline.dk  Sat May 13 15:47:10 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sat, 13 May 2000 13:47:10 GMT
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <391d5b7f.3713359@smtp.worldonline.dk>

On Sat, 13 May 2000 14:56:41 +0200, you wrote:

>in the current 're' engine, a newline is chr(10) and nothing
>else.
>
>however, in the new unicode aware engine, I used the new
>LINEBREAK predicate instead, but it turned out to break one
>of the tests in the current test suite:
>
>    sre.match('a\rb', 'a.b') => None
>
>(unicode adds chr(13), chr(28), chr(29), chr(30), and also
>unichr(133), unichr(8232), and unichr(8233) to the list of
>line breaking codes)
>
>what's the best way to deal with this?  I see three alter-
>natives:
>
>a) stick to the old definition, and use chr(10) also for
>   unicode strings

In the ORO matcher that comes with jpython, the dot matches all but
chr(10). But that is bad IMO. Unicode should use the LINEBREAK
predicate.

regards,
finn



From effbot at telia.com  Sat May 13 16:14:32 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:14:32 +0200
Subject: [Python-Dev] for the todo list: cStringIO uses string.joinfields
Message-ID: <00a101bfbce5$91dbd860$34aab5d4@hagrid>

the O_writelines function in Modules/cStringIO contains the
following code:

  if (!string_joinfields) {
    UNLESS(string_module = PyImport_ImportModule("string")) {
      return NULL;
    }

    UNLESS(string_joinfields=
        PyObject_GetAttrString(string_module, "joinfields")) {
      return NULL;
    }

    Py_DECREF(string_module);
  }

I suppose someone should fix this some day...

(btw, the C API reference implies that ImportModule doesn't
use import hooks.  does that mean that cStringIO doesn't work
under e.g. Gordon's installer?)

</F>




From effbot at telia.com  Sat May 13 16:36:30 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:36:30 +0200
Subject: [Python-Dev] cvs for dummies
Message-ID: <000d01bfbce8$a3466f40$34aab5d4@hagrid>

what's the best way to make sure that a "cvs update" really brings
everything up to date, even if you've accidentally changed some-
thing in your local workspace?

</F>




From moshez at math.huji.ac.il  Sat May 13 16:58:17 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Sat, 13 May 2000 17:58:17 +0300 (IDT)
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <Pine.GSO.4.10.10005131755560.14940-100000@sundial>

On Sat, 13 May 2000, Fredrik Lundh wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
> 
> a) stick to the old definition, and use chr(10) also for
>    unicode strings

If we also supply a \something (is \l taken?) for LINEBREAK, people can
then use [^\l] if they need a Unicode line break. Just a point for a way
to do a thing close to rightness and still not break code.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From fdrake at acm.org  Sat May 13 17:22:12 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Sat, 13 May 2000 11:22:12 -0400 (EDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <14621.29476.390092.610442@newcnri.cnri.reston.va.us>

Fredrik Lundh writes:
 > what's the best way to make sure that a "cvs update" really brings
 > everything up to date, even if you've accidentally changed some-
 > thing in your local workspace?

  Delete the file(s) that got changed and cvs update again.


  -Fred

--
Fred L. Drake, Jr.           <fdrake at acm.org>
Corporation for National Research Initiatives




From effbot at telia.com  Sat May 13 17:28:02 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 17:28:02 +0200
Subject: [Python-Dev] cvs for dummies
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid> <14621.29476.390092.610442@newcnri.cnri.reston.va.us>
Message-ID: <001901bfbcef$d4672b80$34aab5d4@hagrid>

Fred L. Drake, Jr. wrote:
> Fredrik Lundh writes:
>  > what's the best way to make sure that a "cvs update" really brings
>  > everything up to date, even if you've accidentally changed some-
>  > thing in your local workspace?
> 
>   Delete the file(s) that got changed and cvs update again.

okay, what's the best way to get a list of locally changed files?

(in this case, one file ended up with neat little <<<<<<< and
>>>>>> marks in it...  several weeks and about a dozen CVS
updates after I'd touched it...)

</F>




From gmcm at hypernet.com  Sat May 13 18:25:42 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 12:25:42 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <00a101bfbce5$91dbd860$34aab5d4@hagrid>
Message-ID: <1253887332-56495837@hypernet.com>

Fredrik wrote:

> (btw, the C API reference implies that ImportModule doesn't
> use import hooks.  does that mean that cStringIO doesn't work
> under e.g. Gordon's installer?)

You have to fool C code that uses ImportModule by doing an 
import first in your Python code. It's the same for freeze. It's 
tiresome tracking this stuff down. For example, to use shelve:

# this is needed because of the use of __import__ in anydbm 
# (modulefinder does not follow __import__)
import dbhash
# the next 2 are needed because cPickle won't use our import
# hook so we need them already in sys.modules when
# cPickle starts
import string
import copy_reg
# now it will work
import shelve

Imagine the c preprocessor letting you do
#define snarf #include
and then trying to use a dependency tracker.


- Gordon



From effbot at telia.com  Sat May 13 20:09:44 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 20:09:44 +0200
Subject: [Python-Dev] hey, who broke the array module?
Message-ID: <006e01bfbd06$6ba21120$34aab5d4@hagrid>

sigh.  never resync the CVS repository until you've fixed all
bugs in your *own* code ;-)

in 1.5.2:

>>> array.array("h", [65535])
array('h', [-1])

>>> array.array("H", [65535])
array('H', [65535])

in the current CVS version:

>>> array.array("h", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

okay, this might break some existing code -- but one
can always argue that such code were already broken.

on the other hand:

>>> array.array("H", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

oops.

dunno if the right thing would be to add support for various kinds
of unsigned integers to Python/getargs.c, or to hack around this
in the array module...

</F>




From mhammond at skippinet.com.au  Sat May 13 21:19:44 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun, 14 May 2000 05:19:44 +1000
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOENPCKAA.mhammond@skippinet.com.au>

> >   Delete the file(s) that got changed and cvs update again.
>
> okay, what's the best way to get a list of locally changed files?

Diff the directory.  Or better still, use wincvs - nice little red icons
for the changed files.

> (in this case, one file ended up with neat little <<<<<<< and
> >>>>>> marks in it...  several weeks and about a dozen CVS
> updates after I'd touched it...)

This happens when CVS can't manage to perform a successful merge.  You
original is still there, but with a funky name (in the same directory - it
should be obvious).

WinCV also makes this a little more obvious - the icon has a special
"conflict" indicator, and the console messages also reflect the conflict in
red.

Mark.




From tismer at tismer.com  Sat May 13 22:32:45 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 22:32:45 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com> <8fe76b$684$1@newshost.accu.uu.nl> <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com> <8fh9ki$51h$1@slb3.atl.mindspring.net> <8fk4mh$i4$1@kopp.stud.ntnu.no>
Message-ID: <391DBBED.B252E597@tismer.com>


Magnus Lie Hetland wrote:
> 
> Aahz Maruch <aahz at netcom.com> wrote in message
> news:8fh9ki$51h$1 at slb3.atl.mindspring.net...
> > In article <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj at 4ax.com>,
> > Ben Wolfson  <rumjuggler at cryptarchy.org> wrote:
> > >
> > >', '.join(['foo', 'bar', 'baz'])
> >
> > This only works in Python 1.6, which is only released as an alpha at
> > this point.  I suggest rather strongly that we avoid 1.6-specific idioms
> > until 1.6 gets released, particularly in relation to FAQ-type questions.
> 
> This is indeed a bit strange IMO... If I were to join the elements of a
> list I would rather ask the list to do it than some string... I.e.
> 
>    ['foo', 'bar', 'baz'].join(', ')
> 
> (...although it is the string that joins the elements in the resulting
> string...)

I believe the notation of "everything is an object, and objects
provide all their functionality" is a bit stressed in Python 1.6 .
The above example touches the limits where I'd just say
"OO isn't always the right thing, and always OO is the wrong thing".

A clear advantage of 1.6's string methods is that much code
becomes shorter and easier to read, since the nesting level
of braces is reduced quite much. The notation also appears to be
more in the order of which actions are actually processed.

The split/join issue is really on the edge where I begin to not
like it.
It is clear that the join method *must* be performed as a method
of the joining character, since the method expects a list as its
argument. It doesn't make sense to use a list method, since
lists have nothing to do with strings.
Furthermore, the argument to join can be any sequence. Adding
a join method to any sequence, just since we want to join some
strings would be overkill.
So the " ".join(seq) notation is the only possible compromise,
IMHO.
It is actually arguable if this is still "Pythonic".
What you want is to join a list of string by some other string.
This is neither a natural method of the list, nor of the joining
string in the first place.

If it came to the point where the string module had some extra
methods which operate on two lists of string perhaps, we would
have been totally lost, and enforcing some OO method to support
it would be completely off the road.

Already a little strange is that the most string methods
return new objects all the time, since strings are immutable.

join is of really extreme design, and compared with other
string functions which became more readable, I think it is
counter-intuitive and not the way people are thinking.
The think "I want to join this list by this string".

Furthermore, you still have to import string, in order to use
its constants.

Instead of using a module with constants and functions, we
now always have to refer to instances and use their methods.
It has some benefits in simple cases.

But if there are a number of different objects handled
by a function, I think enforcing it to be a method of
one of the objects is the wrong way, OO overdone.

doing-OO-only-if-it-looks-natural-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Sat May 13 22:39:19 2000
From: guido at python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:39:19 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: Your message of "Sat, 13 May 2000 12:25:42 EDT."
             <1253887332-56495837@hypernet.com> 
References: <1253887332-56495837@hypernet.com> 
Message-ID: <200005132039.QAA09114@eric.cnri.reston.va.us>

> Fredrik wrote:
> 
> > (btw, the C API reference implies that ImportModule doesn't
> > use import hooks.  does that mean that cStringIO doesn't work
> > under e.g. Gordon's installer?)
> 
> You have to fool C code that uses ImportModule by doing an 
> import first in your Python code. It's the same for freeze. It's 
> tiresome tracking this stuff down. For example, to use shelve:
> 
> # this is needed because of the use of __import__ in anydbm 
> # (modulefinder does not follow __import__)
> import dbhash
> # the next 2 are needed because cPickle won't use our import
> # hook so we need them already in sys.modules when
> # cPickle starts
> import string
> import copy_reg
> # now it will work
> import shelve

Hm, the way I read the code (but I didn't write it!) it calls
PyImport_Import, which is a higher level function that *does* use the
__import__ hook.  Maybe this wasn't always the case?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Sat May 13 22:43:32 2000
From: guido at python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:43:32 -0400
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Sat, 13 May 2000 13:47:10 GMT."
             <391d5b7f.3713359@smtp.worldonline.dk> 
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>  
            <391d5b7f.3713359@smtp.worldonline.dk> 
Message-ID: <200005132043.QAA09151@eric.cnri.reston.va.us>

[Swede]
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
> >
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings

[Finn]
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

There's no need for invention.  We're supposed to be as close to Perl
as reasonable.  What does Perl do?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gmcm at hypernet.com  Sat May 13 22:54:09 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
In-Reply-To: <391DBBED.B252E597@tismer.com>
Message-ID: <1253871224-57464726@hypernet.com>

Christian wrote:

> The split/join issue is really on the edge where I begin to not
> like it. It is clear that the join method *must* be performed as
> a method of the joining character, since the method expects a
> list as its argument.

We've been through this a number of times on c.l.py.

"What is this trash - I want list.join(sep)!"

After some head banging (often quite violent - ie, 4 or 5 
exchanges), they get that list.join(sep) sucks. But they still 
swear they'll never use sep.join(list).

So you end up saying "Well, string.join still works".

We'll need a pre-emptive FAQ entry with the link bound to a 
key stroke. Or a big increase in the PSU budget...

- Gordon



From gmcm at hypernet.com  Sat May 13 22:54:09 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <200005132039.QAA09114@eric.cnri.reston.va.us>
References: Your message of "Sat, 13 May 2000 12:25:42 EDT."             <1253887332-56495837@hypernet.com> 
Message-ID: <1253871222-57464840@hypernet.com>

[Fredrik]
> > > (btw, the C API reference implies that ImportModule doesn't
> > > use import hooks.  does that mean that cStringIO doesn't work
> > > under e.g. Gordon's installer?)
[Guido]
> Hm, the way I read the code (but I didn't write it!) it calls
> PyImport_Import, which is a higher level function that *does* use
> the __import__ hook.  Maybe this wasn't always the case?

In stock 1.5.2 it's PyImport_ImportModule. Same in cPickle. 
I'm delighted to see them moving towards PyImport_Import.


- Gordon



From effbot at telia.com  Sat May 13 23:40:01 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 23:40:01 +0200
Subject: [Python-Dev] Re: [Patches] getpath patch
References: <391D2BC6.95E4FD3E@lemburg.com>
Message-ID: <001501bfbd23$cc45e160$34aab5d4@hagrid>

MAL wrote:
> Note: Python will dump core if it cannot find the exceptions
> module. Perhaps we should add a builtin _exceptions module
> (basically a frozen exceptions.py) which is then used as
> fallback solution ?!

or use this one:
http://w1.132.telia.com/~u13208596/exceptions.htm

</F>




From bwarsaw at python.org  Sat May 13 23:40:47 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 13 May 2000 17:40:47 -0400 (EDT)
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
Message-ID: <14621.52191.448037.799287@anthem.cnri.reston.va.us>

>>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:

    CT> If it came to the point where the string module had some extra
    CT> methods which operate on two lists of string perhaps, we would
    CT> have been totally lost, and enforcing some OO method to
    CT> support it would be completely off the road.

The new .join() method reads a bit better if you first name the
glue string:

space = ' '
name = space.join(['Barry', 'Aloisius', 'Warsaw'])

But yes, it does look odd when used like

' '.join(['Christian', 'Aloisius', 'Tismer'])

I still think it's nice not to have to import string "just" to get the
join functionality, but remember of course that string.join() isn't
going away, so you can still use this if you like it better.

Alternatively, there has been talk about moving join() into the
built-ins, but I'm not sure if the semantics of tha have been nailed
down.

-Barry



From tismer at tismer.com  Sat May 13 23:48:37 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:48:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
References: <1253871224-57464726@hypernet.com>
Message-ID: <391DCDB5.4FCAB97F@tismer.com>


Gordon McMillan wrote:
> 
> Christian wrote:
> 
> > The split/join issue is really on the edge where I begin to not
> > like it. It is clear that the join method *must* be performed as
> > a method of the joining character, since the method expects a
> > list as its argument.
> 
> We've been through this a number of times on c.l.py.

I know. It just came up when I really used it, when
I read through this huge patch from Fred Gansevles, and when
I see people wondering about it.
After all, it is no surprize. They are right.
If we have to change their mind in order to understand
a basic operation, then we are wrong, not they.

> "What is this trash - I want list.join(sep)!"
> 
> After some head banging (often quite violent - ie, 4 or 5
> exchanges), they get that list.join(sep) sucks. But they still
> swear they'll never use sep.join(list).
> 
> So you end up saying "Well, string.join still works".

And it is the cleanest possible way to go, IMHO.
Unless we had some compound object methods, like

(somelist, somestring).join()

> We'll need a pre-emptive FAQ entry with the link bound to a
> key stroke. Or a big increase in the PSU budget...

We should reconsider the OO pattern.
The user's complaining is natural. " ".join() is not.
We might have gone too far. 

Python isn't just OO, it is better.

Joining lists of strings is joining lists of strings.
This is not a method of a string in the first place.
And not a method od a sequence in the first place.

Making it a method of the joining string now appears to be
a hack to me. (Sorry, Tim, the idea was great in the first place)

I am now
+1 on leaving join() to the string module
-1 on making some filler.join() to be the preferred joining way.

this-was-my-most-conservative-day-since-years-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Sat May 13 23:55:43 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:55:43 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
		<8fe76b$684$1@newshost.accu.uu.nl>
		<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
		<8fh9ki$51h$1@slb3.atl.mindspring.net>
		<8fk4mh$i4$1@kopp.stud.ntnu.no>
		<391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <391DCF5F.BA981607@tismer.com>


"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])

Agreed.

> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])

I'd love that Aloisius, really. I'll ask my parents for a renaming :-)

> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

Sure, and I'm glad to be able to use string methods without ugly
imports. It just came to me when my former colleague Axel met me
last time, and I showed him the 1.6 alpha with its string methods
(just looking over Fred's huge patch) that he said
"Well, quite nice. So they now go the same wrong way as Java did?
The OO pattern is dead. This example shows why."

> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

Sounds like a good alternative.

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From martin at loewis.home.cs.tu-berlin.de  Sun May 14 23:39:52 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 14 May 2000 23:39:52 +0200
Subject: [Python-Dev] Unicode
Message-ID: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>

> comments?  (for obvious reasons, I'm especially interested in comments
> from people using non-ASCII characters on a daily basis...)

> nobody?

Hi Frederik,

I think the problem you try to see is not real. My guideline for using
Unicode in Python 1.6 will be that people should be very careful to
*not* mix byte strings and Unicode strings. If you are processing text
data, obtained from a narrow-string source, you'll always have to make
an explicit decision what the encoding is.

If you follow this guideline, I think the Unicode type of Python 1.6
will work just fine.

If you use Unicode text *a lot*, you may find the need to combine them
with plain byte text in a more convenient way. This is the time you
should look at the implicit conversion stuff, and see which of the
functionality is useful. You then don't need to memorize *all* the
rules where implicit conversion would work - just the cases you care
about.

That may all look difficult - it probably is. But then, it is not more
difficult than tuples vs. lists: why does

>>> [a,b,c] = (1,2,3)

work, and

>>> [1,2]+(3,4)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

does not?

Regards,
Martin



From tim_one at email.msn.com  Mon May 15 01:51:41 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sun, 14 May 2000 19:51:41 -0400
Subject: [Python-Dev] Memory woes under Windows
Message-ID: <000001bfbdff$5bcdfe40$192d153f@tim>

[Noah, I'm wondering whether this is related to our W98 NatSpeak woes --
Python grows its lists much like a certain product we both work on <ahem>
grows its arrays ...]

Here's a simple test case:

from time import clock

def run():
    n = 1
    while n < 4000000:
        a = []
        push = a.append
        start = clock()
        for i in xrange(n):
            push(1)
        finish = clock()
        print "%10d push  %10.3f" % (n, round(finish - start, 3))
        n = n + n

for i in (1, 2, 3):
    try:
        run()
    except MemoryError:
        print "Got a memory error"

So run() builds a number of power-of-2 sized lists, each by appending one
element at a time.  It prints the list length and elapsed time to build each
one (on Windows, this is basically wall-clock time, and is derived from the
Pentium's high-resolution cycle timer).  The driver simply runs this 3
times, reporting any MemoryError that pops up.

The largest array constructed has 2M elements, so consumes about 8Mb -- no
big deal on most machines these days.

Here's what happens on my new laptop (damn, this thing is fast! -- usually):

Win98 (Second Edition)
600MHz Pentium III
160Mb RAM
Python 1.6a2 from python.org, via the Windows installer

         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.011
      4096 push       0.020
      8192 push       0.053
     16384 push       0.074
     32768 push       0.163
     65536 push       0.262
    131072 push       0.514
    262144 push       0.713
    524288 push       1.440
   1048576 push       2.961
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.029
     16384 push       0.057
     32768 push       0.116
     65536 push       0.231
    131072 push       0.474
    262144 push       2.361
    524288 push      24.059
   1048576 push      67.492
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.028
     16384 push       0.057
     32768 push       0.115
     65536 push       0.232
    131072 push       0.462
    262144 push       2.349
    524288 push      23.982
   1048576 push      67.257
Got a memory error

Commentary:  The first time it runs, the timing behavior is
indistinguishable from O(N).  But realloc returns NULL at some point when
growing the 2M array!  There "should be" huge gobs of memory available.

The 2nd and 3rd runs are very similar to each other, both blow up at about
the same time, but both run *very* much slower than the 1st run before that
point as the list size gets non-trivial -- and, while the output doesn't
show this, the disk starts thrashing too.

It's *not* the case that Win98 won't give Python more than 8Mb of memory.
For example,

>>> a = [1]*30000000  # that's 30M
>>>

works fine and fast on this machine, with no visible disk traffic [Noah,
that line sucks up about 120Mb from malloc in one shot].

So, somehow or other, masses of allocations are confusing the system memory
manager nearly to death (implying we should use Vladimir's PyMalloc under
Windows after grabbing every byte the machine has <0.6 wink>).

My belief is that the Windows 1.6a2 from python.org was compiled with VC6,
yes?  Scream if that's wrong.

This particular test case doesn't run any better under my Win95 (original)
P5-166 with 32Mb RAM using Python 1.5.2.  But at work, we've got a
(unfortunately huge, and C++) program that runs much slower on a
large-memory W98 machine than a small-memory W95 one, due to disk thrashing.
It's a mystery!  If anyone has a clue about any of this, spit it out <wink>.

[Noah, I watched the disk cache size while running the above, and it's not
the problem -- while W98 had allocated about 100Mb for disk cache at the
start, it gracefully gave that up as the program's memory demands increased]

just-another-day-with-windows-ly y'rs  - tim





From mhammond at skippinet.com.au  Mon May 15 02:28:05 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon, 15 May 2000 10:28:05 +1000
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbdff$5bcdfe40$192d153f@tim>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>

This is definately wierd!  As you only mentioned Win9x, I thought I would
give it a go on Win2k.

This is from a CVS update of only a few days ago, but it is a non-debug
build.  PII266 with 196MB ram:

         1 push       0.001
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.001
       256 push       0.001
       512 push       0.003
      1024 push       0.006
      2048 push       0.011
      4096 push       0.040
      8192 push       0.043
     16384 push       0.103
     32768 push       0.203
     65536 push       0.583

Things are looking OK to here - the behaviour Tim expected.  But then
things seem to start going a little wrong:

    131072 push       1.456
    262144 push       4.763
    524288 push      16.119
   1048576 push      60.765

All of a sudden we seem to hit N*N behaviour?

I gave up waiting for the next one.  Performance monitor was showing CPU at
100%, but the Python process was only sitting on around 15MB of RAM (and
growing _very_ slowly - at the rate you would expect).  Machine had tons of
ram showing as available, and the disk was not thrashing - ie, Windows
definately had lots of mem available, and I have no reason to believe that
a malloc() would fail here - but certainly no one would ever want to wait
and see :-)

This was all definately built with MSVC6, SP3.

no-room-should-ever-have-more-than-one-windows-ly y'rs

Mark.




From gstein at lyra.org  Mon May 15 06:08:33 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 14 May 2000 21:08:33 -0700 (PDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005142108070.28031-100000@nebula.lyra.org>

On Sat, 13 May 2000, Fredrik Lundh wrote:
> Fred L. Drake, Jr. wrote:
> > Fredrik Lundh writes:
> >  > what's the best way to make sure that a "cvs update" really brings
> >  > everything up to date, even if you've accidentally changed some-
> >  > thing in your local workspace?
> > 
> >   Delete the file(s) that got changed and cvs update again.
> 
> okay, what's the best way to get a list of locally changed files?

I use the following:

% cvs stat | fgrep Local


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From tim_one at email.msn.com  Mon May 15 09:34:39 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:39 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>
Message-ID: <000001bfbe40$07f14520$b82d153f@tim>

[Mark Hammond]
> This is definately wierd!  As you only mentioned Win9x, I thought I would
> give it a go on Win2k.

Thanks, Mark!  I've only got W9X machines at home.

> This is from a CVS update of only a few days ago, but it is a non-debug
> build.  PII266 with 196MB ram:
>
>          1 push       0.001
>          2 push       0.000
>          4 push       0.000
>          8 push       0.000
>         16 push       0.000
>         32 push       0.000
>         64 push       0.000
>        128 push       0.001
>        256 push       0.001
>        512 push       0.003
>       1024 push       0.006
>       2048 push       0.011
>       4096 push       0.040
>       8192 push       0.043
>      16384 push       0.103
>      32768 push       0.203
>      65536 push       0.583
>
> Things are looking OK to here - the behaviour Tim expected.  But then
> things seem to start going a little wrong:
>
>     131072 push       1.456
>     262144 push       4.763
>     524288 push      16.119
>    1048576 push      60.765

So that acts like my Win95 (which I didn't show), and somewhat like my 2nd &
3rd Win98 runs.

> All of a sudden we seem to hit N*N behaviour?

*That* part really isn't too surprising.  Python "overallocates", but by a
fixed amount independent of the current size.  This leads to quadratic-time
behavior "in theory" once a vector gets large enough.  Guido's cultural myth
for why that theory shouldn't matter is that if you keep appending to the
same vector, the OS will eventually move it to the end of the address space,
whereupon further growth simply boosts the VM high-water mark without
actually moving anything.  I call that "a cultural myth" because some
flavors of Unix did used to work that way, and some may still -- I doubt
it's ever been a valid argument under Windows, though. (you, of all people,
know how much Python's internal strategies were informed by machines nobody
uses <wink>).

So I was more surprised up to this point by the supernatural linearity of my
first W98 run (which is reproducible, btw).  But my 2nd & 3rd W98 runs (also
reproducible), and unlike your W2K run, show *worse* than quadratic
behavior.

> I gave up waiting for the next one.

Under both W98 and W95, the next one does eventually hit the MemoryError for
me, but it does take a long time.  If I thought it would help, I'd measure
it.  And *this* one is surprising, because, as you say:

> Performance monitor was showing CPU at 100%, but the Python process
> was only sitting on around 15MB of RAM (and growing _very_ slowly -
> at the rate you would expect).  Machine had tons of ram showing as
> available, and the disk was not thrashing - ie, Windows definately
> had lots of mem available, and I have no reason to believe that
> a malloc() would fail here - but certainly no one would ever want to wait
> and see :-)

How long did you wait?  If less than 10 minutes, perhaps not long enough.  I
certainly didn't expect a NULL return either, even on my tiny machine, and
certainly not on the box with 20x more RAM than the list needs.

> This was all definately built with MSVC6, SP3.

Again good to know.  I'll chew on this, but don't expect a revelation soon.

> no-room-should-ever-have-more-than-one-windows-ly y'rs

Hmm.  I *did* run these in different rooms <wink>.

no-accounting-for-windows-ly y'rs  - tim





From tim_one at email.msn.com  Mon May 15 09:34:51 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:51 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <391DCDB5.4FCAB97F@tismer.com>
Message-ID: <000301bfbe40$0e2a49a0$b82d153f@tim>

[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

Huh!  I would not have guessed that you'd give up on Stackless that easily
<wink>.

> ...
> Making it a method of the joining string now appears to be
> a hack to me. (Sorry, Tim, the idea was great in the first place)

Just the opposite here:  it looked like a hack the first time I thought of
it, but has gotten more charming with each use.  space.join(sequence) is so
pretty it aches.

redefining-truth-all-over-the-place-ly y'rs  - tim





From gward at mems-exchange.org  Mon May 15 15:30:54 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Mon, 15 May 2000 09:30:54 -0400
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>; from effbot@telia.com on Sat, May 13, 2000 at 04:36:30PM +0200
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <20000515093053.A5765@mems-exchange.org>

On 13 May 2000, Fredrik Lundh said:
> what's the best way to make sure that a "cvs update" really brings
> everything up to date, even if you've accidentally changed some-
> thing in your local workspace?

Try the attached script -- it's basically the same as Greg Stein's "cvs
status | grep Local", but beefed-up and overkilled.

Example:

  $ cvstatus -l
  .cvsignore                     Up-to-date        2000-05-02 14:31:04
  Makefile.in                    Locally Modified  2000-05-12 12:25:39
  README                         Up-to-date        2000-05-12 12:34:42
  acconfig.h                     Up-to-date        2000-05-12 12:25:40
  config.h.in                    Up-to-date        2000-05-12 12:25:40
  configure                      Up-to-date        2000-05-12 12:25:40
  configure.in                   Up-to-date        2000-05-12 12:25:40
  install-sh                     Up-to-date        1998-08-13 12:08:45

...so yeah, it generates a lot of output when run on a large working
tree, eg. Python's.  But not as much as "cvs status" on its own.  ;-)

        Greg

PS. I just noticed it uses the "#!/usr/bin/env" hack with a command-line
option for the interpreter, which doesn't work on Linux.  ;-(  You may
have to hack the shebang line to make it work.

-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367
-------------- next part --------------
#!/usr/bin/env perl -w

#
# cvstatus
#
# runs "cvs status" (with optional file arguments), filtering out
# uninteresting stuff and putting in the last-modification time
# of each file.
#
# Usage: cvstatus [files]
#
# GPW 1999/02/17
#
# $Id: cvstatus,v 1.4 2000/04/14 14:56:14 gward Exp $
#

use strict;
use POSIX 'strftime';

my @files = @ARGV;

# Open a pipe to a forked child process
my $pid = open (CVS, "-|");
die "couldn't open pipe: $!\n" unless defined $pid;

# In the child -- run "cvs status" (with optional list of files
# from command line)
unless ($pid)
{
   open (STDERR, ">&STDOUT");           # merge stderr with stdout
   exec 'cvs', 'status', @files;
   die "couldn't exec cvs: $!\n";
}

# In the parent -- read "cvs status" output from the child
else
{
   my $dir = '';
   while (<CVS>)
   {
      my ($filename, $status, $mtime);
      if (/Examining (.*)/)
      {
         $dir = $1;
         if (! -d $dir)
         {
            warn "huh? no directory called $dir!";
            $dir = '';
         }
         elsif ($dir eq '.')
            { $dir = ''; }
         else
            { $dir .= '/' unless $dir =~ m|/$|; }
      }
      elsif (($filename, $status) = /^File: \s* (\S+) \s* Status: \s* (.*)/x)
      {
         $filename = $dir . $filename;
         if ($mtime = (stat $filename)[9])
         {
            $mtime = strftime ("%Y-%m-%d %H:%M:%S", localtime $mtime);
            printf "%-30.30s %-17s %s\n", $filename, $status, $mtime;
         }
         else
         {
            #warn "couldn't stat $filename: $!\n";
            printf "%-30.30s %-17s ???\n", $filename, $status;
         }
      }
   }

   close (CVS);
   warn "cvs failed\n" unless $? == 0;
}

From trentm at activestate.com  Mon May 15 23:09:58 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 15 May 2000 14:09:58 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
References: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
Message-ID: <20000515140958.C20418@activestate.com>

I broke it with my patches to test overflow for some of the PyArg_Parse*()
formatting characters. The upshot of testing for overflow is that now those
formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
unsigned-ness as appropriate (you have to know if the value is signed or
unsigned to know what limits to check against for overflow). Two
possibilities presented themselves:

1. Enforce 'b' as unsigned char (the common usage) and the rest as signed
values (short, int, and long). If you want a signed char, or an unsigned
short you have to work around it yourself.

2. Add formatting characters or modifiers for signed and unsigned versions of
all the integral type to PyArg_Parse*() in getargs.c

Guido prefered the former because (my own interpretation of the reasons) it
covers the common case and keeps the clutter and feature creep down. It is
debatable whether or not we really need signed and unsigned for all of them.
See the following threads on python-dev and patches:
  make 'b' formatter an *unsigned* char
  issues with int/long on 64bit platforms - eg stringobject (PR#306) 
  make 'b','h','i' raise overflow exception
  
Possible code breakage is the drawback.


[Fredrik Lundh wrote]:
> sigh.  never resync the CVS repository until you've fixed all
> bugs in your *own* code ;-)

Sorry, I guess. The test suite did not catch this so it is hard for me to
know that the bug was raised. My patches adds tests for these to the test
suite.

> 
> in 1.5.2:
> 
> >>> array.array("h", [65535])
> array('h', [-1])
> 
> >>> array.array("H", [65535])
> array('H', [65535])
> 
> in the current CVS version:
> 
> >>> array.array("h", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> okay, this might break some existing code -- but one
> can always argue that such code were already broken.

Yes.

> 
> on the other hand:
> 
> >>> array.array("H", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> oops.
> 
oops. See my patch that fixes this for 'H', and 'b', and 'I', and 'L'.


> dunno if the right thing would be to add support for various kinds
> of unsigned integers to Python/getargs.c, or to hack around this
> in the array module...
> 
My patch does the latter and that would be my suggestion because:
(1) Guido didn't like the idea of adding more formatters to getargs.c (see
above)
(2) Adding support for unsigned and signed versions in getargs.c could be
confusing because the formatting characters cannot be the same as in the
array module because 'L' is already used for LONG_LONG types in
PyArg_Parse*().
(3) KISS and the common case. Keep the number of formatters for
PyArg_Parse*() short and simple. I would presume that the common case user
does not really need the extra support.


Trent


-- 
Trent Mick
trentm at activestate.com



From mhammond at skippinet.com.au  Tue May 16 08:22:53 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 16 May 2000 16:22:53 +1000
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
Message-ID: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>

For about the 1,000,000th time in my life (no exaggeration :-), I just
typed "python.exe foo" - I forgot the .py.

It would seem a simple and useful change to append a ".py" extension and
try-again, instead of dieing the first time around - ie, all we would be
changing is that we continue to run where we previously failed.

Is there a good reason why we dont do this?

Mark.




From mal at lemburg.com  Tue May 16 00:07:53 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 00:07:53 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <391d5b7f.3713359@smtp.worldonline.dk>
Message-ID: <39207539.F1C14A25@lemburg.com>

Finn Bock wrote:
> 
> On Sat, 13 May 2000 14:56:41 +0200, you wrote:
> 
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
>
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings
> 
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

+1 on that one... just like \s should use Py_UNICODE_ISSPACE()
and \d Py_UNICODE_ISDECIMAL().

BTW, how have you implemented the locale aware \w and \W
for Unicode ? Unicode doesn't have any locales, but quite a
lot more alphanumeric characters (or equivalents) and there
currently is no Py_UNICODE_ISALPHA() in the core.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon May 15 23:50:39 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 May 2000 23:50:39 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com>
		<8fe76b$684$1@newshost.accu.uu.nl>
		<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
		<8fh9ki$51h$1@slb3.atl.mindspring.net>
		<8fk4mh$i4$1@kopp.stud.ntnu.no>
		<391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <3920712F.1FD0B910@lemburg.com>

"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])
> 
> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])
> 
> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

string.py is depreciated, AFAIK (not that it'll go away anytime
soon, but using string method directly is really the better,
more readable and faster approach).
 
> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

This is probably the way to go. Semantics should probably
be:

	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

and should work with any type providing addition or
concat slot methods.

Patches anyone ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 16 10:21:46 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 10:21:46 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <3921051A.56C7B63E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > comments?  (for obvious reasons, I'm especially interested in comments
> > from people using non-ASCII characters on a daily basis...)
> 
> > nobody?
> 
> Hi Frederik,
> 
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Right, that's the way to go :-)
 
> If you follow this guideline, I think the Unicode type of Python 1.6
> will work just fine.
> 
> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. This is the time you
> should look at the implicit conversion stuff, and see which of the
> functionality is useful. You then don't need to memorize *all* the
> rules where implicit conversion would work - just the cases you care
> about.

One should better not rely on the implicit conversions. These
are really only there to ease porting applications to Unicode
and perhaps make some existing APIs deal with Unicode without
even knowing about it -- of course this will not always work
and those places will need some extra porting effort to make
them useful w/r to Unicode. open() is one such candidate.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Tue May 16 11:30:54 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 May 2000 11:30:54 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>

Martin v. Loewis wrote:
> I think the problem you try to see is not real.

it is real.  I won't repeat the arguments one more time; please read
the W3C character model note and the python-dev archives, and read
up on the unicode support in Tcl and Perl.

> But then, it is not more difficult than tuples vs. lists

your examples always behave the same way, no matter what's in the
containers.  that's not true for MAL's design.

</F>




From guido at python.org  Tue May 16 12:03:07 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 06:03:07 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: Your message of "Tue, 16 May 2000 16:22:53 +1000."
             <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005161003.GAA12247@eric.cnri.reston.va.us>

> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.
> 
> It would seem a simple and useful change to append a ".py" extension and
> try-again, instead of dieing the first time around - ie, all we would be
> changing is that we continue to run where we previously failed.
> 
> Is there a good reason why we dont do this?

Just inertia, plus it's "not the Unix way".  I agree it's a good idea.
(I also found in user testsing that IDLE definitely has to supply the
".py" when saving a module if the user didn't.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Tue May 16 16:52:59 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 09:52:59 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <3920712F.1FD0B910@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
	<14621.52191.448037.799287@anthem.cnri.reston.va.us>
	<3920712F.1FD0B910@lemburg.com>
Message-ID: <14625.24779.329534.364663@beluga.mojam.com>

    >> Alternatively, there has been talk about moving join() into the
    >> built-ins, but I'm not sure if the semantics of tha have been nailed
    >> down.

    Marc> This is probably the way to go. Semantics should probably
    Marc> be:

    Marc> 	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

    Marc> and should work with any type providing addition or concat slot
    Marc> methods.

Of course, while it will always yield what you ask for, it might not always
yield what you expect:

    >>> seq = [1,2,3]
    >>> sep = 5
    >>> reduce(lambda x,y: x + sep + y, seq)
    16

;-)

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From effbot at telia.com  Tue May 16 17:22:06 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 17:22:06 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com>
Message-ID: <000d01bfbf4a$85321400$34aab5d4@hagrid>

>     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
>
> Of course, while it will always yield what you ask for, it might not always
> yield what you expect:
> 
>     >>> seq = [1,2,3]
>     >>> sep = 5
>     >>> reduce(lambda x,y: x + sep + y, seq)
>     16

not to mention:

>>> print join([], " ")
TypeError: reduce of empty sequence with no initial value

...

</F>




From mal at lemburg.com  Tue May 16 19:15:05 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 19:15:05 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com> <000d01bfbf4a$85321400$34aab5d4@hagrid>
Message-ID: <39218219.9E8115E2@lemburg.com>

Fredrik Lundh wrote:
> 
> >     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
> >
> > Of course, while it will always yield what you ask for, it might not always
> > yield what you expect:
> >
> >     >>> seq = [1,2,3]
> >     >>> sep = 5
> >     >>> reduce(lambda x,y: x + sep + y, seq)
> >     16
> 
> not to mention:
> 
> >>> print join([], " ")
> TypeError: reduce of empty sequence with no initial value

Ok, here's a more readable and semantically useful definition:

def join(sequence,sep=''):

    # Special case: empty sequence
    if len(sequence) == 0:
        try:
            return 0*sep
        except TypeError:
            return sep[0:0]
        
    # Normal case
    x = None
    for y in sequence:
        if x is None:
            x = y
        elif sep:
            x = x + sep + y
        else:
            x = x + y
    return x

Examples:

>>> join((1,2,3))
6

>>> join(((1,2),(3,4)),('x',))
(1, 2, 'x', 3, 4)

>>> join(('a','b','c'), ' ')
'a b c'

>>> join(())
''

>>> join((),())
()

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From paul at prescod.net  Tue May 16 19:58:33 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 12:58:33 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <39218C49.C66FEEDE@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
>
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. 

I think that as soon as we are adding admonishions to documentation that
things "probably don't behave as you expect, so be careful", we have
failed. Sometimes failure is unavaoidable (e.g. floats do not act
rationally -- deal with it). But let's not pretend that failure is
success.

> If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Are Python literals a "narrow string source"? It seems blatantly clear
to me that the "encoding" of Python literals should be determined at
compile time, not runtime. Byte arrays from a file are different. 

> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. 

Unfortunately there will be many people with no interesting in Unicode
who will be dealing with it merely because that is the way APIs are
going: XML APIs, Windows APIs, TK, DCOM, SOAP, WebDAV even some X/Unix
APIs. Unicode is the new ASCII.

I want to get a (Unicode) string from an XML document or SOAP request,
compare it to a string literal and never think about Unicode.

> ...
> why does
> 
> >>> [a,b,c] = (1,2,3)
> 
> work, and
> 
> >>> [1,2]+(3,4)
> ...
> 
> does not?

I dunno. If there is no good reason then it is a bug that should be
fixed. The __radd__ operator on lists should iterate over its argument
as a sequence.

As Fredrik points out, though, this situation is not as dangerous as
auto-conversions because

 a) the latter could be loosened later without breaking code

 b) the operation always fails. It never does the wrong thing silently
and it never succeeds for some inputs.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From skip at mojam.com  Tue May 16 20:15:40 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 13:15:40 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <39218219.9E8115E2@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
	<14621.52191.448037.799287@anthem.cnri.reston.va.us>
	<3920712F.1FD0B910@lemburg.com>
	<14625.24779.329534.364663@beluga.mojam.com>
	<000d01bfbf4a$85321400$34aab5d4@hagrid>
	<39218219.9E8115E2@lemburg.com>
Message-ID: <14625.36940.160373.900909@beluga.mojam.com>

    Marc> Ok, here's a more readable and semantically useful definition:
    ...

    >>> join((1,2,3))
    6

My point was that the verb "join" doesn't connote "sum".  The idea of
"join"ing a sequence suggests (to me) that the individual sequence elements
are still identifiable in the result, so "join((1,2,3))" would look
something like "123" or "1 2 3" or "10203", not "6".

It's not a huge deal to me, but I think it mildly violates the principle of
least surprise when you try to apply it to sequences of non-strings.

To extend this into the absurd, what should the following code display?

    class Spam: pass

    eggs = Spam()
    bacon = Spam()
    toast = Spam()

    print join((eggs,bacon,toast))

If a join builtin is supposed to be applicable to all types, we need to
decide what the semantics are going to be for all types.  Maybe all that
needs to happen is that you stringify any non-string elements before
applying the + operator (just one possibility among many, not necessarily
one I recommend).  If you want to limit join's inputs to (or only make it
semantically meaningful for) sequences of strings, then it should probably
not be a builtin, no matter how visually annoying you find

    " ".join(["a","b","c"])

Skip



From effbot at telia.com  Tue May 16 20:26:10 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 20:26:10 +0200
Subject: [Python-Dev] homer-dev, anyone?
Message-ID: <009d01bfbf64$b779a260$34aab5d4@hagrid>

http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40

</F>




From martin at loewis.home.cs.tu-berlin.de  Tue May 16 20:43:34 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 16 May 2000 20:43:34 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	(fredrik@pythonware.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
Message-ID: <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>

> it is real.  I won't repeat the arguments one more time; please read
> the W3C character model note and the python-dev archives, and read
> up on the unicode support in Tcl and Perl.

I did read all that, so there really is no point in repeating the
arguments - yet I'm still not convinced. One of the causes may be that
all your commentary either

- discusses an alternative solution to the existing one, merely
  pointing out the difference, without any strong selling point
- explains small examples that work counter-intuitively

I'd like to know whether you have an example of a real-world
big-application problem that could not be conveniently implemented
using the new Unicode API. For all the examples I can think where
Unicode would matter (XML processing, CORBA wstring mapping,
internationalized messages and GUIs), it would work just fine.

So while it may not be perfect, I think it is good enough. Perhaps my
problem is that I'm not a perfectionist :-)

However, one remark from http://www.w3.org/TR/charmod/ reminded me of
an earlier proposal by Bill Janssen. The Character Model says

# Because encoded text cannot be interpreted and processed without
# knowing the encoding, it is vitally important that the character
# encoding is known at all times and places where text is exchanged or
# stored.

While they were considering document encodings, I think this applies
in general. Bill Janssen's proposal was that each (narrow) string
should have an attribute .encoding. If set, you'll know what encoding
a string has. If not set, it is a byte string, subject to the default
encoding. I'd still like to see that as a feature in Python.

Regards,
Martin



From paul at prescod.net  Tue May 16 20:49:46 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 13:49:46 -0500
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <3921984A.8CDE8E1D@prescod.net>

I hope that if Python were renamed we would not choose yet another name
which turns up hundreds of false hits in web engines. Perhaps Homr or
Home_r. Or maybe Pythahn.

Fredrik Lundh wrote:
> 
> http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> 
> </F>
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I ahve been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From tismer at tismer.com  Tue May 16 21:01:21 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:01:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
Message-ID: <39219B01.A4EE0920@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> Huh!  I would not have guessed that you'd give up on Stackless that easily
> <wink>.

Noh, I didn't give up Stackless, but fishing for soles.
After Just v. R. has become my most ambitious user,
I'm happy enough.

(Again, better don't take me too serious :)

> > ...
> > Making it a method of the joining string now appears to be
> > a hack to me. (Sorry, Tim, the idea was great in the first place)
> 
> Just the opposite here:  it looked like a hack the first time I thought of
> it, but has gotten more charming with each use.  space.join(sequence) is so
> pretty it aches.

It is absolutely phantastic.
The most uninteresting stuff in the join is the separator,
and it has the power to merge thousands of strings
together, without asking the sequence at all
 - give all power to the suppressed, long live the Python anarchy :-)

We now just have to convince the user no longer to think
of *what* to join in te first place, but how.

> redefining-truth-all-over-the-place-ly y'rs  - tim

" "-is-small-but-sooo-strong---lets-elect-new-users - ly y'rs - chris

p.s.: no this is *no* offense, just kidding.

" ".join(":-)", ":^)", "<wink> ") * 42

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Tue May 16 21:10:42 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:10:42 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim> <39219B01.A4EE0920@tismer.com>
Message-ID: <39219D32.BD82DE83@tismer.com>

Oh, while we are at it...

Christian Tismer wrote:
> " ".join(":-)", ":^)", "<wink> ") * 42

is actually wrong, since it needs a seuqence, not just
the arg tuple. Wouldn't it make sense to allow this?
Exactly the opposite as in list.append(), since in this
case we are just expecting strings?

While I have to say that

>>> " ".join("123")
'1 2 3'
>>> 

is not a feature to me but just annoying ;-)

ciao again - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From effbot at telia.com  Tue May 16 21:30:49 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 21:30:49 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > it is real.  I won't repeat the arguments one more time; please read
> > the W3C character model note and the python-dev archives, and read
> > up on the unicode support in Tcl and Perl.
> 
> I did read all that, so there really is no point in repeating the
> arguments - yet I'm still not convinced. One of the causes may be that
> all your commentary either
> 
> - discusses an alternative solution to the existing one, merely
>   pointing out the difference, without any strong selling point
> - explains small examples that work counter-intuitively

umm.  I could have sworn that getting rid of counter-intuitive
behaviour was rather important in python.  maybe we're using
the language in radically different ways?

> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

of course I can kludge my way around the flaws in MAL's design,
but why should I have to do that? it's broken. fixing it is easy.

> Perhaps my problem is that I'm not a perfectionist :-)

perfectionist or not, I only want Python's Unicode support to
be as intuitive as anything else in Python.  as it stands right
now, Perl and Tcl's Unicode support is intuitive.  Python's not.

(it also backs us into a corner -- once you mess this one up,
you cannot fix it in Py3K without breaking lots of code.  that's
really bad).

in contrast, Guido's compromise proposal allows us to do this
the right way in 1.7/Py3K (i.e. teach python about source code
encodings, system api encodings, and stream i/o encodings).

btw, I thought we'd all agreed on GvR's solution for 1.6?

what did I miss?

> So while it may not be perfect, I think it is good enough.

so tell me, if "good enough" is what we're aiming at, why isn't
my counter-proposal good enough?

if not else, it's much easier to document...

</F>




From skip at mojam.com  Tue May 16 21:30:08 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 14:30:08 -0500 (CDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
In-Reply-To: <39219D32.BD82DE83@tismer.com>
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
	<39219B01.A4EE0920@tismer.com>
	<39219D32.BD82DE83@tismer.com>
Message-ID: <14625.41408.423282.529732@beluga.mojam.com>

    Christian> While I have to say that

    >>>> " ".join("123")
    Christian> '1 2 3'
    >>>> 

    Christian> is not a feature to me but just annoying ;-)

More annoying than

    >>> import string
    >>> string.join("123")
    '1 2 3'

? ;-)

a-sequence-is-a-sequence-ly y'rs,

Skip



From tismer at tismer.com  Tue May 16 21:43:33 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:43:33 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
		<39219B01.A4EE0920@tismer.com>
		<39219D32.BD82DE83@tismer.com> <14625.41408.423282.529732@beluga.mojam.com>
Message-ID: <3921A4E5.9BDEBF49@tismer.com>


Skip Montanaro wrote:
> 
>     Christian> While I have to say that
> 
>     >>>> " ".join("123")
>     Christian> '1 2 3'
>     >>>>
> 
>     Christian> is not a feature to me but just annoying ;-)
> 
> More annoying than
> 
>     >>> import string
>     >>> string.join("123")
>     '1 2 3'
> 
> ? ;-)

You are right. Equally bad, just in different flavor.
*gulp* this is going to be a can of worms since...

> a-sequence-is-a-sequence-ly y'rs,

Then a string should better not be a sequence.

The number of places where I really used the string sequence
protocol to take advantage of it is outperfomed by a factor
of ten by cases where I missed to tupleise and got a bad
result. A traceback is better than a sequence here.

oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

p.s.: the Spanish Inquisition can't get me since I'm in Russia
until Sunday - omsk

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Tue May 16 21:49:17 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 15:49:17 -0400
Subject: [Python-Dev] Unicode
In-Reply-To: Your message of "Tue, 16 May 2000 21:30:49 +0200."
             <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> 
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>  
            <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> 
Message-ID: <200005161949.PAA16607@eric.cnri.reston.va.us>

> in contrast, Guido's compromise proposal allows us to do this
> the right way in 1.7/Py3K (i.e. teach python about source code
> encodings, system api encodings, and stream i/o encodings).
> 
> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

Nothing.  We are going to do that (my "ASCII" proposal).  I'm just
waiting for the final SRE code first.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May 16 22:01:46 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 16:01:46 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: Your message of "Tue, 16 May 2000 13:49:46 CDT."
             <3921984A.8CDE8E1D@prescod.net> 
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>  
            <3921984A.8CDE8E1D@prescod.net> 
Message-ID: <200005162001.QAA16657@eric.cnri.reston.va.us>

> I hope that if Python were renamed we would not choose yet another name
> which turns up hundreds of false hits in web engines. Perhaps Homr or
> Home_r. Or maybe Pythahn.

Actually, I'd like to call the next version Throatwobbler Mangrove.
But you'd have to pronounce it Raymond Luxyry Yach-t.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue May 16 22:10:22 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 May 2000 16:10:22 -0400 (EDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
	<005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	<200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
	<00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <14625.43822.773966.59550@amarok.cnri.reston.va.us>

Fredrik Lundh writes:
>perfectionist or not, I only want Python's Unicode support to
>be as intuitive as anything else in Python.  as it stands right
>now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I don't know about Tcl, but Perl 5.6's Unicode support is still
considered experimental.  Consider the following excerpts, for
example.  (And Fredrik's right; we shouldn't release a 1.6 with broken
support, or we'll pay for it for *years*...  But if GvR's ASCII
proposal is considered OK, then great!)

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-04/msg00084.html:

>Ah, yes. Unicode. But after two years of work, the one thing that users
>will want to do - open and read Unicode data - is still not there.
>Who cares if stuff's now represented internally in Unicode if they can't
>read the files they need to.

This is a "big" (as in "huge") disappointment for me as well.  I hope
we'll do better next time.

========================
http://www.egroups.com/message/perl5-porters/67906:
But given that interpretation, I'm amazed at how many operators seem
to be broken with UTF8.    It certainly supports Ilya's contention of
"pre-alpha".

Here's another example:
 
  DB<1> x (256.255.254 . 257.258.259) eq (256.255.254.257.258.259)
0  ''
  DB<2>

Rummaging with Devel::Peek shows that in this case, it's the fault of
the . operator.

And eq is broken as well:

  DB<11> x "\x{100}" eq "\xc4\x80"
0  1
  DB<12>

Aaaaargh!

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-03/msg00971.html:

A couple problems here...passage through a hash key removes the UTF8
flag (as might be expected).  Even if keys were to attempt to restore
the UTF8 flag (ala Convert::UTF::decode_utf8) or hash keys were real
SVs, what then do you do with $h{"\304\254"} and the like?

Suggestions:

1. Leave things as they are, but document UTF8 hash keys as experimental
and subject to change.

or 2. When under use bytes, leave things as they are.  Otherwise, have
keys turn on the utf8 flag if appropriate.  Also give a warning when
using a hash key like "\304\254" since keys will in effect return a
different string that just happens to have the same interal encoding.

========================

 



From paul at prescod.net  Tue May 16 22:36:42 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 15:36:42 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <3921B15A.73EF6355@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

Of course an implicit behavior can never get in the way of
big-application building. The question is about principle of least
surprise, and simplicity of explanation and understanding.

 I'm-told-that-even-Perl-and-C++-can-be-used-for-big-apps -ly yrs

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From martin at loewis.home.cs.tu-berlin.de  Wed May 17 00:02:10 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 17 May 2000 00:02:10 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> (effbot@telia.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>

> perfectionist or not, I only want Python's Unicode support to
> be as intuitive as anything else in Python.  as it stands right
> now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I haven't much experience with Perl, but I don't think Tcl is
intuitive in this area. I really think that they got it all wrong.
They use the string type for "plain bytes", just as we do, but then
have the notion of "correct" and "incorrect" UTF-8 (i.e. strings with
violations of the encoding rule). For a "plain bytes" string, the
following might happen

- the string is scanned for non-UTF-8 characters
- if any are found, the string is converted into UTF-8, essentially
  treating the original string as Latin-1.
- it then continues to use the UTF-8 "version" of the original string,
  and converts it back on demand.

Maybe I got something wrong, but the Unicode support in Tcl makes me
worry very much.

> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

I like the 'only ASCII is converted' approach very much, so I'm not
objecting to that solution - just as I wasn't objecting to the
previous one.

> so tell me, if "good enough" is what we're aiming at, why isn't
> my counter-proposal good enough?

Do you mean the one in

http://www.python.org/pipermail/python-dev/2000-April/005218.html

which I suppose is the same one as the "java-like approach"? AFAICT,
all it does is to change the default encoding from UTF-8 to Latin-1.
I can't follow why this should be *better*, but it would be certainly
as good... In comparison, restricting the "character" interpretation
of the string type (in terms of your proposal) to 7-bit characters
has the advantage that it is less error-prone, as Guido points out.

Regards,
Martin



From mal at lemburg.com  Wed May 17 00:59:45 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 00:59:45 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <3921D2E1.6282AA8F@lemburg.com>

Fredrik Lundh wrote:
> 
> of course I can kludge my way around the flaws in MAL's design,
> but why should I have to do that? it's broken. fixing it is easy.

Look Fredrik, it's not *my* design. All this was discussed in
public and in several rounds late last year. If someone made
a mistake and "broke" anything, then we all did... I still
don't think so, but that's my personal opinion.

--

Now to get back to some non-flammable content: 

Has anyone played around with the latest sys.set_string_encoding()
patches ? I would really like to know what you think.

The idea behind it is that you can define what the Unicode
implementaion is to expect as encoding when it sees an
8-bit string. The encoding is used for coercion, str(unicode)
and printing. It is currently *not* used for the "s"
parser marker and hash values (mainly due to internal issues).

See my patch comments for details.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one at email.msn.com  Wed May 17 08:45:59 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 02:45:59 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <14625.36940.160373.900909@beluga.mojam.com>
Message-ID: <000701bfbfcb$8f6cc600$b52d153f@tim>

[Skip Montanaro]
> ...
> It's not a huge deal to me, but I think it mildly violates the
> principle of least surprise when you try to apply it to sequences
> of non-strings.

When sep.join(seq) was first discussed, half the debate was whether str()
should be magically applied to seq's elements.  I still favor doing that, as
I have often explained the TypeError in e.g.

    string.join(some_mixed_list_of_strings_and_numbers)

to people and agree with their next complaint:  their intent was obvious,
since string.join *produces* a string.  I've never seen an instance of this
error that was appreciated (i.e., it never exposed an error in program logic
or concept, it's just an anal gripe about an arbitrary and unnatural
restriction).  Not at all like

    "42" + 42

where the intent is unknowable.

> To extend this into the absurd, what should the following code display?
>
>     class Spam: pass
>
>     eggs = Spam()
>     bacon = Spam()
>     toast = Spam()
>
>     print join((eggs,bacon,toast))

Note that we killed the idea of a new builtin join last time around.  It's
the kind of muddy & gratuitous hypergeneralization Guido will veto if we
don't kill it ourselves.  That said,

    space.join((eggs, bacon, toast))

should <wink> produce

    str(egg) + space + str(bacon) + space + str(toast)

although how Unicode should fit into all this was never clear to me.

> If a join builtin is supposed to be applicable to all types, we need to
> decide what the semantics are going to be for all types.

See above.

> Maybe all that needs to happen is that you stringify any non-string
> elements before applying the + operator (just one possibility among
> many, not necessarily one I recommend).

In my experience, that it *doesn't* do that today is a common source of
surprise & mild irritation.  But I insist that "stringify" return a string
in this context, and that "+" is simply shorthand for "string catenation".
Generalizing this would be counterproductive.

> If you want to limit join's inputs to (or only make it semantically
> meaningful for) sequences of strings, then it should probably
> not be a builtin, no matter how visually annoying you find
>
>     " ".join(["a","b","c"])

This is one of those "doctor, doctor, it hurts when I stick an onion up my
ass!" things <wink>.  space.join(etc) reads beautifully, and anyone who
doesn't spell it that way but hates the above is picking at a scab they
don't *want* to heal <0.3 wink>.

having-said-nothing-new-he-signs-off-ly y'rs  - tim





From tim_one at email.msn.com  Wed May 17 09:12:27 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 03:12:27 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>
Message-ID: <000801bfbfcf$424029e0$b52d153f@tim>

[Mark Hammond]
> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.

Mark, is this an Australian thing?  That is, you must be the only person on
earth (besides a guy I know from New Zealand -- Australia, New Zealand, same
thing to American eyes <wink>) who puts ".exe" at the end of "python"!  I'm
speculating that you think backwards because you're upside-down down there.

throwing-another-extension-on-the-barbie-mate-ly y'rs  - tim





From effbot at telia.com  Wed May 17 09:36:03 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 09:36:03 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>
Message-ID: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > perfectionist or not, I only want Python's Unicode support to
> > be as intuitive as anything else in Python.  as it stands right
> > now, Perl and Tcl's Unicode support is intuitive.  Python's not.
> 
> I haven't much experience with Perl, but I don't think Tcl is
> intuitive in this area. I really think that they got it all wrong.

"all wrong"?

Tcl works hard to maintain the characters are characters model
(implementation level 2), just like Perl.  the length of a string is
always the number of characters, slicing works as it should, the
internal representation is as efficient as you can make it.

but yes, they have a somewhat dubious autoconversion mechanism
in there.  if something isn't valid UTF-8, it's assumed to be Latin-1.

scary, huh?  not really, if you step back and look at how UTF-8 was
designed.  quoting from RFC 2279:

    "UTF-8 strings can be fairly reliably recognized as such by a
    simple algorithm, i.e. the probability that a string of characters
    in any other encoding appears as valid UTF-8 is low, diminishing
    with increasing string length."

besides, their design is based on the plan 9 rune stuff.  that code
was written by the inventors of UTF-8, who has this to say:

    "There is little a rune-oriented program can do when given bad
    data except exit, which is unreasonable, or carry on. Originally
    the conversion routines, described below, returned errors when
    given invalid UTF, but we found ourselves repeatedly checking
    for errors and ignoring them. We therefore decided to convert
    a bad sequence to a valid rune and continue processing.

    "This technique does have the unfortunate property that con-
    verting invalid UTF byte strings in and out of runes does not
    preserve the input, but this circumstance only occurs when
    non-textual input is given to a textual program."

so let's see: they aimed for a high level of unicode support (layer
2, stream encodings, and system api encodings, etc), they've based
their design on work by the inventors of UTF-8, they have several
years of experience using their implementation in real life, and you
seriously claim that they got it "all wrong"?

that's weird.

> AFAICT, all it does is to change the default encoding from UTF-8
> to Latin-1.

now you're using "all" in that strange way again...  check the archives
for the full story (hint: a conceptual design model isn't the same thing
as a C implementation)

> I can't follow why this should be *better*, but it would be certainly
> as good... In comparison, restricting the "character" interpretation
> of the string type (in terms of your proposal) to 7-bit characters
> has the advantage that it is less error-prone, as Guido points out.

the main reason for that is that Python 1.6 doesn't have any way to
specify source encodings.  add that, so you no longer have to guess
what a string *literal* really is, and that problem goes away.  but
that's something for 1.7.

</F>




From mal at lemburg.com  Wed May 17 10:56:19 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 10:56:19 +0200
Subject: [Python-Dev] join() et al.
References: <000701bfbfcb$8f6cc600$b52d153f@tim>
Message-ID: <39225EB3.8D2C9A26@lemburg.com>

Tim Peters wrote:
> 
> [Skip Montanaro]
> > ...
> > It's not a huge deal to me, but I think it mildly violates the
> > principle of least surprise when you try to apply it to sequences
> > of non-strings.
> 
> When sep.join(seq) was first discussed, half the debate was whether str()
> should be magically applied to seq's elements.  I still favor doing that, as
> I have often explained the TypeError in e.g.
> 
>     string.join(some_mixed_list_of_strings_and_numbers)
> 
> to people and agree with their next complaint:  their intent was obvious,
> since string.join *produces* a string.  I've never seen an instance of this
> error that was appreciated (i.e., it never exposed an error in program logic
> or concept, it's just an anal gripe about an arbitrary and unnatural
> restriction).  Not at all like
> 
>     "42" + 42
> 
> where the intent is unknowable.

Uhm, aren't we discussing a generic sequence join API here ?

For strings, I think that " ".join(seq) is just fine... but it
would be nice to have similar functionality for other sequence
items as well, e.g. for sequences of sequences.
 
> > To extend this into the absurd, what should the following code display?
> >
> >     class Spam: pass
> >
> >     eggs = Spam()
> >     bacon = Spam()
> >     toast = Spam()
> >
> >     print join((eggs,bacon,toast))
> 
> Note that we killed the idea of a new builtin join last time around.  It's
> the kind of muddy & gratuitous hypergeneralization Guido will veto if we
> don't kill it ourselves.

We did ? (I must have been too busy hacking Unicode ;-)

Well, in that case I'd still be interested in hearing about
your thoughts so that I can intergrate such a beast in mxTools.
The acceptance level neede for doing that is much lower than
for the core builtins ;-)

>  That said,
> 
>     space.join((eggs, bacon, toast))
> 
> should <wink> produce
> 
>     str(egg) + space + str(bacon) + space + str(toast)
> 
> although how Unicode should fit into all this was never clear to me.

But that would mask errors and, even worse, "work around" coercion,
which is not a good idea, IMHO. Note that the need to coerce to
Unicode was the reason why the implicit str() in " ".join() was
removed from Barry's original string methods implementation.

space.join(map(str,seq)) is much clearer in this respect: it
forces the user to think about what the join should do with non-
string types.

> > If a join builtin is supposed to be applicable to all types, we need to
> > decide what the semantics are going to be for all types.
> 
> See above.
> 
> > Maybe all that needs to happen is that you stringify any non-string
> > elements before applying the + operator (just one possibility among
> > many, not necessarily one I recommend).
> 
> In my experience, that it *doesn't* do that today is a common source of
> surprise & mild irritation.  But I insist that "stringify" return a string
> in this context, and that "+" is simply shorthand for "string catenation".
> Generalizing this would be counterproductive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Wed May 17 16:12:01 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Wed, 17 May 2000 07:12:01 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>

On Wed, 17 May 2000, Fredrik Lundh wrote:
 > the main reason for that is that Python 1.6 doesn't have any way to
 > specify source encodings.  add that, so you no longer have to guess
 > what a string *literal* really is, and that problem goes away.  but

  You seem to be familiar with the Tcl work, so I'll ask you
this question:  Does Tcl have a way to specify source encoding?
I'm not aware of it, but I've only had time to follow the Tcl
world very lightly these past few years.  ;)


  -Fred

--
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Wed May 17 16:29:32 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 16:29:32 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <018101bfc00c$52be3180$34aab5d4@hagrid>

Fred L. Drake wrote:
> On Wed, 17 May 2000, Fredrik Lundh wrote:
>  > the main reason for that is that Python 1.6 doesn't have any way to
>  > specify source encodings.  add that, so you no longer have to guess
>  > what a string *literal* really is, and that problem goes away.  but
> 
>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?

Tcl has a system encoding (which is used when passing strings
through system APIs), and file/channel-specific encodings.

(for info on how they initialize the system encoding, see earlier
posts).

unfortunately, they're using the system encoding also for source
code.  for portable code, they recommend sticking to ASCII or
using "bootstrap scripts", e.g:

    set fd [open "app.tcl" r]
    fconfigure $fd -encoding euc-jp
    set jpscript [read $fd]
    close $fd
    eval $jpscript

we can surely do better in 1.7...

</F>




From jeremy at alum.mit.edu  Thu May 18 00:38:20 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Wed, 17 May 2000 15:38:20 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <3921D2E1.6282AA8F@lemburg.com>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
	<005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	<200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
	<00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
	<3921D2E1.6282AA8F@lemburg.com>
Message-ID: <14627.8028.887219.978041@localhost.localdomain>

>>>>> "MAL" == M -A Lemburg <mal at lemburg.com> writes:

  MAL> Fredrik Lundh wrote:
  >>  of course I can kludge my way around the flaws in MAL's design,
  >> but why should I have to do that? it's broken. fixing it is easy.

  MAL> Look Fredrik, it's not *my* design. All this was discussed in
  MAL> public and in several rounds late last year. If someone made a
  MAL> mistake and "broke" anything, then we all did... I still don't
  MAL> think so, but that's my personal opinion.

I find its best to avoid referring to a design as "so-and-so's design"
unless you've got something specifically complementary to say.  Using
the person's name in combination with some criticism of the design
tends to produce a defensive reaction.  Perhaps it would help make
this discussion less contentious.

Jeremy




From martin at loewis.home.cs.tu-berlin.de  Thu May 18 00:55:21 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 18 May 2000 00:55:21 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
	(fdrake@acm.org)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>

>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?
> I'm not aware of it, but I've only had time to follow the Tcl
> world very lightly these past few years.  ;)

To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
Unicode escapes, and treats all other source code as
Latin-1. encoding(n) says

# However, because the source command always reads files using the
# ISO8859-1 encoding, Tcl will treat each byte in the file as a
# separate character that maps to the 00 page in Unicode.

Regards
Martin




From tim_one at email.msn.com  Thu May 18 06:34:13 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:13 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <3921A4E5.9BDEBF49@tismer.com>
Message-ID: <000301bfc082$51ce0180$6c2d153f@tim>

[Christian Tismer]
> ...
> Then a string should better not be a sequence.
>
> The number of places where I really used the string sequence
> protocol to take advantage of it is outperfomed by a factor
> of ten by cases where I missed to tupleise and got a bad
> result. A traceback is better than a sequence here.

Alas, I think

    for ch in string:
        muck w/ the character ch

is a common idiom.

> oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

The "sequenenceness" of strings does get in the way often enough.  Strings
have the amazing property that, since characters are also strings,

    while 1:
        string = string[0]

never terminates with an error.  This often manifests as unbounded recursion
in generic functions that crawl over nested sequences (the first time you
code one of these, you try to stop the recursion on a "is it a sequence?"
test, and then someone passes in something containing a string and it
descends forever).  And we also have that

    format % values

requires "values" to be specifically a tuple rather than any old sequence,
else the current

    "%s" % some_string

could be interpreted the wrong way.

There may be some hope in that the "for/in" protocol is now conflated with
the __getitem__ protocol, so if Python grows a more general iteration
protocol, perhaps we could back away from the sequenceness of strings
without harming "for" iteration over the characters ...





From tim_one at email.msn.com  Thu May 18 06:34:05 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:05 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <39225EB3.8D2C9A26@lemburg.com>
Message-ID: <000001bfc082$4d9d5020$6c2d153f@tim>

[M.-A. Lemburg]
> ...
> Uhm, aren't we discussing a generic sequence join API here ?

It depends on whether your "we" includes me <wink>.

> Well, in that case I'd still be interested in hearing about
> your thoughts so that I can intergrate such a beast in mxTools.
> The acceptance level neede for doing that is much lower than
> for the core builtins ;-)

Heh heh.  Python already has a generic sequence join API, called "reduce".
What else do you want beyond that?  There's nothing else I want, and I don't
even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
APL, for more of this ilk.  NumPy has some use for stuff like this, but
effective schemes require dealing with multiple dimensions intelligently,
and then you're in the proper domain of matrices rather than sequences.

> >  That said,
> >
> >     space.join((eggs, bacon, toast))
> >
> > should <wink> produce
> >
> >     str(egg) + space + str(bacon) + space + str(toast)
> >
> > although how Unicode should fit into all this was never clear to me.

> But that would mask errors and,

As I said elsewhere in the msg, I have never seen this "error" do anything
except irritate a user whose intent was the utterly obvious one (i.e.,
convert the object to a string, than catenate it).

> even worse, "work around" coercion, which is not a good idea, IMHO.
> Note that the need to coerce to Unicode was the reason why the
> implicit str() in " ".join() was removed from Barry's original string
> methods implementation.

I'm hoping that in P3K we have only one string type, and then the ambiguity
goes away.  In the meantime, it's a good reason to drop Unicode support
<snicker>.

> space.join(map(str,seq)) is much clearer in this respect: it
> forces the user to think about what the join should do with non-
> string types.

They're producing a string; they want join to turn the pieces into strings;
it's a no-brainer unless join is hypergeneralized into terminal obscurity
(like, indeed, Python's "reduce").

simple-tools-for-tedious-little-tasks-ly y'rs  - tim





From tim_one at email.msn.com  Thu May 18 06:34:11 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:11 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39219B01.A4EE0920@tismer.com>
Message-ID: <000201bfc082$50909f80$6c2d153f@tim>


[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

[Tim]
> Huh!  I would not have guessed that you'd give up on Stackless
> that easily <wink>.

[Chris]
> Noh, I didn't give up Stackless, but fishing for soles.
> After Just v. R. has become my most ambitious user,
> I'm happy enough.

I suspect you missed the point:  Stackless is the *ultimate* exercise in
"changing their mind in order to understand a basic operation".  I was
tweaking you, just as you're tweaking me <smile!>.

> It is absolutely phantastic.
> The most uninteresting stuff in the join is the separator,
> and it has the power to merge thousands of strings
> together, without asking the sequence at all
>  - give all power to the suppressed, long live the Python anarchy :-)

Exactly!  Just as love has the power to bind thousands of incompatible
humans without asking them either:  a vote for space.join() is a vote for
peace on earth.

while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim





From tim_one at email.msn.com  Thu May 18 06:34:17 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:17 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbe40$07f14520$b82d153f@tim>
Message-ID: <000401bfc082$54211940$6c2d153f@tim>

Just a brief note on the little list-grower I posted.  Upon more digging
this doesn't appear to have any relation to Dragon's Win98 headaches, so I
haven't looked at it much more.  Two data points:

1. Gordon McM and I both tried it under NT 4 systems (thanks, G!), and
   those are the only Windows platforms under which no MemoryError is
   raised.  But the runtime behavior is very clearly quadratic-time (in
   the ultimate length of the list) under NT.

2. Win98 comes with very few diagnostic tools useful at this level.  The
   Python process does *not* grow to an unreasonable size.  However, using
   a freeware heap walker I quickly determined that Python quickly sprays
   data *all over* its entire 2Gb virtual heap space while running this
   thing, and then the memory error occurs.  The dump file for the system
   heap memory blocks (just listing the start address, length, & status of
   each block) is about 128Kb and I haven't had time to analyze it.  It's
   clearly terribly fragmented, though.  The mystery here is why Win98
   isn't coalescing all the gazillions of free areas to come with a big-
   enough contiguous chunk to satisfy the request (according to me <wink>,
   the program doesn't create any long-lived data other than the list --
   it appends "1" each time, and uses xrange).

Dragon's Win98 woes appear due to something else:  right after a Win98
system w/ 64Mb RAM is booted, about half the memory is already locked (not
just committed)!  Dragon's product needs more than the remaining 32Mb to
avoid thrashing.  Even stranger, killing every process after booting
releases an insignificant amount of that locked memory.  Strange too, on my
Win98 w/ 160Mb of RAM, upon booting Win98 a massive 50Mb is locked.  This is
insane, and we haven't been able to figure out on whose behalf all this
memory is being allocated.

personally-like-win98-a-lot-but-then-i-bought-a-lot-of-ram-ly y'rs
    - tim





From moshez at math.huji.ac.il  Thu May 18 07:36:09 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Thu, 18 May 2000 08:36:09 +0300 (IDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.GSO.4.10.10005180827490.14709-100000@sundial>

[Tim Peters, on sequenceness of strings]
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Hmmmm...if you add a new method,

for ch in string.as_sequence():
	muck w/ the character ch

You'd solve this.

But you won't manage to convince me that you haven't used things like

string[3:5]+string[6:] to get all the characters that...

The real problem (as I see it, from my very strange POV) is that Python
uses strings for two distinct uses:

1 -- Symbols
2 -- Arrays of characters

"Symbols" are ``run-time representation of identifiers''. For example,
getattr's "prototype" "should be"

getattr(object, symbol, object=None)

While re's search method should be

re_object.search(string)

Of course, there are symbol->string and string->symbol functions, just as
there are list->tuple and tuple->list functions. 

BTW, this would also solve problems if you want to go case-insensitive in
Py3K: == is case-sensitive on strings, but case-insensitive on symbols.

i've-got-this-on-my-chest-since-the-python-conference-and-it-was-a-
  good-opportunity-to-get-it-off-ly y'rs, Z.
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From ping at lfw.org  Thu May 18 06:37:42 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 17 May 2000 21:37:42 -0700 (PDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.LNX.4.10.10005172133490.775-100000@skuld.lfw.org>

On Thu, 18 May 2000, Tim Peters wrote:
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

But there's no way we can back away from

    spam = eggs[hack:chop] + ham[slice:dice]

on strings.  It's just too ideal.

Perhaps eventually the answer will be a character type?

Or perhaps no change at all.  I've not had the pleasure of running
into these problems with characters-being-strings before, even though
your survey of the various gotchas now makes that kind of surprising.


-- ?!ng

"Happiness isn't something you experience; it's something you remember."
    -- Oscar Levant




From mal at lemburg.com  Thu May 18 11:43:57 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 May 2000 11:43:57 +0200
Subject: [Python-Dev] join() et al.
References: <000001bfc082$4d9d5020$6c2d153f@tim>
Message-ID: <3923BB5D.47A28CBE@lemburg.com>

Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > ...
> > Uhm, aren't we discussing a generic sequence join API here ?
> 
> It depends on whether your "we" includes me <wink>.
> 
> > Well, in that case I'd still be interested in hearing about
> > your thoughts so that I can intergrate such a beast in mxTools.
> > The acceptance level neede for doing that is much lower than
> > for the core builtins ;-)
> 
> Heh heh.  Python already has a generic sequence join API, called "reduce".
> What else do you want beyond that?  There's nothing else I want, and I don't
> even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
> APL, for more of this ilk.  NumPy has some use for stuff like this, but
> effective schemes require dealing with multiple dimensions intelligently,
> and then you're in the proper domain of matrices rather than sequences.

The idea behind a generic join() API was that it could be
used to make algorithms dealing with sequences polymorph --
but you're right: this goal is probably too far fetched.

> > >  That said,
> > >
> > >     space.join((eggs, bacon, toast))
> > >
> > > should <wink> produce
> > >
> > >     str(egg) + space + str(bacon) + space + str(toast)
> > >
> > > although how Unicode should fit into all this was never clear to me.
> 
> > But that would mask errors and,
> 
> As I said elsewhere in the msg, I have never seen this "error" do anything
> except irritate a user whose intent was the utterly obvious one (i.e.,
> convert the object to a string, than catenate it).
> 
> > even worse, "work around" coercion, which is not a good idea, IMHO.
> > Note that the need to coerce to Unicode was the reason why the
> > implicit str() in " ".join() was removed from Barry's original string
> > methods implementation.
> 
> I'm hoping that in P3K we have only one string type, and then the ambiguity
> goes away.  In the meantime, it's a good reason to drop Unicode support
> <snicker>.

I'm hoping for that too... it should be Unicode everywhere if you'd
ask me.

In the meantime we can test drive this goal using the -U command
line option: it turns "" into u"" without any source code change.
The fun part about this is that running python in -U mode
reveals quite a few places where the standard lib doesn't handle
Unicode properly, so there's a lot of work ahead...

> > space.join(map(str,seq)) is much clearer in this respect: it
> > forces the user to think about what the join should do with non-
> > string types.
> 
> They're producing a string; they want join to turn the pieces into strings;
> it's a no-brainer unless join is hypergeneralized into terminal obscurity
> (like, indeed, Python's "reduce").

Hmm, the Unicode implementation does these implicit
conversions during coercion and you've all seen the success...
are you sure you want more of this ? 

We could have "".join() apply str() for all objects *except* Unicode.
1 + "2" == "12" would also be an option, or maybe 1 + "2" == 3 ? ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Thu May 18 12:01:16 2000
From: jack at oratrix.nl (Jack Jansen)
Date: Thu, 18 May 2000 12:01:16 +0200
Subject: [Python-Dev] hey, who broke the array module? 
In-Reply-To: Message by Trent Mick <trentm@activestate.com> ,
	     Mon, 15 May 2000 14:09:58 -0700 , <20000515140958.C20418@activestate.com>
Message-ID: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>

> I broke it with my patches to test overflow for some of the PyArg_Parse*()
> formatting characters. The upshot of testing for overflow is that now those
> formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> unsigned-ness as appropriate (you have to know if the value is signed or
> unsigned to know what limits to check against for overflow). Two
> possibilities presented themselves:

I think this is a _very_ bad idea. I have a few thousand (literally) routines 
calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
and the constants are all of the form

kDoSomething = 0x0001
kDoSomethingElse = 0x0002
...
kDoSomethingEvenMoreBrilliant = 0x8000

I'm pretty sure other operating systems have lots of calls with similar 
problems. I would strongly suggest using a new format char if you want 
overflow-tested integers.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From trentm at activestate.com  Thu May 18 18:56:47 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 18 May 2000 09:56:47 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
Message-ID: <20000518095647.D32135@activestate.com>

On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > formatting characters. The upshot of testing for overflow is that now those
> > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > unsigned-ness as appropriate (you have to know if the value is signed or
> > unsigned to know what limits to check against for overflow). Two
> > possibilities presented themselves:
> 
> I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> and the constants are all of the form
> 
> kDoSomething = 0x0001
> kDoSomethingElse = 0x0002
> ...
> kDoSomethingEvenMoreBrilliant = 0x8000
> 
> I'm pretty sure other operating systems have lots of calls with similar 
> problems. I would strongly suggest using a new format char if you want 
> overflow-tested integers.

Sigh. What do you think Guido? This is your call.

1. go back to no bounds testing
2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
unsigned values but is sort of false security for bounds checking)
3. keep it the way it is: 'b' is unsigned and the rest are signed
4. add new format characters or a modifying character for signed and unsigned
versions of these.

Trent

-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Fri May 19 00:05:45 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 18 May 2000 15:05:45 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: Your message of "Thu, 18 May 2000 09:56:47 PDT."
             <20000518095647.D32135@activestate.com> 
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>  
            <20000518095647.D32135@activestate.com> 
Message-ID: <200005182205.PAA12830@cj20424-a.reston1.va.home.com>

> On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > > formatting characters. The upshot of testing for overflow is that now those
> > > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > > unsigned-ness as appropriate (you have to know if the value is signed or
> > > unsigned to know what limits to check against for overflow). Two
> > > possibilities presented themselves:
> > 
> > I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> > calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> > and the constants are all of the form
> > 
> > kDoSomething = 0x0001
> > kDoSomethingElse = 0x0002
> > ...
> > kDoSomethingEvenMoreBrilliant = 0x8000
> > 
> > I'm pretty sure other operating systems have lots of calls with similar 
> > problems. I would strongly suggest using a new format char if you want 
> > overflow-tested integers.
> 
> Sigh. What do you think Guido? This is your call.
> 
> 1. go back to no bounds testing
> 2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
> unsigned values but is sort of false security for bounds checking)
> 3. keep it the way it is: 'b' is unsigned and the rest are signed
> 4. add new format characters or a modifying character for signed and unsigned
> versions of these.

Sigh indeed.  Ideally, we'd introduce H for unsigned and then lock
Jack in a room with his Macintosh computer for 48 hours to fix all his
code...

Jack, what do you think?  Is this acceptable?  (I don't know if you're
still into S&M :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Thu May 18 22:38:59 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 18 May 2000 13:38:59 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl> <20000518095647.D32135@activestate.com> <200005182205.PAA12830@cj20424-a.reston1.va.home.com> <20000518121723.A3252@activestate.com> <200005182225.PAA12950@cj20424-a.reston1.va.home.com> <20000518123029.A3330@activestate.com> <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
Message-ID: <20000518133859.A3665@activestate.com>

On Thu, May 18, 2000 at 03:49:59PM -0700, Guido van Rossum wrote:
> 
> Maybe we can come up with a modifier for signed or unsigned range
> checking?

Ha! How about 'u'? :) Or 's'? :)

I really can't think of a nice answer for this. Could introduce completely
separate formatter characters that do the range checking and remove range
checking from the current formatters. That is an ugly kludge. Could introduce
a separate PyArg_CheckedParse*() or something like that and slowly migrate to
it. This one could use something other than "L" for LONG_LONG.

I think the long term solution should be:
 - have bounds-checked signed and unsigned version of all the integral types
 - call then i/I, b/B, etc. (a la array module)
 - use something other than "L" for LONG_LONG (as you said, q/Q maybe)

The problem is to find a satisfactory migratory path to that.

Sorry, I don't have an answer. Just more questions.

Trent


p.s. If you were going to check in my associate patch I have a problem in the
tab usage in test_array.py which I will resubmit soon (couple of days).

-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Fri May 19 17:06:52 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:06:52 -0700
Subject: [Python-Dev] repr vs. str and locales again
Message-ID: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>

The email below suggests a simple solution to a problem that
e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
all non-ASCII chars into \oct escapes.  Jyrki's solution: use
isprint(), which makes it locale-dependent.  I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
similar patch for unicode strings (once the ASCII proposal is
implemented).

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Fri, 19 May 2000 10:48:29 +0300
From:    Jyrki Kuoppala <jkp at kaapeli.fi>
To:      guido at python.org
Subject: python bug?: python 1.5.2 fails to print printable 8-bit characters in
	   strings

I'm not sure if this exactly is a bug, ie. whether python 1.5.2 is
supposed to support locales and 8-bit characters.  However, on Linux
Debian "unstable" distribution the diff below makes python 1.5.2
handle printable 8-bit characters as one would expect.

Problem description:

python doesn't properly print printable 8-bit characters for the current locale
.

Details:

With no locale set, 8-bit characters in quoted strings print as
backslash-escapes, which I guess is OK:

$ unset LC_ALL
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k\344\344k')
>>>

But with a locale with a printable '?' character (octal 344) I get:

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k\344\344k')
>>>

I should be getting (output from python patched with the enclosed patch):

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, May 18 2000, 14:43:46)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k??k')
>>>                              

This hits for example when Zope with squishdot weblog (squishdot
0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
strings with valid Latin1 characters get indexed as backslash-escaped
octal codes, and thus become unsearchable.

I am using debian unstable, kernels 2.2.15pre10 and 2.0.36, libc 2.1.3.

I suggest that the test for printability in python-1.5.2
/Objects/stringobject.c be fixed to use isprint() which takes the
locale into account:

- --- python-1.5.2/Objects/stringobject.c.orig	Thu Oct  8 05:17:48 1998
+++ python-1.5.2/Objects/stringobject.c	Thu May 18 14:36:28 2000
@@ -224,7 +224,7 @@
 		c = op->ob_sval[i];
 		if (c == quote || c == '\\')
 			fprintf(fp, "\\%c", c);
- -		else if (c < ' ' || c >= 0177)
+		else if (! isprint (c))
 			fprintf(fp, "\\%03o", c & 0377);
 		else
 			fputc(c, fp);
@@ -260,7 +260,7 @@
 			c = op->ob_sval[i];
 			if (c == quote || c == '\\')
 				*p++ = '\\', *p++ = c;
- -			else if (c < ' ' || c >= 0177) {
+			else if (! isprint (c)) {
 				sprintf(p, "\\%03o", c & 0377);
 				while (*p != '\0')
 					p++;



//Jyrki

------- End of Forwarded Message




From guido at python.org  Fri May 19 17:13:01 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:13:01 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 11:25:43 +0200."
             <39250897.6F42@cnet.francetelecom.fr> 
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>  
            <39250897.6F42@cnet.francetelecom.fr> 
Message-ID: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>

[Quoting the entire mail because I've added python-dev to the cc:
list]

> Subject: Re: Python multiplexing is too hard (was: Network statistics program)
> From: Alexandre Ferrieux <alexandre.ferrieux at cnet.francetelecom.fr>
> To: Guido van Rossum <guido at python.org>
> Cc: claird at starbase.neosoft.com
> Date: Fri, 19 May 2000 11:25:43 +0200
> Delivery-Date: Fri May 19 05:26:59 2000
> 
> Guido van Rossum wrote:
> > 
> > Cameron Laird wrote:
> > >                    .
> > > Right.  asyncore is nice--but restricted to socket
> > > connections.  For many applications, that's not a
> > > restriction at all.  However, it'd be nice to have
> > > such a handy interface for communication with
> > > same-host processes; that's why I mentioned popen*().
> > > Does no one else perceive a gap there, in convenient
> > > asynchronous piped IPC?  Do folks just fall back on
> > > select() for this case?
> > 
> > Hm, really?  For same-host processes, threads would
> > do the job nicely I'd say.
> 
> Overkill.
> 
> >  Or you could probably
> > use unix domain sockets (popen only really works on
> > Unix, so that's not much of a restriction).
> 
> Overkill.
> 
> > Also note that often this is needed in the context
> > of a GUI app; there something integrated in the GUI
> > main loop is recommended.  (E.g. the file events that
> > Moshe mentioned.)
> 
> Okay so your answer is, The Python Way of doing it is to use Tcl.
> That's pretty disappointing, I'm sorry to say...
> 
> Consider:
> 
> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
> 	  event queue:
> 		- on unix, by a an additional bit on X's fd (socket) in 
> 		  the select()
> 		- on 'doze, everything is brought back to messages 
> 		  anyway.
> 
> 	And, in both cases, it works with pipes, sockets, serial or other
> devices. Uniform, clean.
> 
> 	- In python "popen only really works on Unix": are you satisfied with
> that state of affairs ? I understand (and value) Python's focus on
> algorithms and data structures, and worming around OS misgivings is a
> boring, ancillary task. But what about the potential gain ?
> 
> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> just so beautiful inside. But while Tcl is weaker in the algorithms, it
> is stronger in the os-wrapping library, and taught me to love high-level
> abstractions. [fileevent] shines in this respect, and I'll miss it in
> Python.
> 		
> -Alex

Alex, it's disappointing to me too!  There just isn't anything
currently in the library to do this, and I haven't written apps that
needs this often enough to have a good feel for what kind of
abstraction is needed.

However perhaps we can come up with a design for something better?  Do
you have a suggestion here?

I agree with your comment that higher-level abstractions around OS
stuff are needed -- I learned system programming long ago, in C, and
I'm "happy enough" with the current state of affairs, but I agree that
for many people this is a problem, and there's no reason why Python
couldn't do better...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Fri May 19 14:44:55 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 19 May 2000 14:44:55 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <002f01bfc190$09870c00$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> Jyrki's solution: use isprint(), which makes it locale-dependent.
> I can live with this.
> 
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
> 
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

does ctype-related locale stuff really mix well with unicode?

if yes, -0. if no, +0.

(intuitively, I'd say no -- deprecate in 1.6, remove in 1.7)

(btw, what about "eval(repr(s)) == s" ?)

</F>




From mal at lemburg.com  Fri May 19 14:30:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 14:30:08 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <392533D0.965E47E4@lemburg.com>

Guido van Rossum wrote:
> 
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.
> 
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
> 
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

The subject line is a bit misleading: the patch only touches
tp_print, not repr() output. And this is good, IMHO, since
otherwise eval(repr(string)) wouldn't necessarily result
in string.

Unicode objects don't implement a tp_print slot... perhaps
they should ?

--

About the ASCII proposal:

Would you be satisfied with what

import sys
sys.set_string_encoding('ascii')

currently implements ?

There are several places where an encoding comes into play with
the Unicode implementation. The above API currently changes
str(unicode), print unicode and the assumption made by the
implementation during coercion of strings to Unicode.

It does not change the encoding used to implement the "s"
or "t" parser markers and also doesn't change the way the
Unicode hash value is computed (these are currently still
hard-coded as UTF-8).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at mems-exchange.org  Fri May 19 14:45:12 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 08:45:12 -0400
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 19, 2000 at 08:06:52AM -0700
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <20000519084511.A14717@mems-exchange.org>

On 19 May 2000, Guido van Rossum said:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

For "ASCII" strings in this day and age -- which are often not
necessarily plain ol' 7-bit ASCII -- I'd say that "32 <= c <= 127" is
not the right way to determine printability.  'isprint()' seems much
more appropriate to me.

Are there other areas of Python that should be locale-sensitive but
aren't?  A minor objection to this patch is that it's a creeping change
that brings in a little bit of locale-sensitivity without addressing a
(possibly) wider problem.  However, I will immediately shoot down my own
objection on the grounds that if we try to fix everything all at once,
then nothing will ever get fixed.  Locale sensitivity strikes me as the
sort of thing that *can* be a "creeping" change -- just fix the bits
that bug people most, and eventually all the important bits will be
fixed.

I have no expertise and therefore no opinion on such a change for
Unicode strings.

        Greg



From pf at artcom-gmbh.de  Fri May 19 14:44:00 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 19 May 2000 14:44:00 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 19, 2000  8: 6:52 am"
Message-ID: <m12sm8G-000CnCC@artcom0.artcom-gmbh.de>

Guido van Rossum asks:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

How portable is the locale awareness property of 'is_print' among
traditional Unix environments, WinXX and MacOS?  This works fine on
my favorite development platform (Linux), but an accidental use of
this new 'feature' might hurt the portability of my Python apps to
other platforms.  If 'is_print' honors the locale in a similar way
on other important platforms I would like this.  Otherwise I would
prefer the current behaviour so that I can deal with it during the
early stages of development on my Linux boxes.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From bwarsaw at python.org  Fri May 19 20:51:23 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 19 May 2000 11:51:23 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
	<20000519084511.A14717@mems-exchange.org>
Message-ID: <14629.36139.735410.272339@localhost.localdomain>

>>>>> "GW" == Greg Ward <gward at mems-exchange.org> writes:

    GW> Locale sensitivity strikes me as the sort of thing that *can*
    GW> be a "creeping" change -- just fix the bits that bug people
    GW> most, and eventually all the important bits will be fixed.

Another decidedly ignorant Anglophone here, but one problem that I see
with localizing stuff is that locale is app- (or at least thread-)
global, isn't it?  That would suck for applications like Mailman which
are (going to be) multilingual in the sense that a single instance of
the application will serve up documents in many languages, as opposed
to serving up documents in just one of a choice of languages.

If it seems I don't know what I'm talking about, you're probably
right.  I just wanted to point out that there are applications have to
deal with many languages at the same time.

-Barry




From effbot at telia.com  Fri May 19 18:46:39 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Fri, 19 May 2000 18:46:39 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com><20000519084511.A14717@mems-exchange.org> <14629.36139.735410.272339@localhost.localdomain>
Message-ID: <00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>

Barry Warsaw wrote:
> Another decidedly ignorant Anglophone here, but one problem that I see
> with localizing stuff is that locale is app- (or at least thread-)
> global, isn't it?  That would suck for applications like Mailman which
> are (going to be) multilingual in the sense that a single instance of
> the application will serve up documents in many languages, as opposed
> to serving up documents in just one of a choice of languages.
> 
> If it seems I don't know what I'm talking about, you're probably
> right.  I just wanted to point out that there are applications have to
> deal with many languages at the same time.

Applications may also have to deal with output devices (i.e. GUI
toolkits, printers, communication links) that don't necessarily have
the same restrictions as the "default console".

better do it the right way: deal with encodings at the boundaries,
not inside the application.

</F>




From gward at mems-exchange.org  Fri May 19 19:03:18 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 13:03:18 -0400
Subject: [Python-Dev] Dynamic linking problem on Solaris
Message-ID: <20000519130317.A16111@mems-exchange.org>

Hi all --

interesting problem with building Robin Dunn's extension for BSD DB 2.x
as a shared object on Solaris 2.6 for Python 1.5.2 with GCC 2.8.1 and
Sun's linker.  (Yes, all of those things seem to matter.)

DB 2.x (well, at least 2.7.7) contains this line of C code:

    *mbytesp = sb.st_size / MEGABYTE;

where 'sb' is a 'struct stat' -- ie. 'sb.st_size' is a long long, which
I believe is 64 bits on Solaris.  Anyways, GCC compiles this division
into a subroutine call -- I guess the SPARC doesn't have a 64-bit
divide, or if it does then GCC doesn't know about it.

Of course, the subroutine in question -- '__cmpdi2' -- is defined in
libgcc.a.  So if you write a C application that uses BSD DB 2.x, and
compile and link it with GCC, no problem -- everything is controlled by
GCC, so libgcc.a gets linked in at the appropriate time, the linker
finds '__cmpdi2' and includes it in your binary executable, and
everything works.

However, if you're building a Python extension that uses BSD DB 2.x,
there's a problem: the default command for creating a shared extension
on Solaris is "ld -G" -- this is in Python's Makefile, so it affects
extension building with either Makefile.pre.in or the Distutils.

However, since "ld" is Sun's "ld", it doesn't know anything about
libgcc.a.  And, since presumably no 64-bit division is done in Python
itself, '__cmpdi2' isn't already present in the Python binary.  The
result: when you attempt to load the extension, you die:

  $ python -c "import dbc"
  Traceback (innermost last):
    File "<string>", line 1, in ?
  ImportError: ld.so.1: python: fatal: relocation error: file ./dbcmodule.so: symbol __cmpdi2: referenced symbol not found

The workaround turns out to be fairly easy, and there are actually two
of them.  First, add libgcc.a to the link command, ie. instead of

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb -o dbcmodule.so

use

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb \
    /depot/gnu/plat/lib/gcc-lib/sparc-sun-solaris2.6/2.8.1/libgcc.a \
    -o dbcmodule.so

(where the location of libgcc.a is variable, but invariably hairy).  Or,
it turns out that you can just use "gcc -G" to create the extension:

  gcc -G db_wrap.o -ldb -o dbcmodule.so

Seems to me that the latter is a no-brainer.

So the question arises: why is the default command for building
extensions on Solaris "ld -G" instead of "gcc -G"?  I'm inclined to go
edit my installed Makefile to make this permanent... what will that
break?

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From bwarsaw at python.org  Fri May 19 22:09:09 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 19 May 2000 13:09:09 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
	<20000519084511.A14717@mems-exchange.org>
	<14629.36139.735410.272339@localhost.localdomain>
	<00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>
Message-ID: <14629.40805.180119.929694@localhost.localdomain>

>>>>> "FL" == Fredrik Lundh <effbot at telia.com> writes:

    FL> better do it the right way: deal with encodings at the
    FL> boundaries, not inside the application.

Sounds good to me. :)




From ping at lfw.org  Fri May 19 19:04:18 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 10:04:18 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005190957260.2892-100000@localhost>

On Fri, 19 May 2000, Ka-Ping Yee wrote:
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data)

Clarification: what i meant by the above is, repr() is not
explicitly an input or an output function.  It does "some
internal computation".

Here is one alternative:

    repr(obj, **kw): options specified in kw dict
                     
        push each element in kw dict into sys.repr_options
        now do the normal conversion, referring to whatever
            options are relevant (such as "locale" if doing strings)
        for looking up any option, first check kw dict,
            then look for sys.repr_options[option]
        restore sys.repr_options

This is ugly and i still like printon/printout better, but
at least it's a smaller change and won't prevent the implementation
of printon/printout later.

This suggestion is not thread-safe.


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell




From ping at lfw.org  Fri May 19 18:56:50 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 09:56:50 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>

On Fri, 19 May 2000, Guido van Rossum wrote:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

Changing the behaviour of repr() (a function that internally
converts data into data) based on a fixed global system parameter
makes me uncomfortable.  Wouldn't it make more sense for the
locale business to be a property of the stream that the string
is being printed on?

This was the gist of my proposal for files having a printout
method a while ago.  I understand if that proposal is a bit too
much of a change to swallow at once, but i'd like to ensure the
door stays open to let it be possible in the future.

Surely there are other language systems that deal with the
issue of "nicely" printing their own data structures for human
interpretation... anyone have any experience to share?  The
printout/printon thing originally comes from Smalltalk, i believe.

(...which reminds me -- i played with Squeak the other day and
thought to myself, it would be cool to browse and edit code in
Python with a system browser like that.)


Note, however:

> This hits for example when Zope with squishdot weblog (squishdot
> 0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
> strings with valid Latin1 characters get indexed as backslash-escaped
> octal codes, and thus become unsearchable.

The above comment in particular strikes me as very fishy.
How on earth can the escaping behaviour of repr() affect the
indexing of text?  Surely when you do a search, you search for
exactly what you asked for.

And does the above mean that, with Jyrki's proposed fix, the
sorting and searching behaviour of Squishdot will suddenly
change, and magically differ from locale to locale?  Is that
something we want?  (That last is not a rhetorical question --
my gut says no, but i don't actually have enough experience
working with these issues to know the answer.)


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell




From mal at lemburg.com  Fri May 19 21:06:24 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 21:06:24 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <392590B0.5CA4F31D@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > The email below suggests a simple solution to a problem that
> > e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> > all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> > isprint(), which makes it locale-dependent.  I can live with this.
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data) based on a fixed global system parameter
> makes me uncomfortable.  Wouldn't it make more sense for the
> locale business to be a property of the stream that the string
> is being printed on?

Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
string_print API which is used for the tp_print slot, so the
only effect to be seen is when printing a string to a real
file object (tp_print is only used by PyObject_Print() and that
API is only used for writing to real PyFileObjects -- all
other stream get the output of str() or repr()).

Perhaps we should drop tp_print for strings altogether and
let str() and repr() to decide what to do... (this is
what Unicode objects do). The only good reason for implementing
tp_print is to write huge amounts of data to a stream without
creating intermediate objects -- not really needed for strings,
since these *are* the intermediate object usually created for
just this purpose ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jeremy at alum.mit.edu  Sat May 20 02:46:11 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Fri, 19 May 2000 17:46:11 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
Message-ID: <14629.57427.9434.623247@localhost.localdomain>

I applied the recent changes to the CVS httplib to Greg's httplib
(call it httplib11) this afternoon.  The result is included below.  I
think this is quite close to checking in, but it could use a slightly
better test suite.

There are a few outstanding questions.

httplib11 does not implement the debuglevel feature.  I don't think
it's important, but it is currently documented and may be used.
Guido, should we implement it?

httplib w/SSL uses a constructor with this prototype:
    def __init__(self, host='', port=None, **x509):
It looks like the x509 dictionary should contain two variables --
key_file and cert_file.  Since we know what the entries are, why not
make them explicit?
    def __init__(self, host='', port=None, cert_file=None, key_file=None):
(Or reverse the two arguments if that is clearer.)

The FakeSocket class in CVS has a comment after the makefile def line
that says "hopefully, never have to write."  It won't do at all the
right thing when called with a write mode, so it ought to raise an
exception.  Any reason it doesn't?

I'd like to add a couple of test cases that use HTTP/1.1 to get some
pages from python.org, including one that uses the chunked encoding.
Just haven't gotten around to it.  Question on that front: Does it
make sense to incorporate the test function in the module with the std
regression test suite?  In general, I would think so.  In this
particular case, the test could fail because of host networking
problems.  I think that's okay as long as the error message is clear
enough. 

Jeremy

"""HTTP/1.1 client library"""

# Written by Greg Stein.

import socket
import string
import mimetools

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO

error = 'httplib.error'

HTTP_PORT = 80
HTTPS_PORT = 443

class HTTPResponse(mimetools.Message):
    __super_init = mimetools.Message.__init__
    
    def __init__(self, fp, version, errcode):
        self.__super_init(fp, 0)

        if version == 'HTTP/1.0':
            self.version = 10
        elif version[:7] == 'HTTP/1.':
            self.version = 11 # use HTTP/1.1 code for HTTP/1.x where x>=1
        else:
            raise error, 'unknown HTTP protocol'

        # are we using the chunked-style of transfer encoding?
        tr_enc = self.getheader('transfer-encoding')
        if tr_enc:
            if string.lower(tr_enc) != 'chunked':
                raise error, 'unknown transfer-encoding'
            self.chunked = 1
            self.chunk_left = None
        else:
            self.chunked = 0

        # will the connection close at the end of the response?
        conn = self.getheader('connection')
        if conn:
            conn = string.lower(conn)
            # a "Connection: close" will always close the
            # connection. if we don't see that and this is not
            # HTTP/1.1, then the connection will close unless we see a
            # Keep-Alive header. 
            self.will_close = string.find(conn, 'close') != -1 or \
                              ( self.version != 11 and \
                                not self.getheader('keep-alive') )
        else:
            # for HTTP/1.1, the connection will always remain open
            # otherwise, it will remain open IFF we see a Keep-Alive header
            self.will_close = self.version != 11 and \
                              not self.getheader('keep-alive')

        # do we have a Content-Length?
        # NOTE: RFC 2616, S4.4, #3 says we ignore this if tr_enc is "chunked"
        length = self.getheader('content-length')
        if length and not self.chunked:
            self.length = int(length)
        else:
            self.length = None

        # does the body have a fixed length? (of zero)
        if (errcode == 204 or               # No Content
            errcode == 304 or               # Not Modified
            100 <= errcode < 200):          # 1xx codes
            self.length = 0

        # if the connection remains open, and we aren't using chunked, and
        # a content-length was not provided, then assume that the connection
        # WILL close.
        if not self.will_close and \
           not self.chunked and \
           self.length is None:
            self.will_close = 1

        # if there is no body, then close NOW. read() may never be
        # called, thus we will never mark self as closed.
        if self.length == 0:
            self.close()

    def close(self):
        if self.fp:
            self.fp.close()
            self.fp = None

    def isclosed(self):
        # NOTE: it is possible that we will not ever call self.close(). This
        #       case occurs when will_close is TRUE, length is None, and we
        #       read up to the last byte, but NOT past it.
        #
        # IMPLIES: if will_close is FALSE, then self.close() will ALWAYS be
        #          called, meaning self.isclosed() is meaningful.
        return self.fp is None

    def read(self, amt=None):
        if self.fp is None:
            return ''

        if self.chunked:
            chunk_left = self.chunk_left
            value = ''
            while 1:
                if chunk_left is None:
                    line = self.fp.readline()
                    i = string.find(line, ';')
                    if i >= 0:
                        line = line[:i]     # strip chunk-extensions
                    chunk_left = string.atoi(line, 16)
                    if chunk_left == 0:
                        break
                if amt is None:
                    value = value + self.fp.read(chunk_left)
                elif amt < chunk_left:
                    value = value + self.fp.read(amt)
                    self.chunk_left = chunk_left - amt
                    return value
                elif amt == chunk_left:
                    value = value + self.fp.read(amt)
                    self.fp.read(2)    # toss the CRLF at the end of the chunk
                    self.chunk_left = None
                    return value
                else:
                    value = value + self.fp.read(chunk_left)
                    amt = amt - chunk_left

                # we read the whole chunk, get another
                self.fp.read(2)        # toss the CRLF at the end of the chunk
                chunk_left = None

            # read and discard trailer up to the CRLF terminator
            ### note: we shouldn't have any trailers!
            while 1:
                line = self.fp.readline()
                if line == '\r\n':
                    break

            # we read everything; close the "file"
            self.close()

            return value

        elif amt is None:
            # unbounded read
            if self.will_close:
                s = self.fp.read()
            else:
                s = self.fp.read(self.length)
            self.close()      # we read everything
            return s

        if self.length is not None:
            if amt > self.length:
                # clip the read to the "end of response"
                amt = self.length
            self.length = self.length - amt

        s = self.fp.read(amt)

        # close our "file" if we know we should
        ### I'm not sure about the len(s) < amt part; we should be
        ### safe because we shouldn't be using non-blocking sockets
        if self.length == 0 or len(s) < amt:
            self.close()

        return s


class HTTPConnection:

    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'

    response_class = HTTPResponse
    default_port = HTTP_PORT

    def __init__(self, host, port=None):
        self.sock = None
        self.response = None
        self._set_hostport(host, port)

    def _set_hostport(self, host, port):
        if port is None:
            i = string.find(host, ':')
            if i >= 0:
                port = int(host[i+1:])
                host = host[:i]
            else:
                port = self.default_port
        self.host = host
        self.port = port
        self.addr = host, port

    def connect(self):
        """Connect to the host and port specified in __init__."""
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect(self.addr)

    def close(self):
        """Close the connection to the HTTP server."""
        if self.sock:
            self.sock.close() # close it manually... there may be other refs
            self.sock = None
        if self.response:
            self.response.close()
            self.response = None

    def send(self, str):
        """Send `str' to the server."""
        if self.sock is None:
            self.connect()

        # send the data to the server. if we get a broken pipe, then close
        # the socket. we want to reconnect when somebody tries to send again.
        #
        # NOTE: we DO propagate the error, though, because we cannot simply
        #       ignore the error... the caller will know if they can retry.
        try:
            self.sock.send(str)
        except socket.error, v:
            if v[0] == 32:    # Broken pipe
                self.close()
            raise

    def putrequest(self, method, url):
        """Send a request to the server.

        `method' specifies an HTTP request method, e.g. 'GET'.
        `url' specifies the object being requested, e.g.
        '/index.html'.
        """
        if self.response is not None:
            if not self.response.isclosed():
                ### implies half-duplex!
                raise error, 'prior response has not been fully handled'
            self.response = None

        if not url:
            url = '/'
        str = '%s %s %s\r\n' % (method, url, self._http_vsn_str)

        try:
            self.send(str)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time (the socket was closed; this will reopen)
            self.send(str)

        self.putheader('Host', self.host)

        if self._http_vsn == 11:
            # Issue some standard headers for better HTTP/1.1 compliance

            # note: we are assuming that clients will not attempt to set these
            #     headers since *this* library must deal with the consequences.
            #     this also means that when the supporting libraries are
            #     updated to recognize other forms, then this code should be
            #     changed (removed or updated).

            # we only want a Content-Encoding of "identity" since we don't
            # support encodings such as x-gzip or x-deflate.
            self.putheader('Accept-Encoding', 'identity')

            # we can accept "chunked" Transfer-Encodings, but no others
            # NOTE: no TE header implies *only* "chunked"
            #self.putheader('TE', 'chunked')

            # if TE is supplied in the header, then it must appear in a
            # Connection header.
            #self.putheader('Connection', 'TE')

        else:
            # For HTTP/1.0, the server will assume "not chunked"
            pass

    def putheader(self, header, value):
        """Send a request header line to the server.

        For example: h.putheader('Accept', 'text/html')
        """
        str = '%s: %s\r\n' % (header, value)
        self.send(str)

    def endheaders(self):
        """Indicate that the last header line has been sent to the server."""

        self.send('\r\n')

    def request(self, method, url, body=None, headers={}):
        """Send a complete request to the server."""

        try:
            self._send_request(method, url, body, headers)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time
            self._send_request(method, url, body, headers)

    def _send_request(self, method, url, body, headers):
        self.putrequest(method, url)

        if body:
            self.putheader('Content-Length', str(len(body)))
        for hdr, value in headers.items():
            self.putheader(hdr, value)
        self.endheaders()

        if body:
            self.send(body)

    def getreply(self):
        """Get a reply from the server.

        Returns a tuple consisting of:
        - server response code (e.g. '200' if all goes well)
        - server response string corresponding to response code
        - any RFC822 headers in the response from the server

        """
        file = self.sock.makefile('rb')
        line = file.readline()
        try:
            [ver, code, msg] = string.split(line, None, 2)
        except ValueError:
            try:
                [ver, code] = string.split(line, None, 1)
                msg = ""
            except ValueError:
                self.close()
                return -1, line, file
        if ver[:5] != 'HTTP/':
            self.close()
            return -1, line, file
        errcode = int(code)
        errmsg = string.strip(msg)
        response = self.response_class(file, ver, errcode)
        if response.will_close:
            # this effectively passes the connection to the response
            self.close()
        else:
            # remember this, so we can tell when it is complete
            self.response = response
        return errcode, errmsg, response

class FakeSocket:
    def __init__(self, sock, ssl):
        self.__sock = sock
        self.__ssl = ssl
        return

    def makefile(self, mode):           # hopefully, never have to write
        # XXX add assert about mode != w???
        msgbuf = ""
        while 1:
            try:
                msgbuf = msgbuf + self.__ssl.read()
            except socket.sslerror, msg:
                break
        return StringIO(msgbuf)

    def send(self, stuff, flags = 0):
        return self.__ssl.write(stuff)

    def recv(self, len = 1024, flags = 0):
        return self.__ssl.read(len)

    def __getattr__(self, attr):
        return getattr(self.__sock, attr)

class HTTPSConnection(HTTPConnection):
    """This class allows communication via SSL."""
    __super_init = HTTPConnection.__init__

    default_port = HTTPS_PORT

    def __init__(self, host, port=None, **x509):
        self.__super_init(host, port)
        self.key_file = x509.get('key_file')
        self.cert_file = x509.get('cert_file')

    def connect(self):
        """Connect to a host onf a given port
        
        Note: This method is automatically invoked by __init__, if a host
        is specified during instantiation.
        """
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect(self.addr)
        ssl = socket.ssl(sock, self.key_file, self.cert_file)
        self.sock = FakeSocket(sock, ssl)

class HTTPMixin:
    """Mixin for compatibility with httplib.py from 1.5.

    requires that class that inherits defines the following attributes:
    super_init 
    super_connect 
    super_putheader 
    super_getreply
    """

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def connect(self, host=None, port=None):
        "Accept arguments to set the host/port, since the superclass doesn't."
        if host is not None:
            self._set_hostport(host, port)
        self.super_connect()

    def set_debuglevel(self, debuglevel):
        "The class no longer supports the debuglevel."
        pass

    def getfile(self):
        "Provide a getfile, since the superclass' use of HTTP/1.1 prevents it."
        return self.file

    def putheader(self, header, *values):
        "The superclass allows only one value argument."
        self.super_putheader(header, string.joinfields(values,'\r\n\t'))

    def getreply(self):
        "Compensate for an instance attribute shuffling."
        errcode, errmsg, response = self.super_getreply()
        if errcode == -1:
            self.file = response  # response is the "file" when errcode==-1
            self.headers = None
            return -1, errmsg, None

        self.headers = response
        self.file = response.fp
        return errcode, errmsg, response

class HTTP(HTTPMixin, HTTPConnection):
    super_init = HTTPConnection.__init__
    super_connect = HTTPConnection.connect
    super_putheader = HTTPConnection.putheader
    super_getreply = HTTPConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port)

class HTTPS(HTTPMixin, HTTPSConnection):
    super_init = HTTPSConnection.__init__
    super_connect = HTTPSConnection.connect
    super_putheader = HTTPSConnection.putheader
    super_getreply = HTTPSConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None, **x509):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port, **x509)

def test():
    """Test this module.

    The test consists of retrieving and displaying the Python
    home page, along with the error code and error string returned
    by the www.python.org server.
    """

    import sys
    import getopt
    opts, args = getopt.getopt(sys.argv[1:], 'd')
    dl = 0
    for o, a in opts:
        if o == '-d': dl = dl + 1
    host = 'www.python.org'
    selector = '/'
    if args[0:]: host = args[0]
    if args[1:]: selector = args[1]
    h = HTTP()
    h.set_debuglevel(dl)
    h.connect(host)
    h.putrequest('GET', selector)
    h.endheaders()
    errcode, errmsg, headers = h.getreply()
    print 'errcode =', errcode
    print 'errmsg  =', errmsg
    print
    if headers:
        for header in headers.headers: print string.strip(header)
    print
    print h.getfile().read()

    if hasattr(socket, 'ssl'):
        host = 'www.c2.net'
        hs = HTTPS()
        hs.connect(host)
        hs.putrequest('GET', selector)
        hs.endheaders()
        errcode, errmsg, headers = hs.getreply()
        print 'errcode =', errcode
        print 'errmsg  =', errmsg
        print
        if headers:
            for header in headers.headers: print string.strip(header)
        print
        print hs.getfile().read()

if __name__ == '__main__':
    test()




From claird at starbase.neosoft.com  Sat May 20 00:02:47 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:02:47 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005192202.RAA48753@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	> Consider:
	> 
	> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
	> 	  event queue:
	> 		- on unix, by a an additional bit on X's fd (socket) in 
	> 		  the select()
	> 		- on 'doze, everything is brought back to messages 
	> 		  anyway.
	> 
	> 	And, in both cases, it works with pipes, sockets, serial or other
	> devices. Uniform, clean.
	> 
	> 	- In python "popen only really works on Unix": are you satisfied with
	> that state of affairs ? I understand (and value) Python's focus on
	> algorithms and data structures, and worming around OS misgivings is a
	> boring, ancillary task. But what about the potential gain ?
	> 
	> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
	> just so beautiful inside. But while Tcl is weaker in the algorithms, it
	> is stronger in the os-wrapping library, and taught me to love high-level
	> abstractions. [fileevent] shines in this respect, and I'll miss it in
	> Python.
	> 		
	> -Alex

	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
Great questions!  Alex and I are both working
on answers, I think; we're definitely not ig-
noring this.  More, in time.

One thing of which I'm certain:  I do NOT like
documentation entries that say things like
"select() doesn't really work except under Unix"
(still true?  Maybe that's been fixed?).  As a
user, I just find that intolerable.  Sufficiently
intolerable that I'll help change the situation?
Well, I'm working on that part now ...



From guido at python.org  Sat May 20 03:19:20 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:19:20 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 17:02:47 CDT."
             <200005192202.RAA48753@starbase.neosoft.com> 
References: <200005192202.RAA48753@starbase.neosoft.com> 
Message-ID: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>

> One thing of which I'm certain:  I do NOT like
> documentation entries that say things like
> "select() doesn't really work except under Unix"
> (still true?  Maybe that's been fixed?).

Hm, that's bogus.  It works well under Windows -- with the restriction
that it only works for sockets, but for sockets it works as well as
on Unix.  it also works well on the Mac.  I wonder where that note
came from (it's probably 6 years old :-).

Fred...?

> As a
> user, I just find that intolerable.  Sufficiently
> intolerable that I'll help change the situation?
> Well, I'm working on that part now ...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From claird at starbase.neosoft.com  Sat May 20 00:37:48 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:37:48 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <200005192237.RAA49766@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 17:32:39 2000
			.
			.
			.
	> One thing of which I'm certain:  I do NOT like
	> documentation entries that say things like
	> "select() doesn't really work except under Unix"
	> (still true?  Maybe that's been fixed?).

	Hm, that's bogus.  It works well under Windows -- with the restriction
	that it only works for sockets, but for sockets it works as well as
	on Unix.  it also works well on the Mac.  I wonder where that note
	came from (it's probably 6 years old :-).

	Fred...?
			.
			.
			.
I sure don't mean to propagate misinformation.
I'll make it more of a habit to forward such
items to Fred as I find them.



From guido at python.org  Sat May 20 03:30:30 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:30:30 -0700
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: Your message of "Fri, 19 May 2000 17:46:11 PDT."
             <14629.57427.9434.623247@localhost.localdomain> 
References: <14629.57427.9434.623247@localhost.localdomain> 
Message-ID: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>

> I applied the recent changes to the CVS httplib to Greg's httplib
> (call it httplib11) this afternoon.  The result is included below.  I
> think this is quite close to checking in, but it could use a slightly
> better test suite.

Thanks -- but note that I don't have the time to review the code.

> There are a few outstanding questions.
> 
> httplib11 does not implement the debuglevel feature.  I don't think
> it's important, but it is currently documented and may be used.
> Guido, should we implement it?

I think the solution is to provide the API ignore the call or
argument.

> httplib w/SSL uses a constructor with this prototype:
>     def __init__(self, host='', port=None, **x509):
> It looks like the x509 dictionary should contain two variables --
> key_file and cert_file.  Since we know what the entries are, why not
> make them explicit?
>     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> (Or reverse the two arguments if that is clearer.)

The reason for the **x509 syntax (I think -- I didn't introduce it) is
that it *forces* the user to use keyword args, which is a good thing
for such an advanced feature.  However there should be code that
checks that no other keyword args are present.

> The FakeSocket class in CVS has a comment after the makefile def line
> that says "hopefully, never have to write."  It won't do at all the
> right thing when called with a write mode, so it ought to raise an
> exception.  Any reason it doesn't?

Probably laziness of the code.  Thanks for this code review (I guess I
was in a hurry when I checked that code in :-).

> I'd like to add a couple of test cases that use HTTP/1.1 to get some
> pages from python.org, including one that uses the chunked encoding.
> Just haven't gotten around to it.  Question on that front: Does it
> make sense to incorporate the test function in the module with the std
> regression test suite?  In general, I would think so.  In this
> particular case, the test could fail because of host networking
> problems.  I think that's okay as long as the error message is clear
> enough. 

Yes, I agree.  Maybe it should raise ImportError when the network is
unreachable -- this is the one exception that the regrtest module
considers non-fatal.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From DavidA at ActiveState.com  Sat May 20 00:38:16 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Fri, 19 May 2000 15:38:16 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005192237.RAA49766@starbase.neosoft.com>
Message-ID: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>

> 	> One thing of which I'm certain:  I do NOT like
> 	> documentation entries that say things like
> 	> "select() doesn't really work except under Unix"
> 	> (still true?  Maybe that's been fixed?).
>
> 	Hm, that's bogus.  It works well under Windows -- with the
> restriction
> 	that it only works for sockets, but for sockets it works as well as
> 	on Unix.  it also works well on the Mac.  I wonder where that note
> 	came from (it's probably 6 years old :-).

I'm pretty sure I know where it came from -- it came from Sam Rushing's
tutorial on how to use Medusa, which was more or less cut & pasted into the
doc, probably at the time that asyncore and asynchat were added to the
Python core.  IMO, it's not the best part of the Python doc -- it is much
too low-to-the ground, and assumes the reader already understands much about
I/O, sync/async issues, and cares mostly about high performance.  All of
which are true of wonderful Sam, most of which are not true of the average
Python user.

While we're complaining about doc, asynchat is not documented, I believe.
Alas, I'm unable to find the time to write up said documentation.

--david

PS: I'm not sure that multiplexing can be made _easy_.  Issues like
block/nonblocking communications channels, multithreading etc. are hard to
ignore, as much as one might want to.




From gstein at lyra.org  Sat May 20 00:38:59 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 May 2000 15:38:59 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>

On Fri, 19 May 2000, Guido van Rossum wrote:
> > I applied the recent changes to the CVS httplib to Greg's httplib
> > (call it httplib11) this afternoon.  The result is included below.  I
> > think this is quite close to checking in,

I'll fold the changes into my copy here (at least), until we're ready to
check into Python itself.

THANK YOU for doing this work. It is the "heavy lifting" part that I just
haven't had a chance to get to myself.

I have a small, local change dealing with the 'Host' header (it shouldn't
be sent automatically for HTTP/1.0; some httplib users already send it
and having *two* in the output headers will make some servers puke).

> > but it could use a slightly
> > better test suite.
> 
> Thanks -- but note that I don't have the time to review the code.

I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
the code, though... :-)

> > There are a few outstanding questions.
> > 
> > httplib11 does not implement the debuglevel feature.  I don't think
> > it's important, but it is currently documented and may be used.
> > Guido, should we implement it?
> 
> I think the solution is to provide the API ignore the call or
> argument.

Can do: ignore the debuglevel feature.

> > httplib w/SSL uses a constructor with this prototype:
> >     def __init__(self, host='', port=None, **x509):
> > It looks like the x509 dictionary should contain two variables --
> > key_file and cert_file.  Since we know what the entries are, why not
> > make them explicit?
> >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > (Or reverse the two arguments if that is clearer.)
> 
> The reason for the **x509 syntax (I think -- I didn't introduce it) is
> that it *forces* the user to use keyword args, which is a good thing
> for such an advanced feature.  However there should be code that
> checks that no other keyword args are present.

Can do: raise an error if other keyword args are present.

> > The FakeSocket class in CVS has a comment after the makefile def line
> > that says "hopefully, never have to write."  It won't do at all the
> > right thing when called with a write mode, so it ought to raise an
> > exception.  Any reason it doesn't?
> 
> Probably laziness of the code.  Thanks for this code review (I guess I
> was in a hurry when I checked that code in :-).

+1 on raising an exception.

> 
> > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > pages from python.org, including one that uses the chunked encoding.
> > Just haven't gotten around to it.  Question on that front: Does it
> > make sense to incorporate the test function in the module with the std
> > regression test suite?  In general, I would think so.  In this
> > particular case, the test could fail because of host networking
> > problems.  I think that's okay as long as the error message is clear
> > enough. 
> 
> Yes, I agree.  Maybe it should raise ImportError when the network is
> unreachable -- this is the one exception that the regrtest module
> considers non-fatal.

+1 on shifting to the test modules.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bckfnn at worldonline.dk  Sat May 20 17:19:09 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sat, 20 May 2000 15:19:09 GMT
Subject: [Python-Dev] Heads up: unicode file I/O in JPython.
Message-ID: <392690f3.17235923@smtp.worldonline.dk>

I have recently released errata-07 which improves on JPython's ability
to handle unicode characters as well as binary data read from and
written to python files.

The conversions can be described as

- I/O to a file opened in binary mode will read/write the low 8-bit 
  of each char. Writing Unicode chars >0xFF will cause silent
  truncation [*].

- I/O to a file opened in text mode will push the character 
  through the default encoding for the platform (in addition to 
  handling CR/LF issues).

This breaks completely with python1.6a2, but I believe that it is close
to the expectations of java users. (The current JPython-1.1 behavior are
completely useless for both characters and binary data. It only barely
manage to handle 7-bit ASCII).

In JPython (with the errata) we can do:

  f = open("test207.out", "w")
  f.write("\x20ac") # On my w2k platform this writes 0x80 to the file.
  f.close()

  f = open("test207.out", "r")
  print hex(ord(f.read()))
  f.close()

  f = open("test207.out", "wb")
  f.write("\x20ac") # On all platforms this writes 0xAC to the file.
  f.close()

  f = open("test207.out", "rb")
  print hex(ord(f.read()))
  f.close()

With the output of:

  0x20ac
  0xac

I do not expect anything like this in CPython. I just hope that all
unicode advice given on c.l.py comes with the modifier, that JPython
might do it differently.

regards,
finn

    http://sourceforge.net/project/filelist.php?group_id=1842

[*] Silent overflow is bad, but it is at least twice as fast as having
to check each char for overflow.





From esr at netaxs.com  Sun May 21 00:36:56 2000
From: esr at netaxs.com (Eric Raymond)
Date: Sat, 20 May 2000 18:36:56 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <200005162001.QAA16657@eric.cnri.reston.va.us>; from Guido van Rossum on Tue, May 16, 2000 at 04:01:46PM -0400
References: <009d01bfbf64$b779a260$34aab5d4@hagrid> <3921984A.8CDE8E1D@prescod.net> <200005162001.QAA16657@eric.cnri.reston.va.us>
Message-ID: <20000520183656.F7487@unix3.netaxs.com>

On Tue, May 16, 2000 at 04:01:46PM -0400, Guido van Rossum wrote:
> > I hope that if Python were renamed we would not choose yet another name
> > which turns up hundreds of false hits in web engines. Perhaps Homr or
> > Home_r. Or maybe Pythahn.
> 
> Actually, I'd like to call the next version Throatwobbler Mangrove.
> But you'd have to pronounce it Raymond Luxyry Yach-t.

Great.  I'll take a J-class kitted for open-ocean sailing, please.  Do
I get a side of bikini babes with that?
-- 
	<a href="http://www.tuxedo.org/~esr/home.html">Eric S. Raymond</a>



From ping at lfw.org  Sun May 21 12:30:05 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Sun, 21 May 2000 03:30:05 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <392590B0.5CA4F31D@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005210329160.420-100000@localhost>

On Fri, 19 May 2000, M.-A. Lemburg wrote:
> Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> string_print API which is used for the tp_print slot,

Very sorry!  I didn't actually look to see where the patch
was being applied.

But then how can this have any effect on squishdot's indexing?



-- ?!ng

"All models are wrong; some models are useful."
    -- George Box





From pf at artcom-gmbh.de  Sun May 21 17:54:06 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Sun, 21 May 2000 17:54:06 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005210329160.420-100000@localhost> from Ka-Ping Yee at "May 21, 2000  3:30: 5 am"
Message-ID: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>

Hi!

Ka-Ping Yee:
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

Sigh.  Let me explain this in some detail.

What do you see here: ????????  If all went well, you should
see some Umlauts which occur quite often in german words, like
"Begr?ssung", "?tzend" or "Gr?tzkacke" and so on.

During the late 80s we here Germany spend a lot of our free time to
patch open source tools software like 'elm', 'B-News', 'less' and
others to make them "8-Bit clean".  For example on ancient Unices
like SCO Xenix where the implementations of C-library functions
like 'is_print', 'is_lower' where out of reach.

After several years everybody seems to agree on ISO-8859-1 as the new
european standard character set, which was also often losely called 
8-Bit ASCII, because ASCII is a true subset of ISO latin1.  Even at least
the german versions of Windows used ISO-8859-1.

As the WWW began to gain popularity nobody with a sane mind really 
used these splendid ASCII escapes like for example '&auml;' instead 
of '?'.  The same holds true for TeX users community where everybody 
was happy to type real umlauts instead of these ugly backslash escapes
sequences used before: \"a\"o\"u ...

To make a short: A lot of effort has been spend to make *ALL* programs
8-Bit clean: That is to move the bytes through without translating
them from or into a bunch of incompatible multi bytes sequences,
which nobody can read or even wants to look at.

Now to get to back to your question:  There are several nice HTML indexing
engines out there.  I personally use HTDig.  At least on Linux these
programs deal fine with HTML files containing 8-bit chars.  

But if for some reason Umlauts end up as octal escapes ('\344' instead of '?')
due to the use of a Python 'print some_tuple' during the creation of HTML
files, a search engine will be unable to find those words with escaped
umlauts.

Mit freundlichen Gr??en, Peter
P.S.: Hope you didn't find my explanation boring or off-topic.



From effbot at telia.com  Sun May 21 18:26:00 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sun, 21 May 2000 18:26:00 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <005601bfc341$40eb0d60$34aab5d4@hagrid>

Peter Funk <pf at artcom-gmbh.de> wrote:
> But if for some reason Umlauts end up as octal escapes ('\344' instead of '?')
> due to the use of a Python 'print some_tuple' during the creation of HTML
> files, a search engine will be unable to find those words with escaped
> umlauts.

umm.  why would anyone use "print some_tuple" when generating
HTML pages?  what if the tuple contains something that results in
a "<" character?

</F>




From guido at python.org  Sun May 21 23:20:03 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 14:20:03 -0700
Subject: [Python-Dev] Is the tempfile module really a security risk?
Message-ID: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>

Every few months I receive patches that purport to make the tempfile
module more secure.  I've never felt that it is a problem.  What is
with these people?  My feeling about these suggestions has always been
that they have read about similar insecurities in C code run by the
super-user, and are trying to get the teacher's attention by proposing
something clever.

Or is there really a problem?  Is anyone in this forum aware of
security issues with tempfile?  Should I worry?  Is the
"random-tempfile" patch that the poster below suggested worth
applying?

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Sun, 21 May 2000 19:34:43 +0200
From:    =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <ragnark at vestdata.no>
To:      Guido van Rossum <guido at python.org>
cc:      patches at python.org
Subject: Re: [Patches] Patch to make tempfile return random filenames

On Sun, May 21, 2000 at 12:17:08PM -0700, Guido van Rossum wrote:
> Hm, I don't like this very much.  Random sequences have a small but
> nonzero probability of generating the same number in rapid succession
> -- probably one in a million or so.  It would be very bad if one in a
> million rums of a particular application crashed for this reason.
> 
> A better way do prevent this kind of attack (if you care about it) is
> to use mktemp.TemporaryFile(), which avoids this vulnerability in a
> different way.
> 
> (Also note the test for os.path.exists() so that an attacker would
> have to use very precise timing to make this work.)

1. the path.exist part does not solve the problem. It causes a racing
condition that is not very hard to get around, by having a program
creating and deleting the file at maximum speed. It will have a 50%
chance of breaking your program.

2. O_EXCL does not always work. E.g. it does not work over NFS - there
are probably other broken implementations too.

3. Even if mktemp.TemporaryFile had been sufficient, providing mktemp in
this dangerous way is not good. Many are likely to use it either not
thinking about the problem at all, or assuming it's solved in the
module.

4. The problems you describe can easily be overcome. I removed the
counter and the file-exist check because I figgured they were no longer
needed. I was wrong. Either a larger number should be used and/or
counter and or file-exist check. Personally I would want the random part
to bee large enough not have to worry about collisions either by chance,
after a fork, or by deliberate attack.


Do you want a new patch that adresses theese problems better?


- -- 
Ragnar Kj?rstad

_______________________________________________
Patches mailing list
Patches at python.org
http://www.python.org/mailman/listinfo/patches

------- End of Forwarded Message




From guido at python.org  Mon May 22 00:05:58 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 15:05:58 -0700
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
Message-ID: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>

I'm happy to announce that we've moved the Python CVS tree to
SourceForge.  SourceForge (www.sourceforge.net) is a free service to
Open Source developers run by VA Linux.

The change has two advantages for us: (1) we no longer have to deal
with the mirrorring of our writable CVS repository to the read-only
mirror ar cvs.python.org (which will soon be decommissioned); (2) we
will be able to add new developers with checkin privileges.  In
addition, we benefit from the high visibility and availability of
SourceForge.

Instructions on how to access the Python SourceForge tree are here:

  http://sourceforge.net/cvs/?group_id=5470

If you have an existing working tree that points to the cvs.python.org
repository, you may want to retarget it to the SourceForge tree.  This
can be done painlessly with Greg Ward's cvs_chroot script:

  http://starship.python.net/~gward/python/

The email notification to python-checkins at python.org still works
(although during the transition a few checkin messages may have been
lots).

While I've got your attention, please remember that the proper
procedures to submit patches is described here:

  http://www.python.org/patches/

We've accumulated quite the backlog of patches to be processed during
the transition; we'll start working on these ASAP.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Sun May 21 22:54:23 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sun, 21 May 2000 22:54:23 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost>
Message-ID: <39284CFF.5D9C9B13@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

The only possible reason I can see is that this squishdot
application uses 'print' to write the data -- perhaps
it pipes it through some other tool ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From effbot at telia.com  Mon May 22 01:24:02 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 01:24:02 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>
Message-ID: <004f01bfc37b$b1551480$34aab5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> > But then how can this have any effect on squishdot's indexing?
> 
> The only possible reason I can see is that this squishdot
> application uses 'print' to write the data -- perhaps
> it pipes it through some other tool ?

but doesn't the patch only affects code that manages to call tp_print
without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
mode)

or to put it another way, if they manage to call tp_print without the
PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

or am I just totally confused?

</F>




From guido at python.org  Mon May 22 05:47:16 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 20:47:16 -0700
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: Your message of "Mon, 22 May 2000 01:24:02 +0200."
             <004f01bfc37b$b1551480$34aab5d4@hagrid> 
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>  
            <004f01bfc37b$b1551480$34aab5d4@hagrid> 
Message-ID: <200005220347.UAA06235@cj20424-a.reston1.va.home.com>

Let's reboot this thread.  Never mind the details of the actual patch,
or why it would affect a particular index.

Obviously if we're going to patch string_print() we're also going to
patch string_repr() (and vice versa) -- the former (without the
Py_PRINT_RAW flag) is supposed to be an optimization of the latter.
(I hadn't even read the patch that far to realize that it only did one
and not the other.)

The point is simply this.

The repr() function for a string turns it into a valid string literal.
There's considerable freedom allowed in this conversion, some of which
is taken (e.g. it prefers single quotes but will use double quotes
when the string contains single quotes).

For safety reasons, control characters are replaced by their octal
escapes.  This is also done for non-ASCI characters.

Lots of people, most of them living in countries where Latin-1 (or
another 8-bit ASCII superset) is in actual use, would prefer that
non-ASCII characters would be left alone rather than changed into
octal escapes.  I think it's not unreasonable to ask that what they
consider printable characters aren't treated as control characters.

I think that using the locale to guide this is reasonable.  If the
locale is set to imply Latin-1, then we can assume that most output
devices are capable of displaying those characters.  What good does
converting those characters to octal escapes do us then?  If the input
string was in fact binary goop, then the output will be unreadable
goop -- but it won't screw up the output device (as control characters
are wont to do, which is the main reason to turn them into octal
escapes).

So I don't see how the patch can do much harm, I don't expect that it
will break much code, and I see a real value for those who use
Latin-1 or other 8-bit supersets of ASCII.

The one objection could be that the locale may be obsolescent -- but
I've only heard /F vent an opinion about that; personally, I doubt
that we will be able to remove the locale any time soon, even if we
invent a better way.  Plus, I think that "better way" should address
this issue anyway.  If the locale eventually disappears, the feature
automatically disappears with it, because you *have* to make a
locale.setlocale() call before the behavior of repr() changes.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From pf at artcom-gmbh.de  Mon May 22 08:18:22 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 08:18:22 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005220347.UAA06235@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 21, 2000  8:47:16 pm"
Message-ID: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>

Guido van Rossum:
[...]
> The one objection could be that the locale may be obsolescent -- but
> I've only heard /F vent an opinion about that; personally, I doubt
> that we will be able to remove the locale any time soon, even if we
> invent a better way.  

AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

Although I understand Barrys and Pings objections against a global state,
it used to work very well:  On a typical single user Linux system the
user chooses his locale during the first stages of system setup and
never has to think about it again.  On multi user systems the locale
of individual accounts may be customized using several environment
variables, which can overide the default locale of the system.

> Plus, I think that "better way" should address
> this issue anyway.  If the locale eventually disappears, the feature
> automatically disappears with it, because you *have* to make a
> locale.setlocale() call before the behavior of repr() changes.

The last sentence is at least not the whole truth.

On POSIX systems there are a several environment variables used to
control the default locale settings for a users session.  For example
on my SuSE Linux system currently running in the german locale the
environment variable LC_CTYPE=de_DE is automatically set by a file 
/etc/profile during login, which causes automatically the C-library 
function toupper('?') to return an '?' ---you should see
a lower case a-umlaut as argument and an upper case umlaut as return
value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls
to 'setlocale' to all application program using this C-library functions.

Regards, Peter.



From tim_one at email.msn.com  Mon May 22 08:59:16 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 02:59:16 -0400
Subject: [Python-Dev] Is the tempfile module really a security risk?
In-Reply-To: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECEGBAA.tim_one@email.msn.com>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

Doing a google search on

    tempfile security

turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
real vulnerability here somewhere (not necessarily Python), but the closest
I found to a clear explanation in 10 minutes was an annoyed paragraph,
saying that if I didn't already understand the problem I should turn in my
Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
issued one of those to me.

> ...
> Is the "random-tempfile" patch that the poster below suggested worth
> applying?

Certainly not the patch he posted!  And for reasons I sketched in my
patches-list commentary, I doubt any hack based on pseudo-random numbers
*can* solve anything.

assuming-there's-indeed-something-in-need-of-solving-ly y'rs  - tim





From effbot at telia.com  Mon May 22 09:20:50 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 09:20:50 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <008001bfc3be$7e5eae40$34aab5d4@hagrid>

Peter Funk wrote:
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to
Python, the old 8-bit locale *ctype* stuff no longer works.  while some
platforms implement a wctype interface, it's not widely available, and it's
not always unicode.

so in order to provide platform-independent unicode support, Python 1.6
comes with unicode-aware and fully portable replacements for the ctype
functions.

the code is already in there...

> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file
> /etc/profile during login, which causes automatically the C-library
> function toupper('?') to return an '?' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
>
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays.  may contain binary goop, or text in some strange
  encoding.  upper, strip, etc should not be used.

- 8-bit text strings using the system encoding.  upper, strip, etc works
  as long as the locale is properly configured.

- 8-bit unicode text strings.  upper, strip, etc may work, as long as the
  system encoding is a subset of unicode -- which means US ASCII or
  ISO Latin 1.

- wide unicode text strings.  upper, strip, etc always works.

is this complexity really worth it?

</F>




From gstein at lyra.org  Mon May 22 09:47:50 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 22 May 2000 00:47:50 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.10.10005220045170.30706-100000@nebula.lyra.org>

I've integrated all of these changes into the httplib.py posted on my
pages at:
    http://www.lyra.org/greg/python/

The actual changes are visible thru ViewCVS at:
    http://www.lyra.org/cgi-bin/viewcvs.cgi/gjspy/httplib.py/


The test code is still in there, until a test_httplib can be written.

Still missing: doc for the new-style semantics.

Cheers,
-g

On Fri, 19 May 2000, Greg Stein wrote:
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > > I applied the recent changes to the CVS httplib to Greg's httplib
> > > (call it httplib11) this afternoon.  The result is included below.  I
> > > think this is quite close to checking in,
> 
> I'll fold the changes into my copy here (at least), until we're ready to
> check into Python itself.
> 
> THANK YOU for doing this work. It is the "heavy lifting" part that I just
> haven't had a chance to get to myself.
> 
> I have a small, local change dealing with the 'Host' header (it shouldn't
> be sent automatically for HTTP/1.0; some httplib users already send it
> and having *two* in the output headers will make some servers puke).
> 
> > > but it could use a slightly
> > > better test suite.
> > 
> > Thanks -- but note that I don't have the time to review the code.
> 
> I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
> the code, though... :-)
> 
> > > There are a few outstanding questions.
> > > 
> > > httplib11 does not implement the debuglevel feature.  I don't think
> > > it's important, but it is currently documented and may be used.
> > > Guido, should we implement it?
> > 
> > I think the solution is to provide the API ignore the call or
> > argument.
> 
> Can do: ignore the debuglevel feature.
> 
> > > httplib w/SSL uses a constructor with this prototype:
> > >     def __init__(self, host='', port=None, **x509):
> > > It looks like the x509 dictionary should contain two variables --
> > > key_file and cert_file.  Since we know what the entries are, why not
> > > make them explicit?
> > >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > > (Or reverse the two arguments if that is clearer.)
> > 
> > The reason for the **x509 syntax (I think -- I didn't introduce it) is
> > that it *forces* the user to use keyword args, which is a good thing
> > for such an advanced feature.  However there should be code that
> > checks that no other keyword args are present.
> 
> Can do: raise an error if other keyword args are present.
> 
> > > The FakeSocket class in CVS has a comment after the makefile def line
> > > that says "hopefully, never have to write."  It won't do at all the
> > > right thing when called with a write mode, so it ought to raise an
> > > exception.  Any reason it doesn't?
> > 
> > Probably laziness of the code.  Thanks for this code review (I guess I
> > was in a hurry when I checked that code in :-).
> 
> +1 on raising an exception.
> 
> > 
> > > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > > pages from python.org, including one that uses the chunked encoding.
> > > Just haven't gotten around to it.  Question on that front: Does it
> > > make sense to incorporate the test function in the module with the std
> > > regression test suite?  In general, I would think so.  In this
> > > particular case, the test could fail because of host networking
> > > problems.  I think that's okay as long as the error message is clear
> > > enough. 
> > 
> > Yes, I agree.  Maybe it should raise ImportError when the network is
> > unreachable -- this is the one exception that the regrtest module
> > considers non-fatal.
> 
> +1 on shifting to the test modules.
> 
> Cheers,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 

-- 
Greg Stein, http://www.lyra.org/




From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 10:25:21 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 10:25:21 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>  
	            <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <3928EEF1.693F@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > From: Alexandre Ferrieux <alexandre.ferrieux at cnet.francetelecom.fr>
> >
> > I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> > just so beautiful inside. But while Tcl is weaker in the algorithms, it
> > is stronger in the os-wrapping library, and taught me to love high-level
> > abstractions. [fileevent] shines in this respect, and I'll miss it in
> > Python.
> 
> Alex, it's disappointing to me too!  There just isn't anything
> currently in the library to do this, and I haven't written apps that
> needs this often enough to have a good feel for what kind of
> abstraction is needed.

Thanks for the empathy. Apologies for my slight overreaction.

> However perhaps we can come up with a design for something better?  Do
> you have a suggestion here?

Yup. One easy answer is 'just copy from Tcl'...

Seriously, I'm really too new to Python to suggest the details or even
the *style* of this 'level 2 API to multiplexing'. However, I can sketch
the implementation since select() (from C or Tcl) is the one primitive I
most depend on !

Basically, as shortly mentioned before, the key problem is the
heterogeneity of seemingly-selectable things in Windoze. On unix, not
only does select() work with
all descriptor types on which it makes sense, but also the fd used by
Xlib is accessible; hence clean multiplexing even with a GUI package is
trivial. Now to the real (rotten) meat, that is M$'s. Facts:

	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
selectable. Why ? Ask a relative in Redmond...

	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
handles. They are selectable, but for example you can't use'em for
redirections. Okay in our case we don't care. I only mention it cause
its scary and could pop back into your face some time later.

	3. The GUI API doesn't expose a descriptor (handle), but fortunately
(though disgustingly) there is a special syscall to wait on both "the
message queue" and selectable handles: MsgWaitForMultipleObjects. So its
doable, if not beautiful.

The Tcl solution to (1.), which is the only real issue, is to have a
separate thread blockingly read 1 byte from the pipe, and then post a
message back to the main thread to awaken it (yes, ugly code to handle
that extra byte and integrate it with the buffering scheme).

In summary, why not peruse Tcl's hard-won experience on
selecting-on-windoze-pipes ?

Then, for the API exposed to the Python programmer, the Tclly exposed
one is a starter:

	fileevent $channel readable|writable callback
	...
	vwait breaker_variable

Explanation for non-Tclers: fileevent hooks the callback, vwait does a
loop of select(). The callback(s) is(are) called without breaking the
loop, unless $breaker_variable is set, at which time vwait returns.

One note about 'breaker_variable': I'm not sure I like it. I'd prefer
something based on exceptions. I don't quite understand why it's not
already this way in Tcl (which has (kindof) first-class exceptions), but
let's not repeat the mistake: let's suggest that (the equivalent of)
vwait loops forever, only to be broken out by an exception from within
one of the callbacks.

HTH,

-Alex



From mal at lemburg.com  Mon May 22 10:56:10 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 10:56:10 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com> <004f01bfc37b$b1551480$34aab5d4@hagrid> <3928F437.D4DB3C25@lemburg.com>
Message-ID: <3928F62A.94980623@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > > But then how can this have any effect on squishdot's indexing?
> >
> > The only possible reason I can see is that this squishdot
> > application uses 'print' to write the data -- perhaps
> > it pipes it through some other tool ?
> 
> but doesn't the patch only affects code that manages to call tp_print
> without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
> mode)

Right.
 
> or to put it another way, if they manage to call tp_print without the
> PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

Looking at the code, the 'print' statement doesn't set
PRINT_RAW -- still the output is written literally to
stdout. Don't know where PRINT_RAW gets set... perhaps
they use PyFile_WriteObject() directly ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From pf at artcom-gmbh.de  Mon May 22 11:44:14 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 11:44:14 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
Message-ID: <m12tokw-000DieC@artcom0.artcom-gmbh.de>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

[Tim]
> Doing a google search on
> 
>     tempfile security
> 
> turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
> real vulnerability here somewhere (not necessarily Python), but the closest
> I found to a clear explanation in 10 minutes was an annoyed paragraph,
> saying that if I didn't already understand the problem I should turn in my
> Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
> issued one of those to me.

On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a 
working example which exploits this vulnerability in older versions
of GCC.

The basic idea is indeed very simple:  Since the /tmp directory is
writable for any user, the bad guy can create a symbolic link in /tmp
pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
program will than overwrite this arbitrary file (where the programmer
really wanted to write something to his tempfile instead).  Since this
will happen with the access permissions of the process running this
program, this opens a bunch of vulnerabilities in many programs
writing something into temporary files with predictable file names.

www.cert.org is another great place to look for security related info.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From claird at starbase.neosoft.com  Mon May 22 13:31:08 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 06:31:08 -0500 (CDT)
Subject: [Python-Dev] Re: Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <200005221131.GAA39671@starbase.neosoft.com>

	From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 03:40:13 2000
			.
			.
			.
	> Alex, it's disappointing to me too!  There just isn't anything
	> currently in the library to do this, and I haven't written apps that
	> needs this often enough to have a good feel for what kind of
	> abstraction is needed.

	Thanks for the empathy. Apologies for my slight overreaction.

	> However perhaps we can come up with a design for something better?  Do
	> you have a suggestion here?

	Yup. One easy answer is 'just copy from Tcl'...

	Seriously, I'm really too new to Python to suggest the details or even
	the *style* of this 'level 2 API to multiplexing'. However, I can sketch
	the implementation since select() (from C or Tcl) is the one primitive I
	most depend on !

	Basically, as shortly mentioned before, the key problem is the
	heterogeneity of seemingly-selectable things in Windoze. On unix, not
	only does select() work with
	all descriptor types on which it makes sense, but also the fd used by
	Xlib is accessible; hence clean multiplexing even with a GUI package is
	trivial. Now to the real (rotten) meat, that is M$'s. Facts:

		1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
	selectable. Why ? Ask a relative in Redmond...

		2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
	handles. They are selectable, but for example you can't use'em for
	redirections. Okay in our case we don't care. I only mention it cause
	its scary and could pop back into your face some time later.

		3. The GUI API doesn't expose a descriptor (handle), but fortunately
	(though disgustingly) there is a special syscall to wait on both "the
	message queue" and selectable handles: MsgWaitForMultipleObjects. So its
	doable, if not beautiful.

	The Tcl solution to (1.), which is the only real issue, is to have a
	separate thread blockingly read 1 byte from the pipe, and then post a
	message back to the main thread to awaken it (yes, ugly code to handle
	that extra byte and integrate it with the buffering scheme).

	In summary, why not peruse Tcl's hard-won experience on
	selecting-on-windoze-pipes ?

	Then, for the API exposed to the Python programmer, the Tclly exposed
	one is a starter:

		fileevent $channel readable|writable callback
		...
		vwait breaker_variable

	Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	loop of select(). The callback(s) is(are) called without breaking the
	loop, unless $breaker_variable is set, at which time vwait returns.

	One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	something based on exceptions. I don't quite understand why it's not
	already this way in Tcl (which has (kindof) first-class exceptions), but
	let's not repeat the mistake: let's suggest that (the equivalent of)
	vwait loops forever, only to be broken out by an exception from within
	one of the callbacks.
			.
			.
			.
I've copied everything Alex wrote, because he writes for
me, also.

As much as I welcome it, I can't answer Guido's question,
"What should the API look like?"  I've been mulling this
over, and concluded I don't have sufficiently deep know-
ledge to be trustworthy on this.

Instead, I'll just give a bit of personal testimony.  I
made the rather coy c.l.p posting, in which I sincerely
asked, "How do you expert Pythoneers do it?" (my para-
phrase), without disclosing either that Alex and I have
been discussing this, or that the Tcl interface we both
know is simply a delight to me.

Here's the delight.  Guido asked, approximately, "What's
the point?  Do you need this for more than the keeping-
the-GUI-responsive-for-which-there's-already-a-notifier-
around case?"  The answer is, yes.  It's a good question,
though.  I'll repeat what Alex has said, with my own em-
phasis:  Tcl gives a uniform command API for
* files (including I/O ports, ...)
* subprocesses
* TCP socket connections
and allows the same fcntl()-like configuration of them
all as to encodings, blocking, buffering, and character
translation.  As a programmer, I use this stuff
CONSTANTLY, and very happily.  It's not just for GUIs; 
several of my mission-critical delivered products have
Tcl-coded daemons to monitor hardware, manage customer
transactions, ...  It's simply wonderful to be able to
evolve a protocol from a socket connection to an fopen()
read to ...

Tcl is GREAT at "gluing".  Python can do it, but Tcl has
a couple of years of refinement in regard to portability
issues of managing subprocesses.  I really, *really*
miss this stuff when I work with a language other than
Tcl.

I don't often whine, "Language A isn't language B."  I'm
happy to let individual character come out.  This is,
for me, an exceptional case.  It's not that Python doesn't
do it the Tcl way; it's that the Tcl way is wonderful, and
moreover that Python doesn't feel to me to have much of an
alternative answer.  I conclude that there might be some-
thing for Python to learn here.

A colleague has also write an even higher-level wrapper in
Tcl for asynchronous sockets.  I'll likely explain more
about it <URL:http://www-users.cs.umn.edu/~dejong/tcl/EasySocket.tar.gz>
in a follow-up.

Conclusion for now:  Alex and I like Python so much that we
want you guys to know that better piping-gluing-networking
truly is possible, and even worthwhile.  This is sort of
like the emigrants who've reported, "Yeah, here's the the
stuff about CPAN that's cool, and how we can have it, too."
Through it all, we absolutely want Python to continue to be
Python.



From guido at python.org  Mon May 22 17:09:44 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 08:09:44 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 08:18:22 +0200."
             <m12tlXi-000CnvC@artcom0.artcom-gmbh.de> 
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de> 
Message-ID: <200005221509.IAA06955@cj20424-a.reston1.va.home.com>

> From: pf at artcom-gmbh.de (Peter Funk)
> 
> Guido van Rossum:
> [...]
> > The one objection could be that the locale may be obsolescent -- but
> > I've only heard /F vent an opinion about that; personally, I doubt
> > that we will be able to remove the locale any time soon, even if we
> > invent a better way.  
> 
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> Although I understand Barrys and Pings objections against a global state,
> it used to work very well:  On a typical single user Linux system the
> user chooses his locale during the first stages of system setup and
> never has to think about it again.  On multi user systems the locale
> of individual accounts may be customized using several environment
> variables, which can overide the default locale of the system.
> 
> > Plus, I think that "better way" should address
> > this issue anyway.  If the locale eventually disappears, the feature
> > automatically disappears with it, because you *have* to make a
> > locale.setlocale() call before the behavior of repr() changes.
> 
> The last sentence is at least not the whole truth.
> 
> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file 
> /etc/profile during login, which causes automatically the C-library 
> function toupper('?') to return an '?' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
> 
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

I don;t believe that.  According to the ANSI standard, a C program
*must* call setlocale(LC_..., "") if it wants the environment
variables to be honored; without this call, the locale is always the
"C" locale, which should *not* honor the environment variables.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at tismer.com  Mon May 22 14:40:51 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:40:51 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <39292AD2.F5080E35@tismer.com>

Hi, I'm back from White Russia (yup, a surviver) :-)

Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > Then a string should better not be a sequence.
> >
> > The number of places where I really used the string sequence
> > protocol to take advantage of it is outperfomed by a factor
> > of ten by cases where I missed to tupleise and got a bad
> > result. A traceback is better than a sequence here.
> 
> Alas, I think
> 
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Sure.
And now for my proposal:

Strings should be strings, but not sequences.
Slicing is ok, and it will always yield strings.
Indexing would either
a - not yield anything but an excpetion
b - just integers instead of 1-char strings

The above idiom would read like this:

Version a: Access string elements via a coercion like tuple() or list():

    for ch in tuple(string):
        muck w/ the character ch

Version b: Access string elements as integer codes:

    for c in string:
        # either:
        ch = chr(c)
        muck w/ the character ch
        # or:
        muck w/ the character code c

> > oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris
> 
> The "sequenenceness" of strings does get in the way often enough.  Strings
> have the amazing property that, since characters are also strings,
> 
>     while 1:
>         string = string[0]
> 
> never terminates with an error.  This often manifests as unbounded recursion
> in generic functions that crawl over nested sequences (the first time you
> code one of these, you try to stop the recursion on a "is it a sequence?"
> test, and then someone passes in something containing a string and it
> descends forever).  And we also have that
> 
>     format % values
> 
> requires "values" to be specifically a tuple rather than any old sequence,
> else the current
> 
>     "%s" % some_string
> 
> could be interpreted the wrong way.
> 
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

O-K!
We seem to have a similar conclusion: It would be better if strings
were no sequences, after all. How to achieve this seems to be
kind of a problem, of course.

Oh, there is another idiom possible!
How about this, after we have the new string methods :-)

    for ch in string.split():
        muck w/ the character ch

Ok, in the long term, we need to rethink iteration of course.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Mon May 22 14:55:21 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:55:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000201bfc082$50909f80$6c2d153f@tim>
Message-ID: <39292E38.A5A89270@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> [Tim]
> > Huh!  I would not have guessed that you'd give up on Stackless
> > that easily <wink>.
> 
> [Chris]
> > Noh, I didn't give up Stackless, but fishing for soles.
> > After Just v. R. has become my most ambitious user,
> > I'm happy enough.
> 
> I suspect you missed the point:  Stackless is the *ultimate* exercise in
> "changing their mind in order to understand a basic operation".  I was
> tweaking you, just as you're tweaking me <smile!>.

Squeek! Peace on earth :-)

And you are almost right on Stackless.
Almost, since I know of at least three new Python users who came
to Python *because* it has Stackless + Continuations. This is a very
new aspect to me.
Things are getting interesting now: Today I got a request from CCP
regarding continuations: They will build a masive parallel
multiplayer game with that. http://www.ccp.cc/eve

> > It is absolutely phantastic.
> > The most uninteresting stuff in the join is the separator,
> > and it has the power to merge thousands of strings
> > together, without asking the sequence at all
> >  - give all power to the suppressed, long live the Python anarchy :-)
> 
> Exactly!  Just as love has the power to bind thousands of incompatible
> humans without asking them either:  a vote for space.join() is a vote for
> peace on earth.

hmmm - that's so nice...

So let's drop a generic join, and use string.love() instead.

> while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim

join-is-a-peacemaker-like-a-Winchester-Cathedral-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From claird at starbase.neosoft.com  Mon May 22 15:09:03 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 08:09:03 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005221309.IAA41866@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?
Review:  Alex and I have so far presented
the Tcl way.  We're still a bit off-balance
at the generosity of spirit that's listen-
ing to us so respectfully.  Still ahead is
the hard work of designing an interface or
higher-level abstraction that's right for
Python.

The good thing, of course, is that this is
absolutely not a language issue at all.
Python is more than sufficiently expressive
for this matter.  All we're doing is working
to insert the right thing in the (a) library.

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...
I've got a whole list of "higher-level
abstractions around OS stuff" that I've been
collecting.  Maybe I'll make it fit for
others to see once we're through this affair
...

	--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Mon May 22 18:16:08 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:16:08 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:20:50 +0200."
             <008001bfc3be$7e5eae40$34aab5d4@hagrid> 
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>  
            <008001bfc3be$7e5eae40$34aab5d4@hagrid> 
Message-ID: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot at telia.com>
>
> Peter Funk wrote:
> > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> you're missing the point -- now that we've added unicode support to
> Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> platforms implement a wctype interface, it's not widely available, and it's
> not always unicode.

Huh?  We were talking strictly 8-bit strings here.  The locale support
hasn't changed there.

> so in order to provide platform-independent unicode support, Python 1.6
> comes with unicode-aware and fully portable replacements for the ctype
> functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the
Unicode stuff is overkill.

> the code is already in there...
> 
> > On POSIX systems there are a several environment variables used to
> > control the default locale settings for a users session.  For example
> > on my SuSE Linux system currently running in the german locale the
> > environment variable LC_CTYPE=de_DE is automatically set by a file
> > /etc/profile during login, which causes automatically the C-library
> > function toupper('?') to return an '?' ---you should see
> > a lower case a-umlaut as argument and an upper case umlaut as return
> > value--- without having all applications to call 'setlocale' explicitly.
> >
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.
> 
> note that this leaves us with four string flavours in 1.6:
> 
> - 8-bit binary arrays.  may contain binary goop, or text in some strange
>   encoding.  upper, strip, etc should not be used.

These are not strings.

> - 8-bit text strings using the system encoding.  upper, strip, etc works
>   as long as the locale is properly configured.
> 
> - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
>   system encoding is a subset of unicode -- which means US ASCII or
>   ISO Latin 1.

This is a figment of your imagination.  You can use 8-bit text strings
to contain Latin-1, but you have to set your locale to match.

> - wide unicode text strings.  upper, strip, etc always works.
> 
> is this complexity really worth it?


From pf at artcom-gmbh.de  Mon May 22 15:02:18 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 15:02:18 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221509.IAA06955@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  8: 9:44 am"
Message-ID: <m12trqc-000DieC@artcom0.artcom-gmbh.de>

Hi!

[...]
[me]:
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.

[Guido van Rossum]:
> I don;t believe that.  According to the ANSI standard, a C program
> *must* call setlocale(LC_..., "") if it wants the environment
> variables to be honored; without this call, the locale is always the
> "C" locale, which should *not* honor the environment variables.

pf at pefunbk> python 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string
>>> print string.upper("?")
?
>>> 

This was the vanilla Python 1.5.2 as originally delivered by SuSE Linux.  
But yes, you are right. :-(  My memory was confused by this practical
experience.  Now I like to quote from the man pages here:

man toupper:
[...]
BUGS
       The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con?
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.

man setlocale:
[...]
       A  program  may be made portable to all locales by calling
       setlocale(LC_ALL, "" ) after program   initialization,  by
       using  the  values  returned  from a localeconv() call for
       locale - dependent information and by using  strcoll()  or
       strxfrm() to compare strings.
[...]
   CONFORMING TO
       ANSI C, POSIX.1

       Linux  (that  is,  libc) supports the portable locales "C"
       and "POSIX".  In the good old days there used to  be  sup?
       port for the European Latin-1 "ISO-8859-1" locale (e.g. in
       libc-4.5.21 and  libc-4.6.27),  and  the  Russian  "KOI-8"
       (more  precisely,  "koi-8r") locale (e.g. in libc-4.6.27),
       so that having an environment variable LC_CTYPE=ISO-8859-1
       sufficed to make isprint() return the right answer.  These
       days non-English speaking Europeans have  to  work  a  bit
       harder, and must install actual locale files.
[...]

In recent Linux distributions almost every Linux C-program seems to 
contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
to forget about it.  However the core Python interpreter does not.
it seems the Linux C-Library is not fully ANSI compliant in this case.
It seems to honour the setting of $LANG regardless whether a program
calls 'setlocale' or not.

Regards, Peter



From guido at python.org  Mon May 22 18:31:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:31:50 -0700
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: Your message of "Mon, 22 May 2000 10:25:21 +0200."
             <3928EEF1.693F@cnet.francetelecom.fr> 
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>  
            <3928EEF1.693F@cnet.francetelecom.fr> 
Message-ID: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>

> Yup. One easy answer is 'just copy from Tcl'...

Tcl seems to be your only frame of reference.  I think it's too early
to say that borrowing Tcl's design is right for Python.  Don't forget
that part of Tcl's design was guided by the desire for backwards
compatibility with Tcl's strong (stronger than Python I find!) Unix
background.

> Seriously, I'm really too new to Python to suggest the details or even
> the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> the implementation since select() (from C or Tcl) is the one primitive I
> most depend on !
> 
> Basically, as shortly mentioned before, the key problem is the
> heterogeneity of seemingly-selectable things in Windoze. On unix, not
> only does select() work with
> all descriptor types on which it makes sense, but also the fd used by
> Xlib is accessible; hence clean multiplexing even with a GUI package is
> trivial. Now to the real (rotten) meat, that is M$'s. Facts:

Note that on Windows, select() is part of SOCKLIB, which explains why
it only understands sockets.  Native Windows code uses the
wait-for-event primitives that you are describing, and these are
powerful enough to wait on named pipes, sockets, and GUI events.
Complaining about the select interface on Windows isn't quite fair.

> 	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> selectable. Why ? Ask a relative in Redmond...

Can we cut the name-calling?

> 	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> handles. They are selectable, but for example you can't use'em for
> redirections. Okay in our case we don't care. I only mention it cause
> its scary and could pop back into your face some time later.

Handles are a much more low-level concept than file descriptors.  get
used to it.

> 	3. The GUI API doesn't expose a descriptor (handle), but fortunately
> (though disgustingly) there is a special syscall to wait on both "the
> message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> doable, if not beautiful.
> 
> The Tcl solution to (1.), which is the only real issue,

Why is (1) the only issue?  Maybe in Tcl-land...

> is to have a
> separate thread blockingly read 1 byte from the pipe, and then post a
> message back to the main thread to awaken it (yes, ugly code to handle
> that extra byte and integrate it with the buffering scheme).

Or the exposed API could deal with this in a different way.

> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?

Because it's designed for Tcl.

> Then, for the API exposed to the Python programmer, the Tclly exposed
> one is a starter:
> 
> 	fileevent $channel readable|writable callback
> 	...
> 	vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> loop of select(). The callback(s) is(are) called without breaking the
> loop, unless $breaker_variable is set, at which time vwait returns.

Sorry, you've lost me here.  Fortunately there's more info at
http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
very complicated, and I'm not sure why you rejected my earlier
suggestion to use threads outright as "too complicated".  After
reading that man page, threads seem easy compared to the caution one
has to exert when using non-blocking I/O.

> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
> something based on exceptions. I don't quite understand why it's not
> already this way in Tcl (which has (kindof) first-class exceptions), but
> let's not repeat the mistake: let's suggest that (the equivalent of)
> vwait loops forever, only to be broken out by an exception from within
> one of the callbacks.

Vwait seems to be part of the Tcl event model.  Maybe we would need to
think about an event model for Python?  On the other hand, Python is
at the mercy of the event model of whatever GUI package it is using --
which could be Tk, or wxWindows, or Gtk, or native Windows, or native
MacOS, or any of a number of other event models.

Perhaps this is an issue that each GUI package available to Python
will have to deal with separately...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 22 18:49:24 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:49:24 -0700
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: Your message of "Mon, 22 May 2000 14:40:51 +0200."
             <39292AD2.F5080E35@tismer.com> 
References: <000301bfc082$51ce0180$6c2d153f@tim>  
            <39292AD2.F5080E35@tismer.com> 
Message-ID: <200005221649.JAA07398@cj20424-a.reston1.va.home.com>

Christian, there was a smiley in your signature, so I can safely
ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
return 97 instead of "a".

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 22 18:54:35 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:54:35 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 15:02:18 +0200."
             <m12trqc-000DieC@artcom0.artcom-gmbh.de> 
References: <m12trqc-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005221654.JAA07426@cj20424-a.reston1.va.home.com>

> pf at pefunbk> python 
> Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import string
> >>> print string.upper("?")
> ?
> >>> 

This threw me off too.  However try this:

python -c 'print "?".upper()'

It will print "?".  A mystery?  No, the GNU readline library calls
setlocale().  It is wrong, but I can't help it.  But it only affects
interactive use of Python.

> In recent Linux distributions almost every Linux C-program seems to 
> contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
> to forget about it.  However the core Python interpreter does not.
> it seems the Linux C-Library is not fully ANSI compliant in this case.
> It seems to honour the setting of $LANG regardless whether a program
> calls 'setlocale' or not.

No, the explanation is in GNU readline.

Compile this little program and see for yourself:

#include <ctype.h>
#include <stdio.h>

main()
{
	printf("toupper(%c) = %c\n", '?', toupper('?'));
}

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at tismer.com  Mon May 22 16:11:37 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 16:11:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>  
	            <39292AD2.F5080E35@tismer.com> <200005221649.JAA07398@cj20424-a.reston1.va.home.com>
Message-ID: <39294019.3CB47800@tismer.com>


Guido van Rossum wrote:
> 
> Christian, there was a smiley in your signature, so I can safely
> ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
> return 97 instead of "a".

There was a smiley, but for the most since I cannot
decide what I want. I'm quite convinced that strings should
better not be sequences, at least not sequences of strings.

"abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
as listed in Tim's posting.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From fdrake at acm.org  Mon May 22 16:12:54 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:12:54 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220710520.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, Guido van Rossum wrote:
 > Hm, that's bogus.  It works well under Windows -- with the restriction
 > that it only works for sockets, but for sockets it works as well as
 > on Unix.  it also works well on the Mac.  I wonder where that note
 > came from (it's probably 6 years old :-).

  Is that still in there?  If I could get a pointer from someone I'll be
able to track it down.  I didn't see it in the select or socket module
documents, and a quick grep did't find 'really work'.
  It's definately fixable if we can find it.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fdrake at acm.org  Mon May 22 16:21:48 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:21:48 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>
Message-ID: <Pine.LNX.4.10.10005220718250.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, David Ascher wrote:
 > I'm pretty sure I know where it came from -- it came from Sam Rushing's
 > tutorial on how to use Medusa, which was more or less cut & pasted into the
 > doc, probably at the time that asyncore and asynchat were added to the
 > Python core.  IMO, it's not the best part of the Python doc -- it is much
 > too low-to-the ground, and assumes the reader already understands much about
 > I/O, sync/async issues, and cares mostly about high performance.  All of

  It's a fairly young section, and I haven't had as much time to review
and edit that or some of the other young sections.  I'll try to pay
particular attention to these as I work on the 1.6 release.

 > which are true of wonderful Sam, most of which are not true of the average
 > Python user.
 > 
 > While we're complaining about doc, asynchat is not documented, I believe.
 > Alas, I'm unable to find the time to write up said documentation.

  Should that situation change, I'll gladly accept a section on asynchat!
Or, if anyone else has time to contribute...??


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From skip at mojam.com  Mon May 22 16:25:00 2000
From: skip at mojam.com (Skip Montanaro)
Date: Mon, 22 May 2000 09:25:00 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
In-Reply-To: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
References: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
Message-ID: <14633.17212.650090.540777@beluga.mojam.com>

    Guido> If you have an existing working tree that points to the
    Guido> cvs.python.org repository, you may want to retarget it to the
    Guido> SourceForge tree.  This can be done painlessly with Greg Ward's
    Guido> cvs_chroot script:

    Guido>   http://starship.python.net/~gward/python/

I tried this with (so far) no apparent success.  I ran cvs_chroot as

    cvs_chroot :pserver:anonymous at cvs.python.sourceforge.net:/cvsroot/python

It warned me about some directories that didn't match the top level
directory.  "No problem", I thought.  I figured they were for the nondist
portions of the tree.  When I tried a cvs update after logging in to the
SourceForge cvs server I got tons of messages that looked like:

    cvs update: move away dist/src/Tools/scripts/untabify.py; it is in the way
    C dist/src/Tools/scripts/untabify.py

It doesn't look like untabify.py has been hosed, but the warnings worry me.
Anyone else encounter this problem?  If so, what's its meaning?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 16:51:56 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 16:51:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>  
	            <3928EEF1.693F@cnet.francetelecom.fr> <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <3929498C.1941@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > Yup. One easy answer is 'just copy from Tcl'...
> 
> Tcl seems to be your only frame of reference.

Nope, but I'll welcome any proof of existence of similar abstractions
(for multiplexing) elsewhere.

>  I think it's too early
> to say that borrowing Tcl's design is right for Python.  Don't forget
> that part of Tcl's design was guided by the desire for backwards
> compatibility with Tcl's strong (stronger than Python I find!) Unix
> background.

I don't quite get how the 'unix background' comes into play here, since
[fileevent] is now implemented and works correctly on all platforms.

If you are talinkg about the API as seen from above, I don't understand
why 'hooking a callback' and 'multiplexing event sources' are a unix
specificity, and/or why it should be avoided outside unix.

> > Seriously, I'm really too new to Python to suggest the details or even
> > the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> > the implementation since select() (from C or Tcl) is the one primitive I
> > most depend on !
> >
> > Basically, as shortly mentioned before, the key problem is the
> > heterogeneity of seemingly-selectable things in Windoze. On unix, not
> > only does select() work with
> > all descriptor types on which it makes sense, but also the fd used by
> > Xlib is accessible; hence clean multiplexing even with a GUI package is
> > trivial. Now to the real (rotten) meat, that is M$'s. Facts:
> 
> Note that on Windows, select() is part of SOCKLIB, which explains why
> it only understands sockets.  Native Windows code uses the
> wait-for-event primitives that you are describing, and these are
> powerful enough to wait on named pipes, sockets, and GUI events.
> Complaining about the select interface on Windows isn't quite fair.

Sorry, you missed the point. Here I used the term 'select()' as a 
generic one (I didn't want to pollute a general discussion with
OS-specific names...). On windows it means MsgWaitForMultipleObjects.

Now as you said "these are powerful enough to wait on named pipes,
sockets, and GUI events"; I won't deny the obvious truth. However,
again, they don't work on *unnamed pipes* (which are the only ones in
'95). That's my sole reason for complaining, and I'm afraid it is fair
;-)

> >       1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> > selectable. Why ? Ask a relative in Redmond...
> 
> Can we cut the name-calling?

Yes we can :^P

> 
> >       2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> > handles. They are selectable, but for example you can't use'em for
> > redirections. Okay in our case we don't care. I only mention it cause
> > its scary and could pop back into your face some time later.
> 
> Handles are a much more low-level concept than file descriptors.  get
> used to it.

Take it easy, I meant to help. Low level as they be, can you explain why
*some* can be passed to CreateProcess as redirections, and *some* can't
?
Obviously there *is* some attempt to unify things in Windows (if only
the
single name of 'handle'); and just as clearly it is not completely
successful.

> >       3. The GUI API doesn't expose a descriptor (handle), but fortunately
> > (though disgustingly) there is a special syscall to wait on both "the
> > message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> > doable, if not beautiful.
> >
> > The Tcl solution to (1.), which is the only real issue,
> 
> Why is (1) the only issue?

Because for (2) we don't care (no need for redirections in our case) and
for (3) the judgement is only aesthetic. 

>  Maybe in Tcl-land...

Come on, I'm emigrating from Tcl to Python with open palms, as Cameron
puts it.
I've already mentioned the outstanding beauty of Python's internal
design, and in comparison Tcl is absolutely awful. Even at the (script)
API level, Some of the early choices in Tcl are disgusting (and some
recent ones too...). I'm really turning to Python with the greatest
pleasure - please don't interpret my arguments as yet another Lang1 vs.
Lang2 flamewar.

> > is to have a
> > separate thread blockingly read 1 byte from the pipe, and then post a
> > message back to the main thread to awaken it (yes, ugly code to handle
> > that extra byte and integrate it with the buffering scheme).
> 
> Or the exposed API could deal with this in a different way.

Please elaborate ?

> > In summary, why not peruse Tcl's hard-won experience on
> > selecting-on-windoze-pipes ?
> 
> Because it's designed for Tcl.

I said 'why not' as a positive suggestion.
I didn't expect you to actually say why not...

Moreover, I don't understand 'designed for Tcl'. What's specific to Tcl
in
unifying descriptor types ?

> > Then, for the API exposed to the Python programmer, the Tclly exposed
> > one is a starter:
> >
> >       fileevent $channel readable|writable callback
> >       ...
> >       vwait breaker_variable
> >
> > Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> > loop of select(). The callback(s) is(are) called without breaking the
> > loop, unless $breaker_variable is set, at which time vwait returns.
> 
> Sorry, you've lost me here.  Fortunately there's more info at
> http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
> very complicated,

Ahem, self-destroying argument: "Fortunately ... very complicated".

While I agree the fileevent manpage is longer than it should be, I fail
to see
what's complicated in the model of 'hooking a callback for a given kind
of events'.

> and I'm not sure why you rejected my earlier
> suggestion to use threads outright as "too complicated".

Not on the same level. You're complaining about the script-level API (or
its documentation, more precisely !). I dismissed the thread-based
*implementation*
as an overkill in terms of resource consumption (thread context +
switching + ITC)
on platforms which can use select() (for anon pipes on Windows, as
already explained, the thread is unavoidable).

>  After
> reading that man page, threads seem easy compared to the caution one
> has to exert when using non-blocking I/O.

Oh, I get it. The problem is, *that* manpage unfortunately tries to
explain event-based and non-blocking I/O at the same time (presumably
because the average user will never follow the 'See Also' links). That's
a blatant pedagogic mistake. Let me try:

	fileevent <channel> readable|writable <script>

	Hooks <script> to be called back whenever the given <channel> becomes
readable|writable. 'Whenever' here means from within event processing
primitives (vwait, update).

	Example:

		# whenever a new line comes down the socket, display it.

		set s [socket $host $port]
		fileevent $s readable gotdata
		proc gotdata {} {global s;puts "New data: [gets $s]"}
		vwait forever

To answer a potential question about blockingness, yes, in the example
above the [gets] will block until a complete line is received. But
mentioning this fact in the manpage is uselessly misleading because  the
fileevent mechanism obviously allows to implement any kind of protocol,
line-based or not, terminator- or size-header- based or not. Uses with
blocking and nonblocking [read] and mixes thereof are immediate
consequences of this classification.

Hope this helps.

> Vwait seems to be part of the Tcl event model.

Hardly. It's just the Tcl name for the primitive that (blockingly) calls
select() (generic term - see above)

>  Maybe we would need to think about an event model for Python?

With pleasure - please define 'model'. Do you mean callbacks vs.
explicit decoding of an event strucutre ? Do you mean blocking select()
vs. something more asynchronous like threads or signals ?

> On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

Why should Python be alone to be exposed to this diversity ? Don't
assume
that Tk is the only option for Tcl. The Tcl/C API even exposes the
proper hooks
to integrate any new event source, like a GUI package.

Again, I'm not interested in Tcl vs. Python here (and anyway Python wins
!!!). I just want to extract what's truly orthogonal to specific design
choices. As it turns out, what you call 'the Tcl event model' can
happily be transported to any (imperative) lang.

I can even be more precise: a random GUI package can be used this way
iff the two following conditions hold:

	(a) Its queue can awaken a select()-like primitive.
	(b) Its queue can be Peek'ed (to check for buffered msgs
                                      before blockigng again)

> Perhaps this is an issue that each GUI package available to Python
> will have to deal with separately...

The characterization is given just above. To me it looks generic enough
to build an abstraction upon it. It's been done for Tcl, and is utterly
independent from its design peculiarities. Now everything depends on
whether abstraction is sought or not...

-Alex



From pf at artcom-gmbh.de  Mon May 22 17:01:50 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 17:01:50 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221654.JAA07426@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  9:54:35 am"
Message-ID: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>

Hi, 

Guido van Rossum:
> > pf at pefunbk> python 
> > Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> import string
> > >>> print string.upper("?")
> > ?
> > >>> 
> 
> This threw me off too.  However try this:
> 
> python -c 'print "?".upper()'

Yes, you are right.  :-(

Conclusion:  If the 'locale' module would ever become depreciated  
then ...ummm...  we poor mortals will simply have to add a line
'import readline' to our Python programs.  Nifty... ;-)

Regards, Peter



From claird at starbase.neosoft.com  Mon May 22 17:19:21 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 10:19:21 -0500 (CDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <200005221519.KAA45253@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Mon May 22 08:45:58 2000
			.
			.
			.
	Tcl seems to be your only frame of reference.  I think it's too early
	to say that borrowing Tcl's design is right for Python.  Don't forget
	that part of Tcl's design was guided by the desire for backwards
	compatibility with Tcl's strong (stronger than Python I find!) Unix
	background.
Right.  We quite agree.  Both of us came
to this looking to learn in the first place
what *is* right for Python.
			.
		[various points]
			.
			.
	> Then, for the API exposed to the Python programmer, the Tclly exposed
	> one is a starter:
	> 
	> 	fileevent $channel readable|writable callback
	> 	...
	> 	vwait breaker_variable
	> 
	> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	> loop of select(). The callback(s) is(are) called without breaking the
	> loop, unless $breaker_variable is set, at which time vwait returns.

	Sorry, you've lost me here.  Fortunately there's more info at
	http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
	very complicated, and I'm not sure why you rejected my earlier
	suggestion to use threads outright as "too complicated".  After
	reading that man page, threads seem easy compared to the caution one
	has to exert when using non-blocking I/O.

	> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	> something based on exceptions. I don't quite understand why it's not
	> already this way in Tcl (which has (kindof) first-class exceptions), but
	> let's not repeat the mistake: let's suggest that (the equivalent of)
	> vwait loops forever, only to be broken out by an exception from within
	> one of the callbacks.

	Vwait seems to be part of the Tcl event model.  Maybe we would need to
	think about an event model for Python?  On the other hand, Python is
	at the mercy of the event model of whatever GUI package it is using --
	which could be Tk, or wxWindows, or Gtk, or native Windows, or native
	MacOS, or any of a number of other event models.

	Perhaps this is an issue that each GUI package available to Python
	will have to deal with separately...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
There are a lot of issues here.  I've got clients
with emergencies that'll keep me busy all week,
and will be able to respond only sporadically.
For now, I want to emphasize that Alex and I both
respect Python as itself; it would simply be alien
to us to do the all-too-common trick of whining,
"Why can't it be like this other language I just
left?"

Tcl's event model has been more successful than
any of you probably realize.  You deserve to know
that.

Should Python have an event model?  I'm not con-
vinced.  I want to work with Python threading a
bit more.  It could be that it answers all the
needs Python has in this regard.  The documentation
Guido found "very complicated" above we think of
as ...--well, I want to conclude by saying I find
this discussion productive, and appreciate your
patience in entertaining it.  Daemon construction
is a lot of what I do, and, more broadly, I like to
think about useful OS service abstractions.  I'll
be back as soon as I have something to contribute.



From effbot at telia.com  Mon May 22 17:13:58 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:13:58 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <00b801bfc400$5c0d8fe0$34aab5d4@hagrid>

Peter Funk wrote:
> Conclusion:  If the 'locale' module would ever become depreciated  
> then ...ummm...  we poor mortals will simply have to add a line
> 'import readline' to our Python programs.  Nifty... ;-)

won't help if python is changed to use the *unicode*
ctype functions...

...but on the other hand, if you use unicode strings for
anything that is not plain ASCII, upper and friends will
do the right thing even if you forget to import readline.

</F>




From effbot at telia.com  Mon May 22 17:37:01 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:37:01 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <00e301bfc403$abac3940$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > Peter Funk wrote:
> > > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> > 
> > you're missing the point -- now that we've added unicode support to
> > Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> > platforms implement a wctype interface, it's not widely available, and it's
> > not always unicode.
> 
> Huh?  We were talking strictly 8-bit strings here.  The locale support
> hasn't changed there.

I meant that the locale support, even though it's part of POSIX, isn't
good enough for unicode support...

> > so in order to provide platform-independent unicode support, Python 1.6
> > comes with unicode-aware and fully portable replacements for the ctype
> > functions.
> 
> For those who only need Latin-1 or another 8-bit ASCII superset, the
> Unicode stuff is overkill.

why?

besides, overkill or not:

> > the code is already in there...

> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

depends on who you're asking, of course:

>>> b = fetch_binary_goop()
>>> type(b)
<type 'string'>
>>> dir(b)
['capitalize', 'center', 'count', 'endswith', 'expandtabs', ...

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour
for unicode), maybe we should base the default unicode/string con-
versions on the locale too?

background:

until now, I've been convinced that the goal should be to have two
"string-like" types: binary arrays for binary goop (including encoded
text), and a Unicode-based string type for text.  afaik, that's the
solution used in Tcl and Perl, and it's also "conceptually compatible"
with things like Java, Windows NT, and XML (and everything else from
the web universe).

given that, it has been clear to me that anything that is not compatible
with this model should be removed as soon as possible (and deprecated
as soon as we understand why it won't fly under the new scheme).

but if backwards compatibility is more important than a minimalistic
design, maybe we need three different "string-like" types:

-- binary arrays (still implemented by the 8-bit string type in 1.6)

-- 8-bit old-style strings (using the "system encoding", as defined
   by the locale.  if the locale is not set, they're assumed to contain
   ASCII)

-- unicode strings (possibly using a "polymorphic" internal representation)

this also solves the default conversion problem: use the locale environ-
ment variables to determine the default encoding, and call
sys.set_string_encoding from site.py (see my earlier post for details).

what have I missed this time?

</F>

PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

>>> sys
... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...




From gmcm at hypernet.com  Mon May 22 18:08:07 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Mon, 22 May 2000 12:08:07 -0400
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <1253110775-103205454@hypernet.com>

Alexandre Ferrieux wrote:

> The Tcl solution to (1.), which is the only real issue, is to
> have a separate thread blockingly read 1 byte from the pipe, and
> then post a message back to the main thread to awaken it (yes,
> ugly code to handle that extra byte and integrate it with the
> buffering scheme).

What's the actual mechanism here? A (dummy) socket so 
"select" works? The WSAEvent... stuff (to associate sockets 
with waitable events) and WaitForMultiple...? The 
WSAAsync... stuff (creates Windows msgs when socket stuff 
happens) with MsgWait...? Some other combination?

Is the mechanism different if it's a console app (vs GUI)?

I'd assume in a GUI, the fileevent-checker gets integrated with 
the message pump. In a console app, how does it get control?

 
> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?
> 
> Then, for the API exposed to the Python programmer, the Tclly
> exposed one is a starter:
> 
>  fileevent $channel readable|writable callback
>  ...
>  vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait
> does a loop of select(). The callback(s) is(are) called without
> breaking the loop, unless $breaker_variable is set, at which time
> vwait returns.
> 
> One note about 'breaker_variable': I'm not sure I like it. I'd
> prefer something based on exceptions. I don't quite understand
> why it's not already this way in Tcl (which has (kindof)
> first-class exceptions), but let's not repeat the mistake: let's
> suggest that (the equivalent of) vwait loops forever, only to be
> broken out by an exception from within one of the callbacks.
> 
> HTH,
> 
> -Alex
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev



- Gordon



From ping at lfw.org  Mon May 22 18:29:54 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:29:54 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221519.KAA45253@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220923360.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> 
> Tcl's event model has been more successful than
> any of you probably realize.  You deserve to know
> that.

Events are a very powerful concurrency model (arguably
more reliable because they are easier to understand
than threads).  My friend Mark Miller has designed a
language called E (http://www.erights.org/) that uses
an event model for all object messaging, and i would
be interested in exploring how we can apply those ideas
to improve Python.

> Should Python have an event model?  I'm not con-
> vinced.

Indeed.  This would be a huge core change, way too
large to be feasible.  But i do think it would be
excellent to simply provide more facilities for
helping people use whatever model they want, and
given the toolkit we let people build great things.

What you described sounded like it could be implemented
fairly easily with some functions like

    register(handle, mode, callback)
        or file.register(mode, callback)

        Put 'callback' in a dictionary of files
        to be watched for mode 'mode'.

    mainloop(timeout)

        Repeat (forever or until 'timeout') a
        'select' on all the files that have been
        registered, and do calls to the callbacks
        that have been registered.

Presumably there would be some exception that a
callback could raise to quietly exit the 'select'
loop.

    1. How does Tcl handle exiting the loop?
       Is there a way for a callback to break
       out of the vwait?

    2. How do you unregister these callbacks in Tcl?

    


-- ?!ng




From ping at lfw.org  Mon May 22 18:23:23 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:23:23 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005221309.IAA41866@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220917350.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> I've got a whole list of "higher-level
> abstractions around OS stuff" that I've been
> collecting.  Maybe I'll make it fit for
> others to see once we're through this affair

Absolutely!  I've thought about this too.  A nice "child process
management" module would be very convenient to have -- i've done
such stuff before -- though i don't know enough about Windows
semantics to make one that works on multiple platforms.  Some
sort of (hypothetical)

    delegate.spawn(function) - return a child object or id
    delegate.kill(id) - kill child

etc. could possibly free us from some of the system dependencies
of fork, signal, etc.

I currently have a module called "delegate" which can run a
function in a child process for you.  It uses pickle() to send
the return value of the function back to the parent (via an
unnamed pipe).  Again, Unix-specific -- but it would be very
cool if we could provide this functionality in a module.  My
module provides just two things, but it's already very useful:

    delegate.timeout(function, timeout) - run the 'function' in
        a child process; if the function doesn't finish in
        'timeout' seconds, kill it and raise an exception;
        otherwise, return the return value of the function

    delegate.parallelize(function, [work, work, work...]) -
        fork off many children (you can specify how many if
        you want) and set each one to work calling the 'function'
        with one of the 'work' items, queueing up work for
        each of the children until all the work gets done.
        Return the results in a dictionary mapping each 'work'
        item to its result.


-- ?!ng




From ping at lfw.org  Mon May 22 18:17:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:17:01 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220914000.461-100000@localhost>

On Mon, 22 May 2000, Guido van Rossum wrote:
> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

Indeed -- but at the moment, we're letting people continue to
use strings this way, since they already do it.

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

I would like it to be only the latter, as Fred, i, and others
have previously suggested, and as corresponds to your ASCII
proposal for treatment of 8-bit strings.

But doesn't the current locale-dependent behaviour of upper()
etc. mean that strings are getting interpreted in the first way?

> > is this complexity really worth it?
> 
> From a backwards compatibility point of view, yes.  Basically,
> programs that don't use Unicode should see no change in semantics.

I'm afraid i have to agree with this, because i don't see any
other option that lets us escape from any of these four ways
of using strings...


-- ?!ng




From fdrake at acm.org  Mon May 22 19:05:46 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 10:05:46 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <Pine.LNX.4.10.10005220914000.461-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005221004220.14844-100000@mailhost.beopen.com>

On Mon, 22 May 2000, Ka-Ping Yee wrote:
 > I would like it to be only the latter, as Fred, i, and others

  Please refer to Fredrik as Fredrik or /F; I don't think anyone else
refers to him as "Fred", and I got really confused when I saw this!  ;)


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From pf at artcom-gmbh.de  Mon May 22 19:17:40 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 19:17:40 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <00e301bfc403$abac3940$34aab5d4@hagrid> from Fredrik Lundh at "May 22, 2000  5:37: 1 pm"
Message-ID: <m12tvpk-000DieC@artcom0.artcom-gmbh.de>

Hi!

Fredrik Lund:
[...]
> > > so in order to provide platform-independent unicode support, Python 1.6
> > > comes with unicode-aware and fully portable replacements for the ctype
> > > functions.
> > 
> > For those who only need Latin-1 or another 8-bit ASCII superset, the
> > Unicode stuff is overkill.
> 
> why?

Going from 8 bit strings to 16 bit strings doubles the memory 
requirements, right?

As long as we only deal with English, Spanish, French, Swedish, Italian
and several other languages, 8 bit strings work out pretty well.  
Unicode will be neat if you can effort the additional space.  
People using Python on small computers in western countries
probably don't want to double the size of their data structures
for no reasonable benefit.

> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

Many locales effectively use Latin1 but for some other locales there
is a difference:

$ LANG="es_ES" python  # Espan?l uses Latin-1, the same as "de_DE"
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("???")
???

$ LANG="ru_RU" python  # This uses ISO 8859-5 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("???")
???

I don't know, how many people for example in Russia already depend 
on this behaviour.  I suggest it should stay as is.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From guido at python.org  Mon May 22 22:38:17 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 15:38:17 -0500
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:17:01 MST."
             <Pine.LNX.4.10.10005220914000.461-100000@localhost> 
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost> 
Message-ID: <200005222038.PAA01284@cj20424-a.reston1.va.home.com>

[Fredrik]
> > > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> > >   encoding.  upper, strip, etc should not be used.

[Guido]
> > These are not strings.

[Ping]
> Indeed -- but at the moment, we're letting people continue to
> use strings this way, since they already do it.

Oops, mistake.  I thought that Fredrik (not Fred! that's another
person in this context!) meant the array module, but upon re-reading
he didn't.

> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > > 
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> > 
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> I would like it to be only the latter, as Fred, i, and others
Fredrik, right?
> have previously suggested, and as corresponds to your ASCII
> proposal for treatment of 8-bit strings.
> 
> But doesn't the current locale-dependent behaviour of upper()
> etc. mean that strings are getting interpreted in the first way?

That's what I meant to say -- 8-bit strings use the system encoding
guided by the locale.

> > > is this complexity really worth it?
> > 
> > From a backwards compatibility point of view, yes.  Basically,
> > programs that don't use Unicode should see no change in semantics.
> 
> I'm afraid i have to agree with this, because i don't see any
> other option that lets us escape from any of these four ways
> of using strings...

Which is why I find Fredrik's attitude unproductive.

And where's the SRE release?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon May 22 22:53:55 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 22:53:55 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and 
 locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com> <00e301bfc403$abac3940$34aab5d4@hagrid>
Message-ID: <39299E63.CD996D7D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > >
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> >
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

This was proposed by Guido some time ago... the discussion
ended with the problem of extracting the encoding definition
from the locale names. There are some ways to solve this
problem (static mappings, fancy LANG variables etc.), but
AFAIK, there is no widely used standard on this yet, so
in the end you're stuck with defining the encoding by hand...
e.g.
	setenv LANG de_DE:latin-1

Perhaps we should help out a little and provide Python with
a parser for the LANG variable with some added magic
to provide useful defaults ?!

> [...]
> 
> this also solves the default conversion problem: use the locale environ-
> ment variables to determine the default encoding, and call
> sys.set_string_encoding from site.py (see my earlier post for details).

Right, that would indeed open up a path for consent...

> </F>
> 
> PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

Perhaps... these were really only added as experimental feature
to test the various possibilities (and a possible implementation).

My original intention was removing these after final consent
-- perhaps we should keep the functionality (expanded
to a per thread setting; the global is a temporary hack) ?!
 
> >>> sys
> ... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...
> 
> looks a little strange...

True; see above for the reason why ;-)

PS: What do you think about the current internal design of
sys.set_string_encoding() ? Note that hash() and the "st"
parser markers still use UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Tue May 23 04:21:00 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 22:21:00 -0400
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <m12tokw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <000501bfc45d$893ad4c0$9ea2143f@tim>

[Peter Funk]
> On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> working example which exploits this vulnerability in older versions
> of GCC.
>
> The basic idea is indeed very simple:  Since the /tmp directory is
> writable for any user, the bad guy can create a symbolic link in /tmp
> pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> program will than overwrite this arbitrary file (where the programmer
> really wanted to write something to his tempfile instead).  Since this
> will happen with the access permissions of the process running this
> program, this opens a bunch of vulnerabilities in many programs
> writing something into temporary files with predictable file names.

I can understand all that, but does it have anything to do with Python's
tempfile module?  gcc wasn't fixed by changing glibc, right?  Playing games
with the file *names* doesn't appear to me to solve anything; the few posts
I bumped into where that was somehow viewed as a Good Thing were about
Solaris systems, where Sun kept the source for generating the "new,
improved, messy" names secret.  In Python, any attacker can read the code
for anything we do, which it makes it much clearer that a name-game approach
is half-assed.

and-people-whine-about-worming-around-bad-decisions-in-
    windows<wink>-ly y'rs  - tim





From tim_one at email.msn.com  Tue May 23 07:15:46 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 23 May 2000 01:15:46 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39294019.3CB47800@tismer.com>
Message-ID: <000401bfc475$f3a110a0$612d153f@tim>

[Christian Tismer]
> There was a smiley, but for the most since I cannot
> decide what I want. I'm quite convinced that strings should
> better not be sequences, at least not sequences of strings.
>
> "abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
> as listed in Tim's posting.

Oh, it's worth a lot more than those!  As Ping testified, the gotchas I
listed really don't catch many people, while string[index] is about as
common as integer+1.

The need for tuples specifically in "format % values" can be wormed around
by special-casing the snot out of a string in the "values" position.

The non-termination of repeated "string = string[0]" *could* be stopped by
introducing a distinct character type.  Trying to formalize the current type
of a string is messy ("string = sequence of string" is a bit paradoxical
<wink>).  The notion that a string is a sequence of characters instead is
vanilla and wholly natural.  OTOH, drawing that distinction at the type
level may well be more trouble in practice than it buys in theory!

So I don't know what I want either -- but I don't want *much* <wink>.

first-do-no-harm-ly y'rs  - tim





From moshez at math.huji.ac.il  Tue May 23 07:27:12 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 23 May 2000 08:27:12 +0300 (IDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.GSO.4.10.10005230824130.12103-100000@sundial>

On Mon, 22 May 2000, Guido van Rossum wrote:

> Can we cut the name-calling?

Hey, what's life without a MS bashing now and then <wink>?

> Vwait seems to be part of the Tcl event model.  Maybe we would need to
> think about an event model for Python?  On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

But that's sort of the point: Python needs a non-GUI event model, to 
use with daemons which need to handle many files. Every GUI package
would have its own event model, and Python will have one event model
that's not tied to a GUI package. 

that-only-proves-we-have-a-problem-ly y'rs, Z.
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:16:50 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:16:50 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com>
Message-ID: <392A3062.E1@cnet.francetelecom.fr>

Gordon McMillan wrote:
> 
> Alexandre Ferrieux wrote:
> 
> > The Tcl solution to (1.), which is the only real issue, is to
> > have a separate thread blockingly read 1 byte from the pipe, and
> > then post a message back to the main thread to awaken it (yes,
> > ugly code to handle that extra byte and integrate it with the
> > buffering scheme).
> 
> What's the actual mechanism here? A (dummy) socket so
> "select" works? The WSAEvent... stuff (to associate sockets
> with waitable events) and WaitForMultiple...? The
> WSAAsync... stuff (creates Windows msgs when socket stuff
> happens) with MsgWait...? Some other combination?

Other. Forget about sockets here, we're talking about true anonymous
pipes, under 95 and NT. Since they are not waitable nor peekable,
the only remaining option is to read in blocking mode from a dedicated
thread. Then of course, this thread reports back to the main
MsgWaitForMultiple with PostThreadMessage.

> Is the mechanism different if it's a console app (vs GUI)?

No. Why should it ?

> I'd assume in a GUI, the fileevent-checker gets integrated with
> the message pump.

The converse: MsgWaitForMultiple integrates the thread's message queue
which is a superset of the GUI's event stream.

-Alex



From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:36:35 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:36:35 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python  multiplexing is too hard)
References: <Pine.LNX.4.10.10005220923360.461-100000@localhost>
Message-ID: <392A3503.7C72@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> > Should Python have an event model?  I'm not con-
> > vinced.
> 
> Indeed.  This would be a huge core change, way too
> large to be feasible. 

Warning here. What would indeed need a huge core change,
is a pervasive use of events like in E. 'Having an event model'
is often interpreted in a less extreme way, simply meaning
'having the proper set of primitives at hand'.
Our discussion (and your comments below too, agreed !) was focussed
on the latter, so we're only talking about a pure library issue.
Asking any change in the Python lang itself for such a peripheral need
never even remotely crossed my mind !

> But i do think it would be
> excellent to simply provide more facilities for
> helping people use whatever model they want, and
> given the toolkit we let people build great things.

Right.

> What you described sounded like it could be implemented
> fairly easily with some functions like
> 
>     register(handle, mode, callback)
>         or file.register(mode, callback)
> 
>         Put 'callback' in a dictionary of files
>         to be watched for mode 'mode'.
> 
>     mainloop(timeout)
> 
>         Repeat (forever or until 'timeout') a
>         'select' on all the files that have been
>         registered, and do calls to the callbacks
>         that have been registered.
> 
> Presumably there would be some exception that a
> callback could raise to quietly exit the 'select'
> loop.

Great !!! That's exactly the kind of Pythonic translation I was
expecting. Thanks !

>     1. How does Tcl handle exiting the loop?
>        Is there a way for a callback to break
>        out of the vwait?

Yes, as explained before, in Tcl the loop-breaker is a write sentinel
on a variable. When a callback wants to break out, it simply sets the
var. But as also mentioned before, I'd prefer an exception-based
mechanism as you summarized.

>     2. How do you unregister these callbacks in Tcl?

We just register an empty string as the callback name (script).
But this is just a random API choice. Anything more Pythonic is welcome
(an explicit unregister function is okay for me).

-Alex



From nhodgson at bigpond.net.au  Tue May 23 09:47:14 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 17:47:14 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <047c01bfc48b$1d2addb0$e3cb8490@neil>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. ...

   Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

   Neil





From effbot at telia.com  Tue May 23 09:50:56 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 09:50:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <00d901bfc48b$a40432a0$f2a6b5d4@hagrid>

Alexandre Ferrieux wrote:
> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,

I thought PeekNamedPipe worked just fine on anonymous pipes.

or are "true anonymous pipes" not the same thing as anonymous
pipes created by CreatePipe?

</F>




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:51:07 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:51:07 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil>
Message-ID: <392A386B.4112@cnet.francetelecom.fr>

Neil Hodgson wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. ...
> 
>    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

Hmmm... You're right, it's documented as such. But I seem to recall we
encountered a problem when actually using it. I'll check with Gordon
Chaffee (Cc of this msg).

-Alex



From mhammond at skippinet.com.au  Tue May 23 09:57:18 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 17:57:18 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. Then of course, this thread reports back to the main
> MsgWaitForMultiple with PostThreadMessage.

Or maybe just with SetEvent(), as the main thread may just be using
WaitForMultipleObjects() - it really depends on whether the app has a
message loop or not.

>
> > Is the mechanism different if it's a console app (vs GUI)?
>
> No. Why should it ?

Because it generally wont have a message loop.  This is also commonly true
for NT services - they only wait on settable objects and if they dont
create a window generally dont need a message loop.   However, it is
precisely these apps that the proposal offers the most benefits to.

> > I'd assume in a GUI, the fileevent-checker gets integrated with
> > the message pump.
>
> The converse: MsgWaitForMultiple integrates the thread's message queue
> which is a superset of the GUI's event stream.

But what happens when we dont own the message loop?  Eg, IDLE is based on
Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
primary message loops are coded in C/C++, and wont provide this level of
customization.

Ironically, Tk seems to be one of the worst for this.  For example, Guido
and I recently(ish) both added threading support to our respective IDEs.
MFC was quite simple to do, as it used a "standard" windows message loop.

From nhodgson at bigpond.net.au  Tue May 23 10:04:03 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 18:04:03 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil> <392A386B.4112@cnet.francetelecom.fr>
Message-ID: <04b001bfc48d$76dbeaa0$e3cb8490@neil>

> >    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.
>
> Hmmm... You're right, it's documented as such. But I seem to recall we
> encountered a problem when actually using it. I'll check with Gordon
> Chaffee (Cc of this msg).

   I can vouch that this does work on 95, NT and W2K as I have been using it
in my SciTE editor for the past year as the means for gathering output from
running tool programs. There was a fiddle required to ensure all output was
retrieved on 95 but it works well with that implemented.

   Neil





From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 10:11:53 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:11:53 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>
Message-ID: <392A3D49.5129@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. Then of course, this thread reports back to the main
> > MsgWaitForMultiple with PostThreadMessage.
> 
> Or maybe just with SetEvent(), as the main thread may just be using
> WaitForMultipleObjects() - it really depends on whether the app has a
> message loop or not.

Yes but why emphasize the differences when you can instead wipe them out
by using MsgWaitForMultiple which integrates all sources ? Even if
there's
no message stream, it's fine !

> > > Is the mechanism different if it's a console app (vs GUI)?
> >
> > No. Why should it ?
> 
> Because it generally wont have a message loop.  This is also commonly true
> for NT services - they only wait on settable objects and if they dont
> create a window generally dont need a message loop.   However, it is
> precisely these apps that the proposal offers the most benefits to.

Yes, but see above: how would it hurt them to call MsgWait* instead of
Wait* ?

> > > I'd assume in a GUI, the fileevent-checker gets integrated with
> > > the message pump.
> >
> > The converse: MsgWaitForMultiple integrates the thread's message queue
> > which is a superset of the GUI's event stream.
> 
> But what happens when we dont own the message loop?  Eg, IDLE is based on
> Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
> primary message loops are coded in C/C++, and wont provide this level of
> customization.

Can you be more precise ? Which one(s) do(es)/n't fulfill the two
conditions mentioned earlier ? I do agree with the fact that the primary
msg loop of a random GUI package is a black box, however it must use one
of the IPC mechanisms provided by the OS. Unifying them is not uniformly
trivial (that's the point of this discussion), but since even on Windows
it is doable (MsgWait*), I fail to see by what magic a GUI package could
bypass its supervision.

> Ironically, Tk seems to be one of the worst for this.

Possibly. Personally I don't like Tk very much, at least from an
implementation standpoint. But precisely, the fact that the model
described so far can accomodate *even* Tk is a proof of generality !

> and I recently(ish) both added threading support to our respective IDEs.
> MFC was quite simple to do, as it used a "standard" windows message loop.
> From all accounts, Guido had quite a difficult time due to some of the
> assumptions made in the message loop.  The other anecdote I have relates to
> debugging.  The Pythonwin debugger is able to live happily under most other
> GUI applications - eg, those written in VB, Delphi, etc.  Pythonwin creates
> a new "standard" message loop under these apps, and generally things work
> well.  However, Tkinter based apps remain un-debuggable using Pythonwin due
> to the assumptions made by the message loop.  This is probably my most
> oft-requested feature addition!!

As you said, all this is due to the assumptions made in Tk. Clearly a
mistake not to repeat, and also orthogonal to the issue of unifying IPC
mechanisms and the API to their multiplexing.

-Alex



From mhammond at skippinet.com.au  Tue May 23 10:24:39 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 18:24:39 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3D49.5129@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>

> Yes but why emphasize the differences when you can instead wipe them out
> by using MsgWaitForMultiple which integrates all sources ? Even if
> there's
> no message stream, it's fine !

Agreed - as I said, it is with these apps that I think it has the most
chance of success.


> Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> conditions mentioned earlier ? I do agree with the fact that the primary
> msg loop of a random GUI package is a black box, however it must use one
> of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> trivial (that's the point of this discussion), but since even on Windows
> it is doable (MsgWait*), I fail to see by what magic a GUI package could
> bypass its supervision.

The only way I could see this working would be to use real, actual Windows
messages on Windows.  Python would need to nominate a special message that
it knows will not conflict with any GUI environments Python may need to run
in.

Each GUI package maintainer would then need to add some special logic in
their message hooking code.  When their black-box message loop delivers
this special message, the framework would need to enter the Python
"event-loop", where it does its stuff - until a new message arrives. It
would need to return, unwind back to the original message pump where it
will be processed as normal, and the entire process repeats.  The process
of waking other objects neednt be GUI toolkit dependent - as you said, it
only need place the well known message in the threads message loop using
PostThreadMessage()

Unless Im missing something?

Mark.




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 10:38:06 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:38:06 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>
Message-ID: <392A436E.4027@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> > conditions mentioned earlier ? I do agree with the fact that the primary
> > msg loop of a random GUI package is a black box, however it must use one
> > of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> > trivial (that's the point of this discussion), but since even on Windows
> > it is doable (MsgWait*), I fail to see by what magic a GUI package could
> > bypass its supervision.
> 
> The only way I could see this working would be to use real, actual Windows
> messages on Windows.  Python would need to nominate a special message that
> it knows will not conflict with any GUI environments Python may need to run
> in.

Why use a special message ? MsgWait* does multiplex true Windows Message
*and* other IPC mechanisms. So if a package uses messages, it will
awaken MsgWait* by its 'message queue' side, while if the package uses a
socket or a pipe, it will awaken it by its 'waitable handle' side
(provided, of course, that you can get your hands on that handle and
pass it in th elist of objects to wait for...).

> Each GUI package maintainer would then need to add some special logic in
> their message hooking code.  When their black-box message loop delivers
> this special message, the framework would need to enter the Python
> "event-loop", where it does its stuff - until a new message arrives.

The key is that there wouldn't be two separate Python/GUI evloops.
That's the reason for the (a) condition: be able to awaken a
multiplexing syscall.

> Unless Im missing something?

I believe the next thing to do is to enumerate which GUI packages
fullfill the following conditions ((a) updated to (a') to reflect the
first paragraph of this msg):

	(a') Its internal event source is either the vanilla Windows Message
queue, or an IPC channel which can be exposed to the outer framework
(for enlisting in a select()-like call), like the socket of an X
connection.

	(b) Its queue can be Peek'ed (to check for buffered msgs before
blockigng again)

HTH,

-Alex



From pf at artcom-gmbh.de  Tue May 23 10:39:11 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 10:39:11 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <000501bfc45d$893ad4c0$9ea2143f@tim> from Tim Peters at "May 22, 2000 10:21: 0 pm"
Message-ID: <m12uADY-000DieC@artcom0.artcom-gmbh.de>

> [Peter Funk(me)]
> > On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> > working example which exploits this vulnerability in older versions
> > of GCC.
> >
> > The basic idea is indeed very simple:  Since the /tmp directory is
> > writable for any user, the bad guy can create a symbolic link in /tmp
> > pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> > program will than overwrite this arbitrary file (where the programmer
> > really wanted to write something to his tempfile instead).  Since this
> > will happen with the access permissions of the process running this
> > program, this opens a bunch of vulnerabilities in many programs
> > writing something into temporary files with predictable file names.
 
[Tim Peters]:
> I can understand all that, but does it have anything to do with Python's
> tempfile module?  gcc wasn't fixed by changing glibc, right?  

Okay.  But people seem to have the opponion, that "application programmers"
are dumb and "system programmers" are clever and smart. ;-)  So they seem
to think, that the library should solve possible security issues.
I don't share this oppinion, but if a some problem can be solved once
and for all in a library, this is better than having to solve this over and
over again in each application.

Concerning 'tempfile' this would either involve changing (or extending) 
the interface (IMO a better approach to this class of problems) or if the
goal is to solve this for existing applications already using 'tempfile', to 
play games with the filenames returned from 'mktemp()'.  This would require
to make them to be truely random... which AFAIK can't be achieved with 
traditional coding techniques and would require access to a secure white
noise generator.  But may be I'm wrong.

> Playing games
> with the file *names* doesn't appear to me to solve anything; the few posts
> I bumped into where that was somehow viewed as a Good Thing were about
> Solaris systems, where Sun kept the source for generating the "new,
> improved, messy" names secret.  In Python, any attacker can read the code
> for anything we do, which it makes it much clearer that a name-game approach
> is half-assed.

I agree.  But I think we should at least extend the documentation
of 'tempfile' (Fred?) to guide people not to write Pythoncode like
	mytemp = open(tempfile.mktemp(), "w")
in programs that are intended to be used on Unix systems by arbitrary
users (possibly 'root').  Even better:  Someone with enough spare time 
should add a new function 'mktempfile()', which creates a temporary 
file and takes care of the security issue and than returns the file 
handle.  This implementation must take care of race conditions using
'os.open' with the following flags:

       O_CREAT If the file does not exist it will be created.
       O_EXCL  When used with O_CREAT, if the file already  exist
	       it is  an error and the open will fail. 

> and-people-whine-about-worming-around-bad-decisions-in-
>     windows<wink>-ly y'rs  - tim

I don't whine.  But currently I've more problems with my GUI app using
Tkinter&Pmw on the Mac <wink>.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60
Wer sich zu wichtig f?r kleine Arbeiten h?lt,
ist meist zu klein f?r wichtige Arbeiten.     --      Jacques Tati



From mhammond at skippinet.com.au  Tue May 23 11:00:54 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 19:00:54 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A436E.4027@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>

> Why use a special message ? MsgWait* does multiplex true Windows Message
> *and* other IPC mechanisms.

But the point was that Python programs need to live inside other GUI
environments, and that these GUI environments provide their own message
loops that we must consider a black box.

So, we can not change the existing message loop to use MsgWait*().  We can
not replace their message loop with one of our own that _does_ do this, as
their message loop is likely to have its own special requirements (eg,
MFC's has idle-time processing, etc)

So I can't see a way out of this bind, other than to come up with a way to
live _in_ a 3rd party, immutable message loop.  My message tried to outline
what would be required, for example, to make Pythonwin use such a Python
driven event loop while still using the MFC message loop.

> The key is that there wouldn't be two separate Python/GUI evloops.
> That's the reason for the (a) condition: be able to awaken a
> multiplexing syscall.

Im not sure that is feasable.  With what I know about MFC, I almost
certainly would not attempt to integrate such a scheme with Pythonwin.  I
obviously can not speak for the other GUI toolkit maintainers.

> I believe the next thing to do is to enumerate which GUI packages
> fullfill the following conditions ((a) updated to (a') to reflect the
> first paragraph of this msg):

That would certainly help.  I believe it is safe to say there are 3 major
GUI environments for Python currently released; Tkinter, wxPython and
Pythonwin.  I know Pythonwin does not qualify.  We both know Tkinter does
not qualify.  I dont know enough about wxPython, but even if it _does_
qualify, the simple fact that Tkinter doesnt would appear to be the
show-stopper...

Dont get me wrong - its a noble goal that I _have_ pondered myself in the
past - but can't see a good solution.

Mark.




From ping at lfw.org  Tue May 23 11:41:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 23 May 2000 02:41:01 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
  multiplexing is too hard)
In-Reply-To: <392A3503.7C72@cnet.francetelecom.fr>
Message-ID: <Pine.LNX.4.10.10005230230130.461-200000@localhost>

On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> 
> Great !!! That's exactly the kind of Pythonic translation I was
> expecting. Thanks !

Here's a straw man.  Try the attached module.  To test it, run:

    python ./watcher.py 10203

then telnet to port 10203 on the local machine.  You can open
several telnet connections to port 10203 at once.

In one session:

    skuld[1041]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> 1 + 2
    3
    >>> spam = 3

In another session:

    skuld[1008]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> spam
    3

> We just register an empty string as the callback name (script).
> But this is just a random API choice. Anything more Pythonic is welcome
> (an explicit unregister function is okay for me).

So is there no way to register more than one callback on a
particular file?  Do you ever find yourself wanting to do that?


-- ?!ng
-------------- next part --------------
"""Watcher module, by Ka-Ping Yee (22 May 2000).

This module implements event handling on files.  To use it, create a
Watcher object, and register callbacks on the Watcher with the watch()
method.  When ready, call the go() method to start the main loop."""

import select

class StopWatching:
    """Callbacks may raise this exception to exit the main loop."""
    pass

class Watcher:
    """This class provides the ability to register callbacks on file events.
    Each instance represents one mapping from file events to callbacks."""

    def __init__(self):
        self.readers = {}
        self.writers = {}
        self.errhandlers = {}
        self.dicts = [("r", self.readers), ("w", self.writers),
                      ("e", self.errhandlers)]

    def watch(self, handle, callback, modes="r"):
        """Register a callback on a file handle for specified events.
        The 'handle' argument may be a file object or any object providing
        a faithful 'fileno()' method (this includes sockets).  The 'modes'
        argument is a string containing any of the chars "r", "w", or "e"
        to specify that the callback should be triggered when the file
        becomes readable, writable, or encounters an error, respectively.
        The 'callback' should be a function that expects to be called with
        the three arguments (watcher, handle, mode)."""
        fd = handle.fileno()
        for mode, dict in self.dicts:
            if mode in modes: dict[fd] = (handle, callback)

    def unwatch(self, handle, modes="r"):
        """Unregister any callbacks on a file for the specified events.
        The 'handle' argument should be a file object and the 'modes'
        argument should contain one or more of the chars "r", "w", or "e"."""
        fd = handle.fileno()
        for mode, dict in self.dicts:
            if mode in modes and dict.has_key(fd): del dict[fd]
            
    def go(self, timeout=None):
        """Loop forever, watching for file events and triggering callbacks,
        until somebody raises an exception.  The StopWatching exception
        provides a quiet way to exit the event loop.  If a timeout is 
        specified, the loop will exit after that many seconds pass by with
        no events occurring."""
        try:
            while self.readers or self.writers or self.errhandlers:
                rd, wr, ex = select.select(self.readers.keys(),
                                           self.writers.keys(),
                                           self.errhandlers.keys(), timeout)
                if not (rd + wr + ex): break
                for fds, (mode, dict) in map(None, [rd, wr, ex], self.dicts):
                    for fd in fds:
                        handle, callback = dict[fd]
                        callback(self, handle, mode)
        except StopWatching: pass

if __name__ == "__main__":
    import sys, socket, code
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.bind("localhost", 10203) # Five is RIGHT OUT.
    s.listen(1)
    consoles = {}
    locals = {} # Share locals, just for fun.

    class Redirector:
        def __init__(self, write):
            self.write = write

    def getline(handle):
        line = ""
        while 1:
            ch = handle.recv(1)
            line = line + ch
            if not ch or ch == "\n": return line

    def read(watcher, handle, mode):
        line = getline(handle)
        if line:
            if line[-2:] == "\r\n": line = line[:-2]
            if line[-1:] == "\n": line = line[:-1]
            out, err = sys.stdout, sys.stderr
            sys.stdout = sys.stderr = Redirector(handle.send)
            more = consoles[handle].push(line)
            handle.send(more and "... " or ">>> ")
            sys.stdout, sys.stderr = out, err
        else:
            watcher.unwatch(handle)
            handle.close()

    def connect(watcher, handle, mode):
        ns, addr = handle.accept()
        consoles[ns] = code.InteractiveConsole(locals, "<%s:%d>" % addr)
        watcher.watch(ns, read)
        ns.send(">>> ")

    w = Watcher()
    w.watch(s, connect)
    w.go()

From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 11:54:31 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 11:54:31 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python   multiplexing is too hard)
References: <Pine.LNX.4.10.10005230230130.461-200000@localhost>
Message-ID: <392A5557.72F7@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> >
> > Great !!! That's exactly the kind of Pythonic translation I was
> > expecting. Thanks !
> 
> Here's a straw man.  <watcher.py>

Nice. Now what's left to do is make select.select() truly
crossplatform...

> So is there no way to register more than one callback on a
> particular file?

Nope - it's considered the responsibility of higher layers.

> Do you ever find yourself wanting to do that?

Seldom, but it happened to me once, and I did exactly that: a layer
above.

-Alex



From mal at lemburg.com  Tue May 23 12:10:20 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 12:10:20 +0200
Subject: [Python-Dev] String encoding
Message-ID: <392A590C.E41239D3@lemburg.com>

The recent discussion about repr() et al. brought up the idea
of a locale based string encoding again.

A support module for querying the encoding used in the current
locale together with the experimental hook to set the string
encoding could yield a compromise which satisfies ASCII, Latin-1
and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter
from within Python (rather than making the encoding a compile
time option). This is easily doable using the (yet to be written)
support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed
to whatever the user or administrator wants it to be on a per
site basis. Furthermore, the encoding should be settable on
a per thread basis inside the interpreter (Python threads
do not seem to inherit any per-thread globals, so the
encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

"""
import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be
# determined
defenc = locale.get_encoding('ascii')

# Set main thread's string encoding
sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume
defenc as encoding of strings.
"""

Minor nit: due to the implementation, the C parser markers
"s" and "t" and the hash() value calculation will still need
to work with a fixed encoding which still is UTF-8. C APIs
which want to support Unicode should be fixed to use "es"
or query the object directly and then apply proper, possibly
OS dependent conversion.

Before starting off into implementing the above, I'd like to
hear some comments...

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 12:16:42 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 12:16:42 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>
Message-ID: <392A5A8A.1534@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> >
> > I'd really like to challenge that 'almost'...
> 
> Sure - but the problem is simply that MFC has a non-standard message - so
> it _must_ be like I described in my message - and you will agree that
> sounds messy.
> 
> If such a set of multiplexing primitives took off, and people found them
> useful, and started complaining they dont work in Pythonwin, then I will
> surely look at it again.  Its too much work and too intrusive for a
> proof-of-concept effort.

Okay, fine.

> > I understand. Maybe I underestimated some of the difficulties. However,
> > I'd still like to separate what can be separated. The unfriendliness to
> > Python debuggers is sad news to me, but is not strictly related to the
> > problem of heterogeneous multiplexing: if I were to design a debugger
> > from scratch for a random language, I believe I'd arrange for the IPC
> > channel used to be more transparent. IOW, the very fact of using the
> > message queue for the debugging IPC *is* the culprit ! In unix, the
> > ptrace() or /proc interfaces have never walked on the toes of any
> > package, GUI or not...
> 
> The unfriendliness is purely related to Pythonwin, and not the general
> Python debugger.  I agree 100% that an RPC type mechanism is far better for
> a debugger.  It was just an anecdote to show how fickle these message loops
> can be (and therefore the complex requirements they have).

Okay, so maybe it's time to summarize what we agreed on:

	(1) 'tearing open' the main loop of a GUI package is tricky in the
general case.
	(2) perusing undefined WM_* messages requires care...

	(3) on the other hand, all other IPC channels are multiplexable. Even
for the worst case (pipes on Windows) at least 1 (1.5?) method has been
identified.

The temporary conclusion as far as I understand, is that nobody in the
Python community has the spare time and energy to tackle (1), that (2)
is tricky due to an unfortunate choice in the implementation of some
debuggers, and that the seemingly appealing unification outlined by (3)
is not enough of a motivation...

Under these conditions, clearly the only option is to put the blackbox
GUI loop inside a separate thread and arrange for it to use a
well-chosen IPC channel to awaken (something like) the Watcher.go()
proposed by Ka-Ping Yee.

Now there's still the issue of actually making select.select()
crossplatform.
Any takers ?

-Alex



From gstein at lyra.org  Tue May 23 12:57:21 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 May 2000 03:57:21 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A590C.E41239D3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>

I still think that having any kind of global setting is going to be
troublesome. Whether it is per-thread or not, it still means that Module
Foo cannot alter the value without interfering with Module Bar.

Cheers,
-g

On Tue, 23 May 2000, M.-A. Lemburg wrote:

> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.
> 
> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.
> 
> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.
> 
> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis. Furthermore, the encoding should be settable on
> a per thread basis inside the interpreter (Python threads
> do not seem to inherit any per-thread globals, so the
> encoding would have to be set for all new threads).
> 
> E.g. a site.py module could look like this:
> 
> """
> import locale,sys
> 
> # Get encoding, defaulting to 'ascii' in case it cannot be
> # determined
> defenc = locale.get_encoding('ascii')
> 
> # Set main thread's string encoding
> sys.setstringencoding(defenc)
> 
> This would result in the Unicode implementation to assume
> defenc as encoding of strings.
> """
> 
> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8. C APIs
> which want to support Unicode should be fixed to use "es"
> or query the object directly and then apply proper, possibly
> OS dependent conversion.
> 
> Before starting off into implementing the above, I'd like to
> hear some comments...
> 
> Thanks,
> -- 
> Marc-Andre Lemburg
> ______________________________________________________________________
> Business:                                      http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Tue May 23 13:38:41 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 13:38:41 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com>
Message-ID: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.

before proceeding down this (not very slippery but slightly
unfortunate, imho) slope, I think we should decide whether

    assert eval(repr(s)) == s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr'
to use isprint, without having to make sure that you can still
parse the resulting string.

but if it is important, you cannot really change 'repr' without
addressing the big issue.

so assuming that the assertion must hold, and that changing
'repr' to be locale-dependent is a good idea, let's move on:

> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.

agreed.

> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is
easy enough to do in Python (either in site.py or in locale.py).
no need for external support modules for Unix, in other words.

for windows, I suggest adding GetACP() to the _locale module,
and let the glue layer (site.py 0or locale.py) do:

    if sys.platform == "win32":
        sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the
system font, and fall back to "macroman" if that doesn't work out.
but figuring out the right way to do that is best left to anyone who
actually has access to a Mac.  in the meantime, just make it:

    elif sys.platform == "mac":
        sys.setstringencoding("macroman")

> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis. 

Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
that the vast majority of non-Mac platforms are either modern Unixes
or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

    else:
        # try to determine encoding from POSIX locale environment
        # variables
        ...

    else:
        sys.setstringencoding("iso-latin-1")

> Furthermore, the encoding should be settable on a per thread basis
> inside the interpreter (Python threads do not seem to inherit any
> per-thread globals, so the encoding would have to be set for all
> new threads).

is the C/POSIX locale setting thread specific?

if not, I think the default encoding should be a global setting, just
like the system locale itself.  otherwise, you'll just be addressing a
real problem (thread/module/function/class/object specific locale
handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8.

can this be fixed?  or rather, what changes to the buffer api
are required if we want to work around this problem?

> C APIs which want to support Unicode should be fixed to use
> "es" or query the object directly and then apply proper, possibly
> OS dependent conversion.

for convenience, it might be a good idea to have a "wide system
encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the
time?

unproductive-ly yrs /F




From pf at artcom-gmbh.de  Tue May 23 14:02:17 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 14:02:17 +0200 (MEST)
Subject: [Python-Dev] String encoding
In-Reply-To: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> from Fredrik Lundh at "May 23, 2000  1:38:41 pm"
Message-ID: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>

Hi Fredrik!

you wrote:
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
[...]

What's the problem with this one?  I've played around with several
locale settings here and I observed no problems, while doing:

>>> import string
>>> s = string.join(map(chr, range(128,256)),"")
>>> assert eval('"'+s+'"') == s

What do you fear here, if 'repr' will output characters from the
upper half of the charset without quoting them as octal sequences?
I don't understand.

Regards, Peter



From fredrik at pythonware.com  Tue May 23 15:09:11 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 15:09:11 +0200
Subject: [Python-Dev] String encoding
References: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <01fa01bfc4b8$16e64700$0500a8c0@secret.pythonware.com>

Peter wrote:
>
> >     assert eval(repr(s)) == s
>
> What's the problem with this one?  I've played around with several
> locale settings here and I observed no problems, while doing:

what if the default encoding for source code is different
from the locale?  (think UTF-8 source code)

(no, that's not supported by 1.6.  but if we don't consider that
case now, we won't be able to support source encodings in the
future -- unless the above assertion isn't important, of course).

</F>




From mal at lemburg.com  Tue May 23 13:14:46 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 13:14:46 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>
Message-ID: <392A6826.CCDD2246@lemburg.com>

Greg Stein wrote:
> 
> I still think that having any kind of global setting is going to be
> troublesome. Whether it is per-thread or not, it still means that Module
> Foo cannot alter the value without interfering with Module Bar.

True. 

The only reasonable place to alter the setting is in
site.py for the main thread. I think the setting should be
inherited by child threads, but I'm not sure whether this is
possible or not.
 
Modules that would need to change the settings are better
(re)designed in a way that doesn't rely on the setting at all, e.g.
work on Unicode exclusively which doesn't introduce the need
in the first place.

And then, noone is forced to alter the ASCII default to begin
with :-) The good thing about exposing this mechanism in Python
is that it gets user attention...

> Cheers,
> -g
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
> 
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> >
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> >
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis. Furthermore, the encoding should be settable on
> > a per thread basis inside the interpreter (Python threads
> > do not seem to inherit any per-thread globals, so the
> > encoding would have to be set for all new threads).
> >
> > E.g. a site.py module could look like this:
> >
> > """
> > import locale,sys
> >
> > # Get encoding, defaulting to 'ascii' in case it cannot be
> > # determined
> > defenc = locale.get_encoding('ascii')
> >
> > # Set main thread's string encoding
> > sys.setstringencoding(defenc)
> >
> > This would result in the Unicode implementation to assume
> > defenc as encoding of strings.
> > """
> >
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8. C APIs
> > which want to support Unicode should be fixed to use "es"
> > or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> >
> > Before starting off into implementing the above, I'd like to
> > hear some comments...
> >
> > Thanks,
> > --
> > Marc-Andre Lemburg
> > ______________________________________________________________________
> > Business:                                      http://www.lemburg.com/
> > Python Pages:                           http://www.lemburg.com/python/
> >
> >
> > _______________________________________________
> > Python-Dev mailing list
> > Python-Dev at python.org
> > http://www.python.org/mailman/listinfo/python-dev
> >
> 
> --
> Greg Stein, http://www.lyra.org/
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From pf at artcom-gmbh.de  Tue May 23 16:29:58 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 16:29:58 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
Message-ID: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>

Python 1.6 reports a bad magic error, when someone tries to import a .pyc
file compiled by Python 1.5.2.  AFAIK only new features have been
added.  So why it isn't possible to use these old files in Python 1.6?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From dan at cgsoftware.com  Tue May 23 16:43:44 2000
From: dan at cgsoftware.com (Daniel Berlin)
Date: Tue, 23 May 2000 10:43:44 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <BAEMJNPFHMFPEFAGBKKAMEJBCCAA.dan@cgsoftware.com>

Because of the unicode changes, AFAIK.
Or was it the multi-arg vs single arg append and friends.
Anyway, the point is that their were incompatible changes made, and thus,
the magic was changed.
--Dan
>
>
> Python 1.6 reports a bad magic error, when someone tries to import a .pyc
> file compiled by Python 1.5.2.  AFAIK only new features have been
> added.  So why it isn't possible to use these old files in Python 1.6?
>
> Regards, Peter
> --
> Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany,
> Fax:+49 4222950260
> office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)




From fdrake at acm.org  Tue May 23 16:47:52 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 07:47:52 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Peter Funk wrote:
 > Python 1.6 reports a bad magic error, when someone tries to import a .pyc
 > file compiled by Python 1.5.2.  AFAIK only new features have been
 > added.  So why it isn't possible to use these old files in Python 1.6?

Peter,
  In theory, perhaps it could; I don't know if the extra work is worth it,
however.
  What's happening is that the .pyc magic number changed because the
marshal format has been extended to support Unicode string objects.  The
old format should still be readable, but there's nothing in the .pyc
loader that supports the acceptance of multiple versions of the marshal
format.
  Is there reason to think that's a substantial problem for users, given
the automatic recompilation of bytecode from source?  The only serious
problems I can see are when multiple versions of the interpreter are being
used on the same collection of source files (because the re-compilation
occurs more often and affects performance), and when *only* .pyc/.pyo
files are available.
  Do you have reason to suspect that either case is sufficiently common to
complicate the .pyc loader, or is there another reason that I've missed
(very possible, I admit)?


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>





From mal at lemburg.com  Tue May 23 16:20:19 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:20:19 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A93A3.91188372@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> 
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
> 
> if this isn't important, nothing stops you from changing 'repr'
> to use isprint, without having to make sure that you can still
> parse the resulting string.
> 
> but if it is important, you cannot really change 'repr' without
> addressing the big issue.

This is a different discussion which I don't really want to
get into... I don't have any need for repr() being locale
dependent, since I only use it for debugging purposes and
never to rebuild objects (marshal and pickle are much better
at that).

BTW, repr(unicode) is not affected by the string encoding:
it always returns unicode-escape.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 23 16:47:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:47:40 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A9A0C.2E297072@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> > [...]
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> 
> agreed.
> 
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> 
> agreed.
> 
> note that parsing LANG (etc) variables on a POSIX platform is
> easy enough to do in Python (either in site.py or in locale.py).
> no need for external support modules for Unix, in other words.

Agreed... the locale.py (and _locale builtin module) are probably
the right place to put such a parser.
 
> for windows, I suggest adding GetACP() to the _locale module,
> and let the glue layer (site.py 0or locale.py) do:
> 
>     if sys.platform == "win32":
>         sys.setstringencoding("cp%d" % GetACP())
> 
> on mac, I think you can determine the encoding by inspecting the
> system font, and fall back to "macroman" if that doesn't work out.
> but figuring out the right way to do that is best left to anyone who
> actually has access to a Mac.  in the meantime, just make it:
> 
>     elif sys.platform == "mac":
>         sys.setstringencoding("macroman")
> 
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis.
> 
> Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
> that the vast majority of non-Mac platforms are either modern Unixes
> or Windows boxes, that makes a lot more sense than US ASCII...
> 
> in other words:
> 
>     else:
>         # try to determine encoding from POSIX locale environment
>         # variables
>         ...
> 
>     else:
>         sys.setstringencoding("iso-latin-1")

That's a different topic which I don't want to revive ;-)

With the above tools you can easily code the latin-1 default
into your site.py.

> > Furthermore, the encoding should be settable on a per thread basis
> > inside the interpreter (Python threads do not seem to inherit any
> > per-thread globals, so the encoding would have to be set for all
> > new threads).
> 
> is the C/POSIX locale setting thread specific?

Good question -- I don't know.

> if not, I think the default encoding should be a global setting, just
> like the system locale itself.  otherwise, you'll just be addressing a
> real problem (thread/module/function/class/object specific locale
> handling), but not really solving it...
>
> better use unicode strings and explicit encodings in that case.

Agreed.
 
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8.
> 
> can this be fixed?  or rather, what changes to the buffer api
> are required if we want to work around this problem?

The problem is that "s" and "t" return C pointers to some
internal data structure of the object. It has to be assured
that this data remains intact at least as long as the object
itself exists.

AFAIK, this cannot be fixed without creating a memory leak.
 
The "es" parser marker uses a different strategy, BTW: the
data is copied into a buffer, thus detaching the object
from the data.

> > C APIs which want to support Unicode should be fixed to use
> > "es" or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> 
> for convenience, it might be a good idea to have a "wide system
> encoding" too, and special parser markers for that purpose.
> 
> or can we assume that all wide system API's use unicode all the
> time?

At least in all references I've seen (e.g. ODBC, wchar_t
implementations, etc.) "wide" refers to Unicode.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Tue May 23 17:13:59 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 08:13:59 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A9A0C.2E297072@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, M.-A. Lemburg wrote:
 > The problem is that "s" and "t" return C pointers to some
 > internal data structure of the object. It has to be assured
 > that this data remains intact at least as long as the object
 > itself exists.
 > 
 > AFAIK, this cannot be fixed without creating a memory leak.
 >  
 > The "es" parser marker uses a different strategy, BTW: the
 > data is copied into a buffer, thus detaching the object
 > from the data.
 > 
 > > > C APIs which want to support Unicode should be fixed to use
 > > > "es" or query the object directly and then apply proper, possibly
 > > > OS dependent conversion.
 > > 
 > > for convenience, it might be a good idea to have a "wide system
 > > encoding" too, and special parser markers for that purpose.
 > > 
 > > or can we assume that all wide system API's use unicode all the
 > > time?
 > 
 > At least in all references I've seen (e.g. ODBC, wchar_t
 > implementations, etc.) "wide" refers to Unicode.

  On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
10646 require a 32-bit space?
  I recall a fair bit of discussion about wchar_t when it was introduced
to ANSI C, and the character set and encoding were specifically not made
part of the specification.  Making a requirement that wchar_t be Unicode
doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Tue May 23 17:16:42 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:16:42 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com>
Message-ID: <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> > before proceeding down this (not very slippery but slightly
> > unfortunate, imho) slope, I think we should decide whether
> > 
> >     assert eval(repr(s)) == s
> > 
> > should be true for strings.

footnote: as far as I can tell, the language reference says it should:
http://www.python.org/doc/current/ref/string-conversions.html

> This is a different discussion which I don't really want to
> get into... I don't have any need for repr() being locale
> dependent, since I only use it for debugging purposes and
> never to rebuild objects (marshal and pickle are much better
> at that).

in other words, you leave it to 'pickle' to call 'repr' for you ;-)

</F>




From pf at artcom-gmbh.de  Tue May 23 17:23:48 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 17:23:48 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> from "Fred L. Drake" at "May 23, 2000  7:47:52 am"
Message-ID: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>

Fred, 
Thank you for your quick response.

Fred L. Drake:
> Peter,
>   In theory, perhaps it could; I don't know if the extra work is worth it,
> however.
[...]
>   Do you have reason to suspect that either case is sufficiently common to
> complicate the .pyc loader, or is there another reason that I've missed
> (very possible, I admit)?

Well, currently we (our company) deliver no source code to our
customers.  I don't want to discuss this policy and the reasoning
behind here.  But this situation may also apply to other commercial
software vendors using Python.

During late 2000 there may be several customers out there running
Python 1.6 and others still running Python 1.5.2.  So we will have
several choices to deal with this situation:
   1. Supply two different binary distribution packages: 
      one containing 1.5.2 .pyc files and one containing 1.6 .pyc files.
      This will introduce some new logistic problems.
   2. Upgrade to Python 1.6 at each customer site at once. 
      This will be difficult.
   3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
      and supply our own patched Python distribution.
      (and this would also be "carrying owls to Athen" for Linux systems)
    [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
      german idiom makes any sense in english ;-) ]
      I personally don't like this.
   4. Change our policy and distribute also the .py sources.  Beside the
      difficulty to convince the management about this one, this also
      introduces new technical "challenges".  The Unix text files have to be
      converted from LF lineends to CR lineends or MacPython wouldn't be 
      able to parse the files.  So the mac source distributions
      must be build from a different directory tree.

No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
some such somewhere in the 1.6 unmarshaller should do the trick.
But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
worth the effort. <wink>

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From esr at thyrsus.com  Tue May 23 17:40:50 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 11:40:50 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Tue, May 23, 2000 at 05:23:48PM +0200
References: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> <m12uGX6-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000523114050.A4781@thyrsus.com>

Peter Funk <pf at artcom-gmbh.de>:
>       (and this would also be "carrying owls to Athen" for Linux systems)
>     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
>       german idiom makes any sense in english ;-) ]

There is a precise equivalent: "carrying coals to Newcastle".
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Are we to understand," asked the judge, "that you hold your own interests
above the interests of the public?"

"I hold that such a question can never arise except in a society of cannibals."
	-- Ayn Rand



From effbot at telia.com  Tue May 23 17:41:46 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:41:46 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com>
Message-ID: <024001bfc4cd$68210f00$f2a6b5d4@hagrid>

M.-A. Lemburg wrote:
> That's a different topic which I don't want to revive ;-)

in a way, you've already done that -- if you're setting the system encoding
in the site.py module, lots of people will end up with the encoding set to ISO
Latin 1 or it's windows superset.

one might of course the system encoding if the user actually calls setlocale,
but there's no way for python to trap calls to that function from a submodule
(e.g. readline), so it's easy to get out of sync.  hmm.

(on the other hand, I'd say it's far more likely that americans are among the
few who don't know how to set the locale, so defaulting to us ascii might be
best after all -- even if their computers really use iso-latin-1, we don't have
to cause unnecessary confusion...)

...

but I guess you're right: let's be politically correct and pretend that this really
is a completely different issue ;-)

</F>




From effbot at telia.com  Tue May 23 18:04:38 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:04:38 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost>  <200005222038.PAA01284@cj20424-a.reston1.va.home.com>
Message-ID: <027f01bfc4d0$99c48ca0$f2a6b5d4@hagrid>

> Which is why I find Fredrik's attitude unproductive.

given that locale support isn't included if you make a default build,
I don't think deprecating it would hurt that many people...

but that's me; when designing libraries, I've always strived to find
the *minimal* set of functions (and code) that makes it possible for
a programmer to do her job well.  I'm especially wary of blind alleys
(sure, you can use locale, but that'll only take you this far, and you
have to start all over if you want to do it right).

btw, talking about productivity, go check out the case sensitivity
threads on comp.lang.python.  imagine if all those people hammered
away on the 1.6 alpha instead...

> And where's the SRE release?

at the usual place:

    http://w1.132.telia.com/~u13208596/sre/index.htm

still one showstopper left, which is why I haven't made the long-
awaited public "now it's finished, dammit" announcement yet.  but
it shouldn't be that far away.

</F>




From fdrake at acm.org  Tue May 23 18:11:14 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 09:11:14 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <20000523114050.A4781@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005230904570.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Eric S. Raymond wrote:
 > Peter Funk <pf at artcom-gmbh.de>:
 > >       (and this would also be "carrying owls to Athen" for Linux systems)
 > >     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
 > >       german idiom makes any sense in english ;-) ]
 > 
 > There is a precise equivalent: "carrying coals to Newcastle".

  That's interesting... I've never heard either, but I think I can guess
the meaning now.
  I agree; it looks like there's some work to do in getting the .pyc
loader to be a little more concerned about importing compatible marshal
formats.  I have an idea about how I'd like to see in done which may be a
little less magical.  I'll work up a patch later this week.
  I won't check in any changes for this until we've heard from Guido on
the matter, and he'll probably be unavailable for the next couple of days.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Tue May 23 18:26:43 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:26:43 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>
Message-ID: <031101bfc4d3$afac2020$f2a6b5d4@hagrid>

Martin v. Loewis wrote:
> To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
> Unicode escapes, and treats all other source code as
> Latin-1. encoding(n) says
> 
> # However, because the source command always reads files using the
> # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> # separate character that maps to the 00 page in Unicode.

as far as I can tell from digging through the sources, the "source"
command uses the system encoding.  and from the look of it, it's
not always iso-latin-1...

</F>




From mal at lemburg.com  Tue May 23 18:48:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 18:48:08 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>
Message-ID: <392AB648.368663A8@lemburg.com>

"Fred L. Drake" wrote:
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
>  > The problem is that "s" and "t" return C pointers to some
>  > internal data structure of the object. It has to be assured
>  > that this data remains intact at least as long as the object
>  > itself exists.
>  >
>  > AFAIK, this cannot be fixed without creating a memory leak.
>  >
>  > The "es" parser marker uses a different strategy, BTW: the
>  > data is copied into a buffer, thus detaching the object
>  > from the data.
>  >
>  > > > C APIs which want to support Unicode should be fixed to use
>  > > > "es" or query the object directly and then apply proper, possibly
>  > > > OS dependent conversion.
>  > >
>  > > for convenience, it might be a good idea to have a "wide system
>  > > encoding" too, and special parser markers for that purpose.
>  > >
>  > > or can we assume that all wide system API's use unicode all the
>  > > time?
>  >
>  > At least in all references I've seen (e.g. ODBC, wchar_t
>  > implementations, etc.) "wide" refers to Unicode.
> 
>   On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
> 10646 require a 32-bit space?

It is, Unicode is definitely moving in the 32-bit direction.

>   I recall a fair bit of discussion about wchar_t when it was introduced
> to ANSI C, and the character set and encoding were specifically not made
> part of the specification.  Making a requirement that wchar_t be Unicode
> doesn't make a lot of sense, and opens up potential portability issues.
> 
> -1 on any assumption that wchar_t is usefully portable.

Ok... so could be that Fredrik has a point there, but I'm
not deep enough into this to be able to comment.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 23 19:15:17 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 19:15:17 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com> <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>
Message-ID: <392ABCA5.EC84824F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > > before proceeding down this (not very slippery but slightly
> > > unfortunate, imho) slope, I think we should decide whether
> > >
> > >     assert eval(repr(s)) == s
> > >
> > > should be true for strings.
> 
> footnote: as far as I can tell, the language reference says it should:
> http://www.python.org/doc/current/ref/string-conversions.html
> 
> > This is a different discussion which I don't really want to
> > get into... I don't have any need for repr() being locale
> > dependent, since I only use it for debugging purposes and
> > never to rebuild objects (marshal and pickle are much better
> > at that).
> 
> in other words, you leave it to 'pickle' to call 'repr' for you ;-)

Ooops... now this gives a totally new ring the changing
repr(). Hehe, perhaps we need a string.encode() method
too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From martin at loewis.home.cs.tu-berlin.de  Tue May 23 23:44:11 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 23 May 2000 23:44:11 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <031101bfc4d3$afac2020$f2a6b5d4@hagrid> (effbot@telia.com)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de> <031101bfc4d3$afac2020$f2a6b5d4@hagrid>
Message-ID: <200005232144.XAA01129@loewis.home.cs.tu-berlin.de>

> > # However, because the source command always reads files using the
> > # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> > # separate character that maps to the 00 page in Unicode.
> 
> as far as I can tell from digging through the sources, the "source"
> command uses the system encoding.  and from the look of it, it's
> not always iso-latin-1...

Indeed, this appears to be an error in the documentation. sourcing

encoding convertto utf-8 ?

has an outcome depending on the system encoding; just try koi8-r to
see the difference.

Regards,
Martin




From effbot at telia.com  Tue May 23 23:57:57 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 23:57:57 +0200
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <008a01bfc502$17765260$f2a6b5d4@hagrid>

    http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
    "May 11: In a press conference held early this morning, Guido van Rossum
    ... announced that his most famous project will be undergoing a name
    change ..."

    http://www.scriptics.com/company/news/press_release_ajuba.html
    "May 22: Scriptics Corporation ... today announced that it has changed its
    name ..."

...




From akuchlin at mems-exchange.org  Wed May 24 01:33:28 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 23 May 2000 19:33:28 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
Message-ID: <200005232333.TAA16068@amarok.cnri.reston.va.us>

Today I checked in a new version of the curses module that will only
work with ncurses and/or SYSV curses.  I've tried compiling it on
Linux with ncurses 5.0, and on Solaris; there are also #ifdef's to
make it work with some version of SGI's curses.

I'd appreciate it if people could try the module with the curses
implementations on other platforms: Tru64, AIX, *BSDs (though they use
ncurses, maybe they're some versions behind), etc.  Please let me know
of your results through e-mail.

And if you have code that used the old curses module, and breaks with
the new module, please let me know; the goal is to have 100%
backward-compatibility.

Also, here's a list of ncurses functions that aren't yet supported;
should I make adding them a priority.  (Most of them seem to be pretty
marginal, except for the mouse-related functions which I want to add
next.)

addchnstr addchstr chgat color_set copywin define_key del_curterm
delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
scr_init scr_restore scr_set scrl set_curterm set_term setterm
setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
slk_attron slk_attrset slk_clear slk_color slk_init slk_label
slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
..signature has giant ASCII graphic: Forced to read "War And Peace" at 110
baud on a Braille terminal after having fingers rubbed with sandpaper.
  -- Kibo, in the Happynet Manifesto



From gstein at lyra.org  Wed May 24 02:00:55 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 May 2000 17:00:55 -0700 (PDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <008a01bfc502$17765260$f2a6b5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>

what a dumb name...


On Tue, 23 May 2000, Fredrik Lundh wrote:

> 
>     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
>     "May 11: In a press conference held early this morning, Guido van Rossum
>     ... announced that his most famous project will be undergoing a name
>     change ..."
> 
>     http://www.scriptics.com/company/news/press_release_ajuba.html
>     "May 22: Scriptics Corporation ... today announced that it has changed its
>     name ..."
> 
> ...
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From klm at digicool.com  Wed May 24 02:33:57 2000
From: klm at digicool.com (Ken Manheimer)
Date: Tue, 23 May 2000 20:33:57 -0400 (EDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.21.0005232030340.31343-100000@korak.digicool.com>

On Tue, 23 May 2000, Greg Stein wrote:

> what a dumb name...
> On Tue, 23 May 2000, Fredrik Lundh wrote:
> 
> >     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> >     "May 11: In a press conference held early this morning, Guido van Rossum
> >     ... announced that his most famous project will be undergoing a name
> >     change ..."

Huh.  I dunno what's so dumb about it.  But i definitely was tickled by:

  !STOP PRESS! Microsoft Corporation announced this afternoon that it had
  aquired rights to use South Park characters in its software. The first
  such product, formerly known as Visual J++, will now be known as Kenny.
  !STOP PRESS!

:->

Ken
klm at digicool.com

(No relation.)




From esr at thyrsus.com  Wed May 24 02:47:50 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 20:47:50 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <200005232333.TAA16068@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Tue, May 23, 2000 at 07:33:28PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
Message-ID: <20000523204750.A6107@thyrsus.com>

Andrew M. Kuchling <akuchlin at mems-exchange.org>:
> Also, here's a list of ncurses functions that aren't yet supported;
> should I make adding them a priority.  (Most of them seem to be pretty
> marginal, except for the mouse-related functions which I want to add
> next.)
> 
> addchnstr addchstr chgat color_set copywin define_key del_curterm
> delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
> mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
> mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
> mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
> overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
> scr_init scr_restore scr_set scrl set_curterm set_term setterm
> setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
> slk_attron slk_attrset slk_clear slk_color slk_init slk_label
> slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
> tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
> tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
> vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
> winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

I think you're right to put the mouse support at highest priority.

I'd say napms() and the overlay/overwrite/copywin group are moderately
important.  So are the functions in the curs_inopts(3x) group -- when
you need those, nothing else will do.  

You can certainly pretty much forget the slk_* group; I only
implemented those for the sake of excruciating completeness.
Likewise for the mv* variants.  

Here's a function that ought to be in the Python wrapper associated with
the module:

def traceback_wrapper(func, *rest):
    "Call a hook function, guaranteeing curses cleanup on error or exit."
    try:
	# Initialize curses
	stdscr=curses.initscr()
	# Turn off echoing of keys, and enter cbreak mode,
	# where no buffering is performed on keyboard input
	curses.noecho() ; curses.cbreak()

	# In keypad mode, escape sequences for special keys
	# (like the cursor keys) will be interpreted and
	# a special value like curses.KEY_LEFT will be returned
        stdscr.keypad(1)

	# Run the hook.  Supply the screen window object as first argument
        apply(func, (stdscr,) + rest)

	# Set everything back to normal
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()		 # Terminate curses
    except:
	# In the event of an error, restore the terminal
	# to a sane state.
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()
	traceback.print_exc()	   # Print the exception

(Does this case mean, perhaps, that the Python interper ought to allow
setting a stack of hooks to be executed just before traceback-emission time?)

I'd also be willing to write a Python function that implements Emacs-style
keybindings for field editing, if that's interesting.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Don't think of it as `gun control', think of it as `victim
disarmament'. If we make enough laws, we can all be criminals.



From skip at mojam.com  Wed May 24 03:40:02 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 23 May 2000 20:40:02 -0500 (CDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
References: <008a01bfc502$17765260$f2a6b5d4@hagrid>
	<Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <14635.13042.415077.857803@beluga.mojam.com>

Regarding "Ajuba", Greg wrote:

    what a dumb name...

The top 10 reasons why "Ajuba" is a great name for the former Scriptics:

   10. An accounting error left waaay too much money in the marketing
       budget.  They felt they had to spend it or risk a budget cut next
       year.

    9. It would make a cool name for a dance.  They will now be able to do
       the "Ajuba" at the company's Friday afternoon beer busts.

    8. It's almost palindromic, giving the company's art department all
       sorts of cool nearly symmetric logo possibilities.

    7. It has 7 +/- 2 letters, so when purchasing managers from other
       companies see it flash by in the background of a baseball or
       basketball game on TV they'll be able to remember it.

    6. No programming languages already exist with that name.

    5. It doesn't mean anything bad in any known Indo-European, Asian or
       African language so they won't risk losing market share (what market
       share?) in some obscure third-world country because it means "take a
       flying leap".

    4. It's not already registered in .com, .net, .edu or .org.

    3. No prospective employee will associate the new company name with the
       old, so they'll be able to pull in lots of resumes from people who
       would never have stooped to programming in Tcl for a living.

    2. It's more prounounceable than "Tcl" or "Tcl/Tk" by just about anybody
       who has ever seen English in print.

    1. It doesn't suggest anything, so the company is free to redirect its
       focus any way it wants, including replacing Tcl with Python in future
       versions of its products.

;-)

Skip



From gward at mems-exchange.org  Wed May 24 04:43:53 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:43:53 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
Message-ID: <20000523224352.A997@mems-exchange.org>

A couple of people are working on support in the Distutils for building
extensions on Windows with non-Microsoft compilers.  I think this is
crucial; I hate the idea of requiring people to either download a binary
or shell out megabucks (and support Chairman Bill's monopoly) just to
use some handy Python extension.  (OK, OK, more likely they'll go
without the extension, or go without Python.  But still...)

However, it seems like it would be nice if people could build Python
itself with (eg.) cygwin's gcc or Borland's compiler.  (It might be
essential to properly support building extensions with gcc.)  Has anyone
one anything towards that goal?  It appears that there is at least one
patch floating around that advises people to hack up their installed
config.h, and drop a libpython.a somewhere in the installation, in order
to compile extensions with cygwin gcc and/or mingw32.  This strikes me
as sub-optimal: can at least the required changes to config.h be made to
allow building Python with one of the Windows gcc ports?

I would be willing to hold my nose and struggle with cygwin for a little
while in Windows in dull moments at work -- had to reboot my Linux box
into Windows today in order to test try building CXX (since my VMware
trial license expired), so I might as well leave it there until it
crashes and play with cygwin.

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Wed May 24 04:49:23 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:49:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
Message-ID: <20000523224923.A1008@mems-exchange.org>

My post on this from last week was met with a deafening silence, so I
will try to be short and to-the-point this time:

   Why are shared extensions on Solaris linked with "ld -G" instead of
   "gcc -G" when gcc is the compiler used to compile Python and
   extensions?

Is it historical?  Ie. did some versions of Solaris and/or gcc not do
the right thing here?  Could we detect that bogosity in "configure", and
only use "ld -G" if it's necessary, and use "gcc -G" by default?

The reason that using "ld -G" is the wrong thing is that libgcc.a is not
referenced when creating the .so file.  If the object code happens to
reference functions in libgcc.a that are not referenced anywhere in the
Python core, then importing the .so fails.  This happens if there is a
64-bit divide in the object code.  See my post of May 19 for details.

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From fredrik at pythonware.com  Wed May 24 11:42:57 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 24 May 2000 11:42:57 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com> <024001bfc4cd$68210f00$f2a6b5d4@hagrid>
Message-ID: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>

> one might of course the system encoding if the user actually calls setlocale,

I think that was supposed to be:

  one might of course SET the system encoding ONLY if the user actually calls setlocale,

or something...

</F>




From gmcm at hypernet.com  Wed May 24 14:24:20 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 24 May 2000 08:24:20 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
In-Reply-To: <20000523224352.A997@mems-exchange.org>
Message-ID: <1252951401-112791664@hypernet.com>

Greg Ward wrote:

> However, it seems like it would be nice if people could build
> Python itself with (eg.) cygwin's gcc or Borland's compiler.  (It
> might be essential to properly support building extensions with
> gcc.)  Has anyone one anything towards that goal?  

Robert Kern (mingw32) and Gordon Williams (Borland).

> It appears
> that there is at least one patch floating around that advises
> people to hack up their installed config.h, and drop a
> libpython.a somewhere in the installation, in order to compile
> extensions with cygwin gcc and/or mingw32.  This strikes me as
> sub-optimal: can at least the required changes to config.h be
> made to allow building Python with one of the Windows gcc ports?

Robert's starship pages (kernr/mingw32) has a config.h 
patched for mingw32.

I believe someone else built Python using cygwin without 
much trouble. But mingw32 is the preferred target - cygwin is 
slow, doesn't thread, has a viral GPL license and only gets 
along with binaries built with cygwin.
 
Robert's web pages talk about a patched mingw32. I don't 
*think* that's true anymore, (at least I found no problems in 
my limited testing of an unpatched mingw32). The difference 
between mingw32 and cygwin is just what runtime they're built 
for.


- Gordon



From guido at python.org  Wed May 24 16:17:29 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 24 May 2000 09:17:29 -0500
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: Your message of "Tue, 23 May 2000 10:39:11 +0200."
             <m12uADY-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uADY-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005241417.JAA07367@cj20424-a.reston1.va.home.com>

> I agree.  But I think we should at least extend the documentation
> of 'tempfile' (Fred?) to guide people not to write Pythoncode like
> 	mytemp = open(tempfile.mktemp(), "w")
> in programs that are intended to be used on Unix systems by arbitrary
> users (possibly 'root').  Even better:  Someone with enough spare time 
> should add a new function 'mktempfile()', which creates a temporary 
> file and takes care of the security issue and than returns the file 
> handle.  This implementation must take care of race conditions using
> 'os.open' with the following flags:
> 
>        O_CREAT If the file does not exist it will be created.
>        O_EXCL  When used with O_CREAT, if the file already  exist
> 	       it is  an error and the open will fail. 

Have you read a recent (CVS) version of tempfile.py?  It has all this
in the class TemporaryFile()!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May 24 17:11:12 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 24 May 2000 10:11:12 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Tue, 23 May 2000 17:23:48 +0200."
             <m12uGX6-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005241511.KAA07512@cj20424-a.reston1.va.home.com>

>    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files

I agree that this is the correct solution.

> No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
> some such somewhere in the 1.6 unmarshaller should do the trick.
> But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> worth the effort. <wink>

That's BDFL for you, thank you. ;-)

Before accepting the trivial patch, I would like to see some analysis
that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
This means you have to prove that (a) the 1.5.2 marshal format is a
subset of the 1.6 marshal format (easy enough probably) and (b) the
1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
one seems a little trickier; I don't remember if we moved opcodes or 
changed existing opcodes' semantics.  You may be lucky, but it will
cause an extra constraint on the evolution of the bytecode, so I'm
somewhat reluctant.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From ping at lfw.org  Wed May 24 17:56:49 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 24 May 2000 08:56:49 -0700 (PDT)
Subject: [Python-Dev] 1.6 release date
Message-ID: <Pine.LNX.4.10.10005240855340.465-100000@localhost>

Sorry if i missed an earlier announcement on this topic.

The web page about 1.6 currently says that Python 1.6 will
be released on June 1.  Is that still the target date?


-- ?!ng




From tismer at tismer.com  Wed May 24 20:37:05 2000
From: tismer at tismer.com (Christian Tismer)
Date: Wed, 24 May 2000 20:37:05 +0200
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in 
 Python 1.6. Why not?
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de> <200005241511.KAA07512@cj20424-a.reston1.va.home.com>
Message-ID: <392C2151.93A0DF24@tismer.com>


Guido van Rossum wrote:
> 
> >    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
> 
> I agree that this is the correct solution.
> 
> > No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or
> > some such somewhere in the 1.6 unmarshaller should do the trick.
> > But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> > worth the effort. <wink>
> 
> That's BDFL for you, thank you. ;-)
> 
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

Be assured, I know the opcodes by heart.
We only appended to the end of opcode space, there are no changes.
But I can't tell about marshal.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Wed May 24 22:15:24 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 13:15:24 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>
Message-ID: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Fredrik Lundh wrote:
> > one might of course the system encoding if the user actually calls setlocale,
> 
> I think that was supposed to be:
> 
>   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> 
> or something...

Bleh. Global switches are bogus. Since you can't depend on the setting,
and you can't change it (for fear of busting something else), then you
have to be explicit about your encoding all the time. Since you're never
going to rely on a global encoding, then why keep it?

This global encoding (per thread or not) just reminds me of the single
hook for import, all over again.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From pf at artcom-gmbh.de  Wed May 24 23:34:19 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Wed, 24 May 2000 23:34:19 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 24, 2000 10:11:12 am"
Message-ID: <m12uinD-000DieC@artcom0.artcom-gmbh.de>

[...about accepting 1.5.2 generated .pyc files...]

Guido van Rossum:
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.

Would it be sufficient, if a Python 1.6a2 interpreter executable containing
such a trivial patch is able to process the test suite in a 1.5.2 tree with 
all the .py-files removed?  But some list.append calls with multiple args 
might cause errors.

> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or 
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

I feel the byte code format is rather mature and future evolution
is unlikely to remove or move opcodes to new values or change the 
semantics of existing opcodes in an incompatible way.  As has been
shown, it is even possible to solve the 1/2 == 0.5 issue with
upward compatible extension of the format.

But I feel unable to provide a formal proof other than comparing
1.5.2/Include/opcode.h, 1.5.2/Python/marshal.c and import.c
with the 1.6 ones.

There are certainly others here on python-dev who can do better.
Christian?

BTW: import.c contains the  following comment:
/* XXX Perhaps the magic number should be frozen and a version field
   added to the .pyc file header? */

Judging from my decade long experience with exotic image and CAD data 
formats I think this is always the way to go for binary data files.  
Using this method newer versions of a program can always recognize
the file format version and convert files generated by older versions
in an appropriate way.

Regards, Peter



From esr at thyrsus.com  Thu May 25 00:02:15 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 18:02:15 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <m12uinD-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Wed, May 24, 2000 at 11:34:19PM +0200
References: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> <m12uinD-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000524180215.A10281@thyrsus.com>

Peter Funk <pf at artcom-gmbh.de>:
> BTW: import.c contains the  following comment:
> /* XXX Perhaps the magic number should be frozen and a version field
>    added to the .pyc file header? */
> 
> Judging from my decade long experience with exotic image and CAD data 
> formats I think this is always the way to go for binary data files.  
> Using this method newer versions of a program can always recognize
> the file format version and convert files generated by older versions
> in an appropriate way.

I have similar experience, notably with hacking graphics file formats.
I concur with this recommendation.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The end move in politics is always to pick up a gun.
	-- R. Buckminster Fuller



From gstein at lyra.org  Wed May 24 23:58:48 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 14:58:48 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6. Why not?
In-Reply-To: <20000524180215.A10281@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005241457000.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Eric S. Raymond wrote:
> Peter Funk <pf at artcom-gmbh.de>:
> > BTW: import.c contains the  following comment:
> > /* XXX Perhaps the magic number should be frozen and a version field
> >    added to the .pyc file header? */
> > 
> > Judging from my decade long experience with exotic image and CAD data 
> > formats I think this is always the way to go for binary data files.  
> > Using this method newer versions of a program can always recognize
> > the file format version and convert files generated by older versions
> > in an appropriate way.
> 
> I have similar experience, notably with hacking graphics file formats.
> I concur with this recommendation.

One more +1 here.

In another thread (right now, actually), I'm discussing how you can hook
up Linux to recognize .pyc files and directly execute them with the Python
interpreter (e.g. no need for #!/usr/bin/env python at the head of the
file). But if that magic number keeps changing, then it makes it a bit
harder to set this up.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From akuchlin at mems-exchange.org  Thu May 25 00:22:46 2000
From: akuchlin at mems-exchange.org (Andrew Kuchling)
Date: Wed, 24 May 2000 18:22:46 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <20000523204750.A6107@thyrsus.com>
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
	<20000523204750.A6107@thyrsus.com>
Message-ID: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>

Eric S. Raymond writes:
>Here's a function that ought to be in the Python wrapper associated with
>the module:

There currently is no such wrapper, but there probably should be.
Guess I'll rename the module to _curses, and add a curses.py file.  Or
should there be a curses package, instead?  That would leave room for
more future expansion.  Guido, any opinion?

--amk



From gstein at lyra.org  Thu May 25 00:38:07 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 15:38:07 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Andrew Kuchling wrote:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:

Dang. Deleted Eric's note accidentally. Note that the proposed wrapper can
be simplifed by using try/finally.

> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

Just a file. IMO, a package would be overkill.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From esr at thyrsus.com  Thu May 25 02:26:49 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 20:26:49 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Wed, May 24, 2000 at 06:22:46PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us> <20000523204750.A6107@thyrsus.com> <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <20000524202649.B10384@thyrsus.com>

Andrew Kuchling <akuchlin at mems-exchange.org>:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:
> 
> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

I'll supply a field-editor function with Emacs-like bindings, too.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Never trust a man who praises compassion while pointing a gun at you.



From fdrake at acm.org  Thu May 25 04:36:59 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Wed, 24 May 2000 19:36:59 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241933380.624-100000@mailhost.beopen.com>

On Wed, 24 May 2000, Andrew Kuchling wrote:
 > There currently is no such wrapper, but there probably should be.
 > Guess I'll rename the module to _curses, and add a curses.py file.  Or
 > should there be a curses package, instead?  That would leave room for
 > more future expansion.  Guido, any opinion?

  I think a package makes sense; some of the libraries that provide widget
sets on top of ncurses would be prime candidates for inclusion.
  The structure should probably be something like:

	curses/
	    __init__.py		# from _curses import *, docstring
	    _curses.so		# current curses module
	    ...


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gstein at lyra.org  Thu May 25 12:58:27 2000
From: gstein at lyra.org (Greg Stein)
Date: Thu, 25 May 2000 03:58:27 -0700 (PDT)
Subject: [Python-Dev] Larry's need for metacharacters...
Message-ID: <Pine.LNX.4.10.10005250355450.13822-100000@nebula.lyra.org>

[ paraphrased from a LWN letter to the editor ]

Regarding the note posted here last week about Perl development stopping
cuz Larry can't figure out any more characters to place after the '$'
character (to create "special" things) ...

Note that Larry became interested in Unicode a few years ago...

Note that Perl now supports Unicode throughout... *including* variable
names...

Coincidence? I think not!

$\uAB56 = 1;


:-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Thu May 25 14:22:09 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 25 May 2000 14:22:09 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>
Message-ID: <392D1AF1.5AA15F2F@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 24 May 2000, Fredrik Lundh wrote:
> > > one might of course the system encoding if the user actually calls setlocale,
> >
> > I think that was supposed to be:
> >
> >   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> >
> > or something...
> 
> Bleh. Global switches are bogus. Since you can't depend on the setting,
> and you can't change it (for fear of busting something else),

Sure you can: in site.py before any other code using Unicode
gets executed.

> then you
> have to be explicit about your encoding all the time. Since you're never
> going to rely on a global encoding, then why keep it?

For the same reason you use setlocale() in C (and Python): to
make programs portable to other locales without too much
fuzz.

> This global encoding (per thread or not) just reminds me of the single
> hook for import, all over again.

Think of it as a configuration switch which is made settable
via a Python interface -- much like the optimize switch or
the debug switch (which are settable via Python APIs in mxTools).
The per-thread implementation is mainly a design question: I
think globals should always be implemented on a per-thread basis.

Hmm, I wish Guido would comment on the idea of keeping the
runtime settable encoding...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Thu May 25 17:30:26 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:30:26 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Tue, 23 May 2000 22:49:23 -0400."
             <20000523224923.A1008@mems-exchange.org> 
References: <20000523224923.A1008@mems-exchange.org> 
Message-ID: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>

[Greg Ward]
> My post on this from last week was met with a deafening silence, so I
> will try to be short and to-the-point this time:
> 
>    Why are shared extensions on Solaris linked with "ld -G" instead of
>    "gcc -G" when gcc is the compiler used to compile Python and
>    extensions?
> 
> Is it historical?  Ie. did some versions of Solaris and/or gcc not do
> the right thing here?  Could we detect that bogosity in "configure", and
> only use "ld -G" if it's necessary, and use "gcc -G" by default?
> 
> The reason that using "ld -G" is the wrong thing is that libgcc.a is not
> referenced when creating the .so file.  If the object code happens to
> reference functions in libgcc.a that are not referenced anywhere in the
> Python core, then importing the .so fails.  This happens if there is a
> 64-bit divide in the object code.  See my post of May 19 for details.

Two excuses: (1) long ago, you really needed to use ld instead of cc
to create a shared library, because cc didn't recognize the flags or
did other things that shouldn't be done to shared libraries; (2) I
didn't know there was a problem with using ld.

Since you have now provided a patch which seems to work, why don't you
check it in...?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:35:10 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:35:10 -0500
Subject: [Python-Dev] 1.6 release date
In-Reply-To: Your message of "Wed, 24 May 2000 08:56:49 MST."
             <Pine.LNX.4.10.10005240855340.465-100000@localhost> 
References: <Pine.LNX.4.10.10005240855340.465-100000@localhost> 
Message-ID: <200005251535.KAA11834@cj20424-a.reston1.va.home.com>

[Ping]
> The web page about 1.6 currently says that Python 1.6 will
> be released on June 1.  Is that still the target date?

Obviously I won't make that date...  I'm holding back an official
announcement of the delay until next week so I can combine it with
some good news. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:51:44 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:51:44 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Wed, 24 May 2000 23:34:19 +0200."
             <m12uinD-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uinD-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005251551.KAA11897@cj20424-a.reston1.va.home.com>

Given Christian Tismer's testimonial and inspection of marshal.c, I
think Peter's small patch is acceptable.

A bigger question is whether we should freeze the magic number and add
a version number.  In theory I'm all for that, but it means more
changes; there are several tools (e.c. Lib/py_compile.py,
Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
intimate knowledge of the .pyc file format that would have to be
modified to match.

The current format of a .pyc file is as follows:

bytes 0-3   magic number
bytes 4-7   timestamp (mtime of .py file)
bytes 8-*   marshalled code object

The magic number itself is used to convey various bits of information,
all implicit:

- the Python version
- whether \r and \n are swapped (some old Mac compilers did this)
- whether all string literals are Unicode (experimental -U flag)

The current (1.6) value of the magic number (as a string -- the .pyc
file format is byte order independent) is '\374\304\015\012' on most
platforms; it's '\374\304\012\015' for the old Mac compilers
mentioned; and it's '\375\304\015\012' with -U.

Can anyone come up with a proposal?  I'm swamped!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:52:54 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:52:54 -0500
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: Your message of "Wed, 24 May 2000 15:38:07 MST."
             <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org> 
Message-ID: <200005251552.KAA11922@cj20424-a.reston1.va.home.com>

> > There currently is no such wrapper, but there probably should be.
> > Guess I'll rename the module to _curses, and add a curses.py file.  Or
> > should there be a curses package, instead?  That would leave room for
> > more future expansion.  Guido, any opinion?

Whatever -- either way is fine with me!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From DavidA at ActiveState.com  Thu May 25 23:42:51 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Thu, 25 May 2000 14:42:51 -0700
Subject: [Python-Dev] ActiveState news
Message-ID: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>

While not a technical point, I thought I'd mention to this group that
ActiveState just announced several things, including some Python-related
projects.  See www.ActiveState.com for details.

--david

PS: In case anyone's still under the delusion that cool Python jobs are hard
to find, let me know. =)




From bwarsaw at python.org  Sat May 27 00:42:10 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 18:42:10 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
Message-ID: <14638.64962.118047.467438@localhost.localdomain>

Hi all,

I've taken /F's C implementation of the standard class-based
exceptions, implemented the stuff he left out, proofread for reference
counting issues, hacked a bit more, and integrated it with the 1.6
interpreter.  Everything seems to work well; i.e. the regression test
suite passes and I don't get any core dumps ;).

I don't have the ability right now to Purify things[1], but I've tried
to be very careful in handling reference counting.  Since I've been
hacking on this all day, it could definitely use another set of eyes.
I think rather than email a huge patch kit, I'll just go ahead and
check the changes in.  Please take a look and give it a hard twist.

Thanks to /F for the excellent head start!
-Barry

[1] Purify was one of the coolest products on Solaris, but alas it
doesn't seem like they'll ever support Linux.  What do you all use to
do similar memory verification tests on Linux?  Or do you just not?



From bwarsaw at python.org  Sat May 27 01:24:48 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 19:24:48 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <14639.1984.920885.635040@localhost.localdomain>

I'm all done checking this stuff in.
-Barry



From gstein at lyra.org  Fri May 26 01:29:19 2000
From: gstein at lyra.org (Greg Stein)
Date: Thu, 25 May 2000 16:29:19 -0700 (PDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Modules _exceptions.c,NONE,1.1
In-Reply-To: <200005252318.QAA25455@slayer.i.sourceforge.net>
Message-ID: <Pine.LNX.4.10.10005251627061.16846-100000@nebula.lyra.org>

On Thu, 25 May 2000, Barry Warsaw wrote:
> Update of /cvsroot/python/python/dist/src/Modules
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25441
> 
> Added Files:
> 	_exceptions.c 
> Log Message:
> Built-in class-based standard exceptions.  Written by Fredrik Lundh.
> Modified, proofread, and integrated for Python 1.6 by Barry Warsaw.

Since the added files are not emailed, you can easily see this file at:

http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/dist/src/Modules/_exceptions.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=python


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gward at python.net  Fri May 26 00:33:54 2000
From: gward at python.net (Greg Ward)
Date: Thu, 25 May 2000 18:33:54 -0400
Subject: [Python-Dev] Terminology question
Message-ID: <20000525183354.A422@beelzebub>

A question of terminology: frequently in the Distutils docs I need to
refer to the package-that-is-not-a-package, ie. the "root" or "empty"
package.  I can't decide if I prefer "root package", "empty package" or
what.  ("Empty" just means the *name* is empty, so it's probably not a
very good thing to say "empty package" -- but "package with no name" or
"unnamed package" aren't much better.)

Is there some accepted convention that I have missed?

Here's the definition I've just written for the "Distribution Python
Modules" manual:

\item[root package] the ``package'' that modules not in a package live
  in.  The vast majority of the standard library is in the root package,
  as are many small, standalone third-party modules that don't belong to
  a larger module collection.  (The root package isn't really a package,
  since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
  call it something.)

Confusing enough?  I thought so...

        Greg
-- 
Greg Ward - Unix nerd                                   gward at python.net
http://starship.python.net/~gward/
Beware of altruism.  It is based on self-deception, the root of all evil.



From guido at python.org  Fri May 26 03:50:24 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 20:50:24 -0500
Subject: [Python-Dev] Terminology question
In-Reply-To: Your message of "Thu, 25 May 2000 18:33:54 -0400."
             <20000525183354.A422@beelzebub> 
References: <20000525183354.A422@beelzebub> 
Message-ID: <200005260150.UAA10169@cj20424-a.reston1.va.home.com>

Greg,

If you have to refer to it as a package (which I don't doubt), the
correct name is definitely the "root package".

A possible clarification of your glossary entry:

\item[root package] the root of the hierarchy of packages.  (This
isn't really a package, since it doesn't have an
\file{\_\_init\_\_.py} file.  But we have to call it something.)  The
vast majority of the standard library is in the root package, as are
many small, standalone third-party modules that don't belong to a
larger module collection.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at python.net  Fri May 26 04:22:03 2000
From: gward at python.net (Greg Ward)
Date: Thu, 25 May 2000 22:22:03 -0400
Subject: [Python-Dev] Where to install non-code files
Message-ID: <20000525222203.A1114@beelzebub>

Another one for the combined distutils/python-dev braintrust; apologies
to those of you on both lists, but this is yet another distutils issue
that treads on python-dev territory.

The problem is this: some module distributions need to install files
other than code (modules, extensions, and scripts).  One example close
to home is the Distutils; it has a "system config file" and will soon
have a stub executable for creating Windows installers.

On Windows and Mac OS, clearly these should go somewhere under
sys.prefix: this is the directory for all things Python, including
third-party module distributions.  If Brian Hooper distributes a module
"foo" that requires a data file containing character encoding data (yes,
this is based on a true story), then the module belongs in (eg.)
C:\Python and the data file in (?) C:\Python\Data.  (Maybe
C:\Python\Data\foo, but that's a minor wrinkle.)

Any disagreement so far?

Anyways, what's bugging me is where to put these files on Unix.
<prefix>/lib/python1.x is *almost* the home for all things Python, but
not quite.  (Let's ignore platform-specific files for now: they don't
count as "miscellaneous data files", which is what I'm mainly concerned
with.)

Currently, misc. data files are put in <prefix>/share, and the
Distutil's config file is searched for in the directory of the distutils
package -- ie. site-packages/distutils under 1.5.2 (or
~/lib/python/distutils if that's where you installed it, or ./distutils
if you're running from the source directory, etc.).  I'm not thrilled
with either of these.

My inclination is to nominate a directory under <prefix>lib/python1.x
for these sort of files: not sure if I want to call it "etc" or "share"
or "data" or what, but it would be treading in Python-space.  It would
break the ability to have a standard library package called "etc" or
"share" or "data" or whatever, but dammit it's convenient.

Better ideas?

        Greg
-- 
Greg Ward - "always the quiet one"                      gward at python.net
http://starship.python.net/~gward/
I have many CHARTS and DIAGRAMS..



From mhammond at skippinet.com.au  Fri May 26 04:35:47 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 12:35:47 +1000
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKACLAA.mhammond@skippinet.com.au>

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python, including
> third-party module distributions.  If Brian Hooper distributes a module
> "foo" that requires a data file containing character encoding data (yes,
> this is based on a true story), then the module belongs in (eg.)
> C:\Python and the data file in (?) C:\Python\Data.  (Maybe
> C:\Python\Data\foo, but that's a minor wrinkle.)
>
> Any disagreement so far?

A little.  I dont think we need a new dump for arbitary files that no one
can associate with their application.

Why not put the data with the code?  It is quite trivial for a Python
package or module to find its own location, and this way we are not
dependent on anything.

Why assume packages are installed _under_ Python?  Why not just assume the
package is _reachable_ by Python.  Once our package/module is being
executed by Python, we know exactly where we are.

On my machine, there is no "data" equivilent; the closest would be
"python-cvs\pcbuild\data", and that certainly doesnt make sense.  Why can't
I just place it where I put all my other Python extensions, ensure it is on
the PythonPath, and have it "just work"?

It sounds a little complicated - do we provide an API for this magic
location, or does everybody cut-and-paste a reference implementation for
locating it?  Either way sounds pretty bad - the API shouldnt be distutils
dependent (I may not have installed this package via distutils), and really
Python itself shouldnt care about this...

So all in all, I dont think it is a problem we need to push up to this
level - let each package author do whatever makes sense, and point out how
trivial it would be if you assumed code and data in the same place/tree.

[If the data is considered read/write, then you need a better answer
anyway, as you can't assume "c:\python\data" is writable (when actually
running the code) anymore than "c:\python\my_package" is]

Mark.




From fdrake at acm.org  Fri May 26 05:05:40 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 20:05:40 -0700 (PDT)
Subject: [Python-Dev] Re: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <Pine.LNX.4.10.10005252003180.7550-100000@mailhost.beopen.com>

On Thu, 25 May 2000, Greg Ward wrote:
 > A question of terminology: frequently in the Distutils docs I need to
 > refer to the package-that-is-not-a-package, ie. the "root" or "empty"
 > package.  I can't decide if I prefer "root package", "empty package" or
 > what.  ("Empty" just means the *name* is empty, so it's probably not a
 > very good thing to say "empty package" -- but "package with no name" or
 > "unnamed package" aren't much better.)

  Well, it's not a package -- it's similar to Java's unnamed package, but
the idea that it's a package has never been advanced.  Why not just call
it the global module space (or namespace)?  That's the only way I've heard
it described, and it's more clear than "empty package" or "unnamed
package".


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fdrake at acm.org  Fri May 26 06:47:10 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 21:47:10 -0700 (PDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Barry A. Warsaw wrote:
 > [1] Purify was one of the coolest products on Solaris, but alas it
 > doesn't seem like they'll ever support Linux.  What do you all use to
 > do similar memory verification tests on Linux?  Or do you just not?

  I'm not aware of anything as good, but there's "memprof" (check for it
with "rpm -q"), and I think a few others.  Checker is a malloc() & friends
implementation that can be used to detect memory errors:

	http://www.gnu.org/software/checker/checker.html

and there's ElectricFence from Bruce Perens:

	http://www.perens.com/FreeSoftware/

(There's a MailMan related link there are well you might be interested
in!)
  There may be others, and I can't speak to the quality of these as I've
not used any of them (yet).  memprof and ElectricFence were installed on
my Mandrake box without my doing anything about it; I don't know if RedHat
installs them on a stock develop box.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From tim_one at email.msn.com  Fri May 26 07:27:13 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 01:27:13 -0400
Subject: [Python-Dev] ActiveState news
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>
Message-ID: <000501bfc6d3$0c3a2700$c52d153f@tim>

[David Ascher]
> While not a technical point, I thought I'd mention to this group that
> ActiveState just announced several things, including some Python-related
> projects.  See www.ActiveState.com for details.

Thanks for pointing that out!  I just took a natural opportunity to plug the
Visual Studio integration on c.l.py:  it's very important that we do
everything we can to support and promote commercial Python endeavors at
every conceivable opportunity <wink>.

> PS: In case anyone's still under the delusion that cool Python
> jobs are hard to find, let me know. =)

Ditto cool speech recognition jobs in small companies about to be devoured
by Belgian conquerors.  And if anyone is under the illusion that golden
handcuffs don't bind, I can set 'em  straight on that one too.

hiring-is-darned-hard-everywhere-ly y'rs  - tim





From gstein at lyra.org  Fri May 26 09:48:12 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 00:48:12 -0700 (PDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
In-Reply-To: <000401bfc6d3$0afb3e60$c52d153f@tim>
Message-ID: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>

On Fri, 26 May 2000, Tim Peters wrote:
>...
> PS:  Barry's exception patch appears to have broken the CVS Windows build
> (nothing links anymore; all the PyExc_xxx symbols aren't found; no time to
> dig more now).

The .dsp file(s) need to be updated to include the new _exceptions.c file
in their build and link step. (the symbols moved there)

IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
directory. Dependencies from the core out to Modules/ seems a bit weird.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From pf at artcom-gmbh.de  Fri May 26 10:23:18 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 10:23:18 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <200005251551.KAA11897@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 25, 2000 10:51:44 am"
Message-ID: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]:
> Given Christian Tismer's testimonial and inspection of marshal.c, I
> think Peter's small patch is acceptable.
> 
> A bigger question is whether we should freeze the magic number and add
> a version number.  In theory I'm all for that, but it means more
> changes; there are several tools (e.c. Lib/py_compile.py,
> Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> intimate knowledge of the .pyc file format that would have to be
> modified to match.
> 
> The current format of a .pyc file is as follows:
> 
> bytes 0-3   magic number
> bytes 4-7   timestamp (mtime of .py file)
> bytes 8-*   marshalled code object

Proposal:
The future format (Python 1.6 and newer) of a .pyc file should be as follows:

bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
bytes 4-7   a version number (which should be == 1 in Python 1.6)
bytes 8-11  timestamp (mtime of .py file) (same as earlier)
bytes 12-*  marshalled code object (same as earlier)

> The magic number itself is used to convey various bits of information,
> all implicit:
[...]
This mechanism to construct the magic number should not be changed.

But now once again a new value must be choosen to prevent havoc 
with .pyc files floating around, where people already played with the 
Python 1.6 alpha releases.  But this change should be definitely the 
last one, which will ever happen during the future life time of Python.

The unmarshaller should do the following with the magic number read:
If the read magic is the old magic number from 1.5.2, skip reading a
version number and assume 0 as the version number.  

If the read magic is this new value instead, it should also read the 
version number and raise a new 'ByteCodeToNew' exception, if the read 
version number is greater than a #defind version number of this 
Python interpreter.  

If future incompatible extensions to the byte code format will happen, 
then this number should be incremented to 2, 3 and so on.

For safety, 'imp.get_magic()' should return the old 1.5.2 magic
number and only 'imp.get_magic(imp.PYC_FINAL)' should return the new 
final magic number.  A new function 'imp.get_version()' should be 
introduced, which will return the current compiled in version number
of this Python interpreter.

Of course all Python modules reading .pyc files must be changed 
ccordingly, so that are able to deal with new .pyc files.  
This shouldn't be too hard.

This proposed change of the .pyc file format must be described in the final
Python 1.6 annoucement, if there are people out there, who borrowed
code from 'Tools/scripts/checkpyc.py' or some such.

Regards, Peter



From mal at lemburg.com  Fri May 26 10:37:53 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 10:37:53 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
Message-ID: <392E37E1.75AC4D0E@lemburg.com>

> Update of /cvsroot/python/python/dist/src/Lib
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25262
> 
> Modified Files:
>         exceptions.py 
> Log Message:
> For backwards compatibility, simply import everything from the
> _exceptions module, including __doc__.

Hmm, wasn't _exceptions supposed to be a *fall back* solution for
the case where the exceptions.py module is not found ? It now
looks like _exceptions replaces exceptions.py...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri May 26 12:48:05 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 12:48:05 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E5665.8CB1C260@lemburg.com>

Peter Funk wrote:
> 
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> >
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> >
> > The current format of a .pyc file is as follows:
> >
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)

This will break all tools relying on having the code object available
in bytes[8:] and believe me: there are lots of those around ;-)

You cannot really change the file header, only add things to the end
of the PYC file...

Hmm, or perhaps we should move the version number to the code object
itself... after all, the changes we want to refer to
using the version number are located in the code object and not the
PYC file layout. Unmarshalling it would then raise the error.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gmcm at hypernet.com  Fri May 26 13:53:14 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 07:53:14 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <1252780469-123073242@hypernet.com>

Greg Ward wrote:

[installing data files]

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python,
> including third-party module distributions.  If Brian Hooper
> distributes a module "foo" that requires a data file containing
> character encoding data (yes, this is based on a true story),
> then the module belongs in (eg.) C:\Python and the data file in
> (?) C:\Python\Data.  (Maybe C:\Python\Data\foo, but that's a
> minor wrinkle.)
> 
> Any disagreement so far?

Yeah. I tend to install stuff outside the sys.prefix tree and then 
use .pth files. I realize I'm, um, unique in this regard but I lost 
everything in some upgrade gone bad. (When a Windows de-
install goes wrong, your only option is to do some manual 
directory and registry pruning.)

I often do much the same on my Linux box, but I don't worry 
about it as much - upgrading is not "click and pray" there. 
(Hmm, I guess it is if you use rpms.)
 
So for Windows, I agree with Mark - put the data with the 
module. On a real OS, I guess I'd be inclined to put global 
data with the module, but user data in ~/.<something>.

> Greg Ward - "always the quiet one"                     
<snort>


- Gordon



From pf at artcom-gmbh.de  Fri May 26 13:50:02 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 13:50:02 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <392E5665.8CB1C260@lemburg.com> from "M.-A. Lemburg" at "May 26, 2000 12:48: 5 pm"
Message-ID: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>

[M.-A. Lemburg]:
> > Proposal:
> > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > 
> > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > bytes 12-*  marshalled code object (same as earlier)
> 
> This will break all tools relying on having the code object available
> in bytes[8:] and believe me: there are lots of those around ;-)

In some way, this is intentional:  If these tools (are there are really
that many out there, that munge with .pyc byte code files?) simply use
'imp.get_magic()' and then silently assume a specific content of the
marshalled code object, they probably need changes anyway, since the
code needed to deal with the new unicode object is missing from them.

> You cannot really change the file header, only add things to the end
> of the PYC file...

Why?  Will this idea really cause such earth quaking grumbling?
Please review this in the context of my proposal to change 'imp.get_magic()'
to return the old 1.5.2 MAGIC, when called without parameter.

> Hmm, or perhaps we should move the version number to the code object
> itself... after all, the changes we want to refer to
> using the version number are located in the code object and not the
> PYC file layout. Unmarshalling it would then raise the error.

Since the file layout is a very thin layer around the marshalled
code object, this makes really no big difference to me.  But it
will be harder to come up with reasonable entries for /etc/magic [1]
and similar mechanisms.  

Putting the version number at the end of file is possible. 
But such a solution is some what "dirty" and only gives the false 
impression that the general file layout (pyc[8:] instead of pyc[12:]) 
is something you can rely on until the end of time.  Hardcoding the
size of an unpadded header (something like using buffer[8:]) is IMO 
bad style anyway.

Regards, Peter
[1]: /etc/magic on Unices is a small textual data base used by the 'file' 
     command to identify the type of a file by looking at the first
     few bytes.  Unix file managers may either use /etc/magic directly
     or a similar scheme to asciociate files with mimetypes and/or default
     applications.



From guido at python.org  Fri May 26 15:10:30 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:10:30 -0500
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: Your message of "Fri, 26 May 2000 00:48:12 MST."
             <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> 
Message-ID: <200005261310.IAA11256@cj20424-a.reston1.va.home.com>

> The .dsp file(s) need to be updated to include the new _exceptions.c file
> in their build and link step. (the symbols moved there)

I'll take care of this.

> IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
> directory. Dependencies from the core out to Modules/ seems a bit weird.

Good catch!  Since Barry's contemplating renaming it to exceptions.c
anyway that would be a good time to move it.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May 26 15:13:06 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:13:06 -0500
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: Your message of "Fri, 26 May 2000 07:53:14 -0400."
             <1252780469-123073242@hypernet.com> 
References: <1252780469-123073242@hypernet.com> 
Message-ID: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>

> So for Windows, I agree with Mark - put the data with the 
> module. On a real OS, I guess I'd be inclined to put global 
> data with the module, but user data in ~/.<something>.

Aha!  Good distinction.

Modifyable data needs to go in a per-user directory, even on Windows,
outside the Python tree.

But static data needs to go in the same directory as the module that
uses it.  (We use this in the standard test package, for example.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at mems-exchange.org  Fri May 26 14:24:23 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 08:24:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>; from guido@python.org on Thu, May 25, 2000 at 10:30:26AM -0500
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>
Message-ID: <20000526082423.A12100@mems-exchange.org>

On 25 May 2000, Guido van Rossum said:
> Two excuses: (1) long ago, you really needed to use ld instead of cc
> to create a shared library, because cc didn't recognize the flags or
> did other things that shouldn't be done to shared libraries; (2) I
> didn't know there was a problem with using ld.
> 
> Since you have now provided a patch which seems to work, why don't you
> check it in...?

Done.  I presume checking in configure.in and configure at the same time
is the right thing to do?  (I checked, and running "autoconf" on the
original configure.in regenerated exactly what's in CVS.)

        Greg



From thomas.heller at ion-tof.com  Fri May 26 14:28:49 2000
From: thomas.heller at ion-tof.com (Thomas Heller)
Date: Fri, 26 May 2000 14:28:49 +0200
Subject: [Distutils] Re: [Python-Dev] Where to install non-code files
References: <1252780469-123073242@hypernet.com>  <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <01ee01bfc70d$f1f17a20$4500a8c0@thomasnb>

[Guido writes]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
This seems to be the value of key "AppData" stored under in
  HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Filders

Right?

Thomas




From guido at python.org  Fri May 26 15:35:40 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:35:40 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Fri, 26 May 2000 08:24:23 -0400."
             <20000526082423.A12100@mems-exchange.org> 
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>  
            <20000526082423.A12100@mems-exchange.org> 
Message-ID: <200005261335.IAA11410@cj20424-a.reston1.va.home.com>

> Done.  I presume checking in configure.in and configure at the same time
> is the right thing to do?  (I checked, and running "autoconf" on the
> original configure.in regenerated exactly what's in CVS.)

Yes.  WHat I usually do is manually bump the version number in
configure before checking it in (it references the configure.in
version) but that's a minor nit...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From pf at artcom-gmbh.de  Fri May 26 14:36:36 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 14:36:36 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 26, 2000  8:13: 6 am"
Message-ID: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]
[...]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.

Is there a reliable algorithm to find a "per-user" directory on any
Win95/98/NT/2000 system?  On MacOS?  

Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
module would provide 'os.environ["HOME"]' similar to the posix
version?  This would certainly simplify the task of application
programmers intending to write portable applications.

Regards, Peter



From bwarsaw at python.org  Sat May 27 14:46:44 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:46:44 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
	<Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
Message-ID: <14639.50100.383806.969434@localhost.localdomain>

>>>>> "GS" == Greg Stein <gstein at lyra.org> writes:

    GS> On Fri, 26 May 2000, Tim Peters wrote:
    >> ...  PS: Barry's exception patch appears to have broken the CVS
    >> Windows build (nothing links anymore; all the PyExc_xxx symbols
    >> aren't found; no time to dig more now).

    GS> The .dsp file(s) need to be updated to include the new
    GS> _exceptions.c file in their build and link step. (the symbols
    GS> moved there)

    GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
    GS> the Python/ directory. Dependencies from the core out to
    GS> Modules/ seems a bit weird.

Guido made the suggestion to move _exceptions.c to exceptions.c any
way.  Should we move the file to the other directory too?  Get out
your plusses and minuses.

-Barry



From bwarsaw at python.org  Sat May 27 14:49:01 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:49:01 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
	<392E37E1.75AC4D0E@lemburg.com>
Message-ID: <14639.50237.999048.146898@localhost.localdomain>

>>>>> "M" == M  <mal at lemburg.com> writes:

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution
    M> for the case where the exceptions.py module is not found ? It
    M> now looks like _exceptions replaces exceptions.py...

I see no reason to keep both of them around.  Too much of a
synchronization headache.

-Barry



From mhammond at skippinet.com.au  Fri May 26 15:12:49 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 23:12:49 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKICLAA.mhammond@skippinet.com.au>

> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?

Ahhh - where to start.  SHGetFolderLocation offers the following
alternatives:

CSIDL_APPDATA
Version 4.71. File system directory that serves as a common repository for
application-specific data. A typical path is C:\Documents and
Settings\username\Application Data

CSIDL_COMMON_APPDATA
Version 5.0. Application data for all users. A typical path is C:\Documents
and Settings\All Users\Application Data.

CSIDL_LOCAL_APPDATA
Version 5.0. File system directory that serves as a data repository for
local (non-roaming) applications. A typical path is C:\Documents and
Settings\username\Local Settings\Application Data.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

Plus a few I didnt bother listing...

<sigh>

Mark.




From jlj at cfdrc.com  Fri May 26 15:20:34 2000
From: jlj at cfdrc.com (Lyle Johnson)
Date: Fri, 26 May 2000 08:20:34 -0500
Subject: [Python-Dev] RE: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <003c01bfc715$2c8fde90$4e574dc0@cfdrc.com>

How about "PWAN", the "package without a name"? ;)

> -----Original Message-----
> From: distutils-sig-admin at python.org
> [mailto:distutils-sig-admin at python.org]On Behalf Of Greg Ward
> Sent: Thursday, May 25, 2000 5:34 PM
> To: distutils-sig at python.org; python-dev at python.org
> Subject: [Distutils] Terminology question
> 
> 
> A question of terminology: frequently in the Distutils docs I need to
> refer to the package-that-is-not-a-package, ie. the "root" or "empty"
> package.  I can't decide if I prefer "root package", "empty package" or
> what.  ("Empty" just means the *name* is empty, so it's probably not a
> very good thing to say "empty package" -- but "package with no name" or
> "unnamed package" aren't much better.)
> 
> Is there some accepted convention that I have missed?
> 
> Here's the definition I've just written for the "Distribution Python
> Modules" manual:
> 
> \item[root package] the ``package'' that modules not in a package live
>   in.  The vast majority of the standard library is in the root package,
>   as are many small, standalone third-party modules that don't belong to
>   a larger module collection.  (The root package isn't really a package,
>   since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
>   call it something.)
> 
> Confusing enough?  I thought so...
> 
>         Greg
> -- 
> Greg Ward - Unix nerd                                   gward at python.net
> http://starship.python.net/~gward/
> Beware of altruism.  It is based on self-deception, the root of all evil.
> 
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> http://www.python.org/mailman/listinfo/distutils-sig
> 



From skip at mojam.com  Fri May 26 10:25:49 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 03:25:49 -0500 (CDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
In-Reply-To: <14639.50237.999048.146898@localhost.localdomain>
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
	<392E37E1.75AC4D0E@lemburg.com>
	<14639.50237.999048.146898@localhost.localdomain>
Message-ID: <14638.13581.195350.511944@beluga.mojam.com>

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution for the
    M> case where the exceptions.py module is not found ? It now looks like
    M> _exceptions replaces exceptions.py...

    BAW> I see no reason to keep both of them around.  Too much of a
    BAW> synchronization headache.

Well, wait a minute.  Is Nick's third revision of his
AttributeError/NameError enhancement still on the table?  If so,
exceptions.py is the right place to put it.  In that case, I would recommend
that exceptions.py still be the file that is loaded.  It would take care of
importing _exceptions.

Oh, BTW.. +1 on Nick's latest version.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From gward at mems-exchange.org  Fri May 26 15:27:16 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:27:16 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
References: <20000525222203.A1114@beelzebub> <1252780469-123073242@hypernet.com>
Message-ID: <20000526092716.B12100@mems-exchange.org>

On 26 May 2000, Gordon McMillan said:
> Yeah. I tend to install stuff outside the sys.prefix tree and then 
> use .pth files. I realize I'm, um, unique in this regard but I lost 
> everything in some upgrade gone bad. (When a Windows de-
> install goes wrong, your only option is to do some manual 
> directory and registry pruning.)

I think that's appropriate for Python "applications" -- in fact, now
that Distutils can install scripts and miscellaneous data, about the
only thing needed to properly support "applications" is an easy way for
developers to say, "Please give me my own directory and create a .pth
file".  (Actually, the .pth file should only be one way to install an
application: you might not want your app's Python library to muck up
everybody else's Python path.  An idea AMK and I cooked up yesterday
would be an addition to the Distutils "build_scripts" command: along
with frobbing the #! line to point to the right Python interpreter, add
a second line:
  import sys ; sys.append(path-to-this-app's-python-lib)

Or maybe "sys.insert(0, ...)".

Anyways, that's neither here nor there.  Except that applications that
get their own directory should be free to put their (static) data files
wherever they please, rather than having to put them in the app's Python
library.

I'm more concerned with the what the Distutils works best with now,
though: module distributions.  I think you guys have convinced me;
static data should normally sit with the code.  I think I'll make that
the default (instead of prefix + "share"), but give developers a way to
override it.  So eg.:

   data_files = ["this.dat", "that.cfg"]

will put the files in the same place as the code (which could be a bit
tricky to figure out, what with the vagaries of package-ization and
"extra" install dirs);

   data_files = [("share", ["this.dat"]), ("etc", ["that.cfg"])]

would put the data file in (eg.) /usr/local/share and the config file in
/usr/local/etc.  This obviously makes the module writer's job harder: he
has to grovel from sys.prefix looking for the files that he expects to
have been installed with his modules.  But if someone really wants to do
this, they should be allowed to.

Finally, you could also put absolute directories in 'data_files',
although this would not be recommended.

> (Hmm, I guess it is if you use rpms.)

All the smart Unix installers (RPM, Debian, FreeBSD, ...?) I know of
have some sort of dependency mechanism, which works to varying degrees
of "work".  I'm only familar with RPM, and my usual response to a
dependency warning is "dammit, I know what I'm doing", and then I rerun
"rpm --nodeps" to ignore the dependency checking.  (This usually arises
because I build my own Perl and Python, and don't use Red Hat's -- I
just make /usr/bin/{perl,python} symlinks to /usr/local/bin, which RPM
tends to whine about.)  But it's nice to know that someone is watching.
;-)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Fri May 26 15:30:29 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:30:29 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 26, 2000 at 08:13:06AM -0500
References: <1252780469-123073242@hypernet.com> <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <20000526093028.C12100@mems-exchange.org>

On 26 May 2000, Guido van Rossum said:
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
> But static data needs to go in the same directory as the module that
> uses it.  (We use this in the standard test package, for example.)

What about the Distutils system config file (pydistutils.cfg)?  This is
something that should only be modified by the sysadmin, and sets the
site-wide policy for building and installing Python modules.  Does this
belong in the code directory?  (I hope so, because that's where it goes
now...)

(Under Unix, users can have a personal Distutils config file that
overrides the system config (~/.pydistutils.cfg), and every module
distribution can have a setup.cfg that overrides both of them.  On
Windows and Mac OS, there are only two config files: system and
per-distribution.)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Fri May 26 16:30:15 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 10:30:15 -0400
Subject: [Python-Dev] py_compile and CR in source files
Message-ID: <20000526103014.A18937@mems-exchange.org>

Just made an unpleasant discovery: if a Python source file has CR-LF
line-endings, you can import it just fine under Unix.  But attempting to
'py_compile.compile()' it fails with a SyntaxError at the first
line-ending.

Arrghh!  This means that Distutils will either have to check/convert
line-endings at build-time (hey, finally, a good excuse for the
"build_py" command), or implicitly compile modules by importing them
(instead of using 'py_compile.compile()').

Perhaps I should "build" modules by line-at-a-time copying -- currently
it copies them in 16k chunks, which would make it hard to fix line
endings.  Hmmm.

        Greg



From skip at mojam.com  Fri May 26 11:39:39 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 04:39:39 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526103014.A18937@mems-exchange.org>
References: <20000526103014.A18937@mems-exchange.org>
Message-ID: <14638.18011.331703.867404@beluga.mojam.com>

    Greg> Arrghh!  This means that Distutils will either have to
    Greg> check/convert line-endings at build-time (hey, finally, a good
    Greg> excuse for the "build_py" command), or implicitly compile modules
    Greg> by importing them (instead of using 'py_compile.compile()').

I don't think you can safely compile modules by importing them.  You have no
idea what the side effects of the import might be.

How about fixing py_compile.compile() instead?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From mal at lemburg.com  Fri May 26 16:27:03 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:27:03 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E89B7.D6BC572D@lemburg.com>

Peter Funk wrote:
> 
> [M.-A. Lemburg]:
> > > Proposal:
> > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > >
> > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > bytes 12-*  marshalled code object (same as earlier)
> >
> > This will break all tools relying on having the code object available
> > in bytes[8:] and believe me: there are lots of those around ;-)
> 
> In some way, this is intentional:  If these tools (are there are really
> that many out there, that munge with .pyc byte code files?) simply use
> 'imp.get_magic()' and then silently assume a specific content of the
> marshalled code object, they probably need changes anyway, since the
> code needed to deal with the new unicode object is missing from them.

That's why I proposed to change the marshalled code object
and not the PYC file: the problem is not only related to 
PYC files, it touches all areas where marshal is used. If 
you try to load a code object using Unicode in Python 1.5
you'll get all sorts of errors, e.g. EOFError, SystemError.
 
Since marshal uses a specific format, that format should
receive the version number.

Ideally that version would be prepended to the format (not sure
whether this is possible), so that the PYC file layout
would then look like this:

word 0: magic
word 1: timestamp
word 2: version in the marshalled code object
word 3-*: rest of the marshalled code object

Please make sure that options such as the -U option are
also respected...

--

A different approach to all this would be fixing only the
first two bytes of the magic word, e.g.

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

This would be b/w compatible and still provide file(1)
with enough information to be able to tell the file type.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri May 26 16:49:23 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:49:23 +0200
Subject: [Python-Dev] Extending locale.py
Message-ID: <392E8EF3.CDA61525@lemburg.com>

To make moving into the direction of making the string encoding
depend on the locale settings a little easier, I've started
to hack away at an extension of the locale.py module.

The module provides enough information to be able to set the string
encoding in site.py at startup. 

Additional code for _localemodule.c would be nice for platforms
which use other APIs to get at the active code page, e.g. on
Windows and Macs.

Please try it on your platform and tell me what you think
of the APIs.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: localex.py
Type: text/python
Size: 26105 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000526/1e6fcf39/attachment.bin>

From gmcm at hypernet.com  Fri May 26 16:56:27 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 10:56:27 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000526092716.B12100@mems-exchange.org>
References: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
Message-ID: <1252769476-123734481@hypernet.com>

Greg Ward wrote:

> On 26 May 2000, Gordon McMillan said:
> > Yeah. I tend to install stuff outside the sys.prefix tree and
> > then use .pth files. I realize I'm, um, unique in this regard
> > but I lost everything in some upgrade gone bad. (When a Windows
> > de- install goes wrong, your only option is to do some manual
> > directory and registry pruning.)
> 
> I think that's appropriate for Python "applications" -- in fact,
> now that Distutils can install scripts and miscellaneous data,
> about the only thing needed to properly support "applications" is
> an easy way for developers to say, "Please give me my own
> directory and create a .pth file". 

Hmm. I see an application as a module distribution that 
happens to have a script. (Or maybe I see a module 
distribution as a scriptless app ;-)).

At any rate, I don't see the need to dignify <prefix>/share and 
friends with an official position.

> (Actually, the .pth file
> should only be one way to install an application: you might not
> want your app's Python library to muck up everybody else's Python
> path.  An idea AMK and I cooked up yesterday would be an addition
> to the Distutils "build_scripts" command: along with frobbing the
> #! line to point to the right Python interpreter, add a second
> line:
>   import sys ; sys.append(path-to-this-app's-python-lib)
> 
> Or maybe "sys.insert(0, ...)".

$PYTHONSTARTUP ??

Never really had to deal with this. On my RH box, 
/usr/bin/python is my build. At a client site which had 1.4 
installed, I built 1.5 into $HOME/bin with a hacked getpath.c.

> I'm more concerned with the what the Distutils works best with
> now, though: module distributions.  I think you guys have
> convinced me; static data should normally sit with the code.  I
> think I'll make that the default (instead of prefix + "share"),
> but give developers a way to override it.  So eg.:
> 
>    data_files = ["this.dat", "that.cfg"]
> 
> will put the files in the same place as the code (which could be
> a bit tricky to figure out, what with the vagaries of
> package-ization and "extra" install dirs);

That's an artifact of your code ;-). If you figured it out once, 
you stand at least a 50% chance of getting the same answer 
a second time <.5 wink>.
 


- Gordon



From gward at mems-exchange.org  Fri May 26 17:06:09 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 11:06:09 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.18011.331703.867404@beluga.mojam.com>; from skip@mojam.com on Fri, May 26, 2000 at 04:39:39AM -0500
References: <20000526103014.A18937@mems-exchange.org> <14638.18011.331703.867404@beluga.mojam.com>
Message-ID: <20000526110608.F9083@mems-exchange.org>

On 26 May 2000, Skip Montanaro said:
> I don't think you can safely compile modules by importing them.  You have no
> idea what the side effects of the import might be.

Yeah, that's my concern.

> How about fixing py_compile.compile() instead?

Would be a good thing to do this for Python 1.6, but I can't go back and
fix all the Python 1.5.2 installations out there.

Does anyone know if any good reasons why 'import' and
'py_compile.compile()' are different?  Or is it something easily
fixable?

        Greg



From tim_one at email.msn.com  Fri May 26 17:41:57 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 11:41:57 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000401bfc082$54211940$6c2d153f@tim>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELFGBAA.tim_one@email.msn.com>

Just polishing part of this off, for the curious:

> ...
> Dragon's Win98 woes appear due to something else:  right after a Win98
> system w/ 64Mb RAM is booted, about half the memory is already locked (not
> just committed)!  Dragon's product needs more than the remaining 32Mb to
> avoid thrashing.  Even stranger, killing every process after booting
> releases an insignificant amount of that locked memory. ...

That turned out to be (mostly) irrelevant, and even if it were relevant it
turns out you can reduce the locked memory (to what appears to be an
undocumented minimum) and the file-cache size (to what is a documented
minimum) just by malloc'ing, zero'ing and free'ing a few giant arrays
(Windows malloc()-- unlike Linux's --returns a pointer to committed memory;
Windows has other calls if you really want memory you can't trust <0.5
wink>).

The next red herring was much funnier:  we couldn't reproduce the problem
when running the recognizer by hand (from a DOS box cmdline)!  But, run it
as Research did, system()'ed from a small Perl script, and it magically ran
3x slower, with monstrous disk thrashing.  So I had a great time besmirching
Perl's reputation <wink>.

Alas, it turned out the *real* trigger was something else entirely, that
we've known about for years but have never understood:  from inside the Perl
script, people used UNC paths to various network locations.  Like

    \\earwig\research2\data5\natspeak\testk\big55.voc

Exactly the same locations were referenced when people ran it "by hand", but
when people do it by hand, they naturally map a drive letter first, in order
reduce typing.  Like

    net use N: \\earwig\research2\data5\natspeak

once and then

    N:\testk\big55.voc

in their command lines.

This difference alone can make a *huge* timing difference!  Like I said,
we've never understood why.  Could simply be a bug in Dragon's
out-of-control network setup, or a bug in MS's networking code, or a bug in
Novell's server code -- I don't think we'll ever know.  The number of
IQ-hours that have gone into *trying* to figure this out over the years
could probably have carried several startups to successful IPOs <0.9 wink>.

One last useless clue:  do all this on a Win98 with 128Mb RAM, and the
timing difference goes away.  Ditto Win95, but much less RAM is needed.  It
sometimes acts like a UNC path consumes 32Mb of dedicated RAM!

Apart from this UNC-vs-mapped-drive issue, over many hours of dead-end
scenarios I was pleased to see that Win98 appears to do a good job of
reallocating physical RAM in response to changing demands, & in particular
better than Win95.  There's no problem here at all!

The original test case I posted-- showing massive heap fragmentation under
Win95, Win98, and W2K (but not NT), when growing a large Python list one
element at a time --remains an as-yet unstudied mystery.  I can easily make
*that* problem go away by, e.g., doing

    a = [1]*3000000
    del a

from time to time, apparently just to convince the Windows malloc that it
would be a wise idea to allocate a lot more than it thinks it needs from
time to time.  This suggests (untested) that it *could* be a huge win for
huge lists under Windows to overallocate huge lists by more than Python does
today.  I'll look into that "someday".





From gstein at lyra.org  Fri May 26 17:46:09 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 08:46:09 -0700 (PDT)
Subject: [Python-Dev] exceptions.c location (was: Win32 build)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005260845130.23146-100000@nebula.lyra.org>

On Sat, 27 May 2000, Barry A. Warsaw wrote:
> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.

+1 for moving it to Python/ (where bltinmodule.c and sysmodule.c exist)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri May 26 18:18:14 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 09:18:14 -0700 (PDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526110608.F9083@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>

On Fri, 26 May 2000, Greg Ward wrote:
> On 26 May 2000, Skip Montanaro said:
> > I don't think you can safely compile modules by importing them.  You have no
> > idea what the side effects of the import might be.
> 
> Yeah, that's my concern.

I agree. You can't just import them.

> > How about fixing py_compile.compile() instead?
> 
> Would be a good thing to do this for Python 1.6, but I can't go back and
> fix all the Python 1.5.2 installations out there.

You and your 1.5 compatibility... :-)

> Does anyone know if any good reasons why 'import' and
> 'py_compile.compile()' are different?  Or is it something easily
> fixable?

I seem to recall needing to put an extra carriage return on the file, but
that the Python parser was fine with the different newline concepts. Guido
explained the difference once to me, but I don't recall offhand -- I'd
have to crawl back thru the email. Just yell over the cube at him to find
out.

*ponder*

Well, assuming that it is NOT okay with \r\n in there, then read the whole
blob in and use string.replace() on it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From skip at mojam.com  Fri May 26 18:30:08 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 11:30:08 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
References: <20000526110608.F9083@mems-exchange.org>
	<Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
Message-ID: <14638.42640.835838.859270@beluga.mojam.com>

    Greg> Well, assuming that it is NOT okay with \r\n in there, then read
    Greg> the whole blob in and use string.replace() on it.

I thought of that too, but quickly dismissed it.  You may have a CRLF pair
embedded in a triple-quoted string.  Those should be left untouched.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From fdrake at acm.org  Fri May 26 19:18:00 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Fri, 26 May 2000 10:18:00 -0700 (PDT)
Subject: [Distutils] Re: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <Pine.LNX.4.10.10005261014420.12340-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Skip Montanaro wrote:
 > I thought of that too, but quickly dismissed it.  You may have a CRLF pair
 > embedded in a triple-quoted string.  Those should be left untouched.

  No, it would be OK to do  the replacement; source files are supposed to
be treated as text, meaning that lineends should be represented as \n.
We're not talking about changing the values of the strings, which will
still be treated as \n and that's what will be incorporated in the value
of the string.  This has no impact on the explicit inclusion of \r or \r\n
in strings.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From bwarsaw at python.org  Fri May 26 19:32:02 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 13:32:02 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
Message-ID: <14638.46354.960974.536560@localhost.localdomain>

>>>>> "Fred" == Fred L Drake <fdrake at acm.org> writes:

    Fred> and there's ElectricFence from Bruce Perens:

    Fred> 	http://www.perens.com/FreeSoftware/

Yup, this comes with RH6.2 and is fairly easy to hook up; just link
with -lefence and go.  Running an efenced python over the whole test
suite fails miserably, but running it over just
Lib/test/test_exceptions.py has already (quickly) revealed one
refcounting bug, which I will check in to fix later today (as I move
Modules/_exceptions.c to Python/exceptions.c).

    Fred> (There's a MailMan related link there are well you might be
    Fred> interested in!)

Indeed!  I've seen Bruce contribute on the various Mailman mailing
lists.

-Barry



From skip at mojam.com  Fri May 26 19:48:46 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 12:48:46 -0500 (CDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.46354.960974.536560@localhost.localdomain>
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
	<14638.46354.960974.536560@localhost.localdomain>
Message-ID: <14638.47358.724731.392760@beluga.mojam.com>

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up; just link
    BAW> with -lefence and go.

Hmmm...  Sounds like an extra configure flag waiting to be added...

Skip



From bwarsaw at python.org  Fri May 26 20:38:19 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 14:38:19 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
	<14638.46354.960974.536560@localhost.localdomain>
	<14638.47358.724731.392760@beluga.mojam.com>
Message-ID: <14638.50331.542338.196305@localhost.localdomain>

>>>>> "SM" == Skip Montanaro <skip at mojam.com> writes:

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up;
    BAW> just link with -lefence and go.

    SM> Hmmm...  Sounds like an extra configure flag waiting to be
    SM> added...

I dunno.  I just did a "make -k OPT=-g LIBC=-lefence".

-Barry



From trentm at activestate.com  Fri May 26 20:55:55 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 26 May 2000 11:55:55 -0700
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
References: <000401bfc6d3$0afb3e60$c52d153f@tim> <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <20000526115555.C32427@activestate.com>

On Sat, May 27, 2000 at 08:46:44AM -0400, Barry A. Warsaw wrote:
> 
> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.
> 
+1 moving exceptions.c to Python/


Trent

-- 
Trent Mick
trentm at activestate.com


Return-Path: <trentm at molotok.activestate.com>
Delivered-To: python-dev at python.org
Received: from molotok.activestate.com (molotok.activestate.com [199.60.48.208])
	by dinsdale.python.org (Postfix) with ESMTP id C7A5F1CD69
	for <python-dev at python.org>; Fri, 26 May 2000 14:49:28 -0400 (EDT)
Received: (from trentm at localhost)
	by molotok.activestate.com (8.9.3/8.9.3) id LAA01949
	for python-dev at python.org; Fri, 26 May 2000 11:49:22 -0700
Resent-Message-Id: <200005261849.LAA01949 at molotok.activestate.com>
Date: Fri, 26 May 2000 11:39:40 -0700
From: Trent Mick <trentm at activestate.com>
To: Peter Funk <pf at artcom-gmbh.de>
Subject: Re: [Python-Dev] Proposal: .pyc file format change
Message-ID: <20000526113940.B32427 at activestate.com>
References: <200005251551.KAA11897 at cj20424-a.reston1.va.home.com> <m12vFOo-000DieC at artcom0.artcom-gmbh.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3us
In-Reply-To: <m12vFOo-000DieC at artcom0.artcom-gmbh.de>
Resent-From: trentm at activestate.com
Resent-Date: Fri, 26 May 2000 11:49:22 -0700
Resent-To: python-dev at python.org
Resent-Sender: trentm at molotok.activestate.com
Sender: python-dev-admin at python.org
Errors-To: python-dev-admin at python.org
X-BeenThere: python-dev at python.org
X-Mailman-Version: 2.0beta3
Precedence: bulk
List-Id: Python core developers <python-dev.python.org>

On Fri, May 26, 2000 at 10:23:18AM +0200, Peter Funk wrote:
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> > 
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> > 
> > The current format of a .pyc file is as follows:
> > 
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)
> 

This may be important: timestamps (as represented by the time_t type) are 8
bytes wide on 64-bit Linux and Win64. However, it will be a while (another 38
years) before time_t starts overflowing past 31 bits (it is a signed value).

The use of a 4 byte timestamp in the .pyc files constitutes an assumption
that this will fit in 4 bytes. The best portable way of handling this issue
(I think) is to just add an overflow check in import.c where
PyOS_GetLastModificationTime (which now properly return time_t) that raises
an exception if the time_t return value from overflows 4-bytes.

I have been going through the Python code looking for possible oveflow cases
for Win64 and Linux64 of late so I will submit these patches (Real Soon Now
(tm)).

CHeers,
Trent

-- 
Trent Mick
trentm at activestate.com



From bwarsaw at python.org  Fri May 26 21:11:40 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 15:11:40 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
	<Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
	<14639.50100.383806.969434@localhost.localdomain>
	<20000526115555.C32427@activestate.com>
Message-ID: <14638.52332.741025.292435@localhost.localdomain>

>>>>> "TM" == Trent Mick <trentm at activestate.com> writes:

    TM> +1 moving exceptions.c to Python/

Done.  And it looks like someone with a more accessible Windows setup
is going to have to modify the .dsp files.

-Barry



From jeremy at alum.mit.edu  Fri May 26 23:40:53 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Fri, 26 May 2000 17:40:53 -0400 (EDT)
Subject: [Python-Dev] Guido is offline
Message-ID: <14638.61285.606894.914184@localhost.localdomain>

FYI: Guido's cable modem service is giving him trouble and he's unable
to read email at the moment.  He wanted me to let you know that lack
of response isn't for lack of interest.  I imagine he won't be fully
responsive until after the holiday weekend :-).

Jeremy




From tim_one at email.msn.com  Sat May 27 06:53:14 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 27 May 2000 00:53:14 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <000001bfc797$781e3d20$cd2d153f@tim>

[GregS]
> Well, assuming that it is NOT okay with \r\n in there, then read
> the whole blob in and use string.replace() on it.
>
[Skip Montanaro]
> I thought of that too, but quickly dismissed it.  You may have a CRLF pair
> embedded in a triple-quoted string.  Those should be left untouched.

Why?  When Python compiles a module "normally", line-ends get normalized,
and the CRLF pairs on Windows vanish anyway.  For example, here's cr.py:

def f():
    s = """a
b
c
d
"""
    for ch in s:
        print ord(ch),
    print

f()
import dis
dis.dis(f)

I'm running on Win98 as I type, and the source file has CRLF line ends.

C:\Python16>python misc/cr.py
97 10 98 10 99 10 100 10

That line shows that only the LFs survived.  The rest shows why:

          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 ('a\012b\012c\012d\012')
          9 STORE_FAST               0 (s)
          etc

That is, as far as the generated code is concerned, the CRs never existed.

60-years-of-computers-and-we-still-can't-agree-on-how-to-end-a-line-ly
    y'rs  - tim





From martin at loewis.home.cs.tu-berlin.de  Sun May 28 08:28:55 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 08:28:55 +0200
Subject: [Python-Dev] String encoding
Message-ID: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>

Fred L. Drake wrote

> I recall a fair bit of discussion about wchar_t when it was
> introduced to ANSI C, and the character set and encoding were
> specifically not made part of the specification.  Making a
> requirement that wchar_t be Unicode doesn't make a lot of sense, and
> opens up potential portability issues.

In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
indicate that wchar_t is Unicode. The exact wording is

# A decimal constant of the form yyyymmL (for example, 199712L),
# intended to indicate that values of type wchar_t are the coded
# representations of the characters defined by ISO/IEC 10646, along
# with all amendments and technical corrigenda as of the specified
# year and month.

Of course, at the moment, there are few, if any, implementations that
define this macro.

Regards,
Martin



From martin at loewis.home.cs.tu-berlin.de  Sun May 28 12:34:01 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 12:34:01 +0200
Subject: [Python-Dev] Patch: AttributeError and NameError: second attempt.
Message-ID: <200005281034.MAA04765@loewis.home.cs.tu-berlin.de>

[thread moved, since I can't put in proper References headers, anyway,
 just by looking at the archive]
> 1) I rewrite the stuff that went into exceptions.py in C, and stick it
>   in the _exceptions module.  I don't much like this idea, since it 
>   kills the advantage noted above.

>2) I leave the stuff that's in C already in C.  I add C __str__ methods 
>   to AttributeError and NameError, which dispatch to helper functions
>   in the python 'exceptions' module, if that module is available.

>Which is better, or is there a third choice available?

There is a third choice: Patch AttributeError afterwards. I.e. in
site.py, say

_AttributeError_str(self):
  code

AttributeError.__str__ = _AttributeError_str

Guido said
> This kind of user-friendliness should really be in the tools, not in
> the core language implementation!

And I think Nick's patch exactly follows this guideline. Currently,
the C code raising AttributeError tries to be friendly, formatting a
string, and passing it to the AttributeError.__init__. With his patch,
the AttributeError just gets enough information so that tools later
can be friendly - actually printing anything is done in Python code.

Fred said
>   I see no problem with the functionality from Nick's patch; this is
> exactly te sort of thing what's needed, including at the basic
> interactive prompt.

I agree. Much of the strength of this approach is lost if it only
works inside tools. When I get an AttributeError, I'd like to see
right away what the problem is. If I had to fire up IDLE and re-run it
first, I'd rather stare at my code long enough to see the problem.

Regards,
Martin




From tismer at tismer.com  Sun May 28 17:02:41 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 17:02:41 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de> <392E89B7.D6BC572D@lemburg.com>
Message-ID: <39313511.4A312B4A@tismer.com>


"M.-A. Lemburg" wrote:
> 
> Peter Funk wrote:
> >
> > [M.-A. Lemburg]:
> > > > Proposal:
> > > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > > >
> > > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > > bytes 12-*  marshalled code object (same as earlier)

<snip/>

> A different approach to all this would be fixing only the
> first two bytes of the magic word, e.g.
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> This would be b/w compatible and still provide file(1)
> with enough information to be able to tell the file type.

I think this approach is simple and powerful enough
to survive Py3000.
Peter's approach is of course nicer and cleaner from
a "redo from scratch" point of view. But then, I'd even
vote for a better format that includes another field
which names the header size explicitly.

For simplicity, comapibility and ease of change,
I vote with +1 for adopting the solution of

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

If that turns out to be insufficient in some future,
do a complete redesign.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From pf at artcom-gmbh.de  Sun May 28 18:23:52 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Sun, 28 May 2000 18:23:52 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39313511.4A312B4A@tismer.com> from Christian Tismer at "May 28, 2000  5: 2:41 pm"
Message-ID: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>

[...]
> For simplicity, comapibility and ease of change,
> I vote with +1 for adopting the solution of
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> If that turns out to be insufficient in some future,
> do a complete redesign.

What about the CR/LF issue with some Mac Compilers (see
Guido's mail for details)?  Can we simply drop this?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From tismer at tismer.com  Sun May 28 18:51:20 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 18:51:20 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39314E88.6AD944CE@tismer.com>


Peter Funk wrote:
> 
> [...]
> > For simplicity, comapibility and ease of change,
> > I vote with +1 for adopting the solution of
> >
> > byte 0: 'P'
> > byte 1: 'Y'
> > byte 2: version number (counting from 1)
> > byte 3: option byte (8 bits: one for each option;
> >                      bit 0: -U cmd switch)
> >
> > If that turns out to be insufficient in some future,
> > do a complete redesign.
> 
> What about the CR/LF issue with some Mac Compilers (see
> Guido's mail for details)?  Can we simply drop this?

Well, forgot about that.
How about swapping bytes 0 and 1?

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Mon May 29 01:54:11 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 28 May 2000 18:54:11 -0500
Subject: [Python-Dev] Guido is offline
In-Reply-To: Your message of "Fri, 26 May 2000 17:40:53 -0400."
             <14638.61285.606894.914184@localhost.localdomain> 
References: <14638.61285.606894.914184@localhost.localdomain> 
Message-ID: <200005282354.SAA02034@cj20424-a.reston1.va.home.com>

> FYI: Guido's cable modem service is giving him trouble and he's unable
> to read email at the moment.  He wanted me to let you know that lack
> of response isn't for lack of interest.  I imagine he won't be fully
> responsive until after the holiday weekend :-).

I'm finally back online now, but can't really enjoy it, because my
in-laws are here... So I have 300 unread emails that will remain
unread until Tuesday. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 29 02:00:39 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 28 May 2000 19:00:39 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Fri, 26 May 2000 14:36:36 +0200."
             <m12vJLw-000DieC@artcom0.artcom-gmbh.de> 
References: <m12vJLw-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>

> > Modifyable data needs to go in a per-user directory, even on Windows,
> > outside the Python tree.
> 
> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?  On MacOS?  

I don't know -- often $HOME is set on Windows.  E.g. IDLE uses $HOME
if set and otherwise the current directory.

The Mac doesn't have an environment at all.

> Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
> module would provide 'os.environ["HOME"]' similar to the posix
> version?  This would certainly simplify the task of application
> programmers intending to write portable applications.

This sounds like a nice idea...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Mon May 29 21:58:41 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 29 May 2000 12:58:41 -0700 (PDT)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39314E88.6AD944CE@tismer.com>
Message-ID: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>

I don't think we should have a two-byte magic value. Especially where
those two bytes are printable, 7-bit ASCII.

"But it is four bytes," you say. Nope. It is two plus a couple parameters
that can now change over time.

To ensure uniqueness, I think a four-byte magic should stay.

I would recommend the approach of adding opcodes into the marshal format.
Specifically, 'V' followed by a single byte. That can only occur at the
beginning. If it is not present, then you know that you have an old
marshal value.

Cheers,
-g

On Sun, 28 May 2000, Christian Tismer wrote:
> Peter Funk wrote:
> > 
> > [...]
> > > For simplicity, comapibility and ease of change,
> > > I vote with +1 for adopting the solution of
> > >
> > > byte 0: 'P'
> > > byte 1: 'Y'
> > > byte 2: version number (counting from 1)
> > > byte 3: option byte (8 bits: one for each option;
> > >                      bit 0: -U cmd switch)
> > >
> > > If that turns out to be insufficient in some future,
> > > do a complete redesign.
> > 
> > What about the CR/LF issue with some Mac Compilers (see
> > Guido's mail for details)?  Can we simply drop this?
> 
> Well, forgot about that.
> How about swapping bytes 0 and 1?
> 
> -- 
> Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
> Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
> Kaunstr. 26                  :    *Starship* http://starship.python.net
> 14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
> PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
>      where do you want to jump today?   http://www.stackless.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From pf at artcom-gmbh.de  Tue May 30 09:08:15 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 09:08:15 +0200 (MEST)
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> from Greg Stein at "May 29, 2000 12:58:41 pm"
Message-ID: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>

Greg Stein:
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
[...]
> To ensure uniqueness, I think a four-byte magic should stay.

Looking at /etc/magic I see many 16-bit magic numbers kept around
from the good old days.  But you are right: Choosing a four-byte magic
value would make the chance of a clash with some other file format
much less likely.

> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

But this would not solve the problem with 8 byte versus 4 byte timestamps
in the header on 64-bit OSes.  Trent Mick pointed this out.

I think, the situation we have now, is very unsatisfactory:  I don't 
see a reasonable solution, which allows us to keep the length of the
header before the marshal-block at a fixed length of 8 bytes together
with a frozen 4 byte magic number.  

Moving the version number into the marshal doesn't help to resolve
this conflict.  So either you have to accept a new magic on 64 bit
systems or you have to enlarge the header.

To come up with a new proposal, the following questions should be answered:
  1. Is there really too much code out there, which depends on 
     the hardcoded assumption, that the marshal part of a .pyc file 
     starts at byte 8?  I see no further evidence for or against this.
     MAL pointed this out in 
     <http://www.python.org/pipermail/python-dev/2000-May/005756.html>
  2. If we decide to enlarge the header, do we really need a new
     header field defining the length of the header ? 
     This was proposed by Christian Tismer in 
     <http://www.python.org/pipermail/python-dev/2000-May/005792.html>
  3. The 'imp' module exposes somewhat the structure of an .pyc file
     through the function 'get_magic()'.  I proposed changing the signature of
     'imp.get_magic()' in an upward compatible way.  I also proposed 
     adding a new function 'imp.get_version()'.  What do you think about 
     this idea?
  4. Greg proposed prepending the version number to the marshal
     format.  If we do this, we definitely need a frozen way to find
     out, where the marshalled code object actually starts.  This has
     also the disadvantage of making the task to come up with a /etc/magic
     definition whichs displays the version number of a .pyc file slightly
     harder.

If we decide to move the version number into the marshal, if we can
also move the .py-timestamp there.  This way the timestamp will be handled
in the same way as large integer literals.  Quoting from the docs:

"""Caveat: On machines where C's long int type has more than 32 bits
   (such as the DEC Alpha), it is possible to create plain Python
   integers that are longer than 32 bits. Since the current marshal
   module uses 32 bits to transfer plain Python integers, such values
   are silently truncated. This particularly affects the use of very
   long integer literals in Python modules -- these will be accepted
   by the parser on such machines, but will be silently be truncated
   when the module is read from the .pyc instead.
   [...]
   A solution would be to refuse such literals in the parser, since
   they are inherently non-portable. Another solution would be to let
   the marshal module raise an exception when an integer value would
   be truncated. At least one of these solutions will be implemented
   in a future version."""

Should this be 1.6?  Changing the format of .pyc files over and over
again in the 1.x series doesn't look very attractive.

Regards, Peter



From trentm at activestate.com  Tue May 30 09:46:09 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 30 May 2000 00:46:09 -0700
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000530004609.A16383@activestate.com>

On Tue, May 30, 2000 at 09:08:15AM +0200, Peter Funk wrote:
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.
> 

I kind of intimated but did not make it clear: I wouldn't worry about the
limitations of a 4 byte timestamp too much. That value is not going to
overflow for another 38 years. Presumably the .pyc header (if such a thing
even still exists then) will change by then.


[peter summarizes .pyc header format options]

> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.
> 
I *hope* it gets into 1.6, because I have implemented the latter suggestion
(raise an exception is truncation of a PyInt to 32-bits will cause data
loss) in the docs that you quoted and will be submitting a patch for it on
Wed or Thurs.

Ciao,
Trent

-- 
Trent Mick
trentm at activestate.com



From effbot at telia.com  Tue May 30 10:21:10 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 10:21:10 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de> <20000530004609.A16383@activestate.com>
Message-ID: <009901bfca10$040531c0$f2a6b5d4@hagrid>

Trent Mick wrote:
> > But this would not solve the problem with 8 byte versus 4 byte timestamps
> > in the header on 64-bit OSes.  Trent Mick pointed this out.
> 
> I kind of intimated but did not make it clear: I wouldn't worry about the
> limitations of a 4 byte timestamp too much. That value is not going to
> overflow for another 38 years. Presumably the .pyc header (if such a thing
> even still exists then) will change by then.

note that py_compile (which is used to create PYC files after installation,
among other things) treats the time as an unsigned integer.

so in other words, if we fix the built-in "PYC compiler" so it does the same
thing before 2038, we can spend another 68 years on coming up with a
really future proof design... ;-)

I really hope Py3K will be out before 2106.

as for the other changes: *please* don't break the header layout in the
1.X series.  and *please* don't break the "if the magic is the same, I can
unmarshal and run this code blob without crashing the interpreter" rule
(raising an exception would be okay, though).

</F>




From mal at lemburg.com  Tue May 30 10:10:25 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 10:10:25 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: 
 .pyc file format change)
References: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39337771.D78BAAF5@lemburg.com>

Peter Funk wrote:
> 
> Greg Stein:
> > I don't think we should have a two-byte magic value. Especially where
> > those two bytes are printable, 7-bit ASCII.
> [...]
> > To ensure uniqueness, I think a four-byte magic should stay.
> 
> Looking at /etc/magic I see many 16-bit magic numbers kept around
> from the good old days.  But you are right: Choosing a four-byte magic
> value would make the chance of a clash with some other file format
> much less likely.

Just for quotes: the current /etc/magic I have on my Linux
machine doesn't know anything about PYC or PYO files, so I
don't really see much of a problem here -- noone seems to
be interested in finding out the file type for these
files anyway ;-)

Also, I don't really get the 16-bit magic argument: we still
have a 32-bit magic number -- one with a 16-bit fixed value and
predefined ranges for the remaining 16 bits. This already is
much better than what we have now w/r to making file(1) work
on PYC files.
 
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.

The switch to 8 byte timestamps is only needed when the current
4 bytes can no longer hold the timestamp value. That will happen
in 2038...

Note that import.c writes the timestamp in 4 bytes until it
reaches an overflow situation.

> I think, the situation we have now, is very unsatisfactory:  I don't
> see a reasonable solution, which allows us to keep the length of the
> header before the marshal-block at a fixed length of 8 bytes together
> with a frozen 4 byte magic number.

Adding a version to the marshal format is a Good Thing --
independent of this discussion.
 
> Moving the version number into the marshal doesn't help to resolve
> this conflict.  So either you have to accept a new magic on 64 bit
> systems or you have to enlarge the header.

No you don't... please read the code: marshal only writes
8 bytes in case 4 bytes aren't enough to hold the value.
 
> To come up with a new proposal, the following questions should be answered:
>   1. Is there really too much code out there, which depends on
>      the hardcoded assumption, that the marshal part of a .pyc file
>      starts at byte 8?  I see no further evidence for or against this.
>      MAL pointed this out in
>      <http://www.python.org/pipermail/python-dev/2000-May/005756.html>

I have several references in my tool collection, the import
stuff uses it, old import hooks (remember ihooks ?) also do, etc.

>   2. If we decide to enlarge the header, do we really need a new
>      header field defining the length of the header ?
>      This was proposed by Christian Tismer in
>      <http://www.python.org/pipermail/python-dev/2000-May/005792.html>

In Py3K we can do this right (breaking things is allowed)...
and I agree with Christian that a proper file format needs
a header length field too. Basically, these values have to
be present, IMHO:

1. Magic
2. Version
3. Length of Header
4. (Header Attribute)*n
-- Start of Data ---

Header Attribute can be pretty much anything -- timestamps,
names of files or other entities, bit sizes, architecture
flags, optimization settings, etc.

>   3. The 'imp' module exposes somewhat the structure of an .pyc file
>      through the function 'get_magic()'.  I proposed changing the signature of
>      'imp.get_magic()' in an upward compatible way.  I also proposed
>      adding a new function 'imp.get_version()'.  What do you think about
>      this idea?

imp.get_magic() would have to return the proposed 32-bit value
('PY' + version byte + option byte).

I'd suggest adding additional functions which can read and write the
header given a PYCHeader object which would hold the 
values version and options.

>   4. Greg proposed prepending the version number to the marshal
>      format.  If we do this, we definitely need a frozen way to find
>      out, where the marshalled code object actually starts.  This has
>      also the disadvantage of making the task to come up with a /etc/magic
>      definition whichs displays the version number of a .pyc file slightly
>      harder.
> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Tue May 30 11:48:50 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 30 May 2000 02:48:50 -0700 (PDT)
Subject: [Python-Dev] inspect.py
Message-ID: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>

I just posted the HTML document generator script i promised
to do at IPC8.  It's at http://www.lfw.org/python/ (see the
bottom of the page).

The reason i'm mentioning this here is that, in the course of
doing that, i put all the introspection work in a separate
module called "inspect.py".  It's at

    http://www.lfw.org/python/inspect.py

It tries to encapsulate the interface provided by func_*, co_*,
et al. with something a little richer.  It can handle anonymous
(tuple) arguments for you, for example.  It can also get the
source code of any function, method, or class for you, as long
as the original .py file is still available.  And more stuff
like that.

I think most of this stuff is quite generally useful, and it
seems good to wrap this up in a module.  I'd like your thoughts
on whether this is worth including in the standard library.



-- ?!ng

"To be human is to continually change.  Your desire to remain as you are
is what ultimately limits you."
    -- The Puppet Master, Ghost in the Shell




From effbot at telia.com  Tue May 30 12:26:29 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 12:26:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>

I wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
> 
> a) stick to the old definition, and use chr(10) also for
>    unicode strings
> 
> b) use different definitions for 8-bit strings and unicode
>    strings; if given an 8-bit string, use chr(10); if given
>    a 16-bit string, use the LINEBREAK predicate.
> 
> c) use LINEBREAK in either case.
> 
> I think (c) is the "right thing", but it's the only that may
> break existing code...

I'm probably getting old, but I don't remember if anyone followed
up on this, and I don't have time to check the archives right now.

so for the upcoming "feature complete" release, I've decided to
stick to (a).

...

for the next release, I suggest implementing a fourth alternative:

d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
   use chr(10).

background: in the current implementation, this decision has to
be made at compile time, and a compiled expression can be used
with either 8-bit strings or 16-bit strings.

a fifth alternative would be to use the locale flag to tell the
difference between unicode and 8-bit characters:

e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).

comments?

</F>

<project name="sre" phase=" complete="97.1%" />




From tismer at tismer.com  Tue May 30 13:24:55 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 30 May 2000 13:24:55 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>
Message-ID: <3933A507.9FA6ABD6@tismer.com>


Greg Stein wrote:
> 
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
> 
> "But it is four bytes," you say. Nope. It is two plus a couple parameters
> that can now change over time.
> 
> To ensure uniqueness, I think a four-byte magic should stay.
> 
> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

Fine with me, too!
Everything that keeps the current 8 byte header intact
and doesn't break much code is fine with me. Moving
additional info intot he marshalled obejcts themselves
gives even more flexibility than any header extension.
Yes I'm all for it.

ciao - chris++

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From mal at lemburg.com  Tue May 30 13:36:00 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 13:36:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com>
Message-ID: <3933A7A0.5FAAC5FD@lemburg.com>

Here is my second version of the module. It is somewhat more
flexible and also smaller in size.

BTW, I haven't found any mention of what language and encoding
the locale 'C' assumes or defines. Currently, the module
reports these as None, meaning undefined. Are language and
encoding defined for 'C' ?

(Sorry for posting the whole module -- starship seems to be
down again...)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: localex.py
Type: text/python
Size: 19642 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000530/e3318342/attachment.bin>

From guido at python.org  Tue May 30 15:59:37 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 08:59:37 -0500
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Tue, 30 May 2000 12:26:29 +0200."
             <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> 
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>  
            <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> 
Message-ID: <200005301359.IAA05484@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot at telia.com>
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> > 
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> > 
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> > 
> > c) use LINEBREAK in either case.
> > 
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

I proposed before to see what Perl does -- since we're supposedly
following Perl's RE syntax anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Tue May 30 14:03:17 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 14:03:17 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>
Message-ID: <3933AE05.4640A75D@lemburg.com>

Fredrik Lundh wrote:
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> >
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> >
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> >
> > c) use LINEBREAK in either case.
> >
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

For Unicode objects you should really default to using the 
Py_UNICODE_ISLINEBREAK() macro which defines all line break
characters (note that CRLF should be interpreted as a
single line break; see PyUnicode_Splitlines()). The reason
here is that Unicode defines how to handle line breaks
and we should try to stick to the standard as close as possible.
All other possibilities could still be made available via new
flags.

For 8-bit strings I'd suggest sticking to the re definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Tue May 30 15:40:53 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 06:40:53 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>
Message-ID: <Pine.LNX.4.10.10005300638110.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Martin v. Loewis wrote:
 > In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
 > indicate that wchar_t is Unicode. The exact wording is

  This is a real improvement!  I've seen brief summmaries of the changes
in C99, but I should take a little time to become more familiar with them.
It looked like a real improvement.

 > Of course, at the moment, there are few, if any, implementations that
 > define this macro.

  I think the gcc people are still working on it, but that's to be
expected; there's a lot of things they're still working on.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fredrik at pythonware.com  Tue May 30 16:23:46 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:23:46 +0200
Subject: [Python-Dev] Q: join vs. __join__ ?
Message-ID: <001101bfca42$ae9f9710$0500a8c0@secret.pythonware.com>

(re: yet another endless thread on comp.lang.python)

how about renaming the "join" method to "__join__", so we can
argue that it doesn't really exist.

</F>

<project name="sre" complete="97.1%" />




From fdrake at acm.org  Tue May 30 16:22:42 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 07:22:42 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where
 to install non-code files)
In-Reply-To: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Guido van Rossum wrote:
 > > Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
 > > module would provide 'os.environ["HOME"]' similar to the posix
 > > version?  This would certainly simplify the task of application
 > > programmers intending to write portable applications.
 > 
 > This sounds like a nice idea...

  Now that this idea has fermented for a few days, I'm inclined to not
like it.  It smells of making Unix-centric interface to something that
isn't terribly portable as a concept.
  Perhaps there should be a function that does the "right thing",
extracting os.environ["HOME"] if defined, and taking an alternate approach
(os.getcwd() or whatever) otherwise.  I don't think setting
os.environ["HOME"] in the library is a good idea because that changes the
environment that gets published to child processes beyond what the
application does.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From jeremy at alum.mit.edu  Tue May 30 16:33:02 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 10:33:02 -0400 (EDT)
Subject: [Python-Dev] SRE snapshot broken
Message-ID: <14643.53534.143126.349006@localhost.localdomain>

I believe I'm looking at the current version.  (It's a file called
snapshot.zip with no version-specific identifying info that I can
find.)

The sre module changed one line in _fixflags from the CVS version.

def _fixflags(flags):
    # convert flag bitmask to sequence
    assert flags == 0
    return ()

The assert flags == 0 is apparently wrong, because it gets called with
an empty tuple if you use sre.search or sre.match.

Also, assuming that simply reverting to the previous test "assert not
flags" fix this bug, is there a test suite that I can run?  Guido
asked me to check in the current snapshot, but it's hard to tell how
to do that correctly.  It's not clear which files belong in the Python
CVS tree, nor is it clear how to test that the build worked.

Jeremy




From guido at python.org  Tue May 30 17:34:04 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 10:34:04 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Tue, 30 May 2000 07:22:42 MST."
             <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com> 
References: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com> 
Message-ID: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>

[Fred]
>   Now that this idea has fermented for a few days, I'm inclined to not
> like it.  It smells of making Unix-centric interface to something that
> isn't terribly portable as a concept.
>   Perhaps there should be a function that does the "right thing",
> extracting os.environ["HOME"] if defined, and taking an alternate approach
> (os.getcwd() or whatever) otherwise.  I don't think setting
> os.environ["HOME"] in the library is a good idea because that changes the
> environment that gets published to child processes beyond what the
> application does.

The passing on to child processes doesn't sound like a big deal to me.
Either these are Python programs, in which case they might appreciate
that the work has already been done, or they aren't, in which case
they probably don't look at $HOME at all (since apparently they worked
before).

I could see defining a new API, e.g. os.gethomedir(), but that doesn't
help all the programs that currently use $HOME...  Perhaps we could do
both?  (I.e. add os.gethomedir() *and* set $HOME.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Tue May 30 16:24:59 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:24:59 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>             <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>  <200005301359.IAA05484@cj20424-a.reston1.va.home.com>
Message-ID: <002801bfca44$b533c900$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> I proposed before to see what Perl does -- since we're supposedly
> following Perl's RE syntax anyway.

anyone happen to have 5.6 on their box?

</F>

<project name="sre" complete="97.1%" />




From fredrik at pythonware.com  Tue May 30 16:38:29 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:38:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com>
Message-ID: <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
...
> > background: in the current implementation, this decision has to
> > be made at compile time, and a compiled expression can be used
> > with either 8-bit strings or 16-bit strings.
...
> For Unicode objects you should really default to using the 
> Py_UNICODE_ISLINEBREAK() macro which defines all line break
> characters (note that CRLF should be interpreted as a
> single line break; see PyUnicode_Splitlines()). The reason
> here is that Unicode defines how to handle line breaks
> and we should try to stick to the standard as close as possible.
> All other possibilities could still be made available via new
> flags.
> 
> For 8-bit strings I'd suggest sticking to the re definition.

guess my background description wasn't clear:

Once a pattern has been compiled, it will always handle line
endings in the same way. The parser doesn't really care if the
pattern is a unicode string or an 8-bit string (unicode strings
can contain "wide" characters, but that's the only difference).

At the other end, the same compiled pattern can be applied
to either 8-bit or unicode strings.  It's all just characters to
the engine...

Now, I can of course change the engine so that it always uses
chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
result is that

    pattern.match(widestring)

won't necessarily match the same thing as

    pattern.match(str(widestring))

even if the wide string only contains plain ASCII.

(an other alternative is to recompile the pattern for each target
string type, but that will hurt performance...)

</F>

<project name="sre" complete="97.1%" />




From mal at lemburg.com  Tue May 30 16:57:57 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 16:57:57 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>
Message-ID: <3933D6F5.F6BDA39@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> ...
> > > background: in the current implementation, this decision has to
> > > be made at compile time, and a compiled expression can be used
> > > with either 8-bit strings or 16-bit strings.
> ...
> > For Unicode objects you should really default to using the
> > Py_UNICODE_ISLINEBREAK() macro which defines all line break
> > characters (note that CRLF should be interpreted as a
> > single line break; see PyUnicode_Splitlines()). The reason
> > here is that Unicode defines how to handle line breaks
> > and we should try to stick to the standard as close as possible.
> > All other possibilities could still be made available via new
> > flags.
> >
> > For 8-bit strings I'd suggest sticking to the re definition.
> 
> guess my background description wasn't clear:
> 
> Once a pattern has been compiled, it will always handle line
> endings in the same way. The parser doesn't really care if the
> pattern is a unicode string or an 8-bit string (unicode strings
> can contain "wide" characters, but that's the only difference).

Ok.

> At the other end, the same compiled pattern can be applied
> to either 8-bit or unicode strings.  It's all just characters to
> the engine...

Doesn't the engine remember wether the pattern was a string
or Unicode ?
 
> Now, I can of course change the engine so that it always uses
> chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
> result is that
> 
>     pattern.match(widestring)
> 
> won't necessarily match the same thing as
> 
>     pattern.match(str(widestring))
> 
> even if the wide string only contains plain ASCII.

Hmm, I wouldn't mind, as long as the engine does the right
thing for Unicode which is to respect the line break
standard defined in Unicode TR13.

Thinking about this some more: I wouldn't even mind if
the engine would use LINEBREAK for all strings :-). It would
certainly make life easier whenever you have to deal with
file input from different platforms, e.g. Mac, Unix and
Windows.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik at pythonware.com  Tue May 30 17:14:00 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:14:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com>
Message-ID: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg <mal at lemburg.com> wrote:
> BTW, I haven't found any mention of what language and encoding
> the locale 'C' assumes or defines. Currently, the module
> reports these as None, meaning undefined. Are language and
> encoding defined for 'C' ?

IIRC, the C locale (and the POSIX character set) is defined in terms
of a "portable character set".  This set contains all ASCII characters,
but doesn't specify what code points to use.

But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
using Python on a non-ASCII platform?  does it even build and run
on such a beast?)

</F>

<project name="sre" complete="97.1%" />




From fredrik at pythonware.com  Tue May 30 17:19:48 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:19:48 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com>
Message-ID: <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> > At the other end, the same compiled pattern can be applied
> > to either 8-bit or unicode strings.  It's all just characters to
> > the engine...
> 
> Doesn't the engine remember wether the pattern was a string
> or Unicode ?

The pattern object contains a reference to the original pattern
string, so I guess the answer is "yes, but indirectly".  But the core
engine doesn't really care -- it just follows the instructions in the
compiled pattern.

> Thinking about this some more: I wouldn't even mind if
> the engine would use LINEBREAK for all strings :-). It would
> certainly make life easier whenever you have to deal with
> file input from different platforms, e.g. Mac, Unix and
> Windows.

That's what I originally proposed (and implemented).  But this may
(in theory, at least) break existing code.  If not else, it broke the
test suite ;-)

</F>

<project name="sre" complete="97.1%" />




From akuchlin at cnri.reston.va.us  Tue May 30 17:16:14 2000
From: akuchlin at cnri.reston.va.us (Andrew M. Kuchling)
Date: Tue, 30 May 2000 11:16:14 -0400
Subject: [Python-Dev] Re: Extending locale.py
In-Reply-To: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>; from fredrik@pythonware.com on Tue, May 30, 2000 at 05:14:00PM +0200
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <20000530111614.B7942@amarok.cnri.reston.va.us>

On Tue, May 30, 2000 at 05:14:00PM +0200, Fredrik Lundh wrote:
>But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
>using Python on a non-ASCII platform?  does it even build and run
>on such a beast?)

The OS/390 port of 1.4? (http://www.s390.ibm.com/products/oe/python.html)
But it doesn't look like they ported the regex module at all.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Better get going; your parents still think me imaginary, and I'd hate to
shatter an illusion like that before dinner.
  -- The monster, in STANLEY AND HIS MONSTER #1





From gmcm at hypernet.com  Tue May 30 17:29:39 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 30 May 2000 11:29:39 -0400
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <1252421881-3397332@hypernet.com>

Fred L. Drake wrote:

>   Now that this idea has fermented for a few days, I'm inclined
>   to not
> like it.  It smells of making Unix-centric interface to something
> that isn't terribly portable as a concept.

I've refrained from jumping in here (as, it seems, have all the 
Windows users) because this is a god-awful friggin' mess on 
Windows.


From fdrake at acm.org  Tue May 30 18:10:29 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 09:10:29 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <1252421881-3397332@hypernet.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
	<1252421881-3397332@hypernet.com>
Message-ID: <14643.59381.73286.292195@mailhost.beopen.com>

Gordon McMillan writes:
 > From the 10**3 foot view, yes, they have the concept. From 
 > any closer it falls apart miserably.

  So they have the concept, just no implementation.  ;)  Sounds like
leaving it up to the application to interpret their requirements is
the right thing.  Or the right thing is to provide a function to ask
where configuration information should be stored for the
user/application; this would be $HOME under Unix and <whatever> on
Windows.  The only other reason I can think of that $HOME is needed is
for navigation purposes (as in a filesystem browser), and for that the
application needs to deal with the lack of the concept in the
operating system as appropriate.

 > (An cmd.exe "cd" w/o arg acts like "pwd". I notice that the 
 > bash shell requires you to set $HOME, and won't make any 
 > guesses.)

  This very definately sounds like overloading $HOME is the wrong
thing.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From pf at artcom-gmbh.de  Tue May 30 18:37:41 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 18:37:41 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005301534.KAA06322@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 30, 2000 10:34: 4 am"
Message-ID: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>

> [Fred]
> >   Now that this idea has fermented for a few days, I'm inclined to not
> > like it.  It smells of making Unix-centric interface to something that
> > isn't terribly portable as a concept.

Yes.  After thinking more carefully and after a closer look to what 
Jack Jansen finally figured out for MacOS (see 
	<http://www.python.org/pipermail/pythonmac-sig/2000-May/003667.html>
) I agree with Fred.  My initial idea to put something into
'os.environ["HOME"]' on those platforms was too simple minded.

> >   Perhaps there should be a function that does the "right thing",
> > extracting os.environ["HOME"] if defined, and taking an alternate approach
> > (os.getcwd() or whatever) otherwise.  
[...]

Every serious (non trivial) application usually contains something like 
"user preferences" or other state information, which should --if possible-- 
survive the following kind of events:
  1. An upgrade of the application to a newer version.  This is
     often accomplished by removing the directory tree, in which this
     application lives and replacing it by unpacking or installing
     in archive containing the new version of the application.
  2. Another colleague uses the application on the same computer and
     modifies settings to fit his personal taste.

On several versions of WinXX and on MacOS prior to release 9.X (and due
to stability problems with the multiuser capabilities even in MacOS 9)
the second kind of event seems to be rather unimportant to the users
of these platforms, since the OSes are considered as "single user"
systems anyway.  Or in other words:  the users are already used to
this situation.

Only the first kind of event should be solved for all platforms:  
<FANTASY>
    Imagine you are using grail version 4.61 on a daily basis for WWW 
    browsing and one day you decide to install the nifty upgrade 
    grail 4.73 on your computer running WinXX or MacOS X.Y 
    and after doing so you just discover that all your carefully
    sorted bookmarks are gone!  That wouldn't be nice?
</FANTASY>

I see some similarities here to the standard library module 'tempfile',
which supplies (or at least it tries ;-) to) a cross-platform portable
strategy for all applications which have to store temporary data.

My intentation was to have simple a cross-platform portable API to store
and retrieve such user specific state information (examples: the bookmarks
of a Web browser, themes, color settings, fonts...  other GUI settings, 
and so... you get the picture).  On unices applications usually use the
idiom 
	os.path.join(os.environ.get("HOME", "."), ".dotfoobar")
or something similar.

Do people remember 'grail'?  I've just stolen the following code snippets
from 'grail0.6/grailbase/utils.py' just to demonstrate, that this is still 
a very common programming problem:
---------------- snip ---------------------
# XXX Unix specific stuff
# XXX (Actually it limps along just fine for Macintosh, too)
 
def getgraildir():
    return getenv("GRAILDIR") or os.path.join(gethome(), ".grail")    
----- snip ------
def gethome():
    try:
        home = getenv("HOME")
        if not home:
            import pwd
            user = getenv("USER") or getenv("LOGNAME")
            if not user:
                pwent = pwd.getpwuid(os.getuid())
            else:
                pwent = pwd.getpwnam(user)
            home = pwent[6]
        return home
    except (KeyError, ImportError):
        return os.curdir
---------------- snap ---------------------
[...]

[Guido van Rossum]:
> I could see defining a new API, e.g. os.gethomedir(), but that doesn't
> help all the programs that currently use $HOME...  Perhaps we could do
> both?  (I.e. add os.gethomedir() *and* set $HOME.)

I'm not sure whether this is really generic enough for the OS module.

May be we should introduce a new small standard library module called 
'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
will probably not know what to do with 'os.gethomedir()'.  

However for the time being this module would only contain one simple 
function returning a directory pathname, which is guaranteed to exist 
and to survive a deinstallation of an application.  May be introducing
a new module is overkill?  What do you think?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From fdrake at acm.org  Tue May 30 19:17:56 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 10:17:56 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>
References: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>
	<m12wp1R-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <14643.63428.387306.455383@mailhost.beopen.com>

Peter Funk writes:
 > <FANTASY>
 >     Imagine you are using grail version 4.61 on a daily basis for WWW 
 >     browsing and one day you decide to install the nifty upgrade 
 >     grail 4.73 on your computer running WinXX or MacOS X.Y 

  God thing you marked that as fantasy -- I would have asked for the
download URL!  ;)

 > Do people remember 'grail'?  I've just stolen the following code snippets

  Not on good days.  ;)

 > I'm not sure whether this is really generic enough for the OS module.

  The location selected is constrained by the OS, but this isn't an
exposure of operating system functionality, so there should probably
be something else.

 > May be we should introduce a new small standard library module called 
 > 'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
 > will probably not know what to do with 'os.gethomedir()'.  
 > 
 > However for the time being this module would only contain one simple 
 > function returning a directory pathname, which is guaranteed to exist 
 > and to survive a deinstallation of an application.  May be introducing

  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
directories, and that's all most applications need; Web browser are a
special case in this way; there aren't that many things that require a
directory.  Those things which do often are program that form an
essential part of a user's environment -- Web browsers and email
clients are two good examples I've seen that really seem to have a lot
of things.
  I think what's needed is a function to return the location where the
application can make one directory entry.  The caller is still
responsible for creating a directory to store a larger set of files if
needed.  Something like grailbase.utils.establish_dir() might be a
nice convenience function.
  An additional convenience may be to offer a function which takes the
application name and a dotfile name, and returns the one to use; the
Windows and MacOS (and BeOS?) worlds seem more comfortable with the
longer, mixed-case, more readable names, while the Unix world enjoys
cryptic little names with a dot at the front.
  Ok, so now that I've rambled, the "userprefs" module looks like it
contains:

        get_appdata_root() -- $HOME, or other based on platform
        get_appdata_name() -- "MyApplication Preferences" or ".myapp"
        establish_dir() -- create dir if it doesn't exist

  Maybe this really is a separate module.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From mal at lemburg.com  Tue May 30 19:54:32 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:54:32 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <39340058.CA3FC798@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > BTW, I haven't found any mention of what language and encoding
> > the locale 'C' assumes or defines. Currently, the module
> > reports these as None, meaning undefined. Are language and
> > encoding defined for 'C' ?
> 
> IIRC, the C locale (and the POSIX character set) is defined in terms
> of a "portable character set".  This set contains all ASCII characters,
> but doesn't specify what code points to use.
> 
> But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
> using Python on a non-ASCII platform?  does it even build and run
> on such a beast?)

Hmm, that would mean having an encoding, but no language
definition available -- setlocale() doesn't work without
language code... I guess its better to leave things
undefined in that case.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 30 19:57:41 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:57:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com> <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>
Message-ID: <39340115.7E05DA6C@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > > At the other end, the same compiled pattern can be applied
> > > to either 8-bit or unicode strings.  It's all just characters to
> > > the engine...
> >
> > Doesn't the engine remember wether the pattern was a string
> > or Unicode ?
> 
> The pattern object contains a reference to the original pattern
> string, so I guess the answer is "yes, but indirectly".  But the core
> engine doesn't really care -- it just follows the instructions in the
> compiled pattern.
> 
> > Thinking about this some more: I wouldn't even mind if
> > the engine would use LINEBREAK for all strings :-). It would
> > certainly make life easier whenever you have to deal with
> > file input from different platforms, e.g. Mac, Unix and
> > Windows.
> 
> That's what I originally proposed (and implemented).  But this may
> (in theory, at least) break existing code.  If not else, it broke the
> test suite ;-)

SRE is new, so what could it break ?

Anyway, perhaps we should wait for some Perl 5.6 wizard to
speak up ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Tue May 30 21:16:13 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:16:13 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
Message-ID: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>

FYI, here's an important announcement that I just sent to c.l.py.  I'm
very excited that we can finally announce this!

I'll be checking mail sporadically until Thursday morning.  Back on
June 19.

--Guido van Rossum (home page: http://www.python.org/~guido/)

To all Python users and developers:

Python is growing rapidly.  In order to take it to the next level,
I've moved with my core development group to a new employer,
BeOpen.com.  BeOpen.com is a startup company with a focus on open
source communities, and an interest in facilitating next generation
application development.  It is a natural fit for Python.

At BeOpen.com I am the director of a new development team named
PythonLabs.  The team includes three of my former colleagues at CNRI:
Fred Drake, Jeremy Hylton, and Barry Warsaw.  Another familiar face
will joins us shortly: Tim Peters.  We have our own website
(www.pythonlabs.com) where you can read more about us, our plans and
our activities.  We've also posted a FAQ there specifically about
PythonLabs, our transition to BeOpen.com, and what it means for the
Python community.

What will change, and what will stay the same? First of all, Python
will remain Open Source.  In fact, everything we produce at PythonLabs
will be released with an Open Source license.  Also, www.python.org
will remain the number one website for the Python community.  CNRI
will continue to host it, and we'll maintain it as a community
project.

What changes is how much time we have for Python.  Previously, Python
was a hobby or side project, which had to compete with our day jobs;
at BeOpen.com we will be focused full time on Python development! This
means that we'll be able to spend much more time on exciting new
projects like Python 3000.  We'll also get support for website
management from BeOpen.com's professional web developers, and we'll
work with their marketing department.

Marketing for Python, you ask? Sure, why not! We want to grow the size
of the Python user and developer community at an even faster pace than
today.  This should benefit everyone: the larger the community, the
more resources will be available to all, and the easier it will be to
find Python expertise when you need it.  We're also planning to make
commercial offerings (within the Open Source guidelines!) to help
Python find its way into the hands of more programmers, especially in
large enterprises where adoption is still lagging.

There's one piece of bad news: Python 1.6 won't be released by June
1st.  There's simply too much left to be done.  We promise that we'll
get it out of the door as soon as possible.  By the way, Python 1.6
will be the last release from CNRI; after that, we'll issue Python
releases from BeOpen.com.

Oh, and to top it all off, I'm going on vacation.  I'm getting married
and will be relaxing on my honeymoon.  For all questions about
PythonLabs, write to pythonlabs-info at beopen.com.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From esr at thyrsus.com  Tue May 30 20:27:18 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:27:18 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142718.A24289@thyrsus.com>

Guido van Rossum <guido at python.org>:
> Oh, and to top it all off, I'm going on vacation.  I'm getting married
> and will be relaxing on my honeymoon.

Mazel tov, Guido!

BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
to include it in 1.6?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The Constitution is not neutral. It was designed to take the
government off the backs of the people.
	-- Justice William O. Douglas 



From fdrake at acm.org  Tue May 30 20:23:25 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 11:23:25 -0700 (PDT)
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <20000530142718.A24289@thyrsus.com>
References: <20000530142718.A24289@thyrsus.com>
Message-ID: <14644.1821.67068.165890@mailhost.beopen.com>

Eric S. Raymond writes:
 > BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
 > to include it in 1.6?

Eric,
  Appearantly the rest of us haven't heard of it.  Since Guido's a
little distracted right now, perhaps you should send the files to
python-dev for discussion?
  Thanks!


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gward at mems-exchange.org  Tue May 30 20:25:42 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 30 May 2000 14:25:42 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142541.D20088@mems-exchange.org>

On 30 May 2000, Guido van Rossum said:
> At BeOpen.com I am the director of a new development team named
> PythonLabs.  The team includes three of my former colleagues at CNRI:
> Fred Drake, Jeremy Hylton, and Barry Warsaw.

Ahh, no wonder it's been so quiet around here.  I was wondering where
you guys had gone.  Mystery solved!

(It's a *joke!*  We already *knew* they were leaving...)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From trentm at activestate.com  Tue May 30 20:26:38 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 30 May 2000 11:26:38 -0700
Subject: [Python-Dev] inspect.py
In-Reply-To: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <20000530112638.E18024@activestate.com>

Looks cool, Ping.

Trent


-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Tue May 30 21:34:38 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:34:38 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: Your message of "Tue, 30 May 2000 14:27:18 -0400."
             <20000530142718.A24289@thyrsus.com> 
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>  
            <20000530142718.A24289@thyrsus.com> 
Message-ID: <200005301934.OAA07671@cj20424-a.reston1.va.home.com>

> Mazel tov, Guido!

Thanks!

> BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
> to include it in 1.6?

Yes, and probably.  As Fred suggested, could you resend to the patches
list?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May 30 20:40:13 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 20:40:13 +0200
Subject: [Python-Dev] inspect.py
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <012e01bfca66$7ed61ee0$f2a6b5d4@hagrid>

ping wrote:
> The reason i'm mentioning this here is that, in the course of
> doing that, i put all the introspection work in a separate
> module called "inspect.py".  It's at
> 
>     http://www.lfw.org/python/inspect.py
>
...
>
> I think most of this stuff is quite generally useful, and it
> seems good to wrap this up in a module.  I'd like your thoughts
> on whether this is worth including in the standard library.

haven't looked at the code (yet), but +1 on concept.

(if this goes into 1.6, I no longer have to keep reposting
pointers to my "describe" module...)

</F>




From skip at mojam.com  Tue May 30 20:43:36 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 30 May 2000 13:43:36 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <14644.3032.900770.450584@beluga.mojam.com>

    Guido> Python is growing rapidly.  In order to take it to the next
    Guido> level, I've moved with my core development group to a new
    Guido> employer, BeOpen.com.

Great news!

    Guido> Oh, and to top it all off, I'm going on vacation.  I'm getting
    Guido> married and will be relaxing on my honeymoon.  For all questions
    Guido> about PythonLabs, write to pythonlabs-info at beopen.com.

Nice to see you are trying to maintain some consistency in the face of huge
professional and personal changes.  I would have worried if you weren't
going to go on vacation!  Congratulations on both moves...

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From esr at thyrsus.com  Tue May 30 20:58:38 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:58:38 -0400
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <14644.1821.67068.165890@mailhost.beopen.com>; from fdrake@acm.org on Tue, May 30, 2000 at 11:23:25AM -0700
References: <20000530142718.A24289@thyrsus.com> <14644.1821.67068.165890@mailhost.beopen.com>
Message-ID: <20000530145838.A24339@thyrsus.com>

Fred L. Drake, Jr. <fdrake at acm.org>:
>   Appearantly the rest of us haven't heard of it.  Since Guido's a
> little distracted right now, perhaps you should send the files to
> python-dev for discussion?

Righty-O.  Here they are enclosed.  I wrote this for use with the
curses module; one reason it's useful is because because the curses
getch function returns ordinal values rather than characters.  It should
be more generally useful for any pPython program with a raw character-by-
character commmand interface.

The tex may need trivial markup fixes.  You might want to add a "See also"
to curses.

I'm using this code heavily in my CML2 project, so it has been tested.
For those of you who haven't heard about CML2, I've written a replacement
for the Linux kernel configuration system in Python.  You can find out more
at:

	http://www.tuxedo.org/~esr/kbuild/

The code has some interesting properties, including the ability to
probe its environment and come up in a Tk-based, curses-based, or
line-oriented mode depending on what it sees.

ascii.py will probably not be the last library code this project spawns.
I have another package called menubrowser that is a framework for writing
menu systems. And I have some Python wrapper enhancements for curses in
the works.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The two pillars of `political correctness' are, 
  a) willful ignorance, and
  b) a steadfast refusal to face the truth
	-- George MacDonald Fraser
-------------- next part --------------
#
# ascii.py -- constants and memembership tests for ASCII characters
#

NUL	= 0x00	# ^@
SOH	= 0x01	# ^A
STX	= 0x02	# ^B
ETX	= 0x03	# ^C
EOT	= 0x04	# ^D
ENQ	= 0x05	# ^E
ACK	= 0x06	# ^F
BEL	= 0x07	# ^G
BS	= 0x08	# ^H
TAB	= 0x09	# ^I
HT	= 0x09	# ^I
LF	= 0x0a	# ^J
NL	= 0x0a	# ^J
VT	= 0x0b	# ^K
FF	= 0x0c	# ^L
CR	= 0x0d	# ^M
SO	= 0x0e	# ^N
SI	= 0x0f	# ^O
DLE	= 0x10	# ^P
DC1	= 0x11	# ^Q
DC2	= 0x12	# ^R
DC3	= 0x13	# ^S
DC4	= 0x14	# ^T
NAK	= 0x15	# ^U
SYN	= 0x16	# ^V
ETB	= 0x17	# ^W
CAN	= 0x18	# ^X
EM	= 0x19	# ^Y
SUB	= 0x1a	# ^Z
ESC	= 0x1b	# ^[
FS	= 0x1c	# ^\
GS	= 0x1d	# ^]
RS	= 0x1e	# ^^
US	= 0x1f	# ^_
SP	= 0x20	# space
DEL	= 0x7f	# delete

def _ctoi(c):
    if type(c) == type(""):
        return ord(c)
    else:
        return c

def isalnum(c): return isalpha(c) or isdigit(c)
def isalpha(c): return isupper(c) or islower(c)
def isascii(c): return _ctoi(c) <= 127		# ?
def isblank(c): return _ctoi(c) in (8,32)
def iscntrl(c): return _ctoi(c) <= 31
def isdigit(c): return _ctoi(c) >= 48 and _ctoi(c) <= 57
def isgraph(c): return _ctoi(c) >= 33 and _ctoi(c) <= 126
def islower(c): return _ctoi(c) >= 97 and _ctoi(c) <= 122
def isprint(c): return _ctoi(c) >= 32 and _ctoi(c) <= 126
def ispunct(c): return _ctoi(c) != 32 and not isalnum(c)
def isspace(c): return _ctoi(c) in (12, 10, 13, 9, 11)
def isupper(c): return _ctoi(c) >= 65 and _ctoi(c) <= 90
def isxdigit(c): return isdigit(c) or \
    (_ctoi(c) >= 65 and _ctoi(c) <= 70) or (_ctoi(c) >= 97 and _ctoi(c) <= 102)

def ctrl(c):
    if type(c) == type(""):
        return chr(_ctoi(c) & 0x1f)
    else:
        return _ctoi(c) & 0x1f

def alt(c):
    if type(c) == type(""):
        return chr(_ctoi(c) | 0x80)
    else:
        return _ctoi(c) | 0x80




-------------- next part --------------
A non-text attachment was scrubbed...
Name: ascii.tex
Type: application/x-tex
Size: 3250 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000530/2e3b28fb/attachment.bin>

From jeremy at alum.mit.edu  Tue May 30 23:09:13 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 17:09:13 -0400 (EDT)
Subject: [Python-Dev] Python 3000 is going to be *really* different
Message-ID: <14644.11769.197518.938252@localhost.localdomain>

http://www.autopreservers.com/autope07.html

Jeremy




From paul at prescod.net  Wed May 31 07:53:47 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 00:53:47 -0500
Subject: [Python-Dev] SIG: python-lang
Message-ID: <3934A8EB.6608B0E1@prescod.net>

I think that we need a forum somewhere between comp.lang.python and
pythondev. Let's call it python-lang.

By virtue of being buried on the "sigs" page, python-lang would be
mostly only accessible to those who have more than a cursory interest in
Python. Furthermore, you would have to go through a simple
administration procedure to join, as you do with any mailman list.

Appropriate topics of python-lang would be new ideas about language
features. Participants would be expected and encouraged to use archives
and FAQs to avoid repetitive topics. Particular verboten would be
"ritual topics": indentation, case sensitivity, integer division,
language comparisions, etc. These discussions would be redirected loudly
and firmly to comp.lang.python.

Python-dev would remain invitation only but it would focus on the day to
day mechanics of getting new versions of Python out the door.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From nhodgson at bigpond.net.au  Wed May 31 08:39:34 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 16:39:34 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com> <1252421881-3397332@hypernet.com>
Message-ID: <019b01bfcaca$fd72ffc0$e3cb8490@neil>

Gordon writes,

> But there's no $HOME as such.
>
> There's
> HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> hell Folders with around 16 subkeys, including AppData
> (which on my system has one entry installed by a program I've
> never used and didn't know I had). But MSOffice uses the
> Personal subkey. Others seem to use the Desktop subkey.

   SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
preferred' method for this as it allows roaming (not that I've ever seen
roaming work). If Unix code expects $HOME to be per machine (and so used to
store, for example, window locations which are dependent on screen
resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

   To make these work on 9x and NT 4 Microsoft provides a redistributable
Shfolder.dll.

   Fred writes,

>  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> directories, and that's all most applications need;

   This may have been the case in the past and for people who understand
Unix well enough to maintain it, but for us just-want-it-to-run folks, its
no longer true. I formatted my Linux partition this week and installed Red
Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
outnumber the dot files 18 to 16.

   Neil




From pf at artcom-gmbh.de  Wed May 31 09:34:34 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Wed, 31 May 2000 09:34:34 +0200 (MEST)
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil> from Neil Hodgson at "May 31, 2000  4:39:34 pm"
Message-ID: <m12x31O-000DifC@artcom0.artcom-gmbh.de>

> Gordon writes,
> 
> > But there's no $HOME as such.
> >
> > There's
> > HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> > hell Folders with around 16 subkeys, including AppData
> > (which on my system has one entry installed by a program I've
> > never used and didn't know I had). But MSOffice uses the
> > Personal subkey. Others seem to use the Desktop subkey.

Neil responds:
>    SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
> preferred' method for this as it allows roaming (not that I've ever seen
> roaming work). If Unix code expects $HOME to be per machine (and so used to
> store, for example, window locations which are dependent on screen
> resolution) then CSIDL_LOCAL_APPDATA would be a better choice.
> 
>    To make these work on 9x and NT 4 Microsoft provides a redistributable
> Shfolder.dll.

Using a place on the local machine of the user makes more sense to me.

But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
sources and also through Mark Hammonds Win32 Python extension c++
sources (here on my Notebook running Linux) and found nothing called
'SHGetSpecialFolderPath'.  So I believe, this API is currently not
exposed to the Python level.  Right?

So it would be very nice, if you WinXX-gurus more familar with the WinXX
platform would come up with some Python code snippet, which I could try
to include into an upcoming standard library 'userprefs.py' I plan to
write.  something like:
    if os.name == 'nt':
        try:
            import win32XYZ
            if hasattr(win32XYZ, 'SHGetSpecialFolderPath'):
                userplace = win32XYZ.SHGetSpecialFolderPath(.....) 
        except ImportError:
            .....
would be very fine.

>    Fred writes,
> 
> >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> > directories, and that's all most applications need;
> 
>    This may have been the case in the past and for people who understand
> Unix well enough to maintain it, but for us just-want-it-to-run folks, its
> no longer true. I formatted my Linux partition this week and installed Red
> Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
> outnumber the dot files 18 to 16.

Fred proposed an API, which leaves the decision whether to use a
single file or to use several files in special directory up to the
application developer.  

I aggree with Fred.  

Simple applications will use only a simple config file, where bigger
applications will need a directory to store several files.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From nhodgson at bigpond.net.au  Wed May 31 10:18:20 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 18:18:20 +1000
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
References: <m12x31O-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <006b01bfcad8$c899c2d0$e3cb8490@neil>

> Using a place on the local machine of the user makes more sense to me.
>
> But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
> sources and also through Mark Hammonds Win32 Python extension c++
> sources (here on my Notebook running Linux) and found nothing called
> 'SHGetSpecialFolderPath'.  So I believe, this API is currently not
> exposed to the Python level.  Right?

   Only through the Win32 Python extensions, I think:

>>> from win32com.shell import shell
>>> from win32com.shell import shellcon
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_APPDATA)
u'G:\\Documents and Settings\\Neil1\\Application Data'
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_LOCAL_APPDATA)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: CSIDL_LOCAL_APPDATA
>>> shell.SHGetSpecialFolderPath(0, 0x1c)
u'G:\\Documents and Settings\\Neil1\\Local Settings\\Application Data'

   Looks like CSIDL_LOCAL_APPDATA isn't included yet, but its value is 0x1c.

   Neil




From effbot at telia.com  Wed May 31 16:05:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 31 May 2000 16:05:41 +0200
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
Message-ID: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>

from comp.lang.python:

> Thanks for the info.  This choice of name is very confusing, to say the least.
> I used commandline completion with __buil TAB, and got __builtins__.

a simple way to avoid this problem is to change global_matches
in rlcompleter.py so that it doesn't return this name.  I suggest
changing:

                if word[:n] == text:

to

                if word[:n] == text and work != "__builtins__":

Comments?

(should we do a series of double-blind tests first? ;-)

</F>

    "People Propose, Science Studies, Technology Conforms"
    -- Don Norman





From fdrake at acm.org  Wed May 31 16:32:27 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 10:32:27 -0400 (EDT)
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
In-Reply-To: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
References: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
Message-ID: <14645.8827.104869.733028@cj42289-a.reston1.va.home.com>

Fredrik Lundh writes:
 > a simple way to avoid this problem is to change global_matches
 > in rlcompleter.py so that it doesn't return this name.  I suggest
 > changing:

  I've made the change in both global_matches() and attr_matches(); we
don't want to see it as a module attribute any more than as a global.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gstein at lyra.org  Wed May 31 17:04:20 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 08:04:20 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <3934A8EB.6608B0E1@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>

[ The correct forum is probably meta-sig. ]

IMO, I don't see a need for yet another forum. The dividing lines become a
bit too blurry, and it will result in questions like "where do I post
this?" Or "what is the difference between python-lang at python.org and
python-list at python.org?"

Cheers,
-g

On Wed, 31 May 2000, Paul Prescod wrote:
> I think that we need a forum somewhere between comp.lang.python and
> pythondev. Let's call it python-lang.
> 
> By virtue of being buried on the "sigs" page, python-lang would be
> mostly only accessible to those who have more than a cursory interest in
> Python. Furthermore, you would have to go through a simple
> administration procedure to join, as you do with any mailman list.
> 
> Appropriate topics of python-lang would be new ideas about language
> features. Participants would be expected and encouraged to use archives
> and FAQs to avoid repetitive topics. Particular verboten would be
> "ritual topics": indentation, case sensitivity, integer division,
> language comparisions, etc. These discussions would be redirected loudly
> and firmly to comp.lang.python.
> 
> Python-dev would remain invitation only but it would focus on the day to
> day mechanics of getting new versions of Python out the door.
> 
> -- 
>  Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
> "Hardly anything more unwelcome can befall a scientific writer than 
> having the foundations of his edifice shaken after the work is 
> finished.  I have been placed in this position by a letter from 
> Mr. Bertrand Russell..." 
>  - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Wed May 31 17:09:03 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 11:09:03 -0400 (EDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
	<1252421881-3397332@hypernet.com>
	<019b01bfcaca$fd72ffc0$e3cb8490@neil>
Message-ID: <14645.11023.118707.176016@cj42289-a.reston1.va.home.com>

Neil Hodgson writes:
 > roaming work). If Unix code expects $HOME to be per machine (and so used to
 > store, for example, window locations which are dependent on screen
 > resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

  This makes me think that there's a need for both per-host and
per-user directories, but I don't know of a good strategy for dealing
with this in general.  Many applications have both kinds of data, but
clump it all together.  What "the norm" is on Unix, I don't really
know, but what I've seen is typically that /home/ is often mounted
over NFS, and so shared for many hosts.  I've seen it always be local
as well, which I find really annoying, but it is easier to support
host-local information.  The catch is that very little information is
*really* host-local, especicially using X11 (where window
configurations are display-local at most, and the user may prefer them
to be display-size-local ;).
  What it boils down to is that doing too much before the separations
are easily maintained is too much; a lot of that separation needs to
be handled inside the application, which knows what information is
user-specific and what *might* be host- or display-specific.  Trying
to provide these abstractions in the standard library is likely to be
hard to use if sufficient generality is also provided.

I wrote:
 >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
 > directories, and that's all most applications need;

And Neil commented:
 >    This may have been the case in the past and for people who understand
 > Unix well enough to maintain it, but for us just-want-it-to-run folks, its
 > no longer true. I formatted my Linux partition this week and installed Red
 > Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
 > outnumber the dot files 18 to 16.

  Interesting!  But is suspect this is still very dependent on what
software you actually use as well; just because something is placed
there in your "standard" install doesn't mean it's useful.  It might
be more interesting to check after you've used that installation for a
year!  Lots of programs add dotfiles on an as-needed basis, and others
never create them, but require the user to create them using a text
editor (though the later seems to be falling out of favor in these
days of GUI applications!).


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From mal at lemburg.com  Wed May 31 18:18:49 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 31 May 2000 18:18:49 +0200
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
Message-ID: <39353B69.D6E74E2C@lemburg.com>

Would there be interest in adding the python-ldap module
(http://sourceforge.net/project/?group_id=2072) to the
core distribution ?

If yes, I think we should approach David Leonard and
ask him if he is willing to donate the lib (which is
in the public domain) to the core.

FYI, LDAP is a well accepted standard network protocol for
querying address and user information.

An older web page with more background is available at: 

   http://www.it.uq.edu.au/~leonard/dc-prj/ldapmodule/

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From paul at prescod.net  Wed May 31 18:24:45 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:24:45 -0500
Subject: [Python-Dev] What's that sound?
Message-ID: <39353CCD.1F3E9A0B@prescod.net>

ActiveState announces four new Python-related projects (PythonDirect,
Komodo, Visual Python, ActivePython).

PythonLabs announces four planet-sized-brains are going to be working on
the Python implementation full time.

PythonWare announces PythonWorks.

Is that the sound of pieces falling into place or of a rumbling
avalanche "warming up" before obliterating everything in its path?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From gstein at lyra.org  Wed May 31 18:30:57 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:30:57 -0700 (PDT)
Subject: [Python-Dev] What's that sound?
In-Reply-To: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> ActiveState announces four new Python-related projects (PythonDirect,
> Komodo, Visual Python, ActivePython).
> 
> PythonLabs announces four planet-sized-brains are going to be working on
> the Python implementation full time.

Five.

> PythonWare announces PythonWorks.
> 
> Is that the sound of pieces falling into place or of a rumbling
> avalanche "warming up" before obliterating everything in its path?

Full-on, robot chubby earthquake.

:-)

I agree with the basic premise: Python *is* going to get a lot more
visibility than it has enjoyed in the past. You might even add that the
latest GNOME release (1.2) has excellent Python support.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From paul at prescod.net  Wed May 31 18:35:23 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:35:23 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
Message-ID: <39353F4B.3E78E22E@prescod.net>

Greg Stein wrote:
> 
> [ The correct forum is probably meta-sig. ]
> 
> IMO, I don't see a need for yet another forum. The dividing lines become a
> bit too blurry, and it will result in questions like "where do I post
> this?" Or "what is the difference between python-lang at python.org and
> python-list at python.org?"

Well, you admit that yhou don't read python-list, right? Most of us
don't, most of the time. Instead we have important discussions about the
language's future on python-dev, where most of the Python community
cannot participate. I'll say it flat out: I'm uncomfortable with that. I
did not include meta-sig because (or python-list) because my issue is
really with the accidental elitism of the python-dev setup. If
python-dev participants do not agree to have important linguistic
discussions in an open forum then setting up the forum is a waste of
time. That's why I'm feeling people here out first.
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From gmcm at hypernet.com  Wed May 31 18:54:22 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 12:54:22 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <1252330400-4656058@hypernet.com>

[Paul Prescod]
> > PythonLabs announces four planet-sized-brains are going to be
> > working on the Python implementation full time.
[Greg] 
> Five.

No, he said "planet-sized-brains", not "planet-sized-egos".

Just notice how long it takes Barry to figure out who I meant....

- Gordon



From bwarsaw at python.org  Wed May 31 18:56:06 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:56:06 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17446.749848.895965@anthem.python.org>

>>>>> "M" == M  <mal at lemburg.com> writes:

    M> Would there be interest in adding the python-ldap module
    M> (http://sourceforge.net/project/?group_id=2072) to the
    M> core distribution ?

I haven't looked at this stuff, but yes, I think a standard LDAP
module would be quite useful.  It's a well enough established
protocol, and it would be good to be able to count on it "being
there".

-Barry



From bwarsaw at python.org  Wed May 31 18:58:51 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:58:51 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <14645.17611.318538.986772@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Is that the sound of pieces falling into place or of a
    PP> rumbling avalanche "warming up" before obliterating everything
    PP> in its path?

Or a big foot hurtling its way earthward?  The question is, what's
that thing under the shadow of the big toe?  I can only vaguely make
out the first of four letters, and I think it's a `P'.

:)

-Barry



From gstein at lyra.org  Wed May 31 18:59:10 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:59:10 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39353F4B.3E78E22E@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > [ The correct forum is probably meta-sig. ]
> > 
> > IMO, I don't see a need for yet another forum. The dividing lines become a
> > bit too blurry, and it will result in questions like "where do I post
> > this?" Or "what is the difference between python-lang at python.org and
> > python-list at python.org?"
> 
> Well, you admit that yhou don't read python-list, right?

Hehe... you make it sound like I'm a criminal on trial :-)

"And do you admit that you don't read that newsgroup? And do you admit
that you harbor irregular thoughts towards c.l.py posters? And do you
admit to obscene thoughts about Salma Hayek?"

Well, yes, no, and damn straight. :-)

> Most of us
> don't, most of the time. Instead we have important discussions about the
> language's future on python-dev, where most of the Python community
> cannot participate. I'll say it flat out: I'm uncomfortable with that. I

I share that concern, and raised it during the formation of python-dev. It
appears that the pipermail archive is truncated (nothing before April last
year). Honestly, though, I would have to say that I am/was more concerned
with the *perception* rather than actual result.

> did not include meta-sig because (or python-list) because my issue is
> really with the accidental elitism of the python-dev setup. If

I disagree with the term "accidental elitism." I would call it "purposeful
meritocracy." The people on python-dev have shown over the span of *years*
that they are capable developers, designers, and have a genuine interest
and care about Python's development. Based on each person's merits, Guido
invited them to participate in this forum.

Perhaps "guido-advisors" would be more appropriately named, but I don't
think Guido likes to display his BDFL status more than necessary :-)

> python-dev participants do not agree to have important linguistic
> discussions in an open forum then setting up the forum is a waste of
> time. That's why I'm feeling people here out first.

Personally, I like the python-dev setting. The noise here is zero. There
are some things that I'm not particularly interested in, thus I pay much
less attention to them, but those items are never noise. I *really* like
that aspect, and would not care to start arguing about language
development in a larger forum where noise, spam, uninformed opinions, and
subjective discussions take place.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Wed May 31 19:04:13 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 13:04:13 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <39353B69.D6E74E2C@lemburg.com>
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17933.810181.300650@cj42289-a.reston1.va.home.com>

M.-A. Lemburg writes:
 > Would there be interest in adding the python-ldap module
 > (http://sourceforge.net/project/?group_id=2072) to the
 > core distribution ?

  Probably!  ACAP (Application Configuration Access Protocol) would be
nice as well -- anybody working on that?

 > FYI, LDAP is a well accepted standard network protocol for
 > querying address and user information.

  And lots of other stuff as well.  Jeremy and I contributed to a
project where it was used to store network latency information.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From paul at prescod.net  Wed May 31 19:10:58 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:10:58 -0500
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net> <14645.17611.318538.986772@anthem.python.org>
Message-ID: <393547A2.30CB7113@prescod.net>

"Barry A. Warsaw" wrote:
> 
> Or a big foot hurtling its way earthward?  The question is, what's
> that thing under the shadow of the big toe?  I can only vaguely make
> out the first of four letters, and I think it's a `P'.

Look closer, big-egoed-four-stringed-guitar-playing-one. It could just
as easily be a J.

And you know what you get when you put P and J together?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From paul at prescod.net  Wed May 31 19:21:56 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:21:56 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>
Message-ID: <39354A34.88B8B6ED@prescod.net>

Greg Stein wrote:
> 
> Hehe... you make it sound like I'm a criminal on trial :-)

Sorry about that. But I'll bet you didn't expect this inquisition did
you?

> I share that concern, and raised it during the formation of python-dev. It
> appears that the pipermail archive is truncated (nothing before April last
> year). Honestly, though, I would have to say that I am/was more concerned
> with the *perception* rather than actual result.

Right, that perception is making people in comp-lang-python get a little
frustrated, paranoid, alienated and nasty. And relaying conversations
from here to there and back puts Fredrik in a bad mood which isn't good
for anyone.

> > did not include meta-sig because (or python-list) because my issue is
> > really with the accidental elitism of the python-dev setup. If
> 
> I disagree with the term "accidental elitism." I would call it "purposeful
> meritocracy." 

The reason I think that it is accidental is because I don't think that
anyone expected so many of us to abandon comp.lang.python and thus our
direct connection to Python's user base. It just happened that way due
to human nature. That forum is full of stuff that you or I don't care
about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
(oops, that's here!) etc, and this one is noise-free. I'm saying that we
could have a middle ground where we trade a little noise for a little
democracy -- if only in perception.

I think that perl-porters and linux-kernel are open lists? The dictators
and demigods just had to learn to filter a little. By keeping
"python-dev" for immediately important things and implementation
details, we will actually make it easier to get the day to day pumpkin
passing done.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From bwarsaw at python.org  Wed May 31 19:28:04 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:28:04 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
	<1252330400-4656058@hypernet.com>
Message-ID: <14645.19364.259837.684595@anthem.python.org>

>>>>> "Gordo" == Gordon McMillan <gmcm at hypernet.com> writes:

    Gordo> No, he said "planet-sized-brains", not "planet-sized-egos".

    Gordo> Just notice how long it takes Barry to figure out who I
    Gordo> meant....

Waaaaiitt a second....

I /do/ have a very large brain.  I keep it in a jar on the headboard
of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
you think I got so studly playing bass?



From bwarsaw at python.org  Wed May 31 19:35:36 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:35:36 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
	<14645.17611.318538.986772@anthem.python.org>
	<393547A2.30CB7113@prescod.net>
Message-ID: <14645.19816.256896.367440@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Look closer, big-egoed-four-stringed-guitar-playing-one. It
    PP> could just as easily be a J.

<squint> Could be!  The absolute value of my diopter is about as big
as my ego.

    PP> And you know what you get when you put P and J together?

A very tasty sammich!

-Barry



From paul at prescod.net  Wed May 31 19:45:30 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:45:30 -0500
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
		<Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <14645.16679.139843.148933@anthem.python.org>
Message-ID: <39354FBA.E1DEFEFA@prescod.net>

"Barry A. Warsaw" wrote:
> 
> ...
>
> I agree.  I think anybody who'd be interested in python-lang is
> already going to be a member of python-dev 

Huh? What about Greg Ewing, Amit Patel, Martijn Faassen, William
Tanksley, Mike Fletcher, Neel Krishnaswami, the various stackless
groupies and a million others. This is just a short list of people who
have made reasonable language suggestions recently. Those suggestions
are going into the bit-bucket unless one of us happens to notice and
champion it here. But we're too busy thinking about 1.6 to think about
long-term ideas anyhow.

Plus, we hand down decisions about (e.g. string.join) and they have the
exact, parallel discussion over there. All the while, anyone from
PythonDev is telling them: "We've already been through this stuff. We've
already discussed this." which only (understandably) annoys them more.

> and any discussion will
> probably be crossposted to the point where it makes no difference.

I think that python-dev's role should change. I think that it would
handle day to day implementation stuff -- nothing long term. I mean if
the noise level on python-lang was too high then we could retreat to
python-dev again but I'd like to think we wouldn't have to. A couple of
sharp words from Guido or Tim could end a flamewar pretty quickly.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From esr at thyrsus.com  Wed May 31 19:53:10 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 13:53:10 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <14645.19364.259837.684595@anthem.python.org>; from bwarsaw@python.org on Wed, May 31, 2000 at 01:28:04PM -0400
References: <39353CCD.1F3E9A0B@prescod.net> <1252330400-4656058@hypernet.com> <14645.19364.259837.684595@anthem.python.org>
Message-ID: <20000531135310.B29319@thyrsus.com>

Barry A. Warsaw <bwarsaw at python.org>:
> Waaaaiitt a second....
> 
> I /do/ have a very large brain.  I keep it in a jar on the headboard
> of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
> you think I got so studly playing bass?

Ah, yes.  We take you back now to that splendid year of 1978.  Cue a
certain high-voiced Canadian singing

	The trouble with the Perl guys
	is they're quite convinced they're right...

Duuude....
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The world is filled with violence. Because criminals carry guns, we
decent law-abiding citizens should also have guns. Otherwise they will
win and the decent people will lose.
        -- James Earl Jones



From esr at snark.thyrsus.com  Wed May 31 20:05:33 2000
From: esr at snark.thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 14:05:33 -0400
Subject: [Python-Dev] Constants
Message-ID: <200005311805.OAA29447@snark.thyrsus.com>

I just looked at Jeremy Hylton's warts posting
at <http://starship.python.net/crew/amk/python/writing/warts.html>

It reminded me that one feature I really, really want in Python 3000
is the ability to declare constants.  Assigning to a constant should 
raise an error.

Is this on the to-do list?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

What, then is law [government]? It is the collective organization of
the individual right to lawful defense."
	-- Frederic Bastiat, "The Law"



From petrilli at amber.org  Wed May 31 20:17:57 2000
From: petrilli at amber.org (Christopher Petrilli)
Date: Wed, 31 May 2000 14:17:57 -0400
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>; from esr@snark.thyrsus.com on Wed, May 31, 2000 at 02:05:33PM -0400
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <20000531141757.E5766@trump.amber.org>

Eric S. Raymond [esr at snark.thyrsus.com] wrote:
> I just looked at Jeremy Hylton's warts posting
> at <http://starship.python.net/crew/amk/python/writing/warts.html>
> 
> It reminded me that one feature I really, really want in Python 3000
> is the ability to declare constants.  Assigning to a constant should 
> raise an error.
> 
> Is this on the to-do list?

I know this isn't "perfect", but what I do often is have a
Constants.py file that has all my constants in a class which has
__setattr__ over ridden to raise an exception.  This has two things;

    1. Difficult to modify the attributes, at least accidentally
    2. Keeps the name-space less poluted by thousands of constants.

Just an idea, I think do this:

     constants = Constants()
     x = constants.foo

Seems clean (reasonably) to me.

I think I stole this from the timbot.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From jeremy at beopen.com  Wed May 31 20:07:18 2000
From: jeremy at beopen.com (Jeremy Hylton)
Date: Wed, 31 May 2000 14:07:18 -0400 (EDT)
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <14645.21718.365823.507322@localhost.localdomain>

Correction: It's Andrew Kuchling's list of language warts.  I
mentioned it in a post on slashdot, where I ventured a guess that the
most substantial changes most new users will see with Python 3000 are
the removal of these warts.

Jeremy



From akuchlin at cnri.reston.va.us  Wed May 31 20:21:04 2000
From: akuchlin at cnri.reston.va.us (Andrew M. Kuchling)
Date: Wed, 31 May 2000 14:21:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <14645.20637.864287.86178@localhost.localdomain>; from jeremy@beopen.com on Wed, May 31, 2000 at 01:49:17PM -0400
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <39353F4B.3E78E22E@prescod.net> <14645.20637.864287.86178@localhost.localdomain>
Message-ID: <20000531142104.A8989@amarok.cnri.reston.va.us>

On Wed, May 31, 2000 at 01:49:17PM -0400, Jeremy Hylton wrote:
>I'm actually more worried about the second.  It's been a while since I
>read c.l.py and I'm occasionally disappointed to miss out on
>seemingly interesting threads.  On the other hand, there is no way I
>could manage to read or even filter the volume on that list.

Really?  I read it through Usenet with GNUS, and it takes about a half
hour to go through everything. Skipping threads by subject usually
makes it easy to avoid uninteresting stuff.  

I'd rather see python-dev limited to very narrow, CVS-tree-related
material, such as: should we add this module?  is this change OK?
&c...  The long-winded language speculation threads are better left to
c.l.python, where more people offer opinions, it's more public, and
newsreaders are more suited to coping with the volume.  (Incidentally,
has any progress been made on reviving c.l.py.announce?)

OTOH, newbies have reported fear of posting in c.l.py, because they
feel the group is too advanced, what with everyone sitting around
talking about coroutines and SNOBOL string parsing.  But I think it's
a good thing if newbies see the high-flown chatter and get their minds
stretched. :)

--amk



From gstein at lyra.org  Wed May 31 20:37:32 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 11:37:32 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39354A34.88B8B6ED@prescod.net>
Message-ID: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > 
> > Hehe... you make it sound like I'm a criminal on trial :-)
> 
> Sorry about that. But I'll bet you didn't expect this inquisition did
> you?

Well, of course not. Nobody expects the Spanish Inquisition!

Hmm. But you're not Spanish. Dang...

> > I share that concern, and raised it during the formation of python-dev. It
> > appears that the pipermail archive is truncated (nothing before April last
> > year). Honestly, though, I would have to say that I am/was more concerned
> > with the *perception* rather than actual result.
> 
> Right, that perception is making people in comp-lang-python get a little
> frustrated, paranoid, alienated and nasty. And relaying conversations
> from here to there and back puts Fredrik in a bad mood which isn't good
> for anyone.

Understood. I don't have a particular solution to the problem, but I also
believe that python-lang is not going to be a benefit/solution.

Hmm. How about this: you stated the premise is to generate proposals for
language features, extensions, additions, whatever. If that is the only
goal, then consider a web-based system: anybody can post a "feature" with
a description/spec/code/whatever; each feature has threaded comments
attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

When you have a feature with a total vote of +73, then you know that it
needs to be looked at in more detail. All votes are open (not anonymous).
Features can be revised, in an effort to remedy issues raised by -1
voters (and thus turn them into +1 votes).

People can review features and votes in a quick pass. If they prefer to
take more time, then they can also review comments.

Of course, this is only a suggestion. I've got so many other projects that
I'd like to code up right now, then I would not want to sign up for
something like this :-)

> > > did not include meta-sig because (or python-list) because my issue is
> > > really with the accidental elitism of the python-dev setup. If
> > 
> > I disagree with the term "accidental elitism." I would call it "purposeful
> > meritocracy." 
> 
> The reason I think that it is accidental is because I don't think that
> anyone expected so many of us to abandon comp.lang.python and thus our
> direct connection to Python's user base.

Good point.

I would still disagree with your "elitism" term, but the side-effect is
definitely accidental and unfortunate. It may even be arguable whether
python-dev *is* responsible for that. The SIGs had much more traffic
before python-dev, too. I might suggest that the SIGs were the previous
"low-noise" forum (in favor of c.l.py). python-dev yanked focus from the
SIGs, and only a little from c.l.py (I think c.l.py's burgeoning traffic
reduced readership on its own).

> It just happened that way due
> to human nature. That forum is full of stuff that you or I don't care
> about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
> (oops, that's here!) etc, and this one is noise-free. I'm saying that we
> could have a middle ground where we trade a little noise for a little
> democracy -- if only in perception.

Admirable, but I think it would be ineffectual. People would be confused
about where to post. Too many forums, with arbitrary/unclear lines about
which to use.

How do you like your new job at DataChannel? Rate it on 1-100. "83" you
say? Well, why not 82? What is the difference between 82 and 83?

"Why does this post belong on c.l.py, and not on python-lang?"

The result will be cross-posting because people will want to ensure they
reach the right people/forum.

Of course, people will also post to the "wrong" forum. Confusion, lack of
care, whatever.

> I think that perl-porters and linux-kernel are open lists? The dictators
> and demigods just had to learn to filter a little. By keeping
> "python-dev" for immediately important things and implementation
> details, we will actually make it easier to get the day to day pumpkin
> passing done.

Yes, they are. And Dick Hardt has expressed the opinion that perl-porters
is practically useless. He was literally dumbfounded when I told him that
python-dev is (near) zero-noise.

The Linux guys filter very well. I don't know enough of, say, Alan's or
Linus' other mailing subscriptions to know whether that is the only thing
they subscribe to, or just one of many. I could easily see keeping up with
linux-kernel if that was your only mailing list. I also suspect there is
plenty of out-of-band mail going on between Linus and his "lieutenants"
when they forward patches to him (and his inevitable replies, rejections,
etc).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bwarsaw at python.org  Wed May 31 20:39:46 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 14:39:46 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
	<Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
	<14645.16679.139843.148933@anthem.python.org>
	<39354FBA.E1DEFEFA@prescod.net>
Message-ID: <14645.23666.161619.557413@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Plus, we hand down decisions about (e.g. string.join) and they
    PP> have the exact, parallel discussion over there. All the while,
    PP> anyone from PythonDev is telling them: "We've already been
    PP> through this stuff. We've already discussed this." which only
    PP> (understandably) annoys them more.

Good point.

    >> and any discussion will probably be crossposted to the point
    >> where it makes no difference.

    PP> I think that python-dev's role should change. I think that it
    PP> would handle day to day implementation stuff -- nothing long
    PP> term. I mean if the noise level on python-lang was too high
    PP> then we could retreat to python-dev again but I'd like to
    PP> think we wouldn't have to. A couple of sharp words from Guido
    PP> or Tim could end a flamewar pretty quickly.

Then I suggest to moderate python-lang.  Would you (and/or others) be
willing to serve as moderators?  I'd support an open subscription
policy in that case.

-Barry



From pingster at ilm.com  Wed May 31 20:41:13 2000
From: pingster at ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 11:41:13 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> Hmm. How about this: you stated the premise is to generate proposals for
> language features, extensions, additions, whatever. If that is the only
> goal, then consider a web-based system: anybody can post a "feature" with
> a description/spec/code/whatever; each feature has threaded comments
> attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

Gee, this sounds familiar.  (Hint: starts with an R and has seven
letters.)  Why are we using Jitterbug again?  Does anybody even submit
things there, and still check the Jitterbug indexes regularly?

Okay, Roundup doesn't have voting, but it does already have priorities
and colour-coded statuses, and voting would be trivial to add.


-- ?!ng




From gstein at lyra.org  Wed May 31 21:04:34 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 12:04:34 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Ka-Ping Yee wrote:
> On Wed, 31 May 2000, Greg Stein wrote:
> > Hmm. How about this: you stated the premise is to generate proposals for
> > language features, extensions, additions, whatever. If that is the only
> > goal, then consider a web-based system: anybody can post a "feature" with
> > a description/spec/code/whatever; each feature has threaded comments
> > attached to it; the kicker: each feature has votes (+1/+0/-0/-1).
> 
> Gee, this sounds familiar.  (Hint: starts with an R and has seven
> letters.)  Why are we using Jitterbug again?  Does anybody even submit
> things there, and still check the Jitterbug indexes regularly?
> 
> Okay, Roundup doesn't have voting, but it does already have priorities
> and colour-coded statuses, and voting would be trivial to add.

Does Roundup have a web-based interface, where I can see all of the
features, their comments, and their votes? Can the person who posted the
original feature/spec update it? (or must they followup with a
modified proposal instead)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bwarsaw at python.org  Wed May 31 21:12:23 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:12:23 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
	<39353F4B.3E78E22E@prescod.net>
	<14645.20637.864287.86178@localhost.localdomain>
	<20000531142104.A8989@amarok.cnri.reston.va.us>
Message-ID: <14645.25623.615735.836896@anthem.python.org>

>>>>> "AMK" == Andrew M Kuchling <akuchlin at cnri.reston.va.us> writes:

    AMK> more suited to coping with the volume.  (Incidentally, has
    AMK> any progress been made on reviving c.l.py.announce?)

Not that I'm aware of, sadly.

-Barry



From bwarsaw at python.org  Wed May 31 21:18:09 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:18:09 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
	<Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <14645.25969.657083.55499@anthem.python.org>

>>>>> "KY" == Ka-Ping Yee <pingster at ilm.com> writes:

    KY> Gee, this sounds familiar.  (Hint: starts with an R and has
    KY> seven letters.)  Why are we using Jitterbug again?  Does
    KY> anybody even submit things there, and still check the
    KY> Jitterbug indexes regularly?

Jitterbug blows.

    KY> Okay, Roundup doesn't have voting, but it does already have
    KY> priorities and colour-coded statuses, and voting would be
    KY> trivial to add.

Roundup sounded just so cool when ?!ng described it at the
conference.  I gotta find some time to look at it! :)

-Barry



From pingster at ilm.com  Wed May 31 21:24:07 2000
From: pingster at ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 12:24:07 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> 
> Does Roundup have a web-based interface,

Yes.

> where I can see all of the
> features, their comments, and their votes?

At the moment, you see date of last activity, description,
priority, status, and fixer (i.e. person who has taken
responsibility for the item).  No votes, but as i said,
that would be really easy.

> Can the person who posted the original feature/spec update it?

Each item has a bunch of mail messages attached to it.
Anyone can edit the description, but that's a short one-line
summary; the only way to propose another design right now
is to send in another message.

Hey, i admit it's a bit primitive, but it seems significantly
better than nothing.  The software people at ILM have coped
with it fairly well for a year, and for the most part we like it.

Go play:  http://www.lfw.org/ping/roundup/roundup.cgi

Username: test  Password: test
Username: spam  Password: spam
Username: eggs  Password: eggs


-- ?!ng




From fdrake at acm.org  Wed May 31 21:58:13 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 15:58:13 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
References: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
	<Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
Message-ID: <14645.28373.733094.942361@cj42289-a.reston1.va.home.com>

Ka-Ping Yee writes:
 > On Wed, 31 May 2000, Greg Stein wrote:
 > > Can the person who posted the original feature/spec update it?
 > 
 > Each item has a bunch of mail messages attached to it.
 > Anyone can edit the description, but that's a short one-line
 > summary; the only way to propose another design right now
 > is to send in another message.

  I thought the roundup interface was quite nice, esp. with the nosy
lists and such.  I'm sure there are a number of small issues, but
nothing Ping can't deal with in a matter of minutes.  ;)
  One thing that might need further consideration is that a feature
proposal may need a slightly different sort of support; it makes more
sense to include more than the one-liner summary, and that should be
modifiable as discussions show adjustments may be needed.  That might
be doable by adding a URL to an external document rather than
including the summary in the issues database.
  I'd love to get rid of the Jitterbug thing!


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From paul at prescod.net  Wed May 31 22:52:38 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 15:52:38 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <39357B96.E819537F@prescod.net>

Greg Stein wrote:
> 
> ...
>
> People can review features and votes in a quick pass. If they prefer to
> take more time, then they can also review comments.

I like this idea for its persistence but I'm not convinced that it
serves the same purpose as the give and take of a mailing list with many
subscribers.
 
> Admirable, but I think it would be ineffectual. People would be confused
> about where to post. Too many forums, with arbitrary/unclear lines about
> which to use.

To me, they are clear:

 * anything Python related can go to comp.lang.python, but many people
will not read it.

 * anything that belongs to a particular SIG goes to that sig.

 * any feature suggestions/debates that do not go in a particular SIG
(especially things related to the core language) go to python-lang

 * python-dev is for any message that has the words "CVS", "patch",
"memory leak", "reference count" etc. in it. It is for implementing the
design that Guido refines out of the rough and tumble of python-lang.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
At the same moment that the Justice Department and the Federal Trade 
Commission are trying to restrict the negative consequences of 
monopoly, the Commerce Department and the Congress are helping to 
define new intellectual property rights, rights that have a 
significant potential to create new monopolies. This is the policy 
equivalent of arm-wrestling with yourself.
	- http://www.salon.com/tech/feature/2000/04/07/greenspan/index.html



From gstein at lyra.org  Wed May 31 23:53:13 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 14:53:13 -0700 (PDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <14645.17446.749848.895965@anthem.python.org>
Message-ID: <Pine.LNX.4.10.10005311452150.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Barry A. Warsaw wrote:
> >>>>> "M" == M  <mal at lemburg.com> writes:
> 
>     M> Would there be interest in adding the python-ldap module
>     M> (http://sourceforge.net/project/?group_id=2072) to the
>     M> core distribution ?
> 
> I haven't looked at this stuff, but yes, I think a standard LDAP
> module would be quite useful.  It's a well enough established
> protocol, and it would be good to be able to count on it "being
> there".

My WebDAV module implements an established protocol (an RFC tends to do
that :-), but the API within the module is still in flux (IMO).

Is the LDAP module's API pretty solid? Is it changing?

And is this module a C extension, or a pure Python implementation?

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gmcm at hypernet.com  Wed May 31 23:58:04 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 17:58:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39357B96.E819537F@prescod.net>
Message-ID: <1252312176-5752122@hypernet.com>

Paul Prescod  wrote:

> Greg Stein wrote:
> > Admirable, but I think it would be ineffectual. People would be
> > confused about where to post. Too many forums, with
> > arbitrary/unclear lines about which to use.
> 
> To me, they are clear:

Of course they are ;-). While something doesn't seem right 
about the current set up, and c.l.py is still remarkably 
civilized, the fact is that the hotheads who say "I'll never use 
Python again if you do something as brain-dead as [ case-
insensitivity | require (host, addr) tuples | ''.join(list) | ... ]" will 
post their sentiments to every available outlet.
 
I agree the shift of some of these syntax issues from python-
dev to c.l.py was ugly, but the truth is that:
 - no new arguments came from c.l.py
 - the c.l.py discussion was much more emotional
 - you can't keep out the riff-raff without inviting reasonable 
accusations of elitistism
 - the vast majority of, erm, "grass-roots" syntax proposals are 
absolutely horrid.

(As you surely know, Paul, from your types-SIG tenure; 
proposing syntax changes without the slightest intention of 
putting any effort into them is a favorite activity of posters.)



- Gordon



From tim_one at email.msn.com  Mon May  1 08:31:05 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 1 May 2000 02:31:05 -0400
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com>
Message-ID: <000001bfb336$d4f512a0$0f2d153f@tim>

[Guido]
> The email below is a serious bug report.  A quick analysis
> shows that UserString.count() calls the count() method on a string
> object, which calls PyArg_ParseTuple() with the format string "O|ii".
> The 'i' format code truncates integers.

For people unfamiliar w/ the details, let's be explicit:  the "i" code
implicitly converts a Python int (== a C long) to a C int (which has no
visible Python counterpart).  Overflow is not detected, so this is broken on
the face of it.

> It probably should raise an overflow exception instead.

Definitely.

> But that would still cause the test to fail -- just in a different
> way (more explicit).  Then the string methods should be fixed to
> use long ints instead -- and then something else would probably break...

Yup.  Seems inevitable.

[MAL]
> Since strings and Unicode objects use integers to describe the
> length of the object (as well as most if not all other
> builtin sequence types), the correct default value should
> thus be something like sys.maxlen which then gets set to
> INT_MAX.
>
> I'd suggest adding sys.maxlen and the modifying UserString.py,
> re.py and sre_parse.py accordingly.

I understand this, but hate it.  I'd rather get rid of the user-visible
distinction between the two int types already there, not add yet a third
artificial int limit.

[Guido]
> Hm, I'm not so sure.  It would be much better if passing sys.maxint
> would just WORK...  Since that's what people have been doing so far.

[Trent Mick]
> Possible solutions (I give 4 of them):
>
> 1. The 'i' format code could raise an overflow exception and the
> PyArg_ParseTuple() call in string_count() could catch it and truncate to
> INT_MAX (reasoning that any overflow of the end position of a
> string can be bound to INT_MAX because that is the limit for any string
> in Python).

There's stronger reason than that:  string_count's "start" and "end"
arguments are documented as "interpreted as in slice notation", and slice
notation with out-of-range indices is well defined in all cases:

    The semantics for a simple slicing are as follows. The primary
    must evaluate to a sequence object. The lower and upper bound
    expressions, if present, must evaluate to plain integers; defaults
    are zero and the sequence's length, respectively. If either bound
    is negative, the sequence's length is added to it. The slicing now
    selects all items with index k such that i <= k < j where i and j
    are the specified lower and upper bounds. This may be an empty
    sequence. It is not an error if i or j lie outside the range of
    valid indexes (such items don't exist so they aren't selected).

(From the Ref Man's section "Slicings")  That is, what string_count should
do is perfectly clear already (or will be, when you read that two more times
<wink>).  Note that you need to distinguish between positive and negative
overflow, though!

> Pros:
> - This "would just WORK" for usage of sys.maxint.
>
> Cons:
> -  This overflow exception catching should then reasonably be
> propagated to other similar functions (like string.endswith(), etc).

Absolutely, but they *all* follow from what "sequence slicing* is *always*
supposed to do in case of out-of-bounds indices.

> - We have to assume that the exception raised in the
> PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for
> the second integer (i.e. 'last'). This is subtle and ugly.

Luckily <wink>, it's not that simple:  exactly the same treatment needs to
be given to both the optional "start" and "end" arguments, and in every
function that accepts optional slice indices.  So you write one utility
function to deal with all that, called in case PyArg_ParseTuple raises an
overflow error.

> Pro or Con:
> - Do we want to start raising overflow exceptions for other conversion
> formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on
> Win64 where sizeof(long) < size(void*))? I think this is a good idea
> in principle but may break code (even if it *does* identify bugs in that
> code).

The code this would break is already broken <0.1 wink>.

> 2. Just change the definitions of the UserString methods to pass
> a variable length argument list instead of default value parameters.
> For example change UserString.count() from:
>
>     def count(self, sub, start=0, end=sys.maxint):
>         return self.data.count(sub, start, end)
>
> to:
>
>     def count(self, *args)):
>         return self.data.count(*args)
>
> The result is that the default value for 'end' is now set by
> string_count() rather than by the UserString implementation:
> ...

This doesn't solve anything -- users can (& do) pass sys.maxint explicitly.
That's part of what Guido means by "since that's what people have been doing
so far".

> ...
> Cons:
> - Does not fix the general problem of the (common?) usage of sys.maxint to
> mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit
> Unices).

Anyone using sys.maxint to mean INT_MAX is fatally confused; passing
sys.maxint as a slice index is not an instance of that confusion, it's just
relying on the documented behavior of out-of-bounds slice indices.

> 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with
> breaks the logical difference with sys.maxint (set to LONG_MAX):
> ...

I hate this (see above).

> ...
> 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f.
> ANSI size_t type). It is probably not a biggie, but Python currently
> makes the assumption that string never exceed INT_MAX in length.

It's not an assumption, it's an explicit part of the design:
PyObject_VAR_HEAD declares ob_size to be an int.  This leads to strain for
sure, partly because the *natural* limit on sizes is derived from malloc
(which indeed takes neither int nor long, but size_t), and partly because
Python exposes no corresponding integer type.  I vaguely recall that this
was deliberate, with the *intent* being to save space in object headers on
the upcoming 128-bit KSR machines <wink>.

> While this assumption is not likely to be proven false it technically
> could be on 64-bit systems.

Well, Guido once said he would take away Python's recursion overflow checks
just as soon as machines came with infinite memory <wink> -- 2Gb is a
reasonable limit for string length, and especially if it's a tradeoff
against increasing the header size for all string objects (it's almost
certainly more important to cater to oodles of small strings on smaller
machines than to one or two gigantic strings on huge machines).

> As well, when you start compiling on Win64 (where sizeof(int) ==
> sizeof(long) < sizeof(size_t)) then you are going to be annoyed
> by hundreds of warnings about implicit casts from size_t (64-bits) to
> int (32-bits) for every strlen, str*, fwrite, and sizeof call that
> you make.

Every place the code implicitly downcasts from size_t to int is plainly
broken today, so we *should* get warnings.  Python has been sloppy about
this!  In large part it's because Python was written before ANSI C, and
size_t simply wasn't supported at the time.  But as with all software,
people rarely go back to clean up; it's overdue (just be thankful you're not
working on the Perl source <0.9 wink>).

> Pros:
> - IMHO logically more correct.
> - Might clean up some subtle bugs.
> - Cleans up annoying and disconcerting warnings.
> - Will probably mean less pain down the road as 64-bit systems
> (esp. Win64) become more prevalent.
>
> Cons:
> - Lot of coding changes.
> - As Guido said: "and then something else would probably break".
> (Though, on currently 32-bits system, there should be no effective
> change).  Only 64-bit systems should be affected and, I would hope,
> the effect would be a clean up.

I support this as a long-term solution, perhaps for P3K.  Note that
ob_refcnt should also be declared size_t (no overflow-checking is done on
refcounts today; the theory is that a refcount can't possibly get bigger
than the total # of pointers in the system, and so if you declare ob_refcnt
to be large enough to hold that, refcount overflow is impossible; but, in
theory, this argument has been broken on every machine where sizeof(int) <
sizeof(void*)).

> I apologize for not being succinct.

Humbug -- it was a wonderfully concise posting, Trent!  The issues are
simply messy.

> Note that I am volunteering here.  Opinions and guidance please.

Alas, the first four letters in "guidance" spell out four-fifths of the only
one able to give you that.

opinions-are-fun-but-don't-count<wink>-ly y'rs  - tim





From mal at lemburg.com  Mon May  1 12:55:52 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 12:55:52 +0200
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg 
 stringobject (PR#306)
References: <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <390D62B8.15331407@lemburg.com>

I've just posted a simple patch to the patches list which
implements the idea I posted earlier:

Silent truncation still takes place, but in a somwhat more
natural way ;-) ...

                       /* Silently truncate to INT_MAX/INT_MIN to
                          make passing sys.maxint to 'i' parser
                          markers work on 64-bit platforms work just
                          like on 32-bit platforms. Overflow errors
                          are not raised. */
                       else if (ival > INT_MAX)
                               ival = INT_MAX;
                       else if (ival < INT_MIN)
                               ival = INT_MIN;
                       *p = ival;

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Mon May  1 16:04:08 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 10:04:08 -0400 (EDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial>
Message-ID: <14605.36568.455646.598506@seahag.cnri.reston.va.us>

Moshe Zadka writes:
 > 1. I'm not sure what to call this function. Currently, I call it
 > __print_expr__, but I'm not sure it's a good name

  It's not.  ;)  How about printresult?
  Another thing to think about is interface; formatting a result and
"printing" it may be different, and you may want to overload them
separately in an environment like IDLE.  Some people may want to just
say:

	import sys
	sys.formatresult = str

  I'm inclined to think that level of control may be better left to
the application; if one hook is provided as you've described, the
application can build different layers as appropriate.

 > 2. I haven't yet supplied a default in __builtin__, so the user *must*
 > override this. This is unacceptable, of course.

  You're right!  But a default is easy enough to add.  I'd put it in
sys instead of __builtin__ though.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From moshez at math.huji.ac.il  Mon May  1 16:19:46 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Mon, 1 May 2000 17:19:46 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <14605.36568.455646.598506@seahag.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005011712410.25942-100000@sundial>

On Mon, 1 May 2000, Fred L. Drake, Jr. wrote:

>   It's not.  ;)  How about printresult?

Hmmmm...better then mine at least.

> 	import sys
> 	sys.formatresult = str

And where does the "don't print if it's None" enter? I doubt if there is a
really good way to divide functionality. OF course, specific IDEs may
provide their own hooks.

>   You're right!  But a default is easy enough to add.

I agree. It was more to spur discussion -- with the advantage that there
is already a way to include Python sessions.

> I'd put it in
> sys instead of __builtin__ though.

Hmmm.. that's a Guido Issue(TM). Guido?
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From fdrake at acm.org  Mon May  1 17:19:10 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:19:10 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
Message-ID: <14605.41070.290137.787832@seahag.cnri.reston.va.us>

  The "winreg" module needs some documentation; is anyone here up to
the task?  I don't think I know enough about the registry to write
something reasonable.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From fdrake at acm.org  Mon May  1 17:23:06 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 11:23:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
Message-ID: <14605.41306.146320.597637@seahag.cnri.reston.va.us>

I wrote:
 >   The "winreg" module needs some documentation; is anyone here up to
 > the task?  I don't think I know enough about the registry to write
 > something reasonable.

  Of course, as soon as I sent this message I remembered that there's
also the linuxaudiodev module; that needs documentation as well!  (I
guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
wants to document audiodev, perhaps I could avoid the Linux chapter
(with one module) by adding documentation for the portable interface.
  There's also the pyexpat module; Andrew/Paul, did one of you want to
contribute something for that?


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Mon May  1 17:26:44 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 11:26:44 -0400
Subject: [Python-Dev] documentation for new modules
In-Reply-To: Your message of "Mon, 01 May 2000 11:23:06 EDT."
             <14605.41306.146320.597637@seahag.cnri.reston.va.us> 
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>  
            <14605.41306.146320.597637@seahag.cnri.reston.va.us> 
Message-ID: <200005011526.LAA20332@eric.cnri.reston.va.us>

>  >   The "winreg" module needs some documentation; is anyone here up to
>  > the task?  I don't think I know enough about the registry to write
>  > something reasonable.

Maybe you could adapt the documentation for the registry functions in
Mark Hammond's win32all?  Not all the APIs are the same but the should
mostly do the same thing...

>   Of course, as soon as I sent this message I remembered that there's
> also the linuxaudiodev module; that needs documentation as well!  (I
> guess I'll need to add a Linux-specific chapter; ugh.)  If anyone
> wants to document audiodev, perhaps I could avoid the Linux chapter
> (with one module) by adding documentation for the portable interface.

There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
mostly the same.

>   There's also the pyexpat module; Andrew/Paul, did one of you want to
> contribute something for that?

I would hope so!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Mon May  1 18:17:06 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 1 May 2000 12:17:06 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <200005011526.LAA20332@eric.cnri.reston.va.us>
References: <14605.41070.290137.787832@seahag.cnri.reston.va.us>
	<14605.41306.146320.597637@seahag.cnri.reston.va.us>
	<200005011526.LAA20332@eric.cnri.reston.va.us>
Message-ID: <14605.44546.568978.296426@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Maybe you could adapt the documentation for the registry functions in
 > Mark Hammond's win32all?  Not all the APIs are the same but the should
 > mostly do the same thing...

  I'll take a look at it when I have time, unless anyone beats me to
it.

 > There's also sunaudiodev.  Is it documented?  linuxaudiodev should be
 > mostly the same.

  It's been documented for a long time.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Mon May  1 20:02:32 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:02:32 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 09:18:05 CDT."
             <390AEF1D.253B93EF@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>  
            <390AEF1D.253B93EF@prescod.net> 
Message-ID: <200005011802.OAA21612@eric.cnri.reston.va.us>

[Guido]
> > And this is exactly why encodings will remain important: entities
> > encoded in ISO-2022-JP have no compelling reason to be recoded
> > permanently into ISO10646, and there are lots of forces that make it
> > convenient to keep it encoded in ISO-2022-JP (like existing tools).

[Paul]
> You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
> a character *set* and not an encoding. ISO-2022-JP says how you should
> represent characters in terms of bits and bytes. ISO10646 defines a
> mapping from integers to characters.

OK.  I really meant recoding in UTF-8 -- I maintain that there are
lots of forces that prevent recoding most ISO-2022-JP documents in
UTF-8.

> They are both important, but separate. I think that this automagical
> re-encoding conflates them.

Who is proposing any automagical re-encoding?

Are you sure you understand what we are arguing about?

*I* am not even sure what we are arguing about.

I am simply saying that 8-bit strings (literals or otherwise) in
Python have always been able to contain encoded strings.

Earlier, you quoted some reference documentation that defines 8-bit
strings as containing characters.  That's taken out of context -- this
was written in a time when there was (for most people anyway) no
difference between characters and bytes, and I really meant bytes.
There's plenty of use of 8-bit Python strings for non-character uses
so your "proof" that 8-bit strings should contain "characters"
according to your definition is invalid.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Mon May  1 20:05:33 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:05:33 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011802.OAA21612@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <14605.51053.369016.283239@cymru.basistech.com>

Guido van Rossum writes:
 > OK.  I really meant recoding in UTF-8 -- I maintain that there are
 > lots of forces that prevent recoding most ISO-2022-JP documents in
 > UTF-8.

Such as?

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From effbot at telia.com  Mon May  1 20:39:52 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 20:39:52 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <009f01bfb39c$a603cc00$34aab5d4@hagrid>

Tom Emerson wrote:
> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.
> 
> Such as?

ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
you just recode the character codes, you'll lose important information.

</F>




From tree at cymru.basistech.com  Mon May  1 20:42:40 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:42:40 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <009f01bfb39c$a603cc00$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<14605.51053.369016.283239@cymru.basistech.com>
	<009f01bfb39c$a603cc00$34aab5d4@hagrid>
Message-ID: <14605.53280.55595.335112@cymru.basistech.com>

Fredrik Lundh writes:
 > ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
 > you just recode the character codes, you'll lose important information.

So encode them using the Plane 14 language tags.

I won't start with whether language/locale should be encoded in a
character encoding... 8-)

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From guido at python.org  Mon May  1 20:52:04 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:52:04 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 14:05:33 EDT."
             <14605.51053.369016.283239@cymru.basistech.com> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
            <14605.51053.369016.283239@cymru.basistech.com> 
Message-ID: <200005011852.OAA21973@eric.cnri.reston.va.us>

> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.

[Tom Emerson]
> Such as?

The standard forces that work against all change -- existing tools,
user habits, compatibility, etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Mon May  1 20:46:04 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:46:04 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011852.OAA21973@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<14605.51053.369016.283239@cymru.basistech.com>
	<200005011852.OAA21973@eric.cnri.reston.va.us>
Message-ID: <14605.53484.225980.235301@cymru.basistech.com>

Guido van Rossum writes:
 > The standard forces that work against all change -- existing tools,
 > user habits, compatibility, etc.

Ah... I misread your original statement, which I took to be a
technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of
course one cannot expect everyone to switch en masse to a new
encoding, pulling their existing documents with them. I'm in full
agreement there.

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From paul at prescod.net  Mon May  1 22:38:29 2000
From: paul at prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 15:38:29 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>  
	            <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <390DEB45.D8D12337@prescod.net>

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:
> 
> ...
>
> OK.  I really meant recoding in UTF-8 -- I maintain that there are
> lots of forces that prevent recoding most ISO-2022-JP documents in
> UTF-8.

Absolutely agree.
 
> Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal,
and then compare that string literal against a Unicode object, should
those funny characters be treated as logical units of text (characters)
or as bytes? And if bytes, should some transformation be automatically
performed to have those bytes be reinterpreted as characters according
to some particular encoding scheme (probably UTF-8).

I claim that we should *as far as possible* treat strings as character
lists and not add any new functionality that depends on them being byte
list. Ideally, we could add a byte array type and start deprecating the
use of strings in that manner. Yes, it will take a long time to fix this
bug but that's what happens when good software lives a long time and the
world changes around it.

> Earlier, you quoted some reference documentation that defines 8-bit
> strings as containing characters.  That's taken out of context -- this
> was written in a time when there was (for most people anyway) no
> difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik. 

Anyhow, you wrote the documentation that way because it was the most
intuitive way of thinking about strings. It remains the most intuitive
way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving
people towards the "character-list" model. In concrete terms I would
suggest that old fashioned lists be automatically coerced to Unicode by
interpreting each byte as a Unicode character. Trying to go the other
way could cause the moral equivalent of an OverflowError but that's not
a problem. 

>>> a=1000000000000000000000000000000000000L
>>> int(a)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify
strings and unicode strings (but not byte arrays).

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Mon May  1 23:32:38 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 17:32:38 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT."
             <390DEB45.D8D12337@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
            <390DEB45.D8D12337@prescod.net> 
Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us>

> > Are you sure you understand what we are arguing about?
> 
> Here's what I thought we were arguing about:
> 
> If you put a bunch of "funny characters" into a Python string literal,
> and then compare that string literal against a Unicode object, should
> those funny characters be treated as logical units of text (characters)
> or as bytes? And if bytes, should some transformation be automatically
> performed to have those bytes be reinterpreted as characters according
> to some particular encoding scheme (probably UTF-8).
> 
> I claim that we should *as far as possible* treat strings as character
> lists and not add any new functionality that depends on them being byte
> list. Ideally, we could add a byte array type and start deprecating the
> use of strings in that manner. Yes, it will take a long time to fix this
> bug but that's what happens when good software lives a long time and the
> world changes around it.
> 
> > Earlier, you quoted some reference documentation that defines 8-bit
> > strings as containing characters.  That's taken out of context -- this
> > was written in a time when there was (for most people anyway) no
> > difference between characters and bytes, and I really meant bytes.
> 
> Actually, I think that that was Fredrik. 

Yes, I came across the post again later.  Sorry.

> Anyhow, you wrote the documentation that way because it was the most
> intuitive way of thinking about strings. It remains the most intuitive
> way. I think that that was the point Fredrik was trying to make.

I just wish he made the point more eloquently.  The eff-bot seems to
be in a crunchy mood lately...

> We can't make "byte-list" strings go away soon but we can start moving
> people towards the "character-list" model. In concrete terms I would
> suggest that old fashioned lists be automatically coerced to Unicode by
> interpreting each byte as a Unicode character. Trying to go the other
> way could cause the moral equivalent of an OverflowError but that's not
> a problem. 
> 
> >>> a=1000000000000000000000000000000000000L
> >>> int(a)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too long to convert
> 
> And just as with ints and longs, we would expect to eventually unify
> strings and unicode strings (but not byte arrays).

OK, you've made your claim -- like Fredrik, you want to interpret
8-bit strings as Latin-1 when converting (not just comparing!) them to
Unicode.

I don't think I've heard a good *argument* for this rule though.  "A
character is a character is a character" sounds like an axiom to me --
something you can't prove or disprove rationally.

I have a bunch of good reasons (I think) for liking UTF-8: it allows
you to convert between Unicode and 8-bit strings without losses, Tcl
uses it (so displaying Unicode in Tkinter *just* *works*...), it is
not Western-language-centric.

Another reason: while you may claim that your (and /F's, and Just's)
preferred solution doesn't enter into the encodings issue, I claim it
does: Latin-1 is just as much an encoding as any other one.

I claim that as long as we're using an encoding we might as well use
the most accepted 8-bit encoding of Unicode as the default encoding.

I also think that the issue is blown out of proportions: this ONLY
happens when you use Unicode objects, and it ONLY matters when some
other part of the program uses 8-bit string objects containing
non-ASCII characters.  Given the long tradition of using different
encodings in 8-bit strings, at that point it is anybody's guess what
encoding is used, and UTF-8 is a better guess than Latin-1.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 00:17:17 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 18:17:17 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Sat, 29 Apr 2000 21:09:40 +0300."
             <Pine.GSO.4.10.10004292105210.28387-100000@sundial> 
References: <Pine.GSO.4.10.10004292105210.28387-100000@sundial> 
Message-ID: <200005012217.SAA23503@eric.cnri.reston.va.us>

> Continuing the recent debate about what is appropriate to the interactive
> prompt printing, and the wide agreement that whatever we decide, users
> might think otherwise, I've written up a patch to have the user control 
> via a function in __builtin__ the way things are printed at the prompt.
> This is not patches at python level stuff for two reasons:
> 
> 1. I'm not sure what to call this function. Currently, I call it
> __print_expr__, but I'm not sure it's a good name
> 
> 2. I haven't yet supplied a default in __builtin__, so the user *must*
> override this. This is unacceptable, of course.
> 
> I'd just like people to tell me if they think this is worth while, and if
> there is anything I missed.

Thanks for bringing this up again.  I think it should be called
sys.displayhook.  The default could be something like

import __builtin__
def displayhook(obj):
    if obj is None:
        return
    __builtin__._ = obj
    sys.stdout.write("%s\n" % repr(obj))

to be nearly 100% compatible with current practice; or use str(obj) to
do what most people would probably prefer.

(Note that you couldn't do "%s\n" % obj because obj might be a tuple.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May  2 00:29:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 00:29:41 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>             <390DEB45.D8D12337@prescod.net>  <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> I just wish he made the point more eloquently.  The eff-bot seems to
> be in a crunchy mood lately...

I've posted a few thousand messages on this topic, most of which
seem to have been ignored.  if you'd read all my messages, and seen
all the replies, you'd be cranky too...

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

maybe, but it's a darn good axiom, and it's used by everyone else.
Perl uses it, Tcl uses it, XML uses it, etc.  see:

http://www.python.org/pipermail/python-dev/2000-April/005218.html

> I have a bunch of good reasons (I think) for liking UTF-8: it allows
> you to convert between Unicode and 8-bit strings without losses, Tcl
> uses it (so displaying Unicode in Tkinter *just* *works*...), it is
> not Western-language-centric.

the "Tcl uses it" is a red herring -- their internal implementation
uses 16-bit integers, and the external interface works very hard
to keep the "strings are character sequences" illusion.

in other words, the length of a string is *always* the number of
characters, the character at index i is *always* the i'th character
in the string, etc.

that's not true in Python 1.6a2.

(as for Tkinter, you only have to add 2-3 lines of code to make it
use 16-bit strings instead...)

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

this is another red herring: my argument is that 8-bit strings should
contain unicode characters, using unicode character codes.  there
should be only one character repertoire, and that repertoire is uni-
code.  for a definition of these terms, see:

http://www.python.org/pipermail/python-dev/2000-April/005225.html

obviously, you can only store 256 different values in a single 8-bit
character (just like you can only store 4294967296 different values
in a single 32-bit int).

to store larger values, use unicode strings (or long integers).

conversion from a small type to a large type always work, conversion
from a large type to a small one may result in an OverflowError.

it has nothing to do with encodings.

> I claim that as long as we're using an encoding we might as well use
> the most accepted 8-bit encoding of Unicode as the default encoding.

yeah, and I claim that it won't fly, as long as it breaks the "strings
are character sequences" rule used by all other contemporary (and
competing) systems.

(if you like, I can post more "fun with unicode" messages ;-)

and as I've mentioned before, there are (at least) two ways to solve
this:

1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and
   Perl).  make sure len(s) returns the number of characters in the
   string, make sure s[i] returns the i'th character (not necessarily
   starting at the i'th byte, and not necessarily one byte), etc.  to
   make this run reasonable fast, use as many implementation tricks
   as you can come up with (I've described three ways to implement
   this in an earlier post).

2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i])
   is a unicode character code, whether s is an 8-bit string or a unicode
   string.

for alternative 1 to work, you need to add some way to explicitly work
with binary strings (like it's done in Perl and Tcl).

alternative 2 doesn't need that; 8-bit strings can still be used to hold
any kind of binary data, as in 1.5.2.  just keep in mind you cannot use
use all methods on such an object...

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

I still think it's very unfortunate that you think that unicode strings
are a special kind of strings.  Perl and Tcl don't, so why should we?

</F>




From gward at mems-exchange.org  Tue May  2 00:40:18 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Mon, 1 May 2000 18:40:18 -0400
Subject: [Python-Dev] Comparison inconsistency with ExtensionClass
Message-ID: <20000501184017.A1171@mems-exchange.org>

Hi all --

I seem to have discovered an inconsistency in the semantics of object
comparison between plain old Python instances and ExtensionClass
instances.  (I've cc'd python-dev because it looks as though one *could*
blame Python for the inconsistency, but I don't really understand the
guts of either Python or ExtensionClass enough to know.)

Here's a simple script that shows the difference:

    class Simple:
        def __init__ (self, data):
            self.data = data

        def __repr__ (self):
            return "<%s at %x: %s>" % (self.__class__.__name__,
                                       id(self),
                                       `self.data`)

        def __cmp__ (self, other):
            print "Simple.__cmp__: self=%s, other=%s" % (`self`, `other`)
            return cmp (self.data, other)


    if __name__ == "__main__":
        v1 = 36
        v2 = Simple (36)
        print "v1 == v2?", (v1 == v2 and "yes" or "no")
        print "v2 == v1?", (v2 == v1 and "yes" or "no")
        print "v1 == v2.data?", (v1 == v2.data and "yes" or "no")
        print "v2.data == v1?", (v2.data == v1 and "yes" or "no")

If I run this under Python 1.5.2, then all the comparisons come out true
and my '__cmp__()' method is called twice:

    v1 == v2? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v2 == v1? Simple.__cmp__: self=<Simple at 1b5148: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes


The first one and the last two are obvious, but the second one only
works thanks to a trick in PyObject_Compare():

    if (PyInstance_Check(v) || PyInstance_Check(w)) {
        ...
        if (!PyInstance_Check(v))
	    return -PyObject_Compare(w, v);
        ...
    }

However, if I make Simple an ExtensionClass:

    from ExtensionClass import Base

    class Simple (Base):

Then the "swap v and w and use w's comparison method" no longer works.
Here's the output of the script with Simple as an ExtensionClass:

    v1 == v2? no
    v2 == v1? Simple.__cmp__: self=<Simple at 1b51c0: 36>, other=36
    yes
    v1 == v2.data? yes
    v2.data == v1? yes

It looks as though ExtensionClass would have to duplicate the trick in
PyObject_Compare() that I quoted, since Python has no idea that
ExtensionClass instances really should act like instances.  This smells
to me like a bug in ExtensionClass.  Comments?

BTW, I'm using the ExtensionClass provided with Zope 2.1.4.  Mostly
tested with Python 1.5.2, but also under the latest CVS Python and we
observed the same behaviour.

        Greg



From mhammond at skippinet.com.au  Tue May  2 01:45:02 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 09:45:02 +1000
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>

> Guido van Rossum writes:
>  > Maybe you could adapt the documentation for the
> registry functions in
>  > Mark Hammond's win32all?  Not all the APIs are the
> same but the should
>  > mostly do the same thing...
>
>   I'll take a look at it when I have time, unless anyone
> beats me to
> it.

I wonder if that anyone could be me? :-)

Note that all the win32api docs for the registry all made it into
docstrings - so winreg has OK documentation as it is...

But I will try and put something together.  It will need to be plain
text or HTML, but I assume that is better than nothing!

Give me a few days...

Mark.




From paul at prescod.net  Tue May  2 02:19:20 2000
From: paul at prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 19:19:20 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>  
	            <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <390E1F08.EA91599E@prescod.net>

Sorry for the long message. Of course you need only respond to that
which is interesting to you. I don't think that most of it is redundant.

Guido van Rossum wrote:
> 
> ...
> 
> OK, you've made your claim -- like Fredrik, you want to interpret
> 8-bit strings as Latin-1 when converting (not just comparing!) them to
> Unicode.

If the user provides an explicit conversion function (e.g. UTF-8-decode)
then of course we should use that function. Under my character is a
character is a character model, this "conversion" is morally equivalent
to ROT-13, strupr or some other text->text translation. So you could
apply UTF-8-decode even to a Unicode string as long as each character in
the string has ord()<256 (so that it could be interpreted as a character
representation for a byte).

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

I don't see it as an axiom, but rather as a design decision you make to
keep your language simple. Along the lines of "all values are objects"
and (now) all integer values are representable with a single type. Are
you happy with this?

a="\244"
b=u"\244"
assert len(a)==len(b)
assert ord(a[0])==ord(b[0])

# same thing, right?
print b==a
# Traceback (most recent call last):
#  File "<stdin>", line 1, in ?
# UnicodeError: UTF-8 decoding error: unexpected code byte

If I type "\244" it means I want character 244, not the first half of a
UTF-8 escape sequence. "\244" is a string with one character. It has no
encoding. It is not latin-1. It is not UTF-8. It is a string with one
character and should compare as equal with another string with the same
character.

I would laugh my ass off if I was using Perl and it did something weird
like this to me (as long as it didn't take a month to track down the
bug!). Now it isn't so funny.

> I have a bunch of good reasons (I think) for liking UTF-8: 

I'm not against UTF-8. It could be an internal representation for some
Unicode objects.

> it allows
> you to convert between Unicode and 8-bit strings without losses, 

Here's the heart of our disagreement:

******
I don't want, in Py3K, to think about "converting between Unicode and
8-bit strings." I want strings and I want byte-arrays and I want to
worry about converting between *them*. There should be only one string
type, its characters should all live in the Unicode character repertoire
and the character numbers should all come from Unicode. "Special"
characters can be assigned to the Unicode Private User Area. Byte arrays
would be entirely seperate and would be converted to Unicode strings
with explicit conversion functions.
*****

In the meantime I'm just trying to get other people thinking in this
mode so that the transition is easier. If I see people embedding UTF-8
escape sequences in literal strings today, I'm going to hit them.

I recognize that we can't design the universe right now but we could
agree on this direction and use it to guide our decision-making.

By the way, if we DID think of 8-bit strings as essentially "byte
arrays" then let's use that terminology and imagine some future
documentation:

"Python's string type is equivalent to a list of bytes. For clarity, we
will call this type a byte list from now on. In contexts where a Unicode
character-string is desired, Python automatically converts byte lists to
charcter strings by doing a UTF-8 decode on them." 

What would you think if Java had a default (I say "magical") conversion
from byte arrays to character strings.

The only reason we are discussing this is because Python strings have a
dual personality which was useful in the past but will (IMHO, of course)
become increasingly confusing in the future. We want the best of both
worlds without confusing anybody and I don't think that we can have it.

If you want 8-bit strings to be really byte arrays in perpetuity then
let's be consistent in that view. We can compare them to Unicode as we
would two completely separate types. "U" comes after "S" so unicode
strings always compare greater than 8-bit strings. The use of the word
"string" for both objects can be considered just a historical accident.

> Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), 

Don't follow this entirely. Shouldn't the next version of TKinter accept
and return Unicode strings? It would be rather ugly for two
Unicode-aware systems (Python and TK) to talk to each other in 8-bit
strings. I mean I don't care what you do at the C level but at the
Python level arguments should be "just strings."

Consider that len() on the TKinter side would return a different value
than on the Python side. 

What about integral indexes into buffers? I'm totally ignorant about
TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is
between the 5th and 6th character when in an 8-bit string the equivalent
index might be the 11th or 12th byte?

> it is not Western-language-centric.

If you look at encoding efficiency it is.

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

The fact that my proposal has the same effect as making Latin-1 the
"default encoding" is a near-term side effect of the definition of
Unicode. My long term proposal is to do away with the concept of 8-bit
strings (and thus, conversions from 8-bit to Unicode) altogether. One
string to rule them all!

Is Unicode going to be the canonical Py3K character set or will we have
different objects for different character sets/encodings with different
default (I say "magical") conversions between them. Such a design would
not be entirely insane though it would be a PITA to implement and
maintain. If we aren't ready to establish Unicode as the one true
character set then we should probably make no special concessions for
Unicode at all. Let a thousand string objects bloom!

Even if we agreed to allow many string objects, byte==character should
not be the default string object. Unicode should be the default.

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  

Won't this be totally common? Most people are going to use 8-bit
literals in their program text but work with Unicode data from XML
parsers, COM, WebDAV, Tkinter, etc?

> Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

If we are guessing then we are doing something wrong. My answer to the
question of "default encoding" falls out naturally from a certain way of
looking at text, popularized in various other languages and increasingly
"the norm" on the Web. If you accept the model (a character is a
character is a character), the right behavior is obvious. 

"\244"==u"\244"

Nobody is ever going to have trouble understanding how this works.
Choose simplicity!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From mhammond at skippinet.com.au  Tue May  2 02:34:16 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 10:34:16 +1000
Subject: [Python-Dev] Neil Hodgson on python-dev?
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au>

I'd like to propose that we invite Neil Hodgson to join the
python-dev family.

Neil is the author of the Scintilla editor control, now used by
wxPython and Pythonwin...  Smart guy, and very experienced with
Python (scintilla was originally written because he had trouble
converting Pythonwin to be a color'd editor :-)

But most relevant at the moment is his Unicode experience.  He
worked for along time with Fujitsu, working with Japanese and all
the encoding issues there.  I have heard him echo the exact
sentiments of Andy.  He is also in the process of polishing the
recent Unicode support in Scintilla.

As this Unicode debate seems to be going nowhere fast, and appears
to simply need more people with _experience_, I think he would be
valuable.  Further, he is a pretty quiet guy - you wont find him
offering his opinion on every post that moves through here :-)

Thoughts?

Mark.




From guido at python.org  Tue May  2 02:41:43 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:41:43 -0400
Subject: [Python-Dev] Neil Hodgson on python-dev?
In-Reply-To: Your message of "Tue, 02 May 2000 10:34:16 +1000."
             <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBGEPGCJAA.mhammond@skippinet.com.au> 
Message-ID: <200005020041.UAA23648@eric.cnri.reston.va.us>

> I'd like to propose that we invite Neil Hodgson to join the
> python-dev family.

Excellent!

> As this Unicode debate seems to be going nowhere fast, and appears
> to simply need more people with _experience_, I think he would be
> valuable.  Further, he is a pretty quiet guy - you wont find him
> offering his opinion on every post that moves through here :-)

As long as he isn't too quiet on the Unicode thing ;-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 02:53:26 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:53:26 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT."
             <390E1F08.EA91599E@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>  
            <390E1F08.EA91599E@prescod.net> 
Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us>

Paul, we're both just saying the same thing over and over without
convincing each other.  I'll wait till someone who wasn't in this
debate before chimes in.

Have you tried using this?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May  2 03:26:06 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 03:26:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net>
Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid>

Paul Prescod <paul at prescod.net> wrote:
> I would laugh my ass off if I was using Perl and it did something weird
> like this to me.

you don't have to -- in Perl 5.6, a character is a character...

does anyone on this list follow the perl-porters list?  was this as
controversial over in Perl land as it appears to be over here?

</F>




From tpassin at home.com  Tue May  2 03:55:25 2000
From: tpassin at home.com (tpassin at home.com)
Date: Mon, 1 May 2000 21:55:25 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>

Guido van  Rossum wrote, about how to represent strings:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

I'm with Paul and Federick on this one - at least about characters being the
atoms of a string.  We **have** to be able to refer to **characters** in a
string, and without guessing.  Otherwise, how could you ever construct a
test, like theString[3]==[a particular japanese ideograph]?  If we do it by
having a "string" datatype, which is really a byte list, and a
"unicodeString" datatype which is a list of abstract characters, I'd say
everyone could get used to working with them.  We'd have to supply
conversion functions, of course.

This route might be the easiest to understand for users.  We'd have to be
very clear about what file.read() would return, for example, and all those
similar read and write functions.  And we'd have to work out how real 8-bit
calls (like writing to a socket?) would play with the new types.

For extra clarity, we could leave string the way it is, introduce stringU
(unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
be the best equivalent to the current string).  Then we would deprecate
string in favor of string8.  Then if tcl and perl go to unicode strings we
pass them a stringU, and if they go some other way, we pass them something
else.  COme to think of it, we need some some data type that will continue
to work with c and c++.  Would that be string8 or would we keep string for
that purpose?

Clarity and ease of use for the user should be primary, fast implementations
next.  If we didn't care about ease of use and clarity, we could all use
Scheme or c, don't use sight of it.

I'd suggest we could create some use cases or scenarios for this area -
needs input from those who know encodings and low level Python stuff better
than I.  Then we could examine more systematically how well various
approaches would work out.

Regards,
Tom Passin





From mhammond at skippinet.com.au  Tue May  2 04:17:09 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 2 May 2000 12:17:09 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEPJCJAA.mhammond@skippinet.com.au>

> Guido van  Rossum wrote, about how to represent strings:
>
> > Paul, we're both just saying the same thing over and
> over without
> > convincing each other.  I'll wait till someone who
> wasn't in this
> > debate before chimes in.

Ive chimed in a little, but Ill chime in again :-)

> I'm with Paul and Federick on this one - at least about
> characters being the
> atoms of a string.  We **have** to be able to refer to
> **characters** in a
> string, and without guessing.  Otherwise, how could you

I see the point, and agree 100% with the intent.  However, reality
does bite.

As far as I can see, the following are immuatable:
* There will be 2 types - a string type and a Unicode type.
* History dicates that the string type may hold binary data.

Thus, it is clear that Python simply can not treat characters as the
smallest atoms of strings.  If I understand things correctly, this
is key to Guido's point, and a bit of a communication block.

The issue, to my mind, is how we handle these facts to produce "the
principal of least surprise".  We simply need to accept that Python
1.x will never be able to treat string objects as sequences of
"characters" - only bytes.

However, with my limited understanding of the full issues, it does
appear that the proposal championed by Fredrik, Just and Paul is the
best solution - not because it magically causes Python to treat
strings as characters in all cases, but because it offers the
prinipcal of least surprise.

As I said, I dont really have a deep enough understanding of the
issues, so this is probably (hopefully!?) my last word on the
matter - but that doesnt mean I dont share the concerns raised
here...

Mark.




From guido at python.org  Tue May  2 05:31:54 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 01 May 2000 23:31:54 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:55:25 EDT."
             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> 
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> 
Message-ID: <200005020331.XAA23818@eric.cnri.reston.va.us>

Tom Passin:
> I'm with Paul and Federick on this one - at least about characters being the
> atoms of a string.  We **have** to be able to refer to **characters** in a
> string, and without guessing.  Otherwise, how could you ever construct a
> test, like theString[3]==[a particular japanese ideograph]?  If we do it by
> having a "string" datatype, which is really a byte list, and a
> "unicodeString" datatype which is a list of abstract characters, I'd say
> everyone could get used to working with them.  We'd have to supply
> conversion functions, of course.

You seem unfamiliar with the details of the implementation we're
proposing?  We already have two datatypes, 8-bit string (call it byte
array) and Unicode string.  There are conversions between them:
explicit conversions such as u.encode("utf-8") or unicode(s,
"latin-1") and implicit conversions used in situations like u+s or
u==s.  The whole discussion is *only* about what the default
conversion in the latter cases should be -- the rest of the
implementation is rock solid and works well.

Users can accomplish what you are proposing by simply ensuring that
theString is a Unicode string.

> This route might be the easiest to understand for users.  We'd have to be
> very clear about what file.read() would return, for example, and all those
> similar read and write functions.  And we'd have to work out how real 8-bit
> calls (like writing to a socket?) would play with the new types.

These are all well defined -- they all deal in 8-bit strings
internally, and all use the default conversions when given Unicode
strings.  Programs that only deal in 8-bit strings don't need to
change.  Programs that want to deal with Unicode and sockets, for
example, must know what encoding to use on the socket, and if it's not
the default encoding, must use explicit conversions.

> For extra clarity, we could leave string the way it is, introduce stringU
> (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
> be the best equivalent to the current string).  Then we would deprecate
> string in favor of string8.  Then if tcl and perl go to unicode strings we
> pass them a stringU, and if they go some other way, we pass them something
> else.  COme to think of it, we need some some data type that will continue
> to work with c and c++.  Would that be string8 or would we keep string for
> that purpose?

What would be the difference between string and string8?

> Clarity and ease of use for the user should be primary, fast implementations
> next.  If we didn't care about ease of use and clarity, we could all use
> Scheme or c, don't use sight of it.
> 
> I'd suggest we could create some use cases or scenarios for this area -
> needs input from those who know encodings and low level Python stuff better
> than I.  Then we could examine more systematically how well various
> approaches would work out.

Very good.

Here's one usage scenario.

A Japanese user is reading lines from a file encoded in ISO-2022-JP.
The readline() method returns 8-bit strings in that encoding (the file
object doesn't do any decoding).  She realizes that she wants to do
some character-level processing on the file so she decides to convert
the strings to Unicode.

I believe that whether the default encoding is UTF-8 or Latin-1
doesn't matter for here -- both are wrong, she needs to write explicit
unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
"better", because interpreting ISO-2022-JP data as UTF-8 will most
likely give an exception (when a \300 range byte isn't followed by a
\200 range byte) -- while interpreting it as Latin-1 will silently do
the wrong thing.  (An explicit error is always better than silent
failure.)

I'd love to discuss other scenarios.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From moshez at math.huji.ac.il  Tue May  2 06:39:12 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 2 May 2000 07:39:12 +0300 (IDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <200005012217.SAA23503@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>

> Thanks for bringing this up again.  I think it should be called
> sys.displayhook.

That should be the easy part -- I'll do it as soon as I'm home.

> The default could be something like
> 
> import __builtin__
import sys # Sorry, I couldn't resist
> def displayhook(obj):
>     if obj is None:
>         return
>     __builtin__._ = obj
>     sys.stdout.write("%s\n" % repr(obj))

This brings up a painful point -- the reason I haven't wrote the default
is because it was way much easier to write it in Python. Of course, I
shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
pains me Python cannot be written with more Python and less C.

A while ago we started talking about the mini-interpreter idea, which
would then freeze Python code into itself, and then it sort of died out.
What have become of it?

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From just at letterror.com  Tue May  2 07:47:35 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 06:47:35 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020331.XAA23818@eric.cnri.reston.va.us>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <l03102802b534149a9639@[193.78.237.164]>

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
>Here's one usage scenario.
>
>A Japanese user is reading lines from a file encoded in ISO-2022-JP.
>The readline() method returns 8-bit strings in that encoding (the file
>object doesn't do any decoding).  She realizes that she wants to do
>some character-level processing on the file so she decides to convert
>the strings to Unicode.
>
>I believe that whether the default encoding is UTF-8 or Latin-1
>doesn't matter for here -- both are wrong, she needs to write explicit
>unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
>"better", because interpreting ISO-2022-JP data as UTF-8 will most
>likely give an exception (when a \300 range byte isn't followed by a
>\200 range byte) -- while interpreting it as Latin-1 will silently do
>the wrong thing.  (An explicit error is always better than silent
>failure.)

But then it's even better to *always* raise an exception, since it's
entirely possible a string contains valid utf-8 while not *being* utf-8. I
really think the exception argument is moot, since there can *always* be
situations that will pass silently. Encoding issues are silent by nature --
eg. there's no way any system can tell that interpreting MacRoman data as
Latin-1 is wrong, maybe even fatal -- the user will just have to deal with
it. You can argue what you want, but *any* multi-byte encoding stored in an
8-bit string is a buffer, not a string, for all the reasons Fredrik and
Paul have thrown at you, and right they are. Choosing such an encoding as a
default conversion to Unicode makes no sense at all. Recap of the main
arguments:

pro UTF-8:
always reversible when going from Unicode to 8-bit

con UTF-8:
not a string: confusing semantics

pro Latin-1:
simpler semantics

con Latin-1:
non-reversible, western-centric

Given the fact that very often *both* will be wrong, I'd go for the simpler
semantics.

Just





From guido at python.org  Tue May  2 06:51:45 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 00:51:45 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 07:39:12 +0300."
             <Pine.GSO.4.10.10005020732200.8759-100000@sundial> 
References: <Pine.GSO.4.10.10005020732200.8759-100000@sundial> 
Message-ID: <200005020451.AAA23940@eric.cnri.reston.va.us>

> > import __builtin__
> import sys # Sorry, I couldn't resist
> > def displayhook(obj):
> >     if obj is None:
> >         return
> >     __builtin__._ = obj
> >     sys.stdout.write("%s\n" % repr(obj))
> 
> This brings up a painful point -- the reason I haven't wrote the default
> is because it was way much easier to write it in Python. Of course, I
> shouldn't be preaching Python-is-easier-to-write-then-C here, but  it
> pains me Python cannot be written with more Python and less C.
> 

But the C code on how to do it was present in the code you deleted
from ceval.c!

> A while ago we started talking about the mini-interpreter idea,
> which would then freeze Python code into itself, and then it sort of
> died out.  What have become of it?

Nobody sent me a patch :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)



From nhodgson at bigpond.net.au  Tue May  2 07:04:12 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 15:04:12 +1000
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <035501bfb3f3$db87fb10$e3cb8490@neil>

   I'm dropping in a bit late in this thread but can the current problem be
summarised in an example as "how is 'literal' interpreted here"?

s = aUnicodeStringFromSomewhere
DoSomething(s + "<literal>")

   The two options being that literal is either assumed to be encoded in
Latin-1 or UTF-8. I can see some arguments for both sides.

Latin-1: more current code was written in a European locale with an implicit
assumption that all string handling was Latin-1. Current editors are more
likely to be displaying literal as it is meant to be interpreted.

UTF-8: all languages can be written in UTF-8 and more recent editors can
display this correctly. Thus people using non-Roman alphabets can write code
which is interpreted as is seen with no need to remember to call conversion
functions.

   Neil




From tpassin at home.com  Tue May  2 07:07:07 2000
From: tpassin at home.com (tpassin at home.com)
Date: Tue, 2 May 2000 01:07:07 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>  <200005020331.XAA23818@eric.cnri.reston.va.us>
Message-ID: <006101bfb3f4$454f99e0$7cac1218@reston1.va.home.com>

Guido van Rossum said
<snip/>
> What would be the difference between string and string8?

Probably none, except to alert people that string8 might have different
behavior than the present-day string, perhaps when interacting with
unicode - probably its behavior would be specified more tightly (i.e., is it
strictly a list of bytes or does it have some assumption about encoding?) or
changed in some way from what we have now.  Or if it turned out that a lot
of programmers in other languages (perl, tcl, perhaps?) expected "string" to
behave in particular ways, the use of a term like "string8" might reduce
confusion.   Possibly none of these apply - no need for "string8" then.

>
> > Clarity and ease of use for the user should be primary, fast
implementations
> > next.  If we didn't care about ease of use and clarity, we could all use
> > Scheme or c, don't use sight of it.
> >
> > I'd suggest we could create some use cases or scenarios for this area -
> > needs input from those who know encodings and low level Python stuff
better
> > than I.  Then we could examine more systematically how well various
> > approaches would work out.
>
> Very good.
>
<snip/>

Tom Passin




From effbot at telia.com  Tue May  2 08:59:03 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 08:59:03 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <003b01bfb404$03cd0560$34aab5d4@hagrid>

Neil Hodgson <nhodgson at bigpond.net.au> wrote:
>    I'm dropping in a bit late in this thread but can the current problem be
> summarised in an example as "how is 'literal' interpreted here"?
> 
> s = aUnicodeStringFromSomewhere
> DoSomething(s + "<literal>")

nope.  the whole discussion centers around what happens
if you type:

    # example 1

    u = aUnicodeStringFromSomewhere
    s = an8bitStringFromSomewhere

    DoSomething(s + u)

and

    # example 2

    u = aUnicodeStringFromSomewhere
    s = an8bitStringFromSomewhere

    if len(u) + len(s) == len(u + s):
        print "true"
    else:
        print "not true"

in Guido's design, the first example may or may not result in
an "UTF-8 decoding error: UTF-8 decoding error: unexpected
code byte" exception.  the second example may result in a
similar error, print "true", or print "not true", depending on the
contents of the 8-bit string.

(under the counter proposal, the first example will never
raise an exception, and the second will always print "true")

...

the string literal issue is a slightly different problem.

> The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. I can see some arguments for both sides.

better make that "two options", not "the two options" ;-)

a more flexible scheme would be to borrow the design from XML
(see http://www.w3.org/TR/1998/REC-xml-19980210). for those
who haven't looked closer at XML, it basically treats the source
file as an encoded unicode character stream, and does all pro-
cessing on the decoded side.

replace "entity" with "script file" in the following excerpts, and you
get close:

section 2.2:

    A parsed entity contains text, a sequence of characters,
    which may represent markup or character data.

    A character is an atomic unit of text as specified by
    ISO/IEC 10646.

section 4.3.3:

    Each external parsed entity in an XML document may
    use a different encoding for its characters. All XML
    processors must be able to read entities in either
    UTF-8 or UTF-16. 

    Entities encoded in UTF-16 must begin with the Byte
    Order Mark /.../ XML processors must be able to use
    this character to differentiate between UTF-8 and
    UTF-16 encoded documents.

    Parsed entities which are stored in an encoding other
    than UTF-8 or UTF-16 must begin with a text declaration
    containing an encoding declaration.

(also see appendix F: Autodetection of Character Encodings)

I propose that we adopt a similar scheme for Python -- but not
in 1.6.  the current "dunno, so we just copy the characters" is
good enough for now...

</F>




From tim_one at email.msn.com  Tue May  2 09:20:52 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 03:20:52 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: <200004271523.LAA13614@eric.cnri.reston.va.us>
Message-ID: <000201bfb406$f2f35520$df2d153f@tim>

[Guido asks good questions about how Windows deals w/ Unicode filenames,
 last Thursday, but gets no answers]

> ...
> I'd like to solve this problem, but I have some questions: what *IS*
> the encoding used for filenames on Windows?  This may differ per
> Windows version; perhaps it can differ drive letter?  Or per
> application or per thread?  On Windows NT, filenames are supposed to
> be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> with a given Unicode string for its name, in a C program?  I suppose
> there's a Win32 API call for that which has a Unicode variant.
>
> On Windows 95/98, the Unicode variants of the Win32 API calls don't
> exist.  So what is the poor Python runtime to do there?
>
> Can Japanese people use Japanese characters in filenames on Windows
> 95/98?  Let's assume they can.  Since the filesystem isn't Unicode
> aware, the filenames must be encoded.  Which encoding is used?  Let's
> assume they use Microsoft's multibyte encoding.  If they put such a
> file on a floppy and ship it to Link?ping, what will Fredrik see as
> the filename?  (I.e., is the encoding fixed by the disk volume, or by
> the operating system?)
>
> Once we have a few answers here, we can solve the problem.  Note that
> sometimes we'll have to refuse a Unicode filename because there's no
> mapping for some of the characters it contains in the filename
> encoding used.

I just thought I'd repeat the questions <wink>.  However, I don't think
you'll really want the answers -- Windows is a legacy-encrusted mess, and
there are always many ways to get a thing done in the end.  For example ...

> Question: how does Fredrik create a file with a Euro
> character (u'\u20ac') in its name?

This particular one is shallower than you were hoping:  in many of the
TrueType fonts (e.g., Courier New but not Courier), Windows extended its
Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80.
So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8.
This is true even on US Win98 (which has no visible Unicode support) -- but
was not supported in US Win95.

i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop-
    at-work-so-can-verify-ms-sure-got-japanese-characters-into-the-
    filenames-somehow-but-doubt-it's-via-unicode-ly y'rs  - tim





From effbot at telia.com  Tue May  2 09:55:49 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 09:55:49 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <007d01bfb40b$d7693720$34aab5d4@hagrid>

Tim Peters wrote:
> [Guido asks good questions about how Windows deals w/ Unicode filenames,
>  last Thursday, but gets no answers]

you missed Finn Bock's post on how Java does it.

here's another data point:

Tcl uses a system encoding to convert from unicode to a suitable
system API encoding, and uses the following approach to figure out
what that one is:

    windows NT/2000:
        unicode (use wide api)

    windows 95/98:
        "cp%d" % GetACP()
        (note that this is "cp1252" in us and western europe,
        not "iso-8859-1")
  
    macintosh:
        determine encoding for fontId 0 based on (script,
        smScriptLanguage) tuple. if that fails, assume
        "macroman"

    unix:
        figure out the locale from LC_ALL, LC_CTYPE, or LANG.
        use heuristics to map from the locale to an encoding
        (see unix/tclUnixInit). if that fails, assume "iso-8859-1"

I propose adding a similar mechanism to Python, along these lines:

    sys.getdefaultencoding() returns the right thing for windows
    and macintosh, "iso-8859-1" for other platforms.

    sys.setencoding(codec) changes the system encoding.  it's
    used from site.py to set things up properly on unix and other
    non-unicode platforms.

</F>




From nhodgson at bigpond.net.au  Tue May  2 10:22:36 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 18:22:36 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim>
Message-ID: <004501bfb40f$92ff0980$e3cb8490@neil>

> > I'd like to solve this problem, but I have some questions: what *IS*
> > the encoding used for filenames on Windows?  This may differ per
> > Windows version; perhaps it can differ drive letter?  Or per
> > application or per thread?  On Windows NT, filenames are supposed to
> > be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> > with a given Unicode string for its name, in a C program?  I suppose
> > there's a Win32 API call for that which has a Unicode variant.

   Its decided by each file system.

   For FAT file systems, the OEM code page is used. The OEM code page
generally used in the United States is code page 437 which is different from
the code page windows uses for display. I had to deal with this in a system
where people used fractions (1/4, 1/2 and 3/4) as part of names which had to
be converted into valid file names. For example 1/4 is 0xBC for display but
0xAC when used in a file name.

   In Japan, I think different manufacturers used different encodings with
NEC trying to maintain market control with their own encoding.

   VFAT stores both Unicode long file names and shortened aliases. However
the Unicode variant is hard to get to from Windows 95/98.

   NTFS stores Unicode.

> > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > exist.  So what is the poor Python runtime to do there?

   Fail the call. All existing files can be opened because they have short
non-Unicode aliases. If a file with a Unicode name can not be created
because the OS doesn't support it then you should give up. Just as you
should give up if you try to save a file with a name that includes a
character not allowed by the file system.

> > Can Japanese people use Japanese characters in filenames on Windows
> > 95/98?

   Yes.

> > Let's assume they can.  Since the filesystem isn't Unicode
> > aware, the filenames must be encoded.  Which encoding is used?  Let's
> > assume they use Microsoft's multibyte encoding.  If they put such a
> > file on a floppy and ship it to Link?ping, what will Fredrik see as
> > the filename?  (I.e., is the encoding fixed by the disk volume, or by
> > the operating system?)

   If Fredrik is running a non-Japanese version of Windows 9x, he will see
some 'random' western characters replacing the Japanese.

   Neil




From effbot at telia.com  Tue May  2 10:36:40 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 10:36:40 +0200
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil>
Message-ID: <008501bfb411$8e0502c0$34aab5d4@hagrid>

Neil Hodgson wrote:
>    Its decided by each file system.

...but the system API translates from the active code page to the
encoding used by the file system, right?

on my w95 box, GetACP() returns 1252, and GetOEMCP() returns
850.  

if I create a file with a name containing latin-1 characters, on a
FAT drive, it shows up correctly in the file browser (cp1252), and
also shows up correctly in the MS-DOS window (under cp850).

if I print the same filename to stdout in the same DOS window, I
get gibberish.

> > > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > > exist.  So what is the poor Python runtime to do there?
> 
>    Fail the call.

...if you fail to convert from unicode to the local code page.

</F>




From mal at lemburg.com  Tue May  2 10:36:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 10:36:43 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <390E939B.11B99B71@lemburg.com>

Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its
work.

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Tue May  2 10:42:51 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 01:42:51 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <Pine.LNX.4.10.10005020114250.522-100000@localhost>

I'll warn you that i'm not much experienced or well-informed, but
i suppose i might as well toss in my naive opinion.

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
> 
> I believe that whether the default encoding is UTF-8 or Latin-1
> doesn't matter for here -- both are wrong, she needs to write explicit
> unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
> "better", because [this] will most likely give an exception...

On Tue, 2 May 2000, Just van Rossum wrote:
> But then it's even better to *always* raise an exception, since it's
> entirely possible a string contains valid utf-8 while not *being* utf-8.

I believe it is time for me to make a truly radical proposal:

    No automatic conversions between 8-bit "strings" and Unicode strings.

If you want to turn UTF-8 into a Unicode string, say so.
If you want to turn Latin-1 into a Unicode string, say so.
If you want to turn ISO-2022-JP into a Unicode string, say so.
Adding a Unicode string and an 8-bit "string" gives an exception.

I know this sounds tedious, but at least it stands the least possible
chance of confusing anyone -- and given all i've seen here and in
other i18n and l10n discussions, there's plenty enough confusion to
go around already.


If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.


Okay.  Feel free to yell at me now.


-- ?!ng

P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
sense of them as byte-buffers -- since that *is* all you get when you
read in some bytes from a file.  If you manipulate an 8-bit "string"
as a character string, you are implicitly making the assumption that
the byte values correspond to the character encoding of the character
repertoire you want to work with, and that's your responsibility.

P. P. S.  If always having to specify encodings is really too much,
i'd probably be willing to consider a default-encoding state on the
Unicode class, but it would have to be a stack of values, not a
single value.




From effbot at telia.com  Tue May  2 11:00:07 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 11:00:07 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <009701bfb414$d35d0ea0$34aab5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> Just a small note on the subject of a character being atomic
> which seems to have been forgotten by the discussing parties:
> 
> Unicode itself can be understood as multi-word character
> encoding, just like UTF-8. The reason is that Unicode entities
> can be combined to produce single display characters (e.g.
> u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> Slicing such a combined Unicode string will have the same
> effect as slicing UTF-8 data.

really?  does it result in a decoder error?  or does it just result
in a rendering error, just as if you slice off any trailing character
without looking...

> It seems that most Latin-1 proponents seem to have single
> display characters in mind. While the same is true for
> many Unicode entities, there are quite a few cases of
> combining characters in Unicode 3.0 and the Unicode
> nomarization algorithm uses these as basis for its
> work.

do we supported automatic normalization in 1.6?

</F>




From ping at lfw.org  Tue May  2 11:46:40 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:46:40 -0700 (PDT)
Subject: [Python-Dev] At the interactive port
In-Reply-To: <Pine.GSO.4.10.10005020732200.8759-100000@sundial>
Message-ID: <Pine.LNX.4.10.10005020242270.522-100000@localhost>

On Tue, 2 May 2000, Moshe Zadka wrote:
> 
> > Thanks for bringing this up again.  I think it should be called
> > sys.displayhook.

I apologize profusely for dropping the ball on this.  I
was going to do it; i have been having a tough time lately
figuring out a Big Life Decision.  (Hate those BLDs.)

I was partway through hacking the patch and didn't get back
to it, but i wanted to at least air the plan i had in mind.
I hope you'll allow me this indulgence.

I was planning to submit a patch that adds the built-in routines

    sys.display
    sys.displaytb

    sys.__display__
    sys.__displaytb__

sys.display(obj) would be implemented as 'print repr(obj)'
and sys.displaytb(tb, exc) would call the same built-in
traceback printer we all know and love.

I assumed that sys.__stdin__ was added to make it easier to
restore sys.stdin to its original value.  In the same vein,
sys.__display__ and sys.__displaytb__ would be saved references
to the original sys.display and sys.displaytb.

I hate to contradict Guido, but i'll gently suggest why i
like "display" better than "displayhook": "display" is a verb,
and i prefer function names to be verbs rather than nouns
describing what the functions are (e.g. "read" rather than
"reader", etc.)


-- ?!ng




From ping at lfw.org  Tue May  2 11:47:34 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:47:34 -0700 (PDT)
Subject: [Python-Dev] Traceback style
Message-ID: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>

This was also going to go out after i posted the
display/displaytb patch.  But anyway, let's see what
you all think.

I propose the following stylistic changes to traceback
printing:

    1.  If there is no function name for a given level
        in the traceback, just omit the ", in ?" at the
        end of the line.

    2.  If a given level of the traceback is in a method,
        instead of just printing the method name, print
        the class and the method name.

    3.  Instead of beginning each line with:
        
            File "foo.py", line 5

        print the line first and drop the quotes:

            Line 5 of foo.py

        In the common interactive case that the file
        is a typed-in string, the current printout is
        
            File "<stdin>", line 1
        
        and the following is easier to read in my opinion:

            Line 1 of <stdin>

Here is an example:

    >>> class Spam:
    ...     def eggs(self):
    ...         return self.ham
    ... 
    >>> s = Spam()
    >>> s.eggs()
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
      File "<stdin>", line 3, in eggs
    AttributeError: ham

With the suggested changes, this would print as

    Traceback (innermost last):
      Line 1 of <stdin>
      Line 3 of <stdin>, in Spam.eggs
    AttributeError: ham



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton




From ping at lfw.org  Tue May  2 11:53:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 02:53:01 -0700 (PDT)
Subject: [Python-Dev] Traceback behaviour in exceptional cases
Message-ID: <Pine.LNX.4.10.10004170045510.1157-100000@localhost>

Here is how i was planning to take care of exceptions in
sys.displaytb...


    1.  When the 'sys' module does not contain a 'stderr'
        attribute, Python currently prints 'lost sys.stderr'
        to the original stderr instead of printing the traceback.
        I propose that it proceed to try to print the traceback
        to the real stderr in this case.

    2.  If 'sys.stderr' is buffered, the traceback does not
        appear in the file.  I propose that Python flush
        'sys.stderr' immediately after printing a traceback.

    3.  Tracebacks get printed to whatever object happens to
        be in 'sys.stderr'.  If the object is not a file (or
        other problems occur during printing), nothing gets
        printed anywhere.  I propose that Python warn about
        this on stderr, then try to print the traceback to
        the real stderr as above.

    4.  Similarly, 'sys.displaytb' may cause an exception.
        I propose that when this happens, Python invoke its
        default traceback printer to print the exception from
        'sys.displaytb' as well as the original exception.

#4 may seem a little convoluted, so here is the exact logic
i suggest (described here in Python but to be implemented in C),
where 'handle_exception()' is the routine the interpreter uses
to handle an exception, 'print_exception' is the built-in
exception printer currently implemented in PyErr_PrintEx and
PyTraceBack_Print, and 'err' is the actual, original stderr.

    def print_double_exception(tb, exc, disptb, dispexc, file):
        file.write("Exception occured during traceback display:\n")
        print_exception(disptb, dispexc, file)
        file.write("\n")
        file.write("Original exception passed to display routine:\n")
        print_exception(tb, exc, file)

    def handle_double_exception(tb, exc, disptb, dispexc):
        if hasattr(sys, 'stderr'):
            err.write("Missing sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)
            return
        try:
            print_double_exception(tb, exc, disptb, dispexc, sys.stderr)
        except:
            err.write("Error on sys.stderr; printing exception to stderr.\n")
            print_double_exception(tb, exc, disptb, dispexc, err)

    def handle_exception():
        tb, exc = sys.exc_traceback, sys.exc_value
        try:
            sys.displaytb(tb, exc)
        except:
            disptb, dispexc = sys.exc_traceback, sys.exc_value
            try:
                handle_double_exception(tb, exc, disptb, dispexc)
            except: pass

    def default_displaytb(tb, exc):
        if hasattr(sys, 'stderr'):
            print_exception(tb, exc, sys.stderr)
        else:
            print "Missing sys.stderr; printing exception to stderr."
            print_exception(tb, exc, err)

    sys.displaytb = sys.__displaytb__ = default_displaytb



-- ?!ng

"In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand."
    -- Gerald Holton




From mal at lemburg.com  Tue May  2 11:56:21 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 11:56:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
Message-ID: <390EA645.89E3B22A@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > Just a small note on the subject of a character being atomic
> > which seems to have been forgotten by the discussing parties:
> >
> > Unicode itself can be understood as multi-word character
> > encoding, just like UTF-8. The reason is that Unicode entities
> > can be combined to produce single display characters (e.g.
> > u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> > Slicing such a combined Unicode string will have the same
> > effect as slicing UTF-8 data.
> 
> really?  does it result in a decoder error?  or does it just result
> in a rendering error, just as if you slice off any trailing character
> without looking...

In the example, if you cut off the u"\u0301", the "e" would
appear without the acute accent, cutting off the u"e" would
probably result in a rendering error or worse put the accent
over the next character to the left.

UTF-8 is better in this respect: it warns you about
the error by raising an exception when being converted to
Unicode.
 
> > It seems that most Latin-1 proponents seem to have single
> > display characters in mind. While the same is true for
> > many Unicode entities, there are quite a few cases of
> > combining characters in Unicode 3.0 and the Unicode
> > normalization algorithm uses these as basis for its
> > work.
> 
> do we supported automatic normalization in 1.6?

No, but it is likely to appear in 1.7... not sure about
the "automatic" though.

FYI: Normalization is needed to make comparing Unicode
strings robust, e.g. u"?" should compare equal to u"e\u0301".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From esr at thyrsus.com  Tue May  2 12:16:55 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 06:16:55 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502061655.A16999@thyrsus.com>

Ka-Ping Yee <ping at lfw.org>:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.
> 
>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.
> 
>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py
> 
>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>
> 
> Here is an example:
> 
>     >>> class Spam:
>     ...     def eggs(self):
>     ...         return self.ham
>     ... 
>     >>> s = Spam()
>     >>> s.eggs()
>     Traceback (innermost last):
>       File "<stdin>", line 1, in ?
>       File "<stdin>", line 3, in eggs
>     AttributeError: ham
> 
> With the suggested changes, this would print as
> 
>     Traceback (innermost last):
>       Line 1 of <stdin>
>       Line 3 of <stdin>, in Spam.eggs
>     AttributeError: ham

IMHO, this is not a good idea.  Emacs users like me want traceback
labels to be *more* like C compiler error messages, not less.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The United States is in no way founded upon the Christian religion
	-- George Washington & John Adams, in a diplomatic message to Malta.



From moshez at math.huji.ac.il  Tue May  2 12:12:14 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 2 May 2000 13:12:14 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>

On Mon, 1 May 2000, Guido van Rossum wrote:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

Well, I'm guessing you had someone specific in mind (Neil?), but I want to
say someothing too, as the only one here (I think) using ISO-8859-8
natively. I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8
is even more horribly western-centered (or I should say USA centered) --
ASCII documents are the same. I'd much prefer Python to reflect a
fundamental truth about Unicode, which at least makes sure binary-goop can
pass through Unicode and remain unharmed, then to reflect a nasty problem
with UTF-8 (not everything is legal). 

If I'm using Hebrew characters in my source (which I won't for a long
while), I'll use them in  Unicode strings only, and make sure I use
Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a
conversion to Unicode on the fly anyway, since most bidi libraries work on
Unicode. So having UTF-8 conversions magically happen won't help me at
all, and will only cause problem when I use "sort-for-uniqueness" on a
list with mixed binary-goop and Unicode strings. In short, this sounds
like a recipe for disaster.

internationally y'rs, Z.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From pf at artcom-gmbh.de  Tue May  2 12:12:26 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:12:26 +0200 (MEST)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> from "Barry A. Warsaw" at "May 1, 2000 12:18:25 pm"
Message-ID: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de>

Barry A. Warsaw:
> Update of /projects/cvsroot/python/dist/src/Doc/lib
[...]
> 	libos.tex 
[...]
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}
> --- 703,712 ----
>   \end{funcdesc}
>   
> ! \begin{funcdesc}{utime}{path, times}
> ! Set the access and modified times of the file specified by \var{path}.
> ! If \var{times} is \code{None}, then the file's access and modified
> ! times are set to the current time.  Otherwise, \var{times} must be a
> ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> ! set the access and modified times, respectively.
>   Availability: Macintosh, \UNIX{}, Windows.
>   \end{funcdesc}

I may have missed something, but I haven't seen a patch to the WinXX
and MacOS implementation of the 'utime' function.  So either the
documentation should explicitly point out, that the new additional
signature is only available on Unices or even better it should be
implemented on all platforms so that programmers intending to write
portable Python have not to worry about this.

I suggest an additional note saying that this signature has been
added in Python 1.6.  There used to be several such notes all over
the documentation saying for example: "New in version 1.5.2." which
I found very useful in the past!

Regards, Peter



From nhodgson at bigpond.net.au  Tue May  2 12:22:00 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 20:22:00 +1000
Subject: [Python-Dev] fun with unicode, part 1
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>
Message-ID: <00d101bfb420$4197e510$e3cb8490@neil>

> ...but the system API translates from the active code page to the
> encoding used by the file system, right?

   Yes, although I think that wasn't the case with Win16 and there are still
some situations in which you have to deal with the differences. Copying a
file from the console on Windows 95 to a FAT volume appears to allow use of
the OEM character set with no conversion.

> if I create a file with a name containing latin-1 characters, on a
> FAT drive, it shows up correctly in the file browser (cp1252), and
> also shows up correctly in the MS-DOS window (under cp850).

   Do you have a FAT drive or a VFAT drive? If you format as FAT on 9x or NT
you will get a VFAT volume.

   Neil




From ping at lfw.org  Tue May  2 12:23:26 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 03:23:26 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502061655.A16999@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005020317030.522-100000@localhost>

On Tue, 2 May 2000, Eric S. Raymond wrote:
>
> Ka-Ping Yee <ping at lfw.org>:
> > 
> > With the suggested changes, this would print as
> > 
> >     Traceback (innermost last):
> >       Line 1 of <stdin>
> >       Line 3 of <stdin>, in Spam.eggs
> >     AttributeError: ham
> 
> IMHO, this is not a good idea.  Emacs users like me want traceback
> labels to be *more* like C compiler error messages, not less.

I suppose Python could go all the way and say things like

    Traceback (innermost last):
      <stdin>:3
      foo.py:25: in Spam.eggs
    AttributeError: ham

but that might be more intimidating for a beginner.

Besides, you Emacs guys have plenty of programmability anyway :)
You would have to do a little parsing to get the file name and
line number from the current format; it's no more work to get
it from the suggested format.

(What i would really like, by the way, is to see the values of
the function arguments on the stack -- but that's a lot of work
to do in C, so implementing this with the help of repr.repr
will probably be the first thing i do with sys.displaytb.)


-- ?!ng




From mal at lemburg.com  Tue May  2 12:46:06 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 12:46:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <390EB1EE.EA557CA9@lemburg.com>

Moshe Zadka wrote:
> 
> I'd much prefer Python to reflect a
> fundamental truth about Unicode, which at least makes sure binary-goop can
> pass through Unicode and remain unharmed, then to reflect a nasty problem
> with UTF-8 (not everything is legal).

Let's not do the same mistake again: Unicode objects should *not*
be used to hold binary data. Please use buffers instead.

BTW, I think that this behaviour should be changed:

>>> buffer('binary') + 'data'
'binarydata'

while:

>>> 'data' + buffer('binary')         
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

IMHO, buffer objects should never coerce to strings, but instead
return a buffer object holding the combined contents. The
same applies to slicing buffer objects:

>>> buffer('binary')[2:5]
'nar'

should prefereably be buffer('nar').

--

Hmm, perhaps we need something like a data string object
to get this 100% right ?!

>>> d = data("...data...")
or
>>> d = d"...data..."
>>> print type(d)
<type 'data'>

>>> 'string' + d
d"string...data..."
>>> u'string' + d
d"s\000t\000r\000i\000n\000g\000...data..."

>>> d[:5]
d"...da"

etc.

Ideally, string and Unicode objects would then be subclasses
of this type in Py3K.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From pf at artcom-gmbh.de  Tue May  2 12:59:55 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 12:59:55 +0200 (MEST)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10005020317030.522-100000@localhost> from Ka-Ping Yee at "May 2, 2000  3:23:26 am"
Message-ID: <m12maPD-000CnCC@artcom0.artcom-gmbh.de>

> > Ka-Ping Yee <ping at lfw.org>:
> > > 
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

> On Tue, 2 May 2000, Eric S. Raymond wrote:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.
> 
Ka-Ping Yee :
[...]
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

I like pings proposed traceback output.  

But beside existing Elisp code there might be other software relying
on a particular format.  As a long time vim user I have absolutely
no idea about other IDEs.  So before changing the default format this
should be carefully checked.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

I'm eagerly waiting to see this. ;-)

Regards, Peter



From just at letterror.com  Tue May  2 14:34:57 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 13:34:57 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E939B.11B99B71@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            	
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102804b534772fc25b@[193.78.237.142]>

At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "?" in a Unicode aware renderer).

Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.

>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's
non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
argument can easily be turned around, backfiring at utf-8:

    Defaulting to utf-8 when going from Unicode to 8-bit and
    back only gives the *illusion* things "just work", since it
    will *silently* "work", even if utf-8 is *not* the desired
    8-bit encoding -- as shown by Fredrik's excellent "fun with
    Unicode, part 1" example. Defaulting to Latin-1 will
    warn the user *much* earlier, since it'll barf when
    converting a Unicode string that contains any character
    code > 255. So there.
]

>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
>work.

Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.

However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with
surrogates?

Just





From nhodgson at bigpond.net.au  Tue May  2 13:40:44 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 2 May 2000 21:40:44 +1000
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil> <003b01bfb404$03cd0560$34aab5d4@hagrid>
Message-ID: <013e01bfb42b$41a3f200$e3cb8490@neil>

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    DoSomething(s + u)

> in Guido's design, the first example may or may not result in
> an "UTF-8 decoding error: UTF-8 decoding error: unexpected
> code byte" exception.

   I would say it is less surprising for most people for this to follow the
silent-widening of each byte - the Fredrik-Paul position. With the current
scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to
UTF-16 conversion. While complete prohibition of automatic conversion has
some appeal, it will just be more noise to many.

>    u = aUnicodeStringFromSomewhere
>    s = an8bitStringFromSomewhere
>
>    if len(u) + len(s) == len(u + s):
>        print "true"
>    else:
>        print "not true"

> the second example may result in a
> similar error, print "true", or print "not true", depending on the
> contents of the 8-bit string.

   I don't see this as important as its trying to take the Unicode strings
are equivalent to 8 bit strings too far. How much further before you have to
break? I always thought of len measuring the number of bytes rather than
characters when applied to strings. The same as strlen in C when you have a
DBCS string.

   I should correct some of the stuff Mark wrote about me. At Fujitsu we did
a lot more DBCS work than Unicode because that's what Japanese code uses.
Even with Java most storage is still DBCS. I was more involved with Unicode
architecture at Reuters 6 or so years ago.

   Neil




From guido at python.org  Tue May  2 13:53:10 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 07:53:10 -0400
Subject: [Python-Dev] At the interactive port
In-Reply-To: Your message of "Tue, 02 May 2000 02:46:40 PDT."
             <Pine.LNX.4.10.10005020242270.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020242270.522-100000@localhost> 
Message-ID: <200005021153.HAA24134@eric.cnri.reston.va.us>

> I was planning to submit a patch that adds the built-in routines
> 
>     sys.display
>     sys.displaytb
> 
>     sys.__display__
>     sys.__displaytb__
> 
> sys.display(obj) would be implemented as 'print repr(obj)'
> and sys.displaytb(tb, exc) would call the same built-in
> traceback printer we all know and love.

Sure.  Though I would recommend to separate the patch in two parts,
because their implementation is totally unrelated.

> I assumed that sys.__stdin__ was added to make it easier to
> restore sys.stdin to its original value.  In the same vein,
> sys.__display__ and sys.__displaytb__ would be saved references
> to the original sys.display and sys.displaytb.

Good idea.

> I hate to contradict Guido, but i'll gently suggest why i
> like "display" better than "displayhook": "display" is a verb,
> and i prefer function names to be verbs rather than nouns
> describing what the functions are (e.g. "read" rather than
> "reader", etc.)

Good idea.  But I hate the "displaytb" name (when I read your message
I had no idea what the "tb" stood for until you explained it).

Hm, perhaps we could do showvalue and showtraceback?
("displaytraceback" is a bit long.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:15:28 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:15:28 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: Your message of "Tue, 02 May 2000 12:12:26 +0200."
             <m12mZfG-000CnCC@artcom0.artcom-gmbh.de> 
References: <m12mZfG-000CnCC@artcom0.artcom-gmbh.de> 
Message-ID: <200005021215.IAA24169@eric.cnri.reston.va.us>

> > ! \begin{funcdesc}{utime}{path, times}
> > ! Set the access and modified times of the file specified by \var{path}.
> > ! If \var{times} is \code{None}, then the file's access and modified
> > ! times are set to the current time.  Otherwise, \var{times} must be a
> > ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to
> > ! set the access and modified times, respectively.
> >   Availability: Macintosh, \UNIX{}, Windows.
> >   \end{funcdesc}
> 
> I may have missed something, but I haven't seen a patch to the WinXX
> and MacOS implementation of the 'utime' function.  So either the
> documentation should explicitly point out, that the new additional
> signature is only available on Unices or even better it should be
> implemented on all platforms so that programmers intending to write
> portable Python have not to worry about this.

Actually, it works on WinXX (tested on 98).  The utime()
implementation there is the same file as on Unix, so the patch fixed
both platforms.  The MS C library only seems to set the mtime, but
that's okay.

On Mac, I hope that the utime() function in GUSI 2 does this, in which
case Jack Jansen needs to copy Barry's patch.

> I suggest an additional note saying that this signature has been
> added in Python 1.6.  There used to be several such notes all over
> the documentation saying for example: "New in version 1.5.2." which
> I found very useful in the past!

Thanks, you're right!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:19:38 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:19:38 -0400
Subject: [Python-Dev] fun with unicode, part 1
In-Reply-To: Your message of "Tue, 02 May 2000 20:22:00 +1000."
             <00d101bfb420$4197e510$e3cb8490@neil> 
References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid>  
            <00d101bfb420$4197e510$e3cb8490@neil> 
Message-ID: <200005021219.IAA24181@eric.cnri.reston.va.us>

>    Yes, although I think that wasn't the case with Win16 and there are still
> some situations in which you have to deal with the differences. Copying a
> file from the console on Windows 95 to a FAT volume appears to allow use of
> the OEM character set with no conversion.

BTW, MS's use of code pages is full of shit.  Yesterday I was
spell-checking a document that had the name Andre in it (the accent
was missing).  The popup menu suggested Andr* where the * was an upper
case slashed O.  I first thought this was because the menu character
set might be using a different code page, but no -- it must have been
bad in the database, because selecting that entry from the menu
actually inserted the slashed O character.  So they must have been
maintaining their database with a different code page.

Just to indicate that when we sort out the rest of the Unicode debate
(which I'm sure we will :-) there will still be surprises on
Windows...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:22:24 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:22:24 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 03:23:26 PDT."
             <Pine.LNX.4.10.10005020317030.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020317030.522-100000@localhost> 
Message-ID: <200005021222.IAA24192@eric.cnri.reston.va.us>

> > Ka-Ping Yee <ping at lfw.org>:
> > > With the suggested changes, this would print as
> > > 
> > >     Traceback (innermost last):
> > >       Line 1 of <stdin>
> > >       Line 3 of <stdin>, in Spam.eggs
> > >     AttributeError: ham

ESR:
> > IMHO, this is not a good idea.  Emacs users like me want traceback
> > labels to be *more* like C compiler error messages, not less.

Ping:
> I suppose Python could go all the way and say things like
> 
>     Traceback (innermost last):
>       <stdin>:3
>       foo.py:25: in Spam.eggs
>     AttributeError: ham
> 
> but that might be more intimidating for a beginner.
> 
> Besides, you Emacs guys have plenty of programmability anyway :)
> You would have to do a little parsing to get the file name and
> line number from the current format; it's no more work to get
> it from the suggested format.

Not sure -- I think I carefully designed the old format to be one of
the formats that Emacs parses *by default*: File "...", line ...  Your
change breaks this.

> (What i would really like, by the way, is to see the values of
> the function arguments on the stack -- but that's a lot of work
> to do in C, so implementing this with the help of repr.repr
> will probably be the first thing i do with sys.displaytb.)

Yes, this is much easier in Python.  Watch out for values that are
uncomfortably big or recursive or that cause additional exceptions on
displaying.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:26:50 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:26:50 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 12:46:06 +0200."
             <390EB1EE.EA557CA9@lemburg.com> 
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>  
            <390EB1EE.EA557CA9@lemburg.com> 
Message-ID: <200005021226.IAA24203@eric.cnri.reston.va.us>

[MAL]
> Let's not do the same mistake again: Unicode objects should *not*
> be used to hold binary data. Please use buffers instead.

Easier said than done -- Python doesn't really have a buffer data
type.  Or do you mean the array module?  It's not trivial to read a
file into an array (although it's possible, there are even two ways).
Fact is, most of Python's standard library and built-in objects use
(8-bit) strings as buffers.

I agree there's no reason to extend this to Unicode strings.

> BTW, I think that this behaviour should be changed:
> 
> >>> buffer('binary') + 'data'
> 'binarydata'
> 
> while:
> 
> >>> 'data' + buffer('binary')         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: illegal argument type for built-in operation
> 
> IMHO, buffer objects should never coerce to strings, but instead
> return a buffer object holding the combined contents. The
> same applies to slicing buffer objects:
> 
> >>> buffer('binary')[2:5]
> 'nar'
> 
> should prefereably be buffer('nar').

Note that a buffer object doesn't hold data!  It's only a pointer to
data.  I can't off-hand explain the asymmetry though.

> --
> 
> Hmm, perhaps we need something like a data string object
> to get this 100% right ?!
> 
> >>> d = data("...data...")
> or
> >>> d = d"...data..."
> >>> print type(d)
> <type 'data'>
> 
> >>> 'string' + d
> d"string...data..."
> >>> u'string' + d
> d"s\000t\000r\000i\000n\000g\000...data..."
> 
> >>> d[:5]
> d"...da"
> 
> etc.
> 
> Ideally, string and Unicode objects would then be subclasses
> of this type in Py3K.

Not clear.  I'd rather do the equivalent of byte arrays in Java, for
which no "string literal" notations exist.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at mems-exchange.org  Tue May  2 14:27:51 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 2 May 2000 08:27:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <20000502082751.A1504@mems-exchange.org>

On 02 May 2000, Ka-Ping Yee said:
> I propose the following stylistic changes to traceback
> printing:
> 
>     1.  If there is no function name for a given level
>         in the traceback, just omit the ", in ?" at the
>         end of the line.

+0 on this: it doesn't really add anything, but it does neaten things
up.

>     2.  If a given level of the traceback is in a method,
>         instead of just printing the method name, print
>         the class and the method name.

+1 here too: this definitely adds utility.

>     3.  Instead of beginning each line with:
>         
>             File "foo.py", line 5
> 
>         print the line first and drop the quotes:
> 
>             Line 5 of foo.py

-0: adds nothing, cleans nothing up, and just generally breaks things
for no good reason.

>         In the common interactive case that the file
>         is a typed-in string, the current printout is
>         
>             File "<stdin>", line 1
>         
>         and the following is easier to read in my opinion:
> 
>             Line 1 of <stdin>

OK, that's a good reason.  Maybe you could special-case the "<stdin>"
case?  How about

   <stdin>, line 1

?

        Greg



From guido at python.org  Tue May  2 14:30:02 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:30:02 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 11:56:21 +0200."
             <390EA645.89E3B22A@lemburg.com> 
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>  
            <390EA645.89E3B22A@lemburg.com> 
Message-ID: <200005021230.IAA24232@eric.cnri.reston.va.us>

[MAL]
> > > Unicode itself can be understood as multi-word character
> > > encoding, just like UTF-8. The reason is that Unicode entities
> > > can be combined to produce single display characters (e.g.
> > > u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> > > Slicing such a combined Unicode string will have the same
> > > effect as slicing UTF-8 data.
[/F]
> > really?  does it result in a decoder error?  or does it just result
> > in a rendering error, just as if you slice off any trailing character
> > without looking...
[MAL]
> In the example, if you cut off the u"\u0301", the "e" would
> appear without the acute accent, cutting off the u"e" would
> probably result in a rendering error or worse put the accent
> over the next character to the left.
> 
> UTF-8 is better in this respect: it warns you about
> the error by raising an exception when being converted to
> Unicode.

I think /F's point was that the Unicode standard prescribes different
behavior here: for UTF-8, a missing or lone continuation byte is an
error; for Unicode, accents are separate characters that may be
inserted and deleted in a string but whose display is undefined under
certain conditions.

(I just noticed that this doesn't work in Tkinter but it does work in
wish.  Strange.)

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"?" should compare equal to u"e\u0301".

Aha, then we'll see u == v even though type(u) is type(v) and len(u)
!= len(v).  /F's world will collapse. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 14:31:55 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:31:55 -0400
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 01:42:51 PDT."
             <Pine.LNX.4.10.10005020114250.522-100000@localhost> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> 
Message-ID: <200005021231.IAA24249@eric.cnri.reston.va.us>

>     No automatic conversions between 8-bit "strings" and Unicode strings.
> 
> If you want to turn UTF-8 into a Unicode string, say so.
> If you want to turn Latin-1 into a Unicode string, say so.
> If you want to turn ISO-2022-JP into a Unicode string, say so.
> Adding a Unicode string and an 8-bit "string" gives an exception.

I'd accept this, with one change: mixing Unicode and 8-bit strings is
okay when the 8-bit strings contain only ASCII (byte values 0 through
127).  That does the right thing when the program is combining
ASCII data (e.g. literals or data files) with Unicode and warns you
when you are using characters for which the encoding matters.  I
believe that this is important because much existing code dealing with
strings can in fact deal with Unicode just fine under these
assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
it deal with Unicode strings -- those changes were all getattr() and
setattr() calls.)

When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
bytes in either should make the comparison fail; when ordering is
important, we can make an arbitrary choice e.g. "\377" < u"\200".

Why not Latin-1?  Because it gives us Western-alphabet users a false
sense that our code works, where in fact it is broken as soon as you
change the encoding.

> P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
> sense of them as byte-buffers -- since that *is* all you get when you
> read in some bytes from a file.  If you manipulate an 8-bit "string"
> as a character string, you are implicitly making the assumption that
> the byte values correspond to the character encoding of the character
> repertoire you want to work with, and that's your responsibility.

This is how I think of them too.

> P. P. S.  If always having to specify encodings is really too much,
> i'd probably be willing to consider a default-encoding state on the
> Unicode class, but it would have to be a stack of values, not a
> single value.

Please elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Tue May  2 15:44:30 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:44:30 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021230.IAA24232@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."            
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>             
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <l03102807b5348b0e6e0b@[193.78.237.142]>

At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
>I think /F's point was that the Unicode standard prescribes different
>behavior here: for UTF-8, a missing or lone continuation byte is an
>error; for Unicode, accents are separate characters that may be
>inserted and deleted in a string but whose display is undefined under
>certain conditions.
>
>(I just noticed that this doesn't work in Tkinter but it does work in
>wish.  Strange.)
>
>> FYI: Normalization is needed to make comparing Unicode
>> strings robust, e.g. u"?" should compare equal to u"e\u0301".
>
>Aha, then we'll see u == v even though type(u) is type(v) and len(u)
>!= len(v).  /F's world will collapse. :-)

Does the Unicode spec *really* specifies u should compare equal to v? This
behavior would be the responsibility of a layout engine, a role which is
way beyond the scope of Unicode support in Python, as it is language- and
script-dependent.

Just





From just at letterror.com  Tue May  2 15:39:24 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:39:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode debate
In-Reply-To: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102806b534883ec4cf@[193.78.237.142]>

At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote:
>If it turns out automatic conversions *are* absolutely necessary,
>then i vote in favour of the simple, direct method promoted by Paul
>and Fredrik: just copy the numerical values of the bytes.  The fact
>that this happens to correspond to Latin-1 is not really the point;
>the main reason is that it satisfies the Principle of Least Surprise.

Exactly.

I'm not sure if automatic conversions are absolutely necessary, but seeing
8-bit strings as Latin-1 encoded Unicode strings seems most natural to me.
Heck, even 8-bit strings should have an s.encode() method, that would
behave *just* like u.encode(), and unicode(blah) could even *return* an
8-bit string if it turns out the string has no character codes > 255!

Conceptually, this gets *very* close to the ideal of "there is only one
string type", and at the same times leaves room for 8-bit strings doubling
as byte arrays for backward compatibility reasons.

(Unicode strings and 8-bit strings could even be the same type, which only
uses wide chars when neccesary!)

Just





From just at letterror.com  Tue May  2 15:55:31 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:55:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 01:42:51 PDT."            
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <l03102808b5348d1eea20@[193.78.237.142]>

At 8:31 AM -0400 02-05-2000, Guido van Rossum wrote:
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

Blech. Just document 8-bit strings *are* Latin-1 unless converted
explicitly, and you're done. It's really much simpler this way. For you as
well as the users.

>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

Yeah, and? It least it'll *show* it's broken instead of *silently* doing
the wrong thing with utf-8.

It's like using Python ints all over the place, and suddenly a user of the
application enters data that causes an integer overflow. Boom. Program
needs to be fixed. What's the big deal?

Just





From effbot at telia.com  Tue May  2 15:05:42 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 15:05:42 +0200
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>             <390EA645.89E3B22A@lemburg.com>  <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <00f301bfb437$227bc180$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

you're gonna do automatic normalization?  that's interesting.
will this make Python the first language to defines strings as
a "sequence of graphemes"?

or was this just the cheap shot it appeared to be?

</F>




From skip at mojam.com  Tue May  2 15:10:22 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 2 May 2000 08:10:22 -0500 (CDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
Message-ID: <14606.54206.559407.213584@beluga.mojam.com>

[... completely eliding Ping's note and stealing his subject ...]

On a not-quite unrelated tack, I wonder if traceback printing can be
enhanced in the case where Python code calls a function or method written in
C (possibly calling multiple C functions), which in turn calls a Python
function that raises an exception.  Currently, the Python functions on
either side of the C functions are printed, but no hint of the C function's
existence is displayed.  Any way to get some indication there's another
function in the middle?

Thanks,

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From mbel44 at dial.pipex.net  Tue May  2 15:46:44 2000
From: mbel44 at dial.pipex.net (Toby Dickenson)
Date: Tue, 02 May 2000 14:46:44 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>

On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum
<guido at python.org> wrote:

>>     No automatic conversions between 8-bit "strings" and Unicode strings.
>> 
>> If you want to turn UTF-8 into a Unicode string, say so.
>> If you want to turn Latin-1 into a Unicode string, say so.
>> If you want to turn ISO-2022-JP into a Unicode string, say so.
>> Adding a Unicode string and an 8-bit "string" gives an exception.
>
>I'd accept this, with one change: mixing Unicode and 8-bit strings is
>okay when the 8-bit strings contain only ASCII (byte values 0 through
>127).  That does the right thing when the program is combining
>ASCII data (e.g. literals or data files) with Unicode and warns you
>when you are using characters for which the encoding matters.  I
>believe that this is important because much existing code dealing with
>strings can in fact deal with Unicode just fine under these
>assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
>it deal with Unicode strings -- those changes were all getattr() and
>setattr() calls.)
>
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

I assume 'fail' means 'non-equal', rather than 'raises an exception'?


Toby Dickenson
tdickenson at geminidataloggers.com



From guido at python.org  Tue May  2 15:58:51 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 09:58:51 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:10:22 CDT."
             <14606.54206.559407.213584@beluga.mojam.com> 
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>  
            <14606.54206.559407.213584@beluga.mojam.com> 
Message-ID: <200005021358.JAA24443@eric.cnri.reston.va.us>

[Skip]
> On a not-quite unrelated tack, I wonder if traceback printing can be
> enhanced in the case where Python code calls a function or method written in
> C (possibly calling multiple C functions), which in turn calls a Python
> function that raises an exception.  Currently, the Python functions on
> either side of the C functions are printed, but no hint of the C function's
> existence is displayed.  Any way to get some indication there's another
> function in the middle?

In some cases, that's a good thing -- in others, it's not.  There
should probably be an API that a C function can call to add an entry
onto the stack.

It's not going to be a trivial fix though -- you'd have to manufacture
a frame object.

I can see two options: you can do this "on the way out" when you catch
an exception, or you can do this "on the way in" when you are called.
The latter would require you to explicitly get rid of the frame too --
probably both on normal returns and on exception returns.  That seems
hairier than only having to make a call on exception returns; but it
means that the C function is invisible to the Python debugger unless
it fails.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  2 16:00:14 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:00:14 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:46:44 BST."
             <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>  
            <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> 
Message-ID: <200005021400.KAA24464@eric.cnri.reston.va.us>

[me]
> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> >bytes in either should make the comparison fail; when ordering is
> >important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby] 
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

Yes, sorry for the ambiguity.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Tue May  2 16:04:17 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 10:04:17 -0400 (EDT)
Subject: [Python-Dev] documentation for new modules
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
References: <14605.44546.568978.296426@seahag.cnri.reston.va.us>
	<ECEPKNMJLHAPFFJHDOJBIEPECJAA.mhammond@skippinet.com.au>
Message-ID: <14606.57441.97184.499435@seahag.cnri.reston.va.us>

Mark Hammond writes:
 > I wonder if that anyone could be me? :-)

  I certainly wouldn't object!  ;)

 > But I will try and put something together.  It will need to be plain
 > text or HTML, but I assume that is better than nothing!

  Plain text would be better than HTML.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From just at letterror.com  Tue May  2 17:11:39 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:11:39 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 14:46:44 BST."            
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>             
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <l0310280fb5349fd24fc5@[193.78.237.142]>

At 10:00 AM -0400 02-05-2000, Guido van Rossum wrote:
>[me]
>> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>> >bytes in either should make the comparison fail; when ordering is
>> >important, we can make an arbitrary choice e.g. "\377" < u"\200".
>
>[Toby]
>> I assume 'fail' means 'non-equal', rather than 'raises an exception'?
>
>Yes, sorry for the ambiguity.

You're going to have a hard time explaining that "\377" != u"\377".

Again, if you define that "all strings are unicode" and that 8-bit strings
contain Unicode characters up to 255, you're all set. Clear semantics, few
surprises, simple implementation, etc. etc.

Just





From guido at python.org  Tue May  2 16:21:28 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:21:28 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 16:11:39 BST."
             <l0310280fb5349fd24fc5@[193.78.237.142]> 
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>  
            <l0310280fb5349fd24fc5@[193.78.237.142]> 
Message-ID: <200005021421.KAA24526@eric.cnri.reston.va.us>

[Just]
> You're going to have a hard time explaining that "\377" != u"\377".

I agree.  You are an example of how hard it is to explain: you still
don't understand that for a person using CJK encodings this is in fact
the truth.

> Again, if you define that "all strings are unicode" and that 8-bit strings
> contain Unicode characters up to 255, you're all set. Clear semantics, few
> surprises, simple implementation, etc. etc.

But not all 8-bit strings occurring in programs are Unicode.  Ask
Moshe.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Tue May  2 17:42:24 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:42:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021421.KAA24526@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 16:11:39 BST."            
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>             
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102812b534a7430fb6@[193.78.237.142]>

>[Just]
>> You're going to have a hard time explaining that "\377" != u"\377".
>
[GvR]
>I agree.  You are an example of how hard it is to explain: you still
>don't understand that for a person using CJK encodings this is in fact
>the truth.

That depends on the definition of truth: it you document that 8-bit strings
are Latin-1, the above is the truth. Conceptually classify all other 8-bit
encodings as binary goop makes the semantics chrystal clear.

>> Again, if you define that "all strings are unicode" and that 8-bit strings
>> contain Unicode characters up to 255, you're all set. Clear semantics, few
>> surprises, simple implementation, etc. etc.
>
>But not all 8-bit strings occurring in programs are Unicode.  Ask
>Moshe.

I know. They can be anything, even binary goop. But that's *only* an
artifact of the fact that 8-bit strings need to double as buffer objects.

Just





From just at letterror.com  Tue May  2 17:45:01 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:45:01 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <l03102812b534a7430fb6@[193.78.237.142]>
References: <200005021421.KAA24526@eric.cnri.reston.va.us> Your message of
 "Tue, 02 May 2000 16:11:39 BST."            
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>             
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102813b534a8484cf9@[193.78.237.142]>

I wrote:
>That depends on the definition of truth: it you document that 8-bit strings
>are Latin-1, the above is the truth.

Oops, I meant of course that "\377" == u"\377" is then the truth...

Sorry,

Just





From mal at lemburg.com  Tue May  2 17:18:21 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:18:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."            
	 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
	 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
	 <009701bfb414$d35d0ea0$34aab5d4@hagrid>             
	 <390EA645.89E3B22A@lemburg.com> <l03102807b5348b0e6e0b@[193.78.237.142]>
Message-ID: <390EF1BD.E6C7AF74@lemburg.com>

Just van Rossum wrote:
> 
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish.  Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"?" should compare equal to u"e\u0301".

                            ^
                            |

Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"?"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.

> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v).  /F's world will collapse. :-)
> 
> Does the Unicode spec *really* specifies u should compare equal to v?

The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.

Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:55:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:55:40 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390EFA7B.F6B622F0@lemburg.com>

[Guido going ASCII]

Do you mean going ASCII all the way (using it for all
aspects where Unicode gets converted to a string and cases
where strings get converted to Unicode), or just 
for some aspect of conversion, e.g. just for the silent
conversions from strings to Unicode ?

[BTW, I'm pretty sure that the Latin-1 folks won't like
ASCII for the same reason they don't like UTF-8: it's
simply an inconvenient way to write strings in their favorite
encoding directly in Python source code. My feeling in this
whole discussion is that it's more about convenience than
anything else. Still, it's very amusing ;-) ]

FYI, here's the conversion table of (potentially) all
conversions done by the implementation:

Python:
-------
string + unicode:       unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode:          print unicode.encode('utf-8'); with stdout
                        redirection this can be changed to any
                        other encoding
str(unicode):           unicode.encode('utf-8')
repr(unicode):          repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:          same as "s" + unicode.encode('utf-8')
"s#" + unicode:         same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:          same as "t" + unicode.encode('utf-8')
"t#" + unicode:         same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string            same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
                        Latin-1 chars as single-char input; using
                        escape sequences any Unicode char can be
                        entered (*)
codecs.open(filename,mode,encname)
                        opens an encoded file for
                        reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
                        returns UTF-8 strings based on the input
                        encoding

IO:
---
open(file,'w').write(unicode)
        same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
        same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
        same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
        same as unicode(open(file,'rb').read(),encname)
stdin + stdout
        can be redirected using StreamRecoders to handle any
        of the supported encodings

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:27:39 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:27:39 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>  
	            <390EB1EE.EA557CA9@lemburg.com> <200005021226.IAA24203@eric.cnri.reston.va.us>
Message-ID: <390EF3EB.5BCE9EC3@lemburg.com>

Guido van Rossum wrote:
> 
> [MAL]
> > Let's not do the same mistake again: Unicode objects should *not*
> > be used to hold binary data. Please use buffers instead.
> 
> Easier said than done -- Python doesn't really have a buffer data
> type.  Or do you mean the array module?  It's not trivial to read a
> file into an array (although it's possible, there are even two ways).
> Fact is, most of Python's standard library and built-in objects use
> (8-bit) strings as buffers.
> 
> I agree there's no reason to extend this to Unicode strings.
> 
> > BTW, I think that this behaviour should be changed:
> >
> > >>> buffer('binary') + 'data'
> > 'binarydata'
> >
> > while:
> >
> > >>> 'data' + buffer('binary')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > TypeError: illegal argument type for built-in operation
> >
> > IMHO, buffer objects should never coerce to strings, but instead
> > return a buffer object holding the combined contents. The
> > same applies to slicing buffer objects:
> >
> > >>> buffer('binary')[2:5]
> > 'nar'
> >
> > should prefereably be buffer('nar').
> 
> Note that a buffer object doesn't hold data!  It's only a pointer to
> data.  I can't off-hand explain the asymmetry though.

Dang, you're right...
 
> > --
> >
> > Hmm, perhaps we need something like a data string object
> > to get this 100% right ?!
> >
> > >>> d = data("...data...")
> > or
> > >>> d = d"...data..."
> > >>> print type(d)
> > <type 'data'>
> >
> > >>> 'string' + d
> > d"string...data..."
> > >>> u'string' + d
> > d"s\000t\000r\000i\000n\000g\000...data..."
> >
> > >>> d[:5]
> > d"...da"
> >
> > etc.
> >
> > Ideally, string and Unicode objects would then be subclasses
> > of this type in Py3K.
> 
> Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> which no "string literal" notations exist.

Anyway, one way or another I think we should make it clear
to users that they should start using some other type for
storing binary data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May  2 17:24:24 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:24:24 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            	
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <390EF327.86D8C3D8@lemburg.com>

Just van Rossum wrote:
> 
> At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
> >Just a small note on the subject of a character being atomic
> >which seems to have been forgotten by the discussing parties:
> >
> >Unicode itself can be understood as multi-word character
> >encoding, just like UTF-8. The reason is that Unicode entities
> >can be combined to produce single display characters (e.g.
> >u"e"+u"\u0301" will print "?" in a Unicode aware renderer).
> 
> Erm, are you sure Unicode prescribes this behavior, for this
> example? I know similar behaviors are specified for certain
> languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried
in some of the tech reports on normalization and
collation.
 
> >Slicing such a combined Unicode string will have the same
> >effect as slicing UTF-8 data.
> 
> Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert
a broken UTF-8 sequence to Unicode. This is per design
of UTF-8 itself which uses the top bit to identify
multi-byte character encodings.

Or can you give an example (perhaps you've found a bug 
that needs fixing) ?

> [ Speaking of exceptions,
> 
> after I sent off my previous post I realized Guido's
> non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
> argument can easily be turned around, backfiring at utf-8:
> 
>     Defaulting to utf-8 when going from Unicode to 8-bit and
>     back only gives the *illusion* things "just work", since it
>     will *silently* "work", even if utf-8 is *not* the desired
>     8-bit encoding -- as shown by Fredrik's excellent "fun with
>     Unicode, part 1" example. Defaulting to Latin-1 will
>     warn the user *much* earlier, since it'll barf when
>     converting a Unicode string that contains any character
>     code > 255. So there.
> ]
> 
> >It seems that most Latin-1 proponents seem to have single
> >display characters in mind. While the same is true for
> >many Unicode entities, there are quite a few cases of
> >combining characters in Unicode 3.0 and the Unicode
> >nomarization algorithm uses these as basis for its
> >work.
> 
> Still, two combining characters are still two input characters for
> the renderer! They may result in one *glyph*, but trust me,
> that's an entirly different can of worms.

No. Please see my other post on the subject...
 
> However, if you'd be talking about Unicode surrogates,
> you'd definitely have a point. How do Java/Perl/Tcl deal with
> surrogates?

Good question... anybody know the answers ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From paul at prescod.net  Tue May  2 18:05:20 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:05:20 -0500
Subject: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <035501bfb3f3$db87fb10$e3cb8490@neil>
Message-ID: <390EFCC0.240BC56B@prescod.net>

Neil, I sincerely appreciate your informed input. I want to emphasize
one ideological difference though. :)

Neil Hodgson wrote:
> 
> ...
>
>    The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. 

I reject that characterization.

I claim that both strings contain Unicode characters but one can contain
Unicode charactes with higher digits. UTF-8 versus latin-1 does not
enter into it. Python strings should not be documented in terms of
encodings any more than Python ints are documented in terms of their
two's complement representation. Then we could describe the default
conversion from integers to floats in terms of their bit-representation.
Ugh!

I accept that the effect is similar to calling Latin-1 the "default"
that's a side effect of the simple logical model that we are proposing.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From just at letterror.com  Tue May  2 19:33:56 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:33:56 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFA7B.F6B622F0@lemburg.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <l03102815b534c1763aa8@[193.78.237.142]>

At 5:55 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>[BTW, I'm pretty sure that the Latin-1 folks won't like
>ASCII for the same reason they don't like UTF-8: it's
>simply an inconvenient way to write strings in their favorite
>encoding directly in Python source code. My feeling in this
>whole discussion is that it's more about convenience than
>anything else. Still, it's very amusing ;-) ]

For the record, I don't want Latin-1 because it's my favorite encoding. It
isn't. Guido's right: I can't even *use* it derictly on my platform. I want
it *only* because it's the most logical 8-bit subset of Unicode -- as we
have stated over and opver and over and over again. What's so hard to
understand about this?

Just





From paul at prescod.net  Tue May  2 18:11:13 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:11:13 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
		 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
		 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <390EFE21.DAD7749B@prescod.net>

Combining characters are a whole 'nother level of complexity. Charater
sets are hard. I don't accept that the argument that "Unicode itself has
complexities so that gives us license to introduce even more
complexities at the character representation level."

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"?" should compare equal to u"e\u0301".

That's a whole 'nother debate at a whole 'nother level of abstraction. I
think we need to get the bytes/characters level right and then we can
worry about display-equivalent characters (or leave that to the Python
programmer to figure out...).
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From paul at prescod.net  Tue May  2 18:13:00 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:13:00 -0500
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>  
	            <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us>
Message-ID: <390EFE8C.4C10473C@prescod.net>

Guido van Rossum wrote:
> 
> ...
>
> But not all 8-bit strings occurring in programs are Unicode.  Ask
> Moshe.

Where are we going? What's our long-range vision?

Three years from now where will we be? 

1. How will we handle characters? 
2. How will we handle bytes?
3. What will unadorned literal strings "do"?
4. Will literal strings be the same type as byte arrays?

I don't see how we can make decisions today without a vision for the
future. I think that this is the central point in our disagreement. Some
of us are aiming for as much compatibility with where we think we should
be going and others are aiming for as much compatibility as possible
with where we came from.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From just at letterror.com  Tue May  2 19:37:09 2000
From: just at letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:37:09 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>		
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <l03102816b534c2476bce@[193.78.237.142]>

At 5:24 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>> Still, two combining characters are still two input characters for
>> the renderer! They may result in one *glyph*, but trust me,
>> that's an entirly different can of worms.
>
>No. Please see my other post on the subject...

It would help if you'd post some actual doco.

Just





From paul at prescod.net  Tue May  2 18:25:33 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:25:33 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>  
	            <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <390F017C.91C7A8A0@prescod.net>

Guido van Rossum wrote:
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

There are many levels of equality that are interesting. I don't think we
would move to grapheme equivalence until "the rest of the world" (XML,
Java, W3C, SQL) did. 

If we were going to move to grapheme equivalence (some day), the right
way would be to normalize characters in the construction of the Unicode
string. This is known as "Early normalization":

http://www.w3.org/TR/charmod/#NormalizationApplication

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From ping at lfw.org  Tue May  2 18:43:25 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 09:43:25 -0700 (PDT)
Subject: [Python-Dev] Traceback style
In-Reply-To: <20000502082751.A1504@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005020939050.522-100000@localhost>

On Tue, 2 May 2000, Greg Ward wrote:
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?

...and "<string>", and "<console>", and perhaps others... ?

    File "<string>", line 3

just looks downright clumsy the first time you see it.
(Well, it still looks kinda clumsy to me or i wouldn't be
proposing the change.)

Can someone verify the already-parseable-by-Emacs claim, and
describe how you get Emacs to do something useful with bits
of traceback?  (Alas, i'm not an Emacs user, so understanding
just how the current format is useful would help.)


-- ?!ng




From bwarsaw at python.org  Tue May  2 19:13:03 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:13:03 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <14607.3231.115841.262068@anthem.cnri.reston.va.us>

>>>>> "PF" == Peter Funk <pf at artcom-gmbh.de> writes:

    PF> I suggest an additional note saying that this signature has
    PF> been added in Python 1.6.  There used to be several such notes
    PF> all over the documentation saying for example: "New in version
    PF> 1.5.2." which I found very useful in the past!

Good point.  Fred, what is the Right Way to do this?

-Barry



From bwarsaw at python.org  Tue May  2 19:16:22 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:16:22 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
	<20000502082751.A1504@mems-exchange.org>
Message-ID: <14607.3430.941026.496225@anthem.cnri.reston.va.us>

I concur with Greg's scores.



From guido at python.org  Tue May  2 19:22:02 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 13:22:02 -0400
Subject: [Python-Dev] Traceback style
In-Reply-To: Your message of "Tue, 02 May 2000 08:27:51 EDT."
             <20000502082751.A1504@mems-exchange.org> 
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>  
            <20000502082751.A1504@mems-exchange.org> 
Message-ID: <200005021722.NAA25854@eric.cnri.reston.va.us>

> On 02 May 2000, Ka-Ping Yee said:
> > I propose the following stylistic changes to traceback
> > printing:
> > 
> >     1.  If there is no function name for a given level
> >         in the traceback, just omit the ", in ?" at the
> >         end of the line.

Greg Ward expresses my sentiments:

> +0 on this: it doesn't really add anything, but it does neaten things
> up.
> 
> >     2.  If a given level of the traceback is in a method,
> >         instead of just printing the method name, print
> >         the class and the method name.
> 
> +1 here too: this definitely adds utility.
> 
> >     3.  Instead of beginning each line with:
> >         
> >             File "foo.py", line 5
> > 
> >         print the line first and drop the quotes:
> > 
> >             Line 5 of foo.py
> 
> -0: adds nothing, cleans nothing up, and just generally breaks things
> for no good reason.
> 
> >         In the common interactive case that the file
> >         is a typed-in string, the current printout is
> >         
> >             File "<stdin>", line 1
> >         
> >         and the following is easier to read in my opinion:
> > 
> >             Line 1 of <stdin>
> 
> OK, that's a good reason.  Maybe you could special-case the "<stdin>"
> case?  How about
> 
>    <stdin>, line 1
> 
> ?

I'd special-case any filename that starts with < and ends with > --
those are all made-up names like <string> or <stdin>.  You can display
them however you like, perhaps

  In "<string>", line 3

For regular files I'd leave the formatting alone -- there are tools
out there that parse these.  (E.g. Emacs' Python mode jumps to the
line with the error if you run a file and it begets an exception.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Tue May  2 19:14:24 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Tue, 2 May 2000 13:14:24 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	<l03102802b534149a9639@[193.78.237.164]>
	<l03102804b534772fc25b@[193.78.237.142]>
	<390EF327.86D8C3D8@lemburg.com>
Message-ID: <14607.3312.660077.42872@cymru.basistech.com>

M.-A. Lemburg writes:
 > The details are on the www.unicode.org web-site burried
 > in some of the tech reports on normalization and
 > collation.

This is described in the Unicode standard itself, and in UTR #15 and
UTR #10. Normalization is an issue with wider imlications than just
handling glyph variants: indeed, it's irrelevant.

The question is this: should

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

compare equal to

U+0055 LATIN CAPITAL LETTER U
U+0308 COMBINING DIAERESIS

or not? It depends on the application. Certainly in a database system
I would want these to compare equal.

Perhaps normalization form needs to be an option of the string comparator?

        -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From bwarsaw at python.org  Tue May  2 19:51:17 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 13:51:17 -0400 (EDT)
Subject: [Python-Dev] Traceback style
References: <Pine.LNX.4.10.10004170129320.1157-100000@localhost>
	<20000502082751.A1504@mems-exchange.org>
	<200005021722.NAA25854@eric.cnri.reston.va.us>
Message-ID: <14607.5525.160379.760452@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido at python.org> writes:

    GvR> For regular files I'd leave the formatting alone -- there are
    GvR> tools out there that parse these.  (E.g. Emacs' Python mode
    GvR> jumps to the line with the error if you run a file and it
    GvR> begets an exception.)

py-traceback-line-re is what matches those lines.  It's current
definition is

(defconst py-traceback-line-re
  "[ \t]+File \"\\([^\"]+\\)\", line \\([0-9]+\\)"
  "Regular expression that describes tracebacks.")

There are probably also gud.el (and maybe compile.el) regexps that
need to be changed too.  I'd rather see something that outputs the
same regardless of whether it's a real file, or something "fake".
Something like

Line 1 of <stdin>
Line 12 of foo.py

should be fine.  I'm not crazy about something like

File "foo.py", line 12
In <stdin>, line 1

-Barry



From fdrake at acm.org  Tue May  2 19:59:43 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 2 May 2000 13:59:43 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
In-Reply-To: <14607.3231.115841.262068@anthem.cnri.reston.va.us>
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
	<14607.3231.115841.262068@anthem.cnri.reston.va.us>
Message-ID: <14607.6031.770981.424012@seahag.cnri.reston.va.us>

bwarsaw at python.org writes:
 > Good point.  Fred, what is the Right Way to do this?

  Pester me night and day until it gets done (email only!).
  Unless of course you've already seen the check-in messages.  ;)


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From bwarsaw at python.org  Tue May  2 20:05:00 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Tue, 2 May 2000 14:05:00 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39
References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us>
	<m12mZfG-000CnCC@artcom0.artcom-gmbh.de>
	<14607.3231.115841.262068@anthem.cnri.reston.va.us>
	<14607.6031.770981.424012@seahag.cnri.reston.va.us>
Message-ID: <14607.6348.453682.219847@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr <fdrake at acm.org> writes:

    Fred>   Pester me night and day until it gets done (email only!).

Okay, I'll cancel the daily delivery of angry rabid velco monkeys.

    Fred> Unless of course you've already seen the check-in messages.
    Fred> ;)

Saw 'em.  Thanks.
-Barry


Return-Path: <sc-publicity-return-3-python-dev=python.org at software-carpentry.com>
Delivered-To: python-dev at python.org
Received: from merlin.codesourcery.com (merlin.codesourcery.com [206.168.99.1])
	by dinsdale.python.org (Postfix) with SMTP id 81F951CD8B
	for <python-dev at python.org>; Tue,  2 May 2000 14:45:04 -0400 (EDT)
Received: (qmail 9404 invoked by uid 513); 2 May 2000 18:53:01 -0000
Mailing-List: contact sc-publicity-help at software-carpentry.com; run by ezmlm
Precedence: bulk
X-No-Archive: yes
Delivered-To: mailing list sc-publicity at software-carpentry.com
Delivered-To: moderator for sc-publicity at software-carpentry.com
Received: (qmail 5829 invoked from network); 2 May 2000 18:12:54 -0000
Date: Tue, 2 May 2000 14:04:56 -0400 (EDT)
From: <gvwilson at nevex.com>
To: sc-discuss at software-carpentry.com,
	sc-announce at software-carpentry.com,
	sc-publicity at software-carpentry.com
Message-ID: <Pine.LNX.4.10.10005021403560.30804-100000 at akbar.nevex.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [Python-Dev] Software Carpentry Design Competition Finalists
Sender: python-dev-admin at python.org
Errors-To: python-dev-admin at python.org
X-BeenThere: python-dev at python.org
X-Mailman-Version: 2.0beta3
List-Id: Python core developers <python-dev.python.org>

		Software Carpentry Design Competition

			 First-Round Results

		  http://www.software-carpentry.com

			     May 2, 2000

The Software Carpentry Project is pleased to announce the selection of
finalists in its first Open Source Design Competition.  There were
many strong entries, and we would like to thank everyone who took the
time to participate.

We would also like to invite everyone who has been involved to contact
the teams listed below, and see if there is any way to collaborate in
the second round.  Many of you had excellent ideas that deserve to be
in the final tools, and the more involved you are in discussions over
the next two months, the easier it will be for you to take part in the
ensuing implementation effort.

The 12 entries that are going forward in the "Configuration", "Build",
and "Track" categories are listed below (in alphabetical order).  The
four prize-winning entries in the "Test" category are also listed, but
as is explained there, we are putting this section of the competition
on hold for a couple of months while we try to refine the requirements.
You can inspect these entries on-line at:

         http://www.software-carpentry.com/first-round.html

And so, without further ado...


== Configuration

The final four entries in the "Configuration" category are:

* BuildConf     Vassilis Virvilis

* ConfBase      Stefan Knappmann

* SapCat        Lindsay Todd

* Tan           David Ascher


== Build

The finalists in the "Build" category are:

* Black         David Ascher and Trent Mick

* PyMake        Rich Miller

* ScCons        Steven Knight

* Tromey        Tom Tromey

Honorable mentions in this category go to:

* Forge         Bill Bitner, Justin Patterson, and Gilbert Ramirez

* Quilt         David Lamb


== Track

The four entries to go forward in the "Track" category are:

* Egad          John Martin

* K2            David Belfer-Shevett

* Roundup       Ka-Ping Yee

* Tracker       Ken Manheimer

There is also an honorable mention for:

* TotalTrack    Alex Samuel, Mark Mitchell


== Test

This category was the most difficult one for the judges. First-round
prizes are being awarded to

* AppTest         Linda Timberlake

* TestTalk        Chang Liu

* Thomas          Patrick Campbell-Preston

* TotalQuality    Alex Samuel, Mark Mitchell

However, the judges did not feel that any of these tools would have an
impact on Open Source software development in general, or scientific
and numerical programming in particular.  This is due in large part to
the vagueness of the posted requirements, for which the project
coordinator (Greg Wilson) accepts full responsibility.

We will therefore not be going forward with this category at the
present time.  Instead, the judges and others will develop narrower,
more specific requirements, guidelines, and expectations.  The
category will be re-opened in July 2000.


== Contact

The aim of the Software Carpentry project is to create a new generation of
easy-to-use software engineering tools, and to document both those tools
and the working practices they are meant to support.  The Advanced
Computing Laboratory at Los Alamos National Laboratory is providing
$860,000 of funding for Software Carpentry, which is being administered by
Code Sourcery, LLC.  For more information, contact the project
coordinator, Dr. Gregory V. Wilson, at 'gvwilson at software-carpentry.com',
or on +1 (416) 504 2325 ext. 229.


== Footnote: Entries from CodeSourcery, LLC

Two entries (TotalTrack and TotalQuality) were received from employees
of CodeSourcery, LLC, the company which is hosting the Software
Carpentry web site.  We discussed this matter with Dr. Rod Oldehoeft,
Deputy Directory of the Advanced Computing Laboratory at Los Alamos
National Laboratory.  His response was:

    John Reynders [Director of the ACL] and I have discussed this
    matter.  We agree that since the judges who make decisions
    are not affiliated with Code Sourcery, there is no conflict of
    interest. Code Sourcery gains no advantage by hosting the
    Software Carpentry web pages.  Please continue evaluating all
    the entries on their merits, and choose the best for further
    eligibility.

Note that the project coordinator, Greg Wilson, is neither employed by
CodeSourcery, nor a judge in the competition.




From paul at prescod.net  Tue May  2 20:23:24 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 13:23:24 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>  
	            <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <390F1D1C.6EAF7EAD@prescod.net>

Guido van Rossum wrote:
> 
> ....
> 
> Have you tried using this?

Yes. I haven't had large problems with it.

As long as you know what is going on, it doesn't usually hurt anything
because you can just explicitly set up the decoding you want. It's like
the int division problem. You get bitten a few times and then get
careful.

It's the naive user who will be surprised by these random UTF-8 decoding
errors. 

That's why this is NOT a convenience issue (are you listening MAL???).
It's a short and long term simplicity issue. There are lots of languages
where it is de rigeur to discover and work around inconvenient and
confusing default behaviors. I just don't think that we should be ADDING
such behaviors.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Tue May  2 20:56:34 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 14:56:34 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT."
             <390F1D1C.6EAF7EAD@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
            <390F1D1C.6EAF7EAD@prescod.net> 
Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us>

> It's the naive user who will be surprised by these random UTF-8 decoding
> errors. 
> 
> That's why this is NOT a convenience issue (are you listening MAL???).
> It's a short and long term simplicity issue. There are lots of languages
> where it is de rigeur to discover and work around inconvenient and
> confusing default behaviors. I just don't think that we should be ADDING
> such behaviors.

So what do you think of my new proposal of using ASCII as the default
"encoding"?  It takes care of "a character is a character" but also
(almost) guarantees an error message when mixing encoded 8-bit strings
with Unicode strings without specifying an explicit conversion --
*any* 8-bit byte with the top bit set is rejected by the default
conversion to Unicode.

I think this is less confusing than Latin-1: when an unsuspecting user
is reading encoded text from a file into 8-bit strings and attempts to
use it in a Unicode context, an error is raised instead of producing
garbage Unicode characters.

It encourages the use of Unicode strings for everything beyond ASCII
-- there's no way around ASCII since that's the source encoding etc.,
but Latin-1 is an inconvenient default in most parts of the world.
ASCII is accepted everywhere as the base character set (e.g. for
email and for text-based protocols like FTP and HTTP), just like
English is the one natural language that we can all sue to communicate
(to some extent).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From dieter at handshake.de  Tue May  2 20:44:41 2000
From: dieter at handshake.de (Dieter Maurer)
Date: Tue,  2 May 2000 20:44:41 +0200 (CEST)
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<390E1F08.EA91599E@prescod.net>
Message-ID: <14607.7798.510723.419556@lindm.dm>

Paul Prescod writes:
 > The fact that my proposal has the same effect as making Latin-1 the
 > "default encoding" is a near-term side effect of the definition of
 > Unicode. My long term proposal is to do away with the concept of 8-bit
 > strings (and thus, conversions from 8-bit to Unicode) altogether. One
 > string to rule them all!
Why must this be a long term proposal?

I would find it quite attractive, when
 * the old string type became an imutable list of bytes
 * automatic conversion between byte lists and unicode strings 
   were performed via user customizable conversion functions
   (a la __import__).

Dieter



From paul at prescod.net  Tue May  2 21:01:32 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:01:32 -0500
Subject: [Python-Dev] Unicode compromise?
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390F260C.2314F97E@prescod.net>

Guido van Rossum wrote:
> 
> >     No automatic conversions between 8-bit "strings" and Unicode strings.
> >
> > If you want to turn UTF-8 into a Unicode string, say so.
> > If you want to turn Latin-1 into a Unicode string, say so.
> > If you want to turn ISO-2022-JP into a Unicode string, say so.
> > Adding a Unicode string and an 8-bit "string" gives an exception.
> 
> I'd accept this, with one change: mixing Unicode and 8-bit strings is
> okay when the 8-bit strings contain only ASCII (byte values 0 through
> 127).  

I could live with this compromise as long as we document that a future
version may use the "character is a character" model. I just don't want
people to start depending on a catchable exception being thrown because
that would stop us from ever unifying unmarked literal strings and
Unicode strings.

--

Are there any steps we could take to make a future divorce of strings
and byte arrays easier? What if we added a 

binary_read()

function that returns some form of byte array. The byte array type could
be just like today's string type except that its type object would be
distinct, it wouldn't have as many string-ish methods and it wouldn't
have any auto-conversion to Unicode at all.

People could start to transition code that reads non-ASCII data to the
new function. We could put big warning labels on read() to state that it
might not always be able to read data that is not in some small set of
recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binary_open(). Or perhaps both.

I do not suggest just using the text/binary flag on the existing open
function because we cannot immediately change its behavior without
breaking code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From jkraai at murlmail.com  Tue May  2 21:46:49 2000
From: jkraai at murlmail.com (jkraai at murlmail.com)
Date: Tue, 2 May 2000 14:46:49 -0500
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <200005021946.OAA03609@www.polytopic.com>

The ever quotable Guido:
> English is the one natural language that we can all sue to communicate



------------------------------------------------------------------
You've received MurlMail! -- FREE, web-based email, accessible
anywhere, anytime from any browser-enabled device. Sign up now at
http://murl.com

Murl.com - At Your Service



From paul at prescod.net  Tue May  2 21:23:27 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:23:27 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
	            <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
Message-ID: <390F2B2F.2953C72D@prescod.net>

Guido van Rossum wrote:
> 
> ...
> 
> So what do you think of my new proposal of using ASCII as the default
> "encoding"?  

I can live with it. I am mildly uncomfortable with the idea that I could
write a whole bunch of software that works great until some European
inserts one of their name characters. Nevertheless, being hard-assed is
better than being permissive because we can loosen up later.

What do we do about str( my_unicode_string )? Perhaps escape the Unicode
characters with backslashed numbers?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html



From guido at python.org  Tue May  2 21:58:20 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 15:58:20 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT."
             <390F2B2F.2953C72D@prescod.net> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>  
            <390F2B2F.2953C72D@prescod.net> 
Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us>

[me]
> > So what do you think of my new proposal of using ASCII as the default
> > "encoding"?  

[Paul]
> I can live with it. I am mildly uncomfortable with the idea that I could
> write a whole bunch of software that works great until some European
> inserts one of their name characters.

Better than that when some Japanese insert *their* name characters and
it produces gibberish instead.

> Nevertheless, being hard-assed is
> better than being permissive because we can loosen up later.

Exactly -- just as nobody should *count* on 10**10 raising
OverflowError, nobody (except maybe parts of the standard library :-)
should *count* on unicode("\347") raising ValueError.  I think that's
fine.

> What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> characters with backslashed numbers?

Hm, good question.  Tcl displays unknown characters as \x or \u
escapes.  I think this may make more sense than raising an error.

But there must be a way to turn on Unicode-awareness on e.g. stdout
and then printing a Unicode object should not use str() (as it
currently does).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Tue May  2 22:47:17 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 2 May 2000 13:47:17 -0700
Subject: [Python-Dev] Cannot declare the largest integer literal.
Message-ID: <20000502134717.A16825@activestate.com>

>>> i = -2147483648
OverflowError: integer literal too large
>>> i = -2147483648L
>>> int(i)   # it *is* a valid integer literal
-2147483648


As far as I traced back:

Python/compile.c::com_atom() calls
Python/compile.c::parsenumber(s = "2147483648") calls
Python/mystrtoul.c::PyOS_strtol() which

returns the ERANGE errno because it is given 2147483648 (which *is* out of
range) rather than -2147483648.


My question: Why is the minus sign not considered part of the "atom", i.e.
the integer literal? Should it be? PyOS_strtol() can properly parse this
integer literal if it is given the whole number with the minus sign.
Otherwise the special case largest negative number will always erroneously be
considered out of range.

I don't know how the tokenizer works in Python. Was there a design decision
to separate the integer literal and the leading sign? And was the effect on
functions like PyOS_strtol() down the pipe missed?


Trent

--
Trent Mick
trentm at activestate.com









From guido at python.org  Tue May  2 22:47:30 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 16:47:30 -0400
Subject: [Python-Dev] Unicode compromise?
In-Reply-To: Your message of "Tue, 02 May 2000 14:01:32 CDT."
             <390F260C.2314F97E@prescod.net> 
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>  
            <390F260C.2314F97E@prescod.net> 
Message-ID: <200005022047.QAA26828@eric.cnri.reston.va.us>

> I could live with this compromise as long as we document that a future
> version may use the "character is a character" model. I just don't want
> people to start depending on a catchable exception being thrown because
> that would stop us from ever unifying unmarked literal strings and
> Unicode strings.

Agreed (as I've said before).

> --
> 
> Are there any steps we could take to make a future divorce of strings
> and byte arrays easier? What if we added a 
> 
> binary_read()
> 
> function that returns some form of byte array. The byte array type could
> be just like today's string type except that its type object would be
> distinct, it wouldn't have as many string-ish methods and it wouldn't
> have any auto-conversion to Unicode at all.

You can do this now with the array module, although clumsily:

  >>> import array
  >>> f = open("/core", "rb")
  >>> a = array.array('B', [0]) * 1000
  >>> f.readinto(a)
  1000
  >>>

Or if you wanted to read raw Unicode (UTF-16):

  >>> a = array.array('H', [0]) * 1000
  >>> f.readinto(a)
  2000
  >>> u = unicode(a, "utf-16")
  >>> 

There are some performance issues, e.g. you have to initialize the
buffer somehow and that seems a bit wasteful.

> People could start to transition code that reads non-ASCII data to the
> new function. We could put big warning labels on read() to state that it
> might not always be able to read data that is not in some small set of
> recognized encodings (probably UTF-8 and UTF-16).
> 
> Or perhaps binary_open(). Or perhaps both.
> 
> I do not suggest just using the text/binary flag on the existing open
> function because we cannot immediately change its behavior without
> breaking code.

A new method makes most sense -- there are definitely situations where
you want to read in text mode for a while and then switch to binary
mode (e.g. HTTP).

I'd like to put this off until after Python 1.6 -- but it deserves
attention.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Wed May  3 01:03:22 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 2 May 2000 16:03:22 -0700
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
Message-ID: <20000502160322.A19101@activestate.com>

I apologize if I am hitting covered ground. What about a module (called
limits or something like that) that would expose some appropriate #define's
in limits.h and float.h.

For example:

limits.FLT_EPSILON could expose the C DBL_EPSILON
limits.FLT_MAX could expose the C DBL_MAX
limits.INT_MAX could expose the C LONG_MAX (although that particulay name
would cause confusion with the actual C INT_MAX)


- Does this kind of thing already exist somewhere? Maybe in NumPy.

- If we ever (perhaps in Py3K) turn the basic types into classes then these
  could turn into constant attributes of those classes, i.e.:
  f = 3.14159
  f.EPSILON = <as set by C's DBL_EPSILON>

- I thought of these values being useful when I thought of comparing two
  floats for equality. Doing a straight comparison of floats is
  dangerous/wrong but is it not okay to consider two floats reasonably equal
  iff:
  	-EPSILON < float2 - float1 < EPSILON
  Or maybe that should be two or three EPSILONs. It has been a while since
  I've done any numerical analysis stuff.

  I suppose the answer to my question is: "It depends on the situation."
  Could this algorithm for float comparison be a better default than the
  status quo? I know that Mark H. and others have suggested that Python
  should maybe not provide a float comparsion operator at all to beginners.



Trent

--
Trent Mick
trentm at activestate.com




From mal at lemburg.com  Wed May  3 01:11:37 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:11:37 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>  
	            <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <390F60A9.A3AA53A9@lemburg.com>

Guido van Rossum wrote:
> 
> > > So what do you think of my new proposal of using ASCII as the default
> > > "encoding"?

How about using unicode-escape or raw-unicode-escape as
default encoding ? (They would have to be adapted to disallow
Latin-1 char input, though.)

The advantage would be that they are compatible with ASCII
while still providing loss-less conversion and since they
use escape characters, you can even read them using an
ASCII based editor.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Wed May  3 01:12:18 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 3 May 2000 09:12:18 +1000
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEBGCKAA.mhammond@skippinet.com.au>

> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

I struck this years ago!  At the time, the answer was "yes, its an
implementation flaw thats not worth fixing".

Interestingly, it _does_ work as a hex literal:

>>> 0x80000000
-2147483648
>>> -2147483648
Traceback (OverflowError: integer literal too large
>>>

Mark.




From mal at lemburg.com  Wed May  3 01:05:28 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:05:28 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            
			 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
			 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net>
Message-ID: <390F5F38.DD76CAF4@lemburg.com>

Paul Prescod wrote:
> 
> Combining characters are a whole 'nother level of complexity. Charater
> sets are hard. I don't accept that the argument that "Unicode itself has
> complexities so that gives us license to introduce even more
> complexities at the character representation level."
> 
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> 
> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> think we need to get the bytes/characters level right and then we can
> worry about display-equivalent characters (or leave that to the Python
> programmer to figure out...).

I just wanted to point out that the argument "slicing doesn't
work with UTF-8" is moot.

I do see a point against UTF-8 auto-conversion given the example
that Guido mailed me:

"""
s = 'ab\341\210\264def'        # == str(u"ab\u1234def")
s.find(u"def")

This prints 3 -- the wrong result since "def" is found at s[5:8], not
at s[3:6].
"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed May  3 04:20:20 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 2 May 2000 22:20:20 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <20000502134717.A16825@activestate.com>
Message-ID: <000001bfb4a6$21da7900$922d153f@tim>

[Trent Mick]
> >>> i = -2147483648
> OverflowError: integer literal too large
> >>> i = -2147483648L
> >>> int(i)   # it *is* a valid integer literal
> -2147483648

Python's grammar is such that negative integer literals don't exist; what
you actually have there is the unary minus operator applied to positive
integer literals; indeed,

>>> def f():
	return -42

>>> import dis
>>> dis.dis(f)
          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 (42)
          9 UNARY_NEGATIVE
         10 RETURN_VALUE
         11 LOAD_CONST               0 (None)
         14 RETURN_VALUE
>>>

Note that, at runtime, the example loads +42, then negates it:  this wart
has deep roots!

> ...
> And was the effect on functions like PyOS_strtol() down the pipe
> missed?

More that it was considered an inconsequential endcase.  It's sure not worth
changing the grammar for <wink>.  I'd rather see Python erase the visible
distinction between ints and longs.





From guido at python.org  Wed May  3 04:31:21 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 02 May 2000 22:31:21 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200."
             <390F60A9.A3AA53A9@lemburg.com> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>  
            <390F60A9.A3AA53A9@lemburg.com> 
Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us>

> Guido van Rossum wrote:
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?

[MAL]
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

No, the backslash should mean itself when encoding from ASCII to
Unicode.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From esr at thyrsus.com  Wed May  3 05:22:20 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 23:22:20 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFE8C.4C10473C@prescod.net>; from paul@prescod.net on Tue, May 02, 2000 at 11:13:00AM -0500
References: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us> <390EFE8C.4C10473C@prescod.net>
Message-ID: <20000502232220.B18638@thyrsus.com>

Paul Prescod <paul at prescod.net>:
> Where are we going? What's our long-range vision?
> 
> Three years from now where will we be? 
> 
> 1. How will we handle characters? 
> 2. How will we handle bytes?
> 3. What will unadorned literal strings "do"?
> 4. Will literal strings be the same type as byte arrays?
> 
> I don't see how we can make decisions today without a vision for the
> future. I think that this is the central point in our disagreement. Some
> of us are aiming for as much compatibility with where we think we should
> be going and others are aiming for as much compatibility as possible
> with where we came from.

And *that* is the most insightful statement I have seen in this entire 
foofaraw (which I have carefully been staying right the hell out of). 

Everybody meditate on the above, please.  Then declare your objectives *at
this level* so our Fearless Leader can make an informed decision *at this
level*.  Only then will it make sense to argue encoding theology...
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Extremism in the defense of liberty is no vice; moderation in the
pursuit of justice is no virtue."
	-- Barry Goldwater (actually written by Karl Hess)



From tim_one at email.msn.com  Wed May  3 07:05:59 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:05:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
Message-ID: <000301bfb4bd$463ec280$622d153f@tim>

[Guido]
> When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> bytes in either should make the comparison fail; when ordering is
> important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby]
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

[Guido]
> Yes, sorry for the ambiguity.

Huh!  You sure about that?  If we're setting up a case where meaningful
comparison is impossible, isn't an exception more appropriate?  The current

>>> 83479278 < "42"
1
>>>

probably traps more people than it helps.





From tim_one at email.msn.com  Wed May  3 07:19:28 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:19:28 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>
Message-ID: <000401bfb4bf$27ec1600$622d153f@tim>

[Fredrik Lundh]
> ...
> (if you like, I can post more "fun with unicode" messages ;-)

By all means!  Exposing a gotcha to ridicule does more good than a dozen
abstract arguments.  But next time stoop to explaining what it is that's
surprising <wink>.





From just at letterror.com  Wed May  3 08:47:07 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 07:47:07 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390F5F38.DD76CAF4@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
  <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>			
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net>
Message-ID: <l03102800b53572ee87ad@[193.78.237.142]>

[MAL vs. PP]
>> > FYI: Normalization is needed to make comparing Unicode
>> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
>>
>> That's a whole 'nother debate at a whole 'nother level of abstraction. I
>> think we need to get the bytes/characters level right and then we can
>> worry about display-equivalent characters (or leave that to the Python
>> programmer to figure out...).
>
>I just wanted to point out that the argument "slicing doesn't
>work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue
(which is indeed not relevant to the current discussion, but it's
fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "?" is u"\350", "?" is
u"\366")

John Jenkins replied:
"""
Well, I'm not sure you want to hear the answer -- but it really depends on
what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated
the same as "?". This is a *semantic* equivalence -- that is, they *mean*
the same thing -- and doesn't depend on the display engine to be true.
Unicode also provides a default collation algorithm
(http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string
comparison and collation are complicated, language-specific problems
requiring a lot of work and interaction with the user to do right.

>From the perspective of a programming language, it would best be served IMHO
by implementing the contents of TR10 for string comparison and collation.
That would make "e`" and "?" come out as equivalent.
"""


Dave Opstad replied:
"""
Unicode talks about "canonical decomposition" in order to make it easier
to answer questions like yours. Specifically, in the Unicode 3.0
standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their
full canonical decompositions are identical. For example, the sequences
<o, combining-diaeresis> and <?> are canonical equivalents. Canonical
equivalence is a Unicode propert. It should not be confused with
language-specific collation or matching, which may add additional
equivalencies."

So they still have language-specific differences, even if Unicode sees
them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify
things further.
"""


It's very deep stuff, which seems more appropriate for an extension than
for builtin comparisons to me.

Just





From tim_one at email.msn.com  Wed May  3 07:47:37 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:47:37 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <000501bfb4c3$16743480$622d153f@tim>

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

Then you don't want Unicode at all, Moshe.  All the official encoding
schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
Unicode not yet having assigned a character to this position, it's that the
standard explicitly makes this sequence illegal and guarantees it will
always be illegal!  the other place this comes up is with surrogates, where
what's legal depends on both parts of a character pair; and, again, the
illegalities here are guaranteed illegal for all time).  UCS-4 is the
closest thing to binary-transparent Unicode encodings get, but even there
the length of a thing is contrained to be a multiple of 4 bytes.  Unicode
and binary goop will never coexist peacefully.





From ping at lfw.org  Wed May  3 07:56:12 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 22:56:12 -0700 (PDT)
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <Pine.LNX.4.10.10005022249330.522-100000@localhost>

On Wed, 3 May 2000, Tim Peters wrote:
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.
> 
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> 
> probably traps more people than it helps.

Yeah, when i said

    No automatic conversions between Unicode strings and 8-bit "strings".

i was about to say

    Raise an exception on any operation attempting to combine or
    compare Unicode strings and 8-bit "strings".

...and then i thought, oh crap, but everything in Python is supposed
to be comparable.

What happens when you have some lists with arbitrary objects in them
and you want to sort them for printing, or to canonicalize them so
you can compare?  It might be too troublesome for list.sort() to
throw an exception because e.g. strings and ints were incomparable,
or 8-bit "strings" and Unicode strings were incomparable...

So -- what's the philosophy, Guido?  Are we committed to "everything
is comparable" (well, "all built-in types are comparable") or not?


-- ?!ng




From tim_one at email.msn.com  Wed May  3 08:40:54 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 02:40:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <000701bfb4ca$87b765c0$622d153f@tim>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

He succeeded for me.  Blind slicing doesn't always "work right" no matter
what encoding you use, because "work right" depends on semantics beyond the
level of encoding.  UTF-8 is no worse than anything else in this respect.





From just at letterror.com  Wed May  3 09:50:11 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 08:50:11 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000701bfb4ca$87b765c0$622d153f@tim>
References: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102804b5358971d413@[193.78.237.152]>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

[Tim]
>He succeeded for me.  Blind slicing doesn't always "work right" no matter
>what encoding you use, because "work right" depends on semantics beyond the
>level of encoding.  UTF-8 is no worse than anything else in this respect.

But the discussion *was* at the level of encoding! Still it is worse, since
an arbitrary utf-8 slice may result in two illegal strings -- slicing "e`"
results in two perfectly legal strings, at the encoding level. Had he used
surrogates as an example, he would've been right... (But even that is an
encoding issue.)

Just





From tim_one at email.msn.com  Wed May  3 09:11:12 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 03:11:12 -0400
Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h
In-Reply-To: <20000502160322.A19101@activestate.com>
Message-ID: <000801bfb4ce$c361ea60$622d153f@tim>

[Trent Mick]
> I apologize if I am hitting covered ground. What about a module (called
> limits or something like that) that would expose some appropriate
> #define's
> in limits.h and float.h.

I personally have little use for these.

> For example:
>
> limits.FLT_EPSILON could expose the C DBL_EPSILON
> limits.FLT_MAX could expose the C DBL_MAX

Hmm -- all evidence suggests that your "O" and "A" keys work fine, so where
did the absurdly abbreviated FLT come from <wink>?

> limits.INT_MAX could expose the C LONG_MAX (although that particulay name
> would cause confusion with the actual C INT_MAX)

That one is available as sys.maxint.

> - Does this kind of thing already exist somewhere? Maybe in NumPy.

Dunno.  I compute the floating-point limits when needed with Python code,
and observing what the hardware actually does is a heck of a lot more
trustworthy than platform C header files (and especially when
cross-compiling).

> - If we ever (perhaps in Py3K) turn the basic types into classes
> then these could turn into constant attributes of those classes, i.e.:
>   f = 3.14159
>   f.EPSILON = <as set by C's DBL_EPSILON>

That sounds better.

> - I thought of these values being useful when I thought of comparing
>   two floats for equality. Doing a straight comparison of floats is
>   dangerous/wrong

This is a myth whose only claim to veracity is the frequency and intensity
with which it's mechanically repeated <0.6 wink>.  It's no more dangerous
than adding two floats:  you're potentially screwed if you don't know what
you're doing in either case, but you're in no trouble at all if you do.

> but is it not okay to consider two floats reasonably equal iff:
>   	-EPSILON < float2 - float1 < EPSILON

Knuth (Vol 2) gives a reasonable defn of approximate float equality.  Yours
is measuring absolute error, which is almost never reasonable; relative
error is the measure of interest, but then 0.0 is an especially irksome
comparand.

> ...
>   I suppose the answer to my question is: "It depends on the situation."

Yes.

>   Could this algorithm for float comparison be a better default than the
>   status quo?

No.

> I know that Mark H. and others have suggested that Python should maybe
> not provide a float comparsion operator at all to beginners.

There's a good case to be made for not exposing *anything* about fp to
beginners, but comparisons aren't especially surprising.  This usually gets
suggested when a newbie is surprised that e.g. 1./49*49 != 1.  Telling them
they *are* equal is simply a lie, and they'll pay for that false comfort
twice over a little bit later down the fp road.  For example, int(1./49*49)
is 0 on IEEE-754 platforms, which is awfully surprising for an expression
that "equals" 1(!).  The next suggestion is then to fudge int() too, and so
on and so on.  It's like the arcade Whack-A-Mole game:  each mole you knock
into its hole pops up two more where you weren't looking.  Before you know
it, not even a bona fide expert can guess what code will actually do
anymore.

the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim





From effbot at telia.com  Wed May  3 09:34:51 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:34:51 +0200
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005022249330.522-100000@localhost>
Message-ID: <00b201bfb4d3$07a95420$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> So -- what's the philosophy, Guido?  Are we committed to "everything
> is comparable" (well, "all built-in types are comparable") or not?

in 1.6a2, obviously not:

>>> aUnicodeString < an8bitString
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

in 1.6a3, maybe.

</F>




From effbot at telia.com  Wed May  3 09:48:56 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:48:56 +0200
Subject: [Python-Dev] Unicode debate
References: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <00ce01bfb4d4$0a7d1820$34aab5d4@hagrid>

Tim Peters <tim_one at email.msn.com> wrote:
> [Moshe Zadka]
> > ...
> > I'd much prefer Python to reflect a fundamental truth about Unicode,
> > which at least makes sure binary-goop can pass through Unicode and
> > remain unharmed, then to reflect a nasty problem with UTF-8 (not
> > everything is legal).
> 
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
> Unicode not yet having assigned a character to this position, it's that the
> standard explicitly makes this sequence illegal and guarantees it will
> always be illegal!

in context, I think what Moshe meant was that with a straight
character code mapping, any 8-bit string can always be mapped
to a unicode string and back again.

given a byte array "b":

    u = unicode(b, "default")
    assert map(ord, u) == map(ord, s)

again, this is no different from casting an integer to a long integer
and back again.  (imaging having to do that on the bits and bytes
level!).

and again, the internal unicode encoding used by the unicode string
type itself, or when serializing that string type, has nothing to do
with that.

</F>




From jack at oratrix.nl  Wed May  3 09:58:31 2000
From: jack at oratrix.nl (Jack Jansen)
Date: Wed, 03 May 2000 09:58:31 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python
 bltinmodule.c,2.154,2.155
In-Reply-To: Message by bwarsaw@cnri.reston.va.us (Barry A. Warsaw) ,
	     Tue, 2 May 2000 15:24:09 -0400 (EDT) , <20000502192409.8C44E6636B@anthem.cnri.reston.va.us> 
Message-ID: <20000503075832.18574370CF2@snelboot.oratrix.nl>

> _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> ahead and initialize the class-based standard exceptions.  If this
> fails, we throw a Py_FatalError.

Isn't a Py_FatalError overkill? Or will not having the class-based standard 
exceptions lead to so much havoc later on that it is better than limping on?
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From just at letterror.com  Wed May  3 11:03:16 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 10:03:16 +0100
Subject: [Python-Dev] Unicode comparisons & normalization
Message-ID: <l03102806b535964edb26@[193.78.237.152]>

After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:

- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)

Just





From effbot at telia.com  Wed May  3 09:56:25 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:56:25 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > characters with backslashed numbers?
> 
> Hm, good question.  Tcl displays unknown characters as \x or \u
> escapes.  I think this may make more sense than raising an error.

but that's on the display side of things, right?  similar to
repr, in other words.

> But there must be a way to turn on Unicode-awareness on e.g. stdout
> and then printing a Unicode object should not use str() (as it
> currently does).

to throw some extra gasoline on this, how about allowing
str() to return unicode strings?

(extra questions: how about renaming "unicode" to "string",
and getting rid of "unichr"?)

count to ten before replying, please.

</F>




From ping at lfw.org  Wed May  3 10:30:02 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:30:02 -0700 (PDT)
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <Pine.LNX.4.10.10005030116460.522-100000@localhost>

On Wed, 3 May 2000, Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:
> 
> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

I just looked through this document.  Indeed, there's a lot
of work to be done if we want to compare strings this way.

I thought the most striking feature was that this comparison
method does *not* satisfy the common assumption

    a > b  implies  a + c > b + d        (+ is concatenation)

-- in fact, it is specifically designed to allow for cases
where differences in the *later* part of a string can have
greater influence than differences in an earlier part of a
string.  It *does* still guarantee that

    a + b > a

and of course we can still rely on the most basic rules such as

    a > b  and  b > c  implies  a > c

There are sufficiently many significant transformations
described in the UTR 10 document that i'm pretty sure it
is possible for two things to collate equally but not be
equivalent.  (Even after Unicode normalization, there is
still the possibility of rearrangement in step 1.2.)

This would be another motivation for Python to carefully
separate the three types of equality:

    is         identity-equal
    ==         value-equal
    <=>        magnitude-equal

We currently don't distinguish between the last two;
the operator "<=>" is my proposal for how to spell
"magnitude-equal", and in terms of outward behaviour
you can consider (a <=> b) to be (a <= b and a >= b).
I suspect we will find ourselves needing it if we do
rich comparisons anyway.

(I don't know of any other useful kinds of equality,
but if you've run into this before, do pipe up...)


-- ?!ng




From mal at lemburg.com  Wed May  3 10:15:29 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:15:29 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
	  <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>			
	 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
	 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
	 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <390FE021.6F15C1C8@lemburg.com>

Just van Rossum wrote:
> 
> [MAL vs. PP]
> >> > FYI: Normalization is needed to make comparing Unicode
> >> > strings robust, e.g. u"?" should compare equal to u"e\u0301".
> >>
> >> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> >> think we need to get the bytes/characters level right and then we can
> >> worry about display-equivalent characters (or leave that to the Python
> >> programmer to figure out...).
> >
> >I just wanted to point out that the argument "slicing doesn't
> >work with UTF-8" is moot.
> 
> And failed...

Huh ? The pure fact that you can have two (or more)
Unicode characters to represent a single character makes
Unicode itself have the same problems as e.g. UTF-8.

> [Refs about collation and decomposition]
>
> It's very deep stuff, which seems more appropriate for an extension than
> for builtin comparisons to me.

That's what I think too; I never argued for making this
builtin and automatic (don't know where people got this idea
from).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From effbot at telia.com  Wed May  3 11:02:09 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:02:09 +0200
Subject: [Python-Dev] Unicode comparisons & normalization
References: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <018a01bfb4de$7744cc00$34aab5d4@hagrid>

Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:

here's another good paper that covers this, the universe, and everything:

    Character Model for the World Wide Web 
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../ 

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../
 
    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting. 

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:

> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).

> - this would indeed mean that it's possible for u == v even though type(u)
> is type(v) and len(u) != len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).

yes, but on layer 3 instead of layer 2.

> - there may be additional customized language-specific sorting rules. I
> currently don't see how to implement that without some global variable.

layer 4.

> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...

layer 4.

> - u.find() may need an alternative API, which returns a (begin, end) tuple,
> since the match may not have the same length as the search string... (This
> is tricky, since you need the begin and end indices in the non-canonical
> form...)

layer 3.




From effbot at telia.com  Wed May  3 11:11:26 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:11:26 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> > 
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?
> 
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

umm.  if you disallow latin-1 characters, how can you call this
one loss-less?

looks like political correctness taken to an entirely new level...

</F>




From ping at lfw.org  Wed May  3 10:50:30 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:50:30 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005030141580.522-100000@localhost>

On Wed, 3 May 2000, Fredrik Lundh wrote:
> Guido van Rossum <guido at python.org> wrote:
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?

You still need to *print* them somehow.  One way or another,
stdout is still just a stream with bytes on it, unless we
augment file objects to understand encodings.

stdout sends bytes to something -- and that something will
interpret the stream of bytes in some encoding (could be
Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:

    1.  You explicitly downconvert to bytes, and specify
        the encoding each time you do.  Then write the
        bytes to stdout (or your file object).

    2.  The file object is smart and can be told what
        encoding to use, and Unicode strings written to
        the file are automatically converted to bytes.

Another thread mentioned having separate read/write and
binary_read/binary_write methods on files.  I suggest
doing it the other way, actually: since read/write operate
on byte streams now, *they* are the binary operations;
the new methods should be the ones that do the extra
encoding/decoding work, and could be called uniread/uniwrite,
uread/uwrite, textread/textwrite, etc.

> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)

Would you expect chr(x) to return an 8-bit string when x < 128,
and a Unicode string when x >= 128?


-- ?!ng




From ping at lfw.org  Wed May  3 11:32:31 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:32:31 -0700 (PDT)
Subject: [Python-Dev] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030151150.522-100000@localhost>

On Tue, 2 May 2000, Guido van Rossum wrote:
> > P. P. S.  If always having to specify encodings is really too much,
> > i'd probably be willing to consider a default-encoding state on the
> > Unicode class, but it would have to be a stack of values, not a
> > single value.
> 
> Please elaborate?

On general principle, it seems bad to just have a "set" method
that encourages people to set static state in a way that
irretrievably loses the current state.  For something like this,
you want a "push" method and a "pop" method with which to bracket
a series of operations, so that you can easily write code which
politely leaves other code unaffected.

For example:

    >>> x = unicode("d\351but")        # assume Guido-ASCII wins
    UnicodeError: ASCII encoding error: value out of range
    >>> x = unicode("d\351but", "latin-1")
    >>> x
    u'd\351but'
    >>> print x.encode("latin-1")      # on my xterm with Latin-1 fonts
    d?but
    >>> x.encode("utf-8")
    'd\303\251but'

Now:

    >>> u"".pushenc("latin-1")         # need a better interface to this?
    >>> x = unicode("d\351but")        # okay now
    >>> x
    u'd\351but'
    >>> u"".pushenc("utf-8")
    >>> x = unicode("d\351but")
    UnicodeError: UTF-8 decoding error: invalid data
    >>> x = unicode("d\303\251but")
    >>> print x.encode("latin-1")
    d?but
    >>> str(x)
    'd\303\251\but'
    >>> u"".popenc()                   # back to the Latin-1 encoding
    >>> str(x)
    'd\351but'
        .
        .
        .
    >>> u"".popenc()                   # back to the ASCII encoding

Similarly, imagine:

    >>> x = u"<Japanese text...>"

    >>> file = open("foo.jis", "w")
    >>> file.pushenc("iso-2022-jp")
    >>> file.uniwrite(x)
        .
        .
        .
    >>> file.popenc()

    >>> import sys
    >>> sys.stdout.write(x)            # bad! x contains chars > 127
    UnicodeError: ASCII decoding error: value out of range

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> sys.stdout.write(x)            # on a kterm with kanji fonts
    <Japanese text...>
        .
        .
        .
    >>> sys.stdout.popenc()

The above examples incorporate the Guido-ASCII proposal, which
makes a fair amount of sense to me now.  How do they look to y'all?



This illustrates the remaining wart:

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> print x                        # still bad! str is still doing ASCII
    UnicodeError: ASCII decoding error: value out of range

    >>> u"".pushenc("iso-2022-jp")
    >>> print x                        # on a kterm with kanji fonts
    <Japanese text...>

Writing to files asks the file object to convert from Unicode to
bytes, then write the bytes.

Printing converts the Unicode to bytes first with str(), then
hands the bytes to the file object to write.

This wart is really a larger printing issue.  If we want to
solve it, files have to know what to do with objects, i.e.

    print x

doesn't mean

    sys.stdout.write(str(x) + "\n")

instead it means

    sys.stdout.printout(x)

Hmm.  I think this might deserve a separate subject line.


-- ?!ng




From ping at lfw.org  Wed May  3 11:41:20 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:41:20 -0700 (PDT)
Subject: [Python-Dev] Printing objects on files
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030232360.522-100000@localhost>

The following is all stolen from E: see http://www.erights.org/.

As i mentioned in the previous message, there are reasons that
we might want to enable files to know what it means to print
things on them.

    print x

would mean

    sys.stdout.printout(x)

where sys.stdout is defined something like

    def __init__(self):
        self.encs = ["ASCII"]

    def pushenc(self, enc):
        self.encs.append(enc)
    
    def popenc(self):
        self.encs.pop()
        if not self.encs: self.encs = ["ASCII"]

    def printout(self, x):
        if type(x) is type(u""):
            self.write(x.encode(self.encs[-1]))
        else:   
            x.__print__(self)
        self.write("\n")

and each object would have a __print__ method; for lists, e.g.:

    def __print__(self, file):
        file.write("[")
        if len(self):
            file.printout(self[0])
        for item in self[1:]:
            file.write(", ")
            file.printout(item)
        file.write("]")

for floats, e.g.:

    def __print__(self, file):
        if hasattr(file, "floatprec"):
            prec = file.floatprec
        else:
            prec = 17
        file.write("%%.%df" % prec % self)

The passing of control between the file and the objects to
be printed enables us to make Tim happy:

    >>> l = [1/2, 1/3, 1/4]            # I can dream, can't i?

    >>> print l
    [0.3, 0.33333333333333331, 0.25]

    >>> sys.stdout.floatprec = 6
    >>> print l
    [0.5, 0.333333, 0.25]

Fantasizing about other useful kinds of state beyond "encs"
and "floatprec" ("listmax"? "ratprec"?) and managing this
namespace is left as an exercise to the reader.


-- ?!ng




From ht at cogsci.ed.ac.uk  Wed May  3 11:59:28 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 03 May 2000 10:59:28 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <f5bog6o54zj.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage).  I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.

I notice one apparent mis-communication between the various
contributors:

Treating narrow-strings as consisting of UNICODE code points <= 255 is 
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.

I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings.  Encodings are only relevant to
conversion between byte-arrays and strings.  Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might.  The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the 
code points between 128 and 255.

This is effectively the way our C-based XML toolset (which we embed in 
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.

I'd like to keep byte-arrays at the margins in Python as well, for all 
the reasons advanced by Paul and Fredrik.

I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From ping at lfw.org  Wed May  3 11:51:30 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:51:30 -0700 (PDT)
Subject: [Python-Dev] Re: Printing objects on files
In-Reply-To: <Pine.LNX.4.10.10005030232360.522-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005030242030.522-100000@localhost>

On Wed, 3 May 2000, Ka-Ping Yee wrote:
> 
> Fantasizing about other useful kinds of state beyond "encs"
> and "floatprec" ("listmax"? "ratprec"?) and managing this
> namespace is left as an exercise to the reader.

Okay, i lied.  Shortly after writing this i realized that it
is probably advisable for all such bits of state to be stored
in stacks, so an interface such as this might do:

    def push(self, key, value):
        if not self.state.has_key(key):
            self.state[key] = []
        self.state[key].append(value)

    def pop(self, key):
        if self.state.has_key(key):
            if len(self.state[key]):
                self.state[key].pop()

    def get(self, key):
        if not self.state.has_key(key):
            stack = self.state[key][-1]
        if stack:
            return stack[-1]
        return None

Thus:

    >>> print 1/3
    0.33333333333333331

    >>> sys.stdout.push("float.prec", 6)
    >>> print 1/3
    0.333333

    >>> sys.stdout.pop("float.prec")
    >>> print 1/3
    0.33333333333333331

And once we allow arbitrary strings as keys to the bits
of state, the period is a natural separator we can use
for managing the namespace.

Take the special case for Unicode out of the file object:
    
    def printout(self, x):
        x.__print__(self)
        self.write("\n")

and have the Unicode string do the work:

    def __printon__(self, file):
        file.write(self.encode(file.get("unicode.enc")))

This behaves just right if an encoding of None means ASCII.

If mucking with encodings is sufficiently common, you could
imagine conveniences on file objects such as

    def __init__(self, filename, mode, encoding=None):
        ...
        if encoding:
            self.push("unicode.enc", encoding)

    def pushenc(self, encoding):
        self.push("unicode.enc", encoding)

    def popenc(self, encoding):
        self.pop("unicode.enc")


-- ?!ng




From effbot at telia.com  Wed May  3 12:31:34 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 12:31:34 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> > to throw some extra gasoline on this, how about allowing
> > str() to return unicode strings?
> 
> You still need to *print* them somehow.  One way or another,
> stdout is still just a stream with bytes on it, unless we
> augment file objects to understand encodings.
> 
> stdout sends bytes to something -- and that something will
> interpret the stream of bytes in some encoding (could be
> Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> 
>     1.  You explicitly downconvert to bytes, and specify
>         the encoding each time you do.  Then write the
>         bytes to stdout (or your file object).
> 
>     2.  The file object is smart and can be told what
>         encoding to use, and Unicode strings written to
>         the file are automatically converted to bytes.

which one's more convenient?

(no, I won't tell you what I prefer. guido doesn't want
more arguments from the old "characters are characters"
proponents, so I gotta trick someone else to spell them
out ;-)

> > (extra questions: how about renaming "unicode" to "string",
> > and getting rid of "unichr"?)
> 
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

that will break too much existing code, I think.  but what
about replacing 128 with 256?

</F>




From just at letterror.com  Wed May  3 13:41:27 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 12:41:27 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390FE021.6F15C1C8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."            		
   <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>				
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>	
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>	
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102800b535bef21708@[193.78.237.152]>

At 10:15 AM +0200 03-05-2000, M.-A. Lemburg wrote:
>Huh ? The pure fact that you can have two (or more)
>Unicode characters to represent a single character makes
>Unicode itself have the same problems as e.g. UTF-8.

It's the different level of abstraction that makes it different.

Even if "e`" is _equivalent_ to the combined character, that doesn't mean
that it _is_ the combined character, on the level of abstraction we are
talking about: it's still 2 characters, and those can be sliced apart
without a problem. Slicing utf-8 doesn't work because it yields invalid
strings, slicing "e`" does work since both halves are valid strings. The
fact that "e`" is semantically equivalent to the combined character doesn't
change that.

Just





From guido at python.org  Wed May  3 13:12:44 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:12:44 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 01:30:02 PDT."
             <Pine.LNX.4.10.10005030116460.522-100000@localhost> 
References: <Pine.LNX.4.10.10005030116460.522-100000@localhost> 
Message-ID: <200005031112.HAA03138@eric.cnri.reston.va.us>

[Ping]
> This would be another motivation for Python to carefully
> separate the three types of equality:
> 
>     is         identity-equal
>     ==         value-equal
>     <=>        magnitude-equal
> 
> We currently don't distinguish between the last two;
> the operator "<=>" is my proposal for how to spell
> "magnitude-equal", and in terms of outward behaviour
> you can consider (a <=> b) to be (a <= b and a >= b).
> I suspect we will find ourselves needing it if we do
> rich comparisons anyway.

I don't think that this form of equality deserves its own operator.
The Unicode comparison rules are sufficiently hairy that it seems
better to implement them separately, either in a separate module or at
least as a Unicode-object-specific method, and let the == operator do
what it does best: compare the representations.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Wed May  3 13:14:54 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:14:54 -0400
Subject: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 11:02:09 +0200."
             <018a01bfb4de$7744cc00$34aab5d4@hagrid> 
References: <l03102806b535964edb26@[193.78.237.152]>  
            <018a01bfb4de$7744cc00$34aab5d4@hagrid> 
Message-ID: <200005031114.HAA03152@eric.cnri.reston.va.us>

> here's another good paper that covers this, the universe, and everything:

Theer's a lot of useful pointers being flung around.  Could someone
with more spare cycles than I currently have perhaps collect these and
produce a little write up "further reading on Unicode comparison and
normalization" (or perhaps a more comprehensive title if warranted) to
be added to the i18n-sig's home page?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From just at letterror.com  Wed May  3 14:26:50 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 13:26:50 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <l03102804b535cb14f243@[193.78.237.149]>

[Ka-Ping Yee]
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

[Fredrik Lundh]
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-)

Just





From guido at python.org  Wed May  3 14:04:29 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:04:29 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 12:31:34 +0200."
             <030a01bfb4ea$c2741e40$34aab5d4@hagrid> 
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>  
            <030a01bfb4ea$c2741e40$34aab5d4@hagrid> 
Message-ID: <200005031204.IAA03252@eric.cnri.reston.va.us>

[Ping]
> > stdout sends bytes to something -- and that something will
> > interpret the stream of bytes in some encoding (could be
> > Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> > 
> >     1.  You explicitly downconvert to bytes, and specify
> >         the encoding each time you do.  Then write the
> >         bytes to stdout (or your file object).
> > 
> >     2.  The file object is smart and can be told what
> >         encoding to use, and Unicode strings written to
> >         the file are automatically converted to bytes.

[Fredrik]
> which one's more convenient?

Marc-Andre's codec module contains file-like objects that support this
(or could easily be made to).

However the problem is that print *always* first converts the object
using str(), and str() enforces that the result is an 8-bit string.
I'm afraid that loosening this will break too much code.  (This all
really happens at the C level.)

I'm also afraid that this means that str(unicode) may have to be
defined to yield UTF-8.  My argument goes as follows:

1. We want to be able to set things up so that print u"..." does the
   right thing.  (What "the right thing" is, is not defined here,
   as long as the user sees the glyphs implied by u"...".)

2. print u is equivalent to sys.stdout.write(str(u)).

3. str() must always returns an 8-bit string.

4. So the solution must involve assigning an object to sys.stdout that
   does the right thing given an 8-bit encoding of u.

5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.

6. UTF-8 is the only sensible candidate.

Note that (apart from print) str() is never implicitly invoked -- all
implicit conversions when Unicode and 8-bit strings are combined
go from 8-bit to Unicode.

(There might be an alternative, but it would depend on having yet
another hook (similar to Ping's sys.display) that gets invoked when
printing an object (as opposed to displaying it at the interactive
prompt).  I'm not too keen on this because it would break code that
temporarily sets sys.stdout to a file of its own choosing and then
invokes print -- a common idiom to capture printed output in a string,
for example, which could be embedded deep inside a module.  If the
main program were to install a naive print hook that always sent
output to a designated place, this strategy might fail.)

> > > (extra questions: how about renaming "unicode" to "string",
> > > and getting rid of "unichr"?)
> > 
> > Would you expect chr(x) to return an 8-bit string when x < 128,
> > and a Unicode string when x >= 128?
> 
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

If the 8-bit Unicode proposal were accepted, this would make sense.
In my "only ASCII is implicitly convertible" proposal, this would be a
mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).

I agree with everyone that things would be much simpler if we had
separate data types for byte arrays and 8-bit character strings.  But
we don't have this distinction yet, and I don't see a quick way to add
it in 1.6 without major upsetting the release schedule.

So all of my proposals are to be considered hacks to maintain as much
b/w compatibility as possible while still supporting some form of
Unicode.  The fact that half the time 8-bit strings are really being
used as byte arrays, while Python can't tell the difference, means (to
me) that the default encoding is an important thing to argue about.

I don't know if I want to push it out all the way to Py3k, but I just
don't see a way to implement "a character is a character" in 1.6 given
all the current constraints.  (BTW I promise that 1.7 will be speedy
once 1.6 is out of the door -- there's a lot else that was put off to
1.7.)

Fredrik, I believe I haven't seen your response to my ASCII proposal.
Is it just as bad as UTF-8 to you, or could you live with it?  On a
scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?

Where's my sre snapshot?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:16:56 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:16:56 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "03 May 2000 10:59:28 BST."
             <f5bog6o54zj.fsf@cogsci.ed.ac.uk> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>  
            <f5bog6o54zj.fsf@cogsci.ed.ac.uk> 
Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us>

[Henry S. Thompson]
> OK, I've never contributed to this discussion, but I have a long
> history of shipping widely used Python/Tkinter/XML tools (see my
> homepage).  I care _very_ much that heretofore I have been unable to
> support full XML because of the lack of Unicode support in Python.
> I've already started playing with 1.6a2 for this reason.

Thanks for chiming in!

> I notice one apparent mis-communication between the various
> contributors:
> 
> Treating narrow-strings as consisting of UNICODE code points <= 255 is 
> not necessarily the same thing as making Latin-1 the default encoding.
> I don't think on Paul and Fredrik's account encoding are relevant to
> narrow-strings at all.

I agree that's what they are trying to tell me.

> I'd rather go right away to the coherent position of byte-arrays,
> narrow-strings and wide-strings.  Encodings are only relevant to
> conversion between byte-arrays and strings.  Decoding a byte-array
> with a UTF-8 encoding into a narrow string might cause
> overflow/truncation, just as decoding a byte-array with a UTF-8
> encoding into a wide-string might.  The fact that decoding a
> byte-array with a Latin-1 encoding into a narrow-string is a memcopy
> is just a side-effect of the courtesy of the UNICODE designers wrt the 
> code points between 128 and 255.
> 
> This is effectively the way our C-based XML toolset (which we embed in 
> Python) works today -- we build an 8-bit version which uses char*
> strings, and a 16-bit version which uses unsigned short* strings, and
> convert from/to byte-streams in any supported encoding at the margins.
> 
> I'd like to keep byte-arrays at the margins in Python as well, for all 
> the reasons advanced by Paul and Fredrik.
> 
> I think treating existing strings as a sort of pun between
> narrow-strings and byte-arrays is a recipe for ongoing confusion.

Very good analysis.

Unfortunately this is where we're stuck, until we have a chance to
redesign this kind of thing from scratch.  Python 1.5.2 programs use
strings for byte arrays probably as much as they use them for
character strings.  This is because way back in 1990 I when I was
designing Python, I wanted to have smallest set of basic types, but I
also wanted to be able to manipulate byte arrays somewhat.  Influenced
by K&R C, I chose to make strings and string I/O 8-bit clean so that
you could read a binary "string" from a file, manipulate it, and write
it back to a file, regardless of whether it was character or binary
data.

This model has never been challenged until now.  I agree that the Java
model (byte arrays and strings) or perhaps your proposed model (byte
arrays, narrow and wide strings) looks better.  But, although Python
has had rudimentary support for byte arrays for a while (the array
module, introduced in 1993), the majority of Python code manipulating
binary data still uses string objects.

My ASCII proposal is a compromise that tries to be fair to both uses
for strings.  Introducing byte arrays as a more fundamental type has
been on the wish list for a long time -- I see no way to introduce
this into Python 1.6 without totally botching the release schedule
(June 1st is very close already!).  I'd like to be able to move on,
there are other important things still to be added to 1.6 (Vladimir's
malloc patches, Neil's GC, Fredrik's completed sre...).

For 1.7 (which should happen later this year) I promise I'll reopen
the discussion on byte arrays.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:18:39 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:18:39 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python bltinmodule.c,2.154,2.155
In-Reply-To: Your message of "Wed, 03 May 2000 09:58:31 +0200."
             <20000503075832.18574370CF2@snelboot.oratrix.nl> 
References: <20000503075832.18574370CF2@snelboot.oratrix.nl> 
Message-ID: <200005031218.IAA03288@eric.cnri.reston.va.us>

> > _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go
> > ahead and initialize the class-based standard exceptions.  If this
> > fails, we throw a Py_FatalError.
> 
> Isn't a Py_FatalError overkill? Or will not having the class-based standard 
> exceptions lead to so much havoc later on that it is better than limping on?

There will be *no* exception objects -- they will all be NULL
pointers.  It's not clear that you will be able to limp very far, and
it's better to have a clear diagnostic at the source of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May  3 14:22:57 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:22:57 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:05:59 EDT."
             <000301bfb4bd$463ec280$622d153f@tim> 
References: <000301bfb4bd$463ec280$622d153f@tim> 
Message-ID: <200005031222.IAA03300@eric.cnri.reston.va.us>

> [Guido]
> > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> > bytes in either should make the comparison fail; when ordering is
> > important, we can make an arbitrary choice e.g. "\377" < u"\200".
> 
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.

[Tim]
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> >>>
> 
> probably traps more people than it helps.

Agreed, but that's the rule we all currently live by, and changing it
is something for Python 3000.

I'm not real strong on this though -- I was willing to live with
exceptions from the UTF-8-to-Unicode conversion.  If we all agree that
it's better for u"\377" == "\377" to raise an precedent-setting
exception than to return false, that's fine with me too.  I do want
u"a" == "a" to be true though (and I believe we all already agree on
that one).

Note that it's not the first precedent -- you can already define
classes whose instances can raise exceptions during comparisons.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Wed May  3 10:56:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:56:08 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us> <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <390FE9A7.DE5545DA@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum <guido at python.org> wrote:
> > > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > > characters with backslashed numbers?
> >
> > Hm, good question.  Tcl displays unknown characters as \x or \u
> > escapes.  I think this may make more sense than raising an error.
> 
> but that's on the display side of things, right?  similar to
> repr, in other words.
> 
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?
> 
> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)
> 
> count to ten before replying, please.

1 2 3 4 5 6 7 8 9 10 ... ok ;-)

Guido's problem with printing Unicode can easily be solved
using the standard codecs.StreamRecoder class as I've done
in the example I posted some days ago.

Basically, what the stdout wrapper would do is take strings
as input, converting them to Unicode and then writing
them encoded to the original stdout. For Unicode objects
the conversion can be skipped and the encoded output written
directly to stdout.

This can be done for any encoding supported by Python; e.g.
you could do the indirection in site.py and then have
Unicode printed as Latin-1 or UTF-8 or one of the many
code pages supported through the mapping codec.

About having str() return Unicode objects: I see str()
as constructor for string objects and under that assumption
str() will always have to return string objects.
unicode() does the same for Unicode objects, so renaming
it to something else doesn't really help all that much.

BTW, __str__() has to return strings too. Perhaps we
need __unicode__() and a corresponding slot function too ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed May  3 15:06:27 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 15:06:27 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid>
Message-ID: <39102453.6923B10@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > > > So what do you think of my new proposal of using ASCII as the default
> > > > > "encoding"?
> >
> > How about using unicode-escape or raw-unicode-escape as
> > default encoding ? (They would have to be adapted to disallow
> > Latin-1 char input, though.)
> >
> > The advantage would be that they are compatible with ASCII
> > while still providing loss-less conversion and since they
> > use escape characters, you can even read them using an
> > ASCII based editor.
> 
> umm.  if you disallow latin-1 characters, how can you call this
> one loss-less?

[Guido didn't like this one, so its probably moot investing
 any more time on this...]

I meant that the unicode-escape codec should only take ASCII
characters as input and disallow non-escaped Latin-1 characters.

Anyway, I'm out of this discussion... 

I'll wait a week or so until things have been sorted out.

Have fun,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Wed May  3 15:09:59 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 06:09:59 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031204.IAA03252@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030556250.522-100000@localhost>

On Wed, 3 May 2000, Guido van Rossum wrote:
> (There might be an alternative, but it would depend on having yet
> another hook (similar to Ping's sys.display) that gets invoked when
> printing an object (as opposed to displaying it at the interactive
> prompt).  I'm not too keen on this because it would break code that
> temporarily sets sys.stdout to a file of its own choosing and then
> invokes print -- a common idiom to capture printed output in a string,
> for example, which could be embedded deep inside a module.  If the
> main program were to install a naive print hook that always sent
> output to a designated place, this strategy might fail.)

I know this is not a small change, but i'm pretty convinced the
right answer here is that the print hook should call a *method*
on sys.stdout, whatever sys.stdout happens to be.  The details
are described in the other long message i wrote ("Printing objects
on files").

Here is an addendum that might actually make that proposal
feasible enough (compatibility-wise) to fly in the short term:

    print x

does, conceptually:

    try:
        sys.stdout.printout(x)
    except AttributeError:
        sys.stdout.write(str(x))
        sys.stdout.write("\n")

The rest can then be added, and the change in 'print x' will
work nicely for any file objects, but will not break on file-like
substitutes that don't define a 'printout' method.

Any reactions to the other benefit of this proposal -- namely,
the ability to control the printing parameters of object
components as they're being traversed for printing?  That was
actually the original motivation for doing the file.printout
thing: it gives you some of the effect of "passing down str-ness"
that we were discussing so heatedly a little while ago.

The other thing that just might justify this much of a change
is that, as you reasoned clearly in your other message, without
adequate resolution to the printing problem we may have painted
ourselves into a corner with regard to str(u"") conversion, and
i don't like the look of that corner much.  *Even* if we were to
get people to agree that it's okay for str(u"") to produce UTF-8,
it still seems pretty hackish to me that we're forced to choose
this encoding as a way of working around that fact that we can't
simply give the file the thing we want to print.


-- ?!ng




From moshez at math.huji.ac.il  Wed May  3 15:55:37 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Wed, 3 May 2000 16:55:37 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <Pine.GSO.4.10.10005031649040.4859-100000@sundial>

On Wed, 3 May 2000, Tim Peters wrote:

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

[Tim Peters]
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences

Of course I don't, and of course you're right. But what I do want is for
my binary goop to pass unharmed through the evil Unicode forest. Which is
why I don't want it to interpret my goop as a sequence of bytes it tries
to decode, but I want the numeric values of my bytes to pass through to
Unicode uharmed -- that means Latin-1 because of the second design
decision of the horribly western-specific unicdoe - the first 256
characters are the same as Latin-1. If it were up to me, I'd use Latin-3,
but it wasn't, so it's not.

> (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE)

Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE
-- isn't it.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From akuchlin at mems-exchange.org  Wed May  3 16:12:06 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:12:06 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <14608.13238.339572.202494@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>been on the wish list for a long time -- I see no way to introduce
>this into Python 1.6 without totally botching the release schedule
>(June 1st is very close already!).  I'd like to be able to move on,

My suggested criterion is that 1.6 not screw things up in a way that
we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
a corner that 

(And can we choose a mailing list for discussing this and stick to it?
 This is being cross-posted to three lists: python-dev, i18-sig, and
 xml-sig!  i18-sig only, maybe?  Or string-sig?)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Chess! I'm tormented by thoughts of strip chess. Pure mind just isn't enough,
Mallah. I long for a body.
  -- The Brain, in DOOM PATROL #34




From akuchlin at mems-exchange.org  Wed May  3 16:15:18 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 3 May 2000 10:15:18 -0400 (EDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <14608.13238.339572.202494@amarok.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<14608.13238.339572.202494@amarok.cnri.reston.va.us>
Message-ID: <14608.13430.92985.717058@amarok.cnri.reston.va.us>

Andrew M. Kuchling writes:
>Guido van Rossum writes:
>My suggested criterion is that 1.6 not screw things up in a way that
>we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
>a corner that 

Doh!  To complete that paragraph: Magic conversions assuming UTF-8
does back us into a corner that is hard to get out of later.  Magic
conversions assuming Latin1 or ASCII are a bit better, but I'd lean
toward the draconian solution: we don't know what we're doing, so do
nothing and require the user to explicitly convert between Unicode and
8-bit strings in a user-selected encoding.

--amk



From guido at python.org  Wed May  3 17:48:32 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 11:48:32 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 10:15:18 EDT."
             <14608.13430.92985.717058@amarok.cnri.reston.va.us> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us> <14608.13238.339572.202494@amarok.cnri.reston.va.us>  
            <14608.13430.92985.717058@amarok.cnri.reston.va.us> 
Message-ID: <200005031548.LAA03595@eric.cnri.reston.va.us>

> >Guido van Rossum writes:
> >My suggested criterion is that 1.6 not screw things up in a way that
> >we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
> >a corner that 

> Andrew M. Kuchling writes:
> Doh!  To complete that paragraph: Magic conversions assuming UTF-8
> does back us into a corner that is hard to get out of later.  Magic
> conversions assuming Latin1 or ASCII are a bit better, but I'd lean
> toward the draconian solution: we don't know what we're doing, so do
> nothing and require the user to explicitly convert between Unicode and
> 8-bit strings in a user-selected encoding.

GvR responds:
That's what Ping suggested.  My reason for proposing default
conversions from ASCII is that there is much code that deals with
character strings in a fairly abstract sense and that would work out
of the box (or after very small changes) with Unicode strings.  This
code often uses some string literals containing ASCII characters.  An
arbitrary example: code to reformat a text paragraph; another: an XML
parser.  These look for certain ASCII characters given as literals in
the code (" ", "<" and so on) but the algorithm is essentially
independent of what encoding is used for non-ASCII characters.  (I
realize that the text reformatting example doesn't work for all
Unicode characters because its assumption that all characters have
equal width is broken -- but at the very least it should work with
Latin-1 or Greek or Cyrillic stored in Unicode strings.)

It's the same as for ints: a function to calculate the GCD works with
ints as well as long ints without change, even though it references
the int constant 0.  In other words, we want string-processing code to
be just as polymorphic as int-processing code.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From just at letterror.com  Wed May  3 21:55:24 2000
From: just at letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 20:55:24 +0100
Subject: [Python-Dev] Unicode strings: an alternative
Message-ID: <l03102800b5362642bae3@[193.78.237.149]>

Today I had a relatively simple idea that unites wide strings and narrow
strings in a way that is more backward comatible at the C level. It's quite
possible this has already been considered and rejected for reasons that are
not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the
existing string object like so:
- wide strings are stored as if they were narrow strings, simply using two
bytes for each Unicode character.
- there's a flag that specifies whether the string is narrow or wide.
- the ob_size field is the _physical_ length of the data; if the string is
wide, len(s) will return ob_size/2, all other string operations will have
to do similar things.
- there can possibly be an encoding attribute which may specify the used
encoding, if known.

Admittedly, this is tricky and involves quite a bit of effort to implement,
since all string methods need to have narrow/wide switch. To make it worse,
it hardly offers anything the current solution doesn't. However, it offers
one IMHO _big_ advantage: C code that just passes strings along does not
need to change: wide strings can be seen as narrow strings without any
loss. This allows for __str__() & str() and friends to work with unicode
strings without any change.

Any thoughts?

Just





From tree at cymru.basistech.com  Wed May  3 22:19:05 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 16:19:05 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <14608.35257.729641.178724@cymru.basistech.com>

Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?


 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From skip at mojam.com  Wed May  3 22:51:49 2000
From: skip at mojam.com (Skip Montanaro)
Date: Wed, 3 May 2000 15:51:49 -0500 (CDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.35257.729641.178724@cymru.basistech.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
	<14608.35257.729641.178724@cymru.basistech.com>
Message-ID: <14608.37223.787291.236623@beluga.mojam.com>

    Tom> Is it possible to add a logical length field too? I presume it is
    Tom> too expensive to recalculate the logical (character) length of a
    Tom> string each time len(s) is called? Doing this is only slightly more
    Tom> time consuming than a normal strlen: ...

Note that currently the len() method doesn't call strlen() at all.  It just
returns the ob_size field.  Presumably, with Just's proposal len() would
simply return ob_size/width.  If you used a variable width encoding, Just's
plan wouldn't work.  (I don't know anything about string encodings - is
UTF-8 variable width?)




From guido at python.org  Wed May  3 23:22:59 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 17:22:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Wed, 03 May 2000 20:55:24 BST."
             <l03102800b5362642bae3@[193.78.237.149]> 
References: <l03102800b5362642bae3@[193.78.237.149]> 
Message-ID: <200005032122.RAA05150@eric.cnri.reston.va.us>

> Today I had a relatively simple idea that unites wide strings and narrow
> strings in a way that is more backward comatible at the C level. It's quite
> possible this has already been considered and rejected for reasons that are
> not yet obvious to me, but I'll give it a shot anyway.
> 
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
> - there's a flag that specifies whether the string is narrow or wide.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
> 
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.

This seems to have some nice properties, but I think it would cause
problems for existing C code that tries to *interpret* the bytes of a
string: it could very well do the wrong thing for wide strings (since
old C code doesn't check for the "wide" flag).  I'm not sure how much
C code there is that merely passes strings along...  Most C code using
strings makes use of the strings (e.g. open() falls in this category
in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tree at cymru.basistech.com  Thu May  4 00:05:39 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 18:05:39 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.37223.787291.236623@beluga.mojam.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
	<14608.35257.729641.178724@cymru.basistech.com>
	<14608.37223.787291.236623@beluga.mojam.com>
Message-ID: <14608.41651.781464.747522@cymru.basistech.com>

Skip Montanaro writes:
 > Note that currently the len() method doesn't call strlen() at all.  It just
 > returns the ob_size field.  Presumably, with Just's proposal len() would
 > simply return ob_size/width.  If you used a variable width encoding, Just's
 > plan wouldn't work.  (I don't know anything about string encodings - is
 > UTF-8 variable width?)

Yes, technically from 1 - 6 bytes per character, though in practice
for Unicode it's 1 - 3.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From guido at python.org  Thu May  4 02:52:39 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:52:39 -0400
Subject: [Python-Dev] weird bug in test_winreg
Message-ID: <200005040052.UAA07874@eric.cnri.reston.va.us>

I just noticed a weird traceback in test_winreg.  When I import
test.autotest on Windows, I get a "test failed" notice for
test_winreg.  When I run it by itself the test succeeds.  But when I
first import test.autotest and then import test.test_winreg (which
should rerun the latter, since test.regrtest unloads all test modules
after they have run), I get an AttributeError telling me that 'None'
object has no attribute 'get'.  This is in encodings.__init__.py in
the first call to _cache.get() in search_function.  Somehow this is
called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
inspection of the encodings module shows that _cache is {}, not None,
and the source shows no evidence of how this could have happened.

Any suggestions?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Thu May  4 02:57:50 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 03 May 2000 20:57:50 -0400
Subject: [Python-Dev] weird bug in test_winreg
In-Reply-To: Your message of "Wed, 03 May 2000 20:52:39 EDT."
             <200005040052.UAA07874@eric.cnri.reston.va.us> 
References: <200005040052.UAA07874@eric.cnri.reston.va.us> 
Message-ID: <200005040057.UAA07966@eric.cnri.reston.va.us>

> I just noticed a weird traceback in test_winreg.  When I import
> test.autotest on Windows, I get a "test failed" notice for
> test_winreg.  When I run it by itself the test succeeds.  But when I
> first import test.autotest and then import test.test_winreg (which
> should rerun the latter, since test.regrtest unloads all test modules
> after they have run), I get an AttributeError telling me that 'None'
> object has no attribute 'get'.  This is in encodings.__init__.py in
> the first call to _cache.get() in search_function.  Somehow this is
> called by SetValueEx() in WriteTestData() in test/test_winreg.py.  But
> inspection of the encodings module shows that _cache is {}, not None,
> and the source shows no evidence of how this could have happened.

I may have sounded confused: the problem is not caused by the
reload().  The test fails the first time around when run by
test.autotest.  My suspicion is that another test somehow overwrites
encodings._cache?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mhammond at skippinet.com.au  Thu May  4 03:20:24 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 4 May 2000 11:20:24 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au>

Oops - I didnt notice the CC - a copy of what I sent to Guido:

-----Original Message-----
From: Mark Hammond [mailto:mhammond at skippinet.com.au]
Sent: Thursday, 4 May 2000 11:13 AM
To: Guido van Rossum
Subject: RE: weird bug in test_winreg


Hah - I was just thinking about this this myself.  If I wasnt waiting 24
hours, I would have beaten you to the test_fork1 patch :-)

However, there is something bad going on.  If you remove your test_fork1
patch, and run it from regrtest (_not_ stand alone) you will see the
children threads die with:

  File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
    alive[id] = os.getpid()
AttributeError: 'None' object has no attribute 'getpid'

Note the error - os is None!

[The reason is only happens as part of the test is because the children are
created before the main thread fails with the attribute error]

Similarly, I get spurious:

Traceback (most recent call last):
  File ".\test_thread.py", line 103, in task2
    mutex.release()
AttributeError: 'None' object has no attribute 'release'

(Only rarely, and never when run stand-alone - the test_fork1 exception
happens 100% of the time from the test suite)

And of course the test_winreg one.

test_winreg, I guessed, may be caused by the import lock (but its certainly
not obvious how or why!?).  However, that doesnt explain the others.

I also saw these _before_ I applied the threading patches (and after!)

So I think the problem may be a little deeper?

Mark.




From just at letterror.com  Thu May  4 09:42:00 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 08:42:00 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <200005032122.RAA05150@eric.cnri.reston.va.us>
References: Your message of "Wed, 03 May 2000 20:55:24 BST."            
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102800b536d1d8c0bc@[193.78.237.161]>

(Thanks for all the comments. I'll condense my replies into one post.)

[JvR]
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.

[Tom Emerson wrote]
>I disagree with you here... store them as UTF-8.

Erm, utf-8 in a wide string? This makes no sense...

[Skip Montanaro]
>Presumably, with Just's proposal len() would
>simply return ob_size/width.

Right. And if you would allow values for width other than 1 and 2, it opens
the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and
"only" width==1 needs to be special-cased for speed.

>If you used a variable width encoding, Just's plan wouldn't work.

Correct, but nor does the current unicode object. Variable width encodings
are too messy to see as strings at all: they are only useful as byte arrays.

[GvR]
>This seems to have some nice properties, but I think it would cause
>problems for existing C code that tries to *interpret* the bytes of a
>string: it could very well do the wrong thing for wide strings (since
>old C code doesn't check for the "wide" flag).  I'm not sure how much
>C code there is that merely passes strings along...  Most C code using
>strings makes use of the strings (e.g. open() falls in this category
>in my eyes).

There are probably many cases that fall into this category. But then again,
these cases, especially those that potentially can deal with other
encodings than ascii, are not much helped by a default encoding, as /F
showed.

My idea arose after yesterday's discussions. Some quotes, plus comments:

[GvR]
>However the problem is that print *always* first converts the object
>using str(), and str() enforces that the result is an 8-bit string.
>I'm afraid that loosening this will break too much code.  (This all
>really happens at the C level.)

Guido goes on to explain that this means utf-8 is the only sensible default
in this case. Good reasoning, but I think it's backwards:
- str(unicodestring) should just return unicodestring
- it is important that stdout receives the original unicode object.

[MAL]
>BTW, __str__() has to return strings too. Perhaps we
>need __unicode__() and a corresponding slot function too ?!

This also seems backwards. If it's really too hard to change Python so that
__str__ can return unicode objects, my solution may help.

[Ka-Ping Yee]
>Here is an addendum that might actually make that proposal
>feasible enough (compatibility-wise) to fly in the short term:
>
>    print x
>
>does, conceptually:
>
>    try:
>        sys.stdout.printout(x)
>    except AttributeError:
>        sys.stdout.write(str(x))
>        sys.stdout.write("\n")

That stuff like this is even being *proposed* (not that it's not smart or
anything...) means there's a terrible bottleneck somewhere which needs
fixing. My proposal seems to do does that nicely.

Of course, there's no such thing as a free lunch, and I'm sure there are
other corners that'll need fixing, but it appears having to write

    if (!PyString_Check(doc) && !PyUnicode_Check(doc))
        ...

in all places that may accept unicode strings is no fun either.

Yes, some code will break if you throw a wide string at it, but I think
that code is easier repaired with my proposal than with the current
implementation.

It's a big advantage to have only one string type; it makes many problems
we've been discussing easier to talk about.

Just





From effbot at telia.com  Thu May  4 09:46:05 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 09:46:05 +0200
Subject: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030556250.522-100000@localhost>
Message-ID: <002d01bfb59c$cf482280$34aab5d4@hagrid>

Ka-Ping Yee <ping at lfw.org> wrote:
> I know this is not a small change, but i'm pretty convinced the
> right answer here is that the print hook should call a *method*
> on sys.stdout, whatever sys.stdout happens to be.  The details
> are described in the other long message i wrote ("Printing objects
> on files").
> 
> Here is an addendum that might actually make that proposal
> feasible enough (compatibility-wise) to fly in the short term:
> 
>     print x
> 
> does, conceptually:
> 
>     try:
>         sys.stdout.printout(x)
>     except AttributeError:
>         sys.stdout.write(str(x))
>         sys.stdout.write("\n")
> 
> The rest can then be added, and the change in 'print x' will
> work nicely for any file objects, but will not break on file-like
> substitutes that don't define a 'printout' method.

another approach is (simplified):

    try:
        sys.stdout.write(x.encode(sys.stdout.encoding))
    except AttributeError:
        sys.stdout.write(str(x))

or, if str is changed to return any kind of string:

    x = str(x)
    try:
        x = x.encode(sys.stdout.encoding)
    except AttributeError:
        pass
    sys.stdout.write(x)

</F>




From ht at cogsci.ed.ac.uk  Thu May  4 10:51:39 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 04 May 2000 09:51:39 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

<snip/>

> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

I think I hear a moderate consensus developing that the 'ASCII
proposal' is a reasonable compromise given the time constraints.  But
let's not fail to come back to this ASAP -- it _really_ narcs me that
every time I load XML into my Python-based editor I'm going to convert
large amounts of wide-string data into UTF-8 just so Tk can convert it
back to wide-strings in order to display it!

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From just at letterror.com  Thu May  4 13:27:45 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 12:27:45 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b536d1d8c0bc@[193.78.237.161]>
References: <200005032122.RAA05150@eric.cnri.reston.va.us> Your message of
 "Wed, 03 May 2000 20:55:24 BST."            
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102809b53709fef820@[193.78.237.126]>

I wrote:
>It's a big advantage to have only one string type; it makes many problems
>we've been discussing easier to talk about.

I think I should've been more explicit about what I meant here. I'll try to
phrase it as an addendum to my proposal -- which suddenly is no longer just
a narrow/wide string unification but narrow/wide/ultrawide, to really be
ready for the future...

As someone else suggested in the discussion, I think it's good if we
separate the encoding from the data type. Meaning that wide strings are no
longer tied to Unicode. This allows for double-byte encodings other than
UCS-2 as well as for safe passing-through of binary goop, but that's not
the main point. The main point is that this will make the behavior of
(wide) strings more understandable and consistent.

The extended string type is simply a sequence of code points, allowing for
0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for
ultra-wide strings. Upcasting is always safe, downcasting may raise
OverflowError. Depending on the used encoding, this comes as close as
possible to the sequence-of-characters model.

The default character set should of course be Unicode -- and it should be
obvious that this implies Latin-1 for narrow strings.

(Additionally: an encoding attribute suddenly makes a whole lot of sense
again.)

Ok, y'all can shoot me now ;-)

Just





From guido at python.org  Thu May  4 14:40:35 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 04 May 2000 08:40:35 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "04 May 2000 09:51:39 BST."
             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk> 
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>  
            <f5br9bi1yw4.fsf@cogsci.ed.ac.uk> 
Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us>

> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.  But
> let's not fail to come back to this ASAP -- it _really_ narcs me that
> every time I load XML into my Python-based editor I'm going to convert
> large amounts of wide-string data into UTF-8 just so Tk can convert it
> back to wide-strings in order to display it!

Thanks -- but that's really Tcl's fault, since the only way to get
character data *into* Tcl (or out of it) is through the UTF-8
encoding.

And is your XML really stored on disk in its 16-bit format?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Thu May  4 15:21:25 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 4 May 2000 15:21:25 +0200
Subject: [Python-Dev] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>  <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>

Guido van Rossum <guido at python.org> wrote:
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.

from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm

    Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)

    Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
    object or modify an existing object to hold a copy of the
    Unicode string given by unicode and numChars.

    (Tcl_UniChar* is currently the same thing as Py_UNICODE*)

</F>




From guido at python.org  Thu May  4 19:03:58 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 04 May 2000 13:03:58 -0400
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: Your message of "Thu, 04 May 2000 11:20:24 +1000."
             <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBOEDACKAA.mhammond@skippinet.com.au> 
Message-ID: <200005041703.NAA13471@eric.cnri.reston.va.us>

Mark Hammond:

> However, there is something bad going on.  If you remove your test_fork1
> patch, and run it from regrtest (_not_ stand alone) you will see the
> children threads die with:
> 
>   File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f
>     alive[id] = os.getpid()
> AttributeError: 'None' object has no attribute 'getpid'
> 
> Note the error - os is None!
> 
> [The reason is only happens as part of the test is because the children are
> created before the main thread fails with the attribute error]

I don't get this one -- maybe my machine is too slow.  (130 MHz
Pentium.)

> Similarly, I get spurious:
> 
> Traceback (most recent call last):
>   File ".\test_thread.py", line 103, in task2
>     mutex.release()
> AttributeError: 'None' object has no attribute 'release'
> 
> (Only rarely, and never when run stand-alone - the test_fork1 exception
> happens 100% of the time from the test suite)
> 
> And of course the test_winreg one.
> 
> test_winreg, I guessed, may be caused by the import lock (but its certainly
> not obvious how or why!?).  However, that doesnt explain the others.
> 
> I also saw these _before_ I applied the threading patches (and after!)
> 
> So I think the problem may be a little deeper?

It's Vladimir's patch which, after each tests, unloads all modules
that were loaded by that test.  If I change this to only unload
modules whose name starts with "test.", the test_winreg problem goes
away, and I bet yours go away too.

The real reason must be deeper -- there's also the import lock and the
fact that if a submodule of package "test" tries to import "os", a
search for "test.os" is made and if it doesn't find it it sticks None
in sys.modules['test.os'].

but I don't have time to research this further.

I'm tempted to apply the following change to regrtest.py.  This should
still unload the test modules (so you can rerun an individual test)
but it doesn't touch other modules.  I'll wait 24 hours. :-)

*** regrtest.py	2000/04/21 21:35:06	1.15
--- regrtest.py	2000/05/04 16:56:26
***************
*** 121,127 ****
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules:
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:
--- 121,127 ----
              skipped.append(test)
          # Unload the newly imported modules (best effort finalization)
          for module in sys.modules.keys():
!             if module not in save_modules and module.startswith("test."):
                  test_support.unload(module)
      if good and not quiet:
          if not bad and not skipped and len(good) > 1:

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gvwilson at nevex.com  Thu May  4 21:03:54 2000
From: gvwilson at nevex.com (gvwilson at nevex.com)
Date: Thu, 4 May 2000 15:03:54 -0400 (EDT)
Subject: [Python-Dev] Minimal (single-file) Python?
Message-ID: <Pine.LNX.4.10.10005041448010.22917-100000@akbar.nevex.com>

Hi.  Has anyone ever built, or thought about building, a single-file
Python, in which all the "basic" capabilities are included in a single
executable (where "basic" means "can do as much as the Bourne shell")?
Some of the entries in the Software Carpentry competition would like to be
able to bootstrap from as small a starting point as possible.

Thanks,
Greg

p.s. I don't think this is the same problem as moving built-in features of
Python into optionally-loaded libraries, as some of the things in the
'sys', 'string', and 'os' modules would have to move in the other
direction to ensure Bourne shell equivalence.





From just at letterror.com  Thu May  4 23:22:38 2000
From: just at letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 22:22:38 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
Message-ID: <l03102810b5378dda02f5@[193.78.237.126]>

(Boy, is it quiet here all of a sudden ;-)

Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.

These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
  happen purely on the basis of the character codes; no
  assumption at all should be made about the encoding, ie.
  what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
  there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
  or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
  narrow or wide

The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
  since it is not neccesary for the character to be a Unicode
  code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
  contains a character > 255 the string should automatically
  become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
  - it is possible to specify a source encoding. I'm not sure if
    this is best done through an extra argument for encode()
    or that it should be a new method, eg. transcode().
  - s.encode() or s.transcode() are allowed to output a wide
    string, as in aNarrowString.encode("UCS-2") and
    s.transcode("Mac-Roman", "UCS-2").

My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.

Just





From skip at mojam.com  Thu May  4 22:43:42 2000
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 4 May 2000 15:43:42 -0500 (CDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <14609.57598.738381.250872@beluga.mojam.com>

    Just> Sorry for the duplication of stuff, but I'd like to reiterate my
    Just> points, to separate them from my implementation proposal, as
    Just> that's just what it is: an implementation detail.

    Just> These things are important to me:
    ...

For the encoding-challenged like me, does it make sense to explicitly state
that you can't mix character widths within a single string, or is that just
so obvious that I deserve a head slap just for mentioning it?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From effbot at telia.com  Thu May  4 23:02:35 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 23:02:35 +0200
Subject: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54zj.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid>

Henry S. Thompson <ht at cogsci.ed.ac.uk> wrote:
> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.

agreed.

(but even if we settle for "7-bit unicode" in 1.6, there are still a
few issues left to sort out before 1.6 final.  but it might be best
to get back to that after we've added SRE and GC to 1.6a3. we
might all need a short break...)

> But let's not fail to come back to this ASAP

first week in june, promise ;-)

</F>




From mhammond at skippinet.com.au  Fri May  5 01:55:15 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 5 May 2000 09:55:15 +1000
Subject: [Python-Dev] FW: weird bug in test_winreg
In-Reply-To: <200005041703.NAA13471@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEEBCKAA.mhammond@skippinet.com.au>

> It's Vladimir's patch which, after each tests, unloads all modules
> that were loaded by that test.  If I change this to only unload
> modules whose name starts with "test.", the test_winreg problem goes
> away, and I bet yours go away too.

They do indeed!

> The real reason must be deeper -- there's also the import lock and the
> fact that if a submodule of package "test" tries to import "os", a
> search for "test.os" is made and if it doesn't find it it sticks None
> in sys.modules['test.os'].
>
> but I don't have time to research this further.

I started to think about this.  The issue is simply that code which
blithely wipes sys.modules[] may cause unexpected results.  While the end
result is a bug, the symptoms are caused by extreme hackiness.

Seeing as my time is also limited, I say we forget it!

> I'm tempted to apply the following change to regrtest.py.  This should
> still unload the test modules (so you can rerun an individual test)
> but it doesn't touch other modules.  I'll wait 24 hours. :-)

The 24 hour time limit is only supposed to apply to _my_ patches - you can
check yours straight in (and if anyone asks, just tell them I said it was
OK) :-)

Mark.




From ht at cogsci.ed.ac.uk  Fri May  5 10:19:07 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:19:07 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
	<200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <f5bya5pxvd0.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido at python.org> writes:

> > I think I hear a moderate consensus developing that the 'ASCII
> > proposal' is a reasonable compromise given the time constraints.  But
> > let's not fail to come back to this ASAP -- it _really_ narcs me that
> > every time I load XML into my Python-based editor I'm going to convert
> > large amounts of wide-string data into UTF-8 just so Tk can convert it
> > back to wide-strings in order to display it!
> 
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.
> 
> And is your XML really stored on disk in its 16-bit format?

No, I have no idea what encoding it's in, my XML parser supports over
a dozen encodings, and quite sensibly always delivers the content, as
per the XML REC, as wide-strings.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From ht at cogsci.ed.ac.uk  Fri May  5 10:21:41 2000
From: ht at cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:21:41 +0100
Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate
In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200"
References: <l03102805b52ca7830b18@[193.78.237.154]>
	<l03102800b52d80db1290@[193.78.237.154]>
	<200004271501.LAA13535@eric.cnri.reston.va.us>
	<3908F566.8E5747C@prescod.net>
	<200004281450.KAA16493@eric.cnri.reston.va.us>
	<390AEF1D.253B93EF@prescod.net>
	<200005011802.OAA21612@eric.cnri.reston.va.us>
	<390DEB45.D8D12337@prescod.net>
	<200005012132.RAA23319@eric.cnri.reston.va.us>
	<390E1F08.EA91599E@prescod.net>
	<200005020053.UAA23665@eric.cnri.reston.va.us>
	<f5bog6o54zj.fsf@cogsci.ed.ac.uk>
	<200005031216.IAA03274@eric.cnri.reston.va.us>
	<f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
	<200005041240.IAA08277@eric.cnri.reston.va.us>
	<00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>
Message-ID: <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>

"Fredrik Lundh" <fredrik at pythonware.com> writes:

> Guido van Rossum <guido at python.org> wrote:
> > Thanks -- but that's really Tcl's fault, since the only way to get
> > character data *into* Tcl (or out of it) is through the UTF-8
> > encoding.
> 
> from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm
> 
>     Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)
> 
>     Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
>     object or modify an existing object to hold a copy of the
>     Unicode string given by unicode and numChars.
> 
>     (Tcl_UniChar* is currently the same thing as Py_UNICODE*)
> 

Any way this can be exploited in Tkinter?

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht at cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/



From just at letterror.com  Fri May  5 11:25:37 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 10:25:37 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <007701bfb60c$1543f060$34aab5d4@hagrid>
References:  <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237
 .154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres
 cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p
 rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233
 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91
 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54z
 j.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <l03102802b5383fd7c128@[193.78.237.126]>

At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
>Henry S. Thompson <ht at cogsci.ed.ac.uk> wrote:
>> I think I hear a moderate consensus developing that the 'ASCII
>> proposal' is a reasonable compromise given the time constraints.
>
>agreed.

This makes no sense: implementing the 7-bit proposal takes the more or less
the same time as implementing 8-bit downcasting. Or is it just the
bickering that's too time consuming? ;-)

I worry that if the current implementation goes into 1.6 more or less as it
is now there's no way we can ever go back (before P3K). Or will Unicode
support be marked "experimental" in 1.6? This is not so much about the
7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
and the u"" notation:

- unicode() only takes strings, so is effectively a method of the string type.
- if narrow and wide strings are meant to be as similar as possible,
chr(256) should just return a wide char
- similarly, why is the u"" notation at all needed?

The current design is more complex than needed, and still offers plenty of
surprises. Making it simpler (without integrating the two string types) is
not a huge effort. Seeing the wide string type as independent of Unicode
takes no physical effort at all, as it's just in our heads.

Fixing str() so it can return wide strings might be harder, and can wait
until later. Would be too bad, though.

Just





From ping at lfw.org  Fri May  5 11:21:20 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 5 May 2000 02:21:20 -0700 (PDT)
Subject: [Python-Dev] Unicode debate
In-Reply-To: <002d01bfb59c$cf482280$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>

On Thu, 4 May 2000, Fredrik Lundh wrote:
> 
> another approach is (simplified):
> 
>     try:
>         sys.stdout.write(x.encode(sys.stdout.encoding))
>     except AttributeError:
>         sys.stdout.write(str(x))

Indeed, that would work to solve just this specific Unicode
issue -- but there is a lot of flexibility and power to be
gained from the general solution of putting a method on the
stream object, as the example with the formatted list items
showed.  I think it is a good idea, for instance, to leave
decisions about how to print Unicode up to the Unicode object,
and not hardcode bits of it into print.

Guido, have you digested my earlier 'printout' suggestions?


-- ?!ng

"Old code doesn't die -- it just smells that way."
    -- Bill Frantz




From mbel44 at dial.pipex.net  Fri May  5 11:07:46 2000
From: mbel44 at dial.pipex.net (Toby Dickenson)
Date: Fri, 05 May 2000 10:07:46 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>

On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just at letterror.com> wrote:

>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points, to
>separate them from my implementation proposal, as that's just what it is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
>  happen purely on the basis of the character codes; no
>  assumption at all should be made about the encoding, ie.
>  what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
>  there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
>  or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
>  narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
>  since it is not neccesary for the character to be a Unicode
>  code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
>  contains a character > 255 the string should automatically
>  become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
>  - it is possible to specify a source encoding. I'm not sure if
>    this is best done through an extra argument for encode()
>    or that it should be a new method, eg. transcode().

>  - s.encode() or s.transcode() are allowed to output a wide
>    string, as in aNarrowString.encode("UCS-2") and
>    s.transcode("Mac-Roman", "UCS-2").

One other pleasant consequence:

- String comparisons work character-by character, even if the
  representation of those characters have different widths.

>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide some
>additional C compatibility (especially now that silent conversion to utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.


Toby Dickenson
tdickenson at geminidataloggers.com



From just at letterror.com  Fri May  5 13:40:49 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 12:40:49 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <l03102805b5385e3de8e8@[193.78.237.127]>

At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
>One other pleasant consequence:
>
>- String comparisons work character-by character, even if the
>  representation of those characters have different widths.

Exactly. By saying "(wide) strings are not tied to Unicode" the question
whether wide strings should or should not be sorted according to the
Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
too hard anyway"...

Just





From tree at cymru.basistech.com  Fri May  5 13:46:41 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 07:46:41 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
	<l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <14610.46241.129977.642796@cymru.basistech.com>

Just van Rossum writes:
 > At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
 > >One other pleasant consequence:
 > >
 > >- String comparisons work character-by character, even if the
 > >  representation of those characters have different widths.
 > 
 > Exactly. By saying "(wide) strings are not tied to Unicode" the question
 > whether wide strings should or should not be sorted according to the
 > Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
 > too hard anyway"...

Wait a second.

There is nothing about Unicode that would prevent you from defining
string equality as byte-level equality.

This strikes me as the wrong way to deal with the complex collation
issues of Unicode.

It seems to me that by default wide-strings compare at the byte-level
(i.e., '=' is a byte level comparison). If you want a normalized
comparison, then you make an explicit function call for that.

This is no different from comparing strings in a case sensitive
vs. case insensitive manner.

       -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From just at letterror.com  Fri May  5 15:17:31 2000
From: just at letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 14:17:31 +0100
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14610.46241.129977.642796@cymru.basistech.com>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <l03102808b53877a3e392@[193.78.237.127]>

[Me]
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

[Tom Emerson]
>Wait a second.
>
>There is nothing about Unicode that would prevent you from defining
>string equality as byte-level equality.

Agreed.

>This strikes me as the wrong way to deal with the complex collation
>issues of Unicode.

All I was trying to say, was that by looking at it this way, it is even
more obvious that the builtin comparison should not deal with Unicode
sorting & collation issues. It seems you're saying the exact same thing:

>It seems to me that by default wide-strings compare at the byte-level
>(i.e., '=' is a byte level comparison). If you want a normalized
>comparison, then you make an explicit function call for that.

Exactly.

>This is no different from comparing strings in a case sensitive
>vs. case insensitive manner.

Good point. All this taken together still means to me that comparisons
between wide and narrow strings should take place at the character level,
which implies that coercion from narrow to wide is done at the character
level, without looking at the encoding. (Which in my book in turn still
implies that as long as we're talking about Unicode, narrow strings are
effectively Latin-1.)

Just





From tree at cymru.basistech.com  Fri May  5 14:34:35 2000
From: tree at cymru.basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 08:34:35 -0400 (EDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102808b53877a3e392@[193.78.237.127]>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102810b5378dda02f5@[193.78.237.126]>
	<l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <14610.49115.820599.172598@cymru.basistech.com>

Just van Rossum writes:
 > Good point. All this taken together still means to me that comparisons
 > between wide and narrow strings should take place at the character level,
 > which implies that coercion from narrow to wide is done at the character
 > level, without looking at the encoding. (Which in my book in turn still
 > implies that as long as we're talking about Unicode, narrow strings are
 > effectively Latin-1.)

Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
characters" are Unicode, but stored in UTF-8 encoding, then you loose.

Hmmmm... how often do you expect to compare narrow vs. wide strings,
using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
use the byte comparison? I may very well have two strings (one narrow,
one wide) that compare equal, even though they're not. Not exactly
what I would expect.

     -tree

[I'm flying from Seattle to Boston today, so eventually I will
 disappear for a while]

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



From pf at artcom-gmbh.de  Fri May  5 15:13:05 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 5 May 2000 15:13:05 +0200 (MEST)
Subject: [Python-Dev] wide strings vs. Unicode point of view (was Re: [I18n-sig] Unicode st.... alternative)
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]> from Just van Rossum at "May 5, 2000 12:40:49 pm"
Message-ID: <m12nhuj-000CnCC@artcom0.artcom-gmbh.de>

Just van Rossum:
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

I personally like the idea speaking of "wide strings" containing wide
character codes instead of Unicode objects.

Unfortunately there are many methods which need to interpret the
content of strings according to some encoding knowledge: for example
'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know,
to which class certain characters belong.

This problem was already some kind of visible in 1.5.2, since these methods 
were available as library functions from the string module and they did
work with a global state maintained by the 'setlocale()' C-library function.
Quoting from the C library man pages:

"""    The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con?
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.
"""

I guess applying 'upper' to a chinese char will not make much sense.

Now these former string module functions were moved into the Python
object core.  So the current Python string and Unicode object API is
somewhat "western centric".  ;-) At least Marc's implementation in
'unicodectype.c' contains the hard coded assumption, that wide strings
contain really unicode characters.  
print u"???".upper().encode("latin1") 
shows "???" independent from the locale setting.  This makes sense.
The output from  print u"???".upper().encode()  however looks ugly
here on my screen... UTF-8 ... blech:? ??

Regards and have a nice weekend, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From guido at python.org  Fri May  5 16:49:52 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:49:52 -0400
Subject: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Fri, 05 May 2000 02:21:20 PDT."
             <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org> 
References: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org> 
Message-ID: <200005051449.KAA14138@eric.cnri.reston.va.us>

> Guido, have you digested my earlier 'printout' suggestions?

Not quite, except to the point that they require more thought than to
rush them into 1.6.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May  5 16:54:16 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:54:16 -0400
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Thu, 04 May 2000 22:22:38 BST."
             <l03102810b5378dda02f5@[193.78.237.126]> 
References: <l03102810b5378dda02f5@[193.78.237.126]> 
Message-ID: <200005051454.KAA14168@eric.cnri.reston.va.us>

> (Boy, is it quiet here all of a sudden ;-)

Maybe because (according to one report on NPR here) 80% of the world's
email systems are victimized by the ILOVEYOU virus?  You & I are not
affected because it's Windows specific (a visual basic script, I got a
copy mailed to me so I could have a good look :-).  Note that there
are already mutations, one of which pretends to be a joke.

> Sorry for the duplication of stuff, but I'd like to reiterate my points, to
> separate them from my implementation proposal, as that's just what it is:
> an implementation detail.
> 
> These things are important to me:
> - get rid of the Unicode-ness of wide strings, in order to
> - make narrow and wide strings as similar as possible
> - implicit conversion between narrow and wide strings should
>   happen purely on the basis of the character codes; no
>   assumption at all should be made about the encoding, ie.
>   what the character code _means_.
> - downcasting from wide to narrow may raise OverflowError if
>   there are characters in the wide string that are > 255
> - str(s) should always return s if s is a string, whether narrow
>   or wide
> - file objects need to be responsible for handling wide strings
> - the above two points should make it possible for
> - if no encoding is known, Unicode is the default, whether
>   narrow or wide
> 
> The above points seem to have the following consequences:
> - the 'u' in \uXXXX notation no longer makes much sense,
>   since it is not neccesary for the character to be a Unicode
>   code point: it's just a 2-byte int. \wXXXX might be an option.
> - the u"" notation is no longer neccesary: if a string literal
>   contains a character > 255 the string should automatically
>   become a wide string.
> - narrow strings should also have an encode() method.
> - the builtin unicode() function might be redundant if:
>   - it is possible to specify a source encoding. I'm not sure if
>     this is best done through an extra argument for encode()
>     or that it should be a new method, eg. transcode().
>   - s.encode() or s.transcode() are allowed to output a wide
>     string, as in aNarrowString.encode("UCS-2") and
>     s.transcode("Mac-Roman", "UCS-2").
> 
> My proposal to extend the "old" string type to be able to contain wide
> strings is of course largely unrelated to all this. Yet it may provide some
> additional C compatibility (especially now that silent conversion to utf-8
> is out) as well as a workaround for the
> str()-having-to-return-a-narrow-string bottleneck.

I'm not so sure that this is enough.  You seem to propose wide strings
as vehicles for 16-bit values (and maybe later 32-bit values) apart
from their encoding.  We already have a data type for that (the array
module).  The Unicode type does a lot more than storing 16-bit values:
it knows lots of encodings to and from Unicode, and it knows things
like which characters are upper or lower or title case and how to map
between them, which characters are word characters, and so on.  All
this is highly Unicode specific and is part of what people ask for
when then when they request Unicode support.  (Example: Unicode has
405 characters classified as numeric, according to the isnumeric()
method.)

And by the way, don't worry about the comparison.  I'm not changing
the default comparison (==, cmp()) for Unicode strings to be anything
than per 16-bit-quantity.  However a Unicode object might in addition
has a method to do normalization or whatever, as long as it's language
independent and strictly defined by the Unicode standard.
Language-specific operations belong in separate modules.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May  5 17:07:48 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:07:48 -0400
Subject: [Python-Dev] Moving Unicode debate to i18n-sig@python.org
Message-ID: <200005051507.LAA14262@eric.cnri.reston.va.us>

I've moved all my responses to the Unicode debate to the i18n-sig
mailing list, where it belongs.  Please don't cross-post any more.

If you're interested in this issue but aren't subscribed to the
i18n-sig list, please subscribe at
http://www.python.org/mailman/listinfo/i18n-sig/.

To view the archives, go to http://www.python.org/pipermail/i18n-sig/.

See you there!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From jim at digicool.com  Fri May  5 19:09:34 2000
From: jim at digicool.com (Jim Fulton)
Date: Fri, 05 May 2000 13:09:34 -0400
Subject: [Python-Dev] Pickle diffs anyone?
Message-ID: <3913004E.6CC69857@digicool.com>

Someone recently made a cool proposal for utilizing
diffs to save space taken by old versions in
the Zope object database:

  http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

To make this work, we need a good way of diffing pickles.

I thought maybe someone here would have some good suggestions.
I do think that the topic is sort of interesting (for some
definition of "interesting" ;).

The page above is a Wiki page. (Wiki is awesome. If you haven't
seen it before, check out http://joyful.com/zwiki/ZWiki.)
If you are a member of zope.org, you can edit the page directly,
which would be fine with me. :)

Jim

--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From fdrake at acm.org  Fri May  5 19:14:16 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 5 May 2000 13:14:16 -0400 (EDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.360.166536.866583@seahag.cnri.reston.va.us>

Jim Fulton writes:
 > To make this work, we need a good way of diffing pickles.

Jim,
  If the basic requirement is for a binary diff facility, perhaps you
should look into XDelta; I think that's available as a C library as
well as a command line tool, so you should be able to hook it in
fairly easily.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From trentm at activestate.com  Fri May  5 19:25:48 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 5 May 2000 10:25:48 -0700
Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)
In-Reply-To: <000001bfb336$d4f512a0$0f2d153f@tim>
References: <NDBBKLNNJCFFMINBECLEOEBKCLAA.trentm@ActiveState.com> <000001bfb336$d4f512a0$0f2d153f@tim>
Message-ID: <20000505102548.B25914@activestate.com>

I posted a couple of patches a couple of days ago to correct the string
methods implementing slice-like optional parameters (count, find, index,
rfind, rindex) to properly clamp slice index values to the proper range (any
PyInt or PyLong value is acceptible now). In fact the slice_index() function
that was being used in ceval.c was reused (renamed to _PyEval_SliceIndex).

As well, the other patch changes PyArg_ParseTuple's 'b', 'h', and 'i'
formatters to raise an OverflowError if they overflow.

Trent

p.s. I thought I would whine here for some more attention. Who needs that
Unicode stuff anyway. ;-)



From fw at deneb.cygnus.argh.org  Fri May  5 18:13:42 2000
From: fw at deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:13:42 +0200
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Just van Rossum's message of "Fri, 5 May 2000 14:17:31 +0100"
References: <l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102810b5378dda02f5@[193.78.237.126]>
	<l03102805b5385e3de8e8@[193.78.237.127]>
	<l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <8766st5615.fsf@deneb.cygnus.argh.org>

Just van Rossum <just at letterror.com> writes:

> Good point. All this taken together still means to me that comparisons
> between wide and narrow strings should take place at the character level,
> which implies that coercion from narrow to wide is done at the character
> level, without looking at the encoding. (Which in my book in turn still
> implies that as long as we're talking about Unicode, narrow strings are
> effectively Latin-1.)

Sorry for jumping in, I've only recently discovered this list. :-/

At the moment, most of the computing world is not Latin-1 but
Windows-12??.  That's why I don't think this is a good idea at all.



From skip at mojam.com  Fri May  5 21:10:24 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 5 May 2000 14:10:24 -0500 (CDT)
Subject: [Python-Dev] Pickle diffs anyone?
In-Reply-To: <3913004E.6CC69857@digicool.com>
References: <3913004E.6CC69857@digicool.com>
Message-ID: <14611.7328.869011.109768@beluga.mojam.com>

    Jim> Someone recently made a cool proposal for utilizing diffs to save
    Jim> space taken by old versions in the Zope object database:

    Jim>   http://www.zope.org/Members/jim/ZODB/ReverseDiffVersioning

    Jim> To make this work, we need a good way of diffing pickles.

Fred already mentioned a candidate library to do diffs.  If that works, the
only other thing I think you'd need to do is guarantee that dicts are
pickled in a consistent fashion, probably by sorting the keys before
enumerating them.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From trentm at activestate.com  Fri May  5 23:34:48 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 5 May 2000 14:34:48 -0700
Subject: [Python-Dev] should a float overflow or just equal 'inf'
Message-ID: <20000505143448.A10731@activestate.com>

Hi all,

I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
formatter for PyArg_ParseTuple raise an Overflow exception if they overflow
(currently they just silently overflow). Presuming that this is considered a
good idea, should this be carried to floats.

Floats don't really overflow, they just equal 'inf'. Would it be more
desireable to raise an Overflow exception for this? I am inclined to think
that this would *not* be desireable based on the following quote:

"""
the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
    that-can-be-done-ly y'rs  - tim
"""

In any case, the question stands. I don't really have an idea of the
potential pains that this could cause to (1) efficiecy, (2) external code
that expects to deal with 'inf's itself. The reason I ask is because I am
looking at related issues in the Python code these days.


Trent
--
Trent Mick
trentm at activestate.com




From tismer at tismer.com  Sat May  6 16:29:07 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 06 May 2000 16:29:07 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000001bfb4a6$21da7900$922d153f@tim>
Message-ID: <39142C33.507025B5@tismer.com>


Tim Peters wrote:
> 
> [Trent Mick]
> > >>> i = -2147483648
> > OverflowError: integer literal too large
> > >>> i = -2147483648L
> > >>> int(i)   # it *is* a valid integer literal
> > -2147483648
> 
> Python's grammar is such that negative integer literals don't exist; what
> you actually have there is the unary minus operator applied to positive
> integer literals; indeed,

<disassembly snipped>

Well, knowing that there are more negatives than positives
and then coding it this way appears in fact as a design flaw to me.

A simple solution could be to do the opposite:
Always store a negative number and negate it
for positive numbers. A real negative number
would then end up with two UNARY_NEGATIVE
opcodes in sequence. If we had a simple postprocessor
to remove such sequences at the end, we're done.
As another step, it could also adjust all such consts
and remove those opcodes.

This could be a task for Skip's peephole optimizer.
Why did it never go into the core?

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tim_one at email.msn.com  Sat May  6 21:13:46 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:13:46 -0400
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39142C33.507025B5@tismer.com>
Message-ID: <000301bfb78f$33e33d80$452d153f@tim>

[Tim]
> Python's grammar is such that negative integer literals don't
> exist; what you actually have there is the unary minus operator
> applied to positive integer literals; ...

[Christian Tismer]
> Well, knowing that there are more negatives than positives
> and then coding it this way appears in fact as a design flaw to me.

Don't know what you're saying here.  Python's grammar has nothing to do with
the relative number of positive vs negative entities; indeed, in a
2's-complement machine it's not even true that there are more negatives than
positives.  Python generates the unary minus for "negative literals"
because, again, negative literals *don't exist* in the grammar.

> A simple solution could be to do the opposite:
> Always store a negative number and negate it
> for positive numbers.  ...

So long as negative literals don't exist in the grammar, "-2147483648" makes
no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
problem" here worth fixing, although if there is <wink>, it will get fixed
by magic as soon as Python ints and longs are unified.





From tim_one at email.msn.com  Sat May  6 21:47:25 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 6 May 2000 15:47:25 -0400
Subject: [Python-Dev] should a float overflow or just equal 'inf'
In-Reply-To: <20000505143448.A10731@activestate.com>
Message-ID: <000801bfb793$e70c9420$452d153f@tim>

[Trent Mick]
> I submitted a patch a coupld of days ago to have the 'b', 'i', and 'h'
> formatter for PyArg_ParseTuple raise an Overflow exception if
> they overflow (currently they just silently overflow). Presuming that
> this is considered a good idea, should this be carried to floats.
>
> Floats don't really overflow, they just equal 'inf'. Would it be more
> desireable to raise an Overflow exception for this? I am inclined to think
> that this would *not* be desireable based on the following quote:
>
> """
> the-754-committee-probably-did-the-best-job-of-fixing-binary-fp-
>     that-can-be-done-ly y'rs  - tim
> """
>
> In any case, the question stands. I don't really have an idea of the
> potential pains that this could cause to (1) efficiecy, (2) external code
> that expects to deal with 'inf's itself. The reason I ask is because I am
> looking at related issues in the Python code these days.

Alas, this is the tip of a very large project:  while (I believe) *every*
platform Python runs on now is 754-conformant, Python itself has no idea
what it's doing wrt 754 semantics.  In part this is because ISO/ANSI C has
no idea what it's doing either.  C9X (the next C std) is supposed to supply
portable spellings of ways to get at 754 features, but before then there's
simply nothing portable that can be done.

Guido & I already agreed in principle that Python will eventually follow 754
rules, but with the overflow, divide-by-0, and invalid operation exceptions
*enabled* by default (and the underflow and inexact exceptions disabled by
default).  It does this by accident <0.9 wink> already for, e.g.,

>>> 1. / 0.
Traceback (innermost last):
  File "<pyshell#0>", line 1, in ?
    1. / 0.
ZeroDivisionError: float division
>>>

Under the 754 defaults, that should silently return a NaN instead.  But
neither Guido nor I think the latter is reasonable default behavior, and
having done so before in a previous life I can formally justify changing the
defaults a language exposes.

Anyway, once all that is done, float overflow *will* raise an exception (by
default; there will also be a way to turn that off), unlike what happens
today.

Before then, I guess continuing the current policy of benign neglect (i.e.,
let it overflow silently) is best for consistency.  Without access to all
the 754 features in C, it's not even easy to detect overflow now!  "if (x ==
x * 0.5) overflow();" isn't quite good enough, as it can trigger a spurious
underflow error -- there's really no reasonable way to spell this stuff in
portable C now!





From gstein at lyra.org  Sun May  7 12:25:29 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 May 2000 03:25:29 -0700 (PDT)
Subject: [Python-Dev] buffer object (was: Unicode debate)
In-Reply-To: <390EF3EB.5BCE9EC3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>

[ damn, I wish people would pay more attention to changing the subject
  line to reflect the contents of the email ... I could not figure out if
  there were any further responses to this without opening most of those
  dang "Unicode debate" emails. sheesh... ]

On Tue, 2 May 2000, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> > 
> > [MAL]
> > > Let's not do the same mistake again: Unicode objects should *not*
> > > be used to hold binary data. Please use buffers instead.
> > 
> > Easier said than done -- Python doesn't really have a buffer data
> > type.

The buffer object. We *do* have the type.

> > Or do you mean the array module?  It's not trivial to read a
> > file into an array (although it's possible, there are even two ways).
> > Fact is, most of Python's standard library and built-in objects use
> > (8-bit) strings as buffers.

For historical reasons only. It would be very easy to change these to use
buffer objects, except for the simple fact that callers might expect a
*string* rather than something with string-like behavior.

>...
> > > BTW, I think that this behaviour should be changed:
> > >
> > > >>> buffer('binary') + 'data'
> > > 'binarydata'

In several places, bufferobject.c uses PyString_FromStringAndSize(). It
wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
copy the data in. A new API could also help out here:

  PyBuffer_CopyMemory(void *ptr, int size)


> > > while:
> > >
> > > >>> 'data' + buffer('binary')
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in ?
> > > TypeError: illegal argument type for built-in operation

The string object can't handle the buffer on the right side. Buffer
objects use the buffer interface, so they can deal with strings on the
right. Therefore: asymmetry :-(

> > > IMHO, buffer objects should never coerce to strings, but instead
> > > return a buffer object holding the combined contents. The
> > > same applies to slicing buffer objects:
> > >
> > > >>> buffer('binary')[2:5]
> > > 'nar'
> > >
> > > should prefereably be buffer('nar').

Sure. Wouldn't be a problem. The FromStringAndSize() thing.

> > Note that a buffer object doesn't hold data!  It's only a pointer to
> > data.  I can't off-hand explain the asymmetry though.
> 
> Dang, you're right...

Untrue. There is an API call which will construct a buffer object with its
own memory:

  PyObject * PyBuffer_New(int size)

The resulting buffer object will be read/write, and you can stuff values
into it using the slice notation.


> > > Hmm, perhaps we need something like a data string object
> > > to get this 100% right ?!

Nope. The buffer object is intended to be exactly this.

>...
> > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > which no "string literal" notations exist.
> 
> Anyway, one way or another I think we should make it clear
> to users that they should start using some other type for
> storing binary data.

Buffer objects. There are a couple changes to make this a bit easier for
people:

1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
   create a read/write buffer of a particular size. buffer() should create
   a zero-length read/write buffer.

2) if slice assignment is updated to allow changes to the length (for
   example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
   change. Specifically: when the buffer object owns the memory, it does
   this by appending the memory after the PyObject_HEAD and setting its
   internal pointer to it; when the dealloc() occurs, the target memory
   goes with the object. A flag would need to be added to tell the buffer
   object to do a second free() for the case where a realloc has returned
   a new pointer.
   [ I'm not sure that I would agree with this change, however; but it
     does make them a bit easier to work with; on the other hand, people
     have been working with immutable strings for a long time, so they're
     okay with concatenation, so I'm okay with saying length-altering
     operations must simply be done thru concatenation. ]


IMO, extensions should be using the buffer object for raw bytes. I know
that Mark has been updating some of the Win32 extensions to do this.
Python programs could use the objects if the buffer() builtin is tweaked
to allow a bit more flexibility in the arguments.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sun May  7 13:09:45 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 May 2000 04:09:45 -0700 (PDT)
Subject: [Python-Dev] introducing byte arrays in 1.6 (was: Unicode debate)
In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005070406200.7610-100000@nebula.lyra.org>

On Wed, 3 May 2000, Guido van Rossum wrote:
>...
> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

See my other note. I think a simple change to the buffer() builtin would
allow read/write byte arrays to be simply constructed.

There are a couple API changes that could be made to bufferobject.[ch]
which could simplify some operations for C code and returning buffer
objects. But changes like that would be preconditioned on accepting the
change in return type from those extensions. For example, the doc may say
something returns a string; while buffer objects are similar to strings in
operation, they are not the *same*. IMO, Python 1.7 would be a good time
to alter return types to buffer objects as appropriate. (but I'm not
adverse to doing it today! (to get people used to the difference in
purposes))

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bckfnn at worldonline.dk  Sun May  7 15:37:21 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sun, 07 May 2000 13:37:21 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39156208.13412015@smtp.worldonline.dk>

[Greg Stein]

>IMO, extensions should be using the buffer object for raw bytes. I know
>that Mark has been updating some of the Win32 extensions to do this.
>Python programs could use the objects if the buffer() builtin is tweaked
>to allow a bit more flexibility in the arguments.

Forgive me for rewinding this to the very beginning. But what is a
buffer object usefull for? I'm trying think about buffer object in terms
of jpython, so my primary interest is the user experience of buffer
objects.

Please correct my misunderstandings.

- There is not a buffer protocol exposed to python object (in the way
  the sequence protocol __getitem__ & friends are exposed).
- A buffer object typically gives access to the raw bytes which
  under lays the backing object. Regardless of the structure of the
  bytes.
- It is only intended for object which have a natural byte storage to
  implement the buffer interface.
- Of the builtin object only string, unicode and array supports the
  buffer interface.
- When slicing a buffer object, the result is always a string regardless
  of the buffer object base.


In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
said to have some natural byte storage. The jpython string type doesn't.
It would take some awful bit shifting to present a jpython string as an
array of bytes.

Would it make any sense to have a buffer object which only accept a byte
array as base? So that jpython would say:

>>> buffer("abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: buffer object expected


Would it make sense to tell python users that they cannot depend on the
portability of using strings (both 8bit and 16bit) as buffer object
base?


Because it is so difficult to look at java storage as a sequence of
bytes, I think I'm all for keeping the buffer() builtin and buffer
object as obscure and unknown as possible <wink>.

regards,
finn



From guido at python.org  Sun May  7 23:29:43 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 07 May 2000 17:29:43 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Sun, 07 May 2000 13:37:21 GMT."
             <39156208.13412015@smtp.worldonline.dk> 
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>  
            <39156208.13412015@smtp.worldonline.dk> 
Message-ID: <200005072129.RAA15850@eric.cnri.reston.va.us>

[Finn Bock]

> Forgive me for rewinding this to the very beginning. But what is a
> buffer object usefull for? I'm trying think about buffer object in terms
> of jpython, so my primary interest is the user experience of buffer
> objects.
> 
> Please correct my misunderstandings.
> 
> - There is not a buffer protocol exposed to python object (in the way
>   the sequence protocol __getitem__ & friends are exposed).
> - A buffer object typically gives access to the raw bytes which
>   under lays the backing object. Regardless of the structure of the
>   bytes.
> - It is only intended for object which have a natural byte storage to
>   implement the buffer interface.

All true.

> - Of the builtin object only string, unicode and array supports the
>   buffer interface.

And the new mmap module.

> - When slicing a buffer object, the result is always a string regardless
>   of the buffer object base.
> 
> In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be
> said to have some natural byte storage. The jpython string type doesn't.
> It would take some awful bit shifting to present a jpython string as an
> array of bytes.

I don't recall why JPython has jarray instead of array -- how do they
differ?  I think it's a shame that similar functionality is embodied
in different APIs.

> Would it make any sense to have a buffer object which only accept a byte
> array as base? So that jpython would say:
> 
> >>> buffer("abc")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: buffer object expected
> 
> 
> Would it make sense to tell python users that they cannot depend on the
> portability of using strings (both 8bit and 16bit) as buffer object
> base?

I think that the portability of many string properties is in danger
with the Unicode proposal.  Supporting this in the next version of
JPython will be a bit tricky.

> Because it is so difficult to look at java storage as a sequence of
> bytes, I think I'm all for keeping the buffer() builtin and buffer
> object as obscure and unknown as possible <wink>.

I basically agree, and in a private email to Greg Stein I've told him
this.  I think that the array module should be promoted to a built-in
function/type, and should be the recommended solution for data
storage.  The buffer API should remain a C-level API, and the buffer()
built-in should be labeled with "for experts only".

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon May  8 10:33:01 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 08 May 2000 10:33:01 +0200
Subject: [Python-Dev] buffer object (was: Unicode debate)
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>
Message-ID: <39167BBD.88EB2C64@lemburg.com>

Greg Stein wrote:
> 
> [ damn, I wish people would pay more attention to changing the subject
>   line to reflect the contents of the email ... I could not figure out if
>   there were any further responses to this without opening most of those
>   dang "Unicode debate" emails. sheesh... ]
> 
> On Tue, 2 May 2000, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > [MAL]
> > > > Let's not do the same mistake again: Unicode objects should *not*
> > > > be used to hold binary data. Please use buffers instead.
> > >
> > > Easier said than done -- Python doesn't really have a buffer data
> > > type.
> 
> The buffer object. We *do* have the type.
> 
> > > Or do you mean the array module?  It's not trivial to read a
> > > file into an array (although it's possible, there are even two ways).
> > > Fact is, most of Python's standard library and built-in objects use
> > > (8-bit) strings as buffers.
> 
> For historical reasons only. It would be very easy to change these to use
> buffer objects, except for the simple fact that callers might expect a
> *string* rather than something with string-like behavior.

Would this be a too drastic change, then ? I think that we should
at least make use of buffers in the standard lib.

>
> >...
> > > > BTW, I think that this behaviour should be changed:
> > > >
> > > > >>> buffer('binary') + 'data'
> > > > 'binarydata'
> 
> In several places, bufferobject.c uses PyString_FromStringAndSize(). It
> wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
> copy the data in. A new API could also help out here:
> 
>   PyBuffer_CopyMemory(void *ptr, int size)
> 
> > > > while:
> > > >
> > > > >>> 'data' + buffer('binary')
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in ?
> > > > TypeError: illegal argument type for built-in operation
> 
> The string object can't handle the buffer on the right side. Buffer
> objects use the buffer interface, so they can deal with strings on the
> right. Therefore: asymmetry :-(
> 
> > > > IMHO, buffer objects should never coerce to strings, but instead
> > > > return a buffer object holding the combined contents. The
> > > > same applies to slicing buffer objects:
> > > >
> > > > >>> buffer('binary')[2:5]
> > > > 'nar'
> > > >
> > > > should prefereably be buffer('nar').
> 
> Sure. Wouldn't be a problem. The FromStringAndSize() thing.

Right.
 
Before digging deeper into this, I think we should here
Guido's opinion on this again: he said that he wanted to
use Java's binary arrays for binary data... perhaps we
need to tweak the array type and make it more directly
accessible (from C and Python) instead.

> > > Note that a buffer object doesn't hold data!  It's only a pointer to
> > > data.  I can't off-hand explain the asymmetry though.
> >
> > Dang, you're right...
> 
> Untrue. There is an API call which will construct a buffer object with its
> own memory:
> 
>   PyObject * PyBuffer_New(int size)
> 
> The resulting buffer object will be read/write, and you can stuff values
> into it using the slice notation.

Yes, but that API is not reachable from within Python,
AFAIK.
 
> > > > Hmm, perhaps we need something like a data string object
> > > > to get this 100% right ?!
> 
> Nope. The buffer object is intended to be exactly this.
> 
> >...
> > > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > > which no "string literal" notations exist.
> >
> > Anyway, one way or another I think we should make it clear
> > to users that they should start using some other type for
> > storing binary data.
> 
> Buffer objects. There are a couple changes to make this a bit easier for
> people:
> 
> 1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
>    create a read/write buffer of a particular size. buffer() should create
>    a zero-length read/write buffer.

This looks a lot like function overloading... I don't think we
should get into this: how about having the buffer() API take
keywords instead ?!

buffer(size=1024,mode='rw') - 1K of owned read write memory
buffer(obj) - read-only referenced memory from obj
buffer(obj,mode='rw') - read-write referenced memory in obj

etc.

Or we could allow passing None as object to obtain an owned
read-write memory block (much like passing NULL to the
C functions).

> 2) if slice assignment is updated to allow changes to the length (for
>    example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
>    change. Specifically: when the buffer object owns the memory, it does
>    this by appending the memory after the PyObject_HEAD and setting its
>    internal pointer to it; when the dealloc() occurs, the target memory
>    goes with the object. A flag would need to be added to tell the buffer
>    object to do a second free() for the case where a realloc has returned
>    a new pointer.
>    [ I'm not sure that I would agree with this change, however; but it
>      does make them a bit easier to work with; on the other hand, people
>      have been working with immutable strings for a long time, so they're
>      okay with concatenation, so I'm okay with saying length-altering
>      operations must simply be done thru concatenation. ]

I don't think I like this either: what happens when the buffer
doesn't own the memory ?
 
> IMO, extensions should be using the buffer object for raw bytes. I know
> that Mark has been updating some of the Win32 extensions to do this.
> Python programs could use the objects if the buffer() builtin is tweaked
> to allow a bit more flexibility in the arguments.

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From bckfnn at worldonline.dk  Mon May  8 21:44:27 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Mon, 08 May 2000 19:44:27 GMT
Subject: [Python-Dev] buffer object
In-Reply-To: <200005072129.RAA15850@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org>   <39156208.13412015@smtp.worldonline.dk>  <200005072129.RAA15850@eric.cnri.reston.va.us>
Message-ID: <3917074c.8837607@smtp.worldonline.dk>

[Guido]

>I don't recall why JPython has jarray instead of array -- how do they
>differ?  I think it's a shame that similar functionality is embodied
>in different APIs.

The jarray module is a paper thin factory for the PyArray type which is
primary (I believe) a wrapper around any existing java array instance.
It exists to make arrays returned from java code useful for jpython.
Since a PyArray must always wrap the original java array, it cannot
resize the array.

In contrast an array instance would own the memory and can resize it as
necessary.

Due to the different purposes I agree with Jim's decision of making the
two module incompatible. And they are truly incompatible. jarray.array
have reversed the (typecode, seq) arguments.

OTOH creating a mostly compatible array module for jpython should not be
too hard.
 
regards,
finn





From guido at python.org  Mon May  8 21:55:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 15:55:50 -0400
Subject: [Python-Dev] buffer object
In-Reply-To: Your message of "Mon, 08 May 2000 19:44:27 GMT."
             <3917074c.8837607@smtp.worldonline.dk> 
References: <Pine.LNX.4.10.10005070308370.7610-100000@nebula.lyra.org> <39156208.13412015@smtp.worldonline.dk> <200005072129.RAA15850@eric.cnri.reston.va.us>  
            <3917074c.8837607@smtp.worldonline.dk> 
Message-ID: <200005081955.PAA21928@eric.cnri.reston.va.us>

> >I don't recall why JPython has jarray instead of array -- how do they
> >differ?  I think it's a shame that similar functionality is embodied
> >in different APIs.
> 
> The jarray module is a paper thin factory for the PyArray type which is
> primary (I believe) a wrapper around any existing java array instance.
> It exists to make arrays returned from java code useful for jpython.
> Since a PyArray must always wrap the original java array, it cannot
> resize the array.

Understood.  This is a bit like the buffer API in CPython then (except
for Greg's vision where the buffer object manages storage as well :-).

> In contrast an array instance would own the memory and can resize it as
> necessary.

OK, this makes sense.

> Due to the different purposes I agree with Jim's decision of making the
> two module incompatible. And they are truly incompatible. jarray.array
> have reversed the (typecode, seq) arguments.

This I'm not so sure of.  Why be different just to be different?

> OTOH creating a mostly compatible array module for jpython should not be
> too hard.

OK, when we make array() a built-in, this should be done for Java too.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Mon May  8 22:29:21 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 8 May 2000 13:29:21 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005081400.KAA19889@eric.cnri.reston.va.us>
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>
Message-ID: <20000508132921.A31981@activestate.com>

On Mon, May 08, 2000 at 10:00:30AM -0400, Guido van Rossum wrote:
> > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > Overflow exception if they overflow (previously they just silently
> > overflowed).
> 
> Trent,
> 
> There's one issue with this: I believe the 'b' format is mostly used
> with unsigned character arguments in practice.
>However on systems
> with default signed characters, CHAR_MAX is 127 and values 128-255 are
> rejected.  I'll change the overflow test to:
> 
> 	else if (ival > CHAR_MAX && ival >= 256) {
> 
> if that's okay with you.
> 
Okay, I guess. Two things:

1. In a way this defeats the main purpose of the checks. Now a silent overflow
could happen for a signed byte value over CHAR_MAX. The only way to
automatically do the bounds checking is if the exact type is known, i.e.
different formatters for signed and unsigned integral values. I don't know if
this is desired (is it?). The obvious choice of 'u' prefixes to specify
unsigned is obviously not an option.

Another option might be to document 'b' as for unsigned chars and 'h', 'i',
'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
limiting the user to unsigned or signed depending on the formatter. (Which
again, means that it would be nice to have different formatters for signed
and unsigned.) I think that the bounds checking is false security unless
these restrictions are made.


2. The above aside, I would be more inclined to change the line in question to:

   else if (ival > UCHAR_MAX) {

as this is more explicit about what is being done.

> Another issue however is that there are probably cases where an 'i'
> format is used (which can't overflow on 32-bit architectures) but
> where the int value is then copied into a short field without an
> additional check...  I'm not sure how to fix this except by a complete
> inspection of all code...  Not clear if it's worth it.

Yes, a complete code inspection seems to be the only way. That is some of
what I am doing. Again, I have two questions:

1. There are a fairly large number of downcasting cases in the Python code
(not necessarily tied to PyArg_ParseTuple results). I was wondering if you
think a generalized check on each such downcast would be advisable. This
would take the form of some macro that would do a bounds check before doing
the cast. For example (a common one is the cast of strlen's size_t return
value to int, because Python strings use int for their length, this is a
downcast on 64-bit systems):

  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, len);

would become
  
  size_t len = strlen(s);
  obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));

CAST_TO_INT would ensure that 'len'did not overflow and would raise an
exception otherwise.

Pros:

- should never have to worry about overflows again
- easy to find (given MSC warnings) and easy to code in (staightforward)

Cons:

- more code, more time to execute
- looks ugly
- have to check PyErr_Occurred every time a cast is done


I would like other people's opinion on this kind of change. There are three
possible answers:

  +1 this is a bad change idea because...<reason>
  -1 this is a good idea, go for it
  +0 (mostly likely) This is probably a good idea for some case where the
     overflow *could* happen, however the strlen example that you gave is
	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
	 is a good assumption/limitation.



2. Microsofts compiler gives good warnings for casts where information loss
is possible. However, I cannot find a way to get similar warnings from gcc.
Does anyone know if that is possible. I.e.

	int i = 123456;
	short s = i;  // should warn about possible loss of information

should give a compiler warning.


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From trentm at activestate.com  Mon May  8 23:26:51 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 8 May 2000 14:26:51 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005081416.KAA20158@eric.cnri.reston.va.us>
References: <20000505135817.A9859@activestate.com> <200005081416.KAA20158@eric.cnri.reston.va.us>
Message-ID: <20000508142651.C8000@activestate.com>

On Mon, May 08, 2000 at 10:16:42AM -0400, Guido van Rossum wrote:
> > The patch to config.h looks big but it really is not. These are the effective
> > changes:
> > - MS_WINxx are keyed off _WINxx
> > - SIZEOF_VOID_P is set to 8 for Win64
> > - COMPILER string is changed appropriately for Win64
>
> One thing worries me: if COMPILER is changed, that changes
> sys.platform to "win64", right?  I'm sure that will break plenty of
> code which currently tests for sys.platform=="win32" but really wants
> to test for any form of Windows.  Maybe sys.platform should remain
> win32?
> 

No, but yes. :( Actually I forgot to mention that my config.h patch changes
the PLATFORM #define from win32 to win64. So yes, you are correct. And, yes
(Sigh) you are right that this will break tests for sys.platform == "win32".

So I guess the simplest thing to do is to leave it as win32 following the
same reasoning for defining MS_WIN32 on Win64:

>  The idea is that the common case is
>  that code specific to Win32 will also work on Win64 rather than being
>  specific to Win32 (i.e. there is more the same than different in WIn32 and
>  Win64).
 

What if someone needs to do something in Python code for either Win32 or
Win64 but not both? Or should this never be necessary (not likely). I would
like Mark H's opinion on this stuff.


Trent

-- 
Trent Mick
trentm at activestate.com



From tismer at tismer.com  Mon May  8 23:52:54 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 08 May 2000 23:52:54 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <000301bfb78f$33e33d80$452d153f@tim>
Message-ID: <39173736.2A776348@tismer.com>


Tim Peters wrote:
> 
> [Tim]
> > Python's grammar is such that negative integer literals don't
> > exist; what you actually have there is the unary minus operator
> > applied to positive integer literals; ...
> 
> [Christian Tismer]
> > Well, knowing that there are more negatives than positives
> > and then coding it this way appears in fact as a design flaw to me.
> 
> Don't know what you're saying here. 

On a 2's-complement machine, there are 2**(n-1) negatives, zero, and
2**(n-1)-1 positives. The most negative number cannot be inverted.
Most machines today use the 2's complement.

> Python's grammar has nothing to do with
> the relative number of positive vs negative entities; indeed, in a
> 2's-complement machine it's not even true that there are more negatives than
> positives. 

If I read this 1's-complement machine then I believe it.
But we don't need to split hair on known stuff :-)

> Python generates the unary minus for "negative literals"
> because, again, negative literals *don't exist* in the grammar.

Yes. If I know the facts and don't build negative literals into
the grammar, then I call it an oversight. Not too bad but not nice.

> > A simple solution could be to do the opposite:
> > Always store a negative number and negate it
> > for positive numbers.  ...
> 
> So long as negative literals don't exist in the grammar, "-2147483648" makes
> no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> problem" here worth fixing, although if there is <wink>, it will get fixed
> by magic as soon as Python ints and longs are unified.

I'd change the grammar.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Mon May  8 23:54:31 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 8 May 2000 14:54:31 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173736.2A776348@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>

On Mon, 8 May 2000, Christian Tismer wrote:
>...
> > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > problem" here worth fixing, although if there is <wink>, it will get fixed
> > by magic as soon as Python ints and longs are unified.
> 
> I'd change the grammar.

That would be very difficult, with very little positive benefit. As Mark
said, use 0x80000000 if you want that number.

Consider that the grammar would probably want to deal with things like
  - 1234
or
  -0xA

Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
complicate the syntax for NUMBER.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From tismer at tismer.com  Tue May  9 00:09:43 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 09 May 2000 00:09:43 +0200
Subject: [Python-Dev] Cannot declare the largest integer literal.
References: <Pine.LNX.4.10.10005081452130.18798-100000@nebula.lyra.org>
Message-ID: <39173B27.4B3BEB40@tismer.com>


Greg Stein wrote:
> 
> On Mon, 8 May 2000, Christian Tismer wrote:
> >...
> > > So long as negative literals don't exist in the grammar, "-2147483648" makes
> > > no sense on a 2's-complement machine with 32-bit C longs.  There isn't "a
> > > problem" here worth fixing, although if there is <wink>, it will get fixed
> > > by magic as soon as Python ints and longs are unified.
> >
> > I'd change the grammar.
> 
> That would be very difficult, with very little positive benefit. As Mark
> said, use 0x80000000 if you want that number.
> 
> Consider that the grammar would probably want to deal with things like
>   - 1234
> or
>   -0xA
> 
> Instead, the grammar sees two parts: "-" and "NUMBER" without needing to
> complicate the syntax for NUMBER.

Right. That was the reason for my first, dumb, proposal:
Always interpret a number as negative and negate it once more.
That makes it positive. In a post process, remove double-negates.
This leaves negations always where they are allowed: On negatives.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Tue May  9 00:11:00 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 8 May 2000 15:11:00 -0700 (PDT)
Subject: [Python-Dev] Cannot declare the largest integer literal.
In-Reply-To: <39173B27.4B3BEB40@tismer.com>
Message-ID: <Pine.LNX.4.10.10005081508490.18798-100000@nebula.lyra.org>

On Tue, 9 May 2000, Christian Tismer wrote:
>...
> Right. That was the reason for my first, dumb, proposal:
> Always interpret a number as negative and negate it once more.
> That makes it positive. In a post process, remove double-negates.
> This leaves negations always where they are allowed: On negatives.

IMO, that is a non-intuitive hack. It would increase the complexity of
Python's parsing internals. Again, with little measurable benefit.

I do not believe that I've run into a case of needing -2147483648 in the
source of one of my programs. If I had, then I'd simply switch to
0x80000000 and/or assign it to INT_MIN.

-1 on making Python more complex to support this single integer value.
   Users should be pointed to 0x80000000 to represent it. (a FAQ entry
   and/or comment in the language reference would be a Good Thing)


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Tue May  9 00:15:17 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 08:15:17 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000508142651.C8000@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au>

[Trent]
> What if someone needs to do something in Python code for either Win32 or
> Win64 but not both? Or should this never be necessary (not
> likely). I would
> like Mark H's opinion on this stuff.

OK :-)

I have always thought that it _would_ move to "win64", and the official way
of checking for "Windows" will be sys.platform[:3]=="win".

In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
sys.platform[:3] in ["win", "mac"])

It will no doubt cause a bit of pain, but IMO it is cleaner...

Mark.




From guido at python.org  Tue May  9 04:14:07 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:14:07 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 08:15:17 +1000."
             <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090214.WAA22419@eric.cnri.reston.va.us>

> [Trent]
> > What if someone needs to do something in Python code for either Win32 or
> > Win64 but not both? Or should this never be necessary (not
> > likely). I would
> > like Mark H's opinion on this stuff.

[Mark]
> OK :-)
> 
> I have always thought that it _would_ move to "win64", and the official way
> of checking for "Windows" will be sys.platform[:3]=="win".
> 
> In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> sys.platform[:3] in ["win", "mac"])
> 
> It will no doubt cause a bit of pain, but IMO it is cleaner...

Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
symbol is defined even on Win64 systems -- to test for Win64, you must
test the _WIN64 symbol.  The two variants are more similar than they
are different.

While testing sys.platform isn't quite the same thing, I think that
the same reasoning goes: a win64 system is everything that a win32
system is, and then some.

So I'd vote for leaving sys.platform alone (i.e. "win32" in both
cases), and providing another way to test for win64-ness.

I wish we had had the foresight to set sys.platform to 'windows', but
since we hadn't, I think we'll have to live with the consequences.

The changes that Trent had to make in the standard library are only
the tip of the iceberg...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  9 04:24:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 22:24:50 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Mon, 08 May 2000 13:29:21 PDT."
             <20000508132921.A31981@activestate.com> 
References: <20000503161656.A20275@activestate.com> <200005081400.KAA19889@eric.cnri.reston.va.us>  
            <20000508132921.A31981@activestate.com> 
Message-ID: <200005090224.WAA22457@eric.cnri.reston.va.us>

[Trent]
> > > Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an
> > > Overflow exception if they overflow (previously they just silently
> > > overflowed).

[Guido]
> > There's one issue with this: I believe the 'b' format is mostly used
> > with unsigned character arguments in practice.
> > However on systems
> > with default signed characters, CHAR_MAX is 127 and values 128-255 are
> > rejected.  I'll change the overflow test to:
> > 
> > 	else if (ival > CHAR_MAX && ival >= 256) {
> > 
> > if that's okay with you.

[Trent]
> Okay, I guess. Two things:
> 
> 1. In a way this defeats the main purpose of the checks. Now a silent overflow
> could happen for a signed byte value over CHAR_MAX. The only way to
> automatically do the bounds checking is if the exact type is known, i.e.
> different formatters for signed and unsigned integral values. I don't know if
> this is desired (is it?). The obvious choice of 'u' prefixes to specify
> unsigned is obviously not an option.

The struct module uses upper case for unsigned.  I think this is
overkill here, and would add a lot of code (if applied systematically)
that would rarely be used.

> Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX]
> for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> limiting the user to unsigned or signed depending on the formatter. (Which
> again, means that it would be nice to have different formatters for signed
> and unsigned.) I think that the bounds checking is false security unless
> these restrictions are made.

I like this: 'b' is unsigned, the others are signed.

> 2. The above aside, I would be more inclined to change the line in question to:
> 
>    else if (ival > UCHAR_MAX) {
> 
> as this is more explicit about what is being done.

Agreed.

> > Another issue however is that there are probably cases where an 'i'
> > format is used (which can't overflow on 32-bit architectures) but
> > where the int value is then copied into a short field without an
> > additional check...  I'm not sure how to fix this except by a complete
> > inspection of all code...  Not clear if it's worth it.
> 
> Yes, a complete code inspection seems to be the only way. That is some of
> what I am doing. Again, I have two questions:
> 
> 1. There are a fairly large number of downcasting cases in the Python code
> (not necessarily tied to PyArg_ParseTuple results). I was wondering if you
> think a generalized check on each such downcast would be advisable. This
> would take the form of some macro that would do a bounds check before doing
> the cast. For example (a common one is the cast of strlen's size_t return
> value to int, because Python strings use int for their length, this is a
> downcast on 64-bit systems):
> 
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
> 
> would become
>   
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> 
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.
> 
> Pros:
> 
> - should never have to worry about overflows again
> - easy to find (given MSC warnings) and easy to code in (staightforward)
> 
> Cons:
> 
> - more code, more time to execute
> - looks ugly
> - have to check PyErr_Occurred every time a cast is done

How would the CAST_TO_INT macro signal an erro?  C doesn't have
exceptions.  If we have to add checks, I'd prefer to write

  size_t len = strlen(s);
  if (INT_OVERFLOW(len))
     return NULL; /* Or whatever is appropriate in this context */
  obj = PyString_FromStringAndSize(s, len);

> I would like other people's opinion on this kind of change. There are three
> possible answers:
> 
>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it
>   +0 (mostly likely) This is probably a good idea for some case where the
>      overflow *could* happen, however the strlen example that you gave is
> 	 *not* such a situation. As Tim Peters said: 2GB limit on string lengths
> 	 is a good assumption/limitation.

-0

> 2. Microsofts compiler gives good warnings for casts where information loss
> is possible. However, I cannot find a way to get similar warnings from gcc.
> Does anyone know if that is possible. I.e.
> 
> 	int i = 123456;
> 	short s = i;  // should warn about possible loss of information
> 
> should give a compiler warning.

Beats me :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mhammond at skippinet.com.au  Tue May  9 04:29:50 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:29:50 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>

> > It will no doubt cause a bit of pain, but IMO it is cleaner...
>
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.

Yes, but still, one day, (if MS have their way :-) win32 will be "legacy".

eg, imagine we were having the same debate about 5 years ago, but there was
a more established Windows 3.1 port available.

If we believed the hype, we probably _would_ have gone with "windows" for
both platforms, in the hope that they are more similar than different
(after all, that _was_ the story back then).

> The changes that Trent had to make in the standard library are only
> the tip of the iceberg...

Yes, but OTOH, the fact we explicitely use "win32" means people shouldnt
really expect code to work on Win64.  If nothing else, it will be a good
opportunity to examine the situation as each occurrence is found.  It will
be quite some time before many people play with the Win64 port seriously
(just like the first NT ports when I first came on the scene :-)

So, I remain a +0 on this - ie, I dont really care personally, but think
"win64" is the right thing.  In any case, Im happy to rely on Guido's time
machine...

Mark.




From mhammond at skippinet.com.au  Tue May  9 04:36:59 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 12:36:59 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au>

One more data point:

Windows CE uses "wince", and I certainly dont believe this should be
"win32" (although if you read the CE marketting stuff, they would have you
believe it is close enough that we should :-).

So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
:-)

Mark.




From guido at python.org  Tue May  9 05:16:34 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:16:34 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:29:50 +1000."
             <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090316.XAA22614@eric.cnri.reston.va.us>

To help me understand the significance of win64 vs. win32, can you
list the major differences?  I thought that the main thing was that
pointers are 64 bits, and that otherwise the APIs are the same.  In
fact, I don't know if WIN64 refers to Windows running on 64-bit
machines (e.g. Alphas) only, or that it is possible to have win64 on a
32-bit machine (e.g. Pentium).

If it's mostly a matter of pointer size, this is almost completely
hidden at the Python level, and I don't think it's worth changing the
plaform name.  All of the changes that Trent found were really tests
for the presence of Windows APIs like the registry...

I could defend calling it Windows in comments but having sys.platform
be "win32".  Like uname on Solaris 2.7 returns SunOS 5.7 -- there's
too much old code that doesn't deserve to be broken.  (And it's not
like we have an excuse that it was always documented this way -- this
wasn't documented very clearly at all...)

It's-spelt-Raymond-Luxury-Yach-t-but-it's-pronounced-Throatwobbler-Mangrove,

--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Tue May  9 05:19:19 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 08 May 2000 23:19:19 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 12:36:59 +1000."
             <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBCEHMCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005090319.XAA22627@eric.cnri.reston.va.us>

> Windows CE uses "wince", and I certainly dont believe this should be
> "win32" (although if you read the CE marketting stuff, they would have you
> believe it is close enough that we should :-).
> 
> So to be _truly_ "windows portable", you will still need [:3]=="win" anyway
> :-)

That's a feature :-).  Too many things we think we know are true on
Windows don't hold on Win/CE, so it's worth being more precise.

I don't believe this is the case for Win64, but I have to admit I
speak from a position of ignorance -- I am clueless as to what defines
Win64.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From nhodgson at bigpond.net.au  Tue May  9 05:35:16 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 9 May 2000 13:35:16 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHLCKAA.mhammond@skippinet.com.au>  <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <035e01bfb968$9ad8cca0$e3cb8490@neil>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that
> pointers are 64 bits, and that otherwise the APIs are the same.  In
> fact, I don't know if WIN64 refers to Windows running on 64-bit
> machines (e.g. Alphas) only, or that it is possible to have win64 on a
> 32-bit machine (e.g. Pentium).

   The 64 bit pointer change propagates to related types like size_t and
window procedure parameters. Running the 64 bit checker over Scintilla found
one real problem and a large number of strlen returning 64 bit size_ts where
only ints were expected.

   64 bit machines will continue to run Win32 code but it is unlikely that
32 bit machines will be taught to run Win64 code.

   Mixed operations, calling between 32 bit and 64 bit code and vice-versa
will be fun. Microsoft (unlike IBM with OS/2) never really did the right
thing for the 16->32 bit conversion. Is there any information yet on mixed
size applications?

   Neil





From mhammond at skippinet.com.au  Tue May  9 06:06:25 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 9 May 2000 14:06:25 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005090316.XAA22614@eric.cnri.reston.va.us>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>

> To help me understand the significance of win64 vs. win32, can you
> list the major differences?  I thought that the main thing was that

I just saw Neils, and Trent may have other input.

However, the point I was making is that 5 years ago, MS were telling us
that the Win32 API was almost identical to the Win16 API, except for the
size of pointers, and dropping of the "memory model" abominations.

The Windows CE department is telling us that CE is, or will be, basically
the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
this should be hidden from the Python programmer.

Now all we need is "win64s" - it will respond to Neil's criticism that
mixed mode programs are a pain, and MS will tell us what "win64s" will
solve all our problems, and allow win32 to run 64 bit programs well into
the future.  Until everyone in the world realizes it sucks, and MS promptly
says it was only ever a hack in the first place, and everyone should be on
Win64 by now anyway :-)

Its-times-like-this-we-really-need-that-time-machine-ly,

Mark.




From tim_one at email.msn.com  Tue May  9 08:54:51 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 9 May 2000 02:54:51 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <200005090224.WAA22457@eric.cnri.reston.va.us>
Message-ID: <000101bfb983$7a34d3c0$592d153f@tim>

[Trent]
> 1. There are a fairly large number of downcasting cases in the
> Python code (not necessarily tied to PyArg_ParseTuple results). I
> was wondering if you think a generalized check on each such
> downcast would be advisable. This would take the form of some macro
> that would do a bounds check before doing the cast. For example (a
> common one is the cast of strlen's size_t return value to int,
> because Python strings use int for their length, this is a downcast
> on 64-bit systems):
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, len);
>
> would become
>
>   size_t len = strlen(s);
>   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
>
> CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> exception otherwise.

[Guido]
> How would the CAST_TO_INT macro signal an erro?  C doesn't have
> exceptions.  If we have to add checks, I'd prefer to write
>
>   size_t len = strlen(s);
>   if (INT_OVERFLOW(len))
>      return NULL; /* Or whatever is appropriate in this context */
>   obj = PyString_FromStringAndSize(s, len);

Of course we have to add checks -- strlen doesn't return an int!  It hasn't
since about a year after Python was first written (ANSI C changed the rules,
and Python is long overdue in catching up -- if you want people to stop
passing multiple args to append, set a good example in our use of C <0.5
wink>).

[Trent]
> I would like other people's opinion on this kind of change.
> There are three possible answers:

Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
a hurrah, -0 and +0 are obvious <ahem>.

>   +1 this is a bad change idea because...<reason>
>   -1 this is a good idea, go for it

That one, except spelled +1.

>   +0 (mostly likely) This is probably a good idea for some case
> where the overflow *could* happen, however the strlen example that
> you gave is *not* such a situation. As Tim Peters said: 2GB limit on
> string lengths is a good assumption/limitation.

No, it's a defensible limitation, but it's *never* a valid assumption.  The
check isn't needed anywhere we can prove a priori that it could never fail
(in which case we're not assuming anything), but it's always needed when we
can't so prove (in which case skipping the check would be a bad asssuption).
In the absence of any context, your strlen example above definitely needs
the check.

An alternative would be to promote the size member from int to size_t;
that's no actual change on the 32-bit machines Guido generally assumes
without realizing it, and removes an arbitrary (albeit defensible)
limitation on some 64-bit machines at the cost of (just possibly, due to
alignment vagaries) boosting var objects' header size on the latter.

correctness-doesn't-happen-by-accident-ly y'rs  - tim





From guido at python.org  Tue May  9 12:48:16 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 06:48:16 -0400
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: Your message of "Tue, 09 May 2000 02:54:51 EDT."
             <000101bfb983$7a34d3c0$592d153f@tim> 
References: <000101bfb983$7a34d3c0$592d153f@tim> 
Message-ID: <200005091048.GAA22912@eric.cnri.reston.va.us>

> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.

Then the signatures of many, many functions would have to be changed
to take or return size_t, too -- almost anything in the Python/C API
that *conceptually* is a size_t is declared as int; the ob_size field
is only the tip of the iceberg.

We'd also have to change the size of Python ints (currently long) to
an integral type that can hold a size_t; on Windows (and I believe
*only* on Windows) this is a long long, or however they spell it
(except size_t is typically unsigned).

This all is a major reworking -- not good for 1.6, even though I agree
it needs to be done eventually.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May  9 13:08:25 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 07:08:25 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:06:25 +1000."
             <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005091108.HAA22983@eric.cnri.reston.va.us>

> > To help me understand the significance of win64 vs. win32, can you
> > list the major differences?  I thought that the main thing was that
> 
> I just saw Neils, and Trent may have other input.
> 
> However, the point I was making is that 5 years ago, MS were telling us
> that the Win32 API was almost identical to the Win16 API, except for the
> size of pointers, and dropping of the "memory model" abominations.
> 
> The Windows CE department is telling us that CE is, or will be, basically
> the same as Win32, except it is a Unicode only platform.  Again, with 1.6,
> this should be hidden from the Python programmer.
> 
> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

OK, I am beginning to get the picture.

The win16-win32-win64 distinction mostly affects the C API.  I agree
that the win16/win32 distinction was huge -- while they provided
backwards compatible APIs, most of these were quickly deprecated.  The
user experience was also completely different.  And huge amounts of
functionality were only available in the win32 version (e.g. the
registry), win32s notwithstanding.

I don't see the same difference for the win32/win64 API.  Yes, all the
APIs have changed -- but only in a way you would *expect* them to
change in a 64-bit world.  From the descriptions of differences, the
user experience and the sets of APIs available are basically the same,
but the APIs are tweaked to allow 64-bit values where this makes
sense.  This is a big deal for MS developers because of MS's
insistence on fixing the sizes of all datatypes -- POSIX developers
are used to typedefs that have platform-dependent widths, but MS in
its wisdom has decided that it should be okay to know that a long is
exactly 32 bits.

Again, the Windows/CE user experience is quite different, so I agree
on making the user-visible platform is different there.  But I still
don't see that the user experience for win64 will be any difference
than for win32.

Another view: win32 was my way of saying the union of Windows 95,
Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
platforms.  If Windows 2000 is sufficiently different to the user, it
deserves a different platform id (win2000?).

Is there a connection between Windows 2000 and _WIN64?

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mal at lemburg.com  Tue May  9 11:09:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 11:09:40 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us>
Message-ID: <3917D5D3.A8CD1B3E@lemburg.com>

Guido van Rossum wrote:
> 
> > [Trent]
> > > What if someone needs to do something in Python code for either Win32 or
> > > Win64 but not both? Or should this never be necessary (not
> > > likely). I would
> > > like Mark H's opinion on this stuff.
> 
> [Mark]
> > OK :-)
> >
> > I have always thought that it _would_ move to "win64", and the official way
> > of checking for "Windows" will be sys.platform[:3]=="win".
> >
> > In fact, Ive noticed Guido use this idiom (both stand-alone, and as :if
> > sys.platform[:3] in ["win", "mac"])
> >
> > It will no doubt cause a bit of pain, but IMO it is cleaner...
> 
> Hmm...  I'm not sure I agree.  I read in the comments that the _WIN32
> symbol is defined even on Win64 systems -- to test for Win64, you must
> test the _WIN64 symbol.  The two variants are more similar than they
> are different.
> 
> While testing sys.platform isn't quite the same thing, I think that
> the same reasoning goes: a win64 system is everything that a win32
> system is, and then some.
> 
> So I'd vote for leaving sys.platform alone (i.e. "win32" in both
> cases), and providing another way to test for win64-ness.

Just curious, what's the output of platform.py on Win64 ?
(You can download platform.py from my Python Pages.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Tue May  9 20:53:37 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 14:53:37 -0400 (EDT)
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091108.HAA22983@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
	<200005091108.HAA22983@eric.cnri.reston.va.us>
Message-ID: <14616.24241.26240.247048@seahag.cnri.reston.va.us>

Guido van Rossum writes:
 > Another view: win32 was my way of saying the union of Windows 95,
 > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
 > platforms.  If Windows 2000 is sufficiently different to the user, it
 > deserves a different platform id (win2000?).
 > 
 > Is there a connection between Windows 2000 and _WIN64?

  Since no one else has responded, here's some stuff from MS on the
topic of Win64:

http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

This document talks only of the Itanium (IA64) processor, and doesn't
mention the Alpha at all.  I know the NT shipping on Alpha machines is
Win32, though the actual application code can be 64-bit (think "32-bit
Solaris on an Ultra"); just the system APIs are 32 bits.
  The last link on the page links to some more detailed technical
information on moving application code to Win64.


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From guido at python.org  Tue May  9 20:57:21 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 14:57:21 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: Your message of "Tue, 09 May 2000 14:53:37 EDT."
             <14616.24241.26240.247048@seahag.cnri.reston.va.us> 
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us>  
            <14616.24241.26240.247048@seahag.cnri.reston.va.us> 
Message-ID: <200005091857.OAA24731@eric.cnri.reston.va.us>

>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp

Thanks, this makes more sense.  I guess that Trent's interest in Win64
has to do with an early shipment of Itaniums that ActiveState might
have received. :-)

The document confirms my feeling that WIN64 vs WIN32, unlike WIN32 vs
WIN16, is mostly a compiler issue, and not a user experience or OS
functionality issue.  The table lists increased limits, not new
software subsystems.

So I still think that sys.platform should be 'win32', to avoid
breaking existing apps.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From gstein at lyra.org  Tue May  9 20:56:34 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 11:56:34 -0700 (PDT)
Subject: [Python-Dev] win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred L. Drake, Jr. wrote:
> Guido van Rossum writes:
>  > Another view: win32 was my way of saying the union of Windows 95,
>  > Windows NT, and Windows 98, contrasted to Windows 3.1 and non-Windows
>  > platforms.  If Windows 2000 is sufficiently different to the user, it
>  > deserves a different platform id (win2000?).
>  > 
>  > Is there a connection between Windows 2000 and _WIN64?
> 
>   Since no one else has responded, here's some stuff from MS on the
> topic of Win64:
> 
> http://www.microsoft.com/windows2000/guide/platform/strategic/64bit.asp
> 
> This document talks only of the Itanium (IA64) processor, and doesn't
> mention the Alpha at all.  I know the NT shipping on Alpha machines is
> Win32, though the actual application code can be 64-bit (think "32-bit
> Solaris on an Ultra"); just the system APIs are 32 bits.

Windows is no longer made/sold for the Alpha processor. That was canned in
August of '99, I believe. Possibly August 98.

Basically, Windows is just the x86 family, and Win/CE for various embedded
processors.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Tue May  9 21:06:49 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 9 May 2000 15:06:49 -0400 (EDT)
Subject: [Python-Dev] Re: win64 (was: [Patches] PC\config.[hc] changes for Win64)
In-Reply-To: <Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
References: <14616.24241.26240.247048@seahag.cnri.reston.va.us>
	<Pine.LNX.4.10.10005091154400.3314-100000@nebula.lyra.org>
Message-ID: <14616.25033.883165.800216@seahag.cnri.reston.va.us>

Greg Stein writes:
 > Windows is no longer made/sold for the Alpha processor. That was canned in
 > August of '99, I believe. Possibly August 98.

  <sigh/>


  -Fred

--
Fred L. Drake, Jr.	  <fdrake at acm.org>
Corporation for National Research Initiatives




From trentm at activestate.com  Tue May  9 21:49:57 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 12:49:57 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <200005091857.OAA24731@eric.cnri.reston.va.us>
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au> <200005091108.HAA22983@eric.cnri.reston.va.us> <14616.24241.26240.247048@seahag.cnri.reston.va.us> <200005091857.OAA24731@eric.cnri.reston.va.us>
Message-ID: <20000509124957.A21838@activestate.com>

> Thanks, this makes more sense.  I guess that Trent's interest in Win64
> has to do with an early shipment of Itaniums that ActiveState might
> have received. :-)

Could be.... Or maybe we don't have any Itanium boxes. :)

Here is a good link on MSDN:

Getting Ready for 64-bit Windows
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_410z.htm

More specifically this (presuming it is being kept up to date) documents the
changes to the Win32 API for 64-bit Windows:
http://msdn.microsoft.com/library/psdk/buildapp/64bitwin_9xo3.htm
I am not a Windows programmer, but the changes are pretty minimal.

Summary:

Points for sys.platform == "win32" on Win64:
Pros:
- will not break existing sys.platform checks
- it would be nicer for casual Python programmer to have platform issues
  hidden, therefore one symbol for the common Windows OSes is more of the
  Pythonic ideal than "the first three characters of the platform string are
  'win'".
Cons:
- may need to add some other mechnism to differentiate Win32 and Win64 in
  Python code
- "win32" is a little misleading in that it refers to an API supported on
  Win32 and Win64 ("windows" would be more accurate, but too late for that)
  

Points for sys.platform == "win64" on Win64:
Pros:
- seems logically cleaner, given that the Win64 API may diverge from the
  Win32 API and there is no other current mechnism to differentiate Win32 and
  Win64 in Python code
Cons:
- may break existing sys.platform checks when run on Win64


Opinion:

I see the two choices ("win32" or "win64") as a trade off between:
- Use "win32" because a common user experience should translate to a common
  way to check for that environment, i.e. one value for sys.platform.
  Unfortunately we are stuck with "win32" instead of something like
  "windows".
- Use "win64" because it is not a big deal for the user to check for
  sys.platform[:3]=="win" and this way a mechanism exists to differentiate
  btwn Win32 and Win64 should it be necessary.

I am inclined to pick "win32" because:

1. While it may be confusing to the Python scriptor on Win64 that he has to
   check for win*32*, that is something that he will learn the first time. It
   is better than the alternative of the scriptor happily using "win64" and
   then that code not running on Win32 for no good reason. 
2. The main question is: is Win64 so much more like Win32 than different from
   it that the common-case general Python programmer should not ever have to
   make the differentiation in his Python code. Or, at least, enough so that
   such differentiation by the Python scriptor is rare enough that some other
   provided mechanism is sufficient (even preferable).
3. Guido has expressed that he favours this option. :) 

then change "win32" to "windows" in Py3K.



Trent

-- 
Trent Mick
trentm at activestate.com



From trentm at activestate.com  Tue May  9 22:05:53 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 13:05:53 -0700
Subject: [Python-Dev] Re: [Patches] make 'b','h','i' raise overflow exception
In-Reply-To: <000101bfb983$7a34d3c0$592d153f@tim>
References: <200005090224.WAA22457@eric.cnri.reston.va.us> <000101bfb983$7a34d3c0$592d153f@tim>
Message-ID: <20000509130553.D21443@activestate.com>

[Trent]
> > Another option might be to document 'b' as for unsigned chars and 'h', 'i',
> > 'l' as signed integral values and then set the bounds checks ([0,
> > UCHAR_MAX]
> > for 'b')  appropriately. Can we clamp these formatters so? I.e. we would be
> > limiting the user to unsigned or signed depending on the formatter. (Which
> > again, means that it would be nice to have different formatters for signed
> > and unsigned.) I think that the bounds checking is false security unless
> > these restrictions are made.
[guido]
> 
> I like this: 'b' is unsigned, the others are signed.

Okay I will submit a patch for this them. 'b' formatter will limit values to
[0, UCHAR_MAX].

> [Trent]
> > 1. There are a fairly large number of downcasting cases in the
> > Python code (not necessarily tied to PyArg_ParseTuple results). I
> > was wondering if you think a generalized check on each such
> > downcast would be advisable. This would take the form of some macro
> > that would do a bounds check before doing the cast. For example (a
> > common one is the cast of strlen's size_t return value to int,
> > because Python strings use int for their length, this is a downcast
> > on 64-bit systems):
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, len);
> >
> > would become
> >
> >   size_t len = strlen(s);
> >   obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
> >
> > CAST_TO_INT would ensure that 'len'did not overflow and would raise an
> > exception otherwise.
> 
> [Guido]
> > How would the CAST_TO_INT macro signal an erro?  C doesn't have
> > exceptions.  If we have to add checks, I'd prefer to write
> >
> >   size_t len = strlen(s);
> >   if (INT_OVERFLOW(len))
> >      return NULL; /* Or whatever is appropriate in this context */
> >   obj = PyString_FromStringAndSize(s, len);
> 
[Tim]
> Of course we have to add checks -- strlen doesn't return an int!  It hasn't
> since about a year after Python was first written (ANSI C changed the rules,
> and Python is long overdue in catching up -- if you want people to stop
> passing multiple args to append, set a good example in our use of C <0.5
> wink>).
>
> The
> check isn't needed anywhere we can prove a priori that it could never fail
> (in which case we're not assuming anything), but it's always needed when we
> can't so prove (in which case skipping the check would be a bad
> asssuption).
> In the absence of any context, your strlen example above definitely needs
> the check.
>

Okay, I just wanted a go ahead that this kind of thing was desired. I will
try to find the points where these overflows *can* happen and then I'll add
checks in a manner closer to Guido syntax above.

> 
> [Trent]
> > I would like other people's opinion on this kind of change.
> > There are three possible answers:
> 
> Please don't change the rating scheme we've been using:  -1 is a veto, +1 is
> a hurrah, -0 and +0 are obvious <ahem>.
> 
> >   +1 this is a bad change idea because...<reason>
> >   -1 this is a good idea, go for it
> 
Whoa, sorry Tim. I mixed up the +/- there. I did not intend to change the
voting system.

[Tim]
> An alternative would be to promote the size member from int to size_t;
> that's no actual change on the 32-bit machines Guido generally assumes
> without realizing it, and removes an arbitrary (albeit defensible)
> limitation on some 64-bit machines at the cost of (just possibly, due to
> alignment vagaries) boosting var objects' header size on the latter.
> 
I agree with Guido that this is too big an immediate change. I'll just try to
find and catch the possible overflows.


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From gstein at lyra.org  Tue May  9 22:14:19 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:14:19 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: <200005091953.PAA28201@seahag.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Fred Drake wrote:
> Update of /projects/cvsroot/python/dist/src/Objects
> In directory seahag.cnri.reston.va.us:/home/fdrake/projects/python/Objects
> 
> Modified Files:
> 	unicodeobject.c 
> Log Message:
> 
> M.-A. Lemburg <mal at lemburg.com>:
> Added support for user settable default encodings. The
> current implementation uses a per-process global which
> defines the value of the encoding parameter in case it
> is set to NULL (meaning: use the default encoding).

Umm... maybe I missed something, but I thought there was pretty broad
feelings *against* having a global like this. This kind of thing is just
nasty.

1) Python modules can't change it, nor can they rely on it being a
   particular value
2) a mutable, global variable is just plain wrong. The InterpreterState
   and ThreadState structures were created *specifically* to avoid adding
   crap variables like this.
3) allowing a default other than utf-8 is sure to cause gotchas and
   surprises. Some code is going to rightly assume that the default is
   just that, but be horribly broken when an application changes it.

Somebody please say this is hugely experimental. And then say why it isn't
just a private patch, rather than sitting in CVS.

:-(

-g

-- 
Greg Stein, http://www.lyra.org/




From guido at python.org  Tue May  9 22:24:05 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 09 May 2000 16:24:05 -0400
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)
In-Reply-To: Your message of "Tue, 09 May 2000 13:14:19 PDT."
             <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> 
Message-ID: <200005092024.QAA25835@eric.cnri.reston.va.us>

> Umm... maybe I missed something, but I thought there was pretty broad
> feelings *against* having a global like this. This kind of thing is just
> nasty.
> 
> 1) Python modules can't change it, nor can they rely on it being a
>    particular value
> 2) a mutable, global variable is just plain wrong. The InterpreterState
>    and ThreadState structures were created *specifically* to avoid adding
>    crap variables like this.
> 3) allowing a default other than utf-8 is sure to cause gotchas and
>    surprises. Some code is going to rightly assume that the default is
>    just that, but be horribly broken when an application changes it.
> 
> Somebody please say this is hugely experimental. And then say why it isn't
> just a private patch, rather than sitting in CVS.

Watch your language.

Marc did this at my request.  It is my intention that the encoding be
hardcoded at compile time.  But while there's a discussion going about
what the hardcoded encoding should *be*, it would seem handy to have a
quick way to experiment.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Tue May  9 22:33:40 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 9 May 2000 13:33:40 -0700 (PDT)
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ...
 unicodeobject.c)
In-Reply-To: <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005091331300.3314-100000@nebula.lyra.org>

On Tue, 9 May 2000, Guido van Rossum wrote:
>...
> Watch your language.

Yes, Dad :-) Sorry...

> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Okee dokee... That was one of my questions: is this experimental or not?

It is still a bit frightening, though, if it might get left in there, for
the reasons I listed (to name a few) ... :-(

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Tue May  9 23:35:16 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 09 May 2000 23:35:16 +0200
Subject: [Python-Dev] global encoding?!? (was: [Python-checkins] ... 
 unicodeobject.c)
References: <Pine.LNX.4.10.10005091308470.3314-100000@nebula.lyra.org> <200005092024.QAA25835@eric.cnri.reston.va.us>
Message-ID: <39188494.61424A7@lemburg.com>

Guido van Rossum wrote:
> 
> > Umm... maybe I missed something, but I thought there was pretty broad
> > feelings *against* having a global like this. This kind of thing is just
> > nasty.
> >
> > 1) Python modules can't change it, nor can they rely on it being a
> >    particular value
> > 2) a mutable, global variable is just plain wrong. The InterpreterState
> >    and ThreadState structures were created *specifically* to avoid adding
> >    crap variables like this.
> > 3) allowing a default other than utf-8 is sure to cause gotchas and
> >    surprises. Some code is going to rightly assume that the default is
> >    just that, but be horribly broken when an application changes it.

Hmm, the patch notice says it all I guess:

This patch fixes a few bugglets and adds an experimental
feature which allows setting the string encoding assumed
by the Unicode implementation at run-time.

The current implementation uses a process global for
the string encoding. This should subsequently be changed
to a thread state variable, so that the setting can
be done on a per thread basis.

Note that only the coercions from strings to Unicode
are affected by the encoding parameter. The "s" parser
marker still returns UTF-8. (str(unicode) also returns
the string encoding -- unlike what I wrote in the original
patch notice.)

The main intent of this patch is to provide a test
bed for the ongoing Unicode debate, e.g. to have the
implementation use 'latin-1' as default string encoding,
put

import sys
sys.set_string_encoding('latin-1')

in you site.py file.

> > Somebody please say this is hugely experimental. And then say why it isn't
> > just a private patch, rather than sitting in CVS.
> 
> Watch your language.
> 
> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Right and that's what the intent was behind adding a global
and some APIs to change it first... there are a few ways this
could one day get finalized:

1. hardcode the encoding (UTF-8 was previously hard-coded)
2. make the encoding a compile time option
3. make the encoding a per-process option
4. make the encoding a per-thread option
5. make the encoding a per-process setting which is deduced
   from env. vars such as LC_ALL, LC_CTYPE, LANG or system
   APIs which can be used to get at the currently
   active local encoding

Note that I have named the APIs sys.get/set_string_encoding()...
I've done that on purpose, because I have a feeling that
changing the conversion from Unicode to strings from UTF-8
to an encoding not capable of representing all Unicode
characters won't get us very far. Also, changing this is
rather tricky due to the way the buffer API works.

The other way around needs some experimenting though and this
is what the patch implements: it allows you to change the
string encoding assumption to test various
possibilities, e.g. ascii, latin-1, unicode-escape,
<your favourite local encoding> etc. without having to
recompile the interpreter every time.

Have fun with it :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mhammond at skippinet.com.au  Wed May 10 00:58:19 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 10 May 2000 08:58:19 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <20000509124957.A21838@activestate.com>
Message-ID: <ECEPKNMJLHAPFFJHDOJBAEJFCKAA.mhammond@skippinet.com.au>

Geez - Fred is posting links to the MS site, and Im battling ipchains and
DHCP on my newly installed Debian box - what is this world coming to!?!?!

> I am inclined to pick "win32" because:

OK - Im sold.

Mark.




From nhodgson at bigpond.net.au  Wed May 10 01:17:27 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 10 May 2000 09:17:27 +1000
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBMEHNCKAA.mhammond@skippinet.com.au>
Message-ID: <009a01bfba0c$bdf13a20$e3cb8490@neil>

> Now all we need is "win64s" - it will respond to Neil's criticism that
> mixed mode programs are a pain, and MS will tell us what "win64s" will
> solve all our problems, and allow win32 to run 64 bit programs well into
> the future.  Until everyone in the world realizes it sucks, and MS
promptly
> says it was only ever a hack in the first place, and everyone should be on
> Win64 by now anyway :-)

   Maybe someone has made noise about this before I joined the discussion,
but I see the absence of a mixed mode being a big problem for users. I don't
think that there will be the 'quick clean" migration from 32 to 64 that
there was for 16 to 32. It doesn't offer that much for most applications. So
there will need to be both 32 bit and 64 bit versions of Python present on
machines. With duplicated libraries. Each DLL should be available in both 32
and 64 bit form. The IDEs will have to be available in both forms as they
are loading, running and debugging code of either width. Users will have to
remember to run a different Python if they are using libraries of the
non-default width.

   Neil





From czupancic at beopen.com  Wed May 10 01:44:20 2000
From: czupancic at beopen.com (Christian Zupancic)
Date: Tue, 09 May 2000 16:44:20 -0700
Subject: [Python-Dev] Python Query
Message-ID: <3918A2D4.B0FE7DDF@beopen.com>

======================================================================
Greetings Python Developers,

Please participate in a small survey about Python for BeOpen.com that we
are conducting with the guidance of our advisor, and the creator of
Python, Guido van Rossum. In return for answering just five short
questions, I will mail you up to three (3) BeOpen T-shirts-- highly
esteemed by select trade-show attendees as "really cool". In addition,
three lucky survey participants will receive a Life-Size Inflatable
Penguin (as they say, "very cool").

- Why do you prefer Python over other languages, e.g. Perl?


- What do you consider to be (a) competitor(s) to Python?


- What are Python's strong points and weaknesses?


- What other languages do you program in?


- If you had one wish about Python, what would it be?


- For Monty Python fans only:
What is the average airspeed of a swallow (European, non-migratory)?

 THANKS! That wasn't so bad, was it?  Make sure you've attached a
business card or address of some sort so I know where to send your
prizes.

Best Regards,
Christian Zupancic
Market Analyst, BeOpen.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: czupancic.vcf
Type: text/x-vcard
Size: 146 bytes
Desc: Card for Christian Zupancic
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000509/44f93500/attachment-0001.vcf>

From trentm at activestate.com  Wed May 10 01:45:36 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 9 May 2000 16:45:36 -0700
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <3917D5D3.A8CD1B3E@lemburg.com>
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com>
Message-ID: <20000509164536.A31366@activestate.com>

On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> Just curious, what's the output of platform.py on Win64 ?
> (You can download platform.py from my Python Pages.)

I get the following:

"""
The system cannot find the path specified
win64-32bit
"""

Sorry, I did not hunt down the "path" error message.

Trent

-- 
Trent Mick
trentm at activestate.com



From tim_one at email.msn.com  Wed May 10 06:53:20 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 May 2000 00:53:20 -0400
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
In-Reply-To: <009a01bfba0c$bdf13a20$e3cb8490@neil>
Message-ID: <000301bfba3b$a9e11300$022d153f@tim>

[Neil Hodgson]
>    Maybe someone has made noise about this before I joined the
> discussion, but I see the absence of a mixed mode being a big
> problem for users. ...

Intel doesn't -- they're not positioning Itanium for the consumer market.
They're going after the high-performance server market with this, and most
signs are that MS is too.

> ...
> It doesn't offer that much for most applications.

Bingo.

plenty-of-time-to-panic-later-if-end-users-ever-care-ly y'rs  - tim





From mal at lemburg.com  Wed May 10 09:47:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm at activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed May 10 09:47:43 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 May 2000 09:47:43 +0200
Subject: [Python-Dev] Re: [Patches] PC\config.[hc] changes for Win64
References: <ECEPKNMJLHAPFFJHDOJBKEHECKAA.mhammond@skippinet.com.au> <200005090214.WAA22419@eric.cnri.reston.va.us> <3917D5D3.A8CD1B3E@lemburg.com> <20000509164536.A31366@activestate.com>
Message-ID: <3919141F.89DC215E@lemburg.com>

Trent Mick wrote:
> 
> On Tue, May 09, 2000 at 11:09:40AM +0200, M.-A. Lemburg wrote:
> > Just curious, what's the output of platform.py on Win64 ?
> > (You can download platform.py from my Python Pages.)
> 
> I get the following:
> 
> """
> The system cannot find the path specified

Hmm, this probably originates from platform.py trying
to find the "file" command which is used on Unix.

> win64-32bit

Now this looks interesting ... 32-bit Win64 ;-)

> """
> 
> Sorry, I did not hunt down the "path" error message.
> 
> Trent
> 
> --
> Trent Mick
> trentm at activestate.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
__________X-Mozilla-Status: 0009______________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From guido at python.org  Wed May 10 18:52:49 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 10 May 2000 12:52:49 -0400
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Tools/idle browser.py,NONE,1.1
In-Reply-To: Your message of "Wed, 10 May 2000 12:47:30 EDT."
             <200005101647.MAA30408@seahag.cnri.reston.va.us> 
References: <200005101647.MAA30408@seahag.cnri.reston.va.us> 
Message-ID: <200005101652.MAA28936@eric.cnri.reston.va.us>

Fred,

"browser" is a particularly non-descriptive name for this module.

Perhaps it's not too late to rename it to e.g. "BrowserControl"?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Wed May 10 22:14:46 2000
From: trentm at activestate.com (Trent Mick)
Date: Wed, 10 May 2000 13:14:46 -0700
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <000201bfba3b$a74ad7c0$022d153f@tim>
References: <20000509162504.A31192@activestate.com> <000201bfba3b$a74ad7c0$022d153f@tim>
Message-ID: <20000510131446.A25926@activestate.com>

On Wed, May 10, 2000 at 12:53:16AM -0400, Tim Peters wrote:
> [Trent Mick]
> > Discussion:
> >
> > Okay, it is debatable to call float_hash and complex_hash broken,
> > but their code presumed that sizeof(long) was 32-bits. As a result
> > the hashed values for floats and complex values were not the same
> > on a 64-bit *nix system as on a 32-bit *nix system. With this
> > patch they are.
> 
> The goal is laudable but the analysis seems flawed.  For example, this new
> comment:

Firstly, I should have admitted my ignorance with regards to hash functions.


> Looks to me like the real problem in the original was here:
> 
>     x = hipart + (long)fractpart + (long)intpart + (expo << 15);
>                                    ^^^^^^^^^^^^^
> 
> The difficulty is that intpart may *not* fit in 32 bits, so the cast of
> intpart to long is ill-defined when sizeof(long) == 4.

> 
> That is, the hash function truly is broken for "large" values with a
> fractional part, and I expect your after-patch code suffers the same
> problem: 

Yes it did.


> The
> solution to this is to break intpart in this branch into pieces no larger
> than 32 bits too 

Okay here is another try (only for floatobject.c) for discussion. If it looks
good then I will submit a patch for float and complex objects. So do the same
for 'intpart' as was done for 'fractpart'.


static long
float_hash(v)
    PyFloatObject *v;
{
    double intpart, fractpart;
    long x;

    fractpart = modf(v->ob_fval, &intpart);

    if (fractpart == 0.0) {
		// ... snip ...
    }
    else {
        int expo;
        long hipart;

        fractpart = frexp(fractpart, &expo);
        fractpart = fractpart * 2147483648.0; 
        hipart = (long)fractpart; 
        fractpart = (fractpart - (double)hipart) * 2147483648.0;

        x = hipart + (long)fractpart + (expo << 15); /* combine the fract parts */

        intpart = frexp(intpart, &expo);
        intpart = intpart * 2147483648.0;
        hipart = (long)intpart;
        intpart = (intpart - (double)hipart) * 2147483648.0;

        x += hipart + (long)intpart + (expo << 15); /* add in the int parts */
    }
    if (x == -1)
        x = -2;
    return x;
}




> Note this consequence under the Win32 Python:

With this change, on Linux32:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 -2141945856
1.10810156237e+12 -2137751552
1.11669149696e+12 -2129362944
1.13387136614e+12 -2112585728
1.16823110451e+12 -2079031296
1.23695058125e+12 -2011922432
1.37438953472e+12 -1877704704
1.64926744166e+12 -1609269248
2.19902325555e+12 -2146107392
3.29853488333e+12 -1609236480
5.49755813888e+12 -1877639168
9.89560464998e+12 -2011824128
1.86916976722e+13 -2078900224


On Linux64:

>>> base = 2.**40 + 0.5
>>> base
1099511627776.5
>>> for i in range(32, 45):
...     x = base + 2.**i
...     print x, hash(x)
...
1.10380659507e+12 2153021440
1.10810156237e+12 2157215744
1.11669149696e+12 2165604352
1.13387136614e+12 2182381568
1.16823110451e+12 2215936000
1.23695058125e+12 2283044864
1.37438953472e+12 2417262592
1.64926744166e+12 2685698048
2.19902325555e+12 2148859904
3.29853488333e+12 2685730816
5.49755813888e+12 2417328128
9.89560464998e+12 2283143168
1.86916976722e+13 2216067072

>-- and that should also fix your 64-bit woes "by magic".
> 

As you can see it did not, but for another reason. The summation of the parts
overflows 'x'. Is this a problem? I.e., does it matter if a hash function
returns an overflowed integral value (my hash function ignorance is showing)?
And if this does not matter, does it matter that a hash returns different
values on different platforms?


> a hash function should never ignore any bit in its input. 

Which brings up a question regarding instance_hash(), func_hash(),
meth_hash(), HKEY_hash() [or whatever it is called], and other which cast a
pointer to a long (discarding the upperhalf of the pointer on Win64). Do
these really need to be fixed. Am I nitpicking too much on this whole thing?


Thanks,
Trent

-- 
Trent Mick
trentm at activestate.com



From tim_one at email.msn.com  Thu May 11 06:13:29 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 00:13:29 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <20000510131446.A25926@activestate.com>
Message-ID: <000b01bfbaff$43d320c0$2aa0143f@tim>

[Trent Mick]
> ...
> Okay here is another try (only for floatobject.c) for discussion.
> If it looks good then I will submit a patch for float and complex
> objects. So do the same for 'intpart' as was done for 'fractpart'.
>
>
> static long
> float_hash(v)
>     PyFloatObject *v;
> {
>     double intpart, fractpart;
>     long x;
>
>     fractpart = modf(v->ob_fval, &intpart);
>
>     if (fractpart == 0.0) {
> 		// ... snip ...
>     }
>     else {
>         int expo;
>         long hipart;
>
>         fractpart = frexp(fractpart, &expo);
>         fractpart = fractpart * 2147483648.0;

It's OK to use "*=" in C <wink>.

Would like a comment that this is 2**31 (which makes the code obvious <wink>
instead of mysterious).  A comment block at the top would help too, like

/* Use frexp to get at the bits in intpart and fractpart.
 * Since the VAX D double format has 56 mantissa bits, which is the
 * most of any double format in use, each of these parts may have as
 * many as (but no more than) 56 significant bits.
 * So, assuming sizeof(long) >= 4, each part can be broken into two longs;
 * frexp and multiplication are used to do that.
 * Also, since the Cray double format has 15 exponent bits, which is the
 * most of any double format in use, shifting the exponent field left by
 * 15 won't overflow a long (again assuming sizeof(long) >= 4).
 */

And this code has gotten messy enough that it's probably better to pkg it in
a utility function rather than duplicate it.

Another approach would be to play with the bits directly, via casting
tricks.  But then you have to wrestle with platform crap like endianness.

>         hipart = (long)fractpart;
>         fractpart = (fractpart - (double)hipart) * 2147483648.0;
>
>         x = hipart + (long)fractpart + (expo << 15); /* combine
> the fract parts */
>
>         intpart = frexp(intpart, &expo);
>         intpart = intpart * 2147483648.0;
>         hipart = (long)intpart;
>         intpart = (intpart - (double)hipart) * 2147483648.0;
>
>         x += hipart + (long)intpart + (expo << 15); /* add in the
> int parts */

There's no point adding in (expo << 15) a second time.

> With this change, on Linux32:
> ...
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 -2141945856
> 1.10810156237e+12 -2137751552
> 1.11669149696e+12 -2129362944
> 1.13387136614e+12 -2112585728
> 1.16823110451e+12 -2079031296
> 1.23695058125e+12 -2011922432
> 1.37438953472e+12 -1877704704
> 1.64926744166e+12 -1609269248
> 2.19902325555e+12 -2146107392
> 3.29853488333e+12 -1609236480
> 5.49755813888e+12 -1877639168
> 9.89560464998e+12 -2011824128
> 1.86916976722e+13 -2078900224
>
>
> On Linux64:
>
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 2153021440
> 1.10810156237e+12 2157215744
> 1.11669149696e+12 2165604352
> 1.13387136614e+12 2182381568
> 1.16823110451e+12 2215936000
> 1.23695058125e+12 2283044864
> 1.37438953472e+12 2417262592
> 1.64926744166e+12 2685698048
> 2.19902325555e+12 2148859904
> 3.29853488333e+12 2685730816
> 5.49755813888e+12 2417328128
> 9.89560464998e+12 2283143168
> 1.86916976722e+13 2216067072

>>-- and that should also fix your 64-bit woes "by magic".

> As you can see it did not, but for another reason.

I read your original complaint as that hash(double) yielded different
results between two *64* bit platforms (Linux64 vs Win64), but what you
showed above appears to be a comparison between a 64-bit platform and a
32-bit platform, and where presumably sizeof(long) is 8 on the former but 4
on the latter.  If so, of *course* results may be different:  hash returns a
C long, and they're different sizes across these platforms.

In any case, the results above aren't really different!

>>> hex(-2141945856)  # 1st result from Linux32
'0x80548000'
>>> hex(2153021440L)  # 1st result from Linux64
'0x80548000L'
>>>

That is, the bits are the same.  How much more do you want from me <wink>?

> The summation of the parts overflows 'x'. Is this a problem? I.e., does
> it matter if a hash function returns an overflowed integral value (my
> hash function ignorance is showing)?

Overflow generally doesn't matter.  In fact, it's usual <wink>; e.g., the
hash for strings iterates over

    x = (1000003*x) ^ *p++;

and overflows madly.  The saving grace is that C defines integer overflow in
such a way that losing the high bits on every operation yields the same
result as if the entire result were computed to infinite precision and the
high bits tossed only at the end.  So overflow doesn't hurt this from being
as reproducible as possible, given that Python's int size is different.

Overflow can be avoided by using xor instead of addition, but addition is
generally preferred because it helps to "scramble" the bits a little more.

> And if this does not matter, does it matter that a hash returns different
> values on different platforms?

No, and it doesn't always stay the same from release to release on a single
platform.  For example, your patch above will change hash(double) on Win32!

>> a hash function should never ignore any bit in its input.

> Which brings up a question regarding instance_hash(), func_hash(),
> meth_hash(), HKEY_hash() [or whatever it is called], and other
> which cast a pointer to a long (discarding the upperhalf of the
> pointer on Win64). Do these really need to be fixed. Am I nitpicking
> too much on this whole thing?

I have to apologize (although only semi-sincerely) for not being meaner
about this when I did the first 64-bit port.  I did that for my own use, and
avoided the problem areas rather than fix them.  But unless a language dies,
you end up paying for every hole in the end, and the sooner they're plugged
the less it costs.

That is, no, you're not nitpicking too much!  Everyone else probably thinks
you are <wink>, *but*, they're not running on 64-bit platforms yet so these
issues are still invisible to their gut radar.  I'll bet your life that
every hole remaining will trip up an end user eventually -- and they're the
ones least able to deal with the "mysterious problems".






From guido at python.org  Thu May 11 15:01:10 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 11 May 2000 09:01:10 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: Your message of "Thu, 11 May 2000 00:13:29 EDT."
             <000b01bfbaff$43d320c0$2aa0143f@tim> 
References: <000b01bfbaff$43d320c0$2aa0143f@tim> 
Message-ID: <200005111301.JAA00512@eric.cnri.reston.va.us>

I have to admit I have no clue about the details of this debate any
more, and I'm cowardly awaiting a patch submission that Tim approves
of.  (I'm hoping a day will come when Tim can check it in himself. :-)

In the mean time, I'd like to emphasize the key invariant here: we
must ensure that (a==b) => (hash(a)==hash(b)).  One quick way to deal
with this could be the following pseudo C:

    PyObject *double_hash(double x)
    {
        long l = (long)x;
        if ((double)l == x)
	    return long_hash(l);
	...double-specific code...
    }

This code makes one assumption: that if there exists a long l equal to
a double x, the cast (long)x should yield l...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Fri May 12 00:14:45 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 11 May 2000 15:14:45 -0700
Subject: [Python-Dev] testing the C API in the test suite (was: bug in PyLong_FromLongLong (PR#324))
In-Reply-To: <200005111323.JAA00637@eric.cnri.reston.va.us>
References: <200005111323.JAA00637@eric.cnri.reston.va.us>
Message-ID: <20000511151445.B15936@activestate.com>

> Date:    Wed, 10 May 2000 15:37:30 -0400
> From:    Thomas.Malik at t-online.de
> To:      python-bugs-list at python.org
> cc:      bugs-py at python.org
> Subject: [Python-bugs-list] bug in PyLong_FromLongLong (PR#324)
> 
> Full_Name: Thomas Malik
> Version: 1.5.2
> OS: all
> Submission from: p3e9ed447.dip.t-dialin.net (62.158.212.71)
> 
> 
> there's a bug in PyLong_FromLongLong, resulting in truncation of negative 64 bi
> t
> integers. PyLong_FromLongLong starts with: 
> 	if( ival <= (LONG_LONG)LONG_MAX ) {
> 		return PyLong_FromLong( (long)ival );
> 	}
> 	else if( ival <= (unsigned LONG_LONG)ULONG_MAX ) {
> 		return PyLong_FromUnsignedLong( (unsigned long)ival );
> 	}
> 	else {
>              ....
> 
> Now, if ival is smaller than -LONG_MAX, it falls outside the long integer range
> (being a 64 bit negative integer), but gets handled by the first if-then-case i
> n
> above code ('cause it is, of course, smaller than LONG_MAX). This results in
> truncation of the 64 bit negative integer to a more or less arbitrary 32 bit
> number. The way to fix it is to compare the absolute value of imax against
> LONG_MAX in the first condition. The second condition (ULONG_MAX) must, at
> least, check wether ival is positive. 
> 

To test this error I found the easiest way was to make a C extension module
to Python that called the C API functions under test directly. I can't
quickly think of a way I could have shown this error *clearly* at the Python
level without a specialized extension module. This has been true for other
things that I have been testing.

Would it make sense to create a standard extension module (called '__test' or
something like that) in which direct tests on the C API could be made? This
would be hooked into the standard testsuite via a test_capi.py that would:
- import __test
- run every exported function in __test (or everyone starting with 'test_',
  or whatever)
- the ImportError could continue to be used to signify skipping, etc
  (although, I think that a new, more explicit TestSuiteError class would be
  more appropriate and clear)

Does something like this already exist that I am missing?

This would make testing some things a lot easier, and clearer. Where
some interface is exposed to the Python programmer it is appropriate to test
it at the Python level. Python also provides a C API and it would be
appropriate to test that at the C level.

I would like to hear some people's thoughts before I go off and put anything
together.

Thanks,
Trent


-- 
Trent Mick
trentm at activestate.com



From DavidA at ActiveState.com  Fri May 12 00:16:43 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Thu, 11 May 2000 15:16:43 -0700
Subject: [Python-Dev] c.l.p.announce
Message-ID: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>

What's the status of comp.lang.python.announce and the 'reviving' thereof?

--david



From tim_one at email.msn.com  Fri May 12 04:58:35 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 May 2000 22:58:35 -0400
Subject: [Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix
In-Reply-To: <200005111301.JAA00512@eric.cnri.reston.va.us>
Message-ID: <000001bfbbbd$f74572c0$9ca2143f@tim>

[Guido]
> I have to admit I have no clue about the details of this debate any
> more,

Na, there's no debate here.  I believe I confused things by misunderstanding
what Trent's original claim was (sorry, Trent!), but we bumped into real
flaws in the current hash anyway (even on 32-bit machines).  I don't think
there's any actual disagreement about anything here.

> and I'm cowardly awaiting a patch submission that Tim approves
> of.

As am I <wink>.

> (I'm hoping a day will come when Tim can check it in himself. :-)

Well, all you have to do to make that happen is get a real job and then hire
me <wink>.

> In the mean time, I'd like to emphasize the key invariant here: we
> must ensure that (a==b) => (hash(a)==hash(b)).

Absolutely.  That's already true, and is so non-controversial that Trent
elided ("...") the code for that in his last post.

> One quick way to deal with this could be the following pseudo C:
>
>     PyObject *double_hash(double x)
>     {
>         long l = (long)x;
>         if ((double)l == x)
> 	    return long_hash(l);
> 	...double-specific code...
>     }
>
> This code makes one assumption: that if there exists a long l equal to
> a double x, the cast (long)x should yield l...

No, that fails on two counts:

1.  If x is "too big" to fit in a long (and a great many doubles are),
    the cast to long is undefined.  Don't know about all current platforms,
    but on the KSR platform such casts raised a fatal hardware
    exception.  The current code already accomplishes this part in a
    safe way (which Trent's patch improves by using a symbol instead of
    the current hard-coded hex constant).

2.  The key invariant needs to be preserved also when x is an exact
    integral value that happens to be (possibly very!) much bigger than
    a C long; e.g.,

>>> long(1.23e300)  # 1.23e300 is an integer! albeit not the one you think
12299999999999999456195024356787918820614965027709909500456844293279
60298864608335541984218516600989160291306221939122973741400364055485
57167627474369519296563706976894811817595986395177079943535811102573
51951343133141138298152217970719263233891682157645730823560232757272
73837119288529943287157489664L
>>> hash(1.23e300) == hash(_)
1
>>>

The current code already handles that correctly too.  All the problems occur
when the double has a non-zero fractional part, and Trent knows how to fix
that now.  hash(x) may differ across platforms because sizeof(long) differs
across platforms, but that's just as true of strings as floats (i.e., Python
has never computed platform-independent hashes -- if that bothers *you*
(doesn't bother me), that's the part you should chime in on).





From guido at python.org  Fri May 12 14:24:25 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 08:24:25 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Thu, 11 May 2000 15:16:43 PDT."
             <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> 
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> 
Message-ID: <200005121224.IAA06063@eric.cnri.reston.va.us>

> What's the status of comp.lang.python.announce and the 'reviving' thereof?

Good question.  Several of us here at CNRI have volunteered to become
moderators.  I think we may have to start faking Approved: headers in
the mean time...

(I wonder if we can make posts to python-announce at python.com be
forwarded to c.l.py.a with such a header automatically tacked on?)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Fri May 12 15:43:37 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 15:43:37 +0200
Subject: [Python-Dev] Unicode and its partners...
Message-ID: <391C0A89.819A33EA@lemburg.com>

It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
discussion. 

Not that I would like it to restart (I think everybody has
made their point), but it kind of surprised me that now with the
ability to actually set the default string encoding at run-time,
noone seems to have played around with it...

>>> import sys
>>> sys.set_string_encoding('unicode-escape')
>>> "abc???" + u"abc"
u'abc\344\366\374abc'
>>> "abc???\u1234" + u"abc"
u'abc\344\366\374\u1234abc'
>>> print "abc???\u1234" + u"abc"
abc\344\366\374\u1234abc

Any takers ?

BTW, has anyone tried to use the codec design for other
tasks than converting text ? It should also be usable for
e.g. compressing/decompressing or other data oriented
content.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From effbot at telia.com  Fri May 12 16:25:24 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Fri, 12 May 2000 16:25:24 +0200
Subject: [Python-Dev] Unicode and its partners...
References: <391C0A89.819A33EA@lemburg.com>
Message-ID: <026901bfbc1d$efe06fc0$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> It got a little silent around the 7-bit vs. 8-bit vs. UTF-8
> discussion. 

that's only because I've promised Guido to prepare SRE
for the next alpha, before spending more time trying to
get this one done right ;-)

and as usual, the last 10% takes 90% of the effort :-(

</F>




From akuchlin at mems-exchange.org  Fri May 12 16:27:21 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 12 May 2000 10:27:21 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: <200005121224.IAA06063@eric.cnri.reston.va.us>
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
	<200005121224.IAA06063@eric.cnri.reston.va.us>
Message-ID: <14620.5321.510321.341870@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>(I wonder if we can make posts to python-announce at python.com be
>forwarded to c.l.py.a with such a header automatically tacked on?)

Probably not a good idea; if the e-mail address is on the Web site, it
probably gets a certain amount of spam that would need to be filtered
out.  

--amk



From guido at python.org  Fri May 12 16:31:55 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 10:31:55 -0400
Subject: [Python-Dev] c.l.p.announce
In-Reply-To: Your message of "Fri, 12 May 2000 10:27:21 EDT."
             <14620.5321.510321.341870@amarok.cnri.reston.va.us> 
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>  
            <14620.5321.510321.341870@amarok.cnri.reston.va.us> 
Message-ID: <200005121431.KAA06538@eric.cnri.reston.va.us>

> Guido van Rossum writes:
> >(I wonder if we can make posts to python-announce at python.com be
> >forwarded to c.l.py.a with such a header automatically tacked on?)
> 
> Probably not a good idea; if the e-mail address is on the Web site, it
> probably gets a certain amount of spam that would need to be filtered
> out.  

OK, let's make it a moderated mailman mailing list; we can make
everyone on python-dev (who wants to) a moderator.  Barry, is there an
easy way to add additional headers to messages posted by mailman to
the news gateway?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From jcollins at pacificnet.net  Fri May 12 17:39:28 2000
From: jcollins at pacificnet.net (Jeffery D. Collins)
Date: Fri, 12 May 2000 08:39:28 -0700
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com> <200005121224.IAA06063@eric.cnri.reston.va.us>  
	            <14620.5321.510321.341870@amarok.cnri.reston.va.us> <200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <391C25B0.EC327BCF@pacificnet.net>

I volunteer to moderate.

Jeff


Guido van Rossum wrote:

> > Guido van Rossum writes:
> > >(I wonder if we can make posts to python-announce at python.com be
> > >forwarded to c.l.py.a with such a header automatically tacked on?)
> >
> > Probably not a good idea; if the e-mail address is on the Web site, it
> > probably gets a certain amount of spam that would need to be filtered
> > out.
>
> OK, let's make it a moderated mailman mailing list; we can make
> everyone on python-dev (who wants to) a moderator.  Barry, is there an
> easy way to add additional headers to messages posted by mailman to
> the news gateway?
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev




From bwarsaw at python.org  Fri May 12 17:41:01 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 12 May 2000 11:41:01 -0400 (EDT)
Subject: [Python-Dev] c.l.p.announce
References: <PLEJJNOHDIGGLDPOGPJJCEACCCAA.DavidA@ActiveState.com>
	<200005121224.IAA06063@eric.cnri.reston.va.us>
	<14620.5321.510321.341870@amarok.cnri.reston.va.us>
	<200005121431.KAA06538@eric.cnri.reston.va.us>
Message-ID: <14620.9741.164735.998570@anthem.cnri.reston.va.us>

>>>>> "GvR" == Guido van Rossum <guido at python.org> writes:

    GvR> OK, let's make it a moderated mailman mailing list; we can
    GvR> make everyone on python-dev (who wants to) a moderator.
    GvR> Barry, is there an easy way to add additional headers to
    GvR> messages posted by mailman to the news gateway?

No, but I'll add that.  It might be a little while before I push the
changes out to python.org; I've got a bunch of things I need to test
first.

-Barry



From mal at lemburg.com  Fri May 12 17:47:55 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 17:47:55 +0200
Subject: [Python-Dev] Landmark
Message-ID: <391C27AB.2F5339D6@lemburg.com>

While trying to configure an in-package Python interpreter
I found that the interpreter still uses 'string.py' as
landmark for finding the standard library.

Since string.py is being depreciated, I think we should
consider a new landmark (such as os.py) or maybe even a
whole new strategy for finding the standard lib location.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Fri May 12 21:04:50 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 15:04:50 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 17:47:55 +0200."
             <391C27AB.2F5339D6@lemburg.com> 
References: <391C27AB.2F5339D6@lemburg.com> 
Message-ID: <200005121904.PAA08166@eric.cnri.reston.va.us>

> While trying to configure an in-package Python interpreter
> I found that the interpreter still uses 'string.py' as
> landmark for finding the standard library.

Oops.

> Since string.py is being depreciated, I think we should
> consider a new landmark (such as os.py) or maybe even a
> whole new strategy for finding the standard lib location.

I don't see a need for a new strategy, but I'll gladly accept patches
that look for os.py.  Note that there are several versions of that
code: Modules/getpath.c, PC/getpathp.c, PC/os2vacpp/getpathp.c.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From gmcm at hypernet.com  Fri May 12 21:50:56 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 15:50:56 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: <200005121904.PAA08166@eric.cnri.reston.va.us>
References: Your message of "Fri, 12 May 2000 17:47:55 +0200."             <391C27AB.2F5339D6@lemburg.com> 
Message-ID: <1253961418-52039567@hypernet.com>

[MAL]
> > Since string.py is being depreciated, I think we should
> > consider a new landmark (such as os.py) or maybe even a
> > whole new strategy for finding the standard lib location.
[GvR]
> I don't see a need for a new strategy

I'll argue for (a choice of) new strategy. The getpath & friends 
code spends a whole lot of time and energy trying to reverse 
engineer things like developer builds and strange sys-admin 
pranks. I agree that code shouldn't die. But it creates painful 
startup times when Python is being used for something like 
CGI.

How about something on the command line that says (pick 
one or come up with another choice):
 - PYTHONPATH is *it*
 - use PYTHONPATH and .pth files found <here>
 - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
PYTHONPATH
 - there's a .pth file <here> with the whole list
 - pretty much any permutation of the above elements

The idea being to avoid a few hundred system calls when a 
dozen or so will suffice. Default behavior should still be to 
magically get it right.


- Gordon



From guido at python.org  Fri May 12 22:29:05 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 16:29:05 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 15:50:56 EDT."
             <1253961418-52039567@hypernet.com> 
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>  
            <1253961418-52039567@hypernet.com> 
Message-ID: <200005122029.QAA08252@eric.cnri.reston.va.us>

> [MAL]
> > > Since string.py is being depreciated, I think we should
> > > consider a new landmark (such as os.py) or maybe even a
> > > whole new strategy for finding the standard lib location.
> [GvR]
> > I don't see a need for a new strategy
> 
> I'll argue for (a choice of) new strategy. The getpath & friends 
> code spends a whole lot of time and energy trying to reverse 
> engineer things like developer builds and strange sys-admin 
> pranks. I agree that code shouldn't die. But it creates painful 
> startup times when Python is being used for something like 
> CGI.
> 
> How about something on the command line that says (pick 
> one or come up with another choice):
>  - PYTHONPATH is *it*
>  - use PYTHONPATH and .pth files found <here>
>  - start in <sys.prefix>/lib/python<sys.version[:3]> and add 
> PYTHONPATH
>  - there's a .pth file <here> with the whole list
>  - pretty much any permutation of the above elements
> 
> The idea being to avoid a few hundred system calls when a 
> dozen or so will suffice. Default behavior should still be to 
> magically get it right.

I'm not keen on changing the meaning of PYTHONPATH, but if you're
willing and able to set an environment variable, you can set
PYTHONHOME and it will abandon the search.  If you want a command line
option for CGI, an option to set PYTHONHOME makes sense.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From weeks at golden.dtc.hp.com  Fri May 12 22:29:52 2000
From: weeks at golden.dtc.hp.com ( (Greg Weeks))
Date: Fri, 12 May 2000 13:29:52 -0700
Subject: [Python-Dev] "is", "==", and sameness
Message-ID: <200005122029.AA126653392@golden.dtc.hp.com>

>From the Python Reference Manual [emphasis added]:

    Types affect almost all aspects of object behavior. Even the importance
    of object IDENTITY is affected in some sense: for immutable types,
    operations that compute new values may actually return a reference to
    any existing object with the same type and value, while for mutable
    objects this is not allowed.

This seems to be saying that two immutable objects are (in some sense) the
same iff they have the same type and value, while two mutable objects are
the same iff they have the same id().  I heartily agree, and I think that
this notion of sameness is the single most useful variant of the "equals"
relation.

Indeed, I think it worthwhile to consider modifying the "is" operator to
compute this notion of sameness.  (This would break only exceedingly
strange user code.)  "is" would then be the natural comparator of
dictionary keys, which could then be any object.

The usefulness of this idea is limited by the absence of user-definable
immutable instances.  It might be nice to be able to declare a class -- eg,
Point -- to be have immutable instances.  This declaration would promise
that:

1.  When the expression Point(3.0,4.0) is evaluated, its reference count
    will be zero.

2.  After Point(3.0,4.0) is evaluated, its attributes will not be changed.


I sent the above thoughts to Guido, who graciously and politely responded
that they struck him as somewhere between bad and poorly presented.  (Which
surprised me.  I would have guessed that the ideas were already in his
head.)  Nevertheless, he mentioned passing them along to you, so I have.


Regards,
Greg



From gmcm at hypernet.com  Sat May 13 00:05:46 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 12 May 2000 18:05:46 -0400
Subject: [Python-Dev] "is", "==", and sameness
In-Reply-To: <200005122029.AA126653392@golden.dtc.hp.com>
Message-ID: <1253953328-52526193@hypernet.com>

Greg Weeks wrote:

> >From the Python Reference Manual [emphasis added]:
> 
>     Types affect almost all aspects of object behavior. Even the
>     importance of object IDENTITY is affected in some sense: for
>     immutable types, operations that compute new values may
>     actually return a reference to any existing object with the
>     same type and value, while for mutable objects this is not
>     allowed.
> 
> This seems to be saying that two immutable objects are (in some
> sense) the same iff they have the same type and value, while two
> mutable objects are the same iff they have the same id().  I
> heartily agree, and I think that this notion of sameness is the
> single most useful variant of the "equals" relation.

Notice the "may" in the reference text.

>>> 88 + 11 is 98 + 1
1
>>> 100 + 3 is 101 + 2
0
>>>

Python goes to the effort of keeping singleton instances of the 
integers less than 100. In certain situations, a similar effort is 
invested in strings. But it is by no means the general case, 
and (unless you've got a solution) it would be expensive to 
make it so.
 
> Indeed, I think it worthwhile to consider modifying the "is"
> operator to compute this notion of sameness.  (This would break
> only exceedingly strange user code.)  "is" would then be the
> natural comparator of dictionary keys, which could then be any
> object.

The implications don't follow. The restriction that dictionary 
keys be immutable is not because of the comparison method. 
It's the principle of "least surprise". Use a mutable object as a 
dict key. Now mutate the object. Now the key / value pair in 
the dictionary is inaccessible. That is, there is some pair (k,v) 
in dict.items() where dict[k] does not yield v.
 
> The usefulness of this idea is limited by the absence of
> user-definable immutable instances.  It might be nice to be able
> to declare a class -- eg, Point -- to be have immutable
> instances.  This declaration would promise that:
> 
> 1.  When the expression Point(3.0,4.0) is evaluated, its
> reference count
>     will be zero.

That's a big change from the way Python works:

>>> sys.getrefcount(None)
167
>>>
 
> 2.  After Point(3.0,4.0) is evaluated, its attributes will not be
> changed.

You can make an instance effectively immutable (by messing 
with __setattr__). You can override __hash__ to return 
something suitable (eg, hash(id(self))), and then use an 
instance as a dict key. You don't even need to do the first to 
do the latter.

- Gordon



From mal at lemburg.com  Fri May 12 23:25:02 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 May 2000 23:25:02 +0200
Subject: [Python-Dev] Landmark
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com>  
	            <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>
Message-ID: <391C76AE.A3118AF1@lemburg.com>

Guido van Rossum wrote:
> [Gordon]
> > [MAL]
> > > > Since string.py is being depreciated, I think we should
> > > > consider a new landmark (such as os.py) or maybe even a
> > > > whole new strategy for finding the standard lib location.
> > [GvR]
> > > I don't see a need for a new strategy
> >
> > I'll argue for (a choice of) new strategy.
> 
> I'm not keen on changing the meaning of PYTHONPATH, but if you're
> willing and able to set an environment variable, you can set
> PYTHONHOME and it will abandon the search.  If you want a command line
> option for CGI, an option to set PYTHONHOME makes sense.

The routines will still look for the landmark though (which
is what surprised me and made me look deeper -- setting
PYTHONHOME didn't work for me because I had only .pyo files
in the lib/python1.5 dir).

Perhaps Python should put more trust into the setting of
PYTHONHOME ?!

[An of course the landmark should change to something like
 os.py -- I'll try to submit a patch for this.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Sat May 13 02:53:27 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 12 May 2000 20:53:27 -0400
Subject: [Python-Dev] Landmark
In-Reply-To: Your message of "Fri, 12 May 2000 23:25:02 +0200."
             <391C76AE.A3118AF1@lemburg.com> 
References: Your message of "Fri, 12 May 2000 17:47:55 +0200." <391C27AB.2F5339D6@lemburg.com> <1253961418-52039567@hypernet.com> <200005122029.QAA08252@eric.cnri.reston.va.us>  
            <391C76AE.A3118AF1@lemburg.com> 
Message-ID: <200005130053.UAA08687@eric.cnri.reston.va.us>

[me]
> > I'm not keen on changing the meaning of PYTHONPATH, but if you're
> > willing and able to set an environment variable, you can set
> > PYTHONHOME and it will abandon the search.  If you want a command line
> > option for CGI, an option to set PYTHONHOME makes sense.

[MAL]
> The routines will still look for the landmark though (which
> is what surprised me and made me look deeper -- setting
> PYTHONHOME didn't work for me because I had only .pyo files
> in the lib/python1.5 dir).
> 
> Perhaps Python should put more trust into the setting of
> PYTHONHOME ?!

Yes!  Note that PC/getpathp.c already trusts PYTHONHOME 100% --
Modules/getpath.c should follow suit.

> [An of course the landmark should change to something like
>  os.py -- I'll try to submit a patch for this.]

Maybe you can combine the two?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Sat May 13 14:56:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 14:56:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
Message-ID: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>

in the current 're' engine, a newline is chr(10) and nothing
else.

however, in the new unicode aware engine, I used the new
LINEBREAK predicate instead, but it turned out to break one
of the tests in the current test suite:

    sre.match('a\rb', 'a.b') => None

(unicode adds chr(13), chr(28), chr(29), chr(30), and also
unichr(133), unichr(8232), and unichr(8233) to the list of
line breaking codes)

what's the best way to deal with this?  I see three alter-
natives:

a) stick to the old definition, and use chr(10) also for
   unicode strings

b) use different definitions for 8-bit strings and unicode
   strings; if given an 8-bit string, use chr(10); if given
   a 16-bit string, use the LINEBREAK predicate.

c) use LINEBREAK in either case.

I think (c) is the "right thing", but it's the only that may
break existing code...

</F>




From bckfnn at worldonline.dk  Sat May 13 15:47:10 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sat, 13 May 2000 13:47:10 GMT
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <391d5b7f.3713359@smtp.worldonline.dk>

On Sat, 13 May 2000 14:56:41 +0200, you wrote:

>in the current 're' engine, a newline is chr(10) and nothing
>else.
>
>however, in the new unicode aware engine, I used the new
>LINEBREAK predicate instead, but it turned out to break one
>of the tests in the current test suite:
>
>    sre.match('a\rb', 'a.b') => None
>
>(unicode adds chr(13), chr(28), chr(29), chr(30), and also
>unichr(133), unichr(8232), and unichr(8233) to the list of
>line breaking codes)
>
>what's the best way to deal with this?  I see three alter-
>natives:
>
>a) stick to the old definition, and use chr(10) also for
>   unicode strings

In the ORO matcher that comes with jpython, the dot matches all but
chr(10). But that is bad IMO. Unicode should use the LINEBREAK
predicate.

regards,
finn



From effbot at telia.com  Sat May 13 16:14:32 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:14:32 +0200
Subject: [Python-Dev] for the todo list: cStringIO uses string.joinfields
Message-ID: <00a101bfbce5$91dbd860$34aab5d4@hagrid>

the O_writelines function in Modules/cStringIO contains the
following code:

  if (!string_joinfields) {
    UNLESS(string_module = PyImport_ImportModule("string")) {
      return NULL;
    }

    UNLESS(string_joinfields=
        PyObject_GetAttrString(string_module, "joinfields")) {
      return NULL;
    }

    Py_DECREF(string_module);
  }

I suppose someone should fix this some day...

(btw, the C API reference implies that ImportModule doesn't
use import hooks.  does that mean that cStringIO doesn't work
under e.g. Gordon's installer?)

</F>




From effbot at telia.com  Sat May 13 16:36:30 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 16:36:30 +0200
Subject: [Python-Dev] cvs for dummies
Message-ID: <000d01bfbce8$a3466f40$34aab5d4@hagrid>

what's the best way to make sure that a "cvs update" really brings
everything up to date, even if you've accidentally changed some-
thing in your local workspace?

</F>




From moshez at math.huji.ac.il  Sat May 13 16:58:17 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Sat, 13 May 2000 17:58:17 +0300 (IDT)
Subject: [Python-Dev] unicode regex quickie: should a newline be the same
 thing as a linebreak?
In-Reply-To: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <Pine.GSO.4.10.10005131755560.14940-100000@sundial>

On Sat, 13 May 2000, Fredrik Lundh wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
> 
> a) stick to the old definition, and use chr(10) also for
>    unicode strings

If we also supply a \something (is \l taken?) for LINEBREAK, people can
then use [^\l] if they need a Unicode line break. Just a point for a way
to do a thing close to rightness and still not break code.

--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From fdrake at acm.org  Sat May 13 17:22:12 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Sat, 13 May 2000 11:22:12 -0400 (EDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <14621.29476.390092.610442@newcnri.cnri.reston.va.us>

Fredrik Lundh writes:
 > what's the best way to make sure that a "cvs update" really brings
 > everything up to date, even if you've accidentally changed some-
 > thing in your local workspace?

  Delete the file(s) that got changed and cvs update again.


  -Fred

--
Fred L. Drake, Jr.           <fdrake at acm.org>
Corporation for National Research Initiatives




From effbot at telia.com  Sat May 13 17:28:02 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 17:28:02 +0200
Subject: [Python-Dev] cvs for dummies
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid> <14621.29476.390092.610442@newcnri.cnri.reston.va.us>
Message-ID: <001901bfbcef$d4672b80$34aab5d4@hagrid>

Fred L. Drake, Jr. wrote:
> Fredrik Lundh writes:
>  > what's the best way to make sure that a "cvs update" really brings
>  > everything up to date, even if you've accidentally changed some-
>  > thing in your local workspace?
> 
>   Delete the file(s) that got changed and cvs update again.

okay, what's the best way to get a list of locally changed files?

(in this case, one file ended up with neat little <<<<<<< and
>>>>>> marks in it...  several weeks and about a dozen CVS
updates after I'd touched it...)

</F>




From gmcm at hypernet.com  Sat May 13 18:25:42 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 12:25:42 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <00a101bfbce5$91dbd860$34aab5d4@hagrid>
Message-ID: <1253887332-56495837@hypernet.com>

Fredrik wrote:

> (btw, the C API reference implies that ImportModule doesn't
> use import hooks.  does that mean that cStringIO doesn't work
> under e.g. Gordon's installer?)

You have to fool C code that uses ImportModule by doing an 
import first in your Python code. It's the same for freeze. It's 
tiresome tracking this stuff down. For example, to use shelve:

# this is needed because of the use of __import__ in anydbm 
# (modulefinder does not follow __import__)
import dbhash
# the next 2 are needed because cPickle won't use our import
# hook so we need them already in sys.modules when
# cPickle starts
import string
import copy_reg
# now it will work
import shelve

Imagine the c preprocessor letting you do
#define snarf #include
and then trying to use a dependency tracker.


- Gordon



From effbot at telia.com  Sat May 13 20:09:44 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 20:09:44 +0200
Subject: [Python-Dev] hey, who broke the array module?
Message-ID: <006e01bfbd06$6ba21120$34aab5d4@hagrid>

sigh.  never resync the CVS repository until you've fixed all
bugs in your *own* code ;-)

in 1.5.2:

>>> array.array("h", [65535])
array('h', [-1])

>>> array.array("H", [65535])
array('H', [65535])

in the current CVS version:

>>> array.array("h", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

okay, this might break some existing code -- but one
can always argue that such code were already broken.

on the other hand:

>>> array.array("H", [65535])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: signed short integer is greater than maximum

oops.

dunno if the right thing would be to add support for various kinds
of unsigned integers to Python/getargs.c, or to hack around this
in the array module...

</F>




From mhammond at skippinet.com.au  Sat May 13 21:19:44 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun, 14 May 2000 05:19:44 +1000
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOENPCKAA.mhammond@skippinet.com.au>

> >   Delete the file(s) that got changed and cvs update again.
>
> okay, what's the best way to get a list of locally changed files?

Diff the directory.  Or better still, use wincvs - nice little red icons
for the changed files.

> (in this case, one file ended up with neat little <<<<<<< and
> >>>>>> marks in it...  several weeks and about a dozen CVS
> updates after I'd touched it...)

This happens when CVS can't manage to perform a successful merge.  You
original is still there, but with a funky name (in the same directory - it
should be obvious).

WinCV also makes this a little more obvious - the icon has a special
"conflict" indicator, and the console messages also reflect the conflict in
red.

Mark.




From tismer at tismer.com  Sat May 13 22:32:45 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 22:32:45 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com> <8fe76b$684$1@newshost.accu.uu.nl> <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com> <8fh9ki$51h$1@slb3.atl.mindspring.net> <8fk4mh$i4$1@kopp.stud.ntnu.no>
Message-ID: <391DBBED.B252E597@tismer.com>


Magnus Lie Hetland wrote:
> 
> Aahz Maruch <aahz at netcom.com> wrote in message
> news:8fh9ki$51h$1 at slb3.atl.mindspring.net...
> > In article <rhgmhs07ulsob3pptd6eh4f2ag4qj911bj at 4ax.com>,
> > Ben Wolfson  <rumjuggler at cryptarchy.org> wrote:
> > >
> > >', '.join(['foo', 'bar', 'baz'])
> >
> > This only works in Python 1.6, which is only released as an alpha at
> > this point.  I suggest rather strongly that we avoid 1.6-specific idioms
> > until 1.6 gets released, particularly in relation to FAQ-type questions.
> 
> This is indeed a bit strange IMO... If I were to join the elements of a
> list I would rather ask the list to do it than some string... I.e.
> 
>    ['foo', 'bar', 'baz'].join(', ')
> 
> (...although it is the string that joins the elements in the resulting
> string...)

I believe the notation of "everything is an object, and objects
provide all their functionality" is a bit stressed in Python 1.6 .
The above example touches the limits where I'd just say
"OO isn't always the right thing, and always OO is the wrong thing".

A clear advantage of 1.6's string methods is that much code
becomes shorter and easier to read, since the nesting level
of braces is reduced quite much. The notation also appears to be
more in the order of which actions are actually processed.

The split/join issue is really on the edge where I begin to not
like it.
It is clear that the join method *must* be performed as a method
of the joining character, since the method expects a list as its
argument. It doesn't make sense to use a list method, since
lists have nothing to do with strings.
Furthermore, the argument to join can be any sequence. Adding
a join method to any sequence, just since we want to join some
strings would be overkill.
So the " ".join(seq) notation is the only possible compromise,
IMHO.
It is actually arguable if this is still "Pythonic".
What you want is to join a list of string by some other string.
This is neither a natural method of the list, nor of the joining
string in the first place.

If it came to the point where the string module had some extra
methods which operate on two lists of string perhaps, we would
have been totally lost, and enforcing some OO method to support
it would be completely off the road.

Already a little strange is that the most string methods
return new objects all the time, since strings are immutable.

join is of really extreme design, and compared with other
string functions which became more readable, I think it is
counter-intuitive and not the way people are thinking.
The think "I want to join this list by this string".

Furthermore, you still have to import string, in order to use
its constants.

Instead of using a module with constants and functions, we
now always have to refer to instances and use their methods.
It has some benefits in simple cases.

But if there are a number of different objects handled
by a function, I think enforcing it to be a method of
one of the objects is the wrong way, OO overdone.

doing-OO-only-if-it-looks-natural-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Sat May 13 22:39:19 2000
From: guido at python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:39:19 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: Your message of "Sat, 13 May 2000 12:25:42 EDT."
             <1253887332-56495837@hypernet.com> 
References: <1253887332-56495837@hypernet.com> 
Message-ID: <200005132039.QAA09114@eric.cnri.reston.va.us>

> Fredrik wrote:
> 
> > (btw, the C API reference implies that ImportModule doesn't
> > use import hooks.  does that mean that cStringIO doesn't work
> > under e.g. Gordon's installer?)
> 
> You have to fool C code that uses ImportModule by doing an 
> import first in your Python code. It's the same for freeze. It's 
> tiresome tracking this stuff down. For example, to use shelve:
> 
> # this is needed because of the use of __import__ in anydbm 
> # (modulefinder does not follow __import__)
> import dbhash
> # the next 2 are needed because cPickle won't use our import
> # hook so we need them already in sys.modules when
> # cPickle starts
> import string
> import copy_reg
> # now it will work
> import shelve

Hm, the way I read the code (but I didn't write it!) it calls
PyImport_Import, which is a higher level function that *does* use the
__import__ hook.  Maybe this wasn't always the case?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Sat May 13 22:43:32 2000
From: guido at python.org (Guido van Rossum)
Date: Sat, 13 May 2000 16:43:32 -0400
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Sat, 13 May 2000 13:47:10 GMT."
             <391d5b7f.3713359@smtp.worldonline.dk> 
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>  
            <391d5b7f.3713359@smtp.worldonline.dk> 
Message-ID: <200005132043.QAA09151@eric.cnri.reston.va.us>

[Swede]
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
> >
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings

[Finn]
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

There's no need for invention.  We're supposed to be as close to Perl
as reasonable.  What does Perl do?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gmcm at hypernet.com  Sat May 13 22:54:09 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
In-Reply-To: <391DBBED.B252E597@tismer.com>
Message-ID: <1253871224-57464726@hypernet.com>

Christian wrote:

> The split/join issue is really on the edge where I begin to not
> like it. It is clear that the join method *must* be performed as
> a method of the joining character, since the method expects a
> list as its argument.

We've been through this a number of times on c.l.py.

"What is this trash - I want list.join(sep)!"

After some head banging (often quite violent - ie, 4 or 5 
exchanges), they get that list.join(sep) sucks. But they still 
swear they'll never use sep.join(list).

So you end up saying "Well, string.join still works".

We'll need a pre-emptive FAQ entry with the link bound to a 
key stroke. Or a big increase in the PSU budget...

- Gordon



From gmcm at hypernet.com  Sat May 13 22:54:09 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Sat, 13 May 2000 16:54:09 -0400
Subject: ImportModule (was Re: [Python-Dev] for the todo list: cStringIO uses string.joinfields)
In-Reply-To: <200005132039.QAA09114@eric.cnri.reston.va.us>
References: Your message of "Sat, 13 May 2000 12:25:42 EDT."             <1253887332-56495837@hypernet.com> 
Message-ID: <1253871222-57464840@hypernet.com>

[Fredrik]
> > > (btw, the C API reference implies that ImportModule doesn't
> > > use import hooks.  does that mean that cStringIO doesn't work
> > > under e.g. Gordon's installer?)
[Guido]
> Hm, the way I read the code (but I didn't write it!) it calls
> PyImport_Import, which is a higher level function that *does* use
> the __import__ hook.  Maybe this wasn't always the case?

In stock 1.5.2 it's PyImport_ImportModule. Same in cPickle. 
I'm delighted to see them moving towards PyImport_Import.


- Gordon



From effbot at telia.com  Sat May 13 23:40:01 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sat, 13 May 2000 23:40:01 +0200
Subject: [Python-Dev] Re: [Patches] getpath patch
References: <391D2BC6.95E4FD3E@lemburg.com>
Message-ID: <001501bfbd23$cc45e160$34aab5d4@hagrid>

MAL wrote:
> Note: Python will dump core if it cannot find the exceptions
> module. Perhaps we should add a builtin _exceptions module
> (basically a frozen exceptions.py) which is then used as
> fallback solution ?!

or use this one:
http://w1.132.telia.com/~u13208596/exceptions.htm

</F>




From bwarsaw at python.org  Sat May 13 23:40:47 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 13 May 2000 17:40:47 -0400 (EDT)
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
Message-ID: <14621.52191.448037.799287@anthem.cnri.reston.va.us>

>>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:

    CT> If it came to the point where the string module had some extra
    CT> methods which operate on two lists of string perhaps, we would
    CT> have been totally lost, and enforcing some OO method to
    CT> support it would be completely off the road.

The new .join() method reads a bit better if you first name the
glue string:

space = ' '
name = space.join(['Barry', 'Aloisius', 'Warsaw'])

But yes, it does look odd when used like

' '.join(['Christian', 'Aloisius', 'Tismer'])

I still think it's nice not to have to import string "just" to get the
join functionality, but remember of course that string.join() isn't
going away, so you can still use this if you like it better.

Alternatively, there has been talk about moving join() into the
built-ins, but I'm not sure if the semantics of tha have been nailed
down.

-Barry



From tismer at tismer.com  Sat May 13 23:48:37 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:48:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
References: <1253871224-57464726@hypernet.com>
Message-ID: <391DCDB5.4FCAB97F@tismer.com>


Gordon McMillan wrote:
> 
> Christian wrote:
> 
> > The split/join issue is really on the edge where I begin to not
> > like it. It is clear that the join method *must* be performed as
> > a method of the joining character, since the method expects a
> > list as its argument.
> 
> We've been through this a number of times on c.l.py.

I know. It just came up when I really used it, when
I read through this huge patch from Fred Gansevles, and when
I see people wondering about it.
After all, it is no surprize. They are right.
If we have to change their mind in order to understand
a basic operation, then we are wrong, not they.

> "What is this trash - I want list.join(sep)!"
> 
> After some head banging (often quite violent - ie, 4 or 5
> exchanges), they get that list.join(sep) sucks. But they still
> swear they'll never use sep.join(list).
> 
> So you end up saying "Well, string.join still works".

And it is the cleanest possible way to go, IMHO.
Unless we had some compound object methods, like

(somelist, somestring).join()

> We'll need a pre-emptive FAQ entry with the link bound to a
> key stroke. Or a big increase in the PSU budget...

We should reconsider the OO pattern.
The user's complaining is natural. " ".join() is not.
We might have gone too far. 

Python isn't just OO, it is better.

Joining lists of strings is joining lists of strings.
This is not a method of a string in the first place.
And not a method od a sequence in the first place.

Making it a method of the joining string now appears to be
a hack to me. (Sorry, Tim, the idea was great in the first place)

I am now
+1 on leaving join() to the string module
-1 on making some filler.join() to be the preferred joining way.

this-was-my-most-conservative-day-since-years-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Sat May 13 23:55:43 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sat, 13 May 2000 23:55:43 +0200
Subject: [Python-Dev] Re: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:)
References: <391A3FD4.25C87CB4@san.rr.com>
		<8fe76b$684$1@newshost.accu.uu.nl>
		<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
		<8fh9ki$51h$1@slb3.atl.mindspring.net>
		<8fk4mh$i4$1@kopp.stud.ntnu.no>
		<391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <391DCF5F.BA981607@tismer.com>


"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])

Agreed.

> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])

I'd love that Aloisius, really. I'll ask my parents for a renaming :-)

> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

Sure, and I'm glad to be able to use string methods without ugly
imports. It just came to me when my former colleague Axel met me
last time, and I showed him the 1.6 alpha with its string methods
(just looking over Fred's huge patch) that he said
"Well, quite nice. So they now go the same wrong way as Java did?
The OO pattern is dead. This example shows why."

> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

Sounds like a good alternative.

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From martin at loewis.home.cs.tu-berlin.de  Sun May 14 23:39:52 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 14 May 2000 23:39:52 +0200
Subject: [Python-Dev] Unicode
Message-ID: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>

> comments?  (for obvious reasons, I'm especially interested in comments
> from people using non-ASCII characters on a daily basis...)

> nobody?

Hi Frederik,

I think the problem you try to see is not real. My guideline for using
Unicode in Python 1.6 will be that people should be very careful to
*not* mix byte strings and Unicode strings. If you are processing text
data, obtained from a narrow-string source, you'll always have to make
an explicit decision what the encoding is.

If you follow this guideline, I think the Unicode type of Python 1.6
will work just fine.

If you use Unicode text *a lot*, you may find the need to combine them
with plain byte text in a more convenient way. This is the time you
should look at the implicit conversion stuff, and see which of the
functionality is useful. You then don't need to memorize *all* the
rules where implicit conversion would work - just the cases you care
about.

That may all look difficult - it probably is. But then, it is not more
difficult than tuples vs. lists: why does

>>> [a,b,c] = (1,2,3)

work, and

>>> [1,2]+(3,4)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

does not?

Regards,
Martin



From tim_one at email.msn.com  Mon May 15 01:51:41 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sun, 14 May 2000 19:51:41 -0400
Subject: [Python-Dev] Memory woes under Windows
Message-ID: <000001bfbdff$5bcdfe40$192d153f@tim>

[Noah, I'm wondering whether this is related to our W98 NatSpeak woes --
Python grows its lists much like a certain product we both work on <ahem>
grows its arrays ...]

Here's a simple test case:

from time import clock

def run():
    n = 1
    while n < 4000000:
        a = []
        push = a.append
        start = clock()
        for i in xrange(n):
            push(1)
        finish = clock()
        print "%10d push  %10.3f" % (n, round(finish - start, 3))
        n = n + n

for i in (1, 2, 3):
    try:
        run()
    except MemoryError:
        print "Got a memory error"

So run() builds a number of power-of-2 sized lists, each by appending one
element at a time.  It prints the list length and elapsed time to build each
one (on Windows, this is basically wall-clock time, and is derived from the
Pentium's high-resolution cycle timer).  The driver simply runs this 3
times, reporting any MemoryError that pops up.

The largest array constructed has 2M elements, so consumes about 8Mb -- no
big deal on most machines these days.

Here's what happens on my new laptop (damn, this thing is fast! -- usually):

Win98 (Second Edition)
600MHz Pentium III
160Mb RAM
Python 1.6a2 from python.org, via the Windows installer

         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.011
      4096 push       0.020
      8192 push       0.053
     16384 push       0.074
     32768 push       0.163
     65536 push       0.262
    131072 push       0.514
    262144 push       0.713
    524288 push       1.440
   1048576 push       2.961
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.029
     16384 push       0.057
     32768 push       0.116
     65536 push       0.231
    131072 push       0.474
    262144 push       2.361
    524288 push      24.059
   1048576 push      67.492
Got a memory error
         1 push       0.000
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.000
       256 push       0.001
       512 push       0.001
      1024 push       0.003
      2048 push       0.007
      4096 push       0.014
      8192 push       0.028
     16384 push       0.057
     32768 push       0.115
     65536 push       0.232
    131072 push       0.462
    262144 push       2.349
    524288 push      23.982
   1048576 push      67.257
Got a memory error

Commentary:  The first time it runs, the timing behavior is
indistinguishable from O(N).  But realloc returns NULL at some point when
growing the 2M array!  There "should be" huge gobs of memory available.

The 2nd and 3rd runs are very similar to each other, both blow up at about
the same time, but both run *very* much slower than the 1st run before that
point as the list size gets non-trivial -- and, while the output doesn't
show this, the disk starts thrashing too.

It's *not* the case that Win98 won't give Python more than 8Mb of memory.
For example,

>>> a = [1]*30000000  # that's 30M
>>>

works fine and fast on this machine, with no visible disk traffic [Noah,
that line sucks up about 120Mb from malloc in one shot].

So, somehow or other, masses of allocations are confusing the system memory
manager nearly to death (implying we should use Vladimir's PyMalloc under
Windows after grabbing every byte the machine has <0.6 wink>).

My belief is that the Windows 1.6a2 from python.org was compiled with VC6,
yes?  Scream if that's wrong.

This particular test case doesn't run any better under my Win95 (original)
P5-166 with 32Mb RAM using Python 1.5.2.  But at work, we've got a
(unfortunately huge, and C++) program that runs much slower on a
large-memory W98 machine than a small-memory W95 one, due to disk thrashing.
It's a mystery!  If anyone has a clue about any of this, spit it out <wink>.

[Noah, I watched the disk cache size while running the above, and it's not
the problem -- while W98 had allocated about 100Mb for disk cache at the
start, it gracefully gave that up as the program's memory demands increased]

just-another-day-with-windows-ly y'rs  - tim





From mhammond at skippinet.com.au  Mon May 15 02:28:05 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon, 15 May 2000 10:28:05 +1000
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbdff$5bcdfe40$192d153f@tim>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>

This is definately wierd!  As you only mentioned Win9x, I thought I would
give it a go on Win2k.

This is from a CVS update of only a few days ago, but it is a non-debug
build.  PII266 with 196MB ram:

         1 push       0.001
         2 push       0.000
         4 push       0.000
         8 push       0.000
        16 push       0.000
        32 push       0.000
        64 push       0.000
       128 push       0.001
       256 push       0.001
       512 push       0.003
      1024 push       0.006
      2048 push       0.011
      4096 push       0.040
      8192 push       0.043
     16384 push       0.103
     32768 push       0.203
     65536 push       0.583

Things are looking OK to here - the behaviour Tim expected.  But then
things seem to start going a little wrong:

    131072 push       1.456
    262144 push       4.763
    524288 push      16.119
   1048576 push      60.765

All of a sudden we seem to hit N*N behaviour?

I gave up waiting for the next one.  Performance monitor was showing CPU at
100%, but the Python process was only sitting on around 15MB of RAM (and
growing _very_ slowly - at the rate you would expect).  Machine had tons of
ram showing as available, and the disk was not thrashing - ie, Windows
definately had lots of mem available, and I have no reason to believe that
a malloc() would fail here - but certainly no one would ever want to wait
and see :-)

This was all definately built with MSVC6, SP3.

no-room-should-ever-have-more-than-one-windows-ly y'rs

Mark.




From gstein at lyra.org  Mon May 15 06:08:33 2000
From: gstein at lyra.org (Greg Stein)
Date: Sun, 14 May 2000 21:08:33 -0700 (PDT)
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <001901bfbcef$d4672b80$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005142108070.28031-100000@nebula.lyra.org>

On Sat, 13 May 2000, Fredrik Lundh wrote:
> Fred L. Drake, Jr. wrote:
> > Fredrik Lundh writes:
> >  > what's the best way to make sure that a "cvs update" really brings
> >  > everything up to date, even if you've accidentally changed some-
> >  > thing in your local workspace?
> > 
> >   Delete the file(s) that got changed and cvs update again.
> 
> okay, what's the best way to get a list of locally changed files?

I use the following:

% cvs stat | fgrep Local


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From tim_one at email.msn.com  Mon May 15 09:34:39 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:39 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBOEOHCKAA.mhammond@skippinet.com.au>
Message-ID: <000001bfbe40$07f14520$b82d153f@tim>

[Mark Hammond]
> This is definately wierd!  As you only mentioned Win9x, I thought I would
> give it a go on Win2k.

Thanks, Mark!  I've only got W9X machines at home.

> This is from a CVS update of only a few days ago, but it is a non-debug
> build.  PII266 with 196MB ram:
>
>          1 push       0.001
>          2 push       0.000
>          4 push       0.000
>          8 push       0.000
>         16 push       0.000
>         32 push       0.000
>         64 push       0.000
>        128 push       0.001
>        256 push       0.001
>        512 push       0.003
>       1024 push       0.006
>       2048 push       0.011
>       4096 push       0.040
>       8192 push       0.043
>      16384 push       0.103
>      32768 push       0.203
>      65536 push       0.583
>
> Things are looking OK to here - the behaviour Tim expected.  But then
> things seem to start going a little wrong:
>
>     131072 push       1.456
>     262144 push       4.763
>     524288 push      16.119
>    1048576 push      60.765

So that acts like my Win95 (which I didn't show), and somewhat like my 2nd &
3rd Win98 runs.

> All of a sudden we seem to hit N*N behaviour?

*That* part really isn't too surprising.  Python "overallocates", but by a
fixed amount independent of the current size.  This leads to quadratic-time
behavior "in theory" once a vector gets large enough.  Guido's cultural myth
for why that theory shouldn't matter is that if you keep appending to the
same vector, the OS will eventually move it to the end of the address space,
whereupon further growth simply boosts the VM high-water mark without
actually moving anything.  I call that "a cultural myth" because some
flavors of Unix did used to work that way, and some may still -- I doubt
it's ever been a valid argument under Windows, though. (you, of all people,
know how much Python's internal strategies were informed by machines nobody
uses <wink>).

So I was more surprised up to this point by the supernatural linearity of my
first W98 run (which is reproducible, btw).  But my 2nd & 3rd W98 runs (also
reproducible), and unlike your W2K run, show *worse* than quadratic
behavior.

> I gave up waiting for the next one.

Under both W98 and W95, the next one does eventually hit the MemoryError for
me, but it does take a long time.  If I thought it would help, I'd measure
it.  And *this* one is surprising, because, as you say:

> Performance monitor was showing CPU at 100%, but the Python process
> was only sitting on around 15MB of RAM (and growing _very_ slowly -
> at the rate you would expect).  Machine had tons of ram showing as
> available, and the disk was not thrashing - ie, Windows definately
> had lots of mem available, and I have no reason to believe that
> a malloc() would fail here - but certainly no one would ever want to wait
> and see :-)

How long did you wait?  If less than 10 minutes, perhaps not long enough.  I
certainly didn't expect a NULL return either, even on my tiny machine, and
certainly not on the box with 20x more RAM than the list needs.

> This was all definately built with MSVC6, SP3.

Again good to know.  I'll chew on this, but don't expect a revelation soon.

> no-room-should-ever-have-more-than-one-windows-ly y'rs

Hmm.  I *did* run these in different rooms <wink>.

no-accounting-for-windows-ly y'rs  - tim





From tim_one at email.msn.com  Mon May 15 09:34:51 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 15 May 2000 03:34:51 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <391DCDB5.4FCAB97F@tismer.com>
Message-ID: <000301bfbe40$0e2a49a0$b82d153f@tim>

[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

Huh!  I would not have guessed that you'd give up on Stackless that easily
<wink>.

> ...
> Making it a method of the joining string now appears to be
> a hack to me. (Sorry, Tim, the idea was great in the first place)

Just the opposite here:  it looked like a hack the first time I thought of
it, but has gotten more charming with each use.  space.join(sequence) is so
pretty it aches.

redefining-truth-all-over-the-place-ly y'rs  - tim





From gward at mems-exchange.org  Mon May 15 15:30:54 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Mon, 15 May 2000 09:30:54 -0400
Subject: [Python-Dev] cvs for dummies
In-Reply-To: <000d01bfbce8$a3466f40$34aab5d4@hagrid>; from effbot@telia.com on Sat, May 13, 2000 at 04:36:30PM +0200
References: <000d01bfbce8$a3466f40$34aab5d4@hagrid>
Message-ID: <20000515093053.A5765@mems-exchange.org>

On 13 May 2000, Fredrik Lundh said:
> what's the best way to make sure that a "cvs update" really brings
> everything up to date, even if you've accidentally changed some-
> thing in your local workspace?

Try the attached script -- it's basically the same as Greg Stein's "cvs
status | grep Local", but beefed-up and overkilled.

Example:

  $ cvstatus -l
  .cvsignore                     Up-to-date        2000-05-02 14:31:04
  Makefile.in                    Locally Modified  2000-05-12 12:25:39
  README                         Up-to-date        2000-05-12 12:34:42
  acconfig.h                     Up-to-date        2000-05-12 12:25:40
  config.h.in                    Up-to-date        2000-05-12 12:25:40
  configure                      Up-to-date        2000-05-12 12:25:40
  configure.in                   Up-to-date        2000-05-12 12:25:40
  install-sh                     Up-to-date        1998-08-13 12:08:45

...so yeah, it generates a lot of output when run on a large working
tree, eg. Python's.  But not as much as "cvs status" on its own.  ;-)

        Greg

PS. I just noticed it uses the "#!/usr/bin/env" hack with a command-line
option for the interpreter, which doesn't work on Linux.  ;-(  You may
have to hack the shebang line to make it work.

-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367
-------------- next part --------------
#!/usr/bin/env perl -w

#
# cvstatus
#
# runs "cvs status" (with optional file arguments), filtering out
# uninteresting stuff and putting in the last-modification time
# of each file.
#
# Usage: cvstatus [files]
#
# GPW 1999/02/17
#
# $Id: cvstatus,v 1.4 2000/04/14 14:56:14 gward Exp $
#

use strict;
use POSIX 'strftime';

my @files = @ARGV;

# Open a pipe to a forked child process
my $pid = open (CVS, "-|");
die "couldn't open pipe: $!\n" unless defined $pid;

# In the child -- run "cvs status" (with optional list of files
# from command line)
unless ($pid)
{
   open (STDERR, ">&STDOUT");           # merge stderr with stdout
   exec 'cvs', 'status', @files;
   die "couldn't exec cvs: $!\n";
}

# In the parent -- read "cvs status" output from the child
else
{
   my $dir = '';
   while (<CVS>)
   {
      my ($filename, $status, $mtime);
      if (/Examining (.*)/)
      {
         $dir = $1;
         if (! -d $dir)
         {
            warn "huh? no directory called $dir!";
            $dir = '';
         }
         elsif ($dir eq '.')
            { $dir = ''; }
         else
            { $dir .= '/' unless $dir =~ m|/$|; }
      }
      elsif (($filename, $status) = /^File: \s* (\S+) \s* Status: \s* (.*)/x)
      {
         $filename = $dir . $filename;
         if ($mtime = (stat $filename)[9])
         {
            $mtime = strftime ("%Y-%m-%d %H:%M:%S", localtime $mtime);
            printf "%-30.30s %-17s %s\n", $filename, $status, $mtime;
         }
         else
         {
            #warn "couldn't stat $filename: $!\n";
            printf "%-30.30s %-17s ???\n", $filename, $status;
         }
      }
   }

   close (CVS);
   warn "cvs failed\n" unless $? == 0;
}

From trentm at activestate.com  Mon May 15 23:09:58 2000
From: trentm at activestate.com (Trent Mick)
Date: Mon, 15 May 2000 14:09:58 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
References: <006e01bfbd06$6ba21120$34aab5d4@hagrid>
Message-ID: <20000515140958.C20418@activestate.com>

I broke it with my patches to test overflow for some of the PyArg_Parse*()
formatting characters. The upshot of testing for overflow is that now those
formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
unsigned-ness as appropriate (you have to know if the value is signed or
unsigned to know what limits to check against for overflow). Two
possibilities presented themselves:

1. Enforce 'b' as unsigned char (the common usage) and the rest as signed
values (short, int, and long). If you want a signed char, or an unsigned
short you have to work around it yourself.

2. Add formatting characters or modifiers for signed and unsigned versions of
all the integral type to PyArg_Parse*() in getargs.c

Guido prefered the former because (my own interpretation of the reasons) it
covers the common case and keeps the clutter and feature creep down. It is
debatable whether or not we really need signed and unsigned for all of them.
See the following threads on python-dev and patches:
  make 'b' formatter an *unsigned* char
  issues with int/long on 64bit platforms - eg stringobject (PR#306) 
  make 'b','h','i' raise overflow exception
  
Possible code breakage is the drawback.


[Fredrik Lundh wrote]:
> sigh.  never resync the CVS repository until you've fixed all
> bugs in your *own* code ;-)

Sorry, I guess. The test suite did not catch this so it is hard for me to
know that the bug was raised. My patches adds tests for these to the test
suite.

> 
> in 1.5.2:
> 
> >>> array.array("h", [65535])
> array('h', [-1])
> 
> >>> array.array("H", [65535])
> array('H', [65535])
> 
> in the current CVS version:
> 
> >>> array.array("h", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> okay, this might break some existing code -- but one
> can always argue that such code were already broken.

Yes.

> 
> on the other hand:
> 
> >>> array.array("H", [65535])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: signed short integer is greater than maximum
> 
> oops.
> 
oops. See my patch that fixes this for 'H', and 'b', and 'I', and 'L'.


> dunno if the right thing would be to add support for various kinds
> of unsigned integers to Python/getargs.c, or to hack around this
> in the array module...
> 
My patch does the latter and that would be my suggestion because:
(1) Guido didn't like the idea of adding more formatters to getargs.c (see
above)
(2) Adding support for unsigned and signed versions in getargs.c could be
confusing because the formatting characters cannot be the same as in the
array module because 'L' is already used for LONG_LONG types in
PyArg_Parse*().
(3) KISS and the common case. Keep the number of formatters for
PyArg_Parse*() short and simple. I would presume that the common case user
does not really need the extra support.


Trent


-- 
Trent Mick
trentm at activestate.com



From mhammond at skippinet.com.au  Tue May 16 08:22:53 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 16 May 2000 16:22:53 +1000
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
Message-ID: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>

For about the 1,000,000th time in my life (no exaggeration :-), I just
typed "python.exe foo" - I forgot the .py.

It would seem a simple and useful change to append a ".py" extension and
try-again, instead of dieing the first time around - ie, all we would be
changing is that we continue to run where we previously failed.

Is there a good reason why we dont do this?

Mark.




From mal at lemburg.com  Tue May 16 00:07:53 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 00:07:53 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <391d5b7f.3713359@smtp.worldonline.dk>
Message-ID: <39207539.F1C14A25@lemburg.com>

Finn Bock wrote:
> 
> On Sat, 13 May 2000 14:56:41 +0200, you wrote:
> 
> >in the current 're' engine, a newline is chr(10) and nothing
> >else.
> >
> >however, in the new unicode aware engine, I used the new
> >LINEBREAK predicate instead, but it turned out to break one
> >of the tests in the current test suite:
> >
> >    sre.match('a\rb', 'a.b') => None
> >
> >(unicode adds chr(13), chr(28), chr(29), chr(30), and also
> >unichr(133), unichr(8232), and unichr(8233) to the list of
> >line breaking codes)
>
> >what's the best way to deal with this?  I see three alter-
> >natives:
> >
> >a) stick to the old definition, and use chr(10) also for
> >   unicode strings
> 
> In the ORO matcher that comes with jpython, the dot matches all but
> chr(10). But that is bad IMO. Unicode should use the LINEBREAK
> predicate.

+1 on that one... just like \s should use Py_UNICODE_ISSPACE()
and \d Py_UNICODE_ISDECIMAL().

BTW, how have you implemented the locale aware \w and \W
for Unicode ? Unicode doesn't have any locales, but quite a
lot more alphanumeric characters (or equivalents) and there
currently is no Py_UNICODE_ISALPHA() in the core.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon May 15 23:50:39 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 May 2000 23:50:39 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com>
		<8fe76b$684$1@newshost.accu.uu.nl>
		<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
		<8fh9ki$51h$1@slb3.atl.mindspring.net>
		<8fk4mh$i4$1@kopp.stud.ntnu.no>
		<391DBBED.B252E597@tismer.com> <14621.52191.448037.799287@anthem.cnri.reston.va.us>
Message-ID: <3920712F.1FD0B910@lemburg.com>

"Barry A. Warsaw" wrote:
> 
> >>>>> "CT" == Christian Tismer <tismer at tismer.com> writes:
> 
>     CT> If it came to the point where the string module had some extra
>     CT> methods which operate on two lists of string perhaps, we would
>     CT> have been totally lost, and enforcing some OO method to
>     CT> support it would be completely off the road.
> 
> The new .join() method reads a bit better if you first name the
> glue string:
> 
> space = ' '
> name = space.join(['Barry', 'Aloisius', 'Warsaw'])
> 
> But yes, it does look odd when used like
> 
> ' '.join(['Christian', 'Aloisius', 'Tismer'])
> 
> I still think it's nice not to have to import string "just" to get the
> join functionality, but remember of course that string.join() isn't
> going away, so you can still use this if you like it better.

string.py is depreciated, AFAIK (not that it'll go away anytime
soon, but using string method directly is really the better,
more readable and faster approach).
 
> Alternatively, there has been talk about moving join() into the
> built-ins, but I'm not sure if the semantics of tha have been nailed
> down.

This is probably the way to go. Semantics should probably
be:

	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

and should work with any type providing addition or
concat slot methods.

Patches anyone ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 16 10:21:46 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 10:21:46 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <3921051A.56C7B63E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > comments?  (for obvious reasons, I'm especially interested in comments
> > from people using non-ASCII characters on a daily basis...)
> 
> > nobody?
> 
> Hi Frederik,
> 
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Right, that's the way to go :-)
 
> If you follow this guideline, I think the Unicode type of Python 1.6
> will work just fine.
> 
> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. This is the time you
> should look at the implicit conversion stuff, and see which of the
> functionality is useful. You then don't need to memorize *all* the
> rules where implicit conversion would work - just the cases you care
> about.

One should better not rely on the implicit conversions. These
are really only there to ease porting applications to Unicode
and perhaps make some existing APIs deal with Unicode without
even knowing about it -- of course this will not always work
and those places will need some extra porting effort to make
them useful w/r to Unicode. open() is one such candidate.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Tue May 16 11:30:54 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 May 2000 11:30:54 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>

Martin v. Loewis wrote:
> I think the problem you try to see is not real.

it is real.  I won't repeat the arguments one more time; please read
the W3C character model note and the python-dev archives, and read
up on the unicode support in Tcl and Perl.

> But then, it is not more difficult than tuples vs. lists

your examples always behave the same way, no matter what's in the
containers.  that's not true for MAL's design.

</F>




From guido at python.org  Tue May 16 12:03:07 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 06:03:07 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: Your message of "Tue, 16 May 2000 16:22:53 +1000."
             <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au> 
References: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au> 
Message-ID: <200005161003.GAA12247@eric.cnri.reston.va.us>

> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.
> 
> It would seem a simple and useful change to append a ".py" extension and
> try-again, instead of dieing the first time around - ie, all we would be
> changing is that we continue to run where we previously failed.
> 
> Is there a good reason why we dont do this?

Just inertia, plus it's "not the Unix way".  I agree it's a good idea.
(I also found in user testsing that IDLE definitely has to supply the
".py" when saving a module if the user didn't.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Tue May 16 16:52:59 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 09:52:59 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <3920712F.1FD0B910@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
	<14621.52191.448037.799287@anthem.cnri.reston.va.us>
	<3920712F.1FD0B910@lemburg.com>
Message-ID: <14625.24779.329534.364663@beluga.mojam.com>

    >> Alternatively, there has been talk about moving join() into the
    >> built-ins, but I'm not sure if the semantics of tha have been nailed
    >> down.

    Marc> This is probably the way to go. Semantics should probably
    Marc> be:

    Marc> 	join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)

    Marc> and should work with any type providing addition or concat slot
    Marc> methods.

Of course, while it will always yield what you ask for, it might not always
yield what you expect:

    >>> seq = [1,2,3]
    >>> sep = 5
    >>> reduce(lambda x,y: x + sep + y, seq)
    16

;-)

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From effbot at telia.com  Tue May 16 17:22:06 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 17:22:06 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com>
Message-ID: <000d01bfbf4a$85321400$34aab5d4@hagrid>

>     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
>
> Of course, while it will always yield what you ask for, it might not always
> yield what you expect:
> 
>     >>> seq = [1,2,3]
>     >>> sep = 5
>     >>> reduce(lambda x,y: x + sep + y, seq)
>     16

not to mention:

>>> print join([], " ")
TypeError: reduce of empty sequence with no initial value

...

</F>




From mal at lemburg.com  Tue May 16 19:15:05 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 May 2000 19:15:05 +0200
Subject: [Python-Dev] join() et al.
References: <391A3FD4.25C87CB4@san.rr.com><8fe76b$684$1@newshost.accu.uu.nl><rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com><8fh9ki$51h$1@slb3.atl.mindspring.net><8fk4mh$i4$1@kopp.stud.ntnu.no><391DBBED.B252E597@tismer.com><14621.52191.448037.799287@anthem.cnri.reston.va.us><3920712F.1FD0B910@lemburg.com> <14625.24779.329534.364663@beluga.mojam.com> <000d01bfbf4a$85321400$34aab5d4@hagrid>
Message-ID: <39218219.9E8115E2@lemburg.com>

Fredrik Lundh wrote:
> 
> >     Marc> join(seq,sep) := reduce(lambda x,y: x + sep + y, seq)
> >
> > Of course, while it will always yield what you ask for, it might not always
> > yield what you expect:
> >
> >     >>> seq = [1,2,3]
> >     >>> sep = 5
> >     >>> reduce(lambda x,y: x + sep + y, seq)
> >     16
> 
> not to mention:
> 
> >>> print join([], " ")
> TypeError: reduce of empty sequence with no initial value

Ok, here's a more readable and semantically useful definition:

def join(sequence,sep=''):

    # Special case: empty sequence
    if len(sequence) == 0:
        try:
            return 0*sep
        except TypeError:
            return sep[0:0]
        
    # Normal case
    x = None
    for y in sequence:
        if x is None:
            x = y
        elif sep:
            x = x + sep + y
        else:
            x = x + y
    return x

Examples:

>>> join((1,2,3))
6

>>> join(((1,2),(3,4)),('x',))
(1, 2, 'x', 3, 4)

>>> join(('a','b','c'), ' ')
'a b c'

>>> join(())
''

>>> join((),())
()

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From paul at prescod.net  Tue May 16 19:58:33 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 12:58:33 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
Message-ID: <39218C49.C66FEEDE@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
>
> I think the problem you try to see is not real. My guideline for using
> Unicode in Python 1.6 will be that people should be very careful to
> *not* mix byte strings and Unicode strings. 

I think that as soon as we are adding admonishions to documentation that
things "probably don't behave as you expect, so be careful", we have
failed. Sometimes failure is unavaoidable (e.g. floats do not act
rationally -- deal with it). But let's not pretend that failure is
success.

> If you are processing text
> data, obtained from a narrow-string source, you'll always have to make
> an explicit decision what the encoding is.

Are Python literals a "narrow string source"? It seems blatantly clear
to me that the "encoding" of Python literals should be determined at
compile time, not runtime. Byte arrays from a file are different. 

> If you use Unicode text *a lot*, you may find the need to combine them
> with plain byte text in a more convenient way. 

Unfortunately there will be many people with no interesting in Unicode
who will be dealing with it merely because that is the way APIs are
going: XML APIs, Windows APIs, TK, DCOM, SOAP, WebDAV even some X/Unix
APIs. Unicode is the new ASCII.

I want to get a (Unicode) string from an XML document or SOAP request,
compare it to a string literal and never think about Unicode.

> ...
> why does
> 
> >>> [a,b,c] = (1,2,3)
> 
> work, and
> 
> >>> [1,2]+(3,4)
> ...
> 
> does not?

I dunno. If there is no good reason then it is a bug that should be
fixed. The __radd__ operator on lists should iterate over its argument
as a sequence.

As Fredrik points out, though, this situation is not as dangerous as
auto-conversions because

 a) the latter could be loosened later without breaking code

 b) the operation always fails. It never does the wrong thing silently
and it never succeeds for some inputs.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From skip at mojam.com  Tue May 16 20:15:40 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 13:15:40 -0500 (CDT)
Subject: [Python-Dev] join() et al.
In-Reply-To: <39218219.9E8115E2@lemburg.com>
References: <391A3FD4.25C87CB4@san.rr.com>
	<8fe76b$684$1@newshost.accu.uu.nl>
	<rhgmhs07ulsob3pptd6eh4f2ag4qj911bj@4ax.com>
	<8fh9ki$51h$1@slb3.atl.mindspring.net>
	<8fk4mh$i4$1@kopp.stud.ntnu.no>
	<391DBBED.B252E597@tismer.com>
	<14621.52191.448037.799287@anthem.cnri.reston.va.us>
	<3920712F.1FD0B910@lemburg.com>
	<14625.24779.329534.364663@beluga.mojam.com>
	<000d01bfbf4a$85321400$34aab5d4@hagrid>
	<39218219.9E8115E2@lemburg.com>
Message-ID: <14625.36940.160373.900909@beluga.mojam.com>

    Marc> Ok, here's a more readable and semantically useful definition:
    ...

    >>> join((1,2,3))
    6

My point was that the verb "join" doesn't connote "sum".  The idea of
"join"ing a sequence suggests (to me) that the individual sequence elements
are still identifiable in the result, so "join((1,2,3))" would look
something like "123" or "1 2 3" or "10203", not "6".

It's not a huge deal to me, but I think it mildly violates the principle of
least surprise when you try to apply it to sequences of non-strings.

To extend this into the absurd, what should the following code display?

    class Spam: pass

    eggs = Spam()
    bacon = Spam()
    toast = Spam()

    print join((eggs,bacon,toast))

If a join builtin is supposed to be applicable to all types, we need to
decide what the semantics are going to be for all types.  Maybe all that
needs to happen is that you stringify any non-string elements before
applying the + operator (just one possibility among many, not necessarily
one I recommend).  If you want to limit join's inputs to (or only make it
semantically meaningful for) sequences of strings, then it should probably
not be a builtin, no matter how visually annoying you find

    " ".join(["a","b","c"])

Skip



From effbot at telia.com  Tue May 16 20:26:10 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 20:26:10 +0200
Subject: [Python-Dev] homer-dev, anyone?
Message-ID: <009d01bfbf64$b779a260$34aab5d4@hagrid>

http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40

</F>




From martin at loewis.home.cs.tu-berlin.de  Tue May 16 20:43:34 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 16 May 2000 20:43:34 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	(fredrik@pythonware.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
Message-ID: <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>

> it is real.  I won't repeat the arguments one more time; please read
> the W3C character model note and the python-dev archives, and read
> up on the unicode support in Tcl and Perl.

I did read all that, so there really is no point in repeating the
arguments - yet I'm still not convinced. One of the causes may be that
all your commentary either

- discusses an alternative solution to the existing one, merely
  pointing out the difference, without any strong selling point
- explains small examples that work counter-intuitively

I'd like to know whether you have an example of a real-world
big-application problem that could not be conveniently implemented
using the new Unicode API. For all the examples I can think where
Unicode would matter (XML processing, CORBA wstring mapping,
internationalized messages and GUIs), it would work just fine.

So while it may not be perfect, I think it is good enough. Perhaps my
problem is that I'm not a perfectionist :-)

However, one remark from http://www.w3.org/TR/charmod/ reminded me of
an earlier proposal by Bill Janssen. The Character Model says

# Because encoded text cannot be interpreted and processed without
# knowing the encoding, it is vitally important that the character
# encoding is known at all times and places where text is exchanged or
# stored.

While they were considering document encodings, I think this applies
in general. Bill Janssen's proposal was that each (narrow) string
should have an attribute .encoding. If set, you'll know what encoding
a string has. If not set, it is a byte string, subject to the default
encoding. I'd still like to see that as a feature in Python.

Regards,
Martin



From paul at prescod.net  Tue May 16 20:49:46 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 13:49:46 -0500
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <3921984A.8CDE8E1D@prescod.net>

I hope that if Python were renamed we would not choose yet another name
which turns up hundreds of false hits in web engines. Perhaps Homr or
Home_r. Or maybe Pythahn.

Fredrik Lundh wrote:
> 
> http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> 
> </F>
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I ahve been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From tismer at tismer.com  Tue May 16 21:01:21 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:01:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
Message-ID: <39219B01.A4EE0920@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> Huh!  I would not have guessed that you'd give up on Stackless that easily
> <wink>.

Noh, I didn't give up Stackless, but fishing for soles.
After Just v. R. has become my most ambitious user,
I'm happy enough.

(Again, better don't take me too serious :)

> > ...
> > Making it a method of the joining string now appears to be
> > a hack to me. (Sorry, Tim, the idea was great in the first place)
> 
> Just the opposite here:  it looked like a hack the first time I thought of
> it, but has gotten more charming with each use.  space.join(sequence) is so
> pretty it aches.

It is absolutely phantastic.
The most uninteresting stuff in the join is the separator,
and it has the power to merge thousands of strings
together, without asking the sequence at all
 - give all power to the suppressed, long live the Python anarchy :-)

We now just have to convince the user no longer to think
of *what* to join in te first place, but how.

> redefining-truth-all-over-the-place-ly y'rs  - tim

" "-is-small-but-sooo-strong---lets-elect-new-users - ly y'rs - chris

p.s.: no this is *no* offense, just kidding.

" ".join(":-)", ":^)", "<wink> ") * 42

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Tue May 16 21:10:42 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:10:42 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim> <39219B01.A4EE0920@tismer.com>
Message-ID: <39219D32.BD82DE83@tismer.com>

Oh, while we are at it...

Christian Tismer wrote:
> " ".join(":-)", ":^)", "<wink> ") * 42

is actually wrong, since it needs a seuqence, not just
the arg tuple. Wouldn't it make sense to allow this?
Exactly the opposite as in list.append(), since in this
case we are just expecting strings?

While I have to say that

>>> " ".join("123")
'1 2 3'
>>> 

is not a feature to me but just annoying ;-)

ciao again - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From effbot at telia.com  Tue May 16 21:30:49 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 16 May 2000 21:30:49 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > it is real.  I won't repeat the arguments one more time; please read
> > the W3C character model note and the python-dev archives, and read
> > up on the unicode support in Tcl and Perl.
> 
> I did read all that, so there really is no point in repeating the
> arguments - yet I'm still not convinced. One of the causes may be that
> all your commentary either
> 
> - discusses an alternative solution to the existing one, merely
>   pointing out the difference, without any strong selling point
> - explains small examples that work counter-intuitively

umm.  I could have sworn that getting rid of counter-intuitive
behaviour was rather important in python.  maybe we're using
the language in radically different ways?

> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

of course I can kludge my way around the flaws in MAL's design,
but why should I have to do that? it's broken. fixing it is easy.

> Perhaps my problem is that I'm not a perfectionist :-)

perfectionist or not, I only want Python's Unicode support to
be as intuitive as anything else in Python.  as it stands right
now, Perl and Tcl's Unicode support is intuitive.  Python's not.

(it also backs us into a corner -- once you mess this one up,
you cannot fix it in Py3K without breaking lots of code.  that's
really bad).

in contrast, Guido's compromise proposal allows us to do this
the right way in 1.7/Py3K (i.e. teach python about source code
encodings, system api encodings, and stream i/o encodings).

btw, I thought we'd all agreed on GvR's solution for 1.6?

what did I miss?

> So while it may not be perfect, I think it is good enough.

so tell me, if "good enough" is what we're aiming at, why isn't
my counter-proposal good enough?

if not else, it's much easier to document...

</F>




From skip at mojam.com  Tue May 16 21:30:08 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 16 May 2000 14:30:08 -0500 (CDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
In-Reply-To: <39219D32.BD82DE83@tismer.com>
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
	<39219B01.A4EE0920@tismer.com>
	<39219D32.BD82DE83@tismer.com>
Message-ID: <14625.41408.423282.529732@beluga.mojam.com>

    Christian> While I have to say that

    >>>> " ".join("123")
    Christian> '1 2 3'
    >>>> 

    Christian> is not a feature to me but just annoying ;-)

More annoying than

    >>> import string
    >>> string.join("123")
    '1 2 3'

? ;-)

a-sequence-is-a-sequence-ly y'rs,

Skip



From tismer at tismer.com  Tue May 16 21:43:33 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 16 May 2000 21:43:33 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfbe40$0e2a49a0$b82d153f@tim>
		<39219B01.A4EE0920@tismer.com>
		<39219D32.BD82DE83@tismer.com> <14625.41408.423282.529732@beluga.mojam.com>
Message-ID: <3921A4E5.9BDEBF49@tismer.com>


Skip Montanaro wrote:
> 
>     Christian> While I have to say that
> 
>     >>>> " ".join("123")
>     Christian> '1 2 3'
>     >>>>
> 
>     Christian> is not a feature to me but just annoying ;-)
> 
> More annoying than
> 
>     >>> import string
>     >>> string.join("123")
>     '1 2 3'
> 
> ? ;-)

You are right. Equally bad, just in different flavor.
*gulp* this is going to be a can of worms since...

> a-sequence-is-a-sequence-ly y'rs,

Then a string should better not be a sequence.

The number of places where I really used the string sequence
protocol to take advantage of it is outperfomed by a factor
of ten by cases where I missed to tupleise and got a bad
result. A traceback is better than a sequence here.

oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

p.s.: the Spanish Inquisition can't get me since I'm in Russia
until Sunday - omsk

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Tue May 16 21:49:17 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 15:49:17 -0400
Subject: [Python-Dev] Unicode
In-Reply-To: Your message of "Tue, 16 May 2000 21:30:49 +0200."
             <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> 
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>  
            <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> 
Message-ID: <200005161949.PAA16607@eric.cnri.reston.va.us>

> in contrast, Guido's compromise proposal allows us to do this
> the right way in 1.7/Py3K (i.e. teach python about source code
> encodings, system api encodings, and stream i/o encodings).
> 
> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

Nothing.  We are going to do that (my "ASCII" proposal).  I'm just
waiting for the final SRE code first.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Tue May 16 22:01:46 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 16 May 2000 16:01:46 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: Your message of "Tue, 16 May 2000 13:49:46 CDT."
             <3921984A.8CDE8E1D@prescod.net> 
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>  
            <3921984A.8CDE8E1D@prescod.net> 
Message-ID: <200005162001.QAA16657@eric.cnri.reston.va.us>

> I hope that if Python were renamed we would not choose yet another name
> which turns up hundreds of false hits in web engines. Perhaps Homr or
> Home_r. Or maybe Pythahn.

Actually, I'd like to call the next version Throatwobbler Mangrove.
But you'd have to pronounce it Raymond Luxyry Yach-t.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue May 16 22:10:22 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 May 2000 16:10:22 -0400 (EDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
	<005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	<200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
	<00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <14625.43822.773966.59550@amarok.cnri.reston.va.us>

Fredrik Lundh writes:
>perfectionist or not, I only want Python's Unicode support to
>be as intuitive as anything else in Python.  as it stands right
>now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I don't know about Tcl, but Perl 5.6's Unicode support is still
considered experimental.  Consider the following excerpts, for
example.  (And Fredrik's right; we shouldn't release a 1.6 with broken
support, or we'll pay for it for *years*...  But if GvR's ASCII
proposal is considered OK, then great!)

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-04/msg00084.html:

>Ah, yes. Unicode. But after two years of work, the one thing that users
>will want to do - open and read Unicode data - is still not there.
>Who cares if stuff's now represented internally in Unicode if they can't
>read the files they need to.

This is a "big" (as in "huge") disappointment for me as well.  I hope
we'll do better next time.

========================
http://www.egroups.com/message/perl5-porters/67906:
But given that interpretation, I'm amazed at how many operators seem
to be broken with UTF8.    It certainly supports Ilya's contention of
"pre-alpha".

Here's another example:
 
  DB<1> x (256.255.254 . 257.258.259) eq (256.255.254.257.258.259)
0  ''
  DB<2>

Rummaging with Devel::Peek shows that in this case, it's the fault of
the . operator.

And eq is broken as well:

  DB<11> x "\x{100}" eq "\xc4\x80"
0  1
  DB<12>

Aaaaargh!

========================
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-03/msg00971.html:

A couple problems here...passage through a hash key removes the UTF8
flag (as might be expected).  Even if keys were to attempt to restore
the UTF8 flag (ala Convert::UTF::decode_utf8) or hash keys were real
SVs, what then do you do with $h{"\304\254"} and the like?

Suggestions:

1. Leave things as they are, but document UTF8 hash keys as experimental
and subject to change.

or 2. When under use bytes, leave things as they are.  Otherwise, have
keys turn on the utf8 flag if appropriate.  Also give a warning when
using a hash key like "\304\254" since keys will in effect return a
different string that just happens to have the same interal encoding.

========================

 



From paul at prescod.net  Tue May 16 22:36:42 2000
From: paul at prescod.net (Paul Prescod)
Date: Tue, 16 May 2000 15:36:42 -0500
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
Message-ID: <3921B15A.73EF6355@prescod.net>

"Martin v. Loewis" wrote:
> 
> ...
> 
> I'd like to know whether you have an example of a real-world
> big-application problem that could not be conveniently implemented
> using the new Unicode API. For all the examples I can think where
> Unicode would matter (XML processing, CORBA wstring mapping,
> internationalized messages and GUIs), it would work just fine.

Of course an implicit behavior can never get in the way of
big-application building. The question is about principle of least
surprise, and simplicity of explanation and understanding.

 I'm-told-that-even-Perl-and-C++-can-be-used-for-big-apps -ly yrs

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From martin at loewis.home.cs.tu-berlin.de  Wed May 17 00:02:10 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 17 May 2000 00:02:10 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> (effbot@telia.com)
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>

> perfectionist or not, I only want Python's Unicode support to
> be as intuitive as anything else in Python.  as it stands right
> now, Perl and Tcl's Unicode support is intuitive.  Python's not.

I haven't much experience with Perl, but I don't think Tcl is
intuitive in this area. I really think that they got it all wrong.
They use the string type for "plain bytes", just as we do, but then
have the notion of "correct" and "incorrect" UTF-8 (i.e. strings with
violations of the encoding rule). For a "plain bytes" string, the
following might happen

- the string is scanned for non-UTF-8 characters
- if any are found, the string is converted into UTF-8, essentially
  treating the original string as Latin-1.
- it then continues to use the UTF-8 "version" of the original string,
  and converts it back on demand.

Maybe I got something wrong, but the Unicode support in Tcl makes me
worry very much.

> btw, I thought we'd all agreed on GvR's solution for 1.6?
> 
> what did I miss?

I like the 'only ASCII is converted' approach very much, so I'm not
objecting to that solution - just as I wasn't objecting to the
previous one.

> so tell me, if "good enough" is what we're aiming at, why isn't
> my counter-proposal good enough?

Do you mean the one in

http://www.python.org/pipermail/python-dev/2000-April/005218.html

which I suppose is the same one as the "java-like approach"? AFAICT,
all it does is to change the default encoding from UTF-8 to Latin-1.
I can't follow why this should be *better*, but it would be certainly
as good... In comparison, restricting the "character" interpretation
of the string type (in terms of your proposal) to 7-bit characters
has the advantage that it is less error-prone, as Guido points out.

Regards,
Martin



From mal at lemburg.com  Wed May 17 00:59:45 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 00:59:45 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
Message-ID: <3921D2E1.6282AA8F@lemburg.com>

Fredrik Lundh wrote:
> 
> of course I can kludge my way around the flaws in MAL's design,
> but why should I have to do that? it's broken. fixing it is easy.

Look Fredrik, it's not *my* design. All this was discussed in
public and in several rounds late last year. If someone made
a mistake and "broke" anything, then we all did... I still
don't think so, but that's my personal opinion.

--

Now to get back to some non-flammable content: 

Has anyone played around with the latest sys.set_string_encoding()
patches ? I would really like to know what you think.

The idea behind it is that you can define what the Unicode
implementaion is to expect as encoding when it sees an
8-bit string. The encoding is used for coercion, str(unicode)
and printing. It is currently *not* used for the "s"
parser marker and hash values (mainly due to internal issues).

See my patch comments for details.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one at email.msn.com  Wed May 17 08:45:59 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 02:45:59 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <14625.36940.160373.900909@beluga.mojam.com>
Message-ID: <000701bfbfcb$8f6cc600$b52d153f@tim>

[Skip Montanaro]
> ...
> It's not a huge deal to me, but I think it mildly violates the
> principle of least surprise when you try to apply it to sequences
> of non-strings.

When sep.join(seq) was first discussed, half the debate was whether str()
should be magically applied to seq's elements.  I still favor doing that, as
I have often explained the TypeError in e.g.

    string.join(some_mixed_list_of_strings_and_numbers)

to people and agree with their next complaint:  their intent was obvious,
since string.join *produces* a string.  I've never seen an instance of this
error that was appreciated (i.e., it never exposed an error in program logic
or concept, it's just an anal gripe about an arbitrary and unnatural
restriction).  Not at all like

    "42" + 42

where the intent is unknowable.

> To extend this into the absurd, what should the following code display?
>
>     class Spam: pass
>
>     eggs = Spam()
>     bacon = Spam()
>     toast = Spam()
>
>     print join((eggs,bacon,toast))

Note that we killed the idea of a new builtin join last time around.  It's
the kind of muddy & gratuitous hypergeneralization Guido will veto if we
don't kill it ourselves.  That said,

    space.join((eggs, bacon, toast))

should <wink> produce

    str(egg) + space + str(bacon) + space + str(toast)

although how Unicode should fit into all this was never clear to me.

> If a join builtin is supposed to be applicable to all types, we need to
> decide what the semantics are going to be for all types.

See above.

> Maybe all that needs to happen is that you stringify any non-string
> elements before applying the + operator (just one possibility among
> many, not necessarily one I recommend).

In my experience, that it *doesn't* do that today is a common source of
surprise & mild irritation.  But I insist that "stringify" return a string
in this context, and that "+" is simply shorthand for "string catenation".
Generalizing this would be counterproductive.

> If you want to limit join's inputs to (or only make it semantically
> meaningful for) sequences of strings, then it should probably
> not be a builtin, no matter how visually annoying you find
>
>     " ".join(["a","b","c"])

This is one of those "doctor, doctor, it hurts when I stick an onion up my
ass!" things <wink>.  space.join(etc) reads beautifully, and anyone who
doesn't spell it that way but hates the above is picking at a scab they
don't *want* to heal <0.3 wink>.

having-said-nothing-new-he-signs-off-ly y'rs  - tim





From tim_one at email.msn.com  Wed May 17 09:12:27 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 May 2000 03:12:27 -0400
Subject: [Python-Dev] Attempt script name with '.py' appended instead of failing?
In-Reply-To: <ECEPKNMJLHAPFFJHDOJBEEPDCKAA.mhammond@skippinet.com.au>
Message-ID: <000801bfbfcf$424029e0$b52d153f@tim>

[Mark Hammond]
> For about the 1,000,000th time in my life (no exaggeration :-), I just
> typed "python.exe foo" - I forgot the .py.

Mark, is this an Australian thing?  That is, you must be the only person on
earth (besides a guy I know from New Zealand -- Australia, New Zealand, same
thing to American eyes <wink>) who puts ".exe" at the end of "python"!  I'm
speculating that you think backwards because you're upside-down down there.

throwing-another-extension-on-the-barbie-mate-ly y'rs  - tim





From effbot at telia.com  Wed May 17 09:36:03 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 09:36:03 +0200
Subject: [Python-Dev] Unicode
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de> <005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com> <200005161843.UAA01118@loewis.home.cs.tu-berlin.de> <00ed01bfbf6d$41c2f720$34aab5d4@hagrid> <200005162202.AAA02125@loewis.home.cs.tu-berlin.de>
Message-ID: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>

Martin v. Loewis wrote:
> > perfectionist or not, I only want Python's Unicode support to
> > be as intuitive as anything else in Python.  as it stands right
> > now, Perl and Tcl's Unicode support is intuitive.  Python's not.
> 
> I haven't much experience with Perl, but I don't think Tcl is
> intuitive in this area. I really think that they got it all wrong.

"all wrong"?

Tcl works hard to maintain the characters are characters model
(implementation level 2), just like Perl.  the length of a string is
always the number of characters, slicing works as it should, the
internal representation is as efficient as you can make it.

but yes, they have a somewhat dubious autoconversion mechanism
in there.  if something isn't valid UTF-8, it's assumed to be Latin-1.

scary, huh?  not really, if you step back and look at how UTF-8 was
designed.  quoting from RFC 2279:

    "UTF-8 strings can be fairly reliably recognized as such by a
    simple algorithm, i.e. the probability that a string of characters
    in any other encoding appears as valid UTF-8 is low, diminishing
    with increasing string length."

besides, their design is based on the plan 9 rune stuff.  that code
was written by the inventors of UTF-8, who has this to say:

    "There is little a rune-oriented program can do when given bad
    data except exit, which is unreasonable, or carry on. Originally
    the conversion routines, described below, returned errors when
    given invalid UTF, but we found ourselves repeatedly checking
    for errors and ignoring them. We therefore decided to convert
    a bad sequence to a valid rune and continue processing.

    "This technique does have the unfortunate property that con-
    verting invalid UTF byte strings in and out of runes does not
    preserve the input, but this circumstance only occurs when
    non-textual input is given to a textual program."

so let's see: they aimed for a high level of unicode support (layer
2, stream encodings, and system api encodings, etc), they've based
their design on work by the inventors of UTF-8, they have several
years of experience using their implementation in real life, and you
seriously claim that they got it "all wrong"?

that's weird.

> AFAICT, all it does is to change the default encoding from UTF-8
> to Latin-1.

now you're using "all" in that strange way again...  check the archives
for the full story (hint: a conceptual design model isn't the same thing
as a C implementation)

> I can't follow why this should be *better*, but it would be certainly
> as good... In comparison, restricting the "character" interpretation
> of the string type (in terms of your proposal) to 7-bit characters
> has the advantage that it is less error-prone, as Guido points out.

the main reason for that is that Python 1.6 doesn't have any way to
specify source encodings.  add that, so you no longer have to guess
what a string *literal* really is, and that problem goes away.  but
that's something for 1.7.

</F>




From mal at lemburg.com  Wed May 17 10:56:19 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 May 2000 10:56:19 +0200
Subject: [Python-Dev] join() et al.
References: <000701bfbfcb$8f6cc600$b52d153f@tim>
Message-ID: <39225EB3.8D2C9A26@lemburg.com>

Tim Peters wrote:
> 
> [Skip Montanaro]
> > ...
> > It's not a huge deal to me, but I think it mildly violates the
> > principle of least surprise when you try to apply it to sequences
> > of non-strings.
> 
> When sep.join(seq) was first discussed, half the debate was whether str()
> should be magically applied to seq's elements.  I still favor doing that, as
> I have often explained the TypeError in e.g.
> 
>     string.join(some_mixed_list_of_strings_and_numbers)
> 
> to people and agree with their next complaint:  their intent was obvious,
> since string.join *produces* a string.  I've never seen an instance of this
> error that was appreciated (i.e., it never exposed an error in program logic
> or concept, it's just an anal gripe about an arbitrary and unnatural
> restriction).  Not at all like
> 
>     "42" + 42
> 
> where the intent is unknowable.

Uhm, aren't we discussing a generic sequence join API here ?

For strings, I think that " ".join(seq) is just fine... but it
would be nice to have similar functionality for other sequence
items as well, e.g. for sequences of sequences.
 
> > To extend this into the absurd, what should the following code display?
> >
> >     class Spam: pass
> >
> >     eggs = Spam()
> >     bacon = Spam()
> >     toast = Spam()
> >
> >     print join((eggs,bacon,toast))
> 
> Note that we killed the idea of a new builtin join last time around.  It's
> the kind of muddy & gratuitous hypergeneralization Guido will veto if we
> don't kill it ourselves.

We did ? (I must have been too busy hacking Unicode ;-)

Well, in that case I'd still be interested in hearing about
your thoughts so that I can intergrate such a beast in mxTools.
The acceptance level neede for doing that is much lower than
for the core builtins ;-)

>  That said,
> 
>     space.join((eggs, bacon, toast))
> 
> should <wink> produce
> 
>     str(egg) + space + str(bacon) + space + str(toast)
> 
> although how Unicode should fit into all this was never clear to me.

But that would mask errors and, even worse, "work around" coercion,
which is not a good idea, IMHO. Note that the need to coerce to
Unicode was the reason why the implicit str() in " ".join() was
removed from Barry's original string methods implementation.

space.join(map(str,seq)) is much clearer in this respect: it
forces the user to think about what the join should do with non-
string types.

> > If a join builtin is supposed to be applicable to all types, we need to
> > decide what the semantics are going to be for all types.
> 
> See above.
> 
> > Maybe all that needs to happen is that you stringify any non-string
> > elements before applying the + operator (just one possibility among
> > many, not necessarily one I recommend).
> 
> In my experience, that it *doesn't* do that today is a common source of
> surprise & mild irritation.  But I insist that "stringify" return a string
> in this context, and that "+" is simply shorthand for "string catenation".
> Generalizing this would be counterproductive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Wed May 17 16:12:01 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Wed, 17 May 2000 07:12:01 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <004f01bfbfd3$0dd17a20$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>

On Wed, 17 May 2000, Fredrik Lundh wrote:
 > the main reason for that is that Python 1.6 doesn't have any way to
 > specify source encodings.  add that, so you no longer have to guess
 > what a string *literal* really is, and that problem goes away.  but

  You seem to be familiar with the Tcl work, so I'll ask you
this question:  Does Tcl have a way to specify source encoding?
I'm not aware of it, but I've only had time to follow the Tcl
world very lightly these past few years.  ;)


  -Fred

--
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Wed May 17 16:29:32 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 17 May 2000 16:29:32 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <018101bfc00c$52be3180$34aab5d4@hagrid>

Fred L. Drake wrote:
> On Wed, 17 May 2000, Fredrik Lundh wrote:
>  > the main reason for that is that Python 1.6 doesn't have any way to
>  > specify source encodings.  add that, so you no longer have to guess
>  > what a string *literal* really is, and that problem goes away.  but
> 
>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?

Tcl has a system encoding (which is used when passing strings
through system APIs), and file/channel-specific encodings.

(for info on how they initialize the system encoding, see earlier
posts).

unfortunately, they're using the system encoding also for source
code.  for portable code, they recommend sticking to ASCII or
using "bootstrap scripts", e.g:

    set fd [open "app.tcl" r]
    fconfigure $fd -encoding euc-jp
    set jpscript [read $fd]
    close $fd
    eval $jpscript

we can surely do better in 1.7...

</F>




From jeremy at alum.mit.edu  Thu May 18 00:38:20 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Wed, 17 May 2000 15:38:20 -0700 (PDT)
Subject: [Python-Dev] Unicode
In-Reply-To: <3921D2E1.6282AA8F@lemburg.com>
References: <200005142139.XAA09615@loewis.home.cs.tu-berlin.de>
	<005e01bfbf19$ee138ed0$0500a8c0@secret.pythonware.com>
	<200005161843.UAA01118@loewis.home.cs.tu-berlin.de>
	<00ed01bfbf6d$41c2f720$34aab5d4@hagrid>
	<3921D2E1.6282AA8F@lemburg.com>
Message-ID: <14627.8028.887219.978041@localhost.localdomain>

>>>>> "MAL" == M -A Lemburg <mal at lemburg.com> writes:

  MAL> Fredrik Lundh wrote:
  >>  of course I can kludge my way around the flaws in MAL's design,
  >> but why should I have to do that? it's broken. fixing it is easy.

  MAL> Look Fredrik, it's not *my* design. All this was discussed in
  MAL> public and in several rounds late last year. If someone made a
  MAL> mistake and "broke" anything, then we all did... I still don't
  MAL> think so, but that's my personal opinion.

I find its best to avoid referring to a design as "so-and-so's design"
unless you've got something specifically complementary to say.  Using
the person's name in combination with some criticism of the design
tends to produce a defensive reaction.  Perhaps it would help make
this discussion less contentious.

Jeremy




From martin at loewis.home.cs.tu-berlin.de  Thu May 18 00:55:21 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 18 May 2000 00:55:21 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
	(fdrake@acm.org)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com>
Message-ID: <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>

>   You seem to be familiar with the Tcl work, so I'll ask you
> this question:  Does Tcl have a way to specify source encoding?
> I'm not aware of it, but I've only had time to follow the Tcl
> world very lightly these past few years.  ;)

To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
Unicode escapes, and treats all other source code as
Latin-1. encoding(n) says

# However, because the source command always reads files using the
# ISO8859-1 encoding, Tcl will treat each byte in the file as a
# separate character that maps to the 00 page in Unicode.

Regards
Martin




From tim_one at email.msn.com  Thu May 18 06:34:13 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:13 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <3921A4E5.9BDEBF49@tismer.com>
Message-ID: <000301bfc082$51ce0180$6c2d153f@tim>

[Christian Tismer]
> ...
> Then a string should better not be a sequence.
>
> The number of places where I really used the string sequence
> protocol to take advantage of it is outperfomed by a factor
> of ten by cases where I missed to tupleise and got a bad
> result. A traceback is better than a sequence here.

Alas, I think

    for ch in string:
        muck w/ the character ch

is a common idiom.

> oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris

The "sequenenceness" of strings does get in the way often enough.  Strings
have the amazing property that, since characters are also strings,

    while 1:
        string = string[0]

never terminates with an error.  This often manifests as unbounded recursion
in generic functions that crawl over nested sequences (the first time you
code one of these, you try to stop the recursion on a "is it a sequence?"
test, and then someone passes in something containing a string and it
descends forever).  And we also have that

    format % values

requires "values" to be specifically a tuple rather than any old sequence,
else the current

    "%s" % some_string

could be interpreted the wrong way.

There may be some hope in that the "for/in" protocol is now conflated with
the __getitem__ protocol, so if Python grows a more general iteration
protocol, perhaps we could back away from the sequenceness of strings
without harming "for" iteration over the characters ...





From tim_one at email.msn.com  Thu May 18 06:34:05 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:05 -0400
Subject: [Python-Dev] join() et al.
In-Reply-To: <39225EB3.8D2C9A26@lemburg.com>
Message-ID: <000001bfc082$4d9d5020$6c2d153f@tim>

[M.-A. Lemburg]
> ...
> Uhm, aren't we discussing a generic sequence join API here ?

It depends on whether your "we" includes me <wink>.

> Well, in that case I'd still be interested in hearing about
> your thoughts so that I can intergrate such a beast in mxTools.
> The acceptance level neede for doing that is much lower than
> for the core builtins ;-)

Heh heh.  Python already has a generic sequence join API, called "reduce".
What else do you want beyond that?  There's nothing else I want, and I don't
even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
APL, for more of this ilk.  NumPy has some use for stuff like this, but
effective schemes require dealing with multiple dimensions intelligently,
and then you're in the proper domain of matrices rather than sequences.

> >  That said,
> >
> >     space.join((eggs, bacon, toast))
> >
> > should <wink> produce
> >
> >     str(egg) + space + str(bacon) + space + str(toast)
> >
> > although how Unicode should fit into all this was never clear to me.

> But that would mask errors and,

As I said elsewhere in the msg, I have never seen this "error" do anything
except irritate a user whose intent was the utterly obvious one (i.e.,
convert the object to a string, than catenate it).

> even worse, "work around" coercion, which is not a good idea, IMHO.
> Note that the need to coerce to Unicode was the reason why the
> implicit str() in " ".join() was removed from Barry's original string
> methods implementation.

I'm hoping that in P3K we have only one string type, and then the ambiguity
goes away.  In the meantime, it's a good reason to drop Unicode support
<snicker>.

> space.join(map(str,seq)) is much clearer in this respect: it
> forces the user to think about what the join should do with non-
> string types.

They're producing a string; they want join to turn the pieces into strings;
it's a no-brainer unless join is hypergeneralized into terminal obscurity
(like, indeed, Python's "reduce").

simple-tools-for-tedious-little-tasks-ly y'rs  - tim





From tim_one at email.msn.com  Thu May 18 06:34:11 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:11 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39219B01.A4EE0920@tismer.com>
Message-ID: <000201bfc082$50909f80$6c2d153f@tim>


[Christian Tismer]
> ...
> After all, it is no surprize. They are right.
> If we have to change their mind in order to understand
> a basic operation, then we are wrong, not they.

[Tim]
> Huh!  I would not have guessed that you'd give up on Stackless
> that easily <wink>.

[Chris]
> Noh, I didn't give up Stackless, but fishing for soles.
> After Just v. R. has become my most ambitious user,
> I'm happy enough.

I suspect you missed the point:  Stackless is the *ultimate* exercise in
"changing their mind in order to understand a basic operation".  I was
tweaking you, just as you're tweaking me <smile!>.

> It is absolutely phantastic.
> The most uninteresting stuff in the join is the separator,
> and it has the power to merge thousands of strings
> together, without asking the sequence at all
>  - give all power to the suppressed, long live the Python anarchy :-)

Exactly!  Just as love has the power to bind thousands of incompatible
humans without asking them either:  a vote for space.join() is a vote for
peace on earth.

while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim





From tim_one at email.msn.com  Thu May 18 06:34:17 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 18 May 2000 00:34:17 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000001bfbe40$07f14520$b82d153f@tim>
Message-ID: <000401bfc082$54211940$6c2d153f@tim>

Just a brief note on the little list-grower I posted.  Upon more digging
this doesn't appear to have any relation to Dragon's Win98 headaches, so I
haven't looked at it much more.  Two data points:

1. Gordon McM and I both tried it under NT 4 systems (thanks, G!), and
   those are the only Windows platforms under which no MemoryError is
   raised.  But the runtime behavior is very clearly quadratic-time (in
   the ultimate length of the list) under NT.

2. Win98 comes with very few diagnostic tools useful at this level.  The
   Python process does *not* grow to an unreasonable size.  However, using
   a freeware heap walker I quickly determined that Python quickly sprays
   data *all over* its entire 2Gb virtual heap space while running this
   thing, and then the memory error occurs.  The dump file for the system
   heap memory blocks (just listing the start address, length, & status of
   each block) is about 128Kb and I haven't had time to analyze it.  It's
   clearly terribly fragmented, though.  The mystery here is why Win98
   isn't coalescing all the gazillions of free areas to come with a big-
   enough contiguous chunk to satisfy the request (according to me <wink>,
   the program doesn't create any long-lived data other than the list --
   it appends "1" each time, and uses xrange).

Dragon's Win98 woes appear due to something else:  right after a Win98
system w/ 64Mb RAM is booted, about half the memory is already locked (not
just committed)!  Dragon's product needs more than the remaining 32Mb to
avoid thrashing.  Even stranger, killing every process after booting
releases an insignificant amount of that locked memory.  Strange too, on my
Win98 w/ 160Mb of RAM, upon booting Win98 a massive 50Mb is locked.  This is
insane, and we haven't been able to figure out on whose behalf all this
memory is being allocated.

personally-like-win98-a-lot-but-then-i-bought-a-lot-of-ram-ly y'rs
    - tim





From moshez at math.huji.ac.il  Thu May 18 07:36:09 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Thu, 18 May 2000 08:36:09 +0300 (IDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.GSO.4.10.10005180827490.14709-100000@sundial>

[Tim Peters, on sequenceness of strings]
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Hmmmm...if you add a new method,

for ch in string.as_sequence():
	muck w/ the character ch

You'd solve this.

But you won't manage to convince me that you haven't used things like

string[3:5]+string[6:] to get all the characters that...

The real problem (as I see it, from my very strange POV) is that Python
uses strings for two distinct uses:

1 -- Symbols
2 -- Arrays of characters

"Symbols" are ``run-time representation of identifiers''. For example,
getattr's "prototype" "should be"

getattr(object, symbol, object=None)

While re's search method should be

re_object.search(string)

Of course, there are symbol->string and string->symbol functions, just as
there are list->tuple and tuple->list functions. 

BTW, this would also solve problems if you want to go case-insensitive in
Py3K: == is case-sensitive on strings, but case-insensitive on symbols.

i've-got-this-on-my-chest-since-the-python-conference-and-it-was-a-
  good-opportunity-to-get-it-off-ly y'rs, Z.
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From ping at lfw.org  Thu May 18 06:37:42 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 17 May 2000 21:37:42 -0700 (PDT)
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON
 FEATURE:))
In-Reply-To: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <Pine.LNX.4.10.10005172133490.775-100000@skuld.lfw.org>

On Thu, 18 May 2000, Tim Peters wrote:
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

But there's no way we can back away from

    spam = eggs[hack:chop] + ham[slice:dice]

on strings.  It's just too ideal.

Perhaps eventually the answer will be a character type?

Or perhaps no change at all.  I've not had the pleasure of running
into these problems with characters-being-strings before, even though
your survey of the various gotchas now makes that kind of surprising.


-- ?!ng

"Happiness isn't something you experience; it's something you remember."
    -- Oscar Levant




From mal at lemburg.com  Thu May 18 11:43:57 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 May 2000 11:43:57 +0200
Subject: [Python-Dev] join() et al.
References: <000001bfc082$4d9d5020$6c2d153f@tim>
Message-ID: <3923BB5D.47A28CBE@lemburg.com>

Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > ...
> > Uhm, aren't we discussing a generic sequence join API here ?
> 
> It depends on whether your "we" includes me <wink>.
> 
> > Well, in that case I'd still be interested in hearing about
> > your thoughts so that I can intergrate such a beast in mxTools.
> > The acceptance level neede for doing that is much lower than
> > for the core builtins ;-)
> 
> Heh heh.  Python already has a generic sequence join API, called "reduce".
> What else do you want beyond that?  There's nothing else I want, and I don't
> even want reduce <0.9 wink>.  You can mine any modern Lisp, or any ancient
> APL, for more of this ilk.  NumPy has some use for stuff like this, but
> effective schemes require dealing with multiple dimensions intelligently,
> and then you're in the proper domain of matrices rather than sequences.

The idea behind a generic join() API was that it could be
used to make algorithms dealing with sequences polymorph --
but you're right: this goal is probably too far fetched.

> > >  That said,
> > >
> > >     space.join((eggs, bacon, toast))
> > >
> > > should <wink> produce
> > >
> > >     str(egg) + space + str(bacon) + space + str(toast)
> > >
> > > although how Unicode should fit into all this was never clear to me.
> 
> > But that would mask errors and,
> 
> As I said elsewhere in the msg, I have never seen this "error" do anything
> except irritate a user whose intent was the utterly obvious one (i.e.,
> convert the object to a string, than catenate it).
> 
> > even worse, "work around" coercion, which is not a good idea, IMHO.
> > Note that the need to coerce to Unicode was the reason why the
> > implicit str() in " ".join() was removed from Barry's original string
> > methods implementation.
> 
> I'm hoping that in P3K we have only one string type, and then the ambiguity
> goes away.  In the meantime, it's a good reason to drop Unicode support
> <snicker>.

I'm hoping for that too... it should be Unicode everywhere if you'd
ask me.

In the meantime we can test drive this goal using the -U command
line option: it turns "" into u"" without any source code change.
The fun part about this is that running python in -U mode
reveals quite a few places where the standard lib doesn't handle
Unicode properly, so there's a lot of work ahead...

> > space.join(map(str,seq)) is much clearer in this respect: it
> > forces the user to think about what the join should do with non-
> > string types.
> 
> They're producing a string; they want join to turn the pieces into strings;
> it's a no-brainer unless join is hypergeneralized into terminal obscurity
> (like, indeed, Python's "reduce").

Hmm, the Unicode implementation does these implicit
conversions during coercion and you've all seen the success...
are you sure you want more of this ? 

We could have "".join() apply str() for all objects *except* Unicode.
1 + "2" == "12" would also be an option, or maybe 1 + "2" == 3 ? ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Thu May 18 12:01:16 2000
From: jack at oratrix.nl (Jack Jansen)
Date: Thu, 18 May 2000 12:01:16 +0200
Subject: [Python-Dev] hey, who broke the array module? 
In-Reply-To: Message by Trent Mick <trentm@activestate.com> ,
	     Mon, 15 May 2000 14:09:58 -0700 , <20000515140958.C20418@activestate.com>
Message-ID: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>

> I broke it with my patches to test overflow for some of the PyArg_Parse*()
> formatting characters. The upshot of testing for overflow is that now those
> formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> unsigned-ness as appropriate (you have to know if the value is signed or
> unsigned to know what limits to check against for overflow). Two
> possibilities presented themselves:

I think this is a _very_ bad idea. I have a few thousand (literally) routines 
calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
and the constants are all of the form

kDoSomething = 0x0001
kDoSomethingElse = 0x0002
...
kDoSomethingEvenMoreBrilliant = 0x8000

I'm pretty sure other operating systems have lots of calls with similar 
problems. I would strongly suggest using a new format char if you want 
overflow-tested integers.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From trentm at activestate.com  Thu May 18 18:56:47 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 18 May 2000 09:56:47 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>
Message-ID: <20000518095647.D32135@activestate.com>

On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > formatting characters. The upshot of testing for overflow is that now those
> > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > unsigned-ness as appropriate (you have to know if the value is signed or
> > unsigned to know what limits to check against for overflow). Two
> > possibilities presented themselves:
> 
> I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> and the constants are all of the form
> 
> kDoSomething = 0x0001
> kDoSomethingElse = 0x0002
> ...
> kDoSomethingEvenMoreBrilliant = 0x8000
> 
> I'm pretty sure other operating systems have lots of calls with similar 
> problems. I would strongly suggest using a new format char if you want 
> overflow-tested integers.

Sigh. What do you think Guido? This is your call.

1. go back to no bounds testing
2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
unsigned values but is sort of false security for bounds checking)
3. keep it the way it is: 'b' is unsigned and the rest are signed
4. add new format characters or a modifying character for signed and unsigned
versions of these.

Trent

-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Fri May 19 00:05:45 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 18 May 2000 15:05:45 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: Your message of "Thu, 18 May 2000 09:56:47 PDT."
             <20000518095647.D32135@activestate.com> 
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl>  
            <20000518095647.D32135@activestate.com> 
Message-ID: <200005182205.PAA12830@cj20424-a.reston1.va.home.com>

> On Thu, May 18, 2000 at 12:01:16PM +0200, Jack Jansen wrote:
> > > I broke it with my patches to test overflow for some of the PyArg_Parse*()
> > > formatting characters. The upshot of testing for overflow is that now those
> > > formatting characters ('b', 'h', 'i', 'l') enforce signed-ness or
> > > unsigned-ness as appropriate (you have to know if the value is signed or
> > > unsigned to know what limits to check against for overflow). Two
> > > possibilities presented themselves:
> > 
> > I think this is a _very_ bad idea. I have a few thousand (literally) routines 
> > calling to Macintosh system calls that use "h" for 16 bit flag-word values, 
> > and the constants are all of the form
> > 
> > kDoSomething = 0x0001
> > kDoSomethingElse = 0x0002
> > ...
> > kDoSomethingEvenMoreBrilliant = 0x8000
> > 
> > I'm pretty sure other operating systems have lots of calls with similar 
> > problems. I would strongly suggest using a new format char if you want 
> > overflow-tested integers.
> 
> Sigh. What do you think Guido? This is your call.
> 
> 1. go back to no bounds testing
> 2. bounds check for [SHRT_MIN, USHRT_MAX] etc (this would allow signed and
> unsigned values but is sort of false security for bounds checking)
> 3. keep it the way it is: 'b' is unsigned and the rest are signed
> 4. add new format characters or a modifying character for signed and unsigned
> versions of these.

Sigh indeed.  Ideally, we'd introduce H for unsigned and then lock
Jack in a room with his Macintosh computer for 48 hours to fix all his
code...

Jack, what do you think?  Is this acceptable?  (I don't know if you're
still into S&M :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From trentm at activestate.com  Thu May 18 22:38:59 2000
From: trentm at activestate.com (Trent Mick)
Date: Thu, 18 May 2000 13:38:59 -0700
Subject: [Python-Dev] hey, who broke the array module?
In-Reply-To: <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
References: <trentm@activestate.com> <20000518100116.F06AB370CF2@snelboot.oratrix.nl> <20000518095647.D32135@activestate.com> <200005182205.PAA12830@cj20424-a.reston1.va.home.com> <20000518121723.A3252@activestate.com> <200005182225.PAA12950@cj20424-a.reston1.va.home.com> <20000518123029.A3330@activestate.com> <200005182249.PAA13020@cj20424-a.reston1.va.home.com>
Message-ID: <20000518133859.A3665@activestate.com>

On Thu, May 18, 2000 at 03:49:59PM -0700, Guido van Rossum wrote:
> 
> Maybe we can come up with a modifier for signed or unsigned range
> checking?

Ha! How about 'u'? :) Or 's'? :)

I really can't think of a nice answer for this. Could introduce completely
separate formatter characters that do the range checking and remove range
checking from the current formatters. That is an ugly kludge. Could introduce
a separate PyArg_CheckedParse*() or something like that and slowly migrate to
it. This one could use something other than "L" for LONG_LONG.

I think the long term solution should be:
 - have bounds-checked signed and unsigned version of all the integral types
 - call then i/I, b/B, etc. (a la array module)
 - use something other than "L" for LONG_LONG (as you said, q/Q maybe)

The problem is to find a satisfactory migratory path to that.

Sorry, I don't have an answer. Just more questions.

Trent


p.s. If you were going to check in my associate patch I have a problem in the
tab usage in test_array.py which I will resubmit soon (couple of days).

-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Fri May 19 17:06:52 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:06:52 -0700
Subject: [Python-Dev] repr vs. str and locales again
Message-ID: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>

The email below suggests a simple solution to a problem that
e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
all non-ASCII chars into \oct escapes.  Jyrki's solution: use
isprint(), which makes it locale-dependent.  I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
similar patch for unicode strings (once the ASCII proposal is
implemented).

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Fri, 19 May 2000 10:48:29 +0300
From:    Jyrki Kuoppala <jkp at kaapeli.fi>
To:      guido at python.org
Subject: python bug?: python 1.5.2 fails to print printable 8-bit characters in
	   strings

I'm not sure if this exactly is a bug, ie. whether python 1.5.2 is
supposed to support locales and 8-bit characters.  However, on Linux
Debian "unstable" distribution the diff below makes python 1.5.2
handle printable 8-bit characters as one would expect.

Problem description:

python doesn't properly print printable 8-bit characters for the current locale
.

Details:

With no locale set, 8-bit characters in quoted strings print as
backslash-escapes, which I guess is OK:

$ unset LC_ALL
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k\344\344k')
>>>

But with a locale with a printable '?' character (octal 344) I get:

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k\344\344k')
>>>

I should be getting (output from python patched with the enclosed patch):

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, May 18 2000, 14:43:46)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','k??k')
>>> print a
('foo', 'k??k')
>>>                              

This hits for example when Zope with squishdot weblog (squishdot
0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
strings with valid Latin1 characters get indexed as backslash-escaped
octal codes, and thus become unsearchable.

I am using debian unstable, kernels 2.2.15pre10 and 2.0.36, libc 2.1.3.

I suggest that the test for printability in python-1.5.2
/Objects/stringobject.c be fixed to use isprint() which takes the
locale into account:

- --- python-1.5.2/Objects/stringobject.c.orig	Thu Oct  8 05:17:48 1998
+++ python-1.5.2/Objects/stringobject.c	Thu May 18 14:36:28 2000
@@ -224,7 +224,7 @@
 		c = op->ob_sval[i];
 		if (c == quote || c == '\\')
 			fprintf(fp, "\\%c", c);
- -		else if (c < ' ' || c >= 0177)
+		else if (! isprint (c))
 			fprintf(fp, "\\%03o", c & 0377);
 		else
 			fputc(c, fp);
@@ -260,7 +260,7 @@
 			c = op->ob_sval[i];
 			if (c == quote || c == '\\')
 				*p++ = '\\', *p++ = c;
- -			else if (c < ' ' || c >= 0177) {
+			else if (! isprint (c)) {
 				sprintf(p, "\\%03o", c & 0377);
 				while (*p != '\0')
 					p++;



//Jyrki

------- End of Forwarded Message




From guido at python.org  Fri May 19 17:13:01 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 08:13:01 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 11:25:43 +0200."
             <39250897.6F42@cnet.francetelecom.fr> 
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>  
            <39250897.6F42@cnet.francetelecom.fr> 
Message-ID: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>

[Quoting the entire mail because I've added python-dev to the cc:
list]

> Subject: Re: Python multiplexing is too hard (was: Network statistics program)
> From: Alexandre Ferrieux <alexandre.ferrieux at cnet.francetelecom.fr>
> To: Guido van Rossum <guido at python.org>
> Cc: claird at starbase.neosoft.com
> Date: Fri, 19 May 2000 11:25:43 +0200
> Delivery-Date: Fri May 19 05:26:59 2000
> 
> Guido van Rossum wrote:
> > 
> > Cameron Laird wrote:
> > >                    .
> > > Right.  asyncore is nice--but restricted to socket
> > > connections.  For many applications, that's not a
> > > restriction at all.  However, it'd be nice to have
> > > such a handy interface for communication with
> > > same-host processes; that's why I mentioned popen*().
> > > Does no one else perceive a gap there, in convenient
> > > asynchronous piped IPC?  Do folks just fall back on
> > > select() for this case?
> > 
> > Hm, really?  For same-host processes, threads would
> > do the job nicely I'd say.
> 
> Overkill.
> 
> >  Or you could probably
> > use unix domain sockets (popen only really works on
> > Unix, so that's not much of a restriction).
> 
> Overkill.
> 
> > Also note that often this is needed in the context
> > of a GUI app; there something integrated in the GUI
> > main loop is recommended.  (E.g. the file events that
> > Moshe mentioned.)
> 
> Okay so your answer is, The Python Way of doing it is to use Tcl.
> That's pretty disappointing, I'm sorry to say...
> 
> Consider:
> 
> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
> 	  event queue:
> 		- on unix, by a an additional bit on X's fd (socket) in 
> 		  the select()
> 		- on 'doze, everything is brought back to messages 
> 		  anyway.
> 
> 	And, in both cases, it works with pipes, sockets, serial or other
> devices. Uniform, clean.
> 
> 	- In python "popen only really works on Unix": are you satisfied with
> that state of affairs ? I understand (and value) Python's focus on
> algorithms and data structures, and worming around OS misgivings is a
> boring, ancillary task. But what about the potential gain ?
> 
> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> just so beautiful inside. But while Tcl is weaker in the algorithms, it
> is stronger in the os-wrapping library, and taught me to love high-level
> abstractions. [fileevent] shines in this respect, and I'll miss it in
> Python.
> 		
> -Alex

Alex, it's disappointing to me too!  There just isn't anything
currently in the library to do this, and I haven't written apps that
needs this often enough to have a good feel for what kind of
abstraction is needed.

However perhaps we can come up with a design for something better?  Do
you have a suggestion here?

I agree with your comment that higher-level abstractions around OS
stuff are needed -- I learned system programming long ago, in C, and
I'm "happy enough" with the current state of affairs, but I agree that
for many people this is a problem, and there's no reason why Python
couldn't do better...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Fri May 19 14:44:55 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 19 May 2000 14:44:55 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <002f01bfc190$09870c00$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> Jyrki's solution: use isprint(), which makes it locale-dependent.
> I can live with this.
> 
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
> 
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

does ctype-related locale stuff really mix well with unicode?

if yes, -0. if no, +0.

(intuitively, I'd say no -- deprecate in 1.6, remove in 1.7)

(btw, what about "eval(repr(s)) == s" ?)

</F>




From mal at lemburg.com  Fri May 19 14:30:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 14:30:08 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <392533D0.965E47E4@lemburg.com>

Guido van Rossum wrote:
> 
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.
> 
> It needs a Py_CHARMASK() call but otherwise seems to be fine.
> 
> Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
> similar patch for unicode strings (once the ASCII proposal is
> implemented).

The subject line is a bit misleading: the patch only touches
tp_print, not repr() output. And this is good, IMHO, since
otherwise eval(repr(string)) wouldn't necessarily result
in string.

Unicode objects don't implement a tp_print slot... perhaps
they should ?

--

About the ASCII proposal:

Would you be satisfied with what

import sys
sys.set_string_encoding('ascii')

currently implements ?

There are several places where an encoding comes into play with
the Unicode implementation. The above API currently changes
str(unicode), print unicode and the assumption made by the
implementation during coercion of strings to Unicode.

It does not change the encoding used to implement the "s"
or "t" parser markers and also doesn't change the way the
Unicode hash value is computed (these are currently still
hard-coded as UTF-8).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at mems-exchange.org  Fri May 19 14:45:12 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 08:45:12 -0400
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 19, 2000 at 08:06:52AM -0700
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <20000519084511.A14717@mems-exchange.org>

On 19 May 2000, Guido van Rossum said:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

For "ASCII" strings in this day and age -- which are often not
necessarily plain ol' 7-bit ASCII -- I'd say that "32 <= c <= 127" is
not the right way to determine printability.  'isprint()' seems much
more appropriate to me.

Are there other areas of Python that should be locale-sensitive but
aren't?  A minor objection to this patch is that it's a creeping change
that brings in a little bit of locale-sensitivity without addressing a
(possibly) wider problem.  However, I will immediately shoot down my own
objection on the grounds that if we try to fix everything all at once,
then nothing will ever get fixed.  Locale sensitivity strikes me as the
sort of thing that *can* be a "creeping" change -- just fix the bits
that bug people most, and eventually all the important bits will be
fixed.

I have no expertise and therefore no opinion on such a change for
Unicode strings.

        Greg



From pf at artcom-gmbh.de  Fri May 19 14:44:00 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 19 May 2000 14:44:00 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 19, 2000  8: 6:52 am"
Message-ID: <m12sm8G-000CnCC@artcom0.artcom-gmbh.de>

Guido van Rossum asks:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

How portable is the locale awareness property of 'is_print' among
traditional Unix environments, WinXX and MacOS?  This works fine on
my favorite development platform (Linux), but an accidental use of
this new 'feature' might hurt the portability of my Python apps to
other platforms.  If 'is_print' honors the locale in a similar way
on other important platforms I would like this.  Otherwise I would
prefer the current behaviour so that I can deal with it during the
early stages of development on my Linux boxes.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From bwarsaw at python.org  Fri May 19 20:51:23 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 19 May 2000 11:51:23 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
	<20000519084511.A14717@mems-exchange.org>
Message-ID: <14629.36139.735410.272339@localhost.localdomain>

>>>>> "GW" == Greg Ward <gward at mems-exchange.org> writes:

    GW> Locale sensitivity strikes me as the sort of thing that *can*
    GW> be a "creeping" change -- just fix the bits that bug people
    GW> most, and eventually all the important bits will be fixed.

Another decidedly ignorant Anglophone here, but one problem that I see
with localizing stuff is that locale is app- (or at least thread-)
global, isn't it?  That would suck for applications like Mailman which
are (going to be) multilingual in the sense that a single instance of
the application will serve up documents in many languages, as opposed
to serving up documents in just one of a choice of languages.

If it seems I don't know what I'm talking about, you're probably
right.  I just wanted to point out that there are applications have to
deal with many languages at the same time.

-Barry




From effbot at telia.com  Fri May 19 18:46:39 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Fri, 19 May 2000 18:46:39 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com><20000519084511.A14717@mems-exchange.org> <14629.36139.735410.272339@localhost.localdomain>
Message-ID: <00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>

Barry Warsaw wrote:
> Another decidedly ignorant Anglophone here, but one problem that I see
> with localizing stuff is that locale is app- (or at least thread-)
> global, isn't it?  That would suck for applications like Mailman which
> are (going to be) multilingual in the sense that a single instance of
> the application will serve up documents in many languages, as opposed
> to serving up documents in just one of a choice of languages.
> 
> If it seems I don't know what I'm talking about, you're probably
> right.  I just wanted to point out that there are applications have to
> deal with many languages at the same time.

Applications may also have to deal with output devices (i.e. GUI
toolkits, printers, communication links) that don't necessarily have
the same restrictions as the "default console".

better do it the right way: deal with encodings at the boundaries,
not inside the application.

</F>




From gward at mems-exchange.org  Fri May 19 19:03:18 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 19 May 2000 13:03:18 -0400
Subject: [Python-Dev] Dynamic linking problem on Solaris
Message-ID: <20000519130317.A16111@mems-exchange.org>

Hi all --

interesting problem with building Robin Dunn's extension for BSD DB 2.x
as a shared object on Solaris 2.6 for Python 1.5.2 with GCC 2.8.1 and
Sun's linker.  (Yes, all of those things seem to matter.)

DB 2.x (well, at least 2.7.7) contains this line of C code:

    *mbytesp = sb.st_size / MEGABYTE;

where 'sb' is a 'struct stat' -- ie. 'sb.st_size' is a long long, which
I believe is 64 bits on Solaris.  Anyways, GCC compiles this division
into a subroutine call -- I guess the SPARC doesn't have a 64-bit
divide, or if it does then GCC doesn't know about it.

Of course, the subroutine in question -- '__cmpdi2' -- is defined in
libgcc.a.  So if you write a C application that uses BSD DB 2.x, and
compile and link it with GCC, no problem -- everything is controlled by
GCC, so libgcc.a gets linked in at the appropriate time, the linker
finds '__cmpdi2' and includes it in your binary executable, and
everything works.

However, if you're building a Python extension that uses BSD DB 2.x,
there's a problem: the default command for creating a shared extension
on Solaris is "ld -G" -- this is in Python's Makefile, so it affects
extension building with either Makefile.pre.in or the Distutils.

However, since "ld" is Sun's "ld", it doesn't know anything about
libgcc.a.  And, since presumably no 64-bit division is done in Python
itself, '__cmpdi2' isn't already present in the Python binary.  The
result: when you attempt to load the extension, you die:

  $ python -c "import dbc"
  Traceback (innermost last):
    File "<string>", line 1, in ?
  ImportError: ld.so.1: python: fatal: relocation error: file ./dbcmodule.so: symbol __cmpdi2: referenced symbol not found

The workaround turns out to be fairly easy, and there are actually two
of them.  First, add libgcc.a to the link command, ie. instead of

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb -o dbcmodule.so

use

  ld -G  db_wrap.o  -L/usr/local/BerkeleyDB/lib -ldb \
    /depot/gnu/plat/lib/gcc-lib/sparc-sun-solaris2.6/2.8.1/libgcc.a \
    -o dbcmodule.so

(where the location of libgcc.a is variable, but invariably hairy).  Or,
it turns out that you can just use "gcc -G" to create the extension:

  gcc -G db_wrap.o -ldb -o dbcmodule.so

Seems to me that the latter is a no-brainer.

So the question arises: why is the default command for building
extensions on Solaris "ld -G" instead of "gcc -G"?  I'm inclined to go
edit my installed Makefile to make this permanent... what will that
break?

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From bwarsaw at python.org  Fri May 19 22:09:09 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 19 May 2000 13:09:09 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
References: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
	<20000519084511.A14717@mems-exchange.org>
	<14629.36139.735410.272339@localhost.localdomain>
	<00e001bfc1b1$d0c1d7c0$34aab5d4@hagrid>
Message-ID: <14629.40805.180119.929694@localhost.localdomain>

>>>>> "FL" == Fredrik Lundh <effbot at telia.com> writes:

    FL> better do it the right way: deal with encodings at the
    FL> boundaries, not inside the application.

Sounds good to me. :)




From ping at lfw.org  Fri May 19 19:04:18 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 10:04:18 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005190957260.2892-100000@localhost>

On Fri, 19 May 2000, Ka-Ping Yee wrote:
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data)

Clarification: what i meant by the above is, repr() is not
explicitly an input or an output function.  It does "some
internal computation".

Here is one alternative:

    repr(obj, **kw): options specified in kw dict
                     
        push each element in kw dict into sys.repr_options
        now do the normal conversion, referring to whatever
            options are relevant (such as "locale" if doing strings)
        for looking up any option, first check kw dict,
            then look for sys.repr_options[option]
        restore sys.repr_options

This is ugly and i still like printon/printout better, but
at least it's a smaller change and won't prevent the implementation
of printon/printout later.

This suggestion is not thread-safe.


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell




From ping at lfw.org  Fri May 19 18:56:50 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Fri, 19 May 2000 09:56:50 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <200005191506.IAA00794@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>

On Fri, 19 May 2000, Guido van Rossum wrote:
> The email below suggests a simple solution to a problem that
> e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> isprint(), which makes it locale-dependent.  I can live with this.

Changing the behaviour of repr() (a function that internally
converts data into data) based on a fixed global system parameter
makes me uncomfortable.  Wouldn't it make more sense for the
locale business to be a property of the stream that the string
is being printed on?

This was the gist of my proposal for files having a printout
method a while ago.  I understand if that proposal is a bit too
much of a change to swallow at once, but i'd like to ensure the
door stays open to let it be possible in the future.

Surely there are other language systems that deal with the
issue of "nicely" printing their own data structures for human
interpretation... anyone have any experience to share?  The
printout/printon thing originally comes from Smalltalk, i believe.

(...which reminds me -- i played with Squeak the other day and
thought to myself, it would be cool to browse and edit code in
Python with a system browser like that.)


Note, however:

> This hits for example when Zope with squishdot weblog (squishdot
> 0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
> strings with valid Latin1 characters get indexed as backslash-escaped
> octal codes, and thus become unsearchable.

The above comment in particular strikes me as very fishy.
How on earth can the escaping behaviour of repr() affect the
indexing of text?  Surely when you do a search, you search for
exactly what you asked for.

And does the above mean that, with Jyrki's proposed fix, the
sorting and searching behaviour of Squishdot will suddenly
change, and magically differ from locale to locale?  Is that
something we want?  (That last is not a rhetorical question --
my gut says no, but i don't actually have enough experience
working with these issues to know the answer.)


-- ?!ng

"Simple, yet complex."
    -- Lenore Snell




From mal at lemburg.com  Fri May 19 21:06:24 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 May 2000 21:06:24 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005190947520.2892-100000@localhost>
Message-ID: <392590B0.5CA4F31D@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > The email below suggests a simple solution to a problem that
> > e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
> > all non-ASCII chars into \oct escapes.  Jyrki's solution: use
> > isprint(), which makes it locale-dependent.  I can live with this.
> 
> Changing the behaviour of repr() (a function that internally
> converts data into data) based on a fixed global system parameter
> makes me uncomfortable.  Wouldn't it make more sense for the
> locale business to be a property of the stream that the string
> is being printed on?

Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
string_print API which is used for the tp_print slot, so the
only effect to be seen is when printing a string to a real
file object (tp_print is only used by PyObject_Print() and that
API is only used for writing to real PyFileObjects -- all
other stream get the output of str() or repr()).

Perhaps we should drop tp_print for strings altogether and
let str() and repr() to decide what to do... (this is
what Unicode objects do). The only good reason for implementing
tp_print is to write huge amounts of data to a stream without
creating intermediate objects -- not really needed for strings,
since these *are* the intermediate object usually created for
just this purpose ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jeremy at alum.mit.edu  Sat May 20 02:46:11 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Fri, 19 May 2000 17:46:11 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
Message-ID: <14629.57427.9434.623247@localhost.localdomain>

I applied the recent changes to the CVS httplib to Greg's httplib
(call it httplib11) this afternoon.  The result is included below.  I
think this is quite close to checking in, but it could use a slightly
better test suite.

There are a few outstanding questions.

httplib11 does not implement the debuglevel feature.  I don't think
it's important, but it is currently documented and may be used.
Guido, should we implement it?

httplib w/SSL uses a constructor with this prototype:
    def __init__(self, host='', port=None, **x509):
It looks like the x509 dictionary should contain two variables --
key_file and cert_file.  Since we know what the entries are, why not
make them explicit?
    def __init__(self, host='', port=None, cert_file=None, key_file=None):
(Or reverse the two arguments if that is clearer.)

The FakeSocket class in CVS has a comment after the makefile def line
that says "hopefully, never have to write."  It won't do at all the
right thing when called with a write mode, so it ought to raise an
exception.  Any reason it doesn't?

I'd like to add a couple of test cases that use HTTP/1.1 to get some
pages from python.org, including one that uses the chunked encoding.
Just haven't gotten around to it.  Question on that front: Does it
make sense to incorporate the test function in the module with the std
regression test suite?  In general, I would think so.  In this
particular case, the test could fail because of host networking
problems.  I think that's okay as long as the error message is clear
enough. 

Jeremy

"""HTTP/1.1 client library"""

# Written by Greg Stein.

import socket
import string
import mimetools

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO

error = 'httplib.error'

HTTP_PORT = 80
HTTPS_PORT = 443

class HTTPResponse(mimetools.Message):
    __super_init = mimetools.Message.__init__
    
    def __init__(self, fp, version, errcode):
        self.__super_init(fp, 0)

        if version == 'HTTP/1.0':
            self.version = 10
        elif version[:7] == 'HTTP/1.':
            self.version = 11 # use HTTP/1.1 code for HTTP/1.x where x>=1
        else:
            raise error, 'unknown HTTP protocol'

        # are we using the chunked-style of transfer encoding?
        tr_enc = self.getheader('transfer-encoding')
        if tr_enc:
            if string.lower(tr_enc) != 'chunked':
                raise error, 'unknown transfer-encoding'
            self.chunked = 1
            self.chunk_left = None
        else:
            self.chunked = 0

        # will the connection close at the end of the response?
        conn = self.getheader('connection')
        if conn:
            conn = string.lower(conn)
            # a "Connection: close" will always close the
            # connection. if we don't see that and this is not
            # HTTP/1.1, then the connection will close unless we see a
            # Keep-Alive header. 
            self.will_close = string.find(conn, 'close') != -1 or \
                              ( self.version != 11 and \
                                not self.getheader('keep-alive') )
        else:
            # for HTTP/1.1, the connection will always remain open
            # otherwise, it will remain open IFF we see a Keep-Alive header
            self.will_close = self.version != 11 and \
                              not self.getheader('keep-alive')

        # do we have a Content-Length?
        # NOTE: RFC 2616, S4.4, #3 says we ignore this if tr_enc is "chunked"
        length = self.getheader('content-length')
        if length and not self.chunked:
            self.length = int(length)
        else:
            self.length = None

        # does the body have a fixed length? (of zero)
        if (errcode == 204 or               # No Content
            errcode == 304 or               # Not Modified
            100 <= errcode < 200):          # 1xx codes
            self.length = 0

        # if the connection remains open, and we aren't using chunked, and
        # a content-length was not provided, then assume that the connection
        # WILL close.
        if not self.will_close and \
           not self.chunked and \
           self.length is None:
            self.will_close = 1

        # if there is no body, then close NOW. read() may never be
        # called, thus we will never mark self as closed.
        if self.length == 0:
            self.close()

    def close(self):
        if self.fp:
            self.fp.close()
            self.fp = None

    def isclosed(self):
        # NOTE: it is possible that we will not ever call self.close(). This
        #       case occurs when will_close is TRUE, length is None, and we
        #       read up to the last byte, but NOT past it.
        #
        # IMPLIES: if will_close is FALSE, then self.close() will ALWAYS be
        #          called, meaning self.isclosed() is meaningful.
        return self.fp is None

    def read(self, amt=None):
        if self.fp is None:
            return ''

        if self.chunked:
            chunk_left = self.chunk_left
            value = ''
            while 1:
                if chunk_left is None:
                    line = self.fp.readline()
                    i = string.find(line, ';')
                    if i >= 0:
                        line = line[:i]     # strip chunk-extensions
                    chunk_left = string.atoi(line, 16)
                    if chunk_left == 0:
                        break
                if amt is None:
                    value = value + self.fp.read(chunk_left)
                elif amt < chunk_left:
                    value = value + self.fp.read(amt)
                    self.chunk_left = chunk_left - amt
                    return value
                elif amt == chunk_left:
                    value = value + self.fp.read(amt)
                    self.fp.read(2)    # toss the CRLF at the end of the chunk
                    self.chunk_left = None
                    return value
                else:
                    value = value + self.fp.read(chunk_left)
                    amt = amt - chunk_left

                # we read the whole chunk, get another
                self.fp.read(2)        # toss the CRLF at the end of the chunk
                chunk_left = None

            # read and discard trailer up to the CRLF terminator
            ### note: we shouldn't have any trailers!
            while 1:
                line = self.fp.readline()
                if line == '\r\n':
                    break

            # we read everything; close the "file"
            self.close()

            return value

        elif amt is None:
            # unbounded read
            if self.will_close:
                s = self.fp.read()
            else:
                s = self.fp.read(self.length)
            self.close()      # we read everything
            return s

        if self.length is not None:
            if amt > self.length:
                # clip the read to the "end of response"
                amt = self.length
            self.length = self.length - amt

        s = self.fp.read(amt)

        # close our "file" if we know we should
        ### I'm not sure about the len(s) < amt part; we should be
        ### safe because we shouldn't be using non-blocking sockets
        if self.length == 0 or len(s) < amt:
            self.close()

        return s


class HTTPConnection:

    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'

    response_class = HTTPResponse
    default_port = HTTP_PORT

    def __init__(self, host, port=None):
        self.sock = None
        self.response = None
        self._set_hostport(host, port)

    def _set_hostport(self, host, port):
        if port is None:
            i = string.find(host, ':')
            if i >= 0:
                port = int(host[i+1:])
                host = host[:i]
            else:
                port = self.default_port
        self.host = host
        self.port = port
        self.addr = host, port

    def connect(self):
        """Connect to the host and port specified in __init__."""
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect(self.addr)

    def close(self):
        """Close the connection to the HTTP server."""
        if self.sock:
            self.sock.close() # close it manually... there may be other refs
            self.sock = None
        if self.response:
            self.response.close()
            self.response = None

    def send(self, str):
        """Send `str' to the server."""
        if self.sock is None:
            self.connect()

        # send the data to the server. if we get a broken pipe, then close
        # the socket. we want to reconnect when somebody tries to send again.
        #
        # NOTE: we DO propagate the error, though, because we cannot simply
        #       ignore the error... the caller will know if they can retry.
        try:
            self.sock.send(str)
        except socket.error, v:
            if v[0] == 32:    # Broken pipe
                self.close()
            raise

    def putrequest(self, method, url):
        """Send a request to the server.

        `method' specifies an HTTP request method, e.g. 'GET'.
        `url' specifies the object being requested, e.g.
        '/index.html'.
        """
        if self.response is not None:
            if not self.response.isclosed():
                ### implies half-duplex!
                raise error, 'prior response has not been fully handled'
            self.response = None

        if not url:
            url = '/'
        str = '%s %s %s\r\n' % (method, url, self._http_vsn_str)

        try:
            self.send(str)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time (the socket was closed; this will reopen)
            self.send(str)

        self.putheader('Host', self.host)

        if self._http_vsn == 11:
            # Issue some standard headers for better HTTP/1.1 compliance

            # note: we are assuming that clients will not attempt to set these
            #     headers since *this* library must deal with the consequences.
            #     this also means that when the supporting libraries are
            #     updated to recognize other forms, then this code should be
            #     changed (removed or updated).

            # we only want a Content-Encoding of "identity" since we don't
            # support encodings such as x-gzip or x-deflate.
            self.putheader('Accept-Encoding', 'identity')

            # we can accept "chunked" Transfer-Encodings, but no others
            # NOTE: no TE header implies *only* "chunked"
            #self.putheader('TE', 'chunked')

            # if TE is supplied in the header, then it must appear in a
            # Connection header.
            #self.putheader('Connection', 'TE')

        else:
            # For HTTP/1.0, the server will assume "not chunked"
            pass

    def putheader(self, header, value):
        """Send a request header line to the server.

        For example: h.putheader('Accept', 'text/html')
        """
        str = '%s: %s\r\n' % (header, value)
        self.send(str)

    def endheaders(self):
        """Indicate that the last header line has been sent to the server."""

        self.send('\r\n')

    def request(self, method, url, body=None, headers={}):
        """Send a complete request to the server."""

        try:
            self._send_request(method, url, body, headers)
        except socket.error, v:
            if v[0] != 32:    # Broken pipe
                raise
            # try one more time
            self._send_request(method, url, body, headers)

    def _send_request(self, method, url, body, headers):
        self.putrequest(method, url)

        if body:
            self.putheader('Content-Length', str(len(body)))
        for hdr, value in headers.items():
            self.putheader(hdr, value)
        self.endheaders()

        if body:
            self.send(body)

    def getreply(self):
        """Get a reply from the server.

        Returns a tuple consisting of:
        - server response code (e.g. '200' if all goes well)
        - server response string corresponding to response code
        - any RFC822 headers in the response from the server

        """
        file = self.sock.makefile('rb')
        line = file.readline()
        try:
            [ver, code, msg] = string.split(line, None, 2)
        except ValueError:
            try:
                [ver, code] = string.split(line, None, 1)
                msg = ""
            except ValueError:
                self.close()
                return -1, line, file
        if ver[:5] != 'HTTP/':
            self.close()
            return -1, line, file
        errcode = int(code)
        errmsg = string.strip(msg)
        response = self.response_class(file, ver, errcode)
        if response.will_close:
            # this effectively passes the connection to the response
            self.close()
        else:
            # remember this, so we can tell when it is complete
            self.response = response
        return errcode, errmsg, response

class FakeSocket:
    def __init__(self, sock, ssl):
        self.__sock = sock
        self.__ssl = ssl
        return

    def makefile(self, mode):           # hopefully, never have to write
        # XXX add assert about mode != w???
        msgbuf = ""
        while 1:
            try:
                msgbuf = msgbuf + self.__ssl.read()
            except socket.sslerror, msg:
                break
        return StringIO(msgbuf)

    def send(self, stuff, flags = 0):
        return self.__ssl.write(stuff)

    def recv(self, len = 1024, flags = 0):
        return self.__ssl.read(len)

    def __getattr__(self, attr):
        return getattr(self.__sock, attr)

class HTTPSConnection(HTTPConnection):
    """This class allows communication via SSL."""
    __super_init = HTTPConnection.__init__

    default_port = HTTPS_PORT

    def __init__(self, host, port=None, **x509):
        self.__super_init(host, port)
        self.key_file = x509.get('key_file')
        self.cert_file = x509.get('cert_file')

    def connect(self):
        """Connect to a host onf a given port
        
        Note: This method is automatically invoked by __init__, if a host
        is specified during instantiation.
        """
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect(self.addr)
        ssl = socket.ssl(sock, self.key_file, self.cert_file)
        self.sock = FakeSocket(sock, ssl)

class HTTPMixin:
    """Mixin for compatibility with httplib.py from 1.5.

    requires that class that inherits defines the following attributes:
    super_init 
    super_connect 
    super_putheader 
    super_getreply
    """

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def connect(self, host=None, port=None):
        "Accept arguments to set the host/port, since the superclass doesn't."
        if host is not None:
            self._set_hostport(host, port)
        self.super_connect()

    def set_debuglevel(self, debuglevel):
        "The class no longer supports the debuglevel."
        pass

    def getfile(self):
        "Provide a getfile, since the superclass' use of HTTP/1.1 prevents it."
        return self.file

    def putheader(self, header, *values):
        "The superclass allows only one value argument."
        self.super_putheader(header, string.joinfields(values,'\r\n\t'))

    def getreply(self):
        "Compensate for an instance attribute shuffling."
        errcode, errmsg, response = self.super_getreply()
        if errcode == -1:
            self.file = response  # response is the "file" when errcode==-1
            self.headers = None
            return -1, errmsg, None

        self.headers = response
        self.file = response.fp
        return errcode, errmsg, response

class HTTP(HTTPMixin, HTTPConnection):
    super_init = HTTPConnection.__init__
    super_connect = HTTPConnection.connect
    super_putheader = HTTPConnection.putheader
    super_getreply = HTTPConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port)

class HTTPS(HTTPMixin, HTTPSConnection):
    super_init = HTTPSConnection.__init__
    super_connect = HTTPSConnection.connect
    super_putheader = HTTPSConnection.putheader
    super_getreply = HTTPSConnection.getreply

    _http_vsn = 10
    _http_vsn_str = 'HTTP/1.0'

    def __init__(self, host='', port=None, **x509):
        "Provide a default host, since the superclass requires one."
        # Note that we may pass an empty string as the host; this will throw
        # an error when we attempt to connect. Presumably, the client code
        # will call connect before then, with a proper host.
        self.super_init(host, port, **x509)

def test():
    """Test this module.

    The test consists of retrieving and displaying the Python
    home page, along with the error code and error string returned
    by the www.python.org server.
    """

    import sys
    import getopt
    opts, args = getopt.getopt(sys.argv[1:], 'd')
    dl = 0
    for o, a in opts:
        if o == '-d': dl = dl + 1
    host = 'www.python.org'
    selector = '/'
    if args[0:]: host = args[0]
    if args[1:]: selector = args[1]
    h = HTTP()
    h.set_debuglevel(dl)
    h.connect(host)
    h.putrequest('GET', selector)
    h.endheaders()
    errcode, errmsg, headers = h.getreply()
    print 'errcode =', errcode
    print 'errmsg  =', errmsg
    print
    if headers:
        for header in headers.headers: print string.strip(header)
    print
    print h.getfile().read()

    if hasattr(socket, 'ssl'):
        host = 'www.c2.net'
        hs = HTTPS()
        hs.connect(host)
        hs.putrequest('GET', selector)
        hs.endheaders()
        errcode, errmsg, headers = hs.getreply()
        print 'errcode =', errcode
        print 'errmsg  =', errmsg
        print
        if headers:
            for header in headers.headers: print string.strip(header)
        print
        print hs.getfile().read()

if __name__ == '__main__':
    test()




From claird at starbase.neosoft.com  Sat May 20 00:02:47 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:02:47 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005192202.RAA48753@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	> Consider:
	> 
	> 	- In Tcl, as you said, this is nicely integrated with the GUI's 
	> 	  event queue:
	> 		- on unix, by a an additional bit on X's fd (socket) in 
	> 		  the select()
	> 		- on 'doze, everything is brought back to messages 
	> 		  anyway.
	> 
	> 	And, in both cases, it works with pipes, sockets, serial or other
	> devices. Uniform, clean.
	> 
	> 	- In python "popen only really works on Unix": are you satisfied with
	> that state of affairs ? I understand (and value) Python's focus on
	> algorithms and data structures, and worming around OS misgivings is a
	> boring, ancillary task. But what about the potential gain ?
	> 
	> I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
	> just so beautiful inside. But while Tcl is weaker in the algorithms, it
	> is stronger in the os-wrapping library, and taught me to love high-level
	> abstractions. [fileevent] shines in this respect, and I'll miss it in
	> Python.
	> 		
	> -Alex

	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
Great questions!  Alex and I are both working
on answers, I think; we're definitely not ig-
noring this.  More, in time.

One thing of which I'm certain:  I do NOT like
documentation entries that say things like
"select() doesn't really work except under Unix"
(still true?  Maybe that's been fixed?).  As a
user, I just find that intolerable.  Sufficiently
intolerable that I'll help change the situation?
Well, I'm working on that part now ...



From guido at python.org  Sat May 20 03:19:20 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:19:20 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: Your message of "Fri, 19 May 2000 17:02:47 CDT."
             <200005192202.RAA48753@starbase.neosoft.com> 
References: <200005192202.RAA48753@starbase.neosoft.com> 
Message-ID: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>

> One thing of which I'm certain:  I do NOT like
> documentation entries that say things like
> "select() doesn't really work except under Unix"
> (still true?  Maybe that's been fixed?).

Hm, that's bogus.  It works well under Windows -- with the restriction
that it only works for sockets, but for sockets it works as well as
on Unix.  it also works well on the Mac.  I wonder where that note
came from (it's probably 6 years old :-).

Fred...?

> As a
> user, I just find that intolerable.  Sufficiently
> intolerable that I'll help change the situation?
> Well, I'm working on that part now ...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From claird at starbase.neosoft.com  Sat May 20 00:37:48 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Fri, 19 May 2000 17:37:48 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <200005192237.RAA49766@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 17:32:39 2000
			.
			.
			.
	> One thing of which I'm certain:  I do NOT like
	> documentation entries that say things like
	> "select() doesn't really work except under Unix"
	> (still true?  Maybe that's been fixed?).

	Hm, that's bogus.  It works well under Windows -- with the restriction
	that it only works for sockets, but for sockets it works as well as
	on Unix.  it also works well on the Mac.  I wonder where that note
	came from (it's probably 6 years old :-).

	Fred...?
			.
			.
			.
I sure don't mean to propagate misinformation.
I'll make it more of a habit to forward such
items to Fred as I find them.



From guido at python.org  Sat May 20 03:30:30 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 19 May 2000 18:30:30 -0700
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: Your message of "Fri, 19 May 2000 17:46:11 PDT."
             <14629.57427.9434.623247@localhost.localdomain> 
References: <14629.57427.9434.623247@localhost.localdomain> 
Message-ID: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>

> I applied the recent changes to the CVS httplib to Greg's httplib
> (call it httplib11) this afternoon.  The result is included below.  I
> think this is quite close to checking in, but it could use a slightly
> better test suite.

Thanks -- but note that I don't have the time to review the code.

> There are a few outstanding questions.
> 
> httplib11 does not implement the debuglevel feature.  I don't think
> it's important, but it is currently documented and may be used.
> Guido, should we implement it?

I think the solution is to provide the API ignore the call or
argument.

> httplib w/SSL uses a constructor with this prototype:
>     def __init__(self, host='', port=None, **x509):
> It looks like the x509 dictionary should contain two variables --
> key_file and cert_file.  Since we know what the entries are, why not
> make them explicit?
>     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> (Or reverse the two arguments if that is clearer.)

The reason for the **x509 syntax (I think -- I didn't introduce it) is
that it *forces* the user to use keyword args, which is a good thing
for such an advanced feature.  However there should be code that
checks that no other keyword args are present.

> The FakeSocket class in CVS has a comment after the makefile def line
> that says "hopefully, never have to write."  It won't do at all the
> right thing when called with a write mode, so it ought to raise an
> exception.  Any reason it doesn't?

Probably laziness of the code.  Thanks for this code review (I guess I
was in a hurry when I checked that code in :-).

> I'd like to add a couple of test cases that use HTTP/1.1 to get some
> pages from python.org, including one that uses the chunked encoding.
> Just haven't gotten around to it.  Question on that front: Does it
> make sense to incorporate the test function in the module with the std
> regression test suite?  In general, I would think so.  In this
> particular case, the test could fail because of host networking
> problems.  I think that's okay as long as the error message is clear
> enough. 

Yes, I agree.  Maybe it should raise ImportError when the network is
unreachable -- this is the one exception that the regrtest module
considers non-fatal.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From DavidA at ActiveState.com  Sat May 20 00:38:16 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Fri, 19 May 2000 15:38:16 -0700
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005192237.RAA49766@starbase.neosoft.com>
Message-ID: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>

> 	> One thing of which I'm certain:  I do NOT like
> 	> documentation entries that say things like
> 	> "select() doesn't really work except under Unix"
> 	> (still true?  Maybe that's been fixed?).
>
> 	Hm, that's bogus.  It works well under Windows -- with the
> restriction
> 	that it only works for sockets, but for sockets it works as well as
> 	on Unix.  it also works well on the Mac.  I wonder where that note
> 	came from (it's probably 6 years old :-).

I'm pretty sure I know where it came from -- it came from Sam Rushing's
tutorial on how to use Medusa, which was more or less cut & pasted into the
doc, probably at the time that asyncore and asynchat were added to the
Python core.  IMO, it's not the best part of the Python doc -- it is much
too low-to-the ground, and assumes the reader already understands much about
I/O, sync/async issues, and cares mostly about high performance.  All of
which are true of wonderful Sam, most of which are not true of the average
Python user.

While we're complaining about doc, asynchat is not documented, I believe.
Alas, I'm unable to find the time to write up said documentation.

--david

PS: I'm not sure that multiplexing can be made _easy_.  Issues like
block/nonblocking communications channels, multithreading etc. are hard to
ignore, as much as one might want to.




From gstein at lyra.org  Sat May 20 00:38:59 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 May 2000 15:38:59 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <200005200130.SAA02265@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>

On Fri, 19 May 2000, Guido van Rossum wrote:
> > I applied the recent changes to the CVS httplib to Greg's httplib
> > (call it httplib11) this afternoon.  The result is included below.  I
> > think this is quite close to checking in,

I'll fold the changes into my copy here (at least), until we're ready to
check into Python itself.

THANK YOU for doing this work. It is the "heavy lifting" part that I just
haven't had a chance to get to myself.

I have a small, local change dealing with the 'Host' header (it shouldn't
be sent automatically for HTTP/1.0; some httplib users already send it
and having *two* in the output headers will make some servers puke).

> > but it could use a slightly
> > better test suite.
> 
> Thanks -- but note that I don't have the time to review the code.

I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
the code, though... :-)

> > There are a few outstanding questions.
> > 
> > httplib11 does not implement the debuglevel feature.  I don't think
> > it's important, but it is currently documented and may be used.
> > Guido, should we implement it?
> 
> I think the solution is to provide the API ignore the call or
> argument.

Can do: ignore the debuglevel feature.

> > httplib w/SSL uses a constructor with this prototype:
> >     def __init__(self, host='', port=None, **x509):
> > It looks like the x509 dictionary should contain two variables --
> > key_file and cert_file.  Since we know what the entries are, why not
> > make them explicit?
> >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > (Or reverse the two arguments if that is clearer.)
> 
> The reason for the **x509 syntax (I think -- I didn't introduce it) is
> that it *forces* the user to use keyword args, which is a good thing
> for such an advanced feature.  However there should be code that
> checks that no other keyword args are present.

Can do: raise an error if other keyword args are present.

> > The FakeSocket class in CVS has a comment after the makefile def line
> > that says "hopefully, never have to write."  It won't do at all the
> > right thing when called with a write mode, so it ought to raise an
> > exception.  Any reason it doesn't?
> 
> Probably laziness of the code.  Thanks for this code review (I guess I
> was in a hurry when I checked that code in :-).

+1 on raising an exception.

> 
> > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > pages from python.org, including one that uses the chunked encoding.
> > Just haven't gotten around to it.  Question on that front: Does it
> > make sense to incorporate the test function in the module with the std
> > regression test suite?  In general, I would think so.  In this
> > particular case, the test could fail because of host networking
> > problems.  I think that's okay as long as the error message is clear
> > enough. 
> 
> Yes, I agree.  Maybe it should raise ImportError when the network is
> unreachable -- this is the one exception that the regrtest module
> considers non-fatal.

+1 on shifting to the test modules.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bckfnn at worldonline.dk  Sat May 20 17:19:09 2000
From: bckfnn at worldonline.dk (Finn Bock)
Date: Sat, 20 May 2000 15:19:09 GMT
Subject: [Python-Dev] Heads up: unicode file I/O in JPython.
Message-ID: <392690f3.17235923@smtp.worldonline.dk>

I have recently released errata-07 which improves on JPython's ability
to handle unicode characters as well as binary data read from and
written to python files.

The conversions can be described as

- I/O to a file opened in binary mode will read/write the low 8-bit 
  of each char. Writing Unicode chars >0xFF will cause silent
  truncation [*].

- I/O to a file opened in text mode will push the character 
  through the default encoding for the platform (in addition to 
  handling CR/LF issues).

This breaks completely with python1.6a2, but I believe that it is close
to the expectations of java users. (The current JPython-1.1 behavior are
completely useless for both characters and binary data. It only barely
manage to handle 7-bit ASCII).

In JPython (with the errata) we can do:

  f = open("test207.out", "w")
  f.write("\x20ac") # On my w2k platform this writes 0x80 to the file.
  f.close()

  f = open("test207.out", "r")
  print hex(ord(f.read()))
  f.close()

  f = open("test207.out", "wb")
  f.write("\x20ac") # On all platforms this writes 0xAC to the file.
  f.close()

  f = open("test207.out", "rb")
  print hex(ord(f.read()))
  f.close()

With the output of:

  0x20ac
  0xac

I do not expect anything like this in CPython. I just hope that all
unicode advice given on c.l.py comes with the modifier, that JPython
might do it differently.

regards,
finn

    http://sourceforge.net/project/filelist.php?group_id=1842

[*] Silent overflow is bad, but it is at least twice as fast as having
to check each char for overflow.





From esr at netaxs.com  Sun May 21 00:36:56 2000
From: esr at netaxs.com (Eric Raymond)
Date: Sat, 20 May 2000 18:36:56 -0400
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <200005162001.QAA16657@eric.cnri.reston.va.us>; from Guido van Rossum on Tue, May 16, 2000 at 04:01:46PM -0400
References: <009d01bfbf64$b779a260$34aab5d4@hagrid> <3921984A.8CDE8E1D@prescod.net> <200005162001.QAA16657@eric.cnri.reston.va.us>
Message-ID: <20000520183656.F7487@unix3.netaxs.com>

On Tue, May 16, 2000 at 04:01:46PM -0400, Guido van Rossum wrote:
> > I hope that if Python were renamed we would not choose yet another name
> > which turns up hundreds of false hits in web engines. Perhaps Homr or
> > Home_r. Or maybe Pythahn.
> 
> Actually, I'd like to call the next version Throatwobbler Mangrove.
> But you'd have to pronounce it Raymond Luxyry Yach-t.

Great.  I'll take a J-class kitted for open-ocean sailing, please.  Do
I get a side of bikini babes with that?
-- 
	<a href="http://www.tuxedo.org/~esr/home.html">Eric S. Raymond</a>



From ping at lfw.org  Sun May 21 12:30:05 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Sun, 21 May 2000 03:30:05 -0700 (PDT)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <392590B0.5CA4F31D@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005210329160.420-100000@localhost>

On Fri, 19 May 2000, M.-A. Lemburg wrote:
> Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> string_print API which is used for the tp_print slot,

Very sorry!  I didn't actually look to see where the patch
was being applied.

But then how can this have any effect on squishdot's indexing?



-- ?!ng

"All models are wrong; some models are useful."
    -- George Box





From pf at artcom-gmbh.de  Sun May 21 17:54:06 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Sun, 21 May 2000 17:54:06 +0200 (MEST)
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: <Pine.LNX.4.10.10005210329160.420-100000@localhost> from Ka-Ping Yee at "May 21, 2000  3:30: 5 am"
Message-ID: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>

Hi!

Ka-Ping Yee:
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

Sigh.  Let me explain this in some detail.

What do you see here: ????????  If all went well, you should
see some Umlauts which occur quite often in german words, like
"Begr?ssung", "?tzend" or "Gr?tzkacke" and so on.

During the late 80s we here Germany spend a lot of our free time to
patch open source tools software like 'elm', 'B-News', 'less' and
others to make them "8-Bit clean".  For example on ancient Unices
like SCO Xenix where the implementations of C-library functions
like 'is_print', 'is_lower' where out of reach.

After several years everybody seems to agree on ISO-8859-1 as the new
european standard character set, which was also often losely called 
8-Bit ASCII, because ASCII is a true subset of ISO latin1.  Even at least
the german versions of Windows used ISO-8859-1.

As the WWW began to gain popularity nobody with a sane mind really 
used these splendid ASCII escapes like for example '&auml;' instead 
of '?'.  The same holds true for TeX users community where everybody 
was happy to type real umlauts instead of these ugly backslash escapes
sequences used before: \"a\"o\"u ...

To make a short: A lot of effort has been spend to make *ALL* programs
8-Bit clean: That is to move the bytes through without translating
them from or into a bunch of incompatible multi bytes sequences,
which nobody can read or even wants to look at.

Now to get to back to your question:  There are several nice HTML indexing
engines out there.  I personally use HTDig.  At least on Linux these
programs deal fine with HTML files containing 8-bit chars.  

But if for some reason Umlauts end up as octal escapes ('\344' instead of '?')
due to the use of a Python 'print some_tuple' during the creation of HTML
files, a search engine will be unable to find those words with escaped
umlauts.

Mit freundlichen Gr??en, Peter
P.S.: Hope you didn't find my explanation boring or off-topic.



From effbot at telia.com  Sun May 21 18:26:00 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Sun, 21 May 2000 18:26:00 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <m12tY3K-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <005601bfc341$40eb0d60$34aab5d4@hagrid>

Peter Funk <pf at artcom-gmbh.de> wrote:
> But if for some reason Umlauts end up as octal escapes ('\344' instead of '?')
> due to the use of a Python 'print some_tuple' during the creation of HTML
> files, a search engine will be unable to find those words with escaped
> umlauts.

umm.  why would anyone use "print some_tuple" when generating
HTML pages?  what if the tuple contains something that results in
a "<" character?

</F>




From guido at python.org  Sun May 21 23:20:03 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 14:20:03 -0700
Subject: [Python-Dev] Is the tempfile module really a security risk?
Message-ID: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>

Every few months I receive patches that purport to make the tempfile
module more secure.  I've never felt that it is a problem.  What is
with these people?  My feeling about these suggestions has always been
that they have read about similar insecurities in C code run by the
super-user, and are trying to get the teacher's attention by proposing
something clever.

Or is there really a problem?  Is anyone in this forum aware of
security issues with tempfile?  Should I worry?  Is the
"random-tempfile" patch that the poster below suggested worth
applying?

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Sun, 21 May 2000 19:34:43 +0200
From:    =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <ragnark at vestdata.no>
To:      Guido van Rossum <guido at python.org>
cc:      patches at python.org
Subject: Re: [Patches] Patch to make tempfile return random filenames

On Sun, May 21, 2000 at 12:17:08PM -0700, Guido van Rossum wrote:
> Hm, I don't like this very much.  Random sequences have a small but
> nonzero probability of generating the same number in rapid succession
> -- probably one in a million or so.  It would be very bad if one in a
> million rums of a particular application crashed for this reason.
> 
> A better way do prevent this kind of attack (if you care about it) is
> to use mktemp.TemporaryFile(), which avoids this vulnerability in a
> different way.
> 
> (Also note the test for os.path.exists() so that an attacker would
> have to use very precise timing to make this work.)

1. the path.exist part does not solve the problem. It causes a racing
condition that is not very hard to get around, by having a program
creating and deleting the file at maximum speed. It will have a 50%
chance of breaking your program.

2. O_EXCL does not always work. E.g. it does not work over NFS - there
are probably other broken implementations too.

3. Even if mktemp.TemporaryFile had been sufficient, providing mktemp in
this dangerous way is not good. Many are likely to use it either not
thinking about the problem at all, or assuming it's solved in the
module.

4. The problems you describe can easily be overcome. I removed the
counter and the file-exist check because I figgured they were no longer
needed. I was wrong. Either a larger number should be used and/or
counter and or file-exist check. Personally I would want the random part
to bee large enough not have to worry about collisions either by chance,
after a fork, or by deliberate attack.


Do you want a new patch that adresses theese problems better?


- -- 
Ragnar Kj?rstad

_______________________________________________
Patches mailing list
Patches at python.org
http://www.python.org/mailman/listinfo/patches

------- End of Forwarded Message




From guido at python.org  Mon May 22 00:05:58 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 15:05:58 -0700
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
Message-ID: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>

I'm happy to announce that we've moved the Python CVS tree to
SourceForge.  SourceForge (www.sourceforge.net) is a free service to
Open Source developers run by VA Linux.

The change has two advantages for us: (1) we no longer have to deal
with the mirrorring of our writable CVS repository to the read-only
mirror ar cvs.python.org (which will soon be decommissioned); (2) we
will be able to add new developers with checkin privileges.  In
addition, we benefit from the high visibility and availability of
SourceForge.

Instructions on how to access the Python SourceForge tree are here:

  http://sourceforge.net/cvs/?group_id=5470

If you have an existing working tree that points to the cvs.python.org
repository, you may want to retarget it to the SourceForge tree.  This
can be done painlessly with Greg Ward's cvs_chroot script:

  http://starship.python.net/~gward/python/

The email notification to python-checkins at python.org still works
(although during the transition a few checkin messages may have been
lots).

While I've got your attention, please remember that the proper
procedures to submit patches is described here:

  http://www.python.org/patches/

We've accumulated quite the backlog of patches to be processed during
the transition; we'll start working on these ASAP.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Sun May 21 22:54:23 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sun, 21 May 2000 22:54:23 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost>
Message-ID: <39284CFF.5D9C9B13@lemburg.com>

Ka-Ping Yee wrote:
> 
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

The only possible reason I can see is that this squishdot
application uses 'print' to write the data -- perhaps
it pipes it through some other tool ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From effbot at telia.com  Mon May 22 01:24:02 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 01:24:02 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>
Message-ID: <004f01bfc37b$b1551480$34aab5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> > But then how can this have any effect on squishdot's indexing?
> 
> The only possible reason I can see is that this squishdot
> application uses 'print' to write the data -- perhaps
> it pipes it through some other tool ?

but doesn't the patch only affects code that manages to call tp_print
without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
mode)

or to put it another way, if they manage to call tp_print without the
PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

or am I just totally confused?

</F>




From guido at python.org  Mon May 22 05:47:16 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 21 May 2000 20:47:16 -0700
Subject: [Python-Dev] repr vs. str and locales again
In-Reply-To: Your message of "Mon, 22 May 2000 01:24:02 +0200."
             <004f01bfc37b$b1551480$34aab5d4@hagrid> 
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com>  
            <004f01bfc37b$b1551480$34aab5d4@hagrid> 
Message-ID: <200005220347.UAA06235@cj20424-a.reston1.va.home.com>

Let's reboot this thread.  Never mind the details of the actual patch,
or why it would affect a particular index.

Obviously if we're going to patch string_print() we're also going to
patch string_repr() (and vice versa) -- the former (without the
Py_PRINT_RAW flag) is supposed to be an optimization of the latter.
(I hadn't even read the patch that far to realize that it only did one
and not the other.)

The point is simply this.

The repr() function for a string turns it into a valid string literal.
There's considerable freedom allowed in this conversion, some of which
is taken (e.g. it prefers single quotes but will use double quotes
when the string contains single quotes).

For safety reasons, control characters are replaced by their octal
escapes.  This is also done for non-ASCI characters.

Lots of people, most of them living in countries where Latin-1 (or
another 8-bit ASCII superset) is in actual use, would prefer that
non-ASCII characters would be left alone rather than changed into
octal escapes.  I think it's not unreasonable to ask that what they
consider printable characters aren't treated as control characters.

I think that using the locale to guide this is reasonable.  If the
locale is set to imply Latin-1, then we can assume that most output
devices are capable of displaying those characters.  What good does
converting those characters to octal escapes do us then?  If the input
string was in fact binary goop, then the output will be unreadable
goop -- but it won't screw up the output device (as control characters
are wont to do, which is the main reason to turn them into octal
escapes).

So I don't see how the patch can do much harm, I don't expect that it
will break much code, and I see a real value for those who use
Latin-1 or other 8-bit supersets of ASCII.

The one objection could be that the locale may be obsolescent -- but
I've only heard /F vent an opinion about that; personally, I doubt
that we will be able to remove the locale any time soon, even if we
invent a better way.  Plus, I think that "better way" should address
this issue anyway.  If the locale eventually disappears, the feature
automatically disappears with it, because you *have* to make a
locale.setlocale() call before the behavior of repr() changes.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From pf at artcom-gmbh.de  Mon May 22 08:18:22 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 08:18:22 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005220347.UAA06235@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 21, 2000  8:47:16 pm"
Message-ID: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>

Guido van Rossum:
[...]
> The one objection could be that the locale may be obsolescent -- but
> I've only heard /F vent an opinion about that; personally, I doubt
> that we will be able to remove the locale any time soon, even if we
> invent a better way.  

AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

Although I understand Barrys and Pings objections against a global state,
it used to work very well:  On a typical single user Linux system the
user chooses his locale during the first stages of system setup and
never has to think about it again.  On multi user systems the locale
of individual accounts may be customized using several environment
variables, which can overide the default locale of the system.

> Plus, I think that "better way" should address
> this issue anyway.  If the locale eventually disappears, the feature
> automatically disappears with it, because you *have* to make a
> locale.setlocale() call before the behavior of repr() changes.

The last sentence is at least not the whole truth.

On POSIX systems there are a several environment variables used to
control the default locale settings for a users session.  For example
on my SuSE Linux system currently running in the german locale the
environment variable LC_CTYPE=de_DE is automatically set by a file 
/etc/profile during login, which causes automatically the C-library 
function toupper('?') to return an '?' ---you should see
a lower case a-umlaut as argument and an upper case umlaut as return
value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls
to 'setlocale' to all application program using this C-library functions.

Regards, Peter.



From tim_one at email.msn.com  Mon May 22 08:59:16 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 02:59:16 -0400
Subject: [Python-Dev] Is the tempfile module really a security risk?
In-Reply-To: <200005212120.OAA05258@cj20424-a.reston1.va.home.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECEGBAA.tim_one@email.msn.com>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

Doing a google search on

    tempfile security

turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
real vulnerability here somewhere (not necessarily Python), but the closest
I found to a clear explanation in 10 minutes was an annoyed paragraph,
saying that if I didn't already understand the problem I should turn in my
Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
issued one of those to me.

> ...
> Is the "random-tempfile" patch that the poster below suggested worth
> applying?

Certainly not the patch he posted!  And for reasons I sketched in my
patches-list commentary, I doubt any hack based on pseudo-random numbers
*can* solve anything.

assuming-there's-indeed-something-in-need-of-solving-ly y'rs  - tim





From effbot at telia.com  Mon May 22 09:20:50 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 09:20:50 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>
Message-ID: <008001bfc3be$7e5eae40$34aab5d4@hagrid>

Peter Funk wrote:
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to
Python, the old 8-bit locale *ctype* stuff no longer works.  while some
platforms implement a wctype interface, it's not widely available, and it's
not always unicode.

so in order to provide platform-independent unicode support, Python 1.6
comes with unicode-aware and fully portable replacements for the ctype
functions.

the code is already in there...

> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file
> /etc/profile during login, which causes automatically the C-library
> function toupper('?') to return an '?' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
>
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays.  may contain binary goop, or text in some strange
  encoding.  upper, strip, etc should not be used.

- 8-bit text strings using the system encoding.  upper, strip, etc works
  as long as the locale is properly configured.

- 8-bit unicode text strings.  upper, strip, etc may work, as long as the
  system encoding is a subset of unicode -- which means US ASCII or
  ISO Latin 1.

- wide unicode text strings.  upper, strip, etc always works.

is this complexity really worth it?

</F>




From gstein at lyra.org  Mon May 22 09:47:50 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 22 May 2000 00:47:50 -0700 (PDT)
Subject: [Python-Dev] HTTP/1.1 capable httplib module
In-Reply-To: <Pine.LNX.4.10.10005191535180.6486-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.10.10005220045170.30706-100000@nebula.lyra.org>

I've integrated all of these changes into the httplib.py posted on my
pages at:
    http://www.lyra.org/greg/python/

The actual changes are visible thru ViewCVS at:
    http://www.lyra.org/cgi-bin/viewcvs.cgi/gjspy/httplib.py/


The test code is still in there, until a test_httplib can be written.

Still missing: doc for the new-style semantics.

Cheers,
-g

On Fri, 19 May 2000, Greg Stein wrote:
> On Fri, 19 May 2000, Guido van Rossum wrote:
> > > I applied the recent changes to the CVS httplib to Greg's httplib
> > > (call it httplib11) this afternoon.  The result is included below.  I
> > > think this is quite close to checking in,
> 
> I'll fold the changes into my copy here (at least), until we're ready to
> check into Python itself.
> 
> THANK YOU for doing this work. It is the "heavy lifting" part that I just
> haven't had a chance to get to myself.
> 
> I have a small, local change dealing with the 'Host' header (it shouldn't
> be sent automatically for HTTP/1.0; some httplib users already send it
> and having *two* in the output headers will make some servers puke).
> 
> > > but it could use a slightly
> > > better test suite.
> > 
> > Thanks -- but note that I don't have the time to review the code.
> 
> I'm reviewing it, too. Gotta work around the fact that Jeremy re-indented
> the code, though... :-)
> 
> > > There are a few outstanding questions.
> > > 
> > > httplib11 does not implement the debuglevel feature.  I don't think
> > > it's important, but it is currently documented and may be used.
> > > Guido, should we implement it?
> > 
> > I think the solution is to provide the API ignore the call or
> > argument.
> 
> Can do: ignore the debuglevel feature.
> 
> > > httplib w/SSL uses a constructor with this prototype:
> > >     def __init__(self, host='', port=None, **x509):
> > > It looks like the x509 dictionary should contain two variables --
> > > key_file and cert_file.  Since we know what the entries are, why not
> > > make them explicit?
> > >     def __init__(self, host='', port=None, cert_file=None, key_file=None):
> > > (Or reverse the two arguments if that is clearer.)
> > 
> > The reason for the **x509 syntax (I think -- I didn't introduce it) is
> > that it *forces* the user to use keyword args, which is a good thing
> > for such an advanced feature.  However there should be code that
> > checks that no other keyword args are present.
> 
> Can do: raise an error if other keyword args are present.
> 
> > > The FakeSocket class in CVS has a comment after the makefile def line
> > > that says "hopefully, never have to write."  It won't do at all the
> > > right thing when called with a write mode, so it ought to raise an
> > > exception.  Any reason it doesn't?
> > 
> > Probably laziness of the code.  Thanks for this code review (I guess I
> > was in a hurry when I checked that code in :-).
> 
> +1 on raising an exception.
> 
> > 
> > > I'd like to add a couple of test cases that use HTTP/1.1 to get some
> > > pages from python.org, including one that uses the chunked encoding.
> > > Just haven't gotten around to it.  Question on that front: Does it
> > > make sense to incorporate the test function in the module with the std
> > > regression test suite?  In general, I would think so.  In this
> > > particular case, the test could fail because of host networking
> > > problems.  I think that's okay as long as the error message is clear
> > > enough. 
> > 
> > Yes, I agree.  Maybe it should raise ImportError when the network is
> > unreachable -- this is the one exception that the regrtest module
> > considers non-fatal.
> 
> +1 on shifting to the test modules.
> 
> Cheers,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 

-- 
Greg Stein, http://www.lyra.org/




From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 10:25:21 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 10:25:21 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org>  
	            <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <3928EEF1.693F@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > From: Alexandre Ferrieux <alexandre.ferrieux at cnet.francetelecom.fr>
> >
> > I'm an oldtime Tcler, firmly decided to switch to Python, 'cause it is
> > just so beautiful inside. But while Tcl is weaker in the algorithms, it
> > is stronger in the os-wrapping library, and taught me to love high-level
> > abstractions. [fileevent] shines in this respect, and I'll miss it in
> > Python.
> 
> Alex, it's disappointing to me too!  There just isn't anything
> currently in the library to do this, and I haven't written apps that
> needs this often enough to have a good feel for what kind of
> abstraction is needed.

Thanks for the empathy. Apologies for my slight overreaction.

> However perhaps we can come up with a design for something better?  Do
> you have a suggestion here?

Yup. One easy answer is 'just copy from Tcl'...

Seriously, I'm really too new to Python to suggest the details or even
the *style* of this 'level 2 API to multiplexing'. However, I can sketch
the implementation since select() (from C or Tcl) is the one primitive I
most depend on !

Basically, as shortly mentioned before, the key problem is the
heterogeneity of seemingly-selectable things in Windoze. On unix, not
only does select() work with
all descriptor types on which it makes sense, but also the fd used by
Xlib is accessible; hence clean multiplexing even with a GUI package is
trivial. Now to the real (rotten) meat, that is M$'s. Facts:

	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
selectable. Why ? Ask a relative in Redmond...

	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
handles. They are selectable, but for example you can't use'em for
redirections. Okay in our case we don't care. I only mention it cause
its scary and could pop back into your face some time later.

	3. The GUI API doesn't expose a descriptor (handle), but fortunately
(though disgustingly) there is a special syscall to wait on both "the
message queue" and selectable handles: MsgWaitForMultipleObjects. So its
doable, if not beautiful.

The Tcl solution to (1.), which is the only real issue, is to have a
separate thread blockingly read 1 byte from the pipe, and then post a
message back to the main thread to awaken it (yes, ugly code to handle
that extra byte and integrate it with the buffering scheme).

In summary, why not peruse Tcl's hard-won experience on
selecting-on-windoze-pipes ?

Then, for the API exposed to the Python programmer, the Tclly exposed
one is a starter:

	fileevent $channel readable|writable callback
	...
	vwait breaker_variable

Explanation for non-Tclers: fileevent hooks the callback, vwait does a
loop of select(). The callback(s) is(are) called without breaking the
loop, unless $breaker_variable is set, at which time vwait returns.

One note about 'breaker_variable': I'm not sure I like it. I'd prefer
something based on exceptions. I don't quite understand why it's not
already this way in Tcl (which has (kindof) first-class exceptions), but
let's not repeat the mistake: let's suggest that (the equivalent of)
vwait loops forever, only to be broken out by an exception from within
one of the callbacks.

HTH,

-Alex



From mal at lemburg.com  Mon May 22 10:56:10 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 10:56:10 +0200
Subject: [Python-Dev] repr vs. str and locales again
References: <Pine.LNX.4.10.10005210329160.420-100000@localhost> <39284CFF.5D9C9B13@lemburg.com> <004f01bfc37b$b1551480$34aab5d4@hagrid> <3928F437.D4DB3C25@lemburg.com>
Message-ID: <3928F62A.94980623@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > > But then how can this have any effect on squishdot's indexing?
> >
> > The only possible reason I can see is that this squishdot
> > application uses 'print' to write the data -- perhaps
> > it pipes it through some other tool ?
> 
> but doesn't the patch only affects code that manages to call tp_print
> without the PRINT_RAW flag?  (that is, in "repr" mode rather than "str"
> mode)

Right.
 
> or to put it another way, if they manage to call tp_print without the
> PRINT_RAW flag, isn't that a bug in their code, rather than in Python?

Looking at the code, the 'print' statement doesn't set
PRINT_RAW -- still the output is written literally to
stdout. Don't know where PRINT_RAW gets set... perhaps
they use PyFile_WriteObject() directly ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From pf at artcom-gmbh.de  Mon May 22 11:44:14 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 11:44:14 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
Message-ID: <m12tokw-000DieC@artcom0.artcom-gmbh.de>

[Guido]
> Every few months I receive patches that purport to make the tempfile
> module more secure.  I've never felt that it is a problem.  What is
> with these people?

[Tim]
> Doing a google search on
> 
>     tempfile security
> 
> turns up hundreds of rants.  Have fun <wink>.  There does appear to be a
> real vulnerability here somewhere (not necessarily Python), but the closest
> I found to a clear explanation in 10 minutes was an annoyed paragraph,
> saying that if I didn't already understand the problem I should turn in my
> Unix Security Expert badge immediately.  Unfortunately, Bill Gates never
> issued one of those to me.

On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a 
working example which exploits this vulnerability in older versions
of GCC.

The basic idea is indeed very simple:  Since the /tmp directory is
writable for any user, the bad guy can create a symbolic link in /tmp
pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
program will than overwrite this arbitrary file (where the programmer
really wanted to write something to his tempfile instead).  Since this
will happen with the access permissions of the process running this
program, this opens a bunch of vulnerabilities in many programs
writing something into temporary files with predictable file names.

www.cert.org is another great place to look for security related info.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From claird at starbase.neosoft.com  Mon May 22 13:31:08 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 06:31:08 -0500 (CDT)
Subject: [Python-Dev] Re: Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <200005221131.GAA39671@starbase.neosoft.com>

	From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 03:40:13 2000
			.
			.
			.
	> Alex, it's disappointing to me too!  There just isn't anything
	> currently in the library to do this, and I haven't written apps that
	> needs this often enough to have a good feel for what kind of
	> abstraction is needed.

	Thanks for the empathy. Apologies for my slight overreaction.

	> However perhaps we can come up with a design for something better?  Do
	> you have a suggestion here?

	Yup. One easy answer is 'just copy from Tcl'...

	Seriously, I'm really too new to Python to suggest the details or even
	the *style* of this 'level 2 API to multiplexing'. However, I can sketch
	the implementation since select() (from C or Tcl) is the one primitive I
	most depend on !

	Basically, as shortly mentioned before, the key problem is the
	heterogeneity of seemingly-selectable things in Windoze. On unix, not
	only does select() work with
	all descriptor types on which it makes sense, but also the fd used by
	Xlib is accessible; hence clean multiplexing even with a GUI package is
	trivial. Now to the real (rotten) meat, that is M$'s. Facts:

		1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
	selectable. Why ? Ask a relative in Redmond...

		2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
	handles. They are selectable, but for example you can't use'em for
	redirections. Okay in our case we don't care. I only mention it cause
	its scary and could pop back into your face some time later.

		3. The GUI API doesn't expose a descriptor (handle), but fortunately
	(though disgustingly) there is a special syscall to wait on both "the
	message queue" and selectable handles: MsgWaitForMultipleObjects. So its
	doable, if not beautiful.

	The Tcl solution to (1.), which is the only real issue, is to have a
	separate thread blockingly read 1 byte from the pipe, and then post a
	message back to the main thread to awaken it (yes, ugly code to handle
	that extra byte and integrate it with the buffering scheme).

	In summary, why not peruse Tcl's hard-won experience on
	selecting-on-windoze-pipes ?

	Then, for the API exposed to the Python programmer, the Tclly exposed
	one is a starter:

		fileevent $channel readable|writable callback
		...
		vwait breaker_variable

	Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	loop of select(). The callback(s) is(are) called without breaking the
	loop, unless $breaker_variable is set, at which time vwait returns.

	One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	something based on exceptions. I don't quite understand why it's not
	already this way in Tcl (which has (kindof) first-class exceptions), but
	let's not repeat the mistake: let's suggest that (the equivalent of)
	vwait loops forever, only to be broken out by an exception from within
	one of the callbacks.
			.
			.
			.
I've copied everything Alex wrote, because he writes for
me, also.

As much as I welcome it, I can't answer Guido's question,
"What should the API look like?"  I've been mulling this
over, and concluded I don't have sufficiently deep know-
ledge to be trustworthy on this.

Instead, I'll just give a bit of personal testimony.  I
made the rather coy c.l.p posting, in which I sincerely
asked, "How do you expert Pythoneers do it?" (my para-
phrase), without disclosing either that Alex and I have
been discussing this, or that the Tcl interface we both
know is simply a delight to me.

Here's the delight.  Guido asked, approximately, "What's
the point?  Do you need this for more than the keeping-
the-GUI-responsive-for-which-there's-already-a-notifier-
around case?"  The answer is, yes.  It's a good question,
though.  I'll repeat what Alex has said, with my own em-
phasis:  Tcl gives a uniform command API for
* files (including I/O ports, ...)
* subprocesses
* TCP socket connections
and allows the same fcntl()-like configuration of them
all as to encodings, blocking, buffering, and character
translation.  As a programmer, I use this stuff
CONSTANTLY, and very happily.  It's not just for GUIs; 
several of my mission-critical delivered products have
Tcl-coded daemons to monitor hardware, manage customer
transactions, ...  It's simply wonderful to be able to
evolve a protocol from a socket connection to an fopen()
read to ...

Tcl is GREAT at "gluing".  Python can do it, but Tcl has
a couple of years of refinement in regard to portability
issues of managing subprocesses.  I really, *really*
miss this stuff when I work with a language other than
Tcl.

I don't often whine, "Language A isn't language B."  I'm
happy to let individual character come out.  This is,
for me, an exceptional case.  It's not that Python doesn't
do it the Tcl way; it's that the Tcl way is wonderful, and
moreover that Python doesn't feel to me to have much of an
alternative answer.  I conclude that there might be some-
thing for Python to learn here.

A colleague has also write an even higher-level wrapper in
Tcl for asynchronous sockets.  I'll likely explain more
about it <URL:http://www-users.cs.umn.edu/~dejong/tcl/EasySocket.tar.gz>
in a follow-up.

Conclusion for now:  Alex and I like Python so much that we
want you guys to know that better piping-gluing-networking
truly is possible, and even worthwhile.  This is sort of
like the emigrants who've reported, "Yeah, here's the the
stuff about CPAN that's cool, and how we can have it, too."
Through it all, we absolutely want Python to continue to be
Python.



From guido at python.org  Mon May 22 17:09:44 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 08:09:44 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 08:18:22 +0200."
             <m12tlXi-000CnvC@artcom0.artcom-gmbh.de> 
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de> 
Message-ID: <200005221509.IAA06955@cj20424-a.reston1.va.home.com>

> From: pf at artcom-gmbh.de (Peter Funk)
> 
> Guido van Rossum:
> [...]
> > The one objection could be that the locale may be obsolescent -- but
> > I've only heard /F vent an opinion about that; personally, I doubt
> > that we will be able to remove the locale any time soon, even if we
> > invent a better way.  
> 
> AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> Although I understand Barrys and Pings objections against a global state,
> it used to work very well:  On a typical single user Linux system the
> user chooses his locale during the first stages of system setup and
> never has to think about it again.  On multi user systems the locale
> of individual accounts may be customized using several environment
> variables, which can overide the default locale of the system.
> 
> > Plus, I think that "better way" should address
> > this issue anyway.  If the locale eventually disappears, the feature
> > automatically disappears with it, because you *have* to make a
> > locale.setlocale() call before the behavior of repr() changes.
> 
> The last sentence is at least not the whole truth.
> 
> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session.  For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file 
> /etc/profile during login, which causes automatically the C-library 
> function toupper('?') to return an '?' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
> 
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

I don;t believe that.  According to the ANSI standard, a C program
*must* call setlocale(LC_..., "") if it wants the environment
variables to be honored; without this call, the locale is always the
"C" locale, which should *not* honor the environment variables.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at tismer.com  Mon May 22 14:40:51 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:40:51 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>
Message-ID: <39292AD2.F5080E35@tismer.com>

Hi, I'm back from White Russia (yup, a surviver) :-)

Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > Then a string should better not be a sequence.
> >
> > The number of places where I really used the string sequence
> > protocol to take advantage of it is outperfomed by a factor
> > of ten by cases where I missed to tupleise and got a bad
> > result. A traceback is better than a sequence here.
> 
> Alas, I think
> 
>     for ch in string:
>         muck w/ the character ch
> 
> is a common idiom.

Sure.
And now for my proposal:

Strings should be strings, but not sequences.
Slicing is ok, and it will always yield strings.
Indexing would either
a - not yield anything but an excpetion
b - just integers instead of 1-char strings

The above idiom would read like this:

Version a: Access string elements via a coercion like tuple() or list():

    for ch in tuple(string):
        muck w/ the character ch

Version b: Access string elements as integer codes:

    for c in string:
        # either:
        ch = chr(c)
        muck w/ the character ch
        # or:
        muck w/ the character code c

> > oh-what-did-I-say-here--duck--but-isn't-it-so--cover-ly y'rs - chris
> 
> The "sequenenceness" of strings does get in the way often enough.  Strings
> have the amazing property that, since characters are also strings,
> 
>     while 1:
>         string = string[0]
> 
> never terminates with an error.  This often manifests as unbounded recursion
> in generic functions that crawl over nested sequences (the first time you
> code one of these, you try to stop the recursion on a "is it a sequence?"
> test, and then someone passes in something containing a string and it
> descends forever).  And we also have that
> 
>     format % values
> 
> requires "values" to be specifically a tuple rather than any old sequence,
> else the current
> 
>     "%s" % some_string
> 
> could be interpreted the wrong way.
> 
> There may be some hope in that the "for/in" protocol is now conflated with
> the __getitem__ protocol, so if Python grows a more general iteration
> protocol, perhaps we could back away from the sequenceness of strings
> without harming "for" iteration over the characters ...

O-K!
We seem to have a similar conclusion: It would be better if strings
were no sequences, after all. How to achieve this seems to be
kind of a problem, of course.

Oh, there is another idiom possible!
How about this, after we have the new string methods :-)

    for ch in string.split():
        muck w/ the character ch

Ok, in the long term, we need to rethink iteration of course.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From tismer at tismer.com  Mon May 22 14:55:21 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 14:55:21 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000201bfc082$50909f80$6c2d153f@tim>
Message-ID: <39292E38.A5A89270@tismer.com>


Tim Peters wrote:
> 
> [Christian Tismer]
> > ...
> > After all, it is no surprize. They are right.
> > If we have to change their mind in order to understand
> > a basic operation, then we are wrong, not they.
> 
> [Tim]
> > Huh!  I would not have guessed that you'd give up on Stackless
> > that easily <wink>.
> 
> [Chris]
> > Noh, I didn't give up Stackless, but fishing for soles.
> > After Just v. R. has become my most ambitious user,
> > I'm happy enough.
> 
> I suspect you missed the point:  Stackless is the *ultimate* exercise in
> "changing their mind in order to understand a basic operation".  I was
> tweaking you, just as you're tweaking me <smile!>.

Squeek! Peace on earth :-)

And you are almost right on Stackless.
Almost, since I know of at least three new Python users who came
to Python *because* it has Stackless + Continuations. This is a very
new aspect to me.
Things are getting interesting now: Today I got a request from CCP
regarding continuations: They will build a masive parallel
multiplayer game with that. http://www.ccp.cc/eve

> > It is absolutely phantastic.
> > The most uninteresting stuff in the join is the separator,
> > and it has the power to merge thousands of strings
> > together, without asking the sequence at all
> >  - give all power to the suppressed, long live the Python anarchy :-)
> 
> Exactly!  Just as love has the power to bind thousands of incompatible
> humans without asking them either:  a vote for space.join() is a vote for
> peace on earth.

hmmm - that's so nice...

So let's drop a generic join, and use string.love() instead.

> while-a-generic-join-builtin-is-a-vote-for-war<wink>-ly y'rs  - tim

join-is-a-peacemaker-like-a-Winchester-Cathedral-ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From claird at starbase.neosoft.com  Mon May 22 15:09:03 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 08:09:03 -0500 (CDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network statistics program)
In-Reply-To: <200005191513.IAA00818@cj20424-a.reston1.va.home.com>
Message-ID: <200005221309.IAA41866@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Fri May 19 07:26:16 2000
			.
			.
			.
	Alex, it's disappointing to me too!  There just isn't anything
	currently in the library to do this, and I haven't written apps that
	needs this often enough to have a good feel for what kind of
	abstraction is needed.

	However perhaps we can come up with a design for something better?  Do
	you have a suggestion here?
Review:  Alex and I have so far presented
the Tcl way.  We're still a bit off-balance
at the generosity of spirit that's listen-
ing to us so respectfully.  Still ahead is
the hard work of designing an interface or
higher-level abstraction that's right for
Python.

The good thing, of course, is that this is
absolutely not a language issue at all.
Python is more than sufficiently expressive
for this matter.  All we're doing is working
to insert the right thing in the (a) library.

	I agree with your comment that higher-level abstractions around OS
	stuff are needed -- I learned system programming long ago, in C, and
	I'm "happy enough" with the current state of affairs, but I agree that
	for many people this is a problem, and there's no reason why Python
	couldn't do better...
I've got a whole list of "higher-level
abstractions around OS stuff" that I've been
collecting.  Maybe I'll make it fit for
others to see once we're through this affair
...

	--Guido van Rossum (home page: http://www.python.org/~guido/)




From guido at python.org  Mon May 22 18:16:08 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:16:08 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:20:50 +0200."
             <008001bfc3be$7e5eae40$34aab5d4@hagrid> 
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>  
            <008001bfc3be$7e5eae40$34aab5d4@hagrid> 
Message-ID: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot at telia.com>
>
> Peter Funk wrote:
> > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> you're missing the point -- now that we've added unicode support to
> Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> platforms implement a wctype interface, it's not widely available, and it's
> not always unicode.

Huh?  We were talking strictly 8-bit strings here.  The locale support
hasn't changed there.

> so in order to provide platform-independent unicode support, Python 1.6
> comes with unicode-aware and fully portable replacements for the ctype
> functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the
Unicode stuff is overkill.

> the code is already in there...
> 
> > On POSIX systems there are a several environment variables used to
> > control the default locale settings for a users session.  For example
> > on my SuSE Linux system currently running in the german locale the
> > environment variable LC_CTYPE=de_DE is automatically set by a file
> > /etc/profile during login, which causes automatically the C-library
> > function toupper('?') to return an '?' ---you should see
> > a lower case a-umlaut as argument and an upper case umlaut as return
> > value--- without having all applications to call 'setlocale' explicitly.
> >
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.
> 
> note that this leaves us with four string flavours in 1.6:
> 
> - 8-bit binary arrays.  may contain binary goop, or text in some strange
>   encoding.  upper, strip, etc should not be used.

These are not strings.

> - 8-bit text strings using the system encoding.  upper, strip, etc works
>   as long as the locale is properly configured.
> 
> - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
>   system encoding is a subset of unicode -- which means US ASCII or
>   ISO Latin 1.

This is a figment of your imagination.  You can use 8-bit text strings
to contain Latin-1, but you have to set your locale to match.

> - wide unicode text strings.  upper, strip, etc always works.
> 
> is this complexity really worth it?


From pf at artcom-gmbh.de  Mon May 22 15:02:18 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 15:02:18 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221509.IAA06955@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  8: 9:44 am"
Message-ID: <m12trqc-000DieC@artcom0.artcom-gmbh.de>

Hi!

[...]
[me]:
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.

[Guido van Rossum]:
> I don;t believe that.  According to the ANSI standard, a C program
> *must* call setlocale(LC_..., "") if it wants the environment
> variables to be honored; without this call, the locale is always the
> "C" locale, which should *not* honor the environment variables.

pf at pefunbk> python 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string
>>> print string.upper("?")
?
>>> 

This was the vanilla Python 1.5.2 as originally delivered by SuSE Linux.  
But yes, you are right. :-(  My memory was confused by this practical
experience.  Now I like to quote from the man pages here:

man toupper:
[...]
BUGS
       The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con?
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.

man setlocale:
[...]
       A  program  may be made portable to all locales by calling
       setlocale(LC_ALL, "" ) after program   initialization,  by
       using  the  values  returned  from a localeconv() call for
       locale - dependent information and by using  strcoll()  or
       strxfrm() to compare strings.
[...]
   CONFORMING TO
       ANSI C, POSIX.1

       Linux  (that  is,  libc) supports the portable locales "C"
       and "POSIX".  In the good old days there used to  be  sup?
       port for the European Latin-1 "ISO-8859-1" locale (e.g. in
       libc-4.5.21 and  libc-4.6.27),  and  the  Russian  "KOI-8"
       (more  precisely,  "koi-8r") locale (e.g. in libc-4.6.27),
       so that having an environment variable LC_CTYPE=ISO-8859-1
       sufficed to make isprint() return the right answer.  These
       days non-English speaking Europeans have  to  work  a  bit
       harder, and must install actual locale files.
[...]

In recent Linux distributions almost every Linux C-program seems to 
contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
to forget about it.  However the core Python interpreter does not.
it seems the Linux C-Library is not fully ANSI compliant in this case.
It seems to honour the setting of $LANG regardless whether a program
calls 'setlocale' or not.

Regards, Peter



From guido at python.org  Mon May 22 18:31:50 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:31:50 -0700
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: Your message of "Mon, 22 May 2000 10:25:21 +0200."
             <3928EEF1.693F@cnet.francetelecom.fr> 
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>  
            <3928EEF1.693F@cnet.francetelecom.fr> 
Message-ID: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>

> Yup. One easy answer is 'just copy from Tcl'...

Tcl seems to be your only frame of reference.  I think it's too early
to say that borrowing Tcl's design is right for Python.  Don't forget
that part of Tcl's design was guided by the desire for backwards
compatibility with Tcl's strong (stronger than Python I find!) Unix
background.

> Seriously, I'm really too new to Python to suggest the details or even
> the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> the implementation since select() (from C or Tcl) is the one primitive I
> most depend on !
> 
> Basically, as shortly mentioned before, the key problem is the
> heterogeneity of seemingly-selectable things in Windoze. On unix, not
> only does select() work with
> all descriptor types on which it makes sense, but also the fd used by
> Xlib is accessible; hence clean multiplexing even with a GUI package is
> trivial. Now to the real (rotten) meat, that is M$'s. Facts:

Note that on Windows, select() is part of SOCKLIB, which explains why
it only understands sockets.  Native Windows code uses the
wait-for-event primitives that you are describing, and these are
powerful enough to wait on named pipes, sockets, and GUI events.
Complaining about the select interface on Windows isn't quite fair.

> 	1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> selectable. Why ? Ask a relative in Redmond...

Can we cut the name-calling?

> 	2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> handles. They are selectable, but for example you can't use'em for
> redirections. Okay in our case we don't care. I only mention it cause
> its scary and could pop back into your face some time later.

Handles are a much more low-level concept than file descriptors.  get
used to it.

> 	3. The GUI API doesn't expose a descriptor (handle), but fortunately
> (though disgustingly) there is a special syscall to wait on both "the
> message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> doable, if not beautiful.
> 
> The Tcl solution to (1.), which is the only real issue,

Why is (1) the only issue?  Maybe in Tcl-land...

> is to have a
> separate thread blockingly read 1 byte from the pipe, and then post a
> message back to the main thread to awaken it (yes, ugly code to handle
> that extra byte and integrate it with the buffering scheme).

Or the exposed API could deal with this in a different way.

> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?

Because it's designed for Tcl.

> Then, for the API exposed to the Python programmer, the Tclly exposed
> one is a starter:
> 
> 	fileevent $channel readable|writable callback
> 	...
> 	vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> loop of select(). The callback(s) is(are) called without breaking the
> loop, unless $breaker_variable is set, at which time vwait returns.

Sorry, you've lost me here.  Fortunately there's more info at
http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
very complicated, and I'm not sure why you rejected my earlier
suggestion to use threads outright as "too complicated".  After
reading that man page, threads seem easy compared to the caution one
has to exert when using non-blocking I/O.

> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
> something based on exceptions. I don't quite understand why it's not
> already this way in Tcl (which has (kindof) first-class exceptions), but
> let's not repeat the mistake: let's suggest that (the equivalent of)
> vwait loops forever, only to be broken out by an exception from within
> one of the callbacks.

Vwait seems to be part of the Tcl event model.  Maybe we would need to
think about an event model for Python?  On the other hand, Python is
at the mercy of the event model of whatever GUI package it is using --
which could be Tk, or wxWindows, or Gtk, or native Windows, or native
MacOS, or any of a number of other event models.

Perhaps this is an issue that each GUI package available to Python
will have to deal with separately...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 22 18:49:24 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:49:24 -0700
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: Your message of "Mon, 22 May 2000 14:40:51 +0200."
             <39292AD2.F5080E35@tismer.com> 
References: <000301bfc082$51ce0180$6c2d153f@tim>  
            <39292AD2.F5080E35@tismer.com> 
Message-ID: <200005221649.JAA07398@cj20424-a.reston1.va.home.com>

Christian, there was a smiley in your signature, so I can safely
ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
return 97 instead of "a".

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 22 18:54:35 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:54:35 -0700
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 15:02:18 +0200."
             <m12trqc-000DieC@artcom0.artcom-gmbh.de> 
References: <m12trqc-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005221654.JAA07426@cj20424-a.reston1.va.home.com>

> pf at pefunbk> python 
> Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import string
> >>> print string.upper("?")
> ?
> >>> 

This threw me off too.  However try this:

python -c 'print "?".upper()'

It will print "?".  A mystery?  No, the GNU readline library calls
setlocale().  It is wrong, but I can't help it.  But it only affects
interactive use of Python.

> In recent Linux distributions almost every Linux C-program seems to 
> contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy 
> to forget about it.  However the core Python interpreter does not.
> it seems the Linux C-Library is not fully ANSI compliant in this case.
> It seems to honour the setting of $LANG regardless whether a program
> calls 'setlocale' or not.

No, the explanation is in GNU readline.

Compile this little program and see for yourself:

#include <ctype.h>
#include <stdio.h>

main()
{
	printf("toupper(%c) = %c\n", '?', toupper('?'));
}

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at tismer.com  Mon May 22 16:11:37 2000
From: tismer at tismer.com (Christian Tismer)
Date: Mon, 22 May 2000 16:11:37 +0200
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON 
 FEATURE:))
References: <000301bfc082$51ce0180$6c2d153f@tim>  
	            <39292AD2.F5080E35@tismer.com> <200005221649.JAA07398@cj20424-a.reston1.va.home.com>
Message-ID: <39294019.3CB47800@tismer.com>


Guido van Rossum wrote:
> 
> Christian, there was a smiley in your signature, so I can safely
> ignore it, right?  It doesn't make sense at all to me to make "abc"[0]
> return 97 instead of "a".

There was a smiley, but for the most since I cannot
decide what I want. I'm quite convinced that strings should
better not be sequences, at least not sequences of strings.

"abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
as listed in Tim's posting.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From fdrake at acm.org  Mon May 22 16:12:54 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:12:54 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005200119.SAA02183@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220710520.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, Guido van Rossum wrote:
 > Hm, that's bogus.  It works well under Windows -- with the restriction
 > that it only works for sockets, but for sockets it works as well as
 > on Unix.  it also works well on the Mac.  I wonder where that note
 > came from (it's probably 6 years old :-).

  Is that still in there?  If I could get a pointer from someone I'll be
able to track it down.  I didn't see it in the select or socket module
documents, and a quick grep did't find 'really work'.
  It's definately fixable if we can find it.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fdrake at acm.org  Mon May 22 16:21:48 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 07:21:48 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJGEIPCDAA.DavidA@ActiveState.com>
Message-ID: <Pine.LNX.4.10.10005220718250.13789-100000@mailhost.beopen.com>

On Fri, 19 May 2000, David Ascher wrote:
 > I'm pretty sure I know where it came from -- it came from Sam Rushing's
 > tutorial on how to use Medusa, which was more or less cut & pasted into the
 > doc, probably at the time that asyncore and asynchat were added to the
 > Python core.  IMO, it's not the best part of the Python doc -- it is much
 > too low-to-the ground, and assumes the reader already understands much about
 > I/O, sync/async issues, and cares mostly about high performance.  All of

  It's a fairly young section, and I haven't had as much time to review
and edit that or some of the other young sections.  I'll try to pay
particular attention to these as I work on the 1.6 release.

 > which are true of wonderful Sam, most of which are not true of the average
 > Python user.
 > 
 > While we're complaining about doc, asynchat is not documented, I believe.
 > Alas, I'm unable to find the time to write up said documentation.

  Should that situation change, I'll gladly accept a section on asynchat!
Or, if anyone else has time to contribute...??


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From skip at mojam.com  Mon May 22 16:25:00 2000
From: skip at mojam.com (Skip Montanaro)
Date: Mon, 22 May 2000 09:25:00 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCE: Python CVS tree moved to SourceForge
In-Reply-To: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
References: <200005212205.PAA05512@cj20424-a.reston1.va.home.com>
Message-ID: <14633.17212.650090.540777@beluga.mojam.com>

    Guido> If you have an existing working tree that points to the
    Guido> cvs.python.org repository, you may want to retarget it to the
    Guido> SourceForge tree.  This can be done painlessly with Greg Ward's
    Guido> cvs_chroot script:

    Guido>   http://starship.python.net/~gward/python/

I tried this with (so far) no apparent success.  I ran cvs_chroot as

    cvs_chroot :pserver:anonymous at cvs.python.sourceforge.net:/cvsroot/python

It warned me about some directories that didn't match the top level
directory.  "No problem", I thought.  I figured they were for the nondist
portions of the tree.  When I tried a cvs update after logging in to the
SourceForge cvs server I got tons of messages that looked like:

    cvs update: move away dist/src/Tools/scripts/untabify.py; it is in the way
    C dist/src/Tools/scripts/untabify.py

It doesn't look like untabify.py has been hosed, but the warnings worry me.
Anyone else encounter this problem?  If so, what's its meaning?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From alexandre.ferrieux at cnet.francetelecom.fr  Mon May 22 16:51:56 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Mon, 22 May 2000 16:51:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <Pine.GSO.4.10.10005180810180.14709-100000@sundial> <92F3F78F2E523B81.794E00EE6EFC8B37.2D5DBFEF2B39A7A2@lp.airnews.net> <39242D1B.78773AA2@python.org> <39250897.6F42@cnet.francetelecom.fr> <200005191513.IAA00818@cj20424-a.reston1.va.home.com>  
	            <3928EEF1.693F@cnet.francetelecom.fr> <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <3929498C.1941@cnet.francetelecom.fr>

Guido van Rossum wrote:
> 
> > Yup. One easy answer is 'just copy from Tcl'...
> 
> Tcl seems to be your only frame of reference.

Nope, but I'll welcome any proof of existence of similar abstractions
(for multiplexing) elsewhere.

>  I think it's too early
> to say that borrowing Tcl's design is right for Python.  Don't forget
> that part of Tcl's design was guided by the desire for backwards
> compatibility with Tcl's strong (stronger than Python I find!) Unix
> background.

I don't quite get how the 'unix background' comes into play here, since
[fileevent] is now implemented and works correctly on all platforms.

If you are talinkg about the API as seen from above, I don't understand
why 'hooking a callback' and 'multiplexing event sources' are a unix
specificity, and/or why it should be avoided outside unix.

> > Seriously, I'm really too new to Python to suggest the details or even
> > the *style* of this 'level 2 API to multiplexing'. However, I can sketch
> > the implementation since select() (from C or Tcl) is the one primitive I
> > most depend on !
> >
> > Basically, as shortly mentioned before, the key problem is the
> > heterogeneity of seemingly-selectable things in Windoze. On unix, not
> > only does select() work with
> > all descriptor types on which it makes sense, but also the fd used by
> > Xlib is accessible; hence clean multiplexing even with a GUI package is
> > trivial. Now to the real (rotten) meat, that is M$'s. Facts:
> 
> Note that on Windows, select() is part of SOCKLIB, which explains why
> it only understands sockets.  Native Windows code uses the
> wait-for-event primitives that you are describing, and these are
> powerful enough to wait on named pipes, sockets, and GUI events.
> Complaining about the select interface on Windows isn't quite fair.

Sorry, you missed the point. Here I used the term 'select()' as a 
generic one (I didn't want to pollute a general discussion with
OS-specific names...). On windows it means MsgWaitForMultipleObjects.

Now as you said "these are powerful enough to wait on named pipes,
sockets, and GUI events"; I won't deny the obvious truth. However,
again, they don't work on *unnamed pipes* (which are the only ones in
'95). That's my sole reason for complaining, and I'm afraid it is fair
;-)

> >       1. 'Handle' types are not equal. Unnames pipes are (surprise!) not
> > selectable. Why ? Ask a relative in Redmond...
> 
> Can we cut the name-calling?

Yes we can :^P

> 
> >       2. 'Handle' types are not equal (bis). Socket 'handles' are *not* true
> > handles. They are selectable, but for example you can't use'em for
> > redirections. Okay in our case we don't care. I only mention it cause
> > its scary and could pop back into your face some time later.
> 
> Handles are a much more low-level concept than file descriptors.  get
> used to it.

Take it easy, I meant to help. Low level as they be, can you explain why
*some* can be passed to CreateProcess as redirections, and *some* can't
?
Obviously there *is* some attempt to unify things in Windows (if only
the
single name of 'handle'); and just as clearly it is not completely
successful.

> >       3. The GUI API doesn't expose a descriptor (handle), but fortunately
> > (though disgustingly) there is a special syscall to wait on both "the
> > message queue" and selectable handles: MsgWaitForMultipleObjects. So its
> > doable, if not beautiful.
> >
> > The Tcl solution to (1.), which is the only real issue,
> 
> Why is (1) the only issue?

Because for (2) we don't care (no need for redirections in our case) and
for (3) the judgement is only aesthetic. 

>  Maybe in Tcl-land...

Come on, I'm emigrating from Tcl to Python with open palms, as Cameron
puts it.
I've already mentioned the outstanding beauty of Python's internal
design, and in comparison Tcl is absolutely awful. Even at the (script)
API level, Some of the early choices in Tcl are disgusting (and some
recent ones too...). I'm really turning to Python with the greatest
pleasure - please don't interpret my arguments as yet another Lang1 vs.
Lang2 flamewar.

> > is to have a
> > separate thread blockingly read 1 byte from the pipe, and then post a
> > message back to the main thread to awaken it (yes, ugly code to handle
> > that extra byte and integrate it with the buffering scheme).
> 
> Or the exposed API could deal with this in a different way.

Please elaborate ?

> > In summary, why not peruse Tcl's hard-won experience on
> > selecting-on-windoze-pipes ?
> 
> Because it's designed for Tcl.

I said 'why not' as a positive suggestion.
I didn't expect you to actually say why not...

Moreover, I don't understand 'designed for Tcl'. What's specific to Tcl
in
unifying descriptor types ?

> > Then, for the API exposed to the Python programmer, the Tclly exposed
> > one is a starter:
> >
> >       fileevent $channel readable|writable callback
> >       ...
> >       vwait breaker_variable
> >
> > Explanation for non-Tclers: fileevent hooks the callback, vwait does a
> > loop of select(). The callback(s) is(are) called without breaking the
> > loop, unless $breaker_variable is set, at which time vwait returns.
> 
> Sorry, you've lost me here.  Fortunately there's more info at
> http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
> very complicated,

Ahem, self-destroying argument: "Fortunately ... very complicated".

While I agree the fileevent manpage is longer than it should be, I fail
to see
what's complicated in the model of 'hooking a callback for a given kind
of events'.

> and I'm not sure why you rejected my earlier
> suggestion to use threads outright as "too complicated".

Not on the same level. You're complaining about the script-level API (or
its documentation, more precisely !). I dismissed the thread-based
*implementation*
as an overkill in terms of resource consumption (thread context +
switching + ITC)
on platforms which can use select() (for anon pipes on Windows, as
already explained, the thread is unavoidable).

>  After
> reading that man page, threads seem easy compared to the caution one
> has to exert when using non-blocking I/O.

Oh, I get it. The problem is, *that* manpage unfortunately tries to
explain event-based and non-blocking I/O at the same time (presumably
because the average user will never follow the 'See Also' links). That's
a blatant pedagogic mistake. Let me try:

	fileevent <channel> readable|writable <script>

	Hooks <script> to be called back whenever the given <channel> becomes
readable|writable. 'Whenever' here means from within event processing
primitives (vwait, update).

	Example:

		# whenever a new line comes down the socket, display it.

		set s [socket $host $port]
		fileevent $s readable gotdata
		proc gotdata {} {global s;puts "New data: [gets $s]"}
		vwait forever

To answer a potential question about blockingness, yes, in the example
above the [gets] will block until a complete line is received. But
mentioning this fact in the manpage is uselessly misleading because  the
fileevent mechanism obviously allows to implement any kind of protocol,
line-based or not, terminator- or size-header- based or not. Uses with
blocking and nonblocking [read] and mixes thereof are immediate
consequences of this classification.

Hope this helps.

> Vwait seems to be part of the Tcl event model.

Hardly. It's just the Tcl name for the primitive that (blockingly) calls
select() (generic term - see above)

>  Maybe we would need to think about an event model for Python?

With pleasure - please define 'model'. Do you mean callbacks vs.
explicit decoding of an event strucutre ? Do you mean blocking select()
vs. something more asynchronous like threads or signals ?

> On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

Why should Python be alone to be exposed to this diversity ? Don't
assume
that Tk is the only option for Tcl. The Tcl/C API even exposes the
proper hooks
to integrate any new event source, like a GUI package.

Again, I'm not interested in Tcl vs. Python here (and anyway Python wins
!!!). I just want to extract what's truly orthogonal to specific design
choices. As it turns out, what you call 'the Tcl event model' can
happily be transported to any (imperative) lang.

I can even be more precise: a random GUI package can be used this way
iff the two following conditions hold:

	(a) Its queue can awaken a select()-like primitive.
	(b) Its queue can be Peek'ed (to check for buffered msgs
                                      before blockigng again)

> Perhaps this is an issue that each GUI package available to Python
> will have to deal with separately...

The characterization is given just above. To me it looks generic enough
to build an abstraction upon it. It's been done for Tcl, and is utterly
independent from its design peculiarities. Now everything depends on
whether abstraction is sought or not...

-Alex



From pf at artcom-gmbh.de  Mon May 22 17:01:50 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 17:01:50 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <200005221654.JAA07426@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 22, 2000  9:54:35 am"
Message-ID: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>

Hi, 

Guido van Rossum:
> > pf at pefunbk> python 
> > Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> import string
> > >>> print string.upper("?")
> > ?
> > >>> 
> 
> This threw me off too.  However try this:
> 
> python -c 'print "?".upper()'

Yes, you are right.  :-(

Conclusion:  If the 'locale' module would ever become depreciated  
then ...ummm...  we poor mortals will simply have to add a line
'import readline' to our Python programs.  Nifty... ;-)

Regards, Peter



From claird at starbase.neosoft.com  Mon May 22 17:19:21 2000
From: claird at starbase.neosoft.com (Cameron Laird)
Date: Mon, 22 May 2000 10:19:21 -0500 (CDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <200005221519.KAA45253@starbase.neosoft.com>

	From guido at cj20424-a.reston1.va.home.com  Mon May 22 08:45:58 2000
			.
			.
			.
	Tcl seems to be your only frame of reference.  I think it's too early
	to say that borrowing Tcl's design is right for Python.  Don't forget
	that part of Tcl's design was guided by the desire for backwards
	compatibility with Tcl's strong (stronger than Python I find!) Unix
	background.
Right.  We quite agree.  Both of us came
to this looking to learn in the first place
what *is* right for Python.
			.
		[various points]
			.
			.
	> Then, for the API exposed to the Python programmer, the Tclly exposed
	> one is a starter:
	> 
	> 	fileevent $channel readable|writable callback
	> 	...
	> 	vwait breaker_variable
	> 
	> Explanation for non-Tclers: fileevent hooks the callback, vwait does a
	> loop of select(). The callback(s) is(are) called without breaking the
	> loop, unless $breaker_variable is set, at which time vwait returns.

	Sorry, you've lost me here.  Fortunately there's more info at
	http://dev.scriptics.com/man/tcl8.3/TclCmd/fileevent.htm.  It looks
	very complicated, and I'm not sure why you rejected my earlier
	suggestion to use threads outright as "too complicated".  After
	reading that man page, threads seem easy compared to the caution one
	has to exert when using non-blocking I/O.

	> One note about 'breaker_variable': I'm not sure I like it. I'd prefer
	> something based on exceptions. I don't quite understand why it's not
	> already this way in Tcl (which has (kindof) first-class exceptions), but
	> let's not repeat the mistake: let's suggest that (the equivalent of)
	> vwait loops forever, only to be broken out by an exception from within
	> one of the callbacks.

	Vwait seems to be part of the Tcl event model.  Maybe we would need to
	think about an event model for Python?  On the other hand, Python is
	at the mercy of the event model of whatever GUI package it is using --
	which could be Tk, or wxWindows, or Gtk, or native Windows, or native
	MacOS, or any of a number of other event models.

	Perhaps this is an issue that each GUI package available to Python
	will have to deal with separately...

	--Guido van Rossum (home page: http://www.python.org/~guido/)
There are a lot of issues here.  I've got clients
with emergencies that'll keep me busy all week,
and will be able to respond only sporadically.
For now, I want to emphasize that Alex and I both
respect Python as itself; it would simply be alien
to us to do the all-too-common trick of whining,
"Why can't it be like this other language I just
left?"

Tcl's event model has been more successful than
any of you probably realize.  You deserve to know
that.

Should Python have an event model?  I'm not con-
vinced.  I want to work with Python threading a
bit more.  It could be that it answers all the
needs Python has in this regard.  The documentation
Guido found "very complicated" above we think of
as ...--well, I want to conclude by saying I find
this discussion productive, and appreciate your
patience in entertaining it.  Daemon construction
is a lot of what I do, and, more broadly, I like to
think about useful OS service abstractions.  I'll
be back as soon as I have something to contribute.



From effbot at telia.com  Mon May 22 17:13:58 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:13:58 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12ttiI-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <00b801bfc400$5c0d8fe0$34aab5d4@hagrid>

Peter Funk wrote:
> Conclusion:  If the 'locale' module would ever become depreciated  
> then ...ummm...  we poor mortals will simply have to add a line
> 'import readline' to our Python programs.  Nifty... ;-)

won't help if python is changed to use the *unicode*
ctype functions...

...but on the other hand, if you use unicode strings for
anything that is not plain ASCII, upper and friends will
do the right thing even if you forget to import readline.

</F>




From effbot at telia.com  Mon May 22 17:37:01 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Mon, 22 May 2000 17:37:01 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <00e301bfc403$abac3940$34aab5d4@hagrid>

Guido van Rossum <guido at python.org> wrote:
> > Peter Funk wrote:
> > > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> > 
> > you're missing the point -- now that we've added unicode support to
> > Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> > platforms implement a wctype interface, it's not widely available, and it's
> > not always unicode.
> 
> Huh?  We were talking strictly 8-bit strings here.  The locale support
> hasn't changed there.

I meant that the locale support, even though it's part of POSIX, isn't
good enough for unicode support...

> > so in order to provide platform-independent unicode support, Python 1.6
> > comes with unicode-aware and fully portable replacements for the ctype
> > functions.
> 
> For those who only need Latin-1 or another 8-bit ASCII superset, the
> Unicode stuff is overkill.

why?

besides, overkill or not:

> > the code is already in there...

> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

depends on who you're asking, of course:

>>> b = fetch_binary_goop()
>>> type(b)
<type 'string'>
>>> dir(b)
['capitalize', 'center', 'count', 'endswith', 'expandtabs', ...

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour
for unicode), maybe we should base the default unicode/string con-
versions on the locale too?

background:

until now, I've been convinced that the goal should be to have two
"string-like" types: binary arrays for binary goop (including encoded
text), and a Unicode-based string type for text.  afaik, that's the
solution used in Tcl and Perl, and it's also "conceptually compatible"
with things like Java, Windows NT, and XML (and everything else from
the web universe).

given that, it has been clear to me that anything that is not compatible
with this model should be removed as soon as possible (and deprecated
as soon as we understand why it won't fly under the new scheme).

but if backwards compatibility is more important than a minimalistic
design, maybe we need three different "string-like" types:

-- binary arrays (still implemented by the 8-bit string type in 1.6)

-- 8-bit old-style strings (using the "system encoding", as defined
   by the locale.  if the locale is not set, they're assumed to contain
   ASCII)

-- unicode strings (possibly using a "polymorphic" internal representation)

this also solves the default conversion problem: use the locale environ-
ment variables to determine the default encoding, and call
sys.set_string_encoding from site.py (see my earlier post for details).

what have I missed this time?

</F>

PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

>>> sys
... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...




From gmcm at hypernet.com  Mon May 22 18:08:07 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Mon, 22 May 2000 12:08:07 -0400
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <3928EEF1.693F@cnet.francetelecom.fr>
Message-ID: <1253110775-103205454@hypernet.com>

Alexandre Ferrieux wrote:

> The Tcl solution to (1.), which is the only real issue, is to
> have a separate thread blockingly read 1 byte from the pipe, and
> then post a message back to the main thread to awaken it (yes,
> ugly code to handle that extra byte and integrate it with the
> buffering scheme).

What's the actual mechanism here? A (dummy) socket so 
"select" works? The WSAEvent... stuff (to associate sockets 
with waitable events) and WaitForMultiple...? The 
WSAAsync... stuff (creates Windows msgs when socket stuff 
happens) with MsgWait...? Some other combination?

Is the mechanism different if it's a console app (vs GUI)?

I'd assume in a GUI, the fileevent-checker gets integrated with 
the message pump. In a console app, how does it get control?

 
> In summary, why not peruse Tcl's hard-won experience on
> selecting-on-windoze-pipes ?
> 
> Then, for the API exposed to the Python programmer, the Tclly
> exposed one is a starter:
> 
>  fileevent $channel readable|writable callback
>  ...
>  vwait breaker_variable
> 
> Explanation for non-Tclers: fileevent hooks the callback, vwait
> does a loop of select(). The callback(s) is(are) called without
> breaking the loop, unless $breaker_variable is set, at which time
> vwait returns.
> 
> One note about 'breaker_variable': I'm not sure I like it. I'd
> prefer something based on exceptions. I don't quite understand
> why it's not already this way in Tcl (which has (kindof)
> first-class exceptions), but let's not repeat the mistake: let's
> suggest that (the equivalent of) vwait loops forever, only to be
> broken out by an exception from within one of the callbacks.
> 
> HTH,
> 
> -Alex
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev



- Gordon



From ping at lfw.org  Mon May 22 18:29:54 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:29:54 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221519.KAA45253@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220923360.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> 
> Tcl's event model has been more successful than
> any of you probably realize.  You deserve to know
> that.

Events are a very powerful concurrency model (arguably
more reliable because they are easier to understand
than threads).  My friend Mark Miller has designed a
language called E (http://www.erights.org/) that uses
an event model for all object messaging, and i would
be interested in exploring how we can apply those ideas
to improve Python.

> Should Python have an event model?  I'm not con-
> vinced.

Indeed.  This would be a huge core change, way too
large to be feasible.  But i do think it would be
excellent to simply provide more facilities for
helping people use whatever model they want, and
given the toolkit we let people build great things.

What you described sounded like it could be implemented
fairly easily with some functions like

    register(handle, mode, callback)
        or file.register(mode, callback)

        Put 'callback' in a dictionary of files
        to be watched for mode 'mode'.

    mainloop(timeout)

        Repeat (forever or until 'timeout') a
        'select' on all the files that have been
        registered, and do calls to the callbacks
        that have been registered.

Presumably there would be some exception that a
callback could raise to quietly exit the 'select'
loop.

    1. How does Tcl handle exiting the loop?
       Is there a way for a callback to break
       out of the vwait?

    2. How do you unregister these callbacks in Tcl?

    


-- ?!ng




From ping at lfw.org  Mon May 22 18:23:23 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:23:23 -0700 (PDT)
Subject: [Python-Dev] Re: Python multiplexing is too hard (was: Network
 statistics program)
In-Reply-To: <200005221309.IAA41866@starbase.neosoft.com>
Message-ID: <Pine.LNX.4.10.10005220917350.461-100000@localhost>

On Mon, 22 May 2000, Cameron Laird wrote:
> I've got a whole list of "higher-level
> abstractions around OS stuff" that I've been
> collecting.  Maybe I'll make it fit for
> others to see once we're through this affair

Absolutely!  I've thought about this too.  A nice "child process
management" module would be very convenient to have -- i've done
such stuff before -- though i don't know enough about Windows
semantics to make one that works on multiple platforms.  Some
sort of (hypothetical)

    delegate.spawn(function) - return a child object or id
    delegate.kill(id) - kill child

etc. could possibly free us from some of the system dependencies
of fork, signal, etc.

I currently have a module called "delegate" which can run a
function in a child process for you.  It uses pickle() to send
the return value of the function back to the parent (via an
unnamed pipe).  Again, Unix-specific -- but it would be very
cool if we could provide this functionality in a module.  My
module provides just two things, but it's already very useful:

    delegate.timeout(function, timeout) - run the 'function' in
        a child process; if the function doesn't finish in
        'timeout' seconds, kill it and raise an exception;
        otherwise, return the return value of the function

    delegate.parallelize(function, [work, work, work...]) -
        fork off many children (you can specify how many if
        you want) and set each one to work calling the 'function'
        with one of the 'work' items, queueing up work for
        each of the children until all the work gets done.
        Return the results in a dictionary mapping each 'work'
        item to its result.


-- ?!ng




From ping at lfw.org  Mon May 22 18:17:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Mon, 22 May 2000 09:17:01 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <200005221616.JAA07234@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005220914000.461-100000@localhost>

On Mon, 22 May 2000, Guido van Rossum wrote:
> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

Indeed -- but at the moment, we're letting people continue to
use strings this way, since they already do it.

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

I would like it to be only the latter, as Fred, i, and others
have previously suggested, and as corresponds to your ASCII
proposal for treatment of 8-bit strings.

But doesn't the current locale-dependent behaviour of upper()
etc. mean that strings are getting interpreted in the first way?

> > is this complexity really worth it?
> 
> From a backwards compatibility point of view, yes.  Basically,
> programs that don't use Unicode should see no change in semantics.

I'm afraid i have to agree with this, because i don't see any
other option that lets us escape from any of these four ways
of using strings...


-- ?!ng




From fdrake at acm.org  Mon May 22 19:05:46 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Mon, 22 May 2000 10:05:46 -0700 (PDT)
Subject: Some information about locale (was Re: [Python-Dev] repr vs.
 str and locales again)
In-Reply-To: <Pine.LNX.4.10.10005220914000.461-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005221004220.14844-100000@mailhost.beopen.com>

On Mon, 22 May 2000, Ka-Ping Yee wrote:
 > I would like it to be only the latter, as Fred, i, and others

  Please refer to Fredrik as Fredrik or /F; I don't think anyone else
refers to him as "Fred", and I got really confused when I saw this!  ;)


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From pf at artcom-gmbh.de  Mon May 22 19:17:40 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Mon, 22 May 2000 19:17:40 +0200 (MEST)
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: <00e301bfc403$abac3940$34aab5d4@hagrid> from Fredrik Lundh at "May 22, 2000  5:37: 1 pm"
Message-ID: <m12tvpk-000DieC@artcom0.artcom-gmbh.de>

Hi!

Fredrik Lund:
[...]
> > > so in order to provide platform-independent unicode support, Python 1.6
> > > comes with unicode-aware and fully portable replacements for the ctype
> > > functions.
> > 
> > For those who only need Latin-1 or another 8-bit ASCII superset, the
> > Unicode stuff is overkill.
> 
> why?

Going from 8 bit strings to 16 bit strings doubles the memory 
requirements, right?

As long as we only deal with English, Spanish, French, Swedish, Italian
and several other languages, 8 bit strings work out pretty well.  
Unicode will be neat if you can effort the additional space.  
People using Python on small computers in western countries
probably don't want to double the size of their data structures
for no reasonable benefit.

> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

Many locales effectively use Latin1 but for some other locales there
is a difference:

$ LANG="es_ES" python  # Espan?l uses Latin-1, the same as "de_DE"
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("???")
???

$ LANG="ru_RU" python  # This uses ISO 8859-5 
Python 1.5.2 (#1, Jul 23 1999, 06:38:16)  [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import string; print string.upper("???")
???

I don't know, how many people for example in Russia already depend 
on this behaviour.  I suggest it should stay as is.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From guido at python.org  Mon May 22 22:38:17 2000
From: guido at python.org (Guido van Rossum)
Date: Mon, 22 May 2000 15:38:17 -0500
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
In-Reply-To: Your message of "Mon, 22 May 2000 09:17:01 MST."
             <Pine.LNX.4.10.10005220914000.461-100000@localhost> 
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost> 
Message-ID: <200005222038.PAA01284@cj20424-a.reston1.va.home.com>

[Fredrik]
> > > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> > >   encoding.  upper, strip, etc should not be used.

[Guido]
> > These are not strings.

[Ping]
> Indeed -- but at the moment, we're letting people continue to
> use strings this way, since they already do it.

Oops, mistake.  I thought that Fredrik (not Fred! that's another
person in this context!) meant the array module, but upon re-reading
he didn't.

> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > > 
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> > 
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> I would like it to be only the latter, as Fred, i, and others
Fredrik, right?
> have previously suggested, and as corresponds to your ASCII
> proposal for treatment of 8-bit strings.
> 
> But doesn't the current locale-dependent behaviour of upper()
> etc. mean that strings are getting interpreted in the first way?

That's what I meant to say -- 8-bit strings use the system encoding
guided by the locale.

> > > is this complexity really worth it?
> > 
> > From a backwards compatibility point of view, yes.  Basically,
> > programs that don't use Unicode should see no change in semantics.
> 
> I'm afraid i have to agree with this, because i don't see any
> other option that lets us escape from any of these four ways
> of using strings...

Which is why I find Fredrik's attitude unproductive.

And where's the SRE release?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon May 22 22:53:55 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 22:53:55 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and 
 locales again)
References: <m12tlXi-000CnvC@artcom0.artcom-gmbh.de>             <008001bfc3be$7e5eae40$34aab5d4@hagrid>  <200005221616.JAA07234@cj20424-a.reston1.va.home.com> <00e301bfc403$abac3940$34aab5d4@hagrid>
Message-ID: <39299E63.CD996D7D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > - 8-bit text strings using the system encoding.  upper, strip, etc works
> > >   as long as the locale is properly configured.
> > >
> > > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> > >   system encoding is a subset of unicode -- which means US ASCII or
> > >   ISO Latin 1.
> >
> > This is a figment of your imagination.  You can use 8-bit text strings
> > to contain Latin-1, but you have to set your locale to match.
> 
> if that's a supported feature (instead of being deprecated in favour
> for unicode), maybe we should base the default unicode/string con-
> versions on the locale too?

This was proposed by Guido some time ago... the discussion
ended with the problem of extracting the encoding definition
from the locale names. There are some ways to solve this
problem (static mappings, fancy LANG variables etc.), but
AFAIK, there is no widely used standard on this yet, so
in the end you're stuck with defining the encoding by hand...
e.g.
	setenv LANG de_DE:latin-1

Perhaps we should help out a little and provide Python with
a parser for the LANG variable with some added magic
to provide useful defaults ?!

> [...]
> 
> this also solves the default conversion problem: use the locale environ-
> ment variables to determine the default encoding, and call
> sys.set_string_encoding from site.py (see my earlier post for details).

Right, that would indeed open up a path for consent...

> </F>
> 
> PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

Perhaps... these were really only added as experimental feature
to test the various possibilities (and a possible implementation).

My original intention was removing these after final consent
-- perhaps we should keep the functionality (expanded
to a per thread setting; the global is a temporary hack) ?!
 
> >>> sys
> ... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...
> 
> looks a little strange...

True; see above for the reason why ;-)

PS: What do you think about the current internal design of
sys.set_string_encoding() ? Note that hash() and the "st"
parser markers still use UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Tue May 23 04:21:00 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Mon, 22 May 2000 22:21:00 -0400
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <m12tokw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <000501bfc45d$893ad4c0$9ea2143f@tim>

[Peter Funk]
> On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> working example which exploits this vulnerability in older versions
> of GCC.
>
> The basic idea is indeed very simple:  Since the /tmp directory is
> writable for any user, the bad guy can create a symbolic link in /tmp
> pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> program will than overwrite this arbitrary file (where the programmer
> really wanted to write something to his tempfile instead).  Since this
> will happen with the access permissions of the process running this
> program, this opens a bunch of vulnerabilities in many programs
> writing something into temporary files with predictable file names.

I can understand all that, but does it have anything to do with Python's
tempfile module?  gcc wasn't fixed by changing glibc, right?  Playing games
with the file *names* doesn't appear to me to solve anything; the few posts
I bumped into where that was somehow viewed as a Good Thing were about
Solaris systems, where Sun kept the source for generating the "new,
improved, messy" names secret.  In Python, any attacker can read the code
for anything we do, which it makes it much clearer that a name-game approach
is half-assed.

and-people-whine-about-worming-around-bad-decisions-in-
    windows<wink>-ly y'rs  - tim





From tim_one at email.msn.com  Tue May 23 07:15:46 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 23 May 2000 01:15:46 -0400
Subject: [Python-Dev] OOps (was: No 1.6! (was Re: A REALLY COOL PYTHON FEATURE:))
In-Reply-To: <39294019.3CB47800@tismer.com>
Message-ID: <000401bfc475$f3a110a0$612d153f@tim>

[Christian Tismer]
> There was a smiley, but for the most since I cannot
> decide what I want. I'm quite convinced that strings should
> better not be sequences, at least not sequences of strings.
>
> "abc"[0:1] would be enough, "abc"[0] isn't worth the side effects,
> as listed in Tim's posting.

Oh, it's worth a lot more than those!  As Ping testified, the gotchas I
listed really don't catch many people, while string[index] is about as
common as integer+1.

The need for tuples specifically in "format % values" can be wormed around
by special-casing the snot out of a string in the "values" position.

The non-termination of repeated "string = string[0]" *could* be stopped by
introducing a distinct character type.  Trying to formalize the current type
of a string is messy ("string = sequence of string" is a bit paradoxical
<wink>).  The notion that a string is a sequence of characters instead is
vanilla and wholly natural.  OTOH, drawing that distinction at the type
level may well be more trouble in practice than it buys in theory!

So I don't know what I want either -- but I don't want *much* <wink>.

first-do-no-harm-ly y'rs  - tim





From moshez at math.huji.ac.il  Tue May 23 07:27:12 2000
From: moshez at math.huji.ac.il (Moshe Zadka)
Date: Tue, 23 May 2000 08:27:12 +0300 (IDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
 multiplexing is too hard)
In-Reply-To: <200005221631.JAA07272@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.GSO.4.10.10005230824130.12103-100000@sundial>

On Mon, 22 May 2000, Guido van Rossum wrote:

> Can we cut the name-calling?

Hey, what's life without a MS bashing now and then <wink>?

> Vwait seems to be part of the Tcl event model.  Maybe we would need to
> think about an event model for Python?  On the other hand, Python is
> at the mercy of the event model of whatever GUI package it is using --
> which could be Tk, or wxWindows, or Gtk, or native Windows, or native
> MacOS, or any of a number of other event models.

But that's sort of the point: Python needs a non-GUI event model, to 
use with daemons which need to handle many files. Every GUI package
would have its own event model, and Python will have one event model
that's not tied to a GUI package. 

that-only-proves-we-have-a-problem-ly y'rs, Z.
--
Moshe Zadka <moshez at math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:16:50 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:16:50 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com>
Message-ID: <392A3062.E1@cnet.francetelecom.fr>

Gordon McMillan wrote:
> 
> Alexandre Ferrieux wrote:
> 
> > The Tcl solution to (1.), which is the only real issue, is to
> > have a separate thread blockingly read 1 byte from the pipe, and
> > then post a message back to the main thread to awaken it (yes,
> > ugly code to handle that extra byte and integrate it with the
> > buffering scheme).
> 
> What's the actual mechanism here? A (dummy) socket so
> "select" works? The WSAEvent... stuff (to associate sockets
> with waitable events) and WaitForMultiple...? The
> WSAAsync... stuff (creates Windows msgs when socket stuff
> happens) with MsgWait...? Some other combination?

Other. Forget about sockets here, we're talking about true anonymous
pipes, under 95 and NT. Since they are not waitable nor peekable,
the only remaining option is to read in blocking mode from a dedicated
thread. Then of course, this thread reports back to the main
MsgWaitForMultiple with PostThreadMessage.

> Is the mechanism different if it's a console app (vs GUI)?

No. Why should it ?

> I'd assume in a GUI, the fileevent-checker gets integrated with
> the message pump.

The converse: MsgWaitForMultiple integrates the thread's message queue
which is a superset of the GUI's event stream.

-Alex



From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:36:35 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:36:35 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python  multiplexing is too hard)
References: <Pine.LNX.4.10.10005220923360.461-100000@localhost>
Message-ID: <392A3503.7C72@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> > Should Python have an event model?  I'm not con-
> > vinced.
> 
> Indeed.  This would be a huge core change, way too
> large to be feasible. 

Warning here. What would indeed need a huge core change,
is a pervasive use of events like in E. 'Having an event model'
is often interpreted in a less extreme way, simply meaning
'having the proper set of primitives at hand'.
Our discussion (and your comments below too, agreed !) was focussed
on the latter, so we're only talking about a pure library issue.
Asking any change in the Python lang itself for such a peripheral need
never even remotely crossed my mind !

> But i do think it would be
> excellent to simply provide more facilities for
> helping people use whatever model they want, and
> given the toolkit we let people build great things.

Right.

> What you described sounded like it could be implemented
> fairly easily with some functions like
> 
>     register(handle, mode, callback)
>         or file.register(mode, callback)
> 
>         Put 'callback' in a dictionary of files
>         to be watched for mode 'mode'.
> 
>     mainloop(timeout)
> 
>         Repeat (forever or until 'timeout') a
>         'select' on all the files that have been
>         registered, and do calls to the callbacks
>         that have been registered.
> 
> Presumably there would be some exception that a
> callback could raise to quietly exit the 'select'
> loop.

Great !!! That's exactly the kind of Pythonic translation I was
expecting. Thanks !

>     1. How does Tcl handle exiting the loop?
>        Is there a way for a callback to break
>        out of the vwait?

Yes, as explained before, in Tcl the loop-breaker is a write sentinel
on a variable. When a callback wants to break out, it simply sets the
var. But as also mentioned before, I'd prefer an exception-based
mechanism as you summarized.

>     2. How do you unregister these callbacks in Tcl?

We just register an empty string as the callback name (script).
But this is just a random API choice. Anything more Pythonic is welcome
(an explicit unregister function is okay for me).

-Alex



From nhodgson at bigpond.net.au  Tue May 23 09:47:14 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 17:47:14 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <047c01bfc48b$1d2addb0$e3cb8490@neil>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. ...

   Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

   Neil





From effbot at telia.com  Tue May 23 09:50:56 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 09:50:56 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <00d901bfc48b$a40432a0$f2a6b5d4@hagrid>

Alexandre Ferrieux wrote:
> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,

I thought PeekNamedPipe worked just fine on anonymous pipes.

or are "true anonymous pipes" not the same thing as anonymous
pipes created by CreatePipe?

</F>




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 09:51:07 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 09:51:07 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil>
Message-ID: <392A386B.4112@cnet.francetelecom.fr>

Neil Hodgson wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. ...
> 
>    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.

Hmmm... You're right, it's documented as such. But I seem to recall we
encountered a problem when actually using it. I'll check with Gordon
Chaffee (Cc of this msg).

-Alex



From mhammond at skippinet.com.au  Tue May 23 09:57:18 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 17:57:18 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3062.E1@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>

> Other. Forget about sockets here, we're talking about true anonymous
> pipes, under 95 and NT. Since they are not waitable nor peekable,
> the only remaining option is to read in blocking mode from a dedicated
> thread. Then of course, this thread reports back to the main
> MsgWaitForMultiple with PostThreadMessage.

Or maybe just with SetEvent(), as the main thread may just be using
WaitForMultipleObjects() - it really depends on whether the app has a
message loop or not.

>
> > Is the mechanism different if it's a console app (vs GUI)?
>
> No. Why should it ?

Because it generally wont have a message loop.  This is also commonly true
for NT services - they only wait on settable objects and if they dont
create a window generally dont need a message loop.   However, it is
precisely these apps that the proposal offers the most benefits to.

> > I'd assume in a GUI, the fileevent-checker gets integrated with
> > the message pump.
>
> The converse: MsgWaitForMultiple integrates the thread's message queue
> which is a superset of the GUI's event stream.

But what happens when we dont own the message loop?  Eg, IDLE is based on
Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
primary message loops are coded in C/C++, and wont provide this level of
customization.

Ironically, Tk seems to be one of the worst for this.  For example, Guido
and I recently(ish) both added threading support to our respective IDEs.
MFC was quite simple to do, as it used a "standard" windows message loop.

From nhodgson at bigpond.net.au  Tue May 23 10:04:03 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Tue, 23 May 2000 18:04:03 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <1253110775-103205454@hypernet.com> <392A3062.E1@cnet.francetelecom.fr> <047c01bfc48b$1d2addb0$e3cb8490@neil> <392A386B.4112@cnet.francetelecom.fr>
Message-ID: <04b001bfc48d$76dbeaa0$e3cb8490@neil>

> >    Anonymous pipes are peekable on both 95 and NT with PeekNamedPipe.
>
> Hmmm... You're right, it's documented as such. But I seem to recall we
> encountered a problem when actually using it. I'll check with Gordon
> Chaffee (Cc of this msg).

   I can vouch that this does work on 95, NT and W2K as I have been using it
in my SciTE editor for the past year as the means for gathering output from
running tool programs. There was a fiddle required to ensure all output was
retrieved on 95 but it works well with that implemented.

   Neil





From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 10:11:53 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:11:53 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBMEFOCLAA.mhammond@skippinet.com.au>
Message-ID: <392A3D49.5129@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Other. Forget about sockets here, we're talking about true anonymous
> > pipes, under 95 and NT. Since they are not waitable nor peekable,
> > the only remaining option is to read in blocking mode from a dedicated
> > thread. Then of course, this thread reports back to the main
> > MsgWaitForMultiple with PostThreadMessage.
> 
> Or maybe just with SetEvent(), as the main thread may just be using
> WaitForMultipleObjects() - it really depends on whether the app has a
> message loop or not.

Yes but why emphasize the differences when you can instead wipe them out
by using MsgWaitForMultiple which integrates all sources ? Even if
there's
no message stream, it's fine !

> > > Is the mechanism different if it's a console app (vs GUI)?
> >
> > No. Why should it ?
> 
> Because it generally wont have a message loop.  This is also commonly true
> for NT services - they only wait on settable objects and if they dont
> create a window generally dont need a message loop.   However, it is
> precisely these apps that the proposal offers the most benefits to.

Yes, but see above: how would it hurt them to call MsgWait* instead of
Wait* ?

> > > I'd assume in a GUI, the fileevent-checker gets integrated with
> > > the message pump.
> >
> > The converse: MsgWaitForMultiple integrates the thread's message queue
> > which is a superset of the GUI's event stream.
> 
> But what happens when we dont own the message loop?  Eg, IDLE is based on
> Tk, Pythonwin on MFC, wxPython on wxWindows, and so on.  Generally, the
> primary message loops are coded in C/C++, and wont provide this level of
> customization.

Can you be more precise ? Which one(s) do(es)/n't fulfill the two
conditions mentioned earlier ? I do agree with the fact that the primary
msg loop of a random GUI package is a black box, however it must use one
of the IPC mechanisms provided by the OS. Unifying them is not uniformly
trivial (that's the point of this discussion), but since even on Windows
it is doable (MsgWait*), I fail to see by what magic a GUI package could
bypass its supervision.

> Ironically, Tk seems to be one of the worst for this.

Possibly. Personally I don't like Tk very much, at least from an
implementation standpoint. But precisely, the fact that the model
described so far can accomodate *even* Tk is a proof of generality !

> and I recently(ish) both added threading support to our respective IDEs.
> MFC was quite simple to do, as it used a "standard" windows message loop.
> From all accounts, Guido had quite a difficult time due to some of the
> assumptions made in the message loop.  The other anecdote I have relates to
> debugging.  The Pythonwin debugger is able to live happily under most other
> GUI applications - eg, those written in VB, Delphi, etc.  Pythonwin creates
> a new "standard" message loop under these apps, and generally things work
> well.  However, Tkinter based apps remain un-debuggable using Pythonwin due
> to the assumptions made by the message loop.  This is probably my most
> oft-requested feature addition!!

As you said, all this is due to the assumptions made in Tk. Clearly a
mistake not to repeat, and also orthogonal to the issue of unifying IPC
mechanisms and the API to their multiplexing.

-Alex



From mhammond at skippinet.com.au  Tue May 23 10:24:39 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 18:24:39 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A3D49.5129@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>

> Yes but why emphasize the differences when you can instead wipe them out
> by using MsgWaitForMultiple which integrates all sources ? Even if
> there's
> no message stream, it's fine !

Agreed - as I said, it is with these apps that I think it has the most
chance of success.


> Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> conditions mentioned earlier ? I do agree with the fact that the primary
> msg loop of a random GUI package is a black box, however it must use one
> of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> trivial (that's the point of this discussion), but since even on Windows
> it is doable (MsgWait*), I fail to see by what magic a GUI package could
> bypass its supervision.

The only way I could see this working would be to use real, actual Windows
messages on Windows.  Python would need to nominate a special message that
it knows will not conflict with any GUI environments Python may need to run
in.

Each GUI package maintainer would then need to add some special logic in
their message hooking code.  When their black-box message loop delivers
this special message, the framework would need to enter the Python
"event-loop", where it does its stuff - until a new message arrives. It
would need to return, unwind back to the original message pump where it
will be processed as normal, and the entire process repeats.  The process
of waking other objects neednt be GUI toolkit dependent - as you said, it
only need place the well known message in the threads message loop using
PostThreadMessage()

Unless Im missing something?

Mark.




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 10:38:06 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 10:38:06 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBGEGACLAA.mhammond@skippinet.com.au>
Message-ID: <392A436E.4027@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> > Can you be more precise ? Which one(s) do(es)/n't fulfill the two
> > conditions mentioned earlier ? I do agree with the fact that the primary
> > msg loop of a random GUI package is a black box, however it must use one
> > of the IPC mechanisms provided by the OS. Unifying them is not uniformly
> > trivial (that's the point of this discussion), but since even on Windows
> > it is doable (MsgWait*), I fail to see by what magic a GUI package could
> > bypass its supervision.
> 
> The only way I could see this working would be to use real, actual Windows
> messages on Windows.  Python would need to nominate a special message that
> it knows will not conflict with any GUI environments Python may need to run
> in.

Why use a special message ? MsgWait* does multiplex true Windows Message
*and* other IPC mechanisms. So if a package uses messages, it will
awaken MsgWait* by its 'message queue' side, while if the package uses a
socket or a pipe, it will awaken it by its 'waitable handle' side
(provided, of course, that you can get your hands on that handle and
pass it in th elist of objects to wait for...).

> Each GUI package maintainer would then need to add some special logic in
> their message hooking code.  When their black-box message loop delivers
> this special message, the framework would need to enter the Python
> "event-loop", where it does its stuff - until a new message arrives.

The key is that there wouldn't be two separate Python/GUI evloops.
That's the reason for the (a) condition: be able to awaken a
multiplexing syscall.

> Unless Im missing something?

I believe the next thing to do is to enumerate which GUI packages
fullfill the following conditions ((a) updated to (a') to reflect the
first paragraph of this msg):

	(a') Its internal event source is either the vanilla Windows Message
queue, or an IPC channel which can be exposed to the outer framework
(for enlisting in a select()-like call), like the socket of an X
connection.

	(b) Its queue can be Peek'ed (to check for buffered msgs before
blockigng again)

HTH,

-Alex



From pf at artcom-gmbh.de  Tue May 23 10:39:11 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 10:39:11 +0200 (MEST)
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: <000501bfc45d$893ad4c0$9ea2143f@tim> from Tim Peters at "May 22, 2000 10:21: 0 pm"
Message-ID: <m12uADY-000DieC@artcom0.artcom-gmbh.de>

> [Peter Funk(me)]
> > On <http://www.insecure.org/sploits/gcc.tmpfiles.html> you can find a
> > working example which exploits this vulnerability in older versions
> > of GCC.
> >
> > The basic idea is indeed very simple:  Since the /tmp directory is
> > writable for any user, the bad guy can create a symbolic link in /tmp
> > pointing to some arbitrary file (e.g. to /etc/passwd).  The attacked
> > program will than overwrite this arbitrary file (where the programmer
> > really wanted to write something to his tempfile instead).  Since this
> > will happen with the access permissions of the process running this
> > program, this opens a bunch of vulnerabilities in many programs
> > writing something into temporary files with predictable file names.
 
[Tim Peters]:
> I can understand all that, but does it have anything to do with Python's
> tempfile module?  gcc wasn't fixed by changing glibc, right?  

Okay.  But people seem to have the opponion, that "application programmers"
are dumb and "system programmers" are clever and smart. ;-)  So they seem
to think, that the library should solve possible security issues.
I don't share this oppinion, but if a some problem can be solved once
and for all in a library, this is better than having to solve this over and
over again in each application.

Concerning 'tempfile' this would either involve changing (or extending) 
the interface (IMO a better approach to this class of problems) or if the
goal is to solve this for existing applications already using 'tempfile', to 
play games with the filenames returned from 'mktemp()'.  This would require
to make them to be truely random... which AFAIK can't be achieved with 
traditional coding techniques and would require access to a secure white
noise generator.  But may be I'm wrong.

> Playing games
> with the file *names* doesn't appear to me to solve anything; the few posts
> I bumped into where that was somehow viewed as a Good Thing were about
> Solaris systems, where Sun kept the source for generating the "new,
> improved, messy" names secret.  In Python, any attacker can read the code
> for anything we do, which it makes it much clearer that a name-game approach
> is half-assed.

I agree.  But I think we should at least extend the documentation
of 'tempfile' (Fred?) to guide people not to write Pythoncode like
	mytemp = open(tempfile.mktemp(), "w")
in programs that are intended to be used on Unix systems by arbitrary
users (possibly 'root').  Even better:  Someone with enough spare time 
should add a new function 'mktempfile()', which creates a temporary 
file and takes care of the security issue and than returns the file 
handle.  This implementation must take care of race conditions using
'os.open' with the following flags:

       O_CREAT If the file does not exist it will be created.
       O_EXCL  When used with O_CREAT, if the file already  exist
	       it is  an error and the open will fail. 

> and-people-whine-about-worming-around-bad-decisions-in-
>     windows<wink>-ly y'rs  - tim

I don't whine.  But currently I've more problems with my GUI app using
Tkinter&Pmw on the Mac <wink>.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60
Wer sich zu wichtig f?r kleine Arbeiten h?lt,
ist meist zu klein f?r wichtige Arbeiten.     --      Jacques Tati



From mhammond at skippinet.com.au  Tue May 23 11:00:54 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 23 May 2000 19:00:54 +1000
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
In-Reply-To: <392A436E.4027@cnet.francetelecom.fr>
Message-ID: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>

> Why use a special message ? MsgWait* does multiplex true Windows Message
> *and* other IPC mechanisms.

But the point was that Python programs need to live inside other GUI
environments, and that these GUI environments provide their own message
loops that we must consider a black box.

So, we can not change the existing message loop to use MsgWait*().  We can
not replace their message loop with one of our own that _does_ do this, as
their message loop is likely to have its own special requirements (eg,
MFC's has idle-time processing, etc)

So I can't see a way out of this bind, other than to come up with a way to
live _in_ a 3rd party, immutable message loop.  My message tried to outline
what would be required, for example, to make Pythonwin use such a Python
driven event loop while still using the MFC message loop.

> The key is that there wouldn't be two separate Python/GUI evloops.
> That's the reason for the (a) condition: be able to awaken a
> multiplexing syscall.

Im not sure that is feasable.  With what I know about MFC, I almost
certainly would not attempt to integrate such a scheme with Pythonwin.  I
obviously can not speak for the other GUI toolkit maintainers.

> I believe the next thing to do is to enumerate which GUI packages
> fullfill the following conditions ((a) updated to (a') to reflect the
> first paragraph of this msg):

That would certainly help.  I believe it is safe to say there are 3 major
GUI environments for Python currently released; Tkinter, wxPython and
Pythonwin.  I know Pythonwin does not qualify.  We both know Tkinter does
not qualify.  I dont know enough about wxPython, but even if it _does_
qualify, the simple fact that Tkinter doesnt would appear to be the
show-stopper...

Dont get me wrong - its a noble goal that I _have_ pondered myself in the
past - but can't see a good solution.

Mark.




From ping at lfw.org  Tue May 23 11:41:01 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 23 May 2000 02:41:01 -0700 (PDT)
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python
  multiplexing is too hard)
In-Reply-To: <392A3503.7C72@cnet.francetelecom.fr>
Message-ID: <Pine.LNX.4.10.10005230230130.461-200000@localhost>

On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> 
> Great !!! That's exactly the kind of Pythonic translation I was
> expecting. Thanks !

Here's a straw man.  Try the attached module.  To test it, run:

    python ./watcher.py 10203

then telnet to port 10203 on the local machine.  You can open
several telnet connections to port 10203 at once.

In one session:

    skuld[1041]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> 1 + 2
    3
    >>> spam = 3

In another session:

    skuld[1008]% telnet localhost 10203
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    >>> spam
    3

> We just register an empty string as the callback name (script).
> But this is just a random API choice. Anything more Pythonic is welcome
> (an explicit unregister function is okay for me).

So is there no way to register more than one callback on a
particular file?  Do you ever find yourself wanting to do that?


-- ?!ng
-------------- next part --------------
"""Watcher module, by Ka-Ping Yee (22 May 2000).

This module implements event handling on files.  To use it, create a
Watcher object, and register callbacks on the Watcher with the watch()
method.  When ready, call the go() method to start the main loop."""

import select

class StopWatching:
    """Callbacks may raise this exception to exit the main loop."""
    pass

class Watcher:
    """This class provides the ability to register callbacks on file events.
    Each instance represents one mapping from file events to callbacks."""

    def __init__(self):
        self.readers = {}
        self.writers = {}
        self.errhandlers = {}
        self.dicts = [("r", self.readers), ("w", self.writers),
                      ("e", self.errhandlers)]

    def watch(self, handle, callback, modes="r"):
        """Register a callback on a file handle for specified events.
        The 'handle' argument may be a file object or any object providing
        a faithful 'fileno()' method (this includes sockets).  The 'modes'
        argument is a string containing any of the chars "r", "w", or "e"
        to specify that the callback should be triggered when the file
        becomes readable, writable, or encounters an error, respectively.
        The 'callback' should be a function that expects to be called with
        the three arguments (watcher, handle, mode)."""
        fd = handle.fileno()
        for mode, dict in self.dicts:
            if mode in modes: dict[fd] = (handle, callback)

    def unwatch(self, handle, modes="r"):
        """Unregister any callbacks on a file for the specified events.
        The 'handle' argument should be a file object and the 'modes'
        argument should contain one or more of the chars "r", "w", or "e"."""
        fd = handle.fileno()
        for mode, dict in self.dicts:
            if mode in modes and dict.has_key(fd): del dict[fd]
            
    def go(self, timeout=None):
        """Loop forever, watching for file events and triggering callbacks,
        until somebody raises an exception.  The StopWatching exception
        provides a quiet way to exit the event loop.  If a timeout is 
        specified, the loop will exit after that many seconds pass by with
        no events occurring."""
        try:
            while self.readers or self.writers or self.errhandlers:
                rd, wr, ex = select.select(self.readers.keys(),
                                           self.writers.keys(),
                                           self.errhandlers.keys(), timeout)
                if not (rd + wr + ex): break
                for fds, (mode, dict) in map(None, [rd, wr, ex], self.dicts):
                    for fd in fds:
                        handle, callback = dict[fd]
                        callback(self, handle, mode)
        except StopWatching: pass

if __name__ == "__main__":
    import sys, socket, code
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.bind("localhost", 10203) # Five is RIGHT OUT.
    s.listen(1)
    consoles = {}
    locals = {} # Share locals, just for fun.

    class Redirector:
        def __init__(self, write):
            self.write = write

    def getline(handle):
        line = ""
        while 1:
            ch = handle.recv(1)
            line = line + ch
            if not ch or ch == "\n": return line

    def read(watcher, handle, mode):
        line = getline(handle)
        if line:
            if line[-2:] == "\r\n": line = line[:-2]
            if line[-1:] == "\n": line = line[:-1]
            out, err = sys.stdout, sys.stderr
            sys.stdout = sys.stderr = Redirector(handle.send)
            more = consoles[handle].push(line)
            handle.send(more and "... " or ">>> ")
            sys.stdout, sys.stderr = out, err
        else:
            watcher.unwatch(handle)
            handle.close()

    def connect(watcher, handle, mode):
        ns, addr = handle.accept()
        consoles[ns] = code.InteractiveConsole(locals, "<%s:%d>" % addr)
        watcher.watch(ns, read)
        ns.send(">>> ")

    w = Watcher()
    w.watch(s, connect)
    w.go()

From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 11:54:31 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 11:54:31 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python   multiplexing is too hard)
References: <Pine.LNX.4.10.10005230230130.461-200000@localhost>
Message-ID: <392A5557.72F7@cnet.francetelecom.fr>

Ka-Ping Yee wrote:
> 
> On Tue, 23 May 2000, Alexandre Ferrieux wrote:
> >
> > Great !!! That's exactly the kind of Pythonic translation I was
> > expecting. Thanks !
> 
> Here's a straw man.  <watcher.py>

Nice. Now what's left to do is make select.select() truly
crossplatform...

> So is there no way to register more than one callback on a
> particular file?

Nope - it's considered the responsibility of higher layers.

> Do you ever find yourself wanting to do that?

Seldom, but it happened to me once, and I did exactly that: a layer
above.

-Alex



From mal at lemburg.com  Tue May 23 12:10:20 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 12:10:20 +0200
Subject: [Python-Dev] String encoding
Message-ID: <392A590C.E41239D3@lemburg.com>

The recent discussion about repr() et al. brought up the idea
of a locale based string encoding again.

A support module for querying the encoding used in the current
locale together with the experimental hook to set the string
encoding could yield a compromise which satisfies ASCII, Latin-1
and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter
from within Python (rather than making the encoding a compile
time option). This is easily doable using the (yet to be written)
support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed
to whatever the user or administrator wants it to be on a per
site basis. Furthermore, the encoding should be settable on
a per thread basis inside the interpreter (Python threads
do not seem to inherit any per-thread globals, so the
encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

"""
import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be
# determined
defenc = locale.get_encoding('ascii')

# Set main thread's string encoding
sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume
defenc as encoding of strings.
"""

Minor nit: due to the implementation, the C parser markers
"s" and "t" and the hash() value calculation will still need
to work with a fixed encoding which still is UTF-8. C APIs
which want to support Unicode should be fixed to use "es"
or query the object directly and then apply proper, possibly
OS dependent conversion.

Before starting off into implementing the above, I'd like to
hear some comments...

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From alexandre.ferrieux at cnet.francetelecom.fr  Tue May 23 12:16:42 2000
From: alexandre.ferrieux at cnet.francetelecom.fr (Alexandre Ferrieux)
Date: Tue, 23 May 2000 12:16:42 +0200
Subject: [Python-Dev] Towards native fileevents in Python (Was Re: Python multiplexing is too hard)
References: <ECEPKNMJLHAPFFJHDOJBOEGBCLAA.mhammond@skippinet.com.au>
Message-ID: <392A5A8A.1534@cnet.francetelecom.fr>

Mark Hammond wrote:
> 
> >
> > I'd really like to challenge that 'almost'...
> 
> Sure - but the problem is simply that MFC has a non-standard message - so
> it _must_ be like I described in my message - and you will agree that
> sounds messy.
> 
> If such a set of multiplexing primitives took off, and people found them
> useful, and started complaining they dont work in Pythonwin, then I will
> surely look at it again.  Its too much work and too intrusive for a
> proof-of-concept effort.

Okay, fine.

> > I understand. Maybe I underestimated some of the difficulties. However,
> > I'd still like to separate what can be separated. The unfriendliness to
> > Python debuggers is sad news to me, but is not strictly related to the
> > problem of heterogeneous multiplexing: if I were to design a debugger
> > from scratch for a random language, I believe I'd arrange for the IPC
> > channel used to be more transparent. IOW, the very fact of using the
> > message queue for the debugging IPC *is* the culprit ! In unix, the
> > ptrace() or /proc interfaces have never walked on the toes of any
> > package, GUI or not...
> 
> The unfriendliness is purely related to Pythonwin, and not the general
> Python debugger.  I agree 100% that an RPC type mechanism is far better for
> a debugger.  It was just an anecdote to show how fickle these message loops
> can be (and therefore the complex requirements they have).

Okay, so maybe it's time to summarize what we agreed on:

	(1) 'tearing open' the main loop of a GUI package is tricky in the
general case.
	(2) perusing undefined WM_* messages requires care...

	(3) on the other hand, all other IPC channels are multiplexable. Even
for the worst case (pipes on Windows) at least 1 (1.5?) method has been
identified.

The temporary conclusion as far as I understand, is that nobody in the
Python community has the spare time and energy to tackle (1), that (2)
is tricky due to an unfortunate choice in the implementation of some
debuggers, and that the seemingly appealing unification outlined by (3)
is not enough of a motivation...

Under these conditions, clearly the only option is to put the blackbox
GUI loop inside a separate thread and arrange for it to use a
well-chosen IPC channel to awaken (something like) the Watcher.go()
proposed by Ka-Ping Yee.

Now there's still the issue of actually making select.select()
crossplatform.
Any takers ?

-Alex



From gstein at lyra.org  Tue May 23 12:57:21 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 May 2000 03:57:21 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A590C.E41239D3@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>

I still think that having any kind of global setting is going to be
troublesome. Whether it is per-thread or not, it still means that Module
Foo cannot alter the value without interfering with Module Bar.

Cheers,
-g

On Tue, 23 May 2000, M.-A. Lemburg wrote:

> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.
> 
> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.
> 
> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.
> 
> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis. Furthermore, the encoding should be settable on
> a per thread basis inside the interpreter (Python threads
> do not seem to inherit any per-thread globals, so the
> encoding would have to be set for all new threads).
> 
> E.g. a site.py module could look like this:
> 
> """
> import locale,sys
> 
> # Get encoding, defaulting to 'ascii' in case it cannot be
> # determined
> defenc = locale.get_encoding('ascii')
> 
> # Set main thread's string encoding
> sys.setstringencoding(defenc)
> 
> This would result in the Unicode implementation to assume
> defenc as encoding of strings.
> """
> 
> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8. C APIs
> which want to support Unicode should be fixed to use "es"
> or query the object directly and then apply proper, possibly
> OS dependent conversion.
> 
> Before starting off into implementing the above, I'd like to
> hear some comments...
> 
> Thanks,
> -- 
> Marc-Andre Lemburg
> ______________________________________________________________________
> Business:                                      http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Tue May 23 13:38:41 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 13:38:41 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com>
Message-ID: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> The recent discussion about repr() et al. brought up the idea
> of a locale based string encoding again.

before proceeding down this (not very slippery but slightly
unfortunate, imho) slope, I think we should decide whether

    assert eval(repr(s)) == s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr'
to use isprint, without having to make sure that you can still
parse the resulting string.

but if it is important, you cannot really change 'repr' without
addressing the big issue.

so assuming that the assertion must hold, and that changing
'repr' to be locale-dependent is a good idea, let's move on:

> A support module for querying the encoding used in the current
> locale together with the experimental hook to set the string
> encoding could yield a compromise which satisfies ASCII, Latin-1
> and UTF-8 proponents.

agreed.

> The idea is to use the site.py module to customize the interpreter
> from within Python (rather than making the encoding a compile
> time option). This is easily doable using the (yet to be written)
> support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is
easy enough to do in Python (either in site.py or in locale.py).
no need for external support modules for Unix, in other words.

for windows, I suggest adding GetACP() to the _locale module,
and let the glue layer (site.py 0or locale.py) do:

    if sys.platform == "win32":
        sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the
system font, and fall back to "macroman" if that doesn't work out.
but figuring out the right way to do that is best left to anyone who
actually has access to a Mac.  in the meantime, just make it:

    elif sys.platform == "mac":
        sys.setstringencoding("macroman")

> The default encoding would be 'ascii' and could then be changed
> to whatever the user or administrator wants it to be on a per
> site basis. 

Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
that the vast majority of non-Mac platforms are either modern Unixes
or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

    else:
        # try to determine encoding from POSIX locale environment
        # variables
        ...

    else:
        sys.setstringencoding("iso-latin-1")

> Furthermore, the encoding should be settable on a per thread basis
> inside the interpreter (Python threads do not seem to inherit any
> per-thread globals, so the encoding would have to be set for all
> new threads).

is the C/POSIX locale setting thread specific?

if not, I think the default encoding should be a global setting, just
like the system locale itself.  otherwise, you'll just be addressing a
real problem (thread/module/function/class/object specific locale
handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

> Minor nit: due to the implementation, the C parser markers
> "s" and "t" and the hash() value calculation will still need
> to work with a fixed encoding which still is UTF-8.

can this be fixed?  or rather, what changes to the buffer api
are required if we want to work around this problem?

> C APIs which want to support Unicode should be fixed to use
> "es" or query the object directly and then apply proper, possibly
> OS dependent conversion.

for convenience, it might be a good idea to have a "wide system
encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the
time?

unproductive-ly yrs /F




From pf at artcom-gmbh.de  Tue May 23 14:02:17 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 14:02:17 +0200 (MEST)
Subject: [Python-Dev] String encoding
In-Reply-To: <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> from Fredrik Lundh at "May 23, 2000  1:38:41 pm"
Message-ID: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>

Hi Fredrik!

you wrote:
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
[...]

What's the problem with this one?  I've played around with several
locale settings here and I observed no problems, while doing:

>>> import string
>>> s = string.join(map(chr, range(128,256)),"")
>>> assert eval('"'+s+'"') == s

What do you fear here, if 'repr' will output characters from the
upper half of the charset without quoting them as octal sequences?
I don't understand.

Regards, Peter



From fredrik at pythonware.com  Tue May 23 15:09:11 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 23 May 2000 15:09:11 +0200
Subject: [Python-Dev] String encoding
References: <m12uDO5-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <01fa01bfc4b8$16e64700$0500a8c0@secret.pythonware.com>

Peter wrote:
>
> >     assert eval(repr(s)) == s
>
> What's the problem with this one?  I've played around with several
> locale settings here and I observed no problems, while doing:

what if the default encoding for source code is different
from the locale?  (think UTF-8 source code)

(no, that's not supported by 1.6.  but if we don't consider that
case now, we won't be able to support source encodings in the
future -- unless the above assertion isn't important, of course).

</F>




From mal at lemburg.com  Tue May 23 13:14:46 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 13:14:46 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230356230.25623-100000@nebula.lyra.org>
Message-ID: <392A6826.CCDD2246@lemburg.com>

Greg Stein wrote:
> 
> I still think that having any kind of global setting is going to be
> troublesome. Whether it is per-thread or not, it still means that Module
> Foo cannot alter the value without interfering with Module Bar.

True. 

The only reasonable place to alter the setting is in
site.py for the main thread. I think the setting should be
inherited by child threads, but I'm not sure whether this is
possible or not.
 
Modules that would need to change the settings are better
(re)designed in a way that doesn't rely on the setting at all, e.g.
work on Unicode exclusively which doesn't introduce the need
in the first place.

And then, noone is forced to alter the ASCII default to begin
with :-) The good thing about exposing this mechanism in Python
is that it gets user attention...

> Cheers,
> -g
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
> 
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> >
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> >
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis. Furthermore, the encoding should be settable on
> > a per thread basis inside the interpreter (Python threads
> > do not seem to inherit any per-thread globals, so the
> > encoding would have to be set for all new threads).
> >
> > E.g. a site.py module could look like this:
> >
> > """
> > import locale,sys
> >
> > # Get encoding, defaulting to 'ascii' in case it cannot be
> > # determined
> > defenc = locale.get_encoding('ascii')
> >
> > # Set main thread's string encoding
> > sys.setstringencoding(defenc)
> >
> > This would result in the Unicode implementation to assume
> > defenc as encoding of strings.
> > """
> >
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8. C APIs
> > which want to support Unicode should be fixed to use "es"
> > or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> >
> > Before starting off into implementing the above, I'd like to
> > hear some comments...
> >
> > Thanks,
> > --
> > Marc-Andre Lemburg
> > ______________________________________________________________________
> > Business:                                      http://www.lemburg.com/
> > Python Pages:                           http://www.lemburg.com/python/
> >
> >
> > _______________________________________________
> > Python-Dev mailing list
> > Python-Dev at python.org
> > http://www.python.org/mailman/listinfo/python-dev
> >
> 
> --
> Greg Stein, http://www.lyra.org/
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From pf at artcom-gmbh.de  Tue May 23 16:29:58 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 16:29:58 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
Message-ID: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>

Python 1.6 reports a bad magic error, when someone tries to import a .pyc
file compiled by Python 1.5.2.  AFAIK only new features have been
added.  So why it isn't possible to use these old files in Python 1.6?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From dan at cgsoftware.com  Tue May 23 16:43:44 2000
From: dan at cgsoftware.com (Daniel Berlin)
Date: Tue, 23 May 2000 10:43:44 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <BAEMJNPFHMFPEFAGBKKAMEJBCCAA.dan@cgsoftware.com>

Because of the unicode changes, AFAIK.
Or was it the multi-arg vs single arg append and friends.
Anyway, the point is that their were incompatible changes made, and thus,
the magic was changed.
--Dan
>
>
> Python 1.6 reports a bad magic error, when someone tries to import a .pyc
> file compiled by Python 1.5.2.  AFAIK only new features have been
> added.  So why it isn't possible to use these old files in Python 1.6?
>
> Regards, Peter
> --
> Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany,
> Fax:+49 4222950260
> office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)




From fdrake at acm.org  Tue May 23 16:47:52 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 07:47:52 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <m12uFh0-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Peter Funk wrote:
 > Python 1.6 reports a bad magic error, when someone tries to import a .pyc
 > file compiled by Python 1.5.2.  AFAIK only new features have been
 > added.  So why it isn't possible to use these old files in Python 1.6?

Peter,
  In theory, perhaps it could; I don't know if the extra work is worth it,
however.
  What's happening is that the .pyc magic number changed because the
marshal format has been extended to support Unicode string objects.  The
old format should still be readable, but there's nothing in the .pyc
loader that supports the acceptance of multiple versions of the marshal
format.
  Is there reason to think that's a substantial problem for users, given
the automatic recompilation of bytecode from source?  The only serious
problems I can see are when multiple versions of the interpreter are being
used on the same collection of source files (because the re-compilation
occurs more often and affects performance), and when *only* .pyc/.pyo
files are available.
  Do you have reason to suspect that either case is sufficiently common to
complicate the .pyc loader, or is there another reason that I've missed
(very possible, I admit)?


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>





From mal at lemburg.com  Tue May 23 16:20:19 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:20:19 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A93A3.91188372@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> 
> before proceeding down this (not very slippery but slightly
> unfortunate, imho) slope, I think we should decide whether
> 
>     assert eval(repr(s)) == s
> 
> should be true for strings.
> 
> if this isn't important, nothing stops you from changing 'repr'
> to use isprint, without having to make sure that you can still
> parse the resulting string.
> 
> but if it is important, you cannot really change 'repr' without
> addressing the big issue.

This is a different discussion which I don't really want to
get into... I don't have any need for repr() being locale
dependent, since I only use it for debugging purposes and
never to rebuild objects (marshal and pickle are much better
at that).

BTW, repr(unicode) is not affected by the string encoding:
it always returns unicode-escape.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 23 16:47:40 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 16:47:40 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com>
Message-ID: <392A9A0C.2E297072@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The recent discussion about repr() et al. brought up the idea
> > of a locale based string encoding again.
> > [...]
> >
> > A support module for querying the encoding used in the current
> > locale together with the experimental hook to set the string
> > encoding could yield a compromise which satisfies ASCII, Latin-1
> > and UTF-8 proponents.
> 
> agreed.
> 
> > The idea is to use the site.py module to customize the interpreter
> > from within Python (rather than making the encoding a compile
> > time option). This is easily doable using the (yet to be written)
> > support module and the sys.setstringencoding() hook.
> 
> agreed.
> 
> note that parsing LANG (etc) variables on a POSIX platform is
> easy enough to do in Python (either in site.py or in locale.py).
> no need for external support modules for Unix, in other words.

Agreed... the locale.py (and _locale builtin module) are probably
the right place to put such a parser.
 
> for windows, I suggest adding GetACP() to the _locale module,
> and let the glue layer (site.py 0or locale.py) do:
> 
>     if sys.platform == "win32":
>         sys.setstringencoding("cp%d" % GetACP())
> 
> on mac, I think you can determine the encoding by inspecting the
> system font, and fall back to "macroman" if that doesn't work out.
> but figuring out the right way to do that is best left to anyone who
> actually has access to a Mac.  in the meantime, just make it:
> 
>     elif sys.platform == "mac":
>         sys.setstringencoding("macroman")
> 
> > The default encoding would be 'ascii' and could then be changed
> > to whatever the user or administrator wants it to be on a per
> > site basis.
> 
> Tcl defaults to "iso-8859-1" on all platforms except the Mac.  assuming
> that the vast majority of non-Mac platforms are either modern Unixes
> or Windows boxes, that makes a lot more sense than US ASCII...
> 
> in other words:
> 
>     else:
>         # try to determine encoding from POSIX locale environment
>         # variables
>         ...
> 
>     else:
>         sys.setstringencoding("iso-latin-1")

That's a different topic which I don't want to revive ;-)

With the above tools you can easily code the latin-1 default
into your site.py.

> > Furthermore, the encoding should be settable on a per thread basis
> > inside the interpreter (Python threads do not seem to inherit any
> > per-thread globals, so the encoding would have to be set for all
> > new threads).
> 
> is the C/POSIX locale setting thread specific?

Good question -- I don't know.

> if not, I think the default encoding should be a global setting, just
> like the system locale itself.  otherwise, you'll just be addressing a
> real problem (thread/module/function/class/object specific locale
> handling), but not really solving it...
>
> better use unicode strings and explicit encodings in that case.

Agreed.
 
> > Minor nit: due to the implementation, the C parser markers
> > "s" and "t" and the hash() value calculation will still need
> > to work with a fixed encoding which still is UTF-8.
> 
> can this be fixed?  or rather, what changes to the buffer api
> are required if we want to work around this problem?

The problem is that "s" and "t" return C pointers to some
internal data structure of the object. It has to be assured
that this data remains intact at least as long as the object
itself exists.

AFAIK, this cannot be fixed without creating a memory leak.
 
The "es" parser marker uses a different strategy, BTW: the
data is copied into a buffer, thus detaching the object
from the data.

> > C APIs which want to support Unicode should be fixed to use
> > "es" or query the object directly and then apply proper, possibly
> > OS dependent conversion.
> 
> for convenience, it might be a good idea to have a "wide system
> encoding" too, and special parser markers for that purpose.
> 
> or can we assume that all wide system API's use unicode all the
> time?

At least in all references I've seen (e.g. ODBC, wchar_t
implementations, etc.) "wide" refers to Unicode.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Tue May 23 17:13:59 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 08:13:59 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <392A9A0C.2E297072@lemburg.com>
Message-ID: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, M.-A. Lemburg wrote:
 > The problem is that "s" and "t" return C pointers to some
 > internal data structure of the object. It has to be assured
 > that this data remains intact at least as long as the object
 > itself exists.
 > 
 > AFAIK, this cannot be fixed without creating a memory leak.
 >  
 > The "es" parser marker uses a different strategy, BTW: the
 > data is copied into a buffer, thus detaching the object
 > from the data.
 > 
 > > > C APIs which want to support Unicode should be fixed to use
 > > > "es" or query the object directly and then apply proper, possibly
 > > > OS dependent conversion.
 > > 
 > > for convenience, it might be a good idea to have a "wide system
 > > encoding" too, and special parser markers for that purpose.
 > > 
 > > or can we assume that all wide system API's use unicode all the
 > > time?
 > 
 > At least in all references I've seen (e.g. ODBC, wchar_t
 > implementations, etc.) "wide" refers to Unicode.

  On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
10646 require a 32-bit space?
  I recall a fair bit of discussion about wchar_t when it was introduced
to ANSI C, and the character set and encoding were specifically not made
part of the specification.  Making a requirement that wchar_t be Unicode
doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Tue May 23 17:16:42 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:16:42 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com>
Message-ID: <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>

M.-A. Lemburg <mal at lemburg.com> wrote:
> > before proceeding down this (not very slippery but slightly
> > unfortunate, imho) slope, I think we should decide whether
> > 
> >     assert eval(repr(s)) == s
> > 
> > should be true for strings.

footnote: as far as I can tell, the language reference says it should:
http://www.python.org/doc/current/ref/string-conversions.html

> This is a different discussion which I don't really want to
> get into... I don't have any need for repr() being locale
> dependent, since I only use it for debugging purposes and
> never to rebuild objects (marshal and pickle are much better
> at that).

in other words, you leave it to 'pickle' to call 'repr' for you ;-)

</F>




From pf at artcom-gmbh.de  Tue May 23 17:23:48 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 23 May 2000 17:23:48 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> from "Fred L. Drake" at "May 23, 2000  7:47:52 am"
Message-ID: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>

Fred, 
Thank you for your quick response.

Fred L. Drake:
> Peter,
>   In theory, perhaps it could; I don't know if the extra work is worth it,
> however.
[...]
>   Do you have reason to suspect that either case is sufficiently common to
> complicate the .pyc loader, or is there another reason that I've missed
> (very possible, I admit)?

Well, currently we (our company) deliver no source code to our
customers.  I don't want to discuss this policy and the reasoning
behind here.  But this situation may also apply to other commercial
software vendors using Python.

During late 2000 there may be several customers out there running
Python 1.6 and others still running Python 1.5.2.  So we will have
several choices to deal with this situation:
   1. Supply two different binary distribution packages: 
      one containing 1.5.2 .pyc files and one containing 1.6 .pyc files.
      This will introduce some new logistic problems.
   2. Upgrade to Python 1.6 at each customer site at once. 
      This will be difficult.
   3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
      and supply our own patched Python distribution.
      (and this would also be "carrying owls to Athen" for Linux systems)
    [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
      german idiom makes any sense in english ;-) ]
      I personally don't like this.
   4. Change our policy and distribute also the .py sources.  Beside the
      difficulty to convince the management about this one, this also
      introduces new technical "challenges".  The Unix text files have to be
      converted from LF lineends to CR lineends or MacPython wouldn't be 
      able to parse the files.  So the mac source distributions
      must be build from a different directory tree.

No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
some such somewhere in the 1.6 unmarshaller should do the trick.
But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
worth the effort. <wink>

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From esr at thyrsus.com  Tue May 23 17:40:50 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 11:40:50 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6.  Why not?
In-Reply-To: <m12uGX6-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Tue, May 23, 2000 at 05:23:48PM +0200
References: <Pine.LNX.4.10.10005230739510.22456-100000@mailhost.beopen.com> <m12uGX6-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000523114050.A4781@thyrsus.com>

Peter Funk <pf at artcom-gmbh.de>:
>       (and this would also be "carrying owls to Athen" for Linux systems)
>     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
>       german idiom makes any sense in english ;-) ]

There is a precise equivalent: "carrying coals to Newcastle".
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Are we to understand," asked the judge, "that you hold your own interests
above the interests of the public?"

"I hold that such a question can never arise except in a society of cannibals."
	-- Ayn Rand



From effbot at telia.com  Tue May 23 17:41:46 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 17:41:46 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com>
Message-ID: <024001bfc4cd$68210f00$f2a6b5d4@hagrid>

M.-A. Lemburg wrote:
> That's a different topic which I don't want to revive ;-)

in a way, you've already done that -- if you're setting the system encoding
in the site.py module, lots of people will end up with the encoding set to ISO
Latin 1 or it's windows superset.

one might of course the system encoding if the user actually calls setlocale,
but there's no way for python to trap calls to that function from a submodule
(e.g. readline), so it's easy to get out of sync.  hmm.

(on the other hand, I'd say it's far more likely that americans are among the
few who don't know how to set the locale, so defaulting to us ascii might be
best after all -- even if their computers really use iso-latin-1, we don't have
to cause unnecessary confusion...)

...

but I guess you're right: let's be politically correct and pretend that this really
is a completely different issue ;-)

</F>




From effbot at telia.com  Tue May 23 18:04:38 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:04:38 +0200
Subject: Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)
References: <Pine.LNX.4.10.10005220914000.461-100000@localhost>  <200005222038.PAA01284@cj20424-a.reston1.va.home.com>
Message-ID: <027f01bfc4d0$99c48ca0$f2a6b5d4@hagrid>

> Which is why I find Fredrik's attitude unproductive.

given that locale support isn't included if you make a default build,
I don't think deprecating it would hurt that many people...

but that's me; when designing libraries, I've always strived to find
the *minimal* set of functions (and code) that makes it possible for
a programmer to do her job well.  I'm especially wary of blind alleys
(sure, you can use locale, but that'll only take you this far, and you
have to start all over if you want to do it right).

btw, talking about productivity, go check out the case sensitivity
threads on comp.lang.python.  imagine if all those people hammered
away on the 1.6 alpha instead...

> And where's the SRE release?

at the usual place:

    http://w1.132.telia.com/~u13208596/sre/index.htm

still one showstopper left, which is why I haven't made the long-
awaited public "now it's finished, dammit" announcement yet.  but
it shouldn't be that far away.

</F>




From fdrake at acm.org  Tue May 23 18:11:14 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 23 May 2000 09:11:14 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6.  Why not?
In-Reply-To: <20000523114050.A4781@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005230904570.22456-100000@mailhost.beopen.com>

On Tue, 23 May 2000, Eric S. Raymond wrote:
 > Peter Funk <pf at artcom-gmbh.de>:
 > >       (and this would also be "carrying owls to Athen" for Linux systems)
 > >     [I don't know whether this ^^^^^^^^^^^^^^^^^^^^^^ careless translated 
 > >       german idiom makes any sense in english ;-) ]
 > 
 > There is a precise equivalent: "carrying coals to Newcastle".

  That's interesting... I've never heard either, but I think I can guess
the meaning now.
  I agree; it looks like there's some work to do in getting the .pyc
loader to be a little more concerned about importing compatible marshal
formats.  I have an idea about how I'd like to see in done which may be a
little less magical.  I'll work up a patch later this week.
  I won't check in any changes for this until we've heard from Guido on
the matter, and he'll probably be unavailable for the next couple of days.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From effbot at telia.com  Tue May 23 18:26:43 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 18:26:43 +0200
Subject: [Python-Dev] Unicode
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de>
Message-ID: <031101bfc4d3$afac2020$f2a6b5d4@hagrid>

Martin v. Loewis wrote:
> To my knowledge, no. Tcl (at least 8.3) supports the \u notation for
> Unicode escapes, and treats all other source code as
> Latin-1. encoding(n) says
> 
> # However, because the source command always reads files using the
> # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> # separate character that maps to the 00 page in Unicode.

as far as I can tell from digging through the sources, the "source"
command uses the system encoding.  and from the look of it, it's
not always iso-latin-1...

</F>




From mal at lemburg.com  Tue May 23 18:48:08 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 18:48:08 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005230805200.22456-100000@mailhost.beopen.com>
Message-ID: <392AB648.368663A8@lemburg.com>

"Fred L. Drake" wrote:
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
>  > The problem is that "s" and "t" return C pointers to some
>  > internal data structure of the object. It has to be assured
>  > that this data remains intact at least as long as the object
>  > itself exists.
>  >
>  > AFAIK, this cannot be fixed without creating a memory leak.
>  >
>  > The "es" parser marker uses a different strategy, BTW: the
>  > data is copied into a buffer, thus detaching the object
>  > from the data.
>  >
>  > > > C APIs which want to support Unicode should be fixed to use
>  > > > "es" or query the object directly and then apply proper, possibly
>  > > > OS dependent conversion.
>  > >
>  > > for convenience, it might be a good idea to have a "wide system
>  > > encoding" too, and special parser markers for that purpose.
>  > >
>  > > or can we assume that all wide system API's use unicode all the
>  > > time?
>  >
>  > At least in all references I've seen (e.g. ODBC, wchar_t
>  > implementations, etc.) "wide" refers to Unicode.
> 
>   On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
> 10646 require a 32-bit space?

It is, Unicode is definitely moving in the 32-bit direction.

>   I recall a fair bit of discussion about wchar_t when it was introduced
> to ANSI C, and the character set and encoding were specifically not made
> part of the specification.  Making a requirement that wchar_t be Unicode
> doesn't make a lot of sense, and opens up potential portability issues.
> 
> -1 on any assumption that wchar_t is usefully portable.

Ok... so could be that Fredrik has a point there, but I'm
not deep enough into this to be able to comment.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 23 19:15:17 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 May 2000 19:15:17 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A93A3.91188372@lemburg.com> <023d01bfc4cb$3b0ee3e0$f2a6b5d4@hagrid>
Message-ID: <392ABCA5.EC84824F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > > before proceeding down this (not very slippery but slightly
> > > unfortunate, imho) slope, I think we should decide whether
> > >
> > >     assert eval(repr(s)) == s
> > >
> > > should be true for strings.
> 
> footnote: as far as I can tell, the language reference says it should:
> http://www.python.org/doc/current/ref/string-conversions.html
> 
> > This is a different discussion which I don't really want to
> > get into... I don't have any need for repr() being locale
> > dependent, since I only use it for debugging purposes and
> > never to rebuild objects (marshal and pickle are much better
> > at that).
> 
> in other words, you leave it to 'pickle' to call 'repr' for you ;-)

Ooops... now this gives a totally new ring the changing
repr(). Hehe, perhaps we need a string.encode() method
too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From martin at loewis.home.cs.tu-berlin.de  Tue May 23 23:44:11 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 23 May 2000 23:44:11 +0200
Subject: [Python-Dev] Unicode
In-Reply-To: <031101bfc4d3$afac2020$f2a6b5d4@hagrid> (effbot@telia.com)
References: <Pine.LNX.4.10.10005170708500.4723-100000@mailhost.beopen.com> <200005172255.AAA01245@loewis.home.cs.tu-berlin.de> <031101bfc4d3$afac2020$f2a6b5d4@hagrid>
Message-ID: <200005232144.XAA01129@loewis.home.cs.tu-berlin.de>

> > # However, because the source command always reads files using the
> > # ISO8859-1 encoding, Tcl will treat each byte in the file as a
> > # separate character that maps to the 00 page in Unicode.
> 
> as far as I can tell from digging through the sources, the "source"
> command uses the system encoding.  and from the look of it, it's
> not always iso-latin-1...

Indeed, this appears to be an error in the documentation. sourcing

encoding convertto utf-8 ?

has an outcome depending on the system encoding; just try koi8-r to
see the difference.

Regards,
Martin




From effbot at telia.com  Tue May 23 23:57:57 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 23 May 2000 23:57:57 +0200
Subject: [Python-Dev] homer-dev, anyone?
References: <009d01bfbf64$b779a260$34aab5d4@hagrid>
Message-ID: <008a01bfc502$17765260$f2a6b5d4@hagrid>

    http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
    "May 11: In a press conference held early this morning, Guido van Rossum
    ... announced that his most famous project will be undergoing a name
    change ..."

    http://www.scriptics.com/company/news/press_release_ajuba.html
    "May 22: Scriptics Corporation ... today announced that it has changed its
    name ..."

...




From akuchlin at mems-exchange.org  Wed May 24 01:33:28 2000
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 23 May 2000 19:33:28 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
Message-ID: <200005232333.TAA16068@amarok.cnri.reston.va.us>

Today I checked in a new version of the curses module that will only
work with ncurses and/or SYSV curses.  I've tried compiling it on
Linux with ncurses 5.0, and on Solaris; there are also #ifdef's to
make it work with some version of SGI's curses.

I'd appreciate it if people could try the module with the curses
implementations on other platforms: Tru64, AIX, *BSDs (though they use
ncurses, maybe they're some versions behind), etc.  Please let me know
of your results through e-mail.

And if you have code that used the old curses module, and breaks with
the new module, please let me know; the goal is to have 100%
backward-compatibility.

Also, here's a list of ncurses functions that aren't yet supported;
should I make adding them a priority.  (Most of them seem to be pretty
marginal, except for the mouse-related functions which I want to add
next.)

addchnstr addchstr chgat color_set copywin define_key del_curterm
delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
scr_init scr_restore scr_set scrl set_curterm set_term setterm
setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
slk_attron slk_attrset slk_clear slk_color slk_init slk_label
slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
..signature has giant ASCII graphic: Forced to read "War And Peace" at 110
baud on a Braille terminal after having fingers rubbed with sandpaper.
  -- Kibo, in the Happynet Manifesto



From gstein at lyra.org  Wed May 24 02:00:55 2000
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 May 2000 17:00:55 -0700 (PDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <008a01bfc502$17765260$f2a6b5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>

what a dumb name...


On Tue, 23 May 2000, Fredrik Lundh wrote:

> 
>     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
>     "May 11: In a press conference held early this morning, Guido van Rossum
>     ... announced that his most famous project will be undergoing a name
>     change ..."
> 
>     http://www.scriptics.com/company/news/press_release_ajuba.html
>     "May 22: Scriptics Corporation ... today announced that it has changed its
>     name ..."
> 
> ...
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From klm at digicool.com  Wed May 24 02:33:57 2000
From: klm at digicool.com (Ken Manheimer)
Date: Tue, 23 May 2000 20:33:57 -0400 (EDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <Pine.LNX.4.21.0005232030340.31343-100000@korak.digicool.com>

On Tue, 23 May 2000, Greg Stein wrote:

> what a dumb name...
> On Tue, 23 May 2000, Fredrik Lundh wrote:
> 
> >     http://www.segfault.org/story.phtml?mode=2&id=391ae457-08fa7b40
> >     "May 11: In a press conference held early this morning, Guido van Rossum
> >     ... announced that his most famous project will be undergoing a name
> >     change ..."

Huh.  I dunno what's so dumb about it.  But i definitely was tickled by:

  !STOP PRESS! Microsoft Corporation announced this afternoon that it had
  aquired rights to use South Park characters in its software. The first
  such product, formerly known as Visual J++, will now be known as Kenny.
  !STOP PRESS!

:->

Ken
klm at digicool.com

(No relation.)




From esr at thyrsus.com  Wed May 24 02:47:50 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 23 May 2000 20:47:50 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <200005232333.TAA16068@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Tue, May 23, 2000 at 07:33:28PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
Message-ID: <20000523204750.A6107@thyrsus.com>

Andrew M. Kuchling <akuchlin at mems-exchange.org>:
> Also, here's a list of ncurses functions that aren't yet supported;
> should I make adding them a priority.  (Most of them seem to be pretty
> marginal, except for the mouse-related functions which I want to add
> next.)
> 
> addchnstr addchstr chgat color_set copywin define_key del_curterm
> delscreen dupwin getmouse inchnstr inchstr innstr keyok mcprint
> mouseinterval mousemask mvaddchnstr mvaddchstr mvchgat mvcur
> mvinchnstr mvinchstr mvinnstr mmvwaddchnstr mvwaddchstr mvwchgat
> mvwgetnstr mvwinchnstr mvwinchstr mvwinnstr napms newterm overlay
> overwrite resetty resizeterm restartterm ripoffline savetty scr_dump
> scr_init scr_restore scr_set scrl set_curterm set_term setterm
> setupterm slk_attr slk_attr_off slk_attr_on slk_attr_set slk_attroff
> slk_attron slk_attrset slk_clear slk_color slk_init slk_label
> slk_noutrefresh slk_refresh slk_restore slk_set slk_touch tgetent
> tgetflag tgetnum tgetstr tgoto tigetflag tigetnum tigetstr timeout
> tparm tputs tputs typeahead ungetmouse use_default_colors vidattr
> vidputs waddchnstr waddchstr wchgat wcolor_set wcursyncup wenclose
> winchnstr winchstr winnstr wmouse_trafo wredrawln wscrl wtimeout

I think you're right to put the mouse support at highest priority.

I'd say napms() and the overlay/overwrite/copywin group are moderately
important.  So are the functions in the curs_inopts(3x) group -- when
you need those, nothing else will do.  

You can certainly pretty much forget the slk_* group; I only
implemented those for the sake of excruciating completeness.
Likewise for the mv* variants.  

Here's a function that ought to be in the Python wrapper associated with
the module:

def traceback_wrapper(func, *rest):
    "Call a hook function, guaranteeing curses cleanup on error or exit."
    try:
	# Initialize curses
	stdscr=curses.initscr()
	# Turn off echoing of keys, and enter cbreak mode,
	# where no buffering is performed on keyboard input
	curses.noecho() ; curses.cbreak()

	# In keypad mode, escape sequences for special keys
	# (like the cursor keys) will be interpreted and
	# a special value like curses.KEY_LEFT will be returned
        stdscr.keypad(1)

	# Run the hook.  Supply the screen window object as first argument
        apply(func, (stdscr,) + rest)

	# Set everything back to normal
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()		 # Terminate curses
    except:
	# In the event of an error, restore the terminal
	# to a sane state.
	stdscr.keypad(0)
	curses.echo() ; curses.nocbreak()
	curses.endwin()
	traceback.print_exc()	   # Print the exception

(Does this case mean, perhaps, that the Python interper ought to allow
setting a stack of hooks to be executed just before traceback-emission time?)

I'd also be willing to write a Python function that implements Emacs-style
keybindings for field editing, if that's interesting.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Don't think of it as `gun control', think of it as `victim
disarmament'. If we make enough laws, we can all be criminals.



From skip at mojam.com  Wed May 24 03:40:02 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 23 May 2000 20:40:02 -0500 (CDT)
Subject: [Python-Dev] homer-dev, anyone?
In-Reply-To: <Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
References: <008a01bfc502$17765260$f2a6b5d4@hagrid>
	<Pine.LNX.4.10.10005231700470.31927-100000@nebula.lyra.org>
Message-ID: <14635.13042.415077.857803@beluga.mojam.com>

Regarding "Ajuba", Greg wrote:

    what a dumb name...

The top 10 reasons why "Ajuba" is a great name for the former Scriptics:

   10. An accounting error left waaay too much money in the marketing
       budget.  They felt they had to spend it or risk a budget cut next
       year.

    9. It would make a cool name for a dance.  They will now be able to do
       the "Ajuba" at the company's Friday afternoon beer busts.

    8. It's almost palindromic, giving the company's art department all
       sorts of cool nearly symmetric logo possibilities.

    7. It has 7 +/- 2 letters, so when purchasing managers from other
       companies see it flash by in the background of a baseball or
       basketball game on TV they'll be able to remember it.

    6. No programming languages already exist with that name.

    5. It doesn't mean anything bad in any known Indo-European, Asian or
       African language so they won't risk losing market share (what market
       share?) in some obscure third-world country because it means "take a
       flying leap".

    4. It's not already registered in .com, .net, .edu or .org.

    3. No prospective employee will associate the new company name with the
       old, so they'll be able to pull in lots of resumes from people who
       would never have stooped to programming in Tcl for a living.

    2. It's more prounounceable than "Tcl" or "Tcl/Tk" by just about anybody
       who has ever seen English in print.

    1. It doesn't suggest anything, so the company is free to redirect its
       focus any way it wants, including replacing Tcl with Python in future
       versions of its products.

;-)

Skip



From gward at mems-exchange.org  Wed May 24 04:43:53 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:43:53 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
Message-ID: <20000523224352.A997@mems-exchange.org>

A couple of people are working on support in the Distutils for building
extensions on Windows with non-Microsoft compilers.  I think this is
crucial; I hate the idea of requiring people to either download a binary
or shell out megabucks (and support Chairman Bill's monopoly) just to
use some handy Python extension.  (OK, OK, more likely they'll go
without the extension, or go without Python.  But still...)

However, it seems like it would be nice if people could build Python
itself with (eg.) cygwin's gcc or Borland's compiler.  (It might be
essential to properly support building extensions with gcc.)  Has anyone
one anything towards that goal?  It appears that there is at least one
patch floating around that advises people to hack up their installed
config.h, and drop a libpython.a somewhere in the installation, in order
to compile extensions with cygwin gcc and/or mingw32.  This strikes me
as sub-optimal: can at least the required changes to config.h be made to
allow building Python with one of the Windows gcc ports?

I would be willing to hold my nose and struggle with cygwin for a little
while in Windows in dull moments at work -- had to reboot my Linux box
into Windows today in order to test try building CXX (since my VMware
trial license expired), so I might as well leave it there until it
crashes and play with cygwin.

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Wed May 24 04:49:23 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 23 May 2000 22:49:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
Message-ID: <20000523224923.A1008@mems-exchange.org>

My post on this from last week was met with a deafening silence, so I
will try to be short and to-the-point this time:

   Why are shared extensions on Solaris linked with "ld -G" instead of
   "gcc -G" when gcc is the compiler used to compile Python and
   extensions?

Is it historical?  Ie. did some versions of Solaris and/or gcc not do
the right thing here?  Could we detect that bogosity in "configure", and
only use "ld -G" if it's necessary, and use "gcc -G" by default?

The reason that using "ld -G" is the wrong thing is that libgcc.a is not
referenced when creating the .so file.  If the object code happens to
reference functions in libgcc.a that are not referenced anywhere in the
Python core, then importing the .so fails.  This happens if there is a
64-bit divide in the object code.  See my post of May 19 for details.

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From fredrik at pythonware.com  Wed May 24 11:42:57 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 24 May 2000 11:42:57 +0200
Subject: [Python-Dev] String encoding
References: <392A590C.E41239D3@lemburg.com> <016b01bfc4ab$72be9e90$0500a8c0@secret.pythonware.com> <392A9A0C.2E297072@lemburg.com> <024001bfc4cd$68210f00$f2a6b5d4@hagrid>
Message-ID: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>

> one might of course the system encoding if the user actually calls setlocale,

I think that was supposed to be:

  one might of course SET the system encoding ONLY if the user actually calls setlocale,

or something...

</F>




From gmcm at hypernet.com  Wed May 24 14:24:20 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 24 May 2000 08:24:20 -0400
Subject: [Python-Dev] Supporting non-Microsoft compilers
In-Reply-To: <20000523224352.A997@mems-exchange.org>
Message-ID: <1252951401-112791664@hypernet.com>

Greg Ward wrote:

> However, it seems like it would be nice if people could build
> Python itself with (eg.) cygwin's gcc or Borland's compiler.  (It
> might be essential to properly support building extensions with
> gcc.)  Has anyone one anything towards that goal?  

Robert Kern (mingw32) and Gordon Williams (Borland).

> It appears
> that there is at least one patch floating around that advises
> people to hack up their installed config.h, and drop a
> libpython.a somewhere in the installation, in order to compile
> extensions with cygwin gcc and/or mingw32.  This strikes me as
> sub-optimal: can at least the required changes to config.h be
> made to allow building Python with one of the Windows gcc ports?

Robert's starship pages (kernr/mingw32) has a config.h 
patched for mingw32.

I believe someone else built Python using cygwin without 
much trouble. But mingw32 is the preferred target - cygwin is 
slow, doesn't thread, has a viral GPL license and only gets 
along with binaries built with cygwin.
 
Robert's web pages talk about a patched mingw32. I don't 
*think* that's true anymore, (at least I found no problems in 
my limited testing of an unpatched mingw32). The difference 
between mingw32 and cygwin is just what runtime they're built 
for.


- Gordon



From guido at python.org  Wed May 24 16:17:29 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 24 May 2000 09:17:29 -0500
Subject: [Python-Dev] Some more on the 'tempfile' naming security issue
In-Reply-To: Your message of "Tue, 23 May 2000 10:39:11 +0200."
             <m12uADY-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uADY-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005241417.JAA07367@cj20424-a.reston1.va.home.com>

> I agree.  But I think we should at least extend the documentation
> of 'tempfile' (Fred?) to guide people not to write Pythoncode like
> 	mytemp = open(tempfile.mktemp(), "w")
> in programs that are intended to be used on Unix systems by arbitrary
> users (possibly 'root').  Even better:  Someone with enough spare time 
> should add a new function 'mktempfile()', which creates a temporary 
> file and takes care of the security issue and than returns the file 
> handle.  This implementation must take care of race conditions using
> 'os.open' with the following flags:
> 
>        O_CREAT If the file does not exist it will be created.
>        O_EXCL  When used with O_CREAT, if the file already  exist
> 	       it is  an error and the open will fail. 

Have you read a recent (CVS) version of tempfile.py?  It has all this
in the class TemporaryFile()!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Wed May 24 17:11:12 2000
From: guido at python.org (Guido van Rossum)
Date: Wed, 24 May 2000 10:11:12 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Tue, 23 May 2000 17:23:48 +0200."
             <m12uGX6-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005241511.KAA07512@cj20424-a.reston1.va.home.com>

>    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files

I agree that this is the correct solution.

> No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or 
> some such somewhere in the 1.6 unmarshaller should do the trick.
> But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> worth the effort. <wink>

That's BDFL for you, thank you. ;-)

Before accepting the trivial patch, I would like to see some analysis
that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
This means you have to prove that (a) the 1.5.2 marshal format is a
subset of the 1.6 marshal format (easy enough probably) and (b) the
1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
one seems a little trickier; I don't remember if we moved opcodes or 
changed existing opcodes' semantics.  You may be lucky, but it will
cause an extra constraint on the evolution of the bytecode, so I'm
somewhat reluctant.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From ping at lfw.org  Wed May 24 17:56:49 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Wed, 24 May 2000 08:56:49 -0700 (PDT)
Subject: [Python-Dev] 1.6 release date
Message-ID: <Pine.LNX.4.10.10005240855340.465-100000@localhost>

Sorry if i missed an earlier announcement on this topic.

The web page about 1.6 currently says that Python 1.6 will
be released on June 1.  Is that still the target date?


-- ?!ng




From tismer at tismer.com  Wed May 24 20:37:05 2000
From: tismer at tismer.com (Christian Tismer)
Date: Wed, 24 May 2000 20:37:05 +0200
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in 
 Python 1.6. Why not?
References: <m12uGX6-000DieC@artcom0.artcom-gmbh.de> <200005241511.KAA07512@cj20424-a.reston1.va.home.com>
Message-ID: <392C2151.93A0DF24@tismer.com>


Guido van Rossum wrote:
> 
> >    3. Patch the 1.6 unmarshaller to support loading 1.5.2 .pyc files
> 
> I agree that this is the correct solution.
> 
> > No choice looks very attractive.  Adding a '|| (magic == 0x994e)' or
> > some such somewhere in the 1.6 unmarshaller should do the trick.
> > But I don't want to submit a patch, if God^H^HGuido thinks, this isn't
> > worth the effort. <wink>
> 
> That's BDFL for you, thank you. ;-)
> 
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.
> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

Be assured, I know the opcodes by heart.
We only appended to the end of opcode space, there are no changes.
But I can't tell about marshal.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From gstein at lyra.org  Wed May 24 22:15:24 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 13:15:24 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <009b01bfc564$71606f10$0500a8c0@secret.pythonware.com>
Message-ID: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Fredrik Lundh wrote:
> > one might of course the system encoding if the user actually calls setlocale,
> 
> I think that was supposed to be:
> 
>   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> 
> or something...

Bleh. Global switches are bogus. Since you can't depend on the setting,
and you can't change it (for fear of busting something else), then you
have to be explicit about your encoding all the time. Since you're never
going to rely on a global encoding, then why keep it?

This global encoding (per thread or not) just reminds me of the single
hook for import, all over again.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From pf at artcom-gmbh.de  Wed May 24 23:34:19 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Wed, 24 May 2000 23:34:19 +0200 (MEST)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 24, 2000 10:11:12 am"
Message-ID: <m12uinD-000DieC@artcom0.artcom-gmbh.de>

[...about accepting 1.5.2 generated .pyc files...]

Guido van Rossum:
> Before accepting the trivial patch, I would like to see some analysis
> that shows that in fact all 1.5.2 .pyc files work correctly with 1.6.

Would it be sufficient, if a Python 1.6a2 interpreter executable containing
such a trivial patch is able to process the test suite in a 1.5.2 tree with 
all the .py-files removed?  But some list.append calls with multiple args 
might cause errors.

> This means you have to prove that (a) the 1.5.2 marshal format is a
> subset of the 1.6 marshal format (easy enough probably) and (b) the
> 1.5.2 bytecode opcodes are a subset of the 1.6 bytecode opcodes.  That
> one seems a little trickier; I don't remember if we moved opcodes or 
> changed existing opcodes' semantics.  You may be lucky, but it will
> cause an extra constraint on the evolution of the bytecode, so I'm
> somewhat reluctant.

I feel the byte code format is rather mature and future evolution
is unlikely to remove or move opcodes to new values or change the 
semantics of existing opcodes in an incompatible way.  As has been
shown, it is even possible to solve the 1/2 == 0.5 issue with
upward compatible extension of the format.

But I feel unable to provide a formal proof other than comparing
1.5.2/Include/opcode.h, 1.5.2/Python/marshal.c and import.c
with the 1.6 ones.

There are certainly others here on python-dev who can do better.
Christian?

BTW: import.c contains the  following comment:
/* XXX Perhaps the magic number should be frozen and a version field
   added to the .pyc file header? */

Judging from my decade long experience with exotic image and CAD data 
formats I think this is always the way to go for binary data files.  
Using this method newer versions of a program can always recognize
the file format version and convert files generated by older versions
in an appropriate way.

Regards, Peter



From esr at thyrsus.com  Thu May 25 00:02:15 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 18:02:15 -0400
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: <m12uinD-000DieC@artcom0.artcom-gmbh.de>; from pf@artcom-gmbh.de on Wed, May 24, 2000 at 11:34:19PM +0200
References: <200005241511.KAA07512@cj20424-a.reston1.va.home.com> <m12uinD-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000524180215.A10281@thyrsus.com>

Peter Funk <pf at artcom-gmbh.de>:
> BTW: import.c contains the  following comment:
> /* XXX Perhaps the magic number should be frozen and a version field
>    added to the .pyc file header? */
> 
> Judging from my decade long experience with exotic image and CAD data 
> formats I think this is always the way to go for binary data files.  
> Using this method newer versions of a program can always recognize
> the file format version and convert files generated by older versions
> in an appropriate way.

I have similar experience, notably with hacking graphics file formats.
I concur with this recommendation.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The end move in politics is always to pick up a gun.
	-- R. Buckminster Fuller



From gstein at lyra.org  Wed May 24 23:58:48 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 14:58:48 -0700 (PDT)
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in
 Python 1.6. Why not?
In-Reply-To: <20000524180215.A10281@thyrsus.com>
Message-ID: <Pine.LNX.4.10.10005241457000.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Eric S. Raymond wrote:
> Peter Funk <pf at artcom-gmbh.de>:
> > BTW: import.c contains the  following comment:
> > /* XXX Perhaps the magic number should be frozen and a version field
> >    added to the .pyc file header? */
> > 
> > Judging from my decade long experience with exotic image and CAD data 
> > formats I think this is always the way to go for binary data files.  
> > Using this method newer versions of a program can always recognize
> > the file format version and convert files generated by older versions
> > in an appropriate way.
> 
> I have similar experience, notably with hacking graphics file formats.
> I concur with this recommendation.

One more +1 here.

In another thread (right now, actually), I'm discussing how you can hook
up Linux to recognize .pyc files and directly execute them with the Python
interpreter (e.g. no need for #!/usr/bin/env python at the head of the
file). But if that magic number keeps changing, then it makes it a bit
harder to set this up.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From akuchlin at mems-exchange.org  Thu May 25 00:22:46 2000
From: akuchlin at mems-exchange.org (Andrew Kuchling)
Date: Wed, 24 May 2000 18:22:46 -0400 (EDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <20000523204750.A6107@thyrsus.com>
References: <200005232333.TAA16068@amarok.cnri.reston.va.us>
	<20000523204750.A6107@thyrsus.com>
Message-ID: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>

Eric S. Raymond writes:
>Here's a function that ought to be in the Python wrapper associated with
>the module:

There currently is no such wrapper, but there probably should be.
Guess I'll rename the module to _curses, and add a curses.py file.  Or
should there be a curses package, instead?  That would leave room for
more future expansion.  Guido, any opinion?

--amk



From gstein at lyra.org  Thu May 25 00:38:07 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 24 May 2000 15:38:07 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org>

On Wed, 24 May 2000, Andrew Kuchling wrote:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:

Dang. Deleted Eric's note accidentally. Note that the proposed wrapper can
be simplifed by using try/finally.

> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

Just a file. IMO, a package would be overkill.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From esr at thyrsus.com  Thu May 25 02:26:49 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 24 May 2000 20:26:49 -0400
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Wed, May 24, 2000 at 06:22:46PM -0400
References: <200005232333.TAA16068@amarok.cnri.reston.va.us> <20000523204750.A6107@thyrsus.com> <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <20000524202649.B10384@thyrsus.com>

Andrew Kuchling <akuchlin at mems-exchange.org>:
> Eric S. Raymond writes:
> >Here's a function that ought to be in the Python wrapper associated with
> >the module:
> 
> There currently is no such wrapper, but there probably should be.
> Guess I'll rename the module to _curses, and add a curses.py file.  Or
> should there be a curses package, instead?  That would leave room for
> more future expansion.  Guido, any opinion?

I'll supply a field-editor function with Emacs-like bindings, too.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

Never trust a man who praises compassion while pointing a gun at you.



From fdrake at acm.org  Thu May 25 04:36:59 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Wed, 24 May 2000 19:36:59 -0700 (PDT)
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: <14636.22070.257835.933767@newcnri.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005241933380.624-100000@mailhost.beopen.com>

On Wed, 24 May 2000, Andrew Kuchling wrote:
 > There currently is no such wrapper, but there probably should be.
 > Guess I'll rename the module to _curses, and add a curses.py file.  Or
 > should there be a curses package, instead?  That would leave room for
 > more future expansion.  Guido, any opinion?

  I think a package makes sense; some of the libraries that provide widget
sets on top of ncurses would be prime candidates for inclusion.
  The structure should probably be something like:

	curses/
	    __init__.py		# from _curses import *, docstring
	    _curses.so		# current curses module
	    ...


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gstein at lyra.org  Thu May 25 12:58:27 2000
From: gstein at lyra.org (Greg Stein)
Date: Thu, 25 May 2000 03:58:27 -0700 (PDT)
Subject: [Python-Dev] Larry's need for metacharacters...
Message-ID: <Pine.LNX.4.10.10005250355450.13822-100000@nebula.lyra.org>

[ paraphrased from a LWN letter to the editor ]

Regarding the note posted here last week about Perl development stopping
cuz Larry can't figure out any more characters to place after the '$'
character (to create "special" things) ...

Note that Larry became interested in Unicode a few years ago...

Note that Perl now supports Unicode throughout... *including* variable
names...

Coincidence? I think not!

$\uAB56 = 1;


:-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Thu May 25 14:22:09 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 25 May 2000 14:22:09 +0200
Subject: [Python-Dev] String encoding
References: <Pine.LNX.4.10.10005241313300.7932-100000@nebula.lyra.org>
Message-ID: <392D1AF1.5AA15F2F@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 24 May 2000, Fredrik Lundh wrote:
> > > one might of course the system encoding if the user actually calls setlocale,
> >
> > I think that was supposed to be:
> >
> >   one might of course SET the system encoding ONLY if the user actually calls setlocale,
> >
> > or something...
> 
> Bleh. Global switches are bogus. Since you can't depend on the setting,
> and you can't change it (for fear of busting something else),

Sure you can: in site.py before any other code using Unicode
gets executed.

> then you
> have to be explicit about your encoding all the time. Since you're never
> going to rely on a global encoding, then why keep it?

For the same reason you use setlocale() in C (and Python): to
make programs portable to other locales without too much
fuzz.

> This global encoding (per thread or not) just reminds me of the single
> hook for import, all over again.

Think of it as a configuration switch which is made settable
via a Python interface -- much like the optimize switch or
the debug switch (which are settable via Python APIs in mxTools).
The per-thread implementation is mainly a design question: I
think globals should always be implemented on a per-thread basis.

Hmm, I wish Guido would comment on the idea of keeping the
runtime settable encoding...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Thu May 25 17:30:26 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:30:26 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Tue, 23 May 2000 22:49:23 -0400."
             <20000523224923.A1008@mems-exchange.org> 
References: <20000523224923.A1008@mems-exchange.org> 
Message-ID: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>

[Greg Ward]
> My post on this from last week was met with a deafening silence, so I
> will try to be short and to-the-point this time:
> 
>    Why are shared extensions on Solaris linked with "ld -G" instead of
>    "gcc -G" when gcc is the compiler used to compile Python and
>    extensions?
> 
> Is it historical?  Ie. did some versions of Solaris and/or gcc not do
> the right thing here?  Could we detect that bogosity in "configure", and
> only use "ld -G" if it's necessary, and use "gcc -G" by default?
> 
> The reason that using "ld -G" is the wrong thing is that libgcc.a is not
> referenced when creating the .so file.  If the object code happens to
> reference functions in libgcc.a that are not referenced anywhere in the
> Python core, then importing the .so fails.  This happens if there is a
> 64-bit divide in the object code.  See my post of May 19 for details.

Two excuses: (1) long ago, you really needed to use ld instead of cc
to create a shared library, because cc didn't recognize the flags or
did other things that shouldn't be done to shared libraries; (2) I
didn't know there was a problem with using ld.

Since you have now provided a patch which seems to work, why don't you
check it in...?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:35:10 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:35:10 -0500
Subject: [Python-Dev] 1.6 release date
In-Reply-To: Your message of "Wed, 24 May 2000 08:56:49 MST."
             <Pine.LNX.4.10.10005240855340.465-100000@localhost> 
References: <Pine.LNX.4.10.10005240855340.465-100000@localhost> 
Message-ID: <200005251535.KAA11834@cj20424-a.reston1.va.home.com>

[Ping]
> The web page about 1.6 currently says that Python 1.6 will
> be released on June 1.  Is that still the target date?

Obviously I won't make that date...  I'm holding back an official
announcement of the delay until next week so I can combine it with
some good news. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:51:44 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:51:44 -0500
Subject: [Python-Dev] importing .pyc-files generated by Python 1.5.2 in Python 1.6. Why not?
In-Reply-To: Your message of "Wed, 24 May 2000 23:34:19 +0200."
             <m12uinD-000DieC@artcom0.artcom-gmbh.de> 
References: <m12uinD-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005251551.KAA11897@cj20424-a.reston1.va.home.com>

Given Christian Tismer's testimonial and inspection of marshal.c, I
think Peter's small patch is acceptable.

A bigger question is whether we should freeze the magic number and add
a version number.  In theory I'm all for that, but it means more
changes; there are several tools (e.c. Lib/py_compile.py,
Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
intimate knowledge of the .pyc file format that would have to be
modified to match.

The current format of a .pyc file is as follows:

bytes 0-3   magic number
bytes 4-7   timestamp (mtime of .py file)
bytes 8-*   marshalled code object

The magic number itself is used to convey various bits of information,
all implicit:

- the Python version
- whether \r and \n are swapped (some old Mac compilers did this)
- whether all string literals are Unicode (experimental -U flag)

The current (1.6) value of the magic number (as a string -- the .pyc
file format is byte order independent) is '\374\304\015\012' on most
platforms; it's '\374\304\012\015' for the old Mac compilers
mentioned; and it's '\375\304\015\012' with -U.

Can anyone come up with a proposal?  I'm swamped!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Thu May 25 17:52:54 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 10:52:54 -0500
Subject: [Python-Dev] Updated curses module in CVS
In-Reply-To: Your message of "Wed, 24 May 2000 15:38:07 MST."
             <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005241527030.7932-100000@nebula.lyra.org> 
Message-ID: <200005251552.KAA11922@cj20424-a.reston1.va.home.com>

> > There currently is no such wrapper, but there probably should be.
> > Guess I'll rename the module to _curses, and add a curses.py file.  Or
> > should there be a curses package, instead?  That would leave room for
> > more future expansion.  Guido, any opinion?

Whatever -- either way is fine with me!

--Guido van Rossum (home page: http://www.python.org/~guido/)



From DavidA at ActiveState.com  Thu May 25 23:42:51 2000
From: DavidA at ActiveState.com (David Ascher)
Date: Thu, 25 May 2000 14:42:51 -0700
Subject: [Python-Dev] ActiveState news
Message-ID: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>

While not a technical point, I thought I'd mention to this group that
ActiveState just announced several things, including some Python-related
projects.  See www.ActiveState.com for details.

--david

PS: In case anyone's still under the delusion that cool Python jobs are hard
to find, let me know. =)




From bwarsaw at python.org  Sat May 27 00:42:10 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 18:42:10 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
Message-ID: <14638.64962.118047.467438@localhost.localdomain>

Hi all,

I've taken /F's C implementation of the standard class-based
exceptions, implemented the stuff he left out, proofread for reference
counting issues, hacked a bit more, and integrated it with the 1.6
interpreter.  Everything seems to work well; i.e. the regression test
suite passes and I don't get any core dumps ;).

I don't have the ability right now to Purify things[1], but I've tried
to be very careful in handling reference counting.  Since I've been
hacking on this all day, it could definitely use another set of eyes.
I think rather than email a huge patch kit, I'll just go ahead and
check the changes in.  Please take a look and give it a hard twist.

Thanks to /F for the excellent head start!
-Barry

[1] Purify was one of the coolest products on Solaris, but alas it
doesn't seem like they'll ever support Linux.  What do you all use to
do similar memory verification tests on Linux?  Or do you just not?



From bwarsaw at python.org  Sat May 27 01:24:48 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Fri, 26 May 2000 19:24:48 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <14639.1984.920885.635040@localhost.localdomain>

I'm all done checking this stuff in.
-Barry



From gstein at lyra.org  Fri May 26 01:29:19 2000
From: gstein at lyra.org (Greg Stein)
Date: Thu, 25 May 2000 16:29:19 -0700 (PDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Modules _exceptions.c,NONE,1.1
In-Reply-To: <200005252318.QAA25455@slayer.i.sourceforge.net>
Message-ID: <Pine.LNX.4.10.10005251627061.16846-100000@nebula.lyra.org>

On Thu, 25 May 2000, Barry Warsaw wrote:
> Update of /cvsroot/python/python/dist/src/Modules
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25441
> 
> Added Files:
> 	_exceptions.c 
> Log Message:
> Built-in class-based standard exceptions.  Written by Fredrik Lundh.
> Modified, proofread, and integrated for Python 1.6 by Barry Warsaw.

Since the added files are not emailed, you can easily see this file at:

http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/dist/src/Modules/_exceptions.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=python


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gward at python.net  Fri May 26 00:33:54 2000
From: gward at python.net (Greg Ward)
Date: Thu, 25 May 2000 18:33:54 -0400
Subject: [Python-Dev] Terminology question
Message-ID: <20000525183354.A422@beelzebub>

A question of terminology: frequently in the Distutils docs I need to
refer to the package-that-is-not-a-package, ie. the "root" or "empty"
package.  I can't decide if I prefer "root package", "empty package" or
what.  ("Empty" just means the *name* is empty, so it's probably not a
very good thing to say "empty package" -- but "package with no name" or
"unnamed package" aren't much better.)

Is there some accepted convention that I have missed?

Here's the definition I've just written for the "Distribution Python
Modules" manual:

\item[root package] the ``package'' that modules not in a package live
  in.  The vast majority of the standard library is in the root package,
  as are many small, standalone third-party modules that don't belong to
  a larger module collection.  (The root package isn't really a package,
  since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
  call it something.)

Confusing enough?  I thought so...

        Greg
-- 
Greg Ward - Unix nerd                                   gward at python.net
http://starship.python.net/~gward/
Beware of altruism.  It is based on self-deception, the root of all evil.



From guido at python.org  Fri May 26 03:50:24 2000
From: guido at python.org (Guido van Rossum)
Date: Thu, 25 May 2000 20:50:24 -0500
Subject: [Python-Dev] Terminology question
In-Reply-To: Your message of "Thu, 25 May 2000 18:33:54 -0400."
             <20000525183354.A422@beelzebub> 
References: <20000525183354.A422@beelzebub> 
Message-ID: <200005260150.UAA10169@cj20424-a.reston1.va.home.com>

Greg,

If you have to refer to it as a package (which I don't doubt), the
correct name is definitely the "root package".

A possible clarification of your glossary entry:

\item[root package] the root of the hierarchy of packages.  (This
isn't really a package, since it doesn't have an
\file{\_\_init\_\_.py} file.  But we have to call it something.)  The
vast majority of the standard library is in the root package, as are
many small, standalone third-party modules that don't belong to a
larger module collection.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at python.net  Fri May 26 04:22:03 2000
From: gward at python.net (Greg Ward)
Date: Thu, 25 May 2000 22:22:03 -0400
Subject: [Python-Dev] Where to install non-code files
Message-ID: <20000525222203.A1114@beelzebub>

Another one for the combined distutils/python-dev braintrust; apologies
to those of you on both lists, but this is yet another distutils issue
that treads on python-dev territory.

The problem is this: some module distributions need to install files
other than code (modules, extensions, and scripts).  One example close
to home is the Distutils; it has a "system config file" and will soon
have a stub executable for creating Windows installers.

On Windows and Mac OS, clearly these should go somewhere under
sys.prefix: this is the directory for all things Python, including
third-party module distributions.  If Brian Hooper distributes a module
"foo" that requires a data file containing character encoding data (yes,
this is based on a true story), then the module belongs in (eg.)
C:\Python and the data file in (?) C:\Python\Data.  (Maybe
C:\Python\Data\foo, but that's a minor wrinkle.)

Any disagreement so far?

Anyways, what's bugging me is where to put these files on Unix.
<prefix>/lib/python1.x is *almost* the home for all things Python, but
not quite.  (Let's ignore platform-specific files for now: they don't
count as "miscellaneous data files", which is what I'm mainly concerned
with.)

Currently, misc. data files are put in <prefix>/share, and the
Distutil's config file is searched for in the directory of the distutils
package -- ie. site-packages/distutils under 1.5.2 (or
~/lib/python/distutils if that's where you installed it, or ./distutils
if you're running from the source directory, etc.).  I'm not thrilled
with either of these.

My inclination is to nominate a directory under <prefix>lib/python1.x
for these sort of files: not sure if I want to call it "etc" or "share"
or "data" or what, but it would be treading in Python-space.  It would
break the ability to have a standard library package called "etc" or
"share" or "data" or whatever, but dammit it's convenient.

Better ideas?

        Greg
-- 
Greg Ward - "always the quiet one"                      gward at python.net
http://starship.python.net/~gward/
I have many CHARTS and DIAGRAMS..



From mhammond at skippinet.com.au  Fri May 26 04:35:47 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 12:35:47 +1000
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKACLAA.mhammond@skippinet.com.au>

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python, including
> third-party module distributions.  If Brian Hooper distributes a module
> "foo" that requires a data file containing character encoding data (yes,
> this is based on a true story), then the module belongs in (eg.)
> C:\Python and the data file in (?) C:\Python\Data.  (Maybe
> C:\Python\Data\foo, but that's a minor wrinkle.)
>
> Any disagreement so far?

A little.  I dont think we need a new dump for arbitary files that no one
can associate with their application.

Why not put the data with the code?  It is quite trivial for a Python
package or module to find its own location, and this way we are not
dependent on anything.

Why assume packages are installed _under_ Python?  Why not just assume the
package is _reachable_ by Python.  Once our package/module is being
executed by Python, we know exactly where we are.

On my machine, there is no "data" equivilent; the closest would be
"python-cvs\pcbuild\data", and that certainly doesnt make sense.  Why can't
I just place it where I put all my other Python extensions, ensure it is on
the PythonPath, and have it "just work"?

It sounds a little complicated - do we provide an API for this magic
location, or does everybody cut-and-paste a reference implementation for
locating it?  Either way sounds pretty bad - the API shouldnt be distutils
dependent (I may not have installed this package via distutils), and really
Python itself shouldnt care about this...

So all in all, I dont think it is a problem we need to push up to this
level - let each package author do whatever makes sense, and point out how
trivial it would be if you assumed code and data in the same place/tree.

[If the data is considered read/write, then you need a better answer
anyway, as you can't assume "c:\python\data" is writable (when actually
running the code) anymore than "c:\python\my_package" is]

Mark.




From fdrake at acm.org  Fri May 26 05:05:40 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 20:05:40 -0700 (PDT)
Subject: [Python-Dev] Re: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <Pine.LNX.4.10.10005252003180.7550-100000@mailhost.beopen.com>

On Thu, 25 May 2000, Greg Ward wrote:
 > A question of terminology: frequently in the Distutils docs I need to
 > refer to the package-that-is-not-a-package, ie. the "root" or "empty"
 > package.  I can't decide if I prefer "root package", "empty package" or
 > what.  ("Empty" just means the *name* is empty, so it's probably not a
 > very good thing to say "empty package" -- but "package with no name" or
 > "unnamed package" aren't much better.)

  Well, it's not a package -- it's similar to Java's unnamed package, but
the idea that it's a package has never been advanced.  Why not just call
it the global module space (or namespace)?  That's the only way I've heard
it described, and it's more clear than "empty package" or "unnamed
package".


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fdrake at acm.org  Fri May 26 06:47:10 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Thu, 25 May 2000 21:47:10 -0700 (PDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.64962.118047.467438@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Barry A. Warsaw wrote:
 > [1] Purify was one of the coolest products on Solaris, but alas it
 > doesn't seem like they'll ever support Linux.  What do you all use to
 > do similar memory verification tests on Linux?  Or do you just not?

  I'm not aware of anything as good, but there's "memprof" (check for it
with "rpm -q"), and I think a few others.  Checker is a malloc() & friends
implementation that can be used to detect memory errors:

	http://www.gnu.org/software/checker/checker.html

and there's ElectricFence from Bruce Perens:

	http://www.perens.com/FreeSoftware/

(There's a MailMan related link there are well you might be interested
in!)
  There may be others, and I can't speak to the quality of these as I've
not used any of them (yet).  memprof and ElectricFence were installed on
my Mandrake box without my doing anything about it; I don't know if RedHat
installs them on a stock develop box.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From tim_one at email.msn.com  Fri May 26 07:27:13 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 01:27:13 -0400
Subject: [Python-Dev] ActiveState news
In-Reply-To: <PLEJJNOHDIGGLDPOGPJJMELECEAA.DavidA@ActiveState.com>
Message-ID: <000501bfc6d3$0c3a2700$c52d153f@tim>

[David Ascher]
> While not a technical point, I thought I'd mention to this group that
> ActiveState just announced several things, including some Python-related
> projects.  See www.ActiveState.com for details.

Thanks for pointing that out!  I just took a natural opportunity to plug the
Visual Studio integration on c.l.py:  it's very important that we do
everything we can to support and promote commercial Python endeavors at
every conceivable opportunity <wink>.

> PS: In case anyone's still under the delusion that cool Python
> jobs are hard to find, let me know. =)

Ditto cool speech recognition jobs in small companies about to be devoured
by Belgian conquerors.  And if anyone is under the illusion that golden
handcuffs don't bind, I can set 'em  straight on that one too.

hiring-is-darned-hard-everywhere-ly y'rs  - tim





From gstein at lyra.org  Fri May 26 09:48:12 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 00:48:12 -0700 (PDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
In-Reply-To: <000401bfc6d3$0afb3e60$c52d153f@tim>
Message-ID: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>

On Fri, 26 May 2000, Tim Peters wrote:
>...
> PS:  Barry's exception patch appears to have broken the CVS Windows build
> (nothing links anymore; all the PyExc_xxx symbols aren't found; no time to
> dig more now).

The .dsp file(s) need to be updated to include the new _exceptions.c file
in their build and link step. (the symbols moved there)

IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
directory. Dependencies from the core out to Modules/ seems a bit weird.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From pf at artcom-gmbh.de  Fri May 26 10:23:18 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 10:23:18 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <200005251551.KAA11897@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 25, 2000 10:51:44 am"
Message-ID: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]:
> Given Christian Tismer's testimonial and inspection of marshal.c, I
> think Peter's small patch is acceptable.
> 
> A bigger question is whether we should freeze the magic number and add
> a version number.  In theory I'm all for that, but it means more
> changes; there are several tools (e.c. Lib/py_compile.py,
> Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> intimate knowledge of the .pyc file format that would have to be
> modified to match.
> 
> The current format of a .pyc file is as follows:
> 
> bytes 0-3   magic number
> bytes 4-7   timestamp (mtime of .py file)
> bytes 8-*   marshalled code object

Proposal:
The future format (Python 1.6 and newer) of a .pyc file should be as follows:

bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
bytes 4-7   a version number (which should be == 1 in Python 1.6)
bytes 8-11  timestamp (mtime of .py file) (same as earlier)
bytes 12-*  marshalled code object (same as earlier)

> The magic number itself is used to convey various bits of information,
> all implicit:
[...]
This mechanism to construct the magic number should not be changed.

But now once again a new value must be choosen to prevent havoc 
with .pyc files floating around, where people already played with the 
Python 1.6 alpha releases.  But this change should be definitely the 
last one, which will ever happen during the future life time of Python.

The unmarshaller should do the following with the magic number read:
If the read magic is the old magic number from 1.5.2, skip reading a
version number and assume 0 as the version number.  

If the read magic is this new value instead, it should also read the 
version number and raise a new 'ByteCodeToNew' exception, if the read 
version number is greater than a #defind version number of this 
Python interpreter.  

If future incompatible extensions to the byte code format will happen, 
then this number should be incremented to 2, 3 and so on.

For safety, 'imp.get_magic()' should return the old 1.5.2 magic
number and only 'imp.get_magic(imp.PYC_FINAL)' should return the new 
final magic number.  A new function 'imp.get_version()' should be 
introduced, which will return the current compiled in version number
of this Python interpreter.

Of course all Python modules reading .pyc files must be changed 
ccordingly, so that are able to deal with new .pyc files.  
This shouldn't be too hard.

This proposed change of the .pyc file format must be described in the final
Python 1.6 annoucement, if there are people out there, who borrowed
code from 'Tools/scripts/checkpyc.py' or some such.

Regards, Peter



From mal at lemburg.com  Fri May 26 10:37:53 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 10:37:53 +0200
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
Message-ID: <392E37E1.75AC4D0E@lemburg.com>

> Update of /cvsroot/python/python/dist/src/Lib
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv25262
> 
> Modified Files:
>         exceptions.py 
> Log Message:
> For backwards compatibility, simply import everything from the
> _exceptions module, including __doc__.

Hmm, wasn't _exceptions supposed to be a *fall back* solution for
the case where the exceptions.py module is not found ? It now
looks like _exceptions replaces exceptions.py...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri May 26 12:48:05 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 12:48:05 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vFOo-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E5665.8CB1C260@lemburg.com>

Peter Funk wrote:
> 
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> >
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> >
> > The current format of a .pyc file is as follows:
> >
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)

This will break all tools relying on having the code object available
in bytes[8:] and believe me: there are lots of those around ;-)

You cannot really change the file header, only add things to the end
of the PYC file...

Hmm, or perhaps we should move the version number to the code object
itself... after all, the changes we want to refer to
using the version number are located in the code object and not the
PYC file layout. Unmarshalling it would then raise the error.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gmcm at hypernet.com  Fri May 26 13:53:14 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 07:53:14 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000525222203.A1114@beelzebub>
Message-ID: <1252780469-123073242@hypernet.com>

Greg Ward wrote:

[installing data files]

> On Windows and Mac OS, clearly these should go somewhere under
> sys.prefix: this is the directory for all things Python,
> including third-party module distributions.  If Brian Hooper
> distributes a module "foo" that requires a data file containing
> character encoding data (yes, this is based on a true story),
> then the module belongs in (eg.) C:\Python and the data file in
> (?) C:\Python\Data.  (Maybe C:\Python\Data\foo, but that's a
> minor wrinkle.)
> 
> Any disagreement so far?

Yeah. I tend to install stuff outside the sys.prefix tree and then 
use .pth files. I realize I'm, um, unique in this regard but I lost 
everything in some upgrade gone bad. (When a Windows de-
install goes wrong, your only option is to do some manual 
directory and registry pruning.)

I often do much the same on my Linux box, but I don't worry 
about it as much - upgrading is not "click and pray" there. 
(Hmm, I guess it is if you use rpms.)
 
So for Windows, I agree with Mark - put the data with the 
module. On a real OS, I guess I'd be inclined to put global 
data with the module, but user data in ~/.<something>.

> Greg Ward - "always the quiet one"                     
<snort>


- Gordon



From pf at artcom-gmbh.de  Fri May 26 13:50:02 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 13:50:02 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <392E5665.8CB1C260@lemburg.com> from "M.-A. Lemburg" at "May 26, 2000 12:48: 5 pm"
Message-ID: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>

[M.-A. Lemburg]:
> > Proposal:
> > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > 
> > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > bytes 12-*  marshalled code object (same as earlier)
> 
> This will break all tools relying on having the code object available
> in bytes[8:] and believe me: there are lots of those around ;-)

In some way, this is intentional:  If these tools (are there are really
that many out there, that munge with .pyc byte code files?) simply use
'imp.get_magic()' and then silently assume a specific content of the
marshalled code object, they probably need changes anyway, since the
code needed to deal with the new unicode object is missing from them.

> You cannot really change the file header, only add things to the end
> of the PYC file...

Why?  Will this idea really cause such earth quaking grumbling?
Please review this in the context of my proposal to change 'imp.get_magic()'
to return the old 1.5.2 MAGIC, when called without parameter.

> Hmm, or perhaps we should move the version number to the code object
> itself... after all, the changes we want to refer to
> using the version number are located in the code object and not the
> PYC file layout. Unmarshalling it would then raise the error.

Since the file layout is a very thin layer around the marshalled
code object, this makes really no big difference to me.  But it
will be harder to come up with reasonable entries for /etc/magic [1]
and similar mechanisms.  

Putting the version number at the end of file is possible. 
But such a solution is some what "dirty" and only gives the false 
impression that the general file layout (pyc[8:] instead of pyc[12:]) 
is something you can rely on until the end of time.  Hardcoding the
size of an unpadded header (something like using buffer[8:]) is IMO 
bad style anyway.

Regards, Peter
[1]: /etc/magic on Unices is a small textual data base used by the 'file' 
     command to identify the type of a file by looking at the first
     few bytes.  Unix file managers may either use /etc/magic directly
     or a similar scheme to asciociate files with mimetypes and/or default
     applications.



From guido at python.org  Fri May 26 15:10:30 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:10:30 -0500
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: Your message of "Fri, 26 May 2000 00:48:12 MST."
             <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> 
References: <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> 
Message-ID: <200005261310.IAA11256@cj20424-a.reston1.va.home.com>

> The .dsp file(s) need to be updated to include the new _exceptions.c file
> in their build and link step. (the symbols moved there)

I'll take care of this.

> IMO, it seems it would be Better(tm) to put _exceptions.c into the Python/
> directory. Dependencies from the core out to Modules/ seems a bit weird.

Good catch!  Since Barry's contemplating renaming it to exceptions.c
anyway that would be a good time to move it.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Fri May 26 15:13:06 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:13:06 -0500
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: Your message of "Fri, 26 May 2000 07:53:14 -0400."
             <1252780469-123073242@hypernet.com> 
References: <1252780469-123073242@hypernet.com> 
Message-ID: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>

> So for Windows, I agree with Mark - put the data with the 
> module. On a real OS, I guess I'd be inclined to put global 
> data with the module, but user data in ~/.<something>.

Aha!  Good distinction.

Modifyable data needs to go in a per-user directory, even on Windows,
outside the Python tree.

But static data needs to go in the same directory as the module that
uses it.  (We use this in the standard test package, for example.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at mems-exchange.org  Fri May 26 14:24:23 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 08:24:23 -0400
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: <200005251530.KAA11785@cj20424-a.reston1.va.home.com>; from guido@python.org on Thu, May 25, 2000 at 10:30:26AM -0500
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>
Message-ID: <20000526082423.A12100@mems-exchange.org>

On 25 May 2000, Guido van Rossum said:
> Two excuses: (1) long ago, you really needed to use ld instead of cc
> to create a shared library, because cc didn't recognize the flags or
> did other things that shouldn't be done to shared libraries; (2) I
> didn't know there was a problem with using ld.
> 
> Since you have now provided a patch which seems to work, why don't you
> check it in...?

Done.  I presume checking in configure.in and configure at the same time
is the right thing to do?  (I checked, and running "autoconf" on the
original configure.in regenerated exactly what's in CVS.)

        Greg



From thomas.heller at ion-tof.com  Fri May 26 14:28:49 2000
From: thomas.heller at ion-tof.com (Thomas Heller)
Date: Fri, 26 May 2000 14:28:49 +0200
Subject: [Distutils] Re: [Python-Dev] Where to install non-code files
References: <1252780469-123073242@hypernet.com>  <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <01ee01bfc70d$f1f17a20$4500a8c0@thomasnb>

[Guido writes]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
This seems to be the value of key "AppData" stored under in
  HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Filders

Right?

Thomas




From guido at python.org  Fri May 26 15:35:40 2000
From: guido at python.org (Guido van Rossum)
Date: Fri, 26 May 2000 08:35:40 -0500
Subject: [Python-Dev] Extension on Solaris: ld -G or gcc -G?
In-Reply-To: Your message of "Fri, 26 May 2000 08:24:23 -0400."
             <20000526082423.A12100@mems-exchange.org> 
References: <20000523224923.A1008@mems-exchange.org> <200005251530.KAA11785@cj20424-a.reston1.va.home.com>  
            <20000526082423.A12100@mems-exchange.org> 
Message-ID: <200005261335.IAA11410@cj20424-a.reston1.va.home.com>

> Done.  I presume checking in configure.in and configure at the same time
> is the right thing to do?  (I checked, and running "autoconf" on the
> original configure.in regenerated exactly what's in CVS.)

Yes.  WHat I usually do is manually bump the version number in
configure before checking it in (it references the configure.in
version) but that's a minor nit...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From pf at artcom-gmbh.de  Fri May 26 14:36:36 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Fri, 26 May 2000 14:36:36 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 26, 2000  8:13: 6 am"
Message-ID: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>

[Guido van Rossum]
[...]
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.

Is there a reliable algorithm to find a "per-user" directory on any
Win95/98/NT/2000 system?  On MacOS?  

Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
module would provide 'os.environ["HOME"]' similar to the posix
version?  This would certainly simplify the task of application
programmers intending to write portable applications.

Regards, Peter



From bwarsaw at python.org  Sat May 27 14:46:44 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:46:44 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise
 on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
	<Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
Message-ID: <14639.50100.383806.969434@localhost.localdomain>

>>>>> "GS" == Greg Stein <gstein at lyra.org> writes:

    GS> On Fri, 26 May 2000, Tim Peters wrote:
    >> ...  PS: Barry's exception patch appears to have broken the CVS
    >> Windows build (nothing links anymore; all the PyExc_xxx symbols
    >> aren't found; no time to dig more now).

    GS> The .dsp file(s) need to be updated to include the new
    GS> _exceptions.c file in their build and link step. (the symbols
    GS> moved there)

    GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
    GS> the Python/ directory. Dependencies from the core out to
    GS> Modules/ seems a bit weird.

Guido made the suggestion to move _exceptions.c to exceptions.c any
way.  Should we move the file to the other directory too?  Get out
your plusses and minuses.

-Barry



From bwarsaw at python.org  Sat May 27 14:49:01 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Sat, 27 May 2000 08:49:01 -0400 (EDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
	<392E37E1.75AC4D0E@lemburg.com>
Message-ID: <14639.50237.999048.146898@localhost.localdomain>

>>>>> "M" == M  <mal at lemburg.com> writes:

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution
    M> for the case where the exceptions.py module is not found ? It
    M> now looks like _exceptions replaces exceptions.py...

I see no reason to keep both of them around.  Too much of a
synchronization headache.

-Barry



From mhammond at skippinet.com.au  Fri May 26 15:12:49 2000
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 26 May 2000 23:12:49 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12vJLw-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <ECEPKNMJLHAPFFJHDOJBGEKICLAA.mhammond@skippinet.com.au>

> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?

Ahhh - where to start.  SHGetFolderLocation offers the following
alternatives:

CSIDL_APPDATA
Version 4.71. File system directory that serves as a common repository for
application-specific data. A typical path is C:\Documents and
Settings\username\Application Data

CSIDL_COMMON_APPDATA
Version 5.0. Application data for all users. A typical path is C:\Documents
and Settings\All Users\Application Data.

CSIDL_LOCAL_APPDATA
Version 5.0. File system directory that serves as a data repository for
local (non-roaming) applications. A typical path is C:\Documents and
Settings\username\Local Settings\Application Data.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

CSIDL_PERSONAL
File system directory that serves as a common repository for documents. A
typical path is C:\Documents and Settings\username\My Documents.

Plus a few I didnt bother listing...

<sigh>

Mark.




From jlj at cfdrc.com  Fri May 26 15:20:34 2000
From: jlj at cfdrc.com (Lyle Johnson)
Date: Fri, 26 May 2000 08:20:34 -0500
Subject: [Python-Dev] RE: [Distutils] Terminology question
In-Reply-To: <20000525183354.A422@beelzebub>
Message-ID: <003c01bfc715$2c8fde90$4e574dc0@cfdrc.com>

How about "PWAN", the "package without a name"? ;)

> -----Original Message-----
> From: distutils-sig-admin at python.org
> [mailto:distutils-sig-admin at python.org]On Behalf Of Greg Ward
> Sent: Thursday, May 25, 2000 5:34 PM
> To: distutils-sig at python.org; python-dev at python.org
> Subject: [Distutils] Terminology question
> 
> 
> A question of terminology: frequently in the Distutils docs I need to
> refer to the package-that-is-not-a-package, ie. the "root" or "empty"
> package.  I can't decide if I prefer "root package", "empty package" or
> what.  ("Empty" just means the *name* is empty, so it's probably not a
> very good thing to say "empty package" -- but "package with no name" or
> "unnamed package" aren't much better.)
> 
> Is there some accepted convention that I have missed?
> 
> Here's the definition I've just written for the "Distribution Python
> Modules" manual:
> 
> \item[root package] the ``package'' that modules not in a package live
>   in.  The vast majority of the standard library is in the root package,
>   as are many small, standalone third-party modules that don't belong to
>   a larger module collection.  (The root package isn't really a package,
>   since it doesn't have an \file{\_\_init\_\_.py} file.  But we have to
>   call it something.)
> 
> Confusing enough?  I thought so...
> 
>         Greg
> -- 
> Greg Ward - Unix nerd                                   gward at python.net
> http://starship.python.net/~gward/
> Beware of altruism.  It is based on self-deception, the root of all evil.
> 
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> http://www.python.org/mailman/listinfo/distutils-sig
> 



From skip at mojam.com  Fri May 26 10:25:49 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 03:25:49 -0500 (CDT)
Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Lib exceptions.py,1.18,1.19
In-Reply-To: <14639.50237.999048.146898@localhost.localdomain>
References: <200005252315.QAA25271@slayer.i.sourceforge.net>
	<392E37E1.75AC4D0E@lemburg.com>
	<14639.50237.999048.146898@localhost.localdomain>
Message-ID: <14638.13581.195350.511944@beluga.mojam.com>

    M> Hmm, wasn't _exceptions supposed to be a *fall back* solution for the
    M> case where the exceptions.py module is not found ? It now looks like
    M> _exceptions replaces exceptions.py...

    BAW> I see no reason to keep both of them around.  Too much of a
    BAW> synchronization headache.

Well, wait a minute.  Is Nick's third revision of his
AttributeError/NameError enhancement still on the table?  If so,
exceptions.py is the right place to put it.  In that case, I would recommend
that exceptions.py still be the file that is loaded.  It would take care of
importing _exceptions.

Oh, BTW.. +1 on Nick's latest version.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From gward at mems-exchange.org  Fri May 26 15:27:16 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:27:16 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
References: <20000525222203.A1114@beelzebub> <1252780469-123073242@hypernet.com>
Message-ID: <20000526092716.B12100@mems-exchange.org>

On 26 May 2000, Gordon McMillan said:
> Yeah. I tend to install stuff outside the sys.prefix tree and then 
> use .pth files. I realize I'm, um, unique in this regard but I lost 
> everything in some upgrade gone bad. (When a Windows de-
> install goes wrong, your only option is to do some manual 
> directory and registry pruning.)

I think that's appropriate for Python "applications" -- in fact, now
that Distutils can install scripts and miscellaneous data, about the
only thing needed to properly support "applications" is an easy way for
developers to say, "Please give me my own directory and create a .pth
file".  (Actually, the .pth file should only be one way to install an
application: you might not want your app's Python library to muck up
everybody else's Python path.  An idea AMK and I cooked up yesterday
would be an addition to the Distutils "build_scripts" command: along
with frobbing the #! line to point to the right Python interpreter, add
a second line:
  import sys ; sys.append(path-to-this-app's-python-lib)

Or maybe "sys.insert(0, ...)".

Anyways, that's neither here nor there.  Except that applications that
get their own directory should be free to put their (static) data files
wherever they please, rather than having to put them in the app's Python
library.

I'm more concerned with the what the Distutils works best with now,
though: module distributions.  I think you guys have convinced me;
static data should normally sit with the code.  I think I'll make that
the default (instead of prefix + "share"), but give developers a way to
override it.  So eg.:

   data_files = ["this.dat", "that.cfg"]

will put the files in the same place as the code (which could be a bit
tricky to figure out, what with the vagaries of package-ization and
"extra" install dirs);

   data_files = [("share", ["this.dat"]), ("etc", ["that.cfg"])]

would put the data file in (eg.) /usr/local/share and the config file in
/usr/local/etc.  This obviously makes the module writer's job harder: he
has to grovel from sys.prefix looking for the files that he expects to
have been installed with his modules.  But if someone really wants to do
this, they should be allowed to.

Finally, you could also put absolute directories in 'data_files',
although this would not be recommended.

> (Hmm, I guess it is if you use rpms.)

All the smart Unix installers (RPM, Debian, FreeBSD, ...?) I know of
have some sort of dependency mechanism, which works to varying degrees
of "work".  I'm only familar with RPM, and my usual response to a
dependency warning is "dammit, I know what I'm doing", and then I rerun
"rpm --nodeps" to ignore the dependency checking.  (This usually arises
because I build my own Perl and Python, and don't use Red Hat's -- I
just make /usr/bin/{perl,python} symlinks to /usr/local/bin, which RPM
tends to whine about.)  But it's nice to know that someone is watching.
;-)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Fri May 26 15:30:29 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 09:30:29 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <200005261313.IAA11285@cj20424-a.reston1.va.home.com>; from guido@python.org on Fri, May 26, 2000 at 08:13:06AM -0500
References: <1252780469-123073242@hypernet.com> <200005261313.IAA11285@cj20424-a.reston1.va.home.com>
Message-ID: <20000526093028.C12100@mems-exchange.org>

On 26 May 2000, Guido van Rossum said:
> Modifyable data needs to go in a per-user directory, even on Windows,
> outside the Python tree.
> 
> But static data needs to go in the same directory as the module that
> uses it.  (We use this in the standard test package, for example.)

What about the Distutils system config file (pydistutils.cfg)?  This is
something that should only be modified by the sysadmin, and sets the
site-wide policy for building and installing Python modules.  Does this
belong in the code directory?  (I hope so, because that's where it goes
now...)

(Under Unix, users can have a personal Distutils config file that
overrides the system config (~/.pydistutils.cfg), and every module
distribution can have a setup.cfg that overrides both of them.  On
Windows and Mac OS, there are only two config files: system and
per-distribution.)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From gward at mems-exchange.org  Fri May 26 16:30:15 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 10:30:15 -0400
Subject: [Python-Dev] py_compile and CR in source files
Message-ID: <20000526103014.A18937@mems-exchange.org>

Just made an unpleasant discovery: if a Python source file has CR-LF
line-endings, you can import it just fine under Unix.  But attempting to
'py_compile.compile()' it fails with a SyntaxError at the first
line-ending.

Arrghh!  This means that Distutils will either have to check/convert
line-endings at build-time (hey, finally, a good excuse for the
"build_py" command), or implicitly compile modules by importing them
(instead of using 'py_compile.compile()').

Perhaps I should "build" modules by line-at-a-time copying -- currently
it copies them in 16k chunks, which would make it hard to fix line
endings.  Hmmm.

        Greg



From skip at mojam.com  Fri May 26 11:39:39 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 04:39:39 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526103014.A18937@mems-exchange.org>
References: <20000526103014.A18937@mems-exchange.org>
Message-ID: <14638.18011.331703.867404@beluga.mojam.com>

    Greg> Arrghh!  This means that Distutils will either have to
    Greg> check/convert line-endings at build-time (hey, finally, a good
    Greg> excuse for the "build_py" command), or implicitly compile modules
    Greg> by importing them (instead of using 'py_compile.compile()').

I don't think you can safely compile modules by importing them.  You have no
idea what the side effects of the import might be.

How about fixing py_compile.compile() instead?

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From mal at lemburg.com  Fri May 26 16:27:03 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:27:03 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <392E89B7.D6BC572D@lemburg.com>

Peter Funk wrote:
> 
> [M.-A. Lemburg]:
> > > Proposal:
> > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > >
> > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > bytes 12-*  marshalled code object (same as earlier)
> >
> > This will break all tools relying on having the code object available
> > in bytes[8:] and believe me: there are lots of those around ;-)
> 
> In some way, this is intentional:  If these tools (are there are really
> that many out there, that munge with .pyc byte code files?) simply use
> 'imp.get_magic()' and then silently assume a specific content of the
> marshalled code object, they probably need changes anyway, since the
> code needed to deal with the new unicode object is missing from them.

That's why I proposed to change the marshalled code object
and not the PYC file: the problem is not only related to 
PYC files, it touches all areas where marshal is used. If 
you try to load a code object using Unicode in Python 1.5
you'll get all sorts of errors, e.g. EOFError, SystemError.
 
Since marshal uses a specific format, that format should
receive the version number.

Ideally that version would be prepended to the format (not sure
whether this is possible), so that the PYC file layout
would then look like this:

word 0: magic
word 1: timestamp
word 2: version in the marshalled code object
word 3-*: rest of the marshalled code object

Please make sure that options such as the -U option are
also respected...

--

A different approach to all this would be fixing only the
first two bytes of the magic word, e.g.

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

This would be b/w compatible and still provide file(1)
with enough information to be able to tell the file type.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri May 26 16:49:23 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 26 May 2000 16:49:23 +0200
Subject: [Python-Dev] Extending locale.py
Message-ID: <392E8EF3.CDA61525@lemburg.com>

To make moving into the direction of making the string encoding
depend on the locale settings a little easier, I've started
to hack away at an extension of the locale.py module.

The module provides enough information to be able to set the string
encoding in site.py at startup. 

Additional code for _localemodule.c would be nice for platforms
which use other APIs to get at the active code page, e.g. on
Windows and Macs.

Please try it on your platform and tell me what you think
of the APIs.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: localex.py
Type: text/python
Size: 26105 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000526/1e6fcf39/attachment-0001.bin>

From gmcm at hypernet.com  Fri May 26 16:56:27 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 26 May 2000 10:56:27 -0400
Subject: [Python-Dev] Where to install non-code files
In-Reply-To: <20000526092716.B12100@mems-exchange.org>
References: <1252780469-123073242@hypernet.com>; from gmcm@hypernet.com on Fri, May 26, 2000 at 07:53:14AM -0400
Message-ID: <1252769476-123734481@hypernet.com>

Greg Ward wrote:

> On 26 May 2000, Gordon McMillan said:
> > Yeah. I tend to install stuff outside the sys.prefix tree and
> > then use .pth files. I realize I'm, um, unique in this regard
> > but I lost everything in some upgrade gone bad. (When a Windows
> > de- install goes wrong, your only option is to do some manual
> > directory and registry pruning.)
> 
> I think that's appropriate for Python "applications" -- in fact,
> now that Distutils can install scripts and miscellaneous data,
> about the only thing needed to properly support "applications" is
> an easy way for developers to say, "Please give me my own
> directory and create a .pth file". 

Hmm. I see an application as a module distribution that 
happens to have a script. (Or maybe I see a module 
distribution as a scriptless app ;-)).

At any rate, I don't see the need to dignify <prefix>/share and 
friends with an official position.

> (Actually, the .pth file
> should only be one way to install an application: you might not
> want your app's Python library to muck up everybody else's Python
> path.  An idea AMK and I cooked up yesterday would be an addition
> to the Distutils "build_scripts" command: along with frobbing the
> #! line to point to the right Python interpreter, add a second
> line:
>   import sys ; sys.append(path-to-this-app's-python-lib)
> 
> Or maybe "sys.insert(0, ...)".

$PYTHONSTARTUP ??

Never really had to deal with this. On my RH box, 
/usr/bin/python is my build. At a client site which had 1.4 
installed, I built 1.5 into $HOME/bin with a hacked getpath.c.

> I'm more concerned with the what the Distutils works best with
> now, though: module distributions.  I think you guys have
> convinced me; static data should normally sit with the code.  I
> think I'll make that the default (instead of prefix + "share"),
> but give developers a way to override it.  So eg.:
> 
>    data_files = ["this.dat", "that.cfg"]
> 
> will put the files in the same place as the code (which could be
> a bit tricky to figure out, what with the vagaries of
> package-ization and "extra" install dirs);

That's an artifact of your code ;-). If you figured it out once, 
you stand at least a 50% chance of getting the same answer 
a second time <.5 wink>.
 


- Gordon



From gward at mems-exchange.org  Fri May 26 17:06:09 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Fri, 26 May 2000 11:06:09 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.18011.331703.867404@beluga.mojam.com>; from skip@mojam.com on Fri, May 26, 2000 at 04:39:39AM -0500
References: <20000526103014.A18937@mems-exchange.org> <14638.18011.331703.867404@beluga.mojam.com>
Message-ID: <20000526110608.F9083@mems-exchange.org>

On 26 May 2000, Skip Montanaro said:
> I don't think you can safely compile modules by importing them.  You have no
> idea what the side effects of the import might be.

Yeah, that's my concern.

> How about fixing py_compile.compile() instead?

Would be a good thing to do this for Python 1.6, but I can't go back and
fix all the Python 1.5.2 installations out there.

Does anyone know if any good reasons why 'import' and
'py_compile.compile()' are different?  Or is it something easily
fixable?

        Greg



From tim_one at email.msn.com  Fri May 26 17:41:57 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 26 May 2000 11:41:57 -0400
Subject: [Python-Dev] Memory woes under Windows
In-Reply-To: <000401bfc082$54211940$6c2d153f@tim>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELFGBAA.tim_one@email.msn.com>

Just polishing part of this off, for the curious:

> ...
> Dragon's Win98 woes appear due to something else:  right after a Win98
> system w/ 64Mb RAM is booted, about half the memory is already locked (not
> just committed)!  Dragon's product needs more than the remaining 32Mb to
> avoid thrashing.  Even stranger, killing every process after booting
> releases an insignificant amount of that locked memory. ...

That turned out to be (mostly) irrelevant, and even if it were relevant it
turns out you can reduce the locked memory (to what appears to be an
undocumented minimum) and the file-cache size (to what is a documented
minimum) just by malloc'ing, zero'ing and free'ing a few giant arrays
(Windows malloc()-- unlike Linux's --returns a pointer to committed memory;
Windows has other calls if you really want memory you can't trust <0.5
wink>).

The next red herring was much funnier:  we couldn't reproduce the problem
when running the recognizer by hand (from a DOS box cmdline)!  But, run it
as Research did, system()'ed from a small Perl script, and it magically ran
3x slower, with monstrous disk thrashing.  So I had a great time besmirching
Perl's reputation <wink>.

Alas, it turned out the *real* trigger was something else entirely, that
we've known about for years but have never understood:  from inside the Perl
script, people used UNC paths to various network locations.  Like

    \\earwig\research2\data5\natspeak\testk\big55.voc

Exactly the same locations were referenced when people ran it "by hand", but
when people do it by hand, they naturally map a drive letter first, in order
reduce typing.  Like

    net use N: \\earwig\research2\data5\natspeak

once and then

    N:\testk\big55.voc

in their command lines.

This difference alone can make a *huge* timing difference!  Like I said,
we've never understood why.  Could simply be a bug in Dragon's
out-of-control network setup, or a bug in MS's networking code, or a bug in
Novell's server code -- I don't think we'll ever know.  The number of
IQ-hours that have gone into *trying* to figure this out over the years
could probably have carried several startups to successful IPOs <0.9 wink>.

One last useless clue:  do all this on a Win98 with 128Mb RAM, and the
timing difference goes away.  Ditto Win95, but much less RAM is needed.  It
sometimes acts like a UNC path consumes 32Mb of dedicated RAM!

Apart from this UNC-vs-mapped-drive issue, over many hours of dead-end
scenarios I was pleased to see that Win98 appears to do a good job of
reallocating physical RAM in response to changing demands, & in particular
better than Win95.  There's no problem here at all!

The original test case I posted-- showing massive heap fragmentation under
Win95, Win98, and W2K (but not NT), when growing a large Python list one
element at a time --remains an as-yet unstudied mystery.  I can easily make
*that* problem go away by, e.g., doing

    a = [1]*3000000
    del a

from time to time, apparently just to convince the Windows malloc that it
would be a wise idea to allocate a lot more than it thinks it needs from
time to time.  This suggests (untested) that it *could* be a huge win for
huge lists under Windows to overallocate huge lists by more than Python does
today.  I'll look into that "someday".





From gstein at lyra.org  Fri May 26 17:46:09 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 08:46:09 -0700 (PDT)
Subject: [Python-Dev] exceptions.c location (was: Win32 build)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <Pine.LNX.4.10.10005260845130.23146-100000@nebula.lyra.org>

On Sat, 27 May 2000, Barry A. Warsaw wrote:
> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.

+1 for moving it to Python/ (where bltinmodule.c and sysmodule.c exist)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri May 26 18:18:14 2000
From: gstein at lyra.org (Greg Stein)
Date: Fri, 26 May 2000 09:18:14 -0700 (PDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <20000526110608.F9083@mems-exchange.org>
Message-ID: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>

On Fri, 26 May 2000, Greg Ward wrote:
> On 26 May 2000, Skip Montanaro said:
> > I don't think you can safely compile modules by importing them.  You have no
> > idea what the side effects of the import might be.
> 
> Yeah, that's my concern.

I agree. You can't just import them.

> > How about fixing py_compile.compile() instead?
> 
> Would be a good thing to do this for Python 1.6, but I can't go back and
> fix all the Python 1.5.2 installations out there.

You and your 1.5 compatibility... :-)

> Does anyone know if any good reasons why 'import' and
> 'py_compile.compile()' are different?  Or is it something easily
> fixable?

I seem to recall needing to put an extra carriage return on the file, but
that the Python parser was fine with the different newline concepts. Guido
explained the difference once to me, but I don't recall offhand -- I'd
have to crawl back thru the email. Just yell over the cube at him to find
out.

*ponder*

Well, assuming that it is NOT okay with \r\n in there, then read the whole
blob in and use string.replace() on it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From skip at mojam.com  Fri May 26 18:30:08 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 11:30:08 -0500 (CDT)
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
References: <20000526110608.F9083@mems-exchange.org>
	<Pine.LNX.4.10.10005260913420.23146-100000@nebula.lyra.org>
Message-ID: <14638.42640.835838.859270@beluga.mojam.com>

    Greg> Well, assuming that it is NOT okay with \r\n in there, then read
    Greg> the whole blob in and use string.replace() on it.

I thought of that too, but quickly dismissed it.  You may have a CRLF pair
embedded in a triple-quoted string.  Those should be left untouched.

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From fdrake at acm.org  Fri May 26 19:18:00 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Fri, 26 May 2000 10:18:00 -0700 (PDT)
Subject: [Distutils] Re: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <Pine.LNX.4.10.10005261014420.12340-100000@mailhost.beopen.com>

On Fri, 26 May 2000, Skip Montanaro wrote:
 > I thought of that too, but quickly dismissed it.  You may have a CRLF pair
 > embedded in a triple-quoted string.  Those should be left untouched.

  No, it would be OK to do  the replacement; source files are supposed to
be treated as text, meaning that lineends should be represented as \n.
We're not talking about changing the values of the strings, which will
still be treated as \n and that's what will be incorporated in the value
of the string.  This has no impact on the explicit inclusion of \r or \r\n
in strings.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From bwarsaw at python.org  Fri May 26 19:32:02 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 13:32:02 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
Message-ID: <14638.46354.960974.536560@localhost.localdomain>

>>>>> "Fred" == Fred L Drake <fdrake at acm.org> writes:

    Fred> and there's ElectricFence from Bruce Perens:

    Fred> 	http://www.perens.com/FreeSoftware/

Yup, this comes with RH6.2 and is fairly easy to hook up; just link
with -lefence and go.  Running an efenced python over the whole test
suite fails miserably, but running it over just
Lib/test/test_exceptions.py has already (quickly) revealed one
refcounting bug, which I will check in to fix later today (as I move
Modules/_exceptions.c to Python/exceptions.c).

    Fred> (There's a MailMan related link there are well you might be
    Fred> interested in!)

Indeed!  I've seen Bruce contribute on the various Mailman mailing
lists.

-Barry



From skip at mojam.com  Fri May 26 19:48:46 2000
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 26 May 2000 12:48:46 -0500 (CDT)
Subject: [Python-Dev] C implementation of exceptions module
In-Reply-To: <14638.46354.960974.536560@localhost.localdomain>
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
	<14638.46354.960974.536560@localhost.localdomain>
Message-ID: <14638.47358.724731.392760@beluga.mojam.com>

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up; just link
    BAW> with -lefence and go.

Hmmm...  Sounds like an extra configure flag waiting to be added...

Skip



From bwarsaw at python.org  Fri May 26 20:38:19 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 14:38:19 -0400 (EDT)
Subject: [Python-Dev] C implementation of exceptions module
References: <14638.64962.118047.467438@localhost.localdomain>
	<Pine.LNX.4.10.10005252132280.7550-100000@mailhost.beopen.com>
	<14638.46354.960974.536560@localhost.localdomain>
	<14638.47358.724731.392760@beluga.mojam.com>
Message-ID: <14638.50331.542338.196305@localhost.localdomain>

>>>>> "SM" == Skip Montanaro <skip at mojam.com> writes:

    BAW> Yup, this comes with RH6.2 and is fairly easy to hook up;
    BAW> just link with -lefence and go.

    SM> Hmmm...  Sounds like an extra configure flag waiting to be
    SM> added...

I dunno.  I just did a "make -k OPT=-g LIBC=-lefence".

-Barry



From trentm at activestate.com  Fri May 26 20:55:55 2000
From: trentm at activestate.com (Trent Mick)
Date: Fri, 26 May 2000 11:55:55 -0700
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
In-Reply-To: <14639.50100.383806.969434@localhost.localdomain>
References: <000401bfc6d3$0afb3e60$c52d153f@tim> <Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org> <14639.50100.383806.969434@localhost.localdomain>
Message-ID: <20000526115555.C32427@activestate.com>

On Sat, May 27, 2000 at 08:46:44AM -0400, Barry A. Warsaw wrote:
> 
> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes:
> 
>     GS> On Fri, 26 May 2000, Tim Peters wrote:
>     >> ...  PS: Barry's exception patch appears to have broken the CVS
>     >> Windows build (nothing links anymore; all the PyExc_xxx symbols
>     >> aren't found; no time to dig more now).
> 
>     GS> The .dsp file(s) need to be updated to include the new
>     GS> _exceptions.c file in their build and link step. (the symbols
>     GS> moved there)
> 
>     GS> IMO, it seems it would be Better(tm) to put _exceptions.c into
>     GS> the Python/ directory. Dependencies from the core out to
>     GS> Modules/ seems a bit weird.
> 
> Guido made the suggestion to move _exceptions.c to exceptions.c any
> way.  Should we move the file to the other directory too?  Get out
> your plusses and minuses.
> 
+1 moving exceptions.c to Python/


Trent

-- 
Trent Mick
trentm at activestate.com


Return-Path: <trentm at molotok.activestate.com>
Delivered-To: python-dev at python.org
Received: from molotok.activestate.com (molotok.activestate.com [199.60.48.208])
	by dinsdale.python.org (Postfix) with ESMTP id C7A5F1CD69
	for <python-dev at python.org>; Fri, 26 May 2000 14:49:28 -0400 (EDT)
Received: (from trentm at localhost)
	by molotok.activestate.com (8.9.3/8.9.3) id LAA01949
	for python-dev at python.org; Fri, 26 May 2000 11:49:22 -0700
Resent-Message-Id: <200005261849.LAA01949 at molotok.activestate.com>
Date: Fri, 26 May 2000 11:39:40 -0700
From: Trent Mick <trentm at activestate.com>
To: Peter Funk <pf at artcom-gmbh.de>
Subject: Re: [Python-Dev] Proposal: .pyc file format change
Message-ID: <20000526113940.B32427 at activestate.com>
References: <200005251551.KAA11897 at cj20424-a.reston1.va.home.com> <m12vFOo-000DieC at artcom0.artcom-gmbh.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3us
In-Reply-To: <m12vFOo-000DieC at artcom0.artcom-gmbh.de>
Resent-From: trentm at activestate.com
Resent-Date: Fri, 26 May 2000 11:49:22 -0700
Resent-To: python-dev at python.org
Resent-Sender: trentm at molotok.activestate.com
Sender: python-dev-admin at python.org
Errors-To: python-dev-admin at python.org
X-BeenThere: python-dev at python.org
X-Mailman-Version: 2.0beta3
Precedence: bulk
List-Id: Python core developers <python-dev.python.org>

On Fri, May 26, 2000 at 10:23:18AM +0200, Peter Funk wrote:
> [Guido van Rossum]:
> > Given Christian Tismer's testimonial and inspection of marshal.c, I
> > think Peter's small patch is acceptable.
> > 
> > A bigger question is whether we should freeze the magic number and add
> > a version number.  In theory I'm all for that, but it means more
> > changes; there are several tools (e.c. Lib/py_compile.py,
> > Tools/freeze/modulefinder.py and Tools/scripts/checkpyc.py) that have
> > intimate knowledge of the .pyc file format that would have to be
> > modified to match.
> > 
> > The current format of a .pyc file is as follows:
> > 
> > bytes 0-3   magic number
> > bytes 4-7   timestamp (mtime of .py file)
> > bytes 8-*   marshalled code object
> 
> Proposal:
> The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> 
> bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> bytes 4-7   a version number (which should be == 1 in Python 1.6)
> bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> bytes 12-*  marshalled code object (same as earlier)
> 

This may be important: timestamps (as represented by the time_t type) are 8
bytes wide on 64-bit Linux and Win64. However, it will be a while (another 38
years) before time_t starts overflowing past 31 bits (it is a signed value).

The use of a 4 byte timestamp in the .pyc files constitutes an assumption
that this will fit in 4 bytes. The best portable way of handling this issue
(I think) is to just add an overflow check in import.c where
PyOS_GetLastModificationTime (which now properly return time_t) that raises
an exception if the time_t return value from overflows 4-bytes.

I have been going through the Python code looking for possible oveflow cases
for Win64 and Linux64 of late so I will submit these patches (Real Soon Now
(tm)).

CHeers,
Trent

-- 
Trent Mick
trentm at activestate.com



From bwarsaw at python.org  Fri May 26 21:11:40 2000
From: bwarsaw at python.org (bwarsaw at python.org)
Date: Fri, 26 May 2000 15:11:40 -0400 (EDT)
Subject: [Python-Dev] Win32 build (was: RE: [Patches] From comp.lang.python: A compromise on case-sensitivity)
References: <000401bfc6d3$0afb3e60$c52d153f@tim>
	<Pine.LNX.4.10.10005260045420.21092-100000@nebula.lyra.org>
	<14639.50100.383806.969434@localhost.localdomain>
	<20000526115555.C32427@activestate.com>
Message-ID: <14638.52332.741025.292435@localhost.localdomain>

>>>>> "TM" == Trent Mick <trentm at activestate.com> writes:

    TM> +1 moving exceptions.c to Python/

Done.  And it looks like someone with a more accessible Windows setup
is going to have to modify the .dsp files.

-Barry



From jeremy at alum.mit.edu  Fri May 26 23:40:53 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Fri, 26 May 2000 17:40:53 -0400 (EDT)
Subject: [Python-Dev] Guido is offline
Message-ID: <14638.61285.606894.914184@localhost.localdomain>

FYI: Guido's cable modem service is giving him trouble and he's unable
to read email at the moment.  He wanted me to let you know that lack
of response isn't for lack of interest.  I imagine he won't be fully
responsive until after the holiday weekend :-).

Jeremy




From tim_one at email.msn.com  Sat May 27 06:53:14 2000
From: tim_one at email.msn.com (Tim Peters)
Date: Sat, 27 May 2000 00:53:14 -0400
Subject: [Python-Dev] py_compile and CR in source files
In-Reply-To: <14638.42640.835838.859270@beluga.mojam.com>
Message-ID: <000001bfc797$781e3d20$cd2d153f@tim>

[GregS]
> Well, assuming that it is NOT okay with \r\n in there, then read
> the whole blob in and use string.replace() on it.
>
[Skip Montanaro]
> I thought of that too, but quickly dismissed it.  You may have a CRLF pair
> embedded in a triple-quoted string.  Those should be left untouched.

Why?  When Python compiles a module "normally", line-ends get normalized,
and the CRLF pairs on Windows vanish anyway.  For example, here's cr.py:

def f():
    s = """a
b
c
d
"""
    for ch in s:
        print ord(ch),
    print

f()
import dis
dis.dis(f)

I'm running on Win98 as I type, and the source file has CRLF line ends.

C:\Python16>python misc/cr.py
97 10 98 10 99 10 100 10

That line shows that only the LFs survived.  The rest shows why:

          0 SET_LINENO               1

          3 SET_LINENO               2
          6 LOAD_CONST               1 ('a\012b\012c\012d\012')
          9 STORE_FAST               0 (s)
          etc

That is, as far as the generated code is concerned, the CRs never existed.

60-years-of-computers-and-we-still-can't-agree-on-how-to-end-a-line-ly
    y'rs  - tim





From martin at loewis.home.cs.tu-berlin.de  Sun May 28 08:28:55 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 08:28:55 +0200
Subject: [Python-Dev] String encoding
Message-ID: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>

Fred L. Drake wrote

> I recall a fair bit of discussion about wchar_t when it was
> introduced to ANSI C, and the character set and encoding were
> specifically not made part of the specification.  Making a
> requirement that wchar_t be Unicode doesn't make a lot of sense, and
> opens up potential portability issues.

In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
indicate that wchar_t is Unicode. The exact wording is

# A decimal constant of the form yyyymmL (for example, 199712L),
# intended to indicate that values of type wchar_t are the coded
# representations of the characters defined by ISO/IEC 10646, along
# with all amendments and technical corrigenda as of the specified
# year and month.

Of course, at the moment, there are few, if any, implementations that
define this macro.

Regards,
Martin



From martin at loewis.home.cs.tu-berlin.de  Sun May 28 12:34:01 2000
From: martin at loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 28 May 2000 12:34:01 +0200
Subject: [Python-Dev] Patch: AttributeError and NameError: second attempt.
Message-ID: <200005281034.MAA04765@loewis.home.cs.tu-berlin.de>

[thread moved, since I can't put in proper References headers, anyway,
 just by looking at the archive]
> 1) I rewrite the stuff that went into exceptions.py in C, and stick it
>   in the _exceptions module.  I don't much like this idea, since it 
>   kills the advantage noted above.

>2) I leave the stuff that's in C already in C.  I add C __str__ methods 
>   to AttributeError and NameError, which dispatch to helper functions
>   in the python 'exceptions' module, if that module is available.

>Which is better, or is there a third choice available?

There is a third choice: Patch AttributeError afterwards. I.e. in
site.py, say

_AttributeError_str(self):
  code

AttributeError.__str__ = _AttributeError_str

Guido said
> This kind of user-friendliness should really be in the tools, not in
> the core language implementation!

And I think Nick's patch exactly follows this guideline. Currently,
the C code raising AttributeError tries to be friendly, formatting a
string, and passing it to the AttributeError.__init__. With his patch,
the AttributeError just gets enough information so that tools later
can be friendly - actually printing anything is done in Python code.

Fred said
>   I see no problem with the functionality from Nick's patch; this is
> exactly te sort of thing what's needed, including at the basic
> interactive prompt.

I agree. Much of the strength of this approach is lost if it only
works inside tools. When I get an AttributeError, I'd like to see
right away what the problem is. If I had to fire up IDLE and re-run it
first, I'd rather stare at my code long enough to see the problem.

Regards,
Martin




From tismer at tismer.com  Sun May 28 17:02:41 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 17:02:41 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12vIcs-000DieC@artcom0.artcom-gmbh.de> <392E89B7.D6BC572D@lemburg.com>
Message-ID: <39313511.4A312B4A@tismer.com>


"M.-A. Lemburg" wrote:
> 
> Peter Funk wrote:
> >
> > [M.-A. Lemburg]:
> > > > Proposal:
> > > > The future format (Python 1.6 and newer) of a .pyc file should be as follows:
> > > >
> > > > bytes 0-3   a new magic number, which should be definitely frozen in 1.6.
> > > > bytes 4-7   a version number (which should be == 1 in Python 1.6)
> > > > bytes 8-11  timestamp (mtime of .py file) (same as earlier)
> > > > bytes 12-*  marshalled code object (same as earlier)

<snip/>

> A different approach to all this would be fixing only the
> first two bytes of the magic word, e.g.
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> This would be b/w compatible and still provide file(1)
> with enough information to be able to tell the file type.

I think this approach is simple and powerful enough
to survive Py3000.
Peter's approach is of course nicer and cleaner from
a "redo from scratch" point of view. But then, I'd even
vote for a better format that includes another field
which names the header size explicitly.

For simplicity, comapibility and ease of change,
I vote with +1 for adopting the solution of

byte 0: 'P'
byte 1: 'Y'
byte 2: version number (counting from 1)
byte 3: option byte (8 bits: one for each option;
                     bit 0: -U cmd switch)

If that turns out to be insufficient in some future,
do a complete redesign.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From pf at artcom-gmbh.de  Sun May 28 18:23:52 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Sun, 28 May 2000 18:23:52 +0200 (MEST)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39313511.4A312B4A@tismer.com> from Christian Tismer at "May 28, 2000  5: 2:41 pm"
Message-ID: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>

[...]
> For simplicity, comapibility and ease of change,
> I vote with +1 for adopting the solution of
> 
> byte 0: 'P'
> byte 1: 'Y'
> byte 2: version number (counting from 1)
> byte 3: option byte (8 bits: one for each option;
>                      bit 0: -U cmd switch)
> 
> If that turns out to be insufficient in some future,
> do a complete redesign.

What about the CR/LF issue with some Mac Compilers (see
Guido's mail for details)?  Can we simply drop this?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From tismer at tismer.com  Sun May 28 18:51:20 2000
From: tismer at tismer.com (Christian Tismer)
Date: Sun, 28 May 2000 18:51:20 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <m12w5qy-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39314E88.6AD944CE@tismer.com>


Peter Funk wrote:
> 
> [...]
> > For simplicity, comapibility and ease of change,
> > I vote with +1 for adopting the solution of
> >
> > byte 0: 'P'
> > byte 1: 'Y'
> > byte 2: version number (counting from 1)
> > byte 3: option byte (8 bits: one for each option;
> >                      bit 0: -U cmd switch)
> >
> > If that turns out to be insufficient in some future,
> > do a complete redesign.
> 
> What about the CR/LF issue with some Mac Compilers (see
> Guido's mail for details)?  Can we simply drop this?

Well, forgot about that.
How about swapping bytes 0 and 1?

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From guido at python.org  Mon May 29 01:54:11 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 28 May 2000 18:54:11 -0500
Subject: [Python-Dev] Guido is offline
In-Reply-To: Your message of "Fri, 26 May 2000 17:40:53 -0400."
             <14638.61285.606894.914184@localhost.localdomain> 
References: <14638.61285.606894.914184@localhost.localdomain> 
Message-ID: <200005282354.SAA02034@cj20424-a.reston1.va.home.com>

> FYI: Guido's cable modem service is giving him trouble and he's unable
> to read email at the moment.  He wanted me to let you know that lack
> of response isn't for lack of interest.  I imagine he won't be fully
> responsive until after the holiday weekend :-).

I'm finally back online now, but can't really enjoy it, because my
in-laws are here... So I have 300 unread emails that will remain
unread until Tuesday. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at python.org  Mon May 29 02:00:39 2000
From: guido at python.org (Guido van Rossum)
Date: Sun, 28 May 2000 19:00:39 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Fri, 26 May 2000 14:36:36 +0200."
             <m12vJLw-000DieC@artcom0.artcom-gmbh.de> 
References: <m12vJLw-000DieC@artcom0.artcom-gmbh.de> 
Message-ID: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>

> > Modifyable data needs to go in a per-user directory, even on Windows,
> > outside the Python tree.
> 
> Is there a reliable algorithm to find a "per-user" directory on any
> Win95/98/NT/2000 system?  On MacOS?  

I don't know -- often $HOME is set on Windows.  E.g. IDLE uses $HOME
if set and otherwise the current directory.

The Mac doesn't have an environment at all.

> Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
> module would provide 'os.environ["HOME"]' similar to the posix
> version?  This would certainly simplify the task of application
> programmers intending to write portable applications.

This sounds like a nice idea...

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Mon May 29 21:58:41 2000
From: gstein at lyra.org (Greg Stein)
Date: Mon, 29 May 2000 12:58:41 -0700 (PDT)
Subject: [Python-Dev] Proposal: .pyc file format change
In-Reply-To: <39314E88.6AD944CE@tismer.com>
Message-ID: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>

I don't think we should have a two-byte magic value. Especially where
those two bytes are printable, 7-bit ASCII.

"But it is four bytes," you say. Nope. It is two plus a couple parameters
that can now change over time.

To ensure uniqueness, I think a four-byte magic should stay.

I would recommend the approach of adding opcodes into the marshal format.
Specifically, 'V' followed by a single byte. That can only occur at the
beginning. If it is not present, then you know that you have an old
marshal value.

Cheers,
-g

On Sun, 28 May 2000, Christian Tismer wrote:
> Peter Funk wrote:
> > 
> > [...]
> > > For simplicity, comapibility and ease of change,
> > > I vote with +1 for adopting the solution of
> > >
> > > byte 0: 'P'
> > > byte 1: 'Y'
> > > byte 2: version number (counting from 1)
> > > byte 3: option byte (8 bits: one for each option;
> > >                      bit 0: -U cmd switch)
> > >
> > > If that turns out to be insufficient in some future,
> > > do a complete redesign.
> > 
> > What about the CR/LF issue with some Mac Compilers (see
> > Guido's mail for details)?  Can we simply drop this?
> 
> Well, forgot about that.
> How about swapping bytes 0 and 1?
> 
> -- 
> Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
> Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
> Kaunstr. 26                  :    *Starship* http://starship.python.net
> 14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
> PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
>      where do you want to jump today?   http://www.stackless.com
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From pf at artcom-gmbh.de  Tue May 30 09:08:15 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 09:08:15 +0200 (MEST)
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> from Greg Stein at "May 29, 2000 12:58:41 pm"
Message-ID: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>

Greg Stein:
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
[...]
> To ensure uniqueness, I think a four-byte magic should stay.

Looking at /etc/magic I see many 16-bit magic numbers kept around
from the good old days.  But you are right: Choosing a four-byte magic
value would make the chance of a clash with some other file format
much less likely.

> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

But this would not solve the problem with 8 byte versus 4 byte timestamps
in the header on 64-bit OSes.  Trent Mick pointed this out.

I think, the situation we have now, is very unsatisfactory:  I don't 
see a reasonable solution, which allows us to keep the length of the
header before the marshal-block at a fixed length of 8 bytes together
with a frozen 4 byte magic number.  

Moving the version number into the marshal doesn't help to resolve
this conflict.  So either you have to accept a new magic on 64 bit
systems or you have to enlarge the header.

To come up with a new proposal, the following questions should be answered:
  1. Is there really too much code out there, which depends on 
     the hardcoded assumption, that the marshal part of a .pyc file 
     starts at byte 8?  I see no further evidence for or against this.
     MAL pointed this out in 
     <http://www.python.org/pipermail/python-dev/2000-May/005756.html>
  2. If we decide to enlarge the header, do we really need a new
     header field defining the length of the header ? 
     This was proposed by Christian Tismer in 
     <http://www.python.org/pipermail/python-dev/2000-May/005792.html>
  3. The 'imp' module exposes somewhat the structure of an .pyc file
     through the function 'get_magic()'.  I proposed changing the signature of
     'imp.get_magic()' in an upward compatible way.  I also proposed 
     adding a new function 'imp.get_version()'.  What do you think about 
     this idea?
  4. Greg proposed prepending the version number to the marshal
     format.  If we do this, we definitely need a frozen way to find
     out, where the marshalled code object actually starts.  This has
     also the disadvantage of making the task to come up with a /etc/magic
     definition whichs displays the version number of a .pyc file slightly
     harder.

If we decide to move the version number into the marshal, if we can
also move the .py-timestamp there.  This way the timestamp will be handled
in the same way as large integer literals.  Quoting from the docs:

"""Caveat: On machines where C's long int type has more than 32 bits
   (such as the DEC Alpha), it is possible to create plain Python
   integers that are longer than 32 bits. Since the current marshal
   module uses 32 bits to transfer plain Python integers, such values
   are silently truncated. This particularly affects the use of very
   long integer literals in Python modules -- these will be accepted
   by the parser on such machines, but will be silently be truncated
   when the module is read from the .pyc instead.
   [...]
   A solution would be to refuse such literals in the parser, since
   they are inherently non-portable. Another solution would be to let
   the marshal module raise an exception when an integer value would
   be truncated. At least one of these solutions will be implemented
   in a future version."""

Should this be 1.6?  Changing the format of .pyc files over and over
again in the 1.x series doesn't look very attractive.

Regards, Peter



From trentm at activestate.com  Tue May 30 09:46:09 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 30 May 2000 00:46:09 -0700
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
In-Reply-To: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <20000530004609.A16383@activestate.com>

On Tue, May 30, 2000 at 09:08:15AM +0200, Peter Funk wrote:
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.
> 

I kind of intimated but did not make it clear: I wouldn't worry about the
limitations of a 4 byte timestamp too much. That value is not going to
overflow for another 38 years. Presumably the .pyc header (if such a thing
even still exists then) will change by then.


[peter summarizes .pyc header format options]

> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.
> 
I *hope* it gets into 1.6, because I have implemented the latter suggestion
(raise an exception is truncation of a PyInt to 32-bits will cause data
loss) in the docs that you quoted and will be submitting a patch for it on
Wed or Thurs.

Ciao,
Trent

-- 
Trent Mick
trentm at activestate.com



From effbot at telia.com  Tue May 30 10:21:10 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 10:21:10 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org> <m12wg8N-000DieC@artcom0.artcom-gmbh.de> <20000530004609.A16383@activestate.com>
Message-ID: <009901bfca10$040531c0$f2a6b5d4@hagrid>

Trent Mick wrote:
> > But this would not solve the problem with 8 byte versus 4 byte timestamps
> > in the header on 64-bit OSes.  Trent Mick pointed this out.
> 
> I kind of intimated but did not make it clear: I wouldn't worry about the
> limitations of a 4 byte timestamp too much. That value is not going to
> overflow for another 38 years. Presumably the .pyc header (if such a thing
> even still exists then) will change by then.

note that py_compile (which is used to create PYC files after installation,
among other things) treats the time as an unsigned integer.

so in other words, if we fix the built-in "PYC compiler" so it does the same
thing before 2038, we can spend another 68 years on coming up with a
really future proof design... ;-)

I really hope Py3K will be out before 2106.

as for the other changes: *please* don't break the header layout in the
1.X series.  and *please* don't break the "if the magic is the same, I can
unmarshal and run this code blob without crashing the interpreter" rule
(raising an exception would be okay, though).

</F>




From mal at lemburg.com  Tue May 30 10:10:25 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 10:10:25 +0200
Subject: Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: 
 .pyc file format change)
References: <m12wg8N-000DieC@artcom0.artcom-gmbh.de>
Message-ID: <39337771.D78BAAF5@lemburg.com>

Peter Funk wrote:
> 
> Greg Stein:
> > I don't think we should have a two-byte magic value. Especially where
> > those two bytes are printable, 7-bit ASCII.
> [...]
> > To ensure uniqueness, I think a four-byte magic should stay.
> 
> Looking at /etc/magic I see many 16-bit magic numbers kept around
> from the good old days.  But you are right: Choosing a four-byte magic
> value would make the chance of a clash with some other file format
> much less likely.

Just for quotes: the current /etc/magic I have on my Linux
machine doesn't know anything about PYC or PYO files, so I
don't really see much of a problem here -- noone seems to
be interested in finding out the file type for these
files anyway ;-)

Also, I don't really get the 16-bit magic argument: we still
have a 32-bit magic number -- one with a 16-bit fixed value and
predefined ranges for the remaining 16 bits. This already is
much better than what we have now w/r to making file(1) work
on PYC files.
 
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> 
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.

The switch to 8 byte timestamps is only needed when the current
4 bytes can no longer hold the timestamp value. That will happen
in 2038...

Note that import.c writes the timestamp in 4 bytes until it
reaches an overflow situation.

> I think, the situation we have now, is very unsatisfactory:  I don't
> see a reasonable solution, which allows us to keep the length of the
> header before the marshal-block at a fixed length of 8 bytes together
> with a frozen 4 byte magic number.

Adding a version to the marshal format is a Good Thing --
independent of this discussion.
 
> Moving the version number into the marshal doesn't help to resolve
> this conflict.  So either you have to accept a new magic on 64 bit
> systems or you have to enlarge the header.

No you don't... please read the code: marshal only writes
8 bytes in case 4 bytes aren't enough to hold the value.
 
> To come up with a new proposal, the following questions should be answered:
>   1. Is there really too much code out there, which depends on
>      the hardcoded assumption, that the marshal part of a .pyc file
>      starts at byte 8?  I see no further evidence for or against this.
>      MAL pointed this out in
>      <http://www.python.org/pipermail/python-dev/2000-May/005756.html>

I have several references in my tool collection, the import
stuff uses it, old import hooks (remember ihooks ?) also do, etc.

>   2. If we decide to enlarge the header, do we really need a new
>      header field defining the length of the header ?
>      This was proposed by Christian Tismer in
>      <http://www.python.org/pipermail/python-dev/2000-May/005792.html>

In Py3K we can do this right (breaking things is allowed)...
and I agree with Christian that a proper file format needs
a header length field too. Basically, these values have to
be present, IMHO:

1. Magic
2. Version
3. Length of Header
4. (Header Attribute)*n
-- Start of Data ---

Header Attribute can be pretty much anything -- timestamps,
names of files or other entities, bit sizes, architecture
flags, optimization settings, etc.

>   3. The 'imp' module exposes somewhat the structure of an .pyc file
>      through the function 'get_magic()'.  I proposed changing the signature of
>      'imp.get_magic()' in an upward compatible way.  I also proposed
>      adding a new function 'imp.get_version()'.  What do you think about
>      this idea?

imp.get_magic() would have to return the proposed 32-bit value
('PY' + version byte + option byte).

I'd suggest adding additional functions which can read and write the
header given a PYCHeader object which would hold the 
values version and options.

>   4. Greg proposed prepending the version number to the marshal
>      format.  If we do this, we definitely need a frozen way to find
>      out, where the marshalled code object actually starts.  This has
>      also the disadvantage of making the task to come up with a /etc/magic
>      definition whichs displays the version number of a .pyc file slightly
>      harder.
> 
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> 
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> 
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From ping at lfw.org  Tue May 30 11:48:50 2000
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 30 May 2000 02:48:50 -0700 (PDT)
Subject: [Python-Dev] inspect.py
Message-ID: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>

I just posted the HTML document generator script i promised
to do at IPC8.  It's at http://www.lfw.org/python/ (see the
bottom of the page).

The reason i'm mentioning this here is that, in the course of
doing that, i put all the introspection work in a separate
module called "inspect.py".  It's at

    http://www.lfw.org/python/inspect.py

It tries to encapsulate the interface provided by func_*, co_*,
et al. with something a little richer.  It can handle anonymous
(tuple) arguments for you, for example.  It can also get the
source code of any function, method, or class for you, as long
as the original .py file is still available.  And more stuff
like that.

I think most of this stuff is quite generally useful, and it
seems good to wrap this up in a module.  I'd like your thoughts
on whether this is worth including in the standard library.



-- ?!ng

"To be human is to continually change.  Your desire to remain as you are
is what ultimately limits you."
    -- The Puppet Master, Ghost in the Shell




From effbot at telia.com  Tue May 30 12:26:29 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 12:26:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>
Message-ID: <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>

I wrote:

> what's the best way to deal with this?  I see three alter-
> natives:
> 
> a) stick to the old definition, and use chr(10) also for
>    unicode strings
> 
> b) use different definitions for 8-bit strings and unicode
>    strings; if given an 8-bit string, use chr(10); if given
>    a 16-bit string, use the LINEBREAK predicate.
> 
> c) use LINEBREAK in either case.
> 
> I think (c) is the "right thing", but it's the only that may
> break existing code...

I'm probably getting old, but I don't remember if anyone followed
up on this, and I don't have time to check the archives right now.

so for the upcoming "feature complete" release, I've decided to
stick to (a).

...

for the next release, I suggest implementing a fourth alternative:

d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
   use chr(10).

background: in the current implementation, this decision has to
be made at compile time, and a compiled expression can be used
with either 8-bit strings or 16-bit strings.

a fifth alternative would be to use the locale flag to tell the
difference between unicode and 8-bit characters:

e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).

comments?

</F>

<project name="sre" phase=" complete="97.1%" />




From tismer at tismer.com  Tue May 30 13:24:55 2000
From: tismer at tismer.com (Christian Tismer)
Date: Tue, 30 May 2000 13:24:55 +0200
Subject: [Python-Dev] Proposal: .pyc file format change
References: <Pine.LNX.4.10.10005291255440.14857-100000@nebula.lyra.org>
Message-ID: <3933A507.9FA6ABD6@tismer.com>


Greg Stein wrote:
> 
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
> 
> "But it is four bytes," you say. Nope. It is two plus a couple parameters
> that can now change over time.
> 
> To ensure uniqueness, I think a four-byte magic should stay.
> 
> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

Fine with me, too!
Everything that keeps the current 8 byte header intact
and doesn't break much code is fine with me. Moving
additional info intot he marshalled obejcts themselves
gives even more flexibility than any header extension.
Yes I'm all for it.

ciao - chris++

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com



From mal at lemburg.com  Tue May 30 13:36:00 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 13:36:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com>
Message-ID: <3933A7A0.5FAAC5FD@lemburg.com>

Here is my second version of the module. It is somewhat more
flexible and also smaller in size.

BTW, I haven't found any mention of what language and encoding
the locale 'C' assumes or defines. Currently, the module
reports these as None, meaning undefined. Are language and
encoding defined for 'C' ?

(Sorry for posting the whole module -- starship seems to be
down again...)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: localex.py
Type: text/python
Size: 19642 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000530/e3318342/attachment-0001.bin>

From guido at python.org  Tue May 30 15:59:37 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 08:59:37 -0500
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
In-Reply-To: Your message of "Tue, 30 May 2000 12:26:29 +0200."
             <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> 
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>  
            <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> 
Message-ID: <200005301359.IAA05484@cj20424-a.reston1.va.home.com>

> From: "Fredrik Lundh" <effbot at telia.com>
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> > 
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> > 
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> > 
> > c) use LINEBREAK in either case.
> > 
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

I proposed before to see what Perl does -- since we're supposedly
following Perl's RE syntax anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Tue May 30 14:03:17 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 14:03:17 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>
Message-ID: <3933AE05.4640A75D@lemburg.com>

Fredrik Lundh wrote:
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> >
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> >
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> >
> > c) use LINEBREAK in either case.
> >
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

For Unicode objects you should really default to using the 
Py_UNICODE_ISLINEBREAK() macro which defines all line break
characters (note that CRLF should be interpreted as a
single line break; see PyUnicode_Splitlines()). The reason
here is that Unicode defines how to handle line breaks
and we should try to stick to the standard as close as possible.
All other possibilities could still be made available via new
flags.

For 8-bit strings I'd suggest sticking to the re definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Tue May 30 15:40:53 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 06:40:53 -0700 (PDT)
Subject: [Python-Dev] String encoding
In-Reply-To: <200005280628.IAA01239@loewis.home.cs.tu-berlin.de>
Message-ID: <Pine.LNX.4.10.10005300638110.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Martin v. Loewis wrote:
 > In ISO (!) C99, an implementation may define __STDC_ISO_10646__ to
 > indicate that wchar_t is Unicode. The exact wording is

  This is a real improvement!  I've seen brief summmaries of the changes
in C99, but I should take a little time to become more familiar with them.
It looked like a real improvement.

 > Of course, at the moment, there are few, if any, implementations that
 > define this macro.

  I think the gcc people are still working on it, but that's to be
expected; there's a lot of things they're still working on.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From fredrik at pythonware.com  Tue May 30 16:23:46 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:23:46 +0200
Subject: [Python-Dev] Q: join vs. __join__ ?
Message-ID: <001101bfca42$ae9f9710$0500a8c0@secret.pythonware.com>

(re: yet another endless thread on comp.lang.python)

how about renaming the "join" method to "__join__", so we can
argue that it doesn't really exist.

</F>

<project name="sre" complete="97.1%" />




From fdrake at acm.org  Tue May 30 16:22:42 2000
From: fdrake at acm.org (Fred L. Drake)
Date: Tue, 30 May 2000 07:22:42 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where
 to install non-code files)
In-Reply-To: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>

On Sun, 28 May 2000, Guido van Rossum wrote:
 > > Idea: Wouldn't it be nice if the 'nt' and 'mac' versions of the 'os'
 > > module would provide 'os.environ["HOME"]' similar to the posix
 > > version?  This would certainly simplify the task of application
 > > programmers intending to write portable applications.
 > 
 > This sounds like a nice idea...

  Now that this idea has fermented for a few days, I'm inclined to not
like it.  It smells of making Unix-centric interface to something that
isn't terribly portable as a concept.
  Perhaps there should be a function that does the "right thing",
extracting os.environ["HOME"] if defined, and taking an alternate approach
(os.getcwd() or whatever) otherwise.  I don't think setting
os.environ["HOME"] in the library is a good idea because that changes the
environment that gets published to child processes beyond what the
application does.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From jeremy at alum.mit.edu  Tue May 30 16:33:02 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 10:33:02 -0400 (EDT)
Subject: [Python-Dev] SRE snapshot broken
Message-ID: <14643.53534.143126.349006@localhost.localdomain>

I believe I'm looking at the current version.  (It's a file called
snapshot.zip with no version-specific identifying info that I can
find.)

The sre module changed one line in _fixflags from the CVS version.

def _fixflags(flags):
    # convert flag bitmask to sequence
    assert flags == 0
    return ()

The assert flags == 0 is apparently wrong, because it gets called with
an empty tuple if you use sre.search or sre.match.

Also, assuming that simply reverting to the previous test "assert not
flags" fix this bug, is there a test suite that I can run?  Guido
asked me to check in the current snapshot, but it's hard to tell how
to do that correctly.  It's not clear which files belong in the Python
CVS tree, nor is it clear how to test that the build worked.

Jeremy




From guido at python.org  Tue May 30 17:34:04 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 10:34:04 -0500
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: Your message of "Tue, 30 May 2000 07:22:42 MST."
             <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com> 
References: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com> 
Message-ID: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>

[Fred]
>   Now that this idea has fermented for a few days, I'm inclined to not
> like it.  It smells of making Unix-centric interface to something that
> isn't terribly portable as a concept.
>   Perhaps there should be a function that does the "right thing",
> extracting os.environ["HOME"] if defined, and taking an alternate approach
> (os.getcwd() or whatever) otherwise.  I don't think setting
> os.environ["HOME"] in the library is a good idea because that changes the
> environment that gets published to child processes beyond what the
> application does.

The passing on to child processes doesn't sound like a big deal to me.
Either these are Python programs, in which case they might appreciate
that the work has already been done, or they aren't, in which case
they probably don't look at $HOME at all (since apparently they worked
before).

I could see defining a new API, e.g. os.gethomedir(), but that doesn't
help all the programs that currently use $HOME...  Perhaps we could do
both?  (I.e. add os.gethomedir() *and* set $HOME.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fredrik at pythonware.com  Tue May 30 16:24:59 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:24:59 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid>             <001d01bfca21$8549c8c0$f2a6b5d4@hagrid>  <200005301359.IAA05484@cj20424-a.reston1.va.home.com>
Message-ID: <002801bfca44$b533c900$0500a8c0@secret.pythonware.com>

Guido van Rossum wrote:
> I proposed before to see what Perl does -- since we're supposedly
> following Perl's RE syntax anyway.

anyone happen to have 5.6 on their box?

</F>

<project name="sre" complete="97.1%" />




From fredrik at pythonware.com  Tue May 30 16:38:29 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 16:38:29 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com>
Message-ID: <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
...
> > background: in the current implementation, this decision has to
> > be made at compile time, and a compiled expression can be used
> > with either 8-bit strings or 16-bit strings.
...
> For Unicode objects you should really default to using the 
> Py_UNICODE_ISLINEBREAK() macro which defines all line break
> characters (note that CRLF should be interpreted as a
> single line break; see PyUnicode_Splitlines()). The reason
> here is that Unicode defines how to handle line breaks
> and we should try to stick to the standard as close as possible.
> All other possibilities could still be made available via new
> flags.
> 
> For 8-bit strings I'd suggest sticking to the re definition.

guess my background description wasn't clear:

Once a pattern has been compiled, it will always handle line
endings in the same way. The parser doesn't really care if the
pattern is a unicode string or an 8-bit string (unicode strings
can contain "wide" characters, but that's the only difference).

At the other end, the same compiled pattern can be applied
to either 8-bit or unicode strings.  It's all just characters to
the engine...

Now, I can of course change the engine so that it always uses
chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
result is that

    pattern.match(widestring)

won't necessarily match the same thing as

    pattern.match(str(widestring))

even if the wide string only contains plain ASCII.

(an other alternative is to recompile the pattern for each target
string type, but that will hurt performance...)

</F>

<project name="sre" complete="97.1%" />




From mal at lemburg.com  Tue May 30 16:57:57 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 16:57:57 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com>
Message-ID: <3933D6F5.F6BDA39@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> ...
> > > background: in the current implementation, this decision has to
> > > be made at compile time, and a compiled expression can be used
> > > with either 8-bit strings or 16-bit strings.
> ...
> > For Unicode objects you should really default to using the
> > Py_UNICODE_ISLINEBREAK() macro which defines all line break
> > characters (note that CRLF should be interpreted as a
> > single line break; see PyUnicode_Splitlines()). The reason
> > here is that Unicode defines how to handle line breaks
> > and we should try to stick to the standard as close as possible.
> > All other possibilities could still be made available via new
> > flags.
> >
> > For 8-bit strings I'd suggest sticking to the re definition.
> 
> guess my background description wasn't clear:
> 
> Once a pattern has been compiled, it will always handle line
> endings in the same way. The parser doesn't really care if the
> pattern is a unicode string or an 8-bit string (unicode strings
> can contain "wide" characters, but that's the only difference).

Ok.

> At the other end, the same compiled pattern can be applied
> to either 8-bit or unicode strings.  It's all just characters to
> the engine...

Doesn't the engine remember wether the pattern was a string
or Unicode ?
 
> Now, I can of course change the engine so that it always uses
> chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the
> result is that
> 
>     pattern.match(widestring)
> 
> won't necessarily match the same thing as
> 
>     pattern.match(str(widestring))
> 
> even if the wide string only contains plain ASCII.

Hmm, I wouldn't mind, as long as the engine does the right
thing for Unicode which is to respect the line break
standard defined in Unicode TR13.

Thinking about this some more: I wouldn't even mind if
the engine would use LINEBREAK for all strings :-). It would
certainly make life easier whenever you have to deal with
file input from different platforms, e.g. Mac, Unix and
Windows.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik at pythonware.com  Tue May 30 17:14:00 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:14:00 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com>
Message-ID: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>

M.-A. Lemburg <mal at lemburg.com> wrote:
> BTW, I haven't found any mention of what language and encoding
> the locale 'C' assumes or defines. Currently, the module
> reports these as None, meaning undefined. Are language and
> encoding defined for 'C' ?

IIRC, the C locale (and the POSIX character set) is defined in terms
of a "portable character set".  This set contains all ASCII characters,
but doesn't specify what code points to use.

But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
using Python on a non-ASCII platform?  does it even build and run
on such a beast?)

</F>

<project name="sre" complete="97.1%" />




From fredrik at pythonware.com  Tue May 30 17:19:48 2000
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 30 May 2000 17:19:48 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com>
Message-ID: <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>

M.-A. Lemburg wrote:
> > At the other end, the same compiled pattern can be applied
> > to either 8-bit or unicode strings.  It's all just characters to
> > the engine...
> 
> Doesn't the engine remember wether the pattern was a string
> or Unicode ?

The pattern object contains a reference to the original pattern
string, so I guess the answer is "yes, but indirectly".  But the core
engine doesn't really care -- it just follows the instructions in the
compiled pattern.

> Thinking about this some more: I wouldn't even mind if
> the engine would use LINEBREAK for all strings :-). It would
> certainly make life easier whenever you have to deal with
> file input from different platforms, e.g. Mac, Unix and
> Windows.

That's what I originally proposed (and implemented).  But this may
(in theory, at least) break existing code.  If not else, it broke the
test suite ;-)

</F>

<project name="sre" complete="97.1%" />




From akuchlin at cnri.reston.va.us  Tue May 30 17:16:14 2000
From: akuchlin at cnri.reston.va.us (Andrew M. Kuchling)
Date: Tue, 30 May 2000 11:16:14 -0400
Subject: [Python-Dev] Re: Extending locale.py
In-Reply-To: <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>; from fredrik@pythonware.com on Tue, May 30, 2000 at 05:14:00PM +0200
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <20000530111614.B7942@amarok.cnri.reston.va.us>

On Tue, May 30, 2000 at 05:14:00PM +0200, Fredrik Lundh wrote:
>But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
>using Python on a non-ASCII platform?  does it even build and run
>on such a beast?)

The OS/390 port of 1.4? (http://www.s390.ibm.com/products/oe/python.html)
But it doesn't look like they ported the regex module at all.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Better get going; your parents still think me imaginary, and I'd hate to
shatter an illusion like that before dinner.
  -- The monster, in STANLEY AND HIS MONSTER #1





From gmcm at hypernet.com  Tue May 30 17:29:39 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 30 May 2000 11:29:39 -0400
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <Pine.LNX.4.10.10005300715520.21070-100000@mailhost.beopen.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
Message-ID: <1252421881-3397332@hypernet.com>

Fred L. Drake wrote:

>   Now that this idea has fermented for a few days, I'm inclined
>   to not
> like it.  It smells of making Unix-centric interface to something
> that isn't terribly portable as a concept.

I've refrained from jumping in here (as, it seems, have all the 
Windows users) because this is a god-awful friggin' mess on 
Windows.


From fdrake at acm.org  Tue May 30 18:10:29 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 09:10:29 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <1252421881-3397332@hypernet.com>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
	<1252421881-3397332@hypernet.com>
Message-ID: <14643.59381.73286.292195@mailhost.beopen.com>

Gordon McMillan writes:
 > From the 10**3 foot view, yes, they have the concept. From 
 > any closer it falls apart miserably.

  So they have the concept, just no implementation.  ;)  Sounds like
leaving it up to the application to interpret their requirements is
the right thing.  Or the right thing is to provide a function to ask
where configuration information should be stored for the
user/application; this would be $HOME under Unix and <whatever> on
Windows.  The only other reason I can think of that $HOME is needed is
for navigation purposes (as in a filesystem browser), and for that the
application needs to deal with the lack of the concept in the
operating system as appropriate.

 > (An cmd.exe "cd" w/o arg acts like "pwd". I notice that the 
 > bash shell requires you to set $HOME, and won't make any 
 > guesses.)

  This very definately sounds like overloading $HOME is the wrong
thing.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From pf at artcom-gmbh.de  Tue May 30 18:37:41 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Tue, 30 May 2000 18:37:41 +0200 (MEST)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <200005301534.KAA06322@cj20424-a.reston1.va.home.com> from Guido van Rossum at "May 30, 2000 10:34: 4 am"
Message-ID: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>

> [Fred]
> >   Now that this idea has fermented for a few days, I'm inclined to not
> > like it.  It smells of making Unix-centric interface to something that
> > isn't terribly portable as a concept.

Yes.  After thinking more carefully and after a closer look to what 
Jack Jansen finally figured out for MacOS (see 
	<http://www.python.org/pipermail/pythonmac-sig/2000-May/003667.html>
) I agree with Fred.  My initial idea to put something into
'os.environ["HOME"]' on those platforms was too simple minded.

> >   Perhaps there should be a function that does the "right thing",
> > extracting os.environ["HOME"] if defined, and taking an alternate approach
> > (os.getcwd() or whatever) otherwise.  
[...]

Every serious (non trivial) application usually contains something like 
"user preferences" or other state information, which should --if possible-- 
survive the following kind of events:
  1. An upgrade of the application to a newer version.  This is
     often accomplished by removing the directory tree, in which this
     application lives and replacing it by unpacking or installing
     in archive containing the new version of the application.
  2. Another colleague uses the application on the same computer and
     modifies settings to fit his personal taste.

On several versions of WinXX and on MacOS prior to release 9.X (and due
to stability problems with the multiuser capabilities even in MacOS 9)
the second kind of event seems to be rather unimportant to the users
of these platforms, since the OSes are considered as "single user"
systems anyway.  Or in other words:  the users are already used to
this situation.

Only the first kind of event should be solved for all platforms:  
<FANTASY>
    Imagine you are using grail version 4.61 on a daily basis for WWW 
    browsing and one day you decide to install the nifty upgrade 
    grail 4.73 on your computer running WinXX or MacOS X.Y 
    and after doing so you just discover that all your carefully
    sorted bookmarks are gone!  That wouldn't be nice?
</FANTASY>

I see some similarities here to the standard library module 'tempfile',
which supplies (or at least it tries ;-) to) a cross-platform portable
strategy for all applications which have to store temporary data.

My intentation was to have simple a cross-platform portable API to store
and retrieve such user specific state information (examples: the bookmarks
of a Web browser, themes, color settings, fonts...  other GUI settings, 
and so... you get the picture).  On unices applications usually use the
idiom 
	os.path.join(os.environ.get("HOME", "."), ".dotfoobar")
or something similar.

Do people remember 'grail'?  I've just stolen the following code snippets
from 'grail0.6/grailbase/utils.py' just to demonstrate, that this is still 
a very common programming problem:
---------------- snip ---------------------
# XXX Unix specific stuff
# XXX (Actually it limps along just fine for Macintosh, too)
 
def getgraildir():
    return getenv("GRAILDIR") or os.path.join(gethome(), ".grail")    
----- snip ------
def gethome():
    try:
        home = getenv("HOME")
        if not home:
            import pwd
            user = getenv("USER") or getenv("LOGNAME")
            if not user:
                pwent = pwd.getpwuid(os.getuid())
            else:
                pwent = pwd.getpwnam(user)
            home = pwent[6]
        return home
    except (KeyError, ImportError):
        return os.curdir
---------------- snap ---------------------
[...]

[Guido van Rossum]:
> I could see defining a new API, e.g. os.gethomedir(), but that doesn't
> help all the programs that currently use $HOME...  Perhaps we could do
> both?  (I.e. add os.gethomedir() *and* set $HOME.)

I'm not sure whether this is really generic enough for the OS module.

May be we should introduce a new small standard library module called 
'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
will probably not know what to do with 'os.gethomedir()'.  

However for the time being this module would only contain one simple 
function returning a directory pathname, which is guaranteed to exist 
and to survive a deinstallation of an application.  May be introducing
a new module is overkill?  What do you think?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From fdrake at acm.org  Tue May 30 19:17:56 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 10:17:56 -0700 (PDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <m12wp1R-000DifC@artcom0.artcom-gmbh.de>
References: <200005301534.KAA06322@cj20424-a.reston1.va.home.com>
	<m12wp1R-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <14643.63428.387306.455383@mailhost.beopen.com>

Peter Funk writes:
 > <FANTASY>
 >     Imagine you are using grail version 4.61 on a daily basis for WWW 
 >     browsing and one day you decide to install the nifty upgrade 
 >     grail 4.73 on your computer running WinXX or MacOS X.Y 

  God thing you marked that as fantasy -- I would have asked for the
download URL!  ;)

 > Do people remember 'grail'?  I've just stolen the following code snippets

  Not on good days.  ;)

 > I'm not sure whether this is really generic enough for the OS module.

  The location selected is constrained by the OS, but this isn't an
exposure of operating system functionality, so there should probably
be something else.

 > May be we should introduce a new small standard library module called 
 > 'userprefs' or some such?  A programmer with a MacOS or WinXX  background 
 > will probably not know what to do with 'os.gethomedir()'.  
 > 
 > However for the time being this module would only contain one simple 
 > function returning a directory pathname, which is guaranteed to exist 
 > and to survive a deinstallation of an application.  May be introducing

  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
directories, and that's all most applications need; Web browser are a
special case in this way; there aren't that many things that require a
directory.  Those things which do often are program that form an
essential part of a user's environment -- Web browsers and email
clients are two good examples I've seen that really seem to have a lot
of things.
  I think what's needed is a function to return the location where the
application can make one directory entry.  The caller is still
responsible for creating a directory to store a larger set of files if
needed.  Something like grailbase.utils.establish_dir() might be a
nice convenience function.
  An additional convenience may be to offer a function which takes the
application name and a dotfile name, and returns the one to use; the
Windows and MacOS (and BeOS?) worlds seem more comfortable with the
longer, mixed-case, more readable names, while the Unix world enjoys
cryptic little names with a dot at the front.
  Ok, so now that I've rambled, the "userprefs" module looks like it
contains:

        get_appdata_root() -- $HOME, or other based on platform
        get_appdata_name() -- "MyApplication Preferences" or ".myapp"
        establish_dir() -- create dir if it doesn't exist

  Maybe this really is a separate module.  ;)


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From mal at lemburg.com  Tue May 30 19:54:32 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:54:32 +0200
Subject: [Python-Dev] Re: Extending locale.py
References: <392E8EF3.CDA61525@lemburg.com> <3933A7A0.5FAAC5FD@lemburg.com> <00a001bfca49$af8bc7a0$0500a8c0@secret.pythonware.com>
Message-ID: <39340058.CA3FC798@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal at lemburg.com> wrote:
> > BTW, I haven't found any mention of what language and encoding
> > the locale 'C' assumes or defines. Currently, the module
> > reports these as None, meaning undefined. Are language and
> > encoding defined for 'C' ?
> 
> IIRC, the C locale (and the POSIX character set) is defined in terms
> of a "portable character set".  This set contains all ASCII characters,
> but doesn't specify what code points to use.
> 
> But I think it's safe to assume 7-bit US ASCII.  (Is anyone anywhere
> using Python on a non-ASCII platform?  does it even build and run
> on such a beast?)

Hmm, that would mean having an encoding, but no language
definition available -- setlocale() doesn't work without
language code... I guess its better to leave things
undefined in that case.

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue May 30 19:57:41 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 30 May 2000 19:57:41 +0200
Subject: [Python-Dev] unicode regex quickie: should a newline be the same 
 thing as a linebreak?
References: <002301bfbcda$afbdb0c0$34aab5d4@hagrid> <001d01bfca21$8549c8c0$f2a6b5d4@hagrid> <3933AE05.4640A75D@lemburg.com> <002901bfca44$b99ffcc0$0500a8c0@secret.pythonware.com> <3933D6F5.F6BDA39@lemburg.com> <00b601bfca4a$7f0aad20$0500a8c0@secret.pythonware.com>
Message-ID: <39340115.7E05DA6C@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > > At the other end, the same compiled pattern can be applied
> > > to either 8-bit or unicode strings.  It's all just characters to
> > > the engine...
> >
> > Doesn't the engine remember wether the pattern was a string
> > or Unicode ?
> 
> The pattern object contains a reference to the original pattern
> string, so I guess the answer is "yes, but indirectly".  But the core
> engine doesn't really care -- it just follows the instructions in the
> compiled pattern.
> 
> > Thinking about this some more: I wouldn't even mind if
> > the engine would use LINEBREAK for all strings :-). It would
> > certainly make life easier whenever you have to deal with
> > file input from different platforms, e.g. Mac, Unix and
> > Windows.
> 
> That's what I originally proposed (and implemented).  But this may
> (in theory, at least) break existing code.  If not else, it broke the
> test suite ;-)

SRE is new, so what could it break ?

Anyway, perhaps we should wait for some Perl 5.6 wizard to
speak up ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at python.org  Tue May 30 21:16:13 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:16:13 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
Message-ID: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>

FYI, here's an important announcement that I just sent to c.l.py.  I'm
very excited that we can finally announce this!

I'll be checking mail sporadically until Thursday morning.  Back on
June 19.

--Guido van Rossum (home page: http://www.python.org/~guido/)

To all Python users and developers:

Python is growing rapidly.  In order to take it to the next level,
I've moved with my core development group to a new employer,
BeOpen.com.  BeOpen.com is a startup company with a focus on open
source communities, and an interest in facilitating next generation
application development.  It is a natural fit for Python.

At BeOpen.com I am the director of a new development team named
PythonLabs.  The team includes three of my former colleagues at CNRI:
Fred Drake, Jeremy Hylton, and Barry Warsaw.  Another familiar face
will joins us shortly: Tim Peters.  We have our own website
(www.pythonlabs.com) where you can read more about us, our plans and
our activities.  We've also posted a FAQ there specifically about
PythonLabs, our transition to BeOpen.com, and what it means for the
Python community.

What will change, and what will stay the same? First of all, Python
will remain Open Source.  In fact, everything we produce at PythonLabs
will be released with an Open Source license.  Also, www.python.org
will remain the number one website for the Python community.  CNRI
will continue to host it, and we'll maintain it as a community
project.

What changes is how much time we have for Python.  Previously, Python
was a hobby or side project, which had to compete with our day jobs;
at BeOpen.com we will be focused full time on Python development! This
means that we'll be able to spend much more time on exciting new
projects like Python 3000.  We'll also get support for website
management from BeOpen.com's professional web developers, and we'll
work with their marketing department.

Marketing for Python, you ask? Sure, why not! We want to grow the size
of the Python user and developer community at an even faster pace than
today.  This should benefit everyone: the larger the community, the
more resources will be available to all, and the easier it will be to
find Python expertise when you need it.  We're also planning to make
commercial offerings (within the Open Source guidelines!) to help
Python find its way into the hands of more programmers, especially in
large enterprises where adoption is still lagging.

There's one piece of bad news: Python 1.6 won't be released by June
1st.  There's simply too much left to be done.  We promise that we'll
get it out of the door as soon as possible.  By the way, Python 1.6
will be the last release from CNRI; after that, we'll issue Python
releases from BeOpen.com.

Oh, and to top it all off, I'm going on vacation.  I'm getting married
and will be relaxing on my honeymoon.  For all questions about
PythonLabs, write to pythonlabs-info at beopen.com.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From esr at thyrsus.com  Tue May 30 20:27:18 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:27:18 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142718.A24289@thyrsus.com>

Guido van Rossum <guido at python.org>:
> Oh, and to top it all off, I'm going on vacation.  I'm getting married
> and will be relaxing on my honeymoon.

Mazel tov, Guido!

BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
to include it in 1.6?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The Constitution is not neutral. It was designed to take the
government off the backs of the people.
	-- Justice William O. Douglas 



From fdrake at acm.org  Tue May 30 20:23:25 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 30 May 2000 11:23:25 -0700 (PDT)
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <20000530142718.A24289@thyrsus.com>
References: <20000530142718.A24289@thyrsus.com>
Message-ID: <14644.1821.67068.165890@mailhost.beopen.com>

Eric S. Raymond writes:
 > BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
 > to include it in 1.6?

Eric,
  Appearantly the rest of us haven't heard of it.  Since Guido's a
little distracted right now, perhaps you should send the files to
python-dev for discussion?
  Thanks!


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gward at mems-exchange.org  Tue May 30 20:25:42 2000
From: gward at mems-exchange.org (Greg Ward)
Date: Tue, 30 May 2000 14:25:42 -0400
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>; from guido@python.org on Tue, May 30, 2000 at 02:16:13PM -0500
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <20000530142541.D20088@mems-exchange.org>

On 30 May 2000, Guido van Rossum said:
> At BeOpen.com I am the director of a new development team named
> PythonLabs.  The team includes three of my former colleagues at CNRI:
> Fred Drake, Jeremy Hylton, and Barry Warsaw.

Ahh, no wonder it's been so quiet around here.  I was wondering where
you guys had gone.  Mystery solved!

(It's a *joke!*  We already *knew* they were leaving...)

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367



From trentm at activestate.com  Tue May 30 20:26:38 2000
From: trentm at activestate.com (Trent Mick)
Date: Tue, 30 May 2000 11:26:38 -0700
Subject: [Python-Dev] inspect.py
In-Reply-To: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <20000530112638.E18024@activestate.com>

Looks cool, Ping.

Trent


-- 
Trent Mick
trentm at activestate.com



From guido at python.org  Tue May 30 21:34:38 2000
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 May 2000 14:34:38 -0500
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: Your message of "Tue, 30 May 2000 14:27:18 -0400."
             <20000530142718.A24289@thyrsus.com> 
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>  
            <20000530142718.A24289@thyrsus.com> 
Message-ID: <200005301934.OAA07671@cj20424-a.reston1.va.home.com>

> Mazel tov, Guido!

Thanks!

> BTW, did you receive the ascii.py module and docs I sent you?  Do you plan
> to include it in 1.6?

Yes, and probably.  As Fred suggested, could you resend to the patches
list?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From effbot at telia.com  Tue May 30 20:40:13 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Tue, 30 May 2000 20:40:13 +0200
Subject: [Python-Dev] inspect.py
References: <Pine.LNX.4.10.10005300243590.2697-100000@localhost>
Message-ID: <012e01bfca66$7ed61ee0$f2a6b5d4@hagrid>

ping wrote:
> The reason i'm mentioning this here is that, in the course of
> doing that, i put all the introspection work in a separate
> module called "inspect.py".  It's at
> 
>     http://www.lfw.org/python/inspect.py
>
...
>
> I think most of this stuff is quite generally useful, and it
> seems good to wrap this up in a module.  I'd like your thoughts
> on whether this is worth including in the standard library.

haven't looked at the code (yet), but +1 on concept.

(if this goes into 1.6, I no longer have to keep reposting
pointers to my "describe" module...)

</F>




From skip at mojam.com  Tue May 30 20:43:36 2000
From: skip at mojam.com (Skip Montanaro)
Date: Tue, 30 May 2000 13:43:36 -0500 (CDT)
Subject: [Python-Dev] ANNOUNCEMENT: Python Development Team Moves to BeOpen.com
In-Reply-To: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
References: <200005301916.OAA07467@cj20424-a.reston1.va.home.com>
Message-ID: <14644.3032.900770.450584@beluga.mojam.com>

    Guido> Python is growing rapidly.  In order to take it to the next
    Guido> level, I've moved with my core development group to a new
    Guido> employer, BeOpen.com.

Great news!

    Guido> Oh, and to top it all off, I'm going on vacation.  I'm getting
    Guido> married and will be relaxing on my honeymoon.  For all questions
    Guido> about PythonLabs, write to pythonlabs-info at beopen.com.

Nice to see you are trying to maintain some consistency in the face of huge
professional and personal changes.  I would have worried if you weren't
going to go on vacation!  Congratulations on both moves...

-- 
Skip Montanaro, skip at mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould



From esr at thyrsus.com  Tue May 30 20:58:38 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Tue, 30 May 2000 14:58:38 -0400
Subject: [Python-Dev] ascii.py + documentation
In-Reply-To: <14644.1821.67068.165890@mailhost.beopen.com>; from fdrake@acm.org on Tue, May 30, 2000 at 11:23:25AM -0700
References: <20000530142718.A24289@thyrsus.com> <14644.1821.67068.165890@mailhost.beopen.com>
Message-ID: <20000530145838.A24339@thyrsus.com>

Fred L. Drake, Jr. <fdrake at acm.org>:
>   Appearantly the rest of us haven't heard of it.  Since Guido's a
> little distracted right now, perhaps you should send the files to
> python-dev for discussion?

Righty-O.  Here they are enclosed.  I wrote this for use with the
curses module; one reason it's useful is because because the curses
getch function returns ordinal values rather than characters.  It should
be more generally useful for any pPython program with a raw character-by-
character commmand interface.

The tex may need trivial markup fixes.  You might want to add a "See also"
to curses.

I'm using this code heavily in my CML2 project, so it has been tested.
For those of you who haven't heard about CML2, I've written a replacement
for the Linux kernel configuration system in Python.  You can find out more
at:

	http://www.tuxedo.org/~esr/kbuild/

The code has some interesting properties, including the ability to
probe its environment and come up in a Tk-based, curses-based, or
line-oriented mode depending on what it sees.

ascii.py will probably not be the last library code this project spawns.
I have another package called menubrowser that is a framework for writing
menu systems. And I have some Python wrapper enhancements for curses in
the works.
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The two pillars of `political correctness' are, 
  a) willful ignorance, and
  b) a steadfast refusal to face the truth
	-- George MacDonald Fraser
-------------- next part --------------
#
# ascii.py -- constants and memembership tests for ASCII characters
#

NUL	= 0x00	# ^@
SOH	= 0x01	# ^A
STX	= 0x02	# ^B
ETX	= 0x03	# ^C
EOT	= 0x04	# ^D
ENQ	= 0x05	# ^E
ACK	= 0x06	# ^F
BEL	= 0x07	# ^G
BS	= 0x08	# ^H
TAB	= 0x09	# ^I
HT	= 0x09	# ^I
LF	= 0x0a	# ^J
NL	= 0x0a	# ^J
VT	= 0x0b	# ^K
FF	= 0x0c	# ^L
CR	= 0x0d	# ^M
SO	= 0x0e	# ^N
SI	= 0x0f	# ^O
DLE	= 0x10	# ^P
DC1	= 0x11	# ^Q
DC2	= 0x12	# ^R
DC3	= 0x13	# ^S
DC4	= 0x14	# ^T
NAK	= 0x15	# ^U
SYN	= 0x16	# ^V
ETB	= 0x17	# ^W
CAN	= 0x18	# ^X
EM	= 0x19	# ^Y
SUB	= 0x1a	# ^Z
ESC	= 0x1b	# ^[
FS	= 0x1c	# ^\
GS	= 0x1d	# ^]
RS	= 0x1e	# ^^
US	= 0x1f	# ^_
SP	= 0x20	# space
DEL	= 0x7f	# delete

def _ctoi(c):
    if type(c) == type(""):
        return ord(c)
    else:
        return c

def isalnum(c): return isalpha(c) or isdigit(c)
def isalpha(c): return isupper(c) or islower(c)
def isascii(c): return _ctoi(c) <= 127		# ?
def isblank(c): return _ctoi(c) in (8,32)
def iscntrl(c): return _ctoi(c) <= 31
def isdigit(c): return _ctoi(c) >= 48 and _ctoi(c) <= 57
def isgraph(c): return _ctoi(c) >= 33 and _ctoi(c) <= 126
def islower(c): return _ctoi(c) >= 97 and _ctoi(c) <= 122
def isprint(c): return _ctoi(c) >= 32 and _ctoi(c) <= 126
def ispunct(c): return _ctoi(c) != 32 and not isalnum(c)
def isspace(c): return _ctoi(c) in (12, 10, 13, 9, 11)
def isupper(c): return _ctoi(c) >= 65 and _ctoi(c) <= 90
def isxdigit(c): return isdigit(c) or \
    (_ctoi(c) >= 65 and _ctoi(c) <= 70) or (_ctoi(c) >= 97 and _ctoi(c) <= 102)

def ctrl(c):
    if type(c) == type(""):
        return chr(_ctoi(c) & 0x1f)
    else:
        return _ctoi(c) & 0x1f

def alt(c):
    if type(c) == type(""):
        return chr(_ctoi(c) | 0x80)
    else:
        return _ctoi(c) | 0x80




-------------- next part --------------
A non-text attachment was scrubbed...
Name: ascii.tex
Type: application/x-tex
Size: 3250 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20000530/2e3b28fb/attachment-0001.bin>

From jeremy at alum.mit.edu  Tue May 30 23:09:13 2000
From: jeremy at alum.mit.edu (Jeremy Hylton)
Date: Tue, 30 May 2000 17:09:13 -0400 (EDT)
Subject: [Python-Dev] Python 3000 is going to be *really* different
Message-ID: <14644.11769.197518.938252@localhost.localdomain>

http://www.autopreservers.com/autope07.html

Jeremy




From paul at prescod.net  Wed May 31 07:53:47 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 00:53:47 -0500
Subject: [Python-Dev] SIG: python-lang
Message-ID: <3934A8EB.6608B0E1@prescod.net>

I think that we need a forum somewhere between comp.lang.python and
pythondev. Let's call it python-lang.

By virtue of being buried on the "sigs" page, python-lang would be
mostly only accessible to those who have more than a cursory interest in
Python. Furthermore, you would have to go through a simple
administration procedure to join, as you do with any mailman list.

Appropriate topics of python-lang would be new ideas about language
features. Participants would be expected and encouraged to use archives
and FAQs to avoid repetitive topics. Particular verboten would be
"ritual topics": indentation, case sensitivity, integer division,
language comparisions, etc. These discussions would be redirected loudly
and firmly to comp.lang.python.

Python-dev would remain invitation only but it would focus on the day to
day mechanics of getting new versions of Python out the door.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Hardly anything more unwelcome can befall a scientific writer than 
having the foundations of his edifice shaken after the work is 
finished.  I have been placed in this position by a letter from 
Mr. Bertrand Russell..." 
 - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)



From nhodgson at bigpond.net.au  Wed May 31 08:39:34 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 16:39:34 +1000
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com> <1252421881-3397332@hypernet.com>
Message-ID: <019b01bfcaca$fd72ffc0$e3cb8490@neil>

Gordon writes,

> But there's no $HOME as such.
>
> There's
> HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> hell Folders with around 16 subkeys, including AppData
> (which on my system has one entry installed by a program I've
> never used and didn't know I had). But MSOffice uses the
> Personal subkey. Others seem to use the Desktop subkey.

   SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
preferred' method for this as it allows roaming (not that I've ever seen
roaming work). If Unix code expects $HOME to be per machine (and so used to
store, for example, window locations which are dependent on screen
resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

   To make these work on 9x and NT 4 Microsoft provides a redistributable
Shfolder.dll.

   Fred writes,

>  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> directories, and that's all most applications need;

   This may have been the case in the past and for people who understand
Unix well enough to maintain it, but for us just-want-it-to-run folks, its
no longer true. I formatted my Linux partition this week and installed Red
Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
outnumber the dot files 18 to 16.

   Neil




From pf at artcom-gmbh.de  Wed May 31 09:34:34 2000
From: pf at artcom-gmbh.de (Peter Funk)
Date: Wed, 31 May 2000 09:34:34 +0200 (MEST)
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil> from Neil Hodgson at "May 31, 2000  4:39:34 pm"
Message-ID: <m12x31O-000DifC@artcom0.artcom-gmbh.de>

> Gordon writes,
> 
> > But there's no $HOME as such.
> >
> > There's
> > HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\S
> > hell Folders with around 16 subkeys, including AppData
> > (which on my system has one entry installed by a program I've
> > never used and didn't know I had). But MSOffice uses the
> > Personal subkey. Others seem to use the Desktop subkey.

Neil responds:
>    SHGetSpecialFolderPath(,,CSIDL_APPDATA,) would be the current 'MS
> preferred' method for this as it allows roaming (not that I've ever seen
> roaming work). If Unix code expects $HOME to be per machine (and so used to
> store, for example, window locations which are dependent on screen
> resolution) then CSIDL_LOCAL_APPDATA would be a better choice.
> 
>    To make these work on 9x and NT 4 Microsoft provides a redistributable
> Shfolder.dll.

Using a place on the local machine of the user makes more sense to me.

But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
sources and also through Mark Hammonds Win32 Python extension c++
sources (here on my Notebook running Linux) and found nothing called
'SHGetSpecialFolderPath'.  So I believe, this API is currently not
exposed to the Python level.  Right?

So it would be very nice, if you WinXX-gurus more familar with the WinXX
platform would come up with some Python code snippet, which I could try
to include into an upcoming standard library 'userprefs.py' I plan to
write.  something like:
    if os.name == 'nt':
        try:
            import win32XYZ
            if hasattr(win32XYZ, 'SHGetSpecialFolderPath'):
                userplace = win32XYZ.SHGetSpecialFolderPath(.....) 
        except ImportError:
            .....
would be very fine.

>    Fred writes,
> 
> >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
> > directories, and that's all most applications need;
> 
>    This may have been the case in the past and for people who understand
> Unix well enough to maintain it, but for us just-want-it-to-run folks, its
> no longer true. I formatted my Linux partition this week and installed Red
> Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
> outnumber the dot files 18 to 16.

Fred proposed an API, which leaves the decision whether to use a
single file or to use several files in special directory up to the
application developer.  

I aggree with Fred.  

Simple applications will use only a simple config file, where bigger
applications will need a directory to store several files.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)



From nhodgson at bigpond.net.au  Wed May 31 10:18:20 2000
From: nhodgson at bigpond.net.au (Neil Hodgson)
Date: Wed, 31 May 2000 18:18:20 +1000
Subject: [Python-Dev] 'userprefs.py': Looking for help for WinXX (was Re: user dirs on Non-Unix platforms...)
References: <m12x31O-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <006b01bfcad8$c899c2d0$e3cb8490@neil>

> Using a place on the local machine of the user makes more sense to me.
>
> But excuse my ingorance: I've just 'grep'ed through the Python 1.6a2
> sources and also through Mark Hammonds Win32 Python extension c++
> sources (here on my Notebook running Linux) and found nothing called
> 'SHGetSpecialFolderPath'.  So I believe, this API is currently not
> exposed to the Python level.  Right?

   Only through the Win32 Python extensions, I think:

>>> from win32com.shell import shell
>>> from win32com.shell import shellcon
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_APPDATA)
u'G:\\Documents and Settings\\Neil1\\Application Data'
>>> shell.SHGetSpecialFolderPath(0, shellcon.CSIDL_LOCAL_APPDATA)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: CSIDL_LOCAL_APPDATA
>>> shell.SHGetSpecialFolderPath(0, 0x1c)
u'G:\\Documents and Settings\\Neil1\\Local Settings\\Application Data'

   Looks like CSIDL_LOCAL_APPDATA isn't included yet, but its value is 0x1c.

   Neil




From effbot at telia.com  Wed May 31 16:05:41 2000
From: effbot at telia.com (Fredrik Lundh)
Date: Wed, 31 May 2000 16:05:41 +0200
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
Message-ID: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>

from comp.lang.python:

> Thanks for the info.  This choice of name is very confusing, to say the least.
> I used commandline completion with __buil TAB, and got __builtins__.

a simple way to avoid this problem is to change global_matches
in rlcompleter.py so that it doesn't return this name.  I suggest
changing:

                if word[:n] == text:

to

                if word[:n] == text and work != "__builtins__":

Comments?

(should we do a series of double-blind tests first? ;-)

</F>

    "People Propose, Science Studies, Technology Conforms"
    -- Don Norman





From fdrake at acm.org  Wed May 31 16:32:27 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 10:32:27 -0400 (EDT)
Subject: [Python-Dev] Q: maybe rlcompleter shouldn't expose __builtins__?
In-Reply-To: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
References: <014901bfcb09$4f4db4a0$f2a6b5d4@hagrid>
Message-ID: <14645.8827.104869.733028@cj42289-a.reston1.va.home.com>

Fredrik Lundh writes:
 > a simple way to avoid this problem is to change global_matches
 > in rlcompleter.py so that it doesn't return this name.  I suggest
 > changing:

  I've made the change in both global_matches() and attr_matches(); we
don't want to see it as a module attribute any more than as a global.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>




From gstein at lyra.org  Wed May 31 17:04:20 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 08:04:20 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <3934A8EB.6608B0E1@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>

[ The correct forum is probably meta-sig. ]

IMO, I don't see a need for yet another forum. The dividing lines become a
bit too blurry, and it will result in questions like "where do I post
this?" Or "what is the difference between python-lang at python.org and
python-list at python.org?"

Cheers,
-g

On Wed, 31 May 2000, Paul Prescod wrote:
> I think that we need a forum somewhere between comp.lang.python and
> pythondev. Let's call it python-lang.
> 
> By virtue of being buried on the "sigs" page, python-lang would be
> mostly only accessible to those who have more than a cursory interest in
> Python. Furthermore, you would have to go through a simple
> administration procedure to join, as you do with any mailman list.
> 
> Appropriate topics of python-lang would be new ideas about language
> features. Participants would be expected and encouraged to use archives
> and FAQs to avoid repetitive topics. Particular verboten would be
> "ritual topics": indentation, case sensitivity, integer division,
> language comparisions, etc. These discussions would be redirected loudly
> and firmly to comp.lang.python.
> 
> Python-dev would remain invitation only but it would focus on the day to
> day mechanics of getting new versions of Python out the door.
> 
> -- 
>  Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
> "Hardly anything more unwelcome can befall a scientific writer than 
> having the foundations of his edifice shaken after the work is 
> finished.  I have been placed in this position by a letter from 
> Mr. Bertrand Russell..." 
>  - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Wed May 31 17:09:03 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 11:09:03 -0400 (EDT)
Subject: Per user dirs on Non-Unix platforms (was Re: [Python-Dev] Where to install non-code files)
In-Reply-To: <019b01bfcaca$fd72ffc0$e3cb8490@neil>
References: <200005290000.TAA02136@cj20424-a.reston1.va.home.com>
	<1252421881-3397332@hypernet.com>
	<019b01bfcaca$fd72ffc0$e3cb8490@neil>
Message-ID: <14645.11023.118707.176016@cj42289-a.reston1.va.home.com>

Neil Hodgson writes:
 > roaming work). If Unix code expects $HOME to be per machine (and so used to
 > store, for example, window locations which are dependent on screen
 > resolution) then CSIDL_LOCAL_APPDATA would be a better choice.

  This makes me think that there's a need for both per-host and
per-user directories, but I don't know of a good strategy for dealing
with this in general.  Many applications have both kinds of data, but
clump it all together.  What "the norm" is on Unix, I don't really
know, but what I've seen is typically that /home/ is often mounted
over NFS, and so shared for many hosts.  I've seen it always be local
as well, which I find really annoying, but it is easier to support
host-local information.  The catch is that very little information is
*really* host-local, especicially using X11 (where window
configurations are display-local at most, and the user may prefer them
to be display-size-local ;).
  What it boils down to is that doing too much before the separations
are easily maintained is too much; a lot of that separation needs to
be handled inside the application, which knows what information is
user-specific and what *might* be host- or display-specific.  Trying
to provide these abstractions in the standard library is likely to be
hard to use if sufficient generality is also provided.

I wrote:
 >  Look at your $HOME on Unix box; most of the dotfiles are *files*, not
 > directories, and that's all most applications need;

And Neil commented:
 >    This may have been the case in the past and for people who understand
 > Unix well enough to maintain it, but for us just-want-it-to-run folks, its
 > no longer true. I formatted my Linux partition this week and installed Red
 > Hat 6.2 and Gnome 1.2 and then used a few applications. The dot directories
 > outnumber the dot files 18 to 16.

  Interesting!  But is suspect this is still very dependent on what
software you actually use as well; just because something is placed
there in your "standard" install doesn't mean it's useful.  It might
be more interesting to check after you've used that installation for a
year!  Lots of programs add dotfiles on an as-needed basis, and others
never create them, but require the user to create them using a text
editor (though the later seems to be falling out of favor in these
days of GUI applications!).


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From mal at lemburg.com  Wed May 31 18:18:49 2000
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 31 May 2000 18:18:49 +0200
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
Message-ID: <39353B69.D6E74E2C@lemburg.com>

Would there be interest in adding the python-ldap module
(http://sourceforge.net/project/?group_id=2072) to the
core distribution ?

If yes, I think we should approach David Leonard and
ask him if he is willing to donate the lib (which is
in the public domain) to the core.

FYI, LDAP is a well accepted standard network protocol for
querying address and user information.

An older web page with more background is available at: 

   http://www.it.uq.edu.au/~leonard/dc-prj/ldapmodule/

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From paul at prescod.net  Wed May 31 18:24:45 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:24:45 -0500
Subject: [Python-Dev] What's that sound?
Message-ID: <39353CCD.1F3E9A0B@prescod.net>

ActiveState announces four new Python-related projects (PythonDirect,
Komodo, Visual Python, ActivePython).

PythonLabs announces four planet-sized-brains are going to be working on
the Python implementation full time.

PythonWare announces PythonWorks.

Is that the sound of pieces falling into place or of a rumbling
avalanche "warming up" before obliterating everything in its path?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From gstein at lyra.org  Wed May 31 18:30:57 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:30:57 -0700 (PDT)
Subject: [Python-Dev] What's that sound?
In-Reply-To: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> ActiveState announces four new Python-related projects (PythonDirect,
> Komodo, Visual Python, ActivePython).
> 
> PythonLabs announces four planet-sized-brains are going to be working on
> the Python implementation full time.

Five.

> PythonWare announces PythonWorks.
> 
> Is that the sound of pieces falling into place or of a rumbling
> avalanche "warming up" before obliterating everything in its path?

Full-on, robot chubby earthquake.

:-)

I agree with the basic premise: Python *is* going to get a lot more
visibility than it has enjoyed in the past. You might even add that the
latest GNOME release (1.2) has excellent Python support.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From paul at prescod.net  Wed May 31 18:35:23 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 11:35:23 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
Message-ID: <39353F4B.3E78E22E@prescod.net>

Greg Stein wrote:
> 
> [ The correct forum is probably meta-sig. ]
> 
> IMO, I don't see a need for yet another forum. The dividing lines become a
> bit too blurry, and it will result in questions like "where do I post
> this?" Or "what is the difference between python-lang at python.org and
> python-list at python.org?"

Well, you admit that yhou don't read python-list, right? Most of us
don't, most of the time. Instead we have important discussions about the
language's future on python-dev, where most of the Python community
cannot participate. I'll say it flat out: I'm uncomfortable with that. I
did not include meta-sig because (or python-list) because my issue is
really with the accidental elitism of the python-dev setup. If
python-dev participants do not agree to have important linguistic
discussions in an open forum then setting up the forum is a waste of
time. That's why I'm feeling people here out first.
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From gmcm at hypernet.com  Wed May 31 18:54:22 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 12:54:22 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <Pine.LNX.4.10.10005310928270.30220-100000@nebula.lyra.org>
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <1252330400-4656058@hypernet.com>

[Paul Prescod]
> > PythonLabs announces four planet-sized-brains are going to be
> > working on the Python implementation full time.
[Greg] 
> Five.

No, he said "planet-sized-brains", not "planet-sized-egos".

Just notice how long it takes Barry to figure out who I meant....

- Gordon



From bwarsaw at python.org  Wed May 31 18:56:06 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:56:06 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17446.749848.895965@anthem.python.org>

>>>>> "M" == M  <mal at lemburg.com> writes:

    M> Would there be interest in adding the python-ldap module
    M> (http://sourceforge.net/project/?group_id=2072) to the
    M> core distribution ?

I haven't looked at this stuff, but yes, I think a standard LDAP
module would be quite useful.  It's a well enough established
protocol, and it would be good to be able to count on it "being
there".

-Barry



From bwarsaw at python.org  Wed May 31 18:58:51 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 12:58:51 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
Message-ID: <14645.17611.318538.986772@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Is that the sound of pieces falling into place or of a
    PP> rumbling avalanche "warming up" before obliterating everything
    PP> in its path?

Or a big foot hurtling its way earthward?  The question is, what's
that thing under the shadow of the big toe?  I can only vaguely make
out the first of four letters, and I think it's a `P'.

:)

-Barry



From gstein at lyra.org  Wed May 31 18:59:10 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 09:59:10 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39353F4B.3E78E22E@prescod.net>
Message-ID: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > [ The correct forum is probably meta-sig. ]
> > 
> > IMO, I don't see a need for yet another forum. The dividing lines become a
> > bit too blurry, and it will result in questions like "where do I post
> > this?" Or "what is the difference between python-lang at python.org and
> > python-list at python.org?"
> 
> Well, you admit that yhou don't read python-list, right?

Hehe... you make it sound like I'm a criminal on trial :-)

"And do you admit that you don't read that newsgroup? And do you admit
that you harbor irregular thoughts towards c.l.py posters? And do you
admit to obscene thoughts about Salma Hayek?"

Well, yes, no, and damn straight. :-)

> Most of us
> don't, most of the time. Instead we have important discussions about the
> language's future on python-dev, where most of the Python community
> cannot participate. I'll say it flat out: I'm uncomfortable with that. I

I share that concern, and raised it during the formation of python-dev. It
appears that the pipermail archive is truncated (nothing before April last
year). Honestly, though, I would have to say that I am/was more concerned
with the *perception* rather than actual result.

> did not include meta-sig because (or python-list) because my issue is
> really with the accidental elitism of the python-dev setup. If

I disagree with the term "accidental elitism." I would call it "purposeful
meritocracy." The people on python-dev have shown over the span of *years*
that they are capable developers, designers, and have a genuine interest
and care about Python's development. Based on each person's merits, Guido
invited them to participate in this forum.

Perhaps "guido-advisors" would be more appropriately named, but I don't
think Guido likes to display his BDFL status more than necessary :-)

> python-dev participants do not agree to have important linguistic
> discussions in an open forum then setting up the forum is a waste of
> time. That's why I'm feeling people here out first.

Personally, I like the python-dev setting. The noise here is zero. There
are some things that I'm not particularly interested in, thus I pay much
less attention to them, but those items are never noise. I *really* like
that aspect, and would not care to start arguing about language
development in a larger forum where noise, spam, uninformed opinions, and
subjective discussions take place.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From fdrake at acm.org  Wed May 31 19:04:13 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 13:04:13 -0400 (EDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <39353B69.D6E74E2C@lemburg.com>
References: <39353B69.D6E74E2C@lemburg.com>
Message-ID: <14645.17933.810181.300650@cj42289-a.reston1.va.home.com>

M.-A. Lemburg writes:
 > Would there be interest in adding the python-ldap module
 > (http://sourceforge.net/project/?group_id=2072) to the
 > core distribution ?

  Probably!  ACAP (Application Configuration Access Protocol) would be
nice as well -- anybody working on that?

 > FYI, LDAP is a well accepted standard network protocol for
 > querying address and user information.

  And lots of other stuff as well.  Jeremy and I contributed to a
project where it was used to store network latency information.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From paul at prescod.net  Wed May 31 19:10:58 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:10:58 -0500
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net> <14645.17611.318538.986772@anthem.python.org>
Message-ID: <393547A2.30CB7113@prescod.net>

"Barry A. Warsaw" wrote:
> 
> Or a big foot hurtling its way earthward?  The question is, what's
> that thing under the shadow of the big toe?  I can only vaguely make
> out the first of four letters, and I think it's a `P'.

Look closer, big-egoed-four-stringed-guitar-playing-one. It could just
as easily be a J.

And you know what you get when you put P and J together?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From paul at prescod.net  Wed May 31 19:21:56 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:21:56 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310945570.30220-100000@nebula.lyra.org>
Message-ID: <39354A34.88B8B6ED@prescod.net>

Greg Stein wrote:
> 
> Hehe... you make it sound like I'm a criminal on trial :-)

Sorry about that. But I'll bet you didn't expect this inquisition did
you?

> I share that concern, and raised it during the formation of python-dev. It
> appears that the pipermail archive is truncated (nothing before April last
> year). Honestly, though, I would have to say that I am/was more concerned
> with the *perception* rather than actual result.

Right, that perception is making people in comp-lang-python get a little
frustrated, paranoid, alienated and nasty. And relaying conversations
from here to there and back puts Fredrik in a bad mood which isn't good
for anyone.

> > did not include meta-sig because (or python-list) because my issue is
> > really with the accidental elitism of the python-dev setup. If
> 
> I disagree with the term "accidental elitism." I would call it "purposeful
> meritocracy." 

The reason I think that it is accidental is because I don't think that
anyone expected so many of us to abandon comp.lang.python and thus our
direct connection to Python's user base. It just happened that way due
to human nature. That forum is full of stuff that you or I don't care
about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
(oops, that's here!) etc, and this one is noise-free. I'm saying that we
could have a middle ground where we trade a little noise for a little
democracy -- if only in perception.

I think that perl-porters and linux-kernel are open lists? The dictators
and demigods just had to learn to filter a little. By keeping
"python-dev" for immediately important things and implementation
details, we will actually make it easier to get the day to day pumpkin
passing done.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From bwarsaw at python.org  Wed May 31 19:28:04 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:28:04 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
	<1252330400-4656058@hypernet.com>
Message-ID: <14645.19364.259837.684595@anthem.python.org>

>>>>> "Gordo" == Gordon McMillan <gmcm at hypernet.com> writes:

    Gordo> No, he said "planet-sized-brains", not "planet-sized-egos".

    Gordo> Just notice how long it takes Barry to figure out who I
    Gordo> meant....

Waaaaiitt a second....

I /do/ have a very large brain.  I keep it in a jar on the headboard
of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
you think I got so studly playing bass?



From bwarsaw at python.org  Wed May 31 19:35:36 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 13:35:36 -0400 (EDT)
Subject: [Python-Dev] What's that sound?
References: <39353CCD.1F3E9A0B@prescod.net>
	<14645.17611.318538.986772@anthem.python.org>
	<393547A2.30CB7113@prescod.net>
Message-ID: <14645.19816.256896.367440@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Look closer, big-egoed-four-stringed-guitar-playing-one. It
    PP> could just as easily be a J.

<squint> Could be!  The absolute value of my diopter is about as big
as my ego.

    PP> And you know what you get when you put P and J together?

A very tasty sammich!

-Barry



From paul at prescod.net  Wed May 31 19:45:30 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 12:45:30 -0500
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
		<Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <14645.16679.139843.148933@anthem.python.org>
Message-ID: <39354FBA.E1DEFEFA@prescod.net>

"Barry A. Warsaw" wrote:
> 
> ...
>
> I agree.  I think anybody who'd be interested in python-lang is
> already going to be a member of python-dev 

Huh? What about Greg Ewing, Amit Patel, Martijn Faassen, William
Tanksley, Mike Fletcher, Neel Krishnaswami, the various stackless
groupies and a million others. This is just a short list of people who
have made reasonable language suggestions recently. Those suggestions
are going into the bit-bucket unless one of us happens to notice and
champion it here. But we're too busy thinking about 1.6 to think about
long-term ideas anyhow.

Plus, we hand down decisions about (e.g. string.join) and they have the
exact, parallel discussion over there. All the while, anyone from
PythonDev is telling them: "We've already been through this stuff. We've
already discussed this." which only (understandably) annoys them more.

> and any discussion will
> probably be crossposted to the point where it makes no difference.

I think that python-dev's role should change. I think that it would
handle day to day implementation stuff -- nothing long term. I mean if
the noise level on python-lang was too high then we could retreat to
python-dev again but I'd like to think we wouldn't have to. A couple of
sharp words from Guido or Tim could end a flamewar pretty quickly.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"I want to give beauty pageants the respectability they deserve."
            - Brooke Ross, Miss Canada International



From esr at thyrsus.com  Wed May 31 19:53:10 2000
From: esr at thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 13:53:10 -0400
Subject: [Python-Dev] What's that sound?
In-Reply-To: <14645.19364.259837.684595@anthem.python.org>; from bwarsaw@python.org on Wed, May 31, 2000 at 01:28:04PM -0400
References: <39353CCD.1F3E9A0B@prescod.net> <1252330400-4656058@hypernet.com> <14645.19364.259837.684595@anthem.python.org>
Message-ID: <20000531135310.B29319@thyrsus.com>

Barry A. Warsaw <bwarsaw at python.org>:
> Waaaaiitt a second....
> 
> I /do/ have a very large brain.  I keep it in a jar on the headboard
> of my bed, surrounded by a candlelit homage to Geddy Lee.  How else do
> you think I got so studly playing bass?

Ah, yes.  We take you back now to that splendid year of 1978.  Cue a
certain high-voiced Canadian singing

	The trouble with the Perl guys
	is they're quite convinced they're right...

Duuude....
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

The world is filled with violence. Because criminals carry guns, we
decent law-abiding citizens should also have guns. Otherwise they will
win and the decent people will lose.
        -- James Earl Jones



From esr at snark.thyrsus.com  Wed May 31 20:05:33 2000
From: esr at snark.thyrsus.com (Eric S. Raymond)
Date: Wed, 31 May 2000 14:05:33 -0400
Subject: [Python-Dev] Constants
Message-ID: <200005311805.OAA29447@snark.thyrsus.com>

I just looked at Jeremy Hylton's warts posting
at <http://starship.python.net/crew/amk/python/writing/warts.html>

It reminded me that one feature I really, really want in Python 3000
is the ability to declare constants.  Assigning to a constant should 
raise an error.

Is this on the to-do list?
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

What, then is law [government]? It is the collective organization of
the individual right to lawful defense."
	-- Frederic Bastiat, "The Law"



From petrilli at amber.org  Wed May 31 20:17:57 2000
From: petrilli at amber.org (Christopher Petrilli)
Date: Wed, 31 May 2000 14:17:57 -0400
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>; from esr@snark.thyrsus.com on Wed, May 31, 2000 at 02:05:33PM -0400
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <20000531141757.E5766@trump.amber.org>

Eric S. Raymond [esr at snark.thyrsus.com] wrote:
> I just looked at Jeremy Hylton's warts posting
> at <http://starship.python.net/crew/amk/python/writing/warts.html>
> 
> It reminded me that one feature I really, really want in Python 3000
> is the ability to declare constants.  Assigning to a constant should 
> raise an error.
> 
> Is this on the to-do list?

I know this isn't "perfect", but what I do often is have a
Constants.py file that has all my constants in a class which has
__setattr__ over ridden to raise an exception.  This has two things;

    1. Difficult to modify the attributes, at least accidentally
    2. Keeps the name-space less poluted by thousands of constants.

Just an idea, I think do this:

     constants = Constants()
     x = constants.foo

Seems clean (reasonably) to me.

I think I stole this from the timbot.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From jeremy at beopen.com  Wed May 31 20:07:18 2000
From: jeremy at beopen.com (Jeremy Hylton)
Date: Wed, 31 May 2000 14:07:18 -0400 (EDT)
Subject: [Python-Dev] Constants
In-Reply-To: <200005311805.OAA29447@snark.thyrsus.com>
References: <200005311805.OAA29447@snark.thyrsus.com>
Message-ID: <14645.21718.365823.507322@localhost.localdomain>

Correction: It's Andrew Kuchling's list of language warts.  I
mentioned it in a post on slashdot, where I ventured a guess that the
most substantial changes most new users will see with Python 3000 are
the removal of these warts.

Jeremy



From akuchlin at cnri.reston.va.us  Wed May 31 20:21:04 2000
From: akuchlin at cnri.reston.va.us (Andrew M. Kuchling)
Date: Wed, 31 May 2000 14:21:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <14645.20637.864287.86178@localhost.localdomain>; from jeremy@beopen.com on Wed, May 31, 2000 at 01:49:17PM -0400
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org> <39353F4B.3E78E22E@prescod.net> <14645.20637.864287.86178@localhost.localdomain>
Message-ID: <20000531142104.A8989@amarok.cnri.reston.va.us>

On Wed, May 31, 2000 at 01:49:17PM -0400, Jeremy Hylton wrote:
>I'm actually more worried about the second.  It's been a while since I
>read c.l.py and I'm occasionally disappointed to miss out on
>seemingly interesting threads.  On the other hand, there is no way I
>could manage to read or even filter the volume on that list.

Really?  I read it through Usenet with GNUS, and it takes about a half
hour to go through everything. Skipping threads by subject usually
makes it easy to avoid uninteresting stuff.  

I'd rather see python-dev limited to very narrow, CVS-tree-related
material, such as: should we add this module?  is this change OK?
&c...  The long-winded language speculation threads are better left to
c.l.python, where more people offer opinions, it's more public, and
newsreaders are more suited to coping with the volume.  (Incidentally,
has any progress been made on reviving c.l.py.announce?)

OTOH, newbies have reported fear of posting in c.l.py, because they
feel the group is too advanced, what with everyone sitting around
talking about coroutines and SNOBOL string parsing.  But I think it's
a good thing if newbies see the high-flown chatter and get their minds
stretched. :)

--amk



From gstein at lyra.org  Wed May 31 20:37:32 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 11:37:32 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39354A34.88B8B6ED@prescod.net>
Message-ID: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Paul Prescod wrote:
> Greg Stein wrote:
> > 
> > Hehe... you make it sound like I'm a criminal on trial :-)
> 
> Sorry about that. But I'll bet you didn't expect this inquisition did
> you?

Well, of course not. Nobody expects the Spanish Inquisition!

Hmm. But you're not Spanish. Dang...

> > I share that concern, and raised it during the formation of python-dev. It
> > appears that the pipermail archive is truncated (nothing before April last
> > year). Honestly, though, I would have to say that I am/was more concerned
> > with the *perception* rather than actual result.
> 
> Right, that perception is making people in comp-lang-python get a little
> frustrated, paranoid, alienated and nasty. And relaying conversations
> from here to there and back puts Fredrik in a bad mood which isn't good
> for anyone.

Understood. I don't have a particular solution to the problem, but I also
believe that python-lang is not going to be a benefit/solution.

Hmm. How about this: you stated the premise is to generate proposals for
language features, extensions, additions, whatever. If that is the only
goal, then consider a web-based system: anybody can post a "feature" with
a description/spec/code/whatever; each feature has threaded comments
attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

When you have a feature with a total vote of +73, then you know that it
needs to be looked at in more detail. All votes are open (not anonymous).
Features can be revised, in an effort to remedy issues raised by -1
voters (and thus turn them into +1 votes).

People can review features and votes in a quick pass. If they prefer to
take more time, then they can also review comments.

Of course, this is only a suggestion. I've got so many other projects that
I'd like to code up right now, then I would not want to sign up for
something like this :-)

> > > did not include meta-sig because (or python-list) because my issue is
> > > really with the accidental elitism of the python-dev setup. If
> > 
> > I disagree with the term "accidental elitism." I would call it "purposeful
> > meritocracy." 
> 
> The reason I think that it is accidental is because I don't think that
> anyone expected so many of us to abandon comp.lang.python and thus our
> direct connection to Python's user base.

Good point.

I would still disagree with your "elitism" term, but the side-effect is
definitely accidental and unfortunate. It may even be arguable whether
python-dev *is* responsible for that. The SIGs had much more traffic
before python-dev, too. I might suggest that the SIGs were the previous
"low-noise" forum (in favor of c.l.py). python-dev yanked focus from the
SIGs, and only a little from c.l.py (I think c.l.py's burgeoning traffic
reduced readership on its own).

> It just happened that way due
> to human nature. That forum is full of stuff that you or I don't care
> about -- compiling on AIX, ADO programming on Windows, Perl idioms, LDAP
> (oops, that's here!) etc, and this one is noise-free. I'm saying that we
> could have a middle ground where we trade a little noise for a little
> democracy -- if only in perception.

Admirable, but I think it would be ineffectual. People would be confused
about where to post. Too many forums, with arbitrary/unclear lines about
which to use.

How do you like your new job at DataChannel? Rate it on 1-100. "83" you
say? Well, why not 82? What is the difference between 82 and 83?

"Why does this post belong on c.l.py, and not on python-lang?"

The result will be cross-posting because people will want to ensure they
reach the right people/forum.

Of course, people will also post to the "wrong" forum. Confusion, lack of
care, whatever.

> I think that perl-porters and linux-kernel are open lists? The dictators
> and demigods just had to learn to filter a little. By keeping
> "python-dev" for immediately important things and implementation
> details, we will actually make it easier to get the day to day pumpkin
> passing done.

Yes, they are. And Dick Hardt has expressed the opinion that perl-porters
is practically useless. He was literally dumbfounded when I told him that
python-dev is (near) zero-noise.

The Linux guys filter very well. I don't know enough of, say, Alan's or
Linus' other mailing subscriptions to know whether that is the only thing
they subscribe to, or just one of many. I could easily see keeping up with
linux-kernel if that was your only mailing list. I also suspect there is
plenty of out-of-band mail going on between Linus and his "lieutenants"
when they forward patches to him (and his inevitable replies, rejections,
etc).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bwarsaw at python.org  Wed May 31 20:39:46 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 14:39:46 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <3934A8EB.6608B0E1@prescod.net>
	<Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
	<14645.16679.139843.148933@anthem.python.org>
	<39354FBA.E1DEFEFA@prescod.net>
Message-ID: <14645.23666.161619.557413@anthem.python.org>

>>>>> "PP" == Paul Prescod <paul at prescod.net> writes:

    PP> Plus, we hand down decisions about (e.g. string.join) and they
    PP> have the exact, parallel discussion over there. All the while,
    PP> anyone from PythonDev is telling them: "We've already been
    PP> through this stuff. We've already discussed this." which only
    PP> (understandably) annoys them more.

Good point.

    >> and any discussion will probably be crossposted to the point
    >> where it makes no difference.

    PP> I think that python-dev's role should change. I think that it
    PP> would handle day to day implementation stuff -- nothing long
    PP> term. I mean if the noise level on python-lang was too high
    PP> then we could retreat to python-dev again but I'd like to
    PP> think we wouldn't have to. A couple of sharp words from Guido
    PP> or Tim could end a flamewar pretty quickly.

Then I suggest to moderate python-lang.  Would you (and/or others) be
willing to serve as moderators?  I'd support an open subscription
policy in that case.

-Barry



From pingster at ilm.com  Wed May 31 20:41:13 2000
From: pingster at ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 11:41:13 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> Hmm. How about this: you stated the premise is to generate proposals for
> language features, extensions, additions, whatever. If that is the only
> goal, then consider a web-based system: anybody can post a "feature" with
> a description/spec/code/whatever; each feature has threaded comments
> attached to it; the kicker: each feature has votes (+1/+0/-0/-1).

Gee, this sounds familiar.  (Hint: starts with an R and has seven
letters.)  Why are we using Jitterbug again?  Does anybody even submit
things there, and still check the Jitterbug indexes regularly?

Okay, Roundup doesn't have voting, but it does already have priorities
and colour-coded statuses, and voting would be trivial to add.


-- ?!ng




From gstein at lyra.org  Wed May 31 21:04:34 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 12:04:34 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Ka-Ping Yee wrote:
> On Wed, 31 May 2000, Greg Stein wrote:
> > Hmm. How about this: you stated the premise is to generate proposals for
> > language features, extensions, additions, whatever. If that is the only
> > goal, then consider a web-based system: anybody can post a "feature" with
> > a description/spec/code/whatever; each feature has threaded comments
> > attached to it; the kicker: each feature has votes (+1/+0/-0/-1).
> 
> Gee, this sounds familiar.  (Hint: starts with an R and has seven
> letters.)  Why are we using Jitterbug again?  Does anybody even submit
> things there, and still check the Jitterbug indexes regularly?
> 
> Okay, Roundup doesn't have voting, but it does already have priorities
> and colour-coded statuses, and voting would be trivial to add.

Does Roundup have a web-based interface, where I can see all of the
features, their comments, and their votes? Can the person who posted the
original feature/spec update it? (or must they followup with a
modified proposal instead)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bwarsaw at python.org  Wed May 31 21:12:23 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:12:23 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005310802370.30220-100000@nebula.lyra.org>
	<39353F4B.3E78E22E@prescod.net>
	<14645.20637.864287.86178@localhost.localdomain>
	<20000531142104.A8989@amarok.cnri.reston.va.us>
Message-ID: <14645.25623.615735.836896@anthem.python.org>

>>>>> "AMK" == Andrew M Kuchling <akuchlin at cnri.reston.va.us> writes:

    AMK> more suited to coping with the volume.  (Incidentally, has
    AMK> any progress been made on reviving c.l.py.announce?)

Not that I'm aware of, sadly.

-Barry



From bwarsaw at python.org  Wed May 31 21:18:09 2000
From: bwarsaw at python.org (Barry A. Warsaw)
Date: Wed, 31 May 2000 15:18:09 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
	<Pine.SGI.3.96.1000531113831.1049307r-100000@happy>
Message-ID: <14645.25969.657083.55499@anthem.python.org>

>>>>> "KY" == Ka-Ping Yee <pingster at ilm.com> writes:

    KY> Gee, this sounds familiar.  (Hint: starts with an R and has
    KY> seven letters.)  Why are we using Jitterbug again?  Does
    KY> anybody even submit things there, and still check the
    KY> Jitterbug indexes regularly?

Jitterbug blows.

    KY> Okay, Roundup doesn't have voting, but it does already have
    KY> priorities and colour-coded statuses, and voting would be
    KY> trivial to add.

Roundup sounded just so cool when ?!ng described it at the
conference.  I gotta find some time to look at it! :)

-Barry



From pingster at ilm.com  Wed May 31 21:24:07 2000
From: pingster at ilm.com (Ka-Ping Yee)
Date: Wed, 31 May 2000 12:24:07 -0700 (PDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
Message-ID: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>

On Wed, 31 May 2000, Greg Stein wrote:
> 
> Does Roundup have a web-based interface,

Yes.

> where I can see all of the
> features, their comments, and their votes?

At the moment, you see date of last activity, description,
priority, status, and fixer (i.e. person who has taken
responsibility for the item).  No votes, but as i said,
that would be really easy.

> Can the person who posted the original feature/spec update it?

Each item has a bunch of mail messages attached to it.
Anyone can edit the description, but that's a short one-line
summary; the only way to propose another design right now
is to send in another message.

Hey, i admit it's a bit primitive, but it seems significantly
better than nothing.  The software people at ILM have coped
with it fairly well for a year, and for the most part we like it.

Go play:  http://www.lfw.org/ping/roundup/roundup.cgi

Username: test  Password: test
Username: spam  Password: spam
Username: eggs  Password: eggs


-- ?!ng




From fdrake at acm.org  Wed May 31 21:58:13 2000
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 31 May 2000 15:58:13 -0400 (EDT)
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
References: <Pine.LNX.4.10.10005311203370.30220-100000@nebula.lyra.org>
	<Pine.SGI.3.96.1000531121936.1049307v-100000@happy>
Message-ID: <14645.28373.733094.942361@cj42289-a.reston1.va.home.com>

Ka-Ping Yee writes:
 > On Wed, 31 May 2000, Greg Stein wrote:
 > > Can the person who posted the original feature/spec update it?
 > 
 > Each item has a bunch of mail messages attached to it.
 > Anyone can edit the description, but that's a short one-line
 > summary; the only way to propose another design right now
 > is to send in another message.

  I thought the roundup interface was quite nice, esp. with the nosy
lists and such.  I'm sure there are a number of small issues, but
nothing Ping can't deal with in a matter of minutes.  ;)
  One thing that might need further consideration is that a feature
proposal may need a slightly different sort of support; it makes more
sense to include more than the one-liner summary, and that should be
modifiable as discussions show adjustments may be needed.  That might
be doable by adding a URL to an external document rather than
including the summary in the issues database.
  I'd love to get rid of the Jitterbug thing!


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at BeOpen.com




From paul at prescod.net  Wed May 31 22:52:38 2000
From: paul at prescod.net (Paul Prescod)
Date: Wed, 31 May 2000 15:52:38 -0500
Subject: [Python-Dev] SIG: python-lang
References: <Pine.LNX.4.10.10005311119130.30220-100000@nebula.lyra.org>
Message-ID: <39357B96.E819537F@prescod.net>

Greg Stein wrote:
> 
> ...
>
> People can review features and votes in a quick pass. If they prefer to
> take more time, then they can also review comments.

I like this idea for its persistence but I'm not convinced that it
serves the same purpose as the give and take of a mailing list with many
subscribers.
 
> Admirable, but I think it would be ineffectual. People would be confused
> about where to post. Too many forums, with arbitrary/unclear lines about
> which to use.

To me, they are clear:

 * anything Python related can go to comp.lang.python, but many people
will not read it.

 * anything that belongs to a particular SIG goes to that sig.

 * any feature suggestions/debates that do not go in a particular SIG
(especially things related to the core language) go to python-lang

 * python-dev is for any message that has the words "CVS", "patch",
"memory leak", "reference count" etc. in it. It is for implementing the
design that Guido refines out of the rough and tumble of python-lang.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
At the same moment that the Justice Department and the Federal Trade 
Commission are trying to restrict the negative consequences of 
monopoly, the Commerce Department and the Congress are helping to 
define new intellectual property rights, rights that have a 
significant potential to create new monopolies. This is the policy 
equivalent of arm-wrestling with yourself.
	- http://www.salon.com/tech/feature/2000/04/07/greenspan/index.html



From gstein at lyra.org  Wed May 31 23:53:13 2000
From: gstein at lyra.org (Greg Stein)
Date: Wed, 31 May 2000 14:53:13 -0700 (PDT)
Subject: [Python-Dev] Adding LDAP to the Python core... ?!
In-Reply-To: <14645.17446.749848.895965@anthem.python.org>
Message-ID: <Pine.LNX.4.10.10005311452150.30220-100000@nebula.lyra.org>

On Wed, 31 May 2000, Barry A. Warsaw wrote:
> >>>>> "M" == M  <mal at lemburg.com> writes:
> 
>     M> Would there be interest in adding the python-ldap module
>     M> (http://sourceforge.net/project/?group_id=2072) to the
>     M> core distribution ?
> 
> I haven't looked at this stuff, but yes, I think a standard LDAP
> module would be quite useful.  It's a well enough established
> protocol, and it would be good to be able to count on it "being
> there".

My WebDAV module implements an established protocol (an RFC tends to do
that :-), but the API within the module is still in flux (IMO).

Is the LDAP module's API pretty solid? Is it changing?

And is this module a C extension, or a pure Python implementation?

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gmcm at hypernet.com  Wed May 31 23:58:04 2000
From: gmcm at hypernet.com (Gordon McMillan)
Date: Wed, 31 May 2000 17:58:04 -0400
Subject: [Python-Dev] SIG: python-lang
In-Reply-To: <39357B96.E819537F@prescod.net>
Message-ID: <1252312176-5752122@hypernet.com>

Paul Prescod  wrote:

> Greg Stein wrote:
> > Admirable, but I think it would be ineffectual. People would be
> > confused about where to post. Too many forums, with
> > arbitrary/unclear lines about which to use.
> 
> To me, they are clear:

Of course they are ;-). While something doesn't seem right 
about the current set up, and c.l.py is still remarkably 
civilized, the fact is that the hotheads who say "I'll never use 
Python again if you do something as brain-dead as [ case-
insensitivity | require (host, addr) tuples | ''.join(list) | ... ]" will 
post their sentiments to every available outlet.
 
I agree the shift of some of these syntax issues from python-
dev to c.l.py was ugly, but the truth is that:
 - no new arguments came from c.l.py
 - the c.l.py discussion was much more emotional
 - you can't keep out the riff-raff without inviting reasonable 
accusations of elitistism
 - the vast majority of, erm, "grass-roots" syntax proposals are 
absolutely horrid.

(As you surely know, Paul, from your types-SIG tenure; 
proposing syntax changes without the slightest intention of 
putting any effort into them is a favorite activity of posters.)



- Gordon