From martin at v.loewis.de Sat Sep 1 00:11:36 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 01 Sep 2007 00:11:36 +0200 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> Message-ID: <46D89218.5090005@v.loewis.de> > (1) Allow bytes methods to take a literal string (which will > obviously be in the source file's encoding). To rephrase Guido's comment: do you have the slightest idea on how to specify and implement that? Regards, Martin From jimjjewett at gmail.com Sat Sep 1 00:17:57 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 31 Aug 2007 18:17:57 -0400 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> Message-ID: On 8/31/07, Guido van Rossum wrote: > On 8/31/07, Jim Jewett wrote: > > (1) Allow bytes methods to take a literal string (which will > > obviously be in the source file's encoding). > Yuck, yuck about the source file encoding part. Also, there is no way > to tell that a particular argument was passed a literal. There is when compiling to bytecode; it goes in co_consts. > The very > definition of "this was a literal" is iffy -- is x a literal when > passed to f below? > x = "abc" > f(x) No, it isn't. Though I suppose consistency with that sort of use (particularly inside a function, where the compiler *could* know) is the main argument against this. > > (2) There really ought to be an immutable bytes type, and the literal > > (or at least a literal, if capitalization matters) ought to be the > > immutable. > > PLISTHEADER = b"""\ > > > > > PLIST 1.0//EN" "http://www.apple.com/DTDs/ > > PropertyList-1.0.dtd"> > > """ > > If the value of PLISTHEADER does change during the run, it will almost > > certainly be a bug. I could code defensively by only ever passing > > copies, but that seems wasteful, and it could hide other bugs. If > > something does try to modify (not replace, modify) it, then there was > > probably a typo or API misunderstanding; I *want* an exception. > Sounds like you're worrying to much. Do you have any indication that > this is going to be a common problem? > > http://svn.python.org/view/python/branches/py3k/Lib/plat-mac/plistlib.py?rev=57563&r1=57305&r2=57563 Let me reverse the question. In Py2, that variable holds a constant string. What is the value in making that constant mutable? -jJ From martin at v.loewis.de Sat Sep 1 00:34:10 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 01 Sep 2007 00:34:10 +0200 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> Message-ID: <46D89762.1000608@v.loewis.de> >> Yuck, yuck about the source file encoding part. Also, there is no way >> to tell that a particular argument was passed a literal. > > There is when compiling to bytecode; it goes in co_consts. > >> The very >> definition of "this was a literal" is iffy -- is x a literal when >> passed to f below? > >> x = "abc" >> f(x) > > No, it isn't. By that definition, bytes never receives a constant. Regards, Martin From jimjjewett at gmail.com Sat Sep 1 01:18:12 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 31 Aug 2007 19:18:12 -0400 Subject: [Python-3000] Release Countdown In-Reply-To: <46D89762.1000608@v.loewis.de> References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> <46D89762.1000608@v.loewis.de> Message-ID: On 8/31/07, "Martin v. L?wis" wrote: > >> Yuck, yuck about the source file encoding part. Also, there is no way > >> to tell that a particular argument was passed a literal. > > There is when compiling to bytecode; it goes in co_consts. > >> The very > >> definition of "this was a literal" is iffy -- is x a literal when > >> passed to f below? > >> x = "abc" > >> f(x) > > No, it isn't. > By that definition, bytes never receives a constant. To go back to the original motivation x.split(":") # a constant, currently fails in Py3K x.split(b":") # mechanical replacement for x.split(":") sep=":" x.split(sep) # annoying but less important failure I would prefer that x.split(":") work. If that happens because bytes.split does the conversion for me (so that x.split(sep) also works), then great. But I realize that would require an assumption about the proper encoding. If it works because the bytecode compiler changes x.split(":") into the moral equivalent of try: x.split(":") except StrNotBytesError: x.split(b":") that is good enough. And for constants which appear as string literals in the code (token stringliteral), the proper encoding is known. -jJ From lists at cheimes.de Sat Sep 1 01:28:00 2007 From: lists at cheimes.de (Christian Heimes) Date: Sat, 01 Sep 2007 01:28:00 +0200 Subject: [Python-3000] Compiling Python 3.0 with MS Visual Studio 2005 In-Reply-To: <46D886DD.2070601@v.loewis.de> References: <46D886DD.2070601@v.loewis.de> Message-ID: Martin v. L?wis wrote: > Christian Heimes schrieb: >> I tried to compile Python 3.0 with MS Visual Studio 2005 on Windows XP >> SP2 (German) and I run into multiple problems with 3rd party modules. >> The problem with time on German installations of Windows still exists. > > Not for me - it works fine here. Are you sure your source is up-to-date? My sources were up to date but unfortunately the output wasn't. After I did a cleanup and full recompile the error is gone. Christian From guido at python.org Sat Sep 1 01:32:20 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 31 Aug 2007 16:32:20 -0700 Subject: [Python-3000] Compiling Python 3.0 with MS Visual Studio 2005 In-Reply-To: References: <46D886DD.2070601@v.loewis.de> Message-ID: On 8/31/07, Christian Heimes wrote: > Martin v. L?wis wrote: > > Christian Heimes schrieb: > >> I tried to compile Python 3.0 with MS Visual Studio 2005 on Windows XP > >> SP2 (German) and I run into multiple problems with 3rd party modules. > >> The problem with time on German installations of Windows still exists. > > > > Not for me - it works fine here. Are you sure your source is up-to-date? > > My sources were up to date but unfortunately the output wasn't. After I > did a cleanup and full recompile the error is gone. Does this mean that all the problems you reported at the start of this thread are gone? (If so, I need to remove the link to this thread from the online release notes. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From half.italian at gmail.com Sat Sep 1 02:14:09 2007 From: half.italian at gmail.com (Sean DiZazzo) Date: Fri, 31 Aug 2007 17:14:09 -0700 Subject: [Python-3000] iterating over a dcitionary Message-ID: <7baa94f60708311714l1423846eq38cd71e586ca87e7@mail.gmail.com> How should we replace in our code: for k,v in dict.iteritems(): with this ?? for k,v in zip(dict, dict.values()): Sorry if this is the wrong forum for questions like this. ~Sean From l.mastrodomenico at gmail.com Sat Sep 1 02:17:54 2007 From: l.mastrodomenico at gmail.com (Lino Mastrodomenico) Date: Sat, 1 Sep 2007 02:17:54 +0200 Subject: [Python-3000] iterating over a dcitionary In-Reply-To: <7baa94f60708311714l1423846eq38cd71e586ca87e7@mail.gmail.com> References: <7baa94f60708311714l1423846eq38cd71e586ca87e7@mail.gmail.com> Message-ID: 2007/9/1, Sean DiZazzo : > How should we replace in our code: > > for k,v in dict.iteritems(): for k, v in dict.items(): -- Lino Mastrodomenico E-mail: l.mastrodomenico at gmail.com From dalcinl at gmail.com Sat Sep 1 02:32:49 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Fri, 31 Aug 2007 21:32:49 -0300 Subject: [Python-3000] bug in py3k buffer object? Message-ID: Dear Travis, in my MPI wrappers, I use MPI_Alloc_mem function to get 'special' MPI memory, and next I return it to Python using return PyBuffer_FromReadWriteMemory(ptr, len); Well, getting back this rw-buffer in python, I tried to do mem = MPI.Alloc_mem(10) mem[:] = str8('\0') * 8 # sort of memzero but then I get this error: Traceback (most recent call last): File "", line 1, in TypeError: buffer is read-only I noticed you use PyBuff_SIMPLE in buffer_ass_item/buffer_ass_subscript... Is this OK? perhaps PyBuf_WRITEABLE is the right flag? No much more time to go deeper. -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From lists at cheimes.de Sat Sep 1 02:47:00 2007 From: lists at cheimes.de (Christian Heimes) Date: Sat, 01 Sep 2007 02:47:00 +0200 Subject: [Python-3000] Compiling Python 3.0 with MS Visual Studio 2005 In-Reply-To: References: <46D886DD.2070601@v.loewis.de> Message-ID: <46D8B684.1030904@cheimes.de> Guido van Rossum wrote: > Does this mean that all the problems you reported at the start of this > thread are gone? (If so, I need to remove the link to this thread from > the online release notes. :-) Just the problem with the time module is gone. The problems with the 3rd party modules still exist and so does the issue with os.stat on non English Windows installations. I'm neither a Windows nor a MS VS 2005 expert - I'm mostly using Linux for development - but I could try to tweak the project file if it is appreciated and wanted. Who is responsible for PCbuild8? Christian From nnorwitz at gmail.com Sat Sep 1 02:58:28 2007 From: nnorwitz at gmail.com (Neal Norwitz) Date: Fri, 31 Aug 2007 17:58:28 -0700 Subject: [Python-3000] Compiling Python 3.0 with MS Visual Studio 2005 In-Reply-To: <46D8B684.1030904@cheimes.de> References: <46D886DD.2070601@v.loewis.de> <46D8B684.1030904@cheimes.de> Message-ID: On 8/31/07, Christian Heimes wrote: > Guido van Rossum wrote: > > Does this mean that all the problems you reported at the start of this > > thread are gone? (If so, I need to remove the link to this thread from > > the online release notes. :-) > > Just the problem with the time module is gone. The problems with the 3rd > party modules still exist and so does the issue with os.stat on non > English Windows installations. I'm neither a Windows nor a MS VS 2005 > expert - I'm mostly using Linux for development - but I could try to > tweak the project file if it is appreciated and wanted. Who is > responsible for PCbuild8? If you have to ask who's in control, that means you're it. :-) There isn't really anyone. Kristj?n V. J?nsson has worked on it in the trunk, but he hasn't been maintaining it in 3k IIRC. It would be really great if you took on the responsibility, made sure things work, and provided patches when they didn't. Bug reports are of course helpful if you can't fix the problems. Cheers, n From greg.ewing at canterbury.ac.nz Sat Sep 1 03:24:28 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 01 Sep 2007 13:24:28 +1200 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> Message-ID: <46D8BF4C.7050508@canterbury.ac.nz> Jim Jewett wrote: > On 8/31/07, Guido van Rossum wrote: > > > x = "abc" > > f(x) > > I suppose consistency with that sort of use ... is > the main argument against this. I'd be *very* upset if Python started behaving differently depending on whether I wrote a literal directly inside a function call or not. -- Greg From guido at python.org Sat Sep 1 04:18:24 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 31 Aug 2007 19:18:24 -0700 Subject: [Python-3000] Windows registry question from blog Message-ID: Someone added this comment to my blog (http://www.artima.com/forums/flat.jsp?forum=106&thread=213583&start=0#278818): "Only a question please, I have Python 2.5 installed in my windows XP machine and I would like to install Python 3a1. I think I could have troubles at the Windows Registry level. Did anybody tried to do so?" Can someone help this person? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From unknown_kev_cat at hotmail.com Sat Sep 1 04:44:28 2007 From: unknown_kev_cat at hotmail.com (Joe Smith) Date: Fri, 31 Aug 2007 22:44:28 -0400 Subject: [Python-3000] Windows registry question from blog References: Message-ID: "Guido van Rossum" wrote in message news:ca471dc20708311918k642b0d2elf67bd8bdba8830a1 at mail.gmail.com... > Someone added this comment to my blog > (http://www.artima.com/forums/flat.jsp?forum=106&thread=213583&start=0#278818): > > "Only a question please, I have Python 2.5 installed in my windows XP > machine and I would like to install Python 3a1. I think I could have > troubles at the Windows Registry level. Did anybody tried to do so?" > > Can someone help this person? > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) A quick scan of the registry makes it look to me like the main issue is that py3k would take over the .py, .pyo, .pyc extentions, which is not a big deal. Reinstalling 2.5 (in place) would fix that. The only other potential issue is the uninstall icon for 2.5 disappearing. However a quick test shows that the design of the installer prevents this potential issue. So everything looks fine to me. I have both instaled at the moment, and it looks fine to me. From nick.bastin at gmail.com Sat Sep 1 06:48:58 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 1 Sep 2007 00:48:58 -0400 Subject: [Python-3000] Windows registry question from blog In-Reply-To: References: Message-ID: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> Is there no option in the installer to associate Python with .py, .pyc, etc.? Obviously then the logical choice would be to unselect that (or perhaps have it unselected by default for alpha installations). -- Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070901/527bf49c/attachment.htm From martin at v.loewis.de Sat Sep 1 07:31:04 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 01 Sep 2007 07:31:04 +0200 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> <46D89762.1000608@v.loewis.de> Message-ID: <46D8F918.6090701@v.loewis.de> > If it works because the bytecode compiler changes x.split(":") into > the moral equivalent of > > try: > x.split(":") > except StrNotBytesError: > x.split(b":") > > that is good enough. And how do you propose to implement that? Regards, Martin From greg.ewing at canterbury.ac.nz Sat Sep 1 03:35:10 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 01 Sep 2007 13:35:10 +1200 Subject: [Python-3000] Release Countdown In-Reply-To: References: <46D7FE7D.5020909@trueblade.com> <46D803FF.9000909@v.loewis.de> <46D806D8.4070905@trueblade.com> <797C63A8-888F-45CC-A780-CE9AD859BC1B@python.org> <46D89762.1000608@v.loewis.de> Message-ID: <46D8C1CE.1020608@canterbury.ac.nz> Jim Jewett wrote: > I would prefer that x.split(":") work. > > If that happens because bytes.split does the conversion for me (so > that x.split(sep) also works), then great. But I realize that would > require an assumption about the proper encoding. If you're going to do things like that, why stop at the parameters to bytes methods? It's hard to argue that they should be treated specially, rather than allowing strings to be cast to bytes in any context that expects bytes. And then the clear distinction between str and bytes that we're trying to maintain breaks down. You can't have it both ways. The type error you're complaining about is just the sort of error that the str/bytes distinction is meant to *catch*. -- Greg From martin at v.loewis.de Sat Sep 1 08:03:48 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 01 Sep 2007 08:03:48 +0200 Subject: [Python-3000] Windows registry question from blog In-Reply-To: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> References: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> Message-ID: <46D900C4.8050109@v.loewis.de> Nicholas Bastin schrieb: > Is there no option in the installer to associate Python with .py, .pyc, > etc.? There certainly is. > Obviously then the logical choice would be to unselect that (or > perhaps have it unselected by default for alpha installations). I'd rather have the user unselect it - people installing multiple Python version are familiar with the phenomenon and might get puzzled if some installation suddenly behaved different. Regards, Martin From michele.simionato at gmail.com Sat Sep 1 10:33:32 2007 From: michele.simionato at gmail.com (Michele Simionato) Date: Sat, 1 Sep 2007 10:33:32 +0200 Subject: [Python-3000] let's get rid of unbound super methods Message-ID: <4edc17eb0709010133n5dd560e1i1956c1b4a395f96f@mail.gmail.com> So Python 3000a1 is out! Kudos to everybody involved! You did an incredible amount of work in a relatively short time! :-) Having said that, let me go to the point. This morning I downloaded the tarball and compiled everything without issues, then I started playing around. One of the first thing I looked at was the new super, since it is a matter that made me scratch my head a lot in the past. Basically I am happy with the implementation, especially about the new magic name __class__ inside the methods which is something I always wanted. So I am not here to ask for new features. I am actually here to ask for less features: specifically, I would like the unbound syntax for super to be removed. I am talking about this: >>> help(super) Help on class super in module __builtin__: class super(object) | super() -> same as super(__class__, ) | super(type) -> unbound super object | super(type, obj) -> bound super object; requires isinstance(obj, type) | super(type, type2) -> bound super object; requires issubclass(type2, type) The single argument syntax 'super(type)' is what I call the unbound syntax. I would like 'super(type)' to be removed from the valid signatures. AFAIK, the only use case for it was the implementation of the autosuper recipe in Guido's new style classes essay. That use case has disappeared nowadays, and I cannot think of other situations where may want to use that feature (you may think differently, if so, please speak). The other reason why I would like it to be removed (apart from the fact that it looks unneeded to me) is that is very difficult to explain to beginners. For instance in the past I lectured on Python, and in order to explain why unbound super objects can be useful I gave this example, which is basically Guido's autosuper recipe implemented by hand: class B(object): def __repr__(self): return '' % self.__class__.__name__ #@classmethod def cmeth(self): print("B.meth called from %s" % self) class C(B): #@classmethod def cmeth(self): print("C.meth called from %s" % self) self.__super.cmeth() C._C__super = super(C) c = C() c.cmeth() Here everything works because the unbound super object is a descriptor and self.__super calls super(C).__get__(self, C) which corresponds to the bound method super(C, self) which is able to dispatch to .cmeth. However, if you uncomment the classmethod decorator, self.__super (where self is now the class C) will just return the unbound super object super(C) which is unable to dispatch to .cmeth. Now, try to explain that to a beginner! We can leave just as well without unbound super methods, so let's take the occasion of Python3k to remove this glitch. Michele Simionato From guido at python.org Sat Sep 1 17:01:58 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 1 Sep 2007 08:01:58 -0700 Subject: [Python-3000] let's get rid of unbound super methods In-Reply-To: <4edc17eb0709010133n5dd560e1i1956c1b4a395f96f@mail.gmail.com> References: <4edc17eb0709010133n5dd560e1i1956c1b4a395f96f@mail.gmail.com> Message-ID: Thanks for proposing this -- I've been scratching my head wondering what the use of unbound super() would be. :-) I'm fine with killing it -- perhaps someone can do a bit of research to try and find out if there are any real-life uses (apart from various auto-super clones)? --Guido On 9/1/07, Michele Simionato wrote: > So Python 3000a1 is out! Kudos to everybody involved! > You did an incredible amount of work in a relatively short time! :-) > > Having said that, let me go to the point. This morning I downloaded > the tarball and compiled everything without issues, then I > started playing around. One of the first thing I looked at was the new > super, since it is a matter that made me scratch my head a lot in the > past. Basically I am happy with the implementation, especially about > the new magic name __class__ inside the methods which is something I > always wanted. So I am not here to ask for new features. I am actually > here to ask for less features: specifically, I would like the unbound > syntax for super to be removed. I am talking about this: > > >>> help(super) > Help on class super in module __builtin__: > > class super(object) > | super() -> same as super(__class__, ) > | super(type) -> unbound super object > | super(type, obj) -> bound super object; requires isinstance(obj, type) > | super(type, type2) -> bound super object; requires issubclass(type2, type) > > > The single argument syntax 'super(type)' is what I call the unbound syntax. > I would like 'super(type)' to be removed from the valid signatures. > AFAIK, the only use case for it was the implementation of the autosuper > recipe in Guido's new style classes essay. That use case has disappeared > nowadays, and I cannot think of other situations where may want to use > that feature (you may think differently, if so, please speak). > The other reason why I would like it to be removed (apart from the fact > that it looks unneeded to me) is that is very difficult to explain to > beginners. For instance in the past I lectured on Python, and in order > to explain why unbound super objects can be useful I gave this example, > which is basically Guido's autosuper recipe implemented by hand: > > class B(object): > def __repr__(self): > return '' % self.__class__.__name__ > #@classmethod > def cmeth(self): > print("B.meth called from %s" % self) > > class C(B): > #@classmethod > def cmeth(self): > print("C.meth called from %s" % self) > self.__super.cmeth() > > C._C__super = super(C) > > c = C() > > c.cmeth() > > Here everything works because the unbound super object is a descriptor > and self.__super calls super(C).__get__(self, C) which corresponds > to the bound method super(C, self) which is able to dispatch to .cmeth. > However, if you uncomment the classmethod decorator, self.__super (where > self is now the class C) will just return the unbound super object super(C) > which is unable to dispatch to .cmeth. Now, try to explain that to a beginner! > We can leave just as well without unbound super methods, so let's take > the occasion of Python3k to remove this glitch. > > > Michele Simionato > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Sun Sep 2 08:14:37 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 2 Sep 2007 02:14:37 -0400 Subject: [Python-3000] Windows registry question from blog In-Reply-To: <46D900C4.8050109@v.loewis.de> References: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> <46D900C4.8050109@v.loewis.de> Message-ID: <66d0a6e10709012314g4142e74blefd2a4620e11e4@mail.gmail.com> On 9/1/07, "Martin v. L?wis" wrote: > > > Obviously then the logical choice would be to unselect that (or > > perhaps have it unselected by default for alpha installations). > > I'd rather have the user unselect it - people installing multiple > Python version are familiar with the phenomenon and might get puzzled > if some installation suddenly behaved different. > If this were an actual certified "release" of Python, I'd agree with that. However, it's not - it's a specifically-incompatible alpha release, and I would vote for it being unselected by default. (People familiar with installing multiple Python versions will not be familiar with anything close to this level of incompatibility in their .py files). -- Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070902/b43cbdff/attachment.htm From brett at python.org Sun Sep 2 09:17:07 2007 From: brett at python.org (Brett Cannon) Date: Sun, 2 Sep 2007 00:17:07 -0700 Subject: [Python-3000] Ambiguity in PEP 3115 and the args to __prepare__ Message-ID: PEP 3115 says a metaclass' __prepare__ takes two positional arguments, name and bases. But the example has it actually accept an arbitrary number of arguments: name and then everything else is bound to bases. Which happens to be true? I'm too tired to even fully trust that I am reading the PEP correctly, so I am not about to try to write an example to see which is correct and come up with a coherent rewording if I am right about what is wrong. =) -Brett From ggpolo at gmail.com Sun Sep 2 15:42:35 2007 From: ggpolo at gmail.com (Guilherme Polo) Date: Sun, 2 Sep 2007 10:42:35 -0300 Subject: [Python-3000] Ambiguity in PEP 3115 and the args to __prepare__ In-Reply-To: References: Message-ID: 2007/9/2, Brett Cannon : > PEP 3115 says a metaclass' __prepare__ takes two positional arguments, > name and bases. But the example has it actually accept an arbitrary > number of arguments: name and then everything else is bound to bases. > > Which happens to be true? I've played with it a bit and as I see it only takes name and bases. Maybe there is something secret ;p about using *args in __prepare__ that I dont know yet. > I'm too tired to even fully trust that I am > reading the PEP correctly, so I am not about to try to write an > example to see which is correct and come up with a coherent rewording > if I am right about what is wrong. =) > > -Brett > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/ggpolo%40gmail.com > -- -- Guilherme H. Polo Goncalves -- -- Guilherme H. Polo Goncalves From guido at python.org Sun Sep 2 17:07:55 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 2 Sep 2007 08:07:55 -0700 Subject: [Python-3000] Ambiguity in PEP 3115 and the args to __prepare__ In-Reply-To: References: Message-ID: On 9/2/07, Brett Cannon wrote: > PEP 3115 says a metaclass' __prepare__ takes two positional arguments, > name and bases. But the example has it actually accept an arbitrary > number of arguments: name and then everything else is bound to bases. > > Which happens to be true? I'm too tired to even fully trust that I am > reading the PEP correctly, so I am not about to try to write an > example to see which is correct and come up with a coherent rewording > if I am right about what is wrong. =) I think you're misreading what you think is an example. I'm assuming you're referring to this code: def prepare_class(name, *bases, metaclass=None, **kwargs): if metaclass is None: metaclass = compute_default_metaclass(bases) prepare = getattr(metaclass, '__prepare__', None) if prepare is not None: return prepare(name, bases, **kwargs) else: return dict() This indeed *defines* a function with a *bases argument, but it is not called __prepare__! It *calls* __prepare__ passing it name and bases, i.e. the 2nd argument to prepare is a tuple of bases. The only example defining __prepare__ later in the PEP takes two positional arguments (name and bases again). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From trentm at gmail.com Sun Sep 2 18:26:07 2007 From: trentm at gmail.com (Trent Mick) Date: Sun, 2 Sep 2007 09:26:07 -0700 Subject: [Python-3000] Windows registry question from blog In-Reply-To: <66d0a6e10709012314g4142e74blefd2a4620e11e4@mail.gmail.com> References: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> <46D900C4.8050109@v.loewis.de> <66d0a6e10709012314g4142e74blefd2a4620e11e4@mail.gmail.com> Message-ID: <6db0ea510709020926p6745e419x1b02217016addcad@mail.gmail.com> > > > Obviously then the logical choice would be to unselect that (or > > > perhaps have it unselected by default for alpha installations). > > > > I'd rather have the user unselect it - people installing multiple > > Python version are familiar with the phenomenon and might get puzzled > > if some installation suddenly behaved different. > > > > If this were an actual certified "release" of Python, I'd agree with that. > However, it's not - it's a specifically-incompatible alpha release, and I > would vote for it being unselected by default. (People familiar with > installing multiple Python versions will not be familiar with anything close > to this level of incompatibility in their .py files). FWIW, this is what I do for the ActivePython (and Komodo) installers: only do the PATHEXT, PATH and file association changes by default in final releases and require the user to select that for alpha/beta releases. Trent -- Trent Mick trentm at gmail.com From martin at v.loewis.de Sun Sep 2 19:24:25 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 02 Sep 2007 19:24:25 +0200 Subject: [Python-3000] Windows registry question from blog In-Reply-To: <6db0ea510709020926p6745e419x1b02217016addcad@mail.gmail.com> References: <66d0a6e10708312148w677ed8b5g223ebb4288c0c167@mail.gmail.com> <46D900C4.8050109@v.loewis.de> <66d0a6e10709012314g4142e74blefd2a4620e11e4@mail.gmail.com> <6db0ea510709020926p6745e419x1b02217016addcad@mail.gmail.com> Message-ID: <46DAF1C9.6000009@v.loewis.de> > FWIW, this is what I do for the ActivePython (and Komodo) installers: > only do the PATHEXT, PATH and file association changes by default in > final releases and require the user to select that for alpha/beta > releases. That's actually worth something; I'll see whether I can find the time to change this for a2. I'd like to make that computed, so I don't have to change the script for a release. Contributions are welcome. Regards, Martin From brett at python.org Sun Sep 2 19:43:34 2007 From: brett at python.org (Brett Cannon) Date: Sun, 2 Sep 2007 10:43:34 -0700 Subject: [Python-3000] Ambiguity in PEP 3115 and the args to __prepare__ In-Reply-To: References: Message-ID: On 9/2/07, Guido van Rossum wrote: > On 9/2/07, Brett Cannon wrote: > > PEP 3115 says a metaclass' __prepare__ takes two positional arguments, > > name and bases. But the example has it actually accept an arbitrary > > number of arguments: name and then everything else is bound to bases. > > > > Which happens to be true? I'm too tired to even fully trust that I am > > reading the PEP correctly, so I am not about to try to write an > > example to see which is correct and come up with a coherent rewording > > if I am right about what is wrong. =) > > I think you're misreading what you think is an example. I'm assuming > you're referring to this code: > > def prepare_class(name, *bases, metaclass=None, **kwargs): > if metaclass is None: > metaclass = compute_default_metaclass(bases) > prepare = getattr(metaclass, '__prepare__', None) > if prepare is not None: > return prepare(name, bases, **kwargs) > else: > return dict() > > This indeed *defines* a function with a *bases argument, but it is not > called __prepare__! It *calls* __prepare__ passing it name and bases, > i.e. the 2nd argument to prepare is a tuple of bases. Ah, OK, that is the issue (that and type.__prepare__ takes any arguments and just always returns a new dictionary). So it was the lack of sleep. =) -Brett From robin at nibor.org Sun Sep 2 23:10:08 2007 From: robin at nibor.org (Robin Stocker) Date: Sun, 02 Sep 2007 23:10:08 +0200 Subject: [Python-3000] Patch for Doc/tutorial In-Reply-To: References: Message-ID: <46DB26B0.2090007@nibor.org> Paul Dubois schrieb: > Attached is a patch for changes to the tutorial. I made it by doing: > > svn diff tutorial > tutorial.diff > > in the Doc directory. I hope this is what is wanted; if not let me know > what to do. > > Unfortunately cygwin will not run Sphinx correctly even using 2.5, much > less 3.0. And running docutils by hand gets a lot of errors because > Sphinx has hidden a lot of the definitions used in the tutorial. So the > bottom line is I have only an imperfect idea if I have screwed up any > formatting. > > I would like to rewrite the classes.rst file in particular, and it is > the one that I did not check to be sure the examples worked, but first I > need to do something about getting me a real Linux so I don't have these > problems. So unless someone is hot to trot I'd like to remain 'owner' of > this issue on the spreadsheet. > > Whoever puts in these patches, I would appreciate being notified that it > is done. > > Paul I've had a look at the patch and here's another one against the current py3k, to be applied in the Doc directory. It mostly fixes some code formatting errors, like no space after a comma. Robin Stocker -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: tutorial-formatting-fixes.patch Url: http://mail.python.org/pipermail/python-3000/attachments/20070902/2fcab1da/attachment.txt From g.brandl at gmx.net Mon Sep 3 09:10:48 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 03 Sep 2007 09:10:48 +0200 Subject: [Python-3000] Patch for Doc/tutorial In-Reply-To: <46DB26B0.2090007@nibor.org> References: <46DB26B0.2090007@nibor.org> Message-ID: Robin Stocker schrieb: > Paul Dubois schrieb: >> Attached is a patch for changes to the tutorial. I made it by doing: >> >> svn diff tutorial > tutorial.diff >> >> in the Doc directory. I hope this is what is wanted; if not let me know >> what to do. >> >> Unfortunately cygwin will not run Sphinx correctly even using 2.5, much >> less 3.0. And running docutils by hand gets a lot of errors because >> Sphinx has hidden a lot of the definitions used in the tutorial. So the >> bottom line is I have only an imperfect idea if I have screwed up any >> formatting. >> >> I would like to rewrite the classes.rst file in particular, and it is >> the one that I did not check to be sure the examples worked, but first I >> need to do something about getting me a real Linux so I don't have these >> problems. So unless someone is hot to trot I'd like to remain 'owner' of >> this issue on the spreadsheet. >> >> Whoever puts in these patches, I would appreciate being notified that it >> is done. >> >> Paul > > I've had a look at the patch and here's another one against the current > py3k, to be applied in the Doc directory. It mostly fixes some code > formatting errors, like no space after a comma. Thanks very much (again), applied as rev. 57923. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From baranguren at gmail.com Mon Sep 3 16:59:33 2007 From: baranguren at gmail.com (Benjamin Aranguren) Date: Mon, 3 Sep 2007 07:59:33 -0700 Subject: [Python-3000] backported ABC In-Reply-To: References: Message-ID: I am having a problem backporting collections.py/_abcoll.py and would like to get your input. There's one test in test_collections that fails. class TestOneTrickPonyABCs(unittest.TestCase): def test_Hashable(self): # Check some non-hashables non_samples = [list(), set(), dict()] for x in non_samples: self.failIf(isinstance(x, Hashable), repr(x)) self.failIf(issubclass(type(x), Hashable), repr(type(x))) The problem is list, set, dict all has __hash__ function so isinstance and issubclass returns true even though none of list, set, and dict was registered as a subclass of Hashable. But, calling x.__hash__() on these types results to a TypeError: list objects are unhashable. Thanks! On 8/26/07, Benjamin Aranguren wrote: > I got it now. both modules need to be backported as well. I'm on it. > > On 8/26/07, Benjamin Aranguren wrote: > > No problem. Created issue 1026 in tracker with a single patch file attached. > > > > I'm not aware of what changes need to be done with _abcoll.py and > > collections.py. If you can point me to the right direction, I would > > definitely like to work on it. > > > > On 8/26/07, Guido van Rossum wrote: > > > Thanks! > > > > > > Would it inconvenience you terribly to upload this all to the new > > > tracker (bugs.python.org)? Preferably as a single patch against the > > > svn trunk (to use svn diff, you have to svn add the new files first!) > > > > > > Also, are you planning to work on _abcoll.py and the changes to collections.py? > > > > > > --Guido > > > > > > On 8/26/07, Benjamin Aranguren wrote: > > > > We copied abc.py and test_abc.py from py3k svn and modified to work with 2.6. > > > > > > > > After making all the changes we ran all the tests to ensure that no > > > > other modules were affected. > > > > > > > > Attached are abc.py, test_abc.py, and their relevant patches from 3.0 to 2.6. > > > > > > > > On 8/25/07, Guido van Rossum wrote: > > > > > Um, that patch contains only the C code for overloading isinstance() > > > > > and issubclass(). > > > > > > > > > > Did you do anything about abc.py and _abcoll.py/collections.py and > > > > > their respective unit tests? Or what about the unit tests for > > > > > isinstance()/issubclass()? > > > > > > > > > > On 8/25/07, Benjamin Aranguren wrote: > > > > > > Worked with Alex Martelli at the Goolge Python Sprint. > > > > > > > > > > -- > > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > > > > > > > > > > > > > -- > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > From eric+python-dev at trueblade.com Mon Sep 3 17:06:55 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Mon, 03 Sep 2007 11:06:55 -0400 Subject: [Python-3000] str.format vs. string.Formatter exceptions Message-ID: <46DC230F.2040409@trueblade.com> Ron Adam points out some differences in which exceptions are thrown by str.format and string.Formatter. For example, on a missing positional argument: >>> "{0}".format() Traceback (most recent call last): File "", line 1, in ValueError: Not enough positional arguments in format string >>> Formatter().format("{0}") Traceback (most recent call last): File "", line 1, in File "/shared/src/python/py3k/Lib/string.py", line 201, in format return self.vformat(format_string, args, kwargs) File "/shared/src/python/py3k/Lib/string.py", line 220, in vformat obj, arg_used = self.get_field(field_name, args, kwargs) File "/shared/src/python/py3k/Lib/string.py", line 278, in get_field obj = self.get_value(first, args, kwargs) File "/shared/src/python/py3k/Lib/string.py", line 235, in get_value return args[key] IndexError: tuple index out of range The PEP says: In general, exceptions generated by the formatter code itself are of the "ValueError" variety -- there is an error in the actual "value" of the format string. I can easily change string.Formatter to make this a ValueError, and I think that's probably the right thing to do. For example, if the string comes from a translation module, then there might be an extra parameter added by mistake, in which case ValueError seems right to me. But I'd like to hear if anyone else thinks this should be an IndexError, or maybe they both should be some other exception. Similarly "{x}".format()' currently raises ValueError, but 'Formatter().format("{x}")' raises KeyError. From nick.bastin at gmail.com Mon Sep 3 20:54:45 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 3 Sep 2007 14:54:45 -0400 Subject: [Python-3000] Performance Notes Message-ID: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> I've been doing some profiling of 3.0 vs. 2.6 release builds on Windows XP for the purpose of hopefully closing the performance gap. This data is very preliminary, but I thought I'd throw it out here in case someone else also wanted to look into this. Also, possibly useful for comparing against profiling data on other platforms. The table below just lists functions and speed differentials in 3.0 vs. 2.6, ordered by the functions in which we spend the most total time. NOTE: This data is time sampling, not call graph. Added time could come from either more calls, or longer calls. + 11.5% PyEval_EvalFrameEx + 40.2% lookdict (replacing lookdict_string) +312.9% PyDict_GetItem - 13.2% call_function + 19.4% fast_function Other notes: * PyLong_FitsInLong consumes about 2% of total pystone runtime. * unicode_compare consumes the exact same time in 3.0 that string_richcompare consumed in 2.6. Either these functions share a similar CPU profile, or their call counts vary dramatically. Top 5 functions in Python 2.6: * PyEval_EvalFrameEx (48.66%) * lookdict_string (5.76%) * call_function (4.80%) * frame_dealloc (2.80%) * fast_function (2.48%) Top 5 functions in Python 3.0: * PyEval_EvalFrameEx (44.37%) * lookdict (6.66%) * PyDict_GetItem (4.63%) * unicode_hash (3.51%) * call_function (3.38%) -- Nick From thomas at python.org Tue Sep 4 01:33:33 2007 From: thomas at python.org (Thomas Wouters) Date: Tue, 4 Sep 2007 01:33:33 +0200 Subject: [Python-3000] Merging between trunk and py3k? In-Reply-To: References: Message-ID: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> On 8/31/07, Guido van Rossum wrote: > > I haven't heard yet that merging is impossible or useless; there's > still a lot of similarity between the trunk and the branch. Merging is sometimes hard, but always fun. Well, challenging. A Chinese kind of interesting time. It certainly forces the merger to keep up to date on changes in both branches :-) I'll happily keep on merging until at least 3.0final is released, quite possibly until 2.x is nailed to its perch. I wouldn't even mind doing that after the reindent of the py3k C source; everything would conflict, but 'diff -cbB' solves that nicely. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/603706c9/attachment.htm From guido at python.org Tue Sep 4 04:25:17 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 3 Sep 2007 19:25:17 -0700 Subject: [Python-3000] Merging between trunk and py3k? In-Reply-To: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> References: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> Message-ID: Thanks for volunteering! Let me know when you're short on time and I'll take over (or appoint another volunteer :). --Guido On 9/3/07, Thomas Wouters wrote: > > > On 8/31/07, Guido van Rossum wrote: > > I haven't heard yet that merging is impossible or useless; there's > > still a lot of similarity between the trunk and the branch. > > Merging is sometimes hard, but always fun. Well, challenging. A Chinese kind > of interesting time. It certainly forces the merger to keep up to date on > changes in both branches :-) I'll happily keep on merging until at least > 3.0final is released, quite possibly until 2.x is nailed to its perch. I > wouldn't even mind doing that after the reindent of the py3k C source; > everything would conflict, but 'diff -cbB' solves that nicely. > > -- > Thomas Wouters > > Hi! I'm a .signature virus! copy me into your .signature file to help me > spread! -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Tue Sep 4 04:33:37 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 3 Sep 2007 22:33:37 -0400 Subject: [Python-3000] Merging between trunk and py3k? In-Reply-To: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> References: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> Message-ID: <66d0a6e10709031933i3b8c0d88ma11429329a4b311d@mail.gmail.com> On 9/3/07, Thomas Wouters wrote: > > > On 8/31/07, Guido van Rossum wrote: > > I haven't heard yet that merging is impossible or useless; there's > > still a lot of similarity between the trunk and the branch. > > Merging is sometimes hard, but always fun. Well, challenging. A Chinese kind > of interesting time. Merging in SVN is hard and challenging. Merging in a reasonable SCM is not so bad. :-) (Unfortunately, in this context, read "reasonable" as "commercial") -- Nick From rrr at ronadam.com Tue Sep 4 04:38:22 2007 From: rrr at ronadam.com (Ron Adam) Date: Mon, 03 Sep 2007 21:38:22 -0500 Subject: [Python-3000] str.format vs. string.Formatter exceptions In-Reply-To: <46DC230F.2040409@trueblade.com> References: <46DC230F.2040409@trueblade.com> Message-ID: <46DCC51E.3050809@ronadam.com> Eric Smith wrote: > Ron Adam points out some differences in which exceptions are thrown by > str.format and string.Formatter. For example, on a missing positional > argument: > > >>> "{0}".format() > Traceback (most recent call last): > File "", line 1, in > ValueError: Not enough positional arguments in format string > > >>> Formatter().format("{0}") > Traceback (most recent call last): > File "", line 1, in > File "/shared/src/python/py3k/Lib/string.py", line 201, in format > return self.vformat(format_string, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 220, in vformat > obj, arg_used = self.get_field(field_name, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 278, in get_field > obj = self.get_value(first, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 235, in get_value > return args[key] > IndexError: tuple index out of range > > The PEP says: In general, exceptions generated by the formatter code > itself are of the "ValueError" variety -- there is an error in the > actual "value" of the format string. The PEP also says the following in regards to this... +---------------- Implementation note: The implementation of this proposal is not required to enforce the rule about a name being a valid Python identifier. Instead, it will rely on the getattr function of the underlying object to throw an exception if the identifier is not legal. The str.format() function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.). +---------------- If these return ValueErrors, as I think it has been suggested in the earlier messages, then this will need to be updated as well. _RON > I can easily change string.Formatter to make this a ValueError, and I > think that's probably the right thing to do. For example, if the string > comes from a translation module, then there might be an extra parameter > added by mistake, in which case ValueError seems right to me. > > But I'd like to hear if anyone else thinks this should be an IndexError, > or maybe they both should be some other exception. > > Similarly "{x}".format()' currently raises ValueError, but > 'Formatter().format("{x}")' raises KeyError. > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/rrr%40ronadam.com > > From aahz at pythoncraft.com Tue Sep 4 05:09:24 2007 From: aahz at pythoncraft.com (Aahz) Date: Mon, 3 Sep 2007 20:09:24 -0700 Subject: [Python-3000] Merging between trunk and py3k? In-Reply-To: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> References: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> Message-ID: <20070904030923.GA18848@panix.com> On Tue, Sep 04, 2007, Thomas Wouters wrote: > > Merging is sometimes hard, but always fun. Well, challenging. A > Chinese kind of interesting time. Not so Chinese, actually: http://www.noblenet.org/reference/inter.htm -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer http://www.lysator.liu.se/c/ten-commandments.html From guido at python.org Tue Sep 4 05:09:28 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 3 Sep 2007 20:09:28 -0700 Subject: [Python-3000] str.format vs. string.Formatter exceptions In-Reply-To: <46DC230F.2040409@trueblade.com> References: <46DC230F.2040409@trueblade.com> Message-ID: Since IndexError and KeyError are conceptually like ValueError but in a more narrowly defined context, I think IndexError and KeyError actually make sense here (even though they don't inherit from ValueError). --Guido On 9/3/07, Eric Smith wrote: > Ron Adam points out some differences in which exceptions are thrown by > str.format and string.Formatter. For example, on a missing positional > argument: > > >>> "{0}".format() > Traceback (most recent call last): > File "", line 1, in > ValueError: Not enough positional arguments in format string > > >>> Formatter().format("{0}") > Traceback (most recent call last): > File "", line 1, in > File "/shared/src/python/py3k/Lib/string.py", line 201, in format > return self.vformat(format_string, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 220, in vformat > obj, arg_used = self.get_field(field_name, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 278, in get_field > obj = self.get_value(first, args, kwargs) > File "/shared/src/python/py3k/Lib/string.py", line 235, in get_value > return args[key] > IndexError: tuple index out of range > > The PEP says: In general, exceptions generated by the formatter code > itself are of the "ValueError" variety -- there is an error in the > actual "value" of the format string. > > I can easily change string.Formatter to make this a ValueError, and I > think that's probably the right thing to do. For example, if the string > comes from a translation module, then there might be an extra parameter > added by mistake, in which case ValueError seems right to me. > > But I'd like to hear if anyone else thinks this should be an IndexError, > or maybe they both should be some other exception. > > Similarly "{x}".format()' currently raises ValueError, but > 'Formatter().format("{x}")' raises KeyError. > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 05:16:43 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 3 Sep 2007 20:16:43 -0700 Subject: [Python-3000] Performance Notes In-Reply-To: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> References: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> Message-ID: Interesting! Thanks for doing this. We'll need a lot of this over the coming year. I read in this that the increased cost is largely due to using unicode strings for all variable and attribute names. So the next step might be to optimize the snot out of unicode hashing and introduce the unicode equivalent of lookup_string (while retiring the 8-bit version). The unicode type has never received the same amount of love that the 8-bit str type received over the years (and from day zero). BTW this goes to show that int operations are *not* (yet) the biggest bottleneck -- though I'm sure they're bubbling under. PS It would be interesting to collect more "holistic" benchmarks (micro-benchmarks aren't particularly interesting in this stage, as we're trying to improve *overall* performance). --Guido On 9/3/07, Nicholas Bastin wrote: > I've been doing some profiling of 3.0 vs. 2.6 release builds on > Windows XP for the purpose of hopefully closing the performance gap. > This data is very preliminary, but I thought I'd throw it out here in > case someone else also wanted to look into this. Also, possibly > useful for comparing against profiling data on other platforms. The > table below just lists functions and speed differentials in 3.0 vs. > 2.6, ordered by the functions in which we spend the most total time. > > NOTE: This data is time sampling, not call graph. Added time could > come from either more calls, or longer calls. > > + 11.5% PyEval_EvalFrameEx > + 40.2% lookdict (replacing lookdict_string) > +312.9% PyDict_GetItem > - 13.2% call_function > + 19.4% fast_function > > Other notes: > * PyLong_FitsInLong consumes about 2% of total pystone runtime. > * unicode_compare consumes the exact same time in 3.0 that > string_richcompare consumed in 2.6. Either these functions share a > similar CPU profile, or their call counts vary dramatically. > > Top 5 functions in Python 2.6: > > * PyEval_EvalFrameEx (48.66%) > * lookdict_string (5.76%) > * call_function (4.80%) > * frame_dealloc (2.80%) > * fast_function (2.48%) > > Top 5 functions in Python 3.0: > > * PyEval_EvalFrameEx (44.37%) > * lookdict (6.66%) > * PyDict_GetItem (4.63%) > * unicode_hash (3.51%) > * call_function (3.38%) > > -- > Nick > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 05:30:41 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 3 Sep 2007 20:30:41 -0700 Subject: [Python-3000] backported ABC In-Reply-To: References: Message-ID: You're going to have to do some spelunking in the 3.0 source (because I don't have time right now :-), but I think 3.0 has some magic that solves this. I *think* it is done by not inheriting tp_hash unless tp_richcompare is also inherited. The details are probably in typeobject.c. Ask me again tomorrow if you can't figure it out. --Guido On 9/3/07, Benjamin Aranguren wrote: > I am having a problem backporting collections.py/_abcoll.py and would > like to get your input. > > There's one test in test_collections that fails. > > class TestOneTrickPonyABCs(unittest.TestCase): > > def test_Hashable(self): > # Check some non-hashables > non_samples = [list(), set(), dict()] > for x in non_samples: > self.failIf(isinstance(x, Hashable), repr(x)) > self.failIf(issubclass(type(x), Hashable), repr(type(x))) > > The problem is list, set, dict all has __hash__ function so isinstance > and issubclass returns true even though none of list, set, and dict > was registered as a subclass of Hashable. > > But, calling x.__hash__() on these types results to a TypeError: list > objects are unhashable. > > Thanks! > > On 8/26/07, Benjamin Aranguren wrote: > > I got it now. both modules need to be backported as well. I'm on it. > > > > On 8/26/07, Benjamin Aranguren wrote: > > > No problem. Created issue 1026 in tracker with a single patch file attached. > > > > > > I'm not aware of what changes need to be done with _abcoll.py and > > > collections.py. If you can point me to the right direction, I would > > > definitely like to work on it. > > > > > > On 8/26/07, Guido van Rossum wrote: > > > > Thanks! > > > > > > > > Would it inconvenience you terribly to upload this all to the new > > > > tracker (bugs.python.org)? Preferably as a single patch against the > > > > svn trunk (to use svn diff, you have to svn add the new files first!) > > > > > > > > Also, are you planning to work on _abcoll.py and the changes to collections.py? > > > > > > > > --Guido > > > > > > > > On 8/26/07, Benjamin Aranguren wrote: > > > > > We copied abc.py and test_abc.py from py3k svn and modified to work with 2.6. > > > > > > > > > > After making all the changes we ran all the tests to ensure that no > > > > > other modules were affected. > > > > > > > > > > Attached are abc.py, test_abc.py, and their relevant patches from 3.0 to 2.6. > > > > > > > > > > On 8/25/07, Guido van Rossum wrote: > > > > > > Um, that patch contains only the C code for overloading isinstance() > > > > > > and issubclass(). > > > > > > > > > > > > Did you do anything about abc.py and _abcoll.py/collections.py and > > > > > > their respective unit tests? Or what about the unit tests for > > > > > > isinstance()/issubclass()? > > > > > > > > > > > > On 8/25/07, Benjamin Aranguren wrote: > > > > > > > Worked with Alex Martelli at the Goolge Python Sprint. > > > > > > > > > > > > -- > > > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From baranguren at gmail.com Tue Sep 4 05:37:00 2007 From: baranguren at gmail.com (Benjamin Aranguren) Date: Mon, 3 Sep 2007 20:37:00 -0700 Subject: [Python-3000] backported ABC In-Reply-To: References: Message-ID: Thanks! This helps. I was just not sure if I was on the right track or not. I did try disabling &list_nohash in listobject.c I think I have the right idea and just needed some reassurance. I'll give it another try. Thanks again. On 9/3/07, Guido van Rossum wrote: > You're going to have to do some spelunking in the 3.0 source (because > I don't have time right now :-), but I think 3.0 has some magic that > solves this. I *think* it is done by not inheriting tp_hash unless > tp_richcompare is also inherited. The details are probably in > typeobject.c. > > Ask me again tomorrow if you can't figure it out. > > --Guido > > On 9/3/07, Benjamin Aranguren wrote: > > I am having a problem backporting collections.py/_abcoll.py and would > > like to get your input. > > > > There's one test in test_collections that fails. > > > > class TestOneTrickPonyABCs(unittest.TestCase): > > > > def test_Hashable(self): > > # Check some non-hashables > > non_samples = [list(), set(), dict()] > > for x in non_samples: > > self.failIf(isinstance(x, Hashable), repr(x)) > > self.failIf(issubclass(type(x), Hashable), repr(type(x))) > > > > The problem is list, set, dict all has __hash__ function so isinstance > > and issubclass returns true even though none of list, set, and dict > > was registered as a subclass of Hashable. > > > > But, calling x.__hash__() on these types results to a TypeError: list > > objects are unhashable. > > > > Thanks! > > > > On 8/26/07, Benjamin Aranguren wrote: > > > I got it now. both modules need to be backported as well. I'm on it. > > > > > > On 8/26/07, Benjamin Aranguren wrote: > > > > No problem. Created issue 1026 in tracker with a single patch file attached. > > > > > > > > I'm not aware of what changes need to be done with _abcoll.py and > > > > collections.py. If you can point me to the right direction, I would > > > > definitely like to work on it. > > > > > > > > On 8/26/07, Guido van Rossum wrote: > > > > > Thanks! > > > > > > > > > > Would it inconvenience you terribly to upload this all to the new > > > > > tracker (bugs.python.org)? Preferably as a single patch against the > > > > > svn trunk (to use svn diff, you have to svn add the new files first!) > > > > > > > > > > Also, are you planning to work on _abcoll.py and the changes to collections.py? > > > > > > > > > > --Guido > > > > > > > > > > On 8/26/07, Benjamin Aranguren wrote: > > > > > > We copied abc.py and test_abc.py from py3k svn and modified to work with 2.6. > > > > > > > > > > > > After making all the changes we ran all the tests to ensure that no > > > > > > other modules were affected. > > > > > > > > > > > > Attached are abc.py, test_abc.py, and their relevant patches from 3.0 to 2.6. > > > > > > > > > > > > On 8/25/07, Guido van Rossum wrote: > > > > > > > Um, that patch contains only the C code for overloading isinstance() > > > > > > > and issubclass(). > > > > > > > > > > > > > > Did you do anything about abc.py and _abcoll.py/collections.py and > > > > > > > their respective unit tests? Or what about the unit tests for > > > > > > > isinstance()/issubclass()? > > > > > > > > > > > > > > On 8/25/07, Benjamin Aranguren wrote: > > > > > > > > Worked with Alex Martelli at the Goolge Python Sprint. > > > > > > > > > > > > > > -- > > > > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > > > > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > From rhamph at gmail.com Tue Sep 4 06:10:45 2007 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 3 Sep 2007 22:10:45 -0600 Subject: [Python-3000] Performance Notes In-Reply-To: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> References: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> Message-ID: On 9/3/07, Nicholas Bastin wrote: > I've been doing some profiling of 3.0 vs. 2.6 release builds on > Windows XP for the purpose of hopefully closing the performance gap. > This data is very preliminary, but I thought I'd throw it out here in > case someone else also wanted to look into this. Also, possibly > useful for comparing against profiling data on other platforms. The > table below just lists functions and speed differentials in 3.0 vs. > 2.6, ordered by the functions in which we spend the most total time. > > NOTE: This data is time sampling, not call graph. Added time could > come from either more calls, or longer calls. > > + 11.5% PyEval_EvalFrameEx > + 40.2% lookdict (replacing lookdict_string) > +312.9% PyDict_GetItem > - 13.2% call_function > + 19.4% fast_function lookdict_string appears to still use the old string type, rather than unicode. This prevents it from being used. It's probably not too hard to fix. > Other notes: > * PyLong_FitsInLong consumes about 2% of total pystone runtime. > * unicode_compare consumes the exact same time in 3.0 that > string_richcompare consumed in 2.6. Either these functions share a > similar CPU profile, or their call counts vary dramatically. > > Top 5 functions in Python 2.6: > > * PyEval_EvalFrameEx (48.66%) > * lookdict_string (5.76%) > * call_function (4.80%) > * frame_dealloc (2.80%) > * fast_function (2.48%) > > Top 5 functions in Python 3.0: > > * PyEval_EvalFrameEx (44.37%) > * lookdict (6.66%) > * PyDict_GetItem (4.63%) > * unicode_hash (3.51%) > * call_function (3.38%) -- Adam Olsen, aka Rhamphoryncus From amk at amk.ca Mon Sep 3 18:53:47 2007 From: amk at amk.ca (A.M. Kuchling) Date: Mon, 3 Sep 2007 12:53:47 -0400 Subject: [Python-3000] [mark@qtrac.eu: Poss. clarification for What's New in Python 3] Message-ID: <20070903165347.GA24392@mac.local> Forwarded: a comment on the 3.0 What's New. --amk -------------- next part -------------- An embedded message was scrubbed... From: Mark Summerfield Subject: Poss. clarification for What's New in Python 3 Date: Sat, 1 Sep 2007 08:55:42 +0100 Size: 3400 Url: http://mail.python.org/pipermail/python-3000/attachments/20070903/787db536/attachment.eml From g.brandl at gmx.net Tue Sep 4 08:23:11 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 08:23:11 +0200 Subject: [Python-3000] What about operator.*slice? Message-ID: Are they useful enough to keep? -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From thomas at python.org Tue Sep 4 10:43:08 2007 From: thomas at python.org (Thomas Wouters) Date: Tue, 4 Sep 2007 10:43:08 +0200 Subject: [Python-3000] Merging between trunk and py3k? In-Reply-To: <66d0a6e10709031933i3b8c0d88ma11429329a4b311d@mail.gmail.com> References: <9e804ac0709031633w705f2c9fkb0cf3ef98a62840c@mail.gmail.com> <66d0a6e10709031933i3b8c0d88ma11429329a4b311d@mail.gmail.com> Message-ID: <9e804ac0709040143q23bcd22by78bf66e4138faa83@mail.gmail.com> On 9/4/07, Nicholas Bastin wrote: > > On 9/3/07, Thomas Wouters wrote: > > > > > > On 8/31/07, Guido van Rossum wrote: > > > I haven't heard yet that merging is impossible or useless; there's > > > still a lot of similarity between the trunk and the branch. > > > > Merging is sometimes hard, but always fun. Well, challenging. A Chinese > kind > > of interesting time. > > Merging in SVN is hard and challenging. Merging in a reasonable SCM > is not so bad. :-) Merging two direct sibling branches with svnmerge is actually quite doable. It's slightly more annoying than it would be in an SCM with proper branch merging, but not significantly so. The merges we're doing would be about as hard and challenging in any other SCM. I know, I actually did them. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/5f05bb83/attachment-0001.htm From noamraph at gmail.com Tue Sep 4 10:49:53 2007 From: noamraph at gmail.com (Noam Raphael) Date: Tue, 4 Sep 2007 11:49:53 +0300 Subject: [Python-3000] Default dict iterator should have been iteritems() Message-ID: Hello, Just a thought that came to me after writing a code that deals quite a lot with dicts: The default dict iterator should in principle be iteritems(), and not iterkeys(). This is probably just theoritical, since it will break a lot of code and not gain a lot, but it may be remembered when someone decides to write a new language... The reasoning is simple: Iteration over an object usually gets all the data it contains. A dict can be seen as an unordered collection of tuples (key, value), indexed by key. So, iteration over a dict should yield those tuples. For this reason, I think that "for key, value in dict.iteritems()" is more common than "for key in dict" - When iterating over a dict, you are usually interested in both the key and the value. Another point: if the default dict iterator were iteritems(), the dict copy constructor would not have been a special case - dict(x) always gets an iterable over tuples and produces a new dict. Currently, if you want to produce a dict from a UserDict, for example, you must call dict(userdict.iteritems()). As I see it, the only reason for the current status is the desire to make "x in dict" equivalent to "dict.has_key(x)", since has_key is a common operation and "x in" is shorter. But actually "dict.has_key(x)" explains exactly what's intended, while "x in dict" isn't really clear (for newbies, that is): do you ask whether x is in dict.keys(), or in dict.values(), or in dict.items()? Of course, if dict's default iterator were iteritems(), "x in dict" should have meant "x in dict.items()", which is very easy to implement. What do you think? Noam From thomas at python.org Tue Sep 4 10:56:20 2007 From: thomas at python.org (Thomas Wouters) Date: Tue, 4 Sep 2007 10:56:20 +0200 Subject: [Python-3000] What about operator.*slice? In-Reply-To: References: Message-ID: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> On 9/4/07, Georg Brandl wrote: > > Are they useful enough to keep? operator.*slice? They're rather convenient when you don't want to bother with creating a slice object yourself, but I'm not worried either way. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/cbcef254/attachment.htm From g.brandl at gmx.net Tue Sep 4 11:09:07 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 11:09:07 +0200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: Message-ID: Noam Raphael schrieb: > As I see it, the only reason for the current status is the desire to > make "x in dict" equivalent to "dict.has_key(x)", since has_key is a > common operation and "x in" is shorter. But actually "dict.has_key(x)" > explains exactly what's intended, while "x in dict" isn't really clear > (for newbies, that is): do you ask whether x is in dict.keys(), or in > dict.values(), or in dict.items()? Even if it's true that a loop over items is more common than a loop over keys, "x in keys" is much more common than "x in items". In every language there are things that must be learned and remembered. That dict.__iter__ yields keys is one of them. (You could present similar arguments that speak in favor of dict.__iter__ yielding values...) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From greg.ewing at canterbury.ac.nz Tue Sep 4 11:30:14 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 04 Sep 2007 21:30:14 +1200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: Message-ID: <46DD25A6.6070504@canterbury.ac.nz> Noam Raphael wrote: > The default dict iterator should in principle be iteritems(), and not > iterkeys(). This was discussed at length back when "in" support was added to dicts. There were reasons for choosing to do it the way it's done, and I don't think it's likely to be changed. -- Greg From theller at ctypes.org Tue Sep 4 11:34:46 2007 From: theller at ctypes.org (Thomas Heller) Date: Tue, 04 Sep 2007 11:34:46 +0200 Subject: [Python-3000] Confused about getattr() and special methods Message-ID: I was looking into the Lib\test\test_uuid on Windows, which fails with this traceback: test test_uuid failed -- Traceback (most recent call last): File "C:\buildbot\work\3.0.heller-windows\build\lib\test\test_uuid.py", line 323, in test_ipconfig_getnode node = uuid._ipconfig_getnode() File "C:\buildbot\work\3.0.heller-windows\build\lib\uuid.py", line 376, in _ipconfig_getnode for line in pipe: TypeError: '_wrap_close' object is not iterable The test can be fixed with this little patch: Index: Lib/os.py =================================================================== --- Lib/os.py (revision 57827) +++ Lib/os.py (working copy) @@ -664,6 +664,8 @@ return self._proc.wait() << 8 # Shift left to match old behavior def __getattr__(self, name): return getattr(self._stream, name) + def __iter__(self): + return iter(self._stream) # Supply os.fdopen() (used by subprocess!) def fdopen(fd, mode="r", buffering=-1): However, looking further into this I'm getting confused. Shouldn't the __getattr__ implementation find the __iter__ method of the _stream instance variable? Consider this code: ##__metaclass__ = type class X: def __str__(self): return "foo" def __len__(self): return 42 def __iter__(self): return iter([1, 2, 3]) class proxy: def __init__(self): self.x = X() def __getattr__(self, name): return getattr(self.x, name) p = proxy() print(len(p)) print(str(p)) print(iter(p)) In Python2.5 and trunk, all the calls len(p), str(p), and iter(p) return the attributes of the X class instance. Uncommenting the '__metaclass__ = type' line makes the code fail. IIUC, in py3k, classic classes do not exist any longer, so the __metaclass__ line has no effect anyway. Is this behaviour intended? Thomas From thomas at python.org Tue Sep 4 12:00:16 2007 From: thomas at python.org (Thomas Wouters) Date: Tue, 4 Sep 2007 12:00:16 +0200 Subject: [Python-3000] Confused about getattr() and special methods In-Reply-To: References: Message-ID: <9e804ac0709040300l1a79d22bv7811b054a5245380@mail.gmail.com> On 9/4/07, Thomas Heller wrote: > Shouldn't the __getattr__ implementation find the __iter__ method > of the _stream instance variable? No. For new-style classes, the special methods (that are part of the PyType C struct) are always looked up on the class, never the instance. The class's __getattr__ is never called. It's the class that defines behaviour, and __getattr__ and __getattribute__ just define how to handle *instance* attribute access. This change is really the biggest difference between classic classes and new-style classes, much bigger than the MRO change ;-) -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/df911e34/attachment.htm From g.brandl at gmx.net Tue Sep 4 12:09:12 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 12:09:12 +0200 Subject: [Python-3000] __special__ method lookup [was Re: Confused about getattr() and special methods] In-Reply-To: References: Message-ID: Thomas Heller schrieb: > IIUC, in py3k, classic classes do not exist any longer, so the __metaclass__ line > has no effect anyway. Is this behaviour intended? It is another incarnation of special methods being looked up on the class, not the instance. This was always the behavior with new-style classes, see the thread at http://mail.python.org/pipermail/python-3000/2007-March/006261.html for a previous discussion. I think we should tackle this issue now and make sure the decided resolution is consistently applied throughout Python. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From ncoghlan at gmail.com Tue Sep 4 12:27:49 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 04 Sep 2007 20:27:49 +1000 Subject: [Python-3000] __special__ method lookup [was Re: Confused about getattr() and special methods] In-Reply-To: References: Message-ID: <46DD3325.4060601@gmail.com> Georg Brandl wrote: > Thomas Heller schrieb: > >> IIUC, in py3k, classic classes do not exist any longer, so the __metaclass__ line >> has no effect anyway. Is this behaviour intended? > > It is another incarnation of special methods being looked up on the class, > not the instance. This was always the behavior with new-style classes, see > the thread at > > http://mail.python.org/pipermail/python-3000/2007-March/006261.html > > for a previous discussion. > > I think we should tackle this issue now and make sure the decided resolution > is consistently applied throughout Python. This issue came up when implementing PEP 343 as well - because the with statement is just syntactic sugar without any dedicated opcodes, __enter__/__exit__ are accessed via a conventional attribute lookup opcode. So unlike the special methods that use a C-level slot in the type object, these two operations *can* be affected by instance attributes and __getattr__. However, Guido did say at the time that he was OK with the effect of instance attributes on special method lookups being formally undefined and implementation dependent. I wasn't too worried either way - mucking with special methods outside the scope of 'provide this on your class to support operation X' has long been a pretty dubious exercise. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From noamraph at gmail.com Tue Sep 4 13:16:07 2007 From: noamraph at gmail.com (Noam Raphael) Date: Tue, 4 Sep 2007 14:16:07 +0300 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <46DD25A6.6070504@canterbury.ac.nz> References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: On 9/4/07, Greg Ewing wrote: > Noam Raphael wrote: > > The default dict iterator should in principle be iteritems(), and not > > iterkeys(). > > This was discussed at length back when "in" support was > added to dicts. There were reasons for choosing to do it > the way it's done, and I don't think it's likely to be > changed. > Just out of curiousity - do you remember these reasons? I just have the feeling that back then, iterations were less common, since you couldn't iterate over dicts without creating new lists, and you didn't have list comprehensions and generators. You couldn't write an expression such as dict((x, y) for y, x in d) to quickly get the inverse permutation, so the relative ugliness of dict((x, y) for y, x in d.items()) was not considered. I don't think that it's likely to be changed too. Noam From g.brandl at gmx.net Tue Sep 4 13:24:20 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 13:24:20 +0200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: Noam Raphael schrieb: > On 9/4/07, Greg Ewing wrote: >> Noam Raphael wrote: >> > The default dict iterator should in principle be iteritems(), and not >> > iterkeys(). >> >> This was discussed at length back when "in" support was >> added to dicts. There were reasons for choosing to do it >> the way it's done, and I don't think it's likely to be >> changed. >> > Just out of curiousity - do you remember these reasons? I just have > the feeling that back then, iterations were less common, since you > couldn't iterate over dicts without creating new lists, and you didn't > have list comprehensions and generators. You couldn't write an > expression such as > dict((x, y) for y, x in d) > to quickly get the inverse permutation, so the relative ugliness of > dict((x, y) for y, x in d.items()) > was not considered. Well, what about dict((x, d[x]) for x in d) ? Doesn't strike me as ugly... Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From nick.bastin at gmail.com Tue Sep 4 14:34:43 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 4 Sep 2007 08:34:43 -0400 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> On 9/4/07, Georg Brandl wrote: > Noam Raphael schrieb: > > Just out of curiousity - do you remember these reasons? I just have > > the feeling that back then, iterations were less common, since you > > couldn't iterate over dicts without creating new lists, and you didn't > > have list comprehensions and generators. You couldn't write an > > expression such as > > dict((x, y) for y, x in d) > > to quickly get the inverse permutation, so the relative ugliness of > > dict((x, y) for y, x in d.items()) > > was not considered. > > Well, what about dict((x, d[x]) for x in d) ? Doesn't strike me as ugly... It doesn't strike me as ugly, it just strikes me as slow. In C++, a std::map::iterator will give you std::pair, and I've often wanted such a construction in Python. Right now to get a similar thing, you pay something like O(n log n) (assuming d[x] is O(log n)) instead of O(n). Not to mention that we know that d[x] is pretty expensive these days on common lookups, since we're not dropping into the fast lookdict_string anymore. -- Nick From guido at python.org Tue Sep 4 16:23:43 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 07:23:43 -0700 Subject: [Python-3000] What about operator.*slice? In-Reply-To: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> References: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> Message-ID: Since x[a:b] is not basic syntax (like it once was) but simply the combination of operator.getitem and slice() I don't see the point of keeping operator.getitem. PS. I don't know how useful the operator module really is -- in all those years it's existed I haven't really used it myself, and I'm always baffled when I see code using it. --Guido On 9/4/07, Thomas Wouters wrote: > > > On 9/4/07, Georg Brandl wrote: > > Are they useful enough to keep? > > operator.*slice? They're rather convenient when you don't want to bother > with creating a slice object yourself, but I'm not worried either way. > > -- > Thomas Wouters > > Hi! I'm a .signature virus! copy me into your .signature file to help me > spread! > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 16:31:53 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 07:31:53 -0700 Subject: [Python-3000] __special__ method lookup [was Re: Confused about getattr() and special methods] In-Reply-To: <46DD3325.4060601@gmail.com> References: <46DD3325.4060601@gmail.com> Message-ID: I only care about getting this right when there is a reasonable chance that a class is being used as an object. For example, at the sprint we ran into this with the __format__ special method, when someone discovered that format(object, "") raised a weird error rather than returning str(object), which was due to the default __format__ method defined on the object class. It's important that you can format *anything*, so we fixed this right away. OTOH for the with-statement, the object passed to it is always specially constructed to work in this context, and passing something random like a type object just isn't a reasonable use case. As long as you get *some* kind of error (and you do, usually complaining about the arg count) I'm okay. --Guido On 9/4/07, Nick Coghlan wrote: > Georg Brandl wrote: > > Thomas Heller schrieb: > > > >> IIUC, in py3k, classic classes do not exist any longer, so the __metaclass__ line > >> has no effect anyway. Is this behaviour intended? > > > > It is another incarnation of special methods being looked up on the class, > > not the instance. This was always the behavior with new-style classes, see > > the thread at > > > > http://mail.python.org/pipermail/python-3000/2007-March/006261.html > > > > for a previous discussion. > > > > I think we should tackle this issue now and make sure the decided resolution > > is consistently applied throughout Python. > > This issue came up when implementing PEP 343 as well - because the with > statement is just syntactic sugar without any dedicated opcodes, > __enter__/__exit__ are accessed via a conventional attribute lookup > opcode. So unlike the special methods that use a C-level slot in the > type object, these two operations *can* be affected by instance > attributes and __getattr__. > > However, Guido did say at the time that he was OK with the effect of > instance attributes on special method lookups being formally undefined > and implementation dependent. I wasn't too worried either way - mucking > with special methods outside the scope of 'provide this on your class to > support operation X' has long been a pretty dubious exercise. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > --------------------------------------------------------------- > http://www.boredomandlaziness.org > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 16:36:09 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 07:36:09 -0700 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: On 9/4/07, Noam Raphael wrote: > On 9/4/07, Greg Ewing wrote: > > Noam Raphael wrote: > > > The default dict iterator should in principle be iteritems(), and not > > > iterkeys(). > > > > This was discussed at length back when "in" support was > > added to dicts. There were reasons for choosing to do it > > the way it's done, and I don't think it's likely to be > > changed. > > > Just out of curiousity - do you remember these reasons? Consistency with "k in d", where you'll agree with me that the only useful interpretation is checking for a key. It would be annoying if "for x in obj:" no longer rhymed with "if x in obj:". > I just have > the feeling that back then, iterations were less common, since you > couldn't iterate over dicts without creating new lists, and you didn't > have list comprehensions and generators. You couldn't write an > expression such as > dict((x, y) for y, x in d) > to quickly get the inverse permutation, so the relative ugliness of > dict((x, y) for y, x in d.items()) > was not considered. > > I don't think that it's likely to be changed too. I think it's even in PEP 3099 as something we *won't* change. I happen to be rather fond of it myself. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Tue Sep 4 17:01:07 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 04 Sep 2007 17:01:07 +0200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> Message-ID: <46DD7333.9060006@v.loewis.de> > (assuming d[x] is O(log n)) In Python, d[x] is typically considered to be O(1) (unlike in C++, where it is O(log n)). Of course, with Python using a hashtable, performance may decrease in the presence of collisions. In the normal case, dict((x, d[x]) for x in d) will be O(n) in Python. Regards, Martin From ncoghlan at gmail.com Tue Sep 4 17:09:12 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 05 Sep 2007 01:09:12 +1000 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: <46DD7518.7070108@gmail.com> Guido van Rossum wrote: > On 9/4/07, Noam Raphael wrote: >> On 9/4/07, Greg Ewing wrote: >>> Noam Raphael wrote: >>>> The default dict iterator should in principle be iteritems(), and not >>>> iterkeys(). >>> This was discussed at length back when "in" support was >>> added to dicts. There were reasons for choosing to do it >>> the way it's done, and I don't think it's likely to be >>> changed. >>> >> Just out of curiousity - do you remember these reasons? > > Consistency with "k in d", where you'll agree with me that the only > useful interpretation is checking for a key. It would be annoying if > "for x in obj:" no longer rhymed with "if x in obj:". I would certainly be rather annoyed if the following code could blow up with an assertion error in the absence of any threading foolishness: for k in d: assert k in d Containment and iteration really do need to be kept consistent and having the value matter when checking for dictionary containment would be outright bizarre. Put the two together and it makes sense for dictionary iteration and containment tests to both be based on keys. Note that the other basic container types in the standard library (lists, tuples, sets, strings, xrange) also obey the iteration<->containment invariant above. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From nick.bastin at gmail.com Tue Sep 4 17:39:00 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 4 Sep 2007 11:39:00 -0400 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <46DD7333.9060006@v.loewis.de> References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> <46DD7333.9060006@v.loewis.de> Message-ID: <66d0a6e10709040839u465530bcw38ba21b4886bc4a4@mail.gmail.com> On 9/4/07, "Martin v. L?wis" wrote: > > (assuming d[x] is O(log n)) > > In Python, d[x] is typically considered to be O(1) (unlike in C++, > where it is O(log n)). Of course, with Python using a hashtable, > performance may decrease in the presence of collisions. In the > normal case, dict((x, d[x]) for x in d) will be O(n) in Python. Even if we suppose that d[x] is O(1) (and I don't have real data to say whether most uses of it actually conform to this, besides keyword argument passing), that still makes: [(x, d[x]) for x in d] O(2n), which is O(n), but only pedantically. In the real world, 2n is still worse than n (and the hashtable means that it can devolve into O(n**2) in the worst case). However, all that said, you'd probably never write the above line of code, and d.iteritems() will continue to suffice if there are concerns about 'for (k,v) in d' being materially different than 'if x in d'. -- Nick From g.brandl at gmx.net Tue Sep 4 17:45:38 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 17:45:38 +0200 Subject: [Python-3000] abc docs Message-ID: I've added a basic skeleton of documentation for the "abc" module, but it would be nice if somebody proofread it and at add more from PEP 3119 if desired. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From guido at python.org Tue Sep 4 18:17:36 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 09:17:36 -0700 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <66d0a6e10709040839u465530bcw38ba21b4886bc4a4@mail.gmail.com> References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> <46DD7333.9060006@v.loewis.de> <66d0a6e10709040839u465530bcw38ba21b4886bc4a4@mail.gmail.com> Message-ID: On 9/4/07, Nicholas Bastin wrote: > On 9/4/07, "Martin v. L?wis" wrote: > > > (assuming d[x] is O(log n)) > > > > In Python, d[x] is typically considered to be O(1) (unlike in C++, > > where it is O(log n)). Of course, with Python using a hashtable, > > performance may decrease in the presence of collisions. In the > > normal case, dict((x, d[x]) for x in d) will be O(n) in Python. > > Even if we suppose that d[x] is O(1) (and I don't have real data to > say whether most uses of it actually conform to this, besides keyword > argument passing), that still makes: > > [(x, d[x]) for x in d] > > O(2n), which is O(n), but only pedantically. In the real world, 2n is > still worse than n (and the hashtable means that it can devolve into > O(n**2) in the worst case). You shouldn't be using words whose meaning you don't understand. > However, all that said, you'd probably > never write the above line of code, and d.iteritems() will continue to > suffice if there are concerns about 'for (k,v) in d' being materially > different than 'if x in d'. Since this is the python-3000 list, d.items() is what you're looking for. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 18:23:47 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 09:23:47 -0700 Subject: [Python-3000] [mark@qtrac.eu: Poss. clarification for What's New in Python 3] In-Reply-To: <20070903165347.GA24392@mac.local> References: <20070903165347.GA24392@mac.local> Message-ID: Thanks, Mark! Fixed by changing "B\n" into "B". :-) On 9/3/07, A.M. Kuchling wrote: > Forwarded: a comment on the 3.0 What's New. > > --amk > > > ---------- Forwarded message ---------- > From: Mark Summerfield > To: comments at amk.ca > Date: Sat, 1 Sep 2007 08:55:42 +0100 > Subject: Poss. clarification for What's New in Python 3 > Hi, > > In the What's New in Python 3 document you say > > For example, in Python 2.x, print "A\n", "B\n" would write "A\nB\n"; > but in Python 3.0, print("A\n", "B\n") writes "A\n B\n". > > > I would be tempted to change this to: > > For example, in Python 2.x, print "A\n", "B\n" would write "A\nB\n\n"; > but in Python 3.0, print("A\n", "B\n") writes "A\n B\n\n". > Python 3's print() has keyword arguments to control what's > output between items and what is output at the end, for example, > print("A\n", "B\n", sep="", end="") writes "A\nB\n". > > -- > Mark Summerfield, Qtrac Ltd., www.qtrac.eu > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Tue Sep 4 18:52:45 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 4 Sep 2007 12:52:45 -0400 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> <46DD7333.9060006@v.loewis.de> <66d0a6e10709040839u465530bcw38ba21b4886bc4a4@mail.gmail.com> Message-ID: <66d0a6e10709040952p472b1bb4q3dcd46b1ac5127ff@mail.gmail.com> On 9/4/07, Guido van Rossum wrote: > On 9/4/07, Nicholas Bastin wrote: > > However, all that said, you'd probably > > never write the above line of code, and d.iteritems() will continue to > > suffice if there are concerns about 'for (k,v) in d' being materially > > different than 'if x in d'. > > Since this is the python-3000 list, d.items() is what you're looking for. My mistake, I had referred back to the 3.0 documentation, which still claims that iteritems is a method. -- Nick From greg at krypto.org Tue Sep 4 19:11:14 2007 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 4 Sep 2007 11:11:14 -0600 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <46DD7333.9060006@v.loewis.de> References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> <46DD7333.9060006@v.loewis.de> Message-ID: <52dc1c820709041011r64acd37et88cc664350e95e92@mail.gmail.com> On 9/4/07, "Martin v. L?wis" wrote: > > > (assuming d[x] is O(log n)) > > In Python, d[x] is typically considered to be O(1) (unlike in C++, > where it is O(log n)). Of course, with Python using a hashtable, > performance may decrease in the presence of collisions. In the > normal case, dict((x, d[x]) for x in d) will be O(n) in Python. And if the speed of d[x] were ever an issue that shows up on python performance profiles when used in a loop like that it would be pretty easy to optimize the common case internally by having the key iteration retain an optional (weak?) reference in the dict object to the most recently looked up key+value for a short circuit quickly returning its value. I do not expect that to ever to matter as code can just loop using the appropriate iterator instead. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/cbe5ee88/attachment.htm From g.brandl at gmx.net Tue Sep 4 19:15:37 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 19:15:37 +0200 Subject: [Python-3000] dict view operations Message-ID: While looking at documenting the dict view changes, I came across an inconsistency in how the dict views' set-like operations are implemented: with sets/frozensets, the operator versions only work if the other operand is a set/frozenset, while the dict view operators allow any iterable. Do we care? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From skip at pobox.com Tue Sep 4 19:27:17 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 4 Sep 2007 12:27:17 -0500 Subject: [Python-3000] Should all iter(keys|items|values) be renamed? Message-ID: <18141.38261.320718.902982@montanaro.dyndns.org> After Nick's last message I went searching for "iteritems" in the docs. I fixed a couple places (not yet checked in), but eventually came across Mailbox.iteritems. Looking at the mailbox.py code, sure enough, it still exists: def iteritems(self): """Return an iterator over (key, message) tuples.""" for key in self.keys(): try: value = self[key] except KeyError: continue yield (key, value) def items(self): """Return a list of (key, message) tuples. Memory intensive.""" return list(self.iteritems()) Should it be renamed items and the second def'n deleted? Same for iterkeys, itervalues where they appear? Skip From fdrake at acm.org Tue Sep 4 19:35:05 2007 From: fdrake at acm.org (Fred Drake) Date: Tue, 4 Sep 2007 13:35:05 -0400 Subject: [Python-3000] Should all iter(keys|items|values) be renamed? In-Reply-To: <18141.38261.320718.902982@montanaro.dyndns.org> References: <18141.38261.320718.902982@montanaro.dyndns.org> Message-ID: <5D1955CD-4228-43D1-BC4D-C79FAA6832E0@acm.org> On Sep 4, 2007, at 1:27 PM, skip at pobox.com wrote: > After Nick's last message I went searching for "iteritems" in the > docs. I > fixed a couple places (not yet checked in), but eventually came across Timing is great! I checked in a bunch of doc changes on this exact topic. Watch for conflicts. My changes were mostly removals and minor updates; I'm sure there's more to be done. -Fred -- Fred Drake From g.brandl at gmx.net Tue Sep 4 19:37:26 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 04 Sep 2007 19:37:26 +0200 Subject: [Python-3000] dict view operations In-Reply-To: References: Message-ID: Georg Brandl schrieb: > While looking at documenting the dict view changes, I came across an > inconsistency in how the dict views' set-like operations are implemented: > with sets/frozensets, the operator versions only work if the other operand > is a set/frozenset, while the dict view operators allow any iterable. > > Do we care? Oh, and another thing: the items views can contain unhashable values, so d.items() & d.items() will fail for such dictionaries since the operands are converted to sets before doing the intersection. I suspect there's nothing that can easily be done about that though... Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From brett at python.org Tue Sep 4 20:08:53 2007 From: brett at python.org (Brett Cannon) Date: Tue, 4 Sep 2007 11:08:53 -0700 Subject: [Python-3000] What about operator.*slice? In-Reply-To: References: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> Message-ID: On 9/4/07, Guido van Rossum wrote: > Since x[a:b] is not basic syntax (like it once was) but simply the > combination of operator.getitem and slice() I don't see the point of > keeping operator.getitem. > > PS. I don't know how useful the operator module really is -- in all > those years it's existed I haven't really used it myself, and I'm > always baffled when I see code using it. > The only great use I have found for it myself is attrgetter and itemgetter, but those were added by Raymond in 2.5 (I think). Otherwise I never use it. -Brett From guido at python.org Tue Sep 4 20:17:34 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 11:17:34 -0700 Subject: [Python-3000] Should all iter(keys|items|values) be renamed? In-Reply-To: <18141.38261.320718.902982@montanaro.dyndns.org> References: <18141.38261.320718.902982@montanaro.dyndns.org> Message-ID: On 9/4/07, skip at pobox.com wrote: > After Nick's last message I went searching for "iteritems" in the docs. I > fixed a couple places (not yet checked in), but eventually came across > Mailbox.iteritems. Looking at the mailbox.py code, sure enough, it still > exists: > > def iteritems(self): > """Return an iterator over (key, message) tuples.""" > for key in self.keys(): > try: > value = self[key] > except KeyError: > continue > yield (key, value) > > def items(self): > """Return a list of (key, message) tuples. Memory intensive.""" > return list(self.iteritems()) > > Should it be renamed items and the second def'n deleted? Same for iterkeys, > itervalues where they appear? It is incorrect to replace items() with iteritems() though -- it should be replaced with a "view" like sketched in PEP 3106. I think this will be a fairly large project; ATM we don't even have a reusable implementation of dict views (the version in dictobject.c is explicitly restricted to dict instances). It would be a good idea to review the conformance of every stdlib API that tries to look like a mapping, and make them conform to the new mapping ABCs in PEP 3119. (Ditto for sequences and sets except there are so few of those.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 4 20:22:31 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 11:22:31 -0700 Subject: [Python-3000] dict view operations In-Reply-To: References: Message-ID: On 9/4/07, Georg Brandl wrote: > Georg Brandl schrieb: > > While looking at documenting the dict view changes, I came across an > > inconsistency in how the dict views' set-like operations are implemented: > > with sets/frozensets, the operator versions only work if the other operand > > is a set/frozenset, while the dict view operators allow any iterable. > > > > Do we care? The Set ABCs in PEP 3119 should be followed IMO. But they haven't received a lot of review so we may have to go back and discuss what that PEP should say (and perhaps it isn't giving enough detail). However, I don't see it as a violation if some of the types are more lenient in what they accept -- they just shouldn't be more restrictive. > Oh, and another thing: the items views can contain unhashable values, so > > d.items() & d.items() > > will fail for such dictionaries since the operands are converted to sets > before doing the intersection. > > I suspect there's nothing that can easily be done about that though... Indeed, since the result must be a new set (not a view) and the result cannot be represented as a set either (unless it's empty or happens to contain no unhashable values, which would be a rare piece of luck). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Tue Sep 4 20:35:12 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 04 Sep 2007 20:35:12 +0200 Subject: [Python-3000] dict view operations In-Reply-To: References: Message-ID: <46DDA560.8070301@v.loewis.de> > Oh, and another thing: the items views can contain unhashable values That, of course, could be fixed: if the key-value pairs would only hash by key (ignoring the value), they would remain hashable. Regards, Martin From guido at python.org Tue Sep 4 20:41:37 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 11:41:37 -0700 Subject: [Python-3000] dict view operations In-Reply-To: <46DDA560.8070301@v.loewis.de> References: <46DDA560.8070301@v.loewis.de> Message-ID: On 9/4/07, "Martin v. L?wis" wrote: > > Oh, and another thing: the items views can contain unhashable values > > That, of course, could be fixed: if the key-value pairs would only > hash by key (ignoring the value), they would remain hashable. How would that help? The key/value pairs are ordinary tuples, so you still wouldn't be able to look them up in another set, nor would you be able to represent d.items() & d.items() as a regular set or frozenset instance. What use case are you thinking of that this would address? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Tue Sep 4 20:44:47 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 4 Sep 2007 14:44:47 -0400 Subject: [Python-3000] dict view operations In-Reply-To: <46DDA560.8070301@v.loewis.de> References: <46DDA560.8070301@v.loewis.de> Message-ID: <66d0a6e10709041144k1402615q17182d820c99cdc9@mail.gmail.com> On 9/4/07, "Martin v. L?wis" wrote: > > Oh, and another thing: the items views can contain unhashable values > > That, of course, could be fixed: if the key-value pairs would only > hash by key (ignoring the value), they would remain hashable. I understand what you mean, but without changing tuples generically, how would you implement this? -- Nick From barry at python.org Tue Sep 4 20:51:49 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 4 Sep 2007 14:51:49 -0400 Subject: [Python-3000] What about operator.*slice? In-Reply-To: References: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> Message-ID: <43F0FFDA-242C-4810-A534-164092EBA835@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 4, 2007, at 2:08 PM, Brett Cannon wrote: > On 9/4/07, Guido van Rossum wrote: >> Since x[a:b] is not basic syntax (like it once was) but simply the >> combination of operator.getitem and slice() I don't see the point of >> keeping operator.getitem. >> >> PS. I don't know how useful the operator module really is -- in all >> those years it's existed I haven't really used it myself, and I'm >> always baffled when I see code using it. >> > > The only great use I have found for it myself is attrgetter and > itemgetter, but those were added by Raymond in 2.5 (I think). > Otherwise I never use it. Same here, although very occasionally I use one or two others. I still think attrgetter could be made more useful by dereferencing dot- paths. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRt2pRXEjvBPtnXfVAQK59gQAn7KTJHk3R3JTLErEfljDKZ7B2H0WEZD3 ljpnDc7Kn5GNAfWdNueJNigMKGctKhK3ZEO9Gw8TNxTJonhOCjLhSPZPrCMlM3tV CeEieXw8VBFMPA0biDEtq3Ic6x/6yuX3xXmVPQTOOY1kAScfFmeb1bi17xPkhdsl 36FrPEsePig= =rR0Y -----END PGP SIGNATURE----- From hto at arcor.de Tue Sep 4 18:49:48 2007 From: hto at arcor.de (Thomas Hunger) Date: Tue, 4 Sep 2007 18:49:48 +0200 Subject: [Python-3000] Performance Notes In-Reply-To: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> References: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> Message-ID: <200709041849.48534.hto@arcor.de> > I've been doing some profiling of 3.0 vs. 2.6 release builds on > Windows XP for the purpose of hopefully closing the performance > gap. This data is very preliminary, but I thought I'd throw it out > here in case someone else also wanted to look into this. Also, > possibly useful for comparing against profiling data on other > platforms. The table below just lists functions and speed > differentials in 3.0 vs. 2.6, ordered by the functions in which we > spend the most total time. Hello, I don't know much about python internals, so the following might be bogus: I replaced unicode_hash and string_hash with the hash function from here: http://www.azillionmonkeys.com/qed/hash.html. Then I ran the following micro-benchmark : $ time ./python bench.py where bech.py is: f = dict((line, nr) for nr, line in enumerate(open('/usr/share/dict/words', encoding='latin1').readlines())) Python3k original hash: real 0m2.210s new hash: real 0m1.842s So maybe this is an interesting hash function? Tom From martin at v.loewis.de Tue Sep 4 20:55:14 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 04 Sep 2007 20:55:14 +0200 Subject: [Python-3000] dict view operations In-Reply-To: References: <46DDA560.8070301@v.loewis.de> Message-ID: <46DDAA12.5030707@v.loewis.de> Guido van Rossum schrieb: > On 9/4/07, "Martin v. L?wis" wrote: >>> Oh, and another thing: the items views can contain unhashable values >> That, of course, could be fixed: if the key-value pairs would only >> hash by key (ignoring the value), they would remain hashable. > > How would that help? The key/value pairs are ordinary tuples They would have to stop being that: class Association(tuple): def __hash__(self): return hash(self[0]) > What use case are you thinking of that this would address? It would allow to treat the items view as a proper set (which it still is). Regards, Martin From guido at python.org Tue Sep 4 21:14:14 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 12:14:14 -0700 Subject: [Python-3000] dict view operations In-Reply-To: <46DDAA12.5030707@v.loewis.de> References: <46DDA560.8070301@v.loewis.de> <46DDAA12.5030707@v.loewis.de> Message-ID: On 9/4/07, "Martin v. L?wis" wrote: > Guido van Rossum schrieb: > > On 9/4/07, "Martin v. L?wis" wrote: > >>> Oh, and another thing: the items views can contain unhashable values > >> That, of course, could be fixed: if the key-value pairs would only > >> hash by key (ignoring the value), they would remain hashable. > > > > How would that help? The key/value pairs are ordinary tuples > > They would have to stop being that: > > class Association(tuple): > def __hash__(self): > return hash(self[0]) > > > What use case are you thinking of that this would address? > > It would allow to treat the items view as a proper set (which > it still is). Can you give some examples? I can too easily think of examples that fail with this approach: d = {1: 1, 2: 2} iv = set(d.items()) (1, 1) in iv The latter expression would be False, (while it currently is True), since (1,1) has a different hash value than Association((1, 1)). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From tjreedy at udel.edu Tue Sep 4 21:18:59 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 4 Sep 2007 15:18:59 -0400 Subject: [Python-3000] Default dict iterator should have been iteritems() References: Message-ID: "Noam Raphael" wrote in message news:b348a0850709040149i6d9d7183ped5d393d492d3824 at mail.gmail.com... | The reasoning is simple: Iteration over an object usually gets all the | data it contains. A dict can be seen as an unordered collection of | tuples (key, value), indexed by key. So, iteration over a dict should | yield those tuples. Given that viewpoint, yes. But a dict can also be seen as a set of objects that happen to have a value attached (like a graph with labelled nodes, which is still 'made up of' nodes rather than (node,label) pairs). From this viewpoint, yielding the objects is sensible. By itself, I think the decision was a toss-up. But consistency with 'in', which is not a toss-up, tips the balance. tjr From martin at v.loewis.de Tue Sep 4 21:22:48 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 04 Sep 2007 21:22:48 +0200 Subject: [Python-3000] dict view operations In-Reply-To: References: <46DDA560.8070301@v.loewis.de> <46DDAA12.5030707@v.loewis.de> Message-ID: <46DDB088.7010606@v.loewis.de> >>> What use case are you thinking of that this would address? >> It would allow to treat the items view as a proper set (which >> it still is). > > Can you give some examples? You mean, actual applications where people would want to perform set operations on .items()? No - I was just trying to give a solution to the theoretical problem that Georg brought up. > I can too easily think of examples that > fail with this approach: > > d = {1: 1, 2: 2} > iv = set(d.items()) > (1, 1) in iv > > The latter expression would be False, (while it currently is True), > since (1,1) has a different hash value than Association((1, 1)). Right. Since the elements in the view/set would not be plain two-tuples, this would have to be spelled as Association((1,1)) in iv Of course, it violates the principle that things that compare equal should also hash equal; to restore that principle, one would have to make associations not compare equal to two-tuples (and then not make them a subtype anymore, either). Regards, Martin From eduardo.padoan at gmail.com Tue Sep 4 21:37:44 2007 From: eduardo.padoan at gmail.com (Eduardo O. Padoan) Date: Tue, 4 Sep 2007 16:37:44 -0300 Subject: [Python-3000] dict view operations In-Reply-To: References: Message-ID: On 9/4/07, Georg Brandl wrote: > Georg Brandl schrieb: > Oh, and another thing: the items views can contain unhashable values, so > > d.items() & d.items() > > will fail for such dictionaries since the operands are converted to sets > before doing the intersection. > > I suspect there's nothing that can easily be done about that though... Py3k-ish: >>> d = {2: [], 4: {}} >>> d.items() & d.items() ... TypeError: list objects are unhashable Must behave like Python 2.x-ish: >>> d = {2: [], 4: {}} >>> set(d.items()) & set(d.items()) ... TypeError: list objects are unhashable .. right? If so, IIUC, there is nothing to be done about that... > Georg -- http://www.advogato.org/person/eopadoan/ Bookmarks: http://del.icio.us/edcrypt From eric+python-dev at trueblade.com Tue Sep 4 22:05:53 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Tue, 04 Sep 2007 16:05:53 -0400 Subject: [Python-3000] str.format vs. string.Formatter exceptions In-Reply-To: References: <46DC230F.2040409@trueblade.com> Message-ID: <46DDBAA1.7090406@trueblade.com> Guido van Rossum wrote: > Since IndexError and KeyError are conceptually like ValueError but in > a more narrowly defined context, I think IndexError and KeyError > actually make sense here (even though they don't inherit from > ValueError). > > --Guido Okay, I'll change these to IndexError and KeyError. Eric. > > On 9/3/07, Eric Smith wrote: >> Ron Adam points out some differences in which exceptions are thrown by >> str.format and string.Formatter. For example, on a missing positional >> argument: >> >> >>> "{0}".format() >> Traceback (most recent call last): >> File "", line 1, in >> ValueError: Not enough positional arguments in format string >> >> >>> Formatter().format("{0}") >> Traceback (most recent call last): >> File "", line 1, in >> File "/shared/src/python/py3k/Lib/string.py", line 201, in format >> return self.vformat(format_string, args, kwargs) >> File "/shared/src/python/py3k/Lib/string.py", line 220, in vformat >> obj, arg_used = self.get_field(field_name, args, kwargs) >> File "/shared/src/python/py3k/Lib/string.py", line 278, in get_field >> obj = self.get_value(first, args, kwargs) >> File "/shared/src/python/py3k/Lib/string.py", line 235, in get_value >> return args[key] >> IndexError: tuple index out of range >> >> The PEP says: In general, exceptions generated by the formatter code >> itself are of the "ValueError" variety -- there is an error in the >> actual "value" of the format string. >> >> I can easily change string.Formatter to make this a ValueError, and I >> think that's probably the right thing to do. For example, if the string >> comes from a translation module, then there might be an extra parameter >> added by mistake, in which case ValueError seems right to me. >> >> But I'd like to hear if anyone else thinks this should be an IndexError, >> or maybe they both should be some other exception. >> >> Similarly "{x}".format()' currently raises ValueError, but >> 'Formatter().format("{x}")' raises KeyError. >> _______________________________________________ >> Python-3000 mailing list >> Python-3000 at python.org >> http://mail.python.org/mailman/listinfo/python-3000 >> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org >> > > From greg.ewing at canterbury.ac.nz Tue Sep 4 22:44:45 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 05 Sep 2007 08:44:45 +1200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> Message-ID: <46DDC3BD.8090505@canterbury.ac.nz> Noam Raphael wrote: > Just out of curiousity - do you remember these reasons? I don't remember the discussion in detail, but a couple of reasons that come to mind: * It would be confusing to have "x in d" and "for x in d" meaning subtly different things. * It's more efficient to iterate over just the keys, because a tuple has to be created for each item when iterating over (key, value) pairs. It's reasonable that if you want more done, you should have to write more to get it. -- Greg From greg.ewing at canterbury.ac.nz Tue Sep 4 22:52:20 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 05 Sep 2007 08:52:20 +1200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> References: <46DD25A6.6070504@canterbury.ac.nz> <66d0a6e10709040534j616eda22va40647ca622ae989@mail.gmail.com> Message-ID: <46DDC584.6010905@canterbury.ac.nz> Nicholas Bastin wrote: > On 9/4/07, Georg Brandl wrote: > > > Well, what about dict((x, d[x]) for x in d) ? Doesn't strike me as ugly... > > It doesn't strike me as ugly, it just strikes me as slow. Are people forgetting that in 3.0 dict(d.items()) will do the same thing very efficiently? Of course, if you know you have a dict, d.copy() is even more efficient. -- Greg From greg.ewing at canterbury.ac.nz Tue Sep 4 23:01:03 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 05 Sep 2007 09:01:03 +1200 Subject: [Python-3000] What about operator.*slice? In-Reply-To: References: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> Message-ID: <46DDC78F.2090208@canterbury.ac.nz> Guido van Rossum wrote: > PS. I don't know how useful the operator module really is I think its main use is as a source of functions for passing to map(). Unless I'm mistaken, that's still going to be faster than a listcomp when a built-in function is used, isn't it? -- Greg From facundobatista at gmail.com Tue Sep 4 23:46:43 2007 From: facundobatista at gmail.com (Facundo Batista) Date: Tue, 4 Sep 2007 18:46:43 -0300 Subject: [Python-3000] What about operator.*slice? In-Reply-To: <46DDC78F.2090208@canterbury.ac.nz> References: <9e804ac0709040156x74a36892p1090d0d113f043f9@mail.gmail.com> <46DDC78F.2090208@canterbury.ac.nz> Message-ID: 2007/9/4, Greg Ewing : > I think its main use is as a source of functions for passing > to map(). Unless I'm mistaken, that's still going to be faster Or to sort: >>> import operator >>> l = [(1, 3), (2, 2)] >>> sorted(l, key=operator.itemgetter(1)) [(2, 2), (1, 3)] >>> Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From lars at ibp.de Tue Sep 4 23:54:53 2007 From: lars at ibp.de (Lars Immisch) Date: Tue, 04 Sep 2007 23:54:53 +0200 Subject: [Python-3000] audio device support Message-ID: <46DDD42D.8090608@ibp.de> Hi, I recently worked on Python audio device support for Linux and OS X. Not so recently, I wrote a DirectSound module for win32. Python 2 has support for various audio devices, but they have no common interface and some are broken or obsolete. Python 3000 might be a chance to improve on this. The situation seems to be: Linux: ossaudiodev is becoming obsolete on Linux (because OSS is being replaced by ALSA). pyalsaaudio, http://sourceforge.net/projects/pyalsaaudio, is broken for multithreaded programs: it does not wrap blocking calls with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS. A suitable, submitted patch has not been included by the maintainer in nearly two years. With this or a similar patch, it works fine, however. Windows: win32all has DirectSound support, but it's lowlevel and complicated. Other audio device wrappers may exit, but I don't know about them. OS X: The (undocumented) audiodev implementation does not work for me. There is a pyrex implementation for coreaudio support which I haven't tested, but I have written coreaudio wrappers in C (to be published). What I'd like to see: I like the idea of having audio device support for the major operating systems in the standard library. But I am even more interested in a common interface for simple operations. IMO, the API should support: - stereo playback - stereo recording - different sampling rates and formats (alaw, mulaw and PCM in signed integers in various widths and maybe PCM in floats/doubles). - device selection - volume control Overall, I think the level of abstraction in the OSS or ALSA APIs is about right, coreaudio on OS X and DirectSound on Windows are overkill outside of niche applications. I would volunteer sample implementations for Windows, OS X and Linux (ALSA). - Lars From brett at python.org Wed Sep 5 02:54:44 2007 From: brett at python.org (Brett Cannon) Date: Tue, 4 Sep 2007 17:54:44 -0700 Subject: [Python-3000] Questions about PEP 3121 Message-ID: I am prepping for a presentation on Python 3.0 that I am giving tonight and I had some questions about PEP 3121 that the example creates. First is whether the name of the function that returns the module-specific memory is PyModule_GetData() or PyModule_GetState()? The former is listed by the PEP but the latter is used by the example. Second is how are the exception and type to be added to the module? Currently one uses PyModule_AddObject() to insert an object into the global namespace of a module. But the example leaves that out and I wanted to make sure there was not some magical new step left out (initializing Xxo_Type is also left out, but that does not directly deal with module initialization). Lastly, what is tp_reload to be used for? The PEP doesn't say but the PyModuleDef lists it. I assume it is to be called when a module is reloaded, but it is not specified in the PEP. -Brett From janssen at parc.com Wed Sep 5 05:31:25 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 4 Sep 2007 20:31:25 PDT Subject: [Python-3000] bytes C API in 2.6 for easy transition to 3.0? Message-ID: <07Sep4.203132pdt."57996"@synergy1.parc.xerox.com> According to PEP 358, "bytes" will be in both 2.6 and 3.0. It would be nice if the C API for "bytes" existed in the trunk, so that it could be used for new code that will port more easily to 3.0. Bill From guido at python.org Wed Sep 5 05:41:54 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 4 Sep 2007 20:41:54 -0700 Subject: [Python-3000] bytes C API in 2.6 for easy transition to 3.0? In-Reply-To: <-6760061404575982124@unknownmsgid> References: <-6760061404575982124@unknownmsgid> Message-ID: This is the plan. We're just short on cheap labor to implement it. I wish I could quote an email that you sent long, long ago (in ILU times) about having set up a drummer in the back of the room to entice the 50 coding slaves to more productivity. I believe there was a whip involved too. :-) On 9/4/07, Bill Janssen wrote: > According to PEP 358, "bytes" will be in both 2.6 and 3.0. It would > be nice if the C API for "bytes" existed in the trunk, so that it > could be used for new code that will port more easily to 3.0. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg at krypto.org Wed Sep 5 07:53:45 2007 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 4 Sep 2007 23:53:45 -0600 Subject: [Python-3000] bytes C API in 2.6 for easy transition to 3.0? In-Reply-To: <1636919686236946180@unknownmsgid> References: <1636919686236946180@unknownmsgid> Message-ID: <52dc1c820709042253j5c69e4e5l1d3f953526c051a4@mail.gmail.com> On 9/4/07, Bill Janssen wrote: > > According to PEP 358, "bytes" will be in both 2.6 and 3.0. It would > be nice if the C API for "bytes" existed in the trunk, so that it > could be used for new code that will port more easily to 3.0. > > Bill I assume this includes the new buffer api since we really seem to want C API users to use that rather than bytes objects directly? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070904/6bd913dd/attachment.htm From nick.bastin at gmail.com Wed Sep 5 09:17:39 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 5 Sep 2007 03:17:39 -0400 Subject: [Python-3000] Solaris support in 3.0? Message-ID: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> This is a combination question-and-status-report email. The question would be, what does the "somewhat" tag mean on Solaris support in the release notes for 3.0a1, and does someone have a list of things that don't work, or does that just mean it hasn't been tested? I built 3.0a1 on Sparc Solaris (5.8), and except for those things that didn't build for lack of the required dependencies (_bsddb, _hashlib, _ssl, _tkinter, gdbm, ossaudiodev, readline, _curses, _curses_panel), everything claims to have built fine (with gcc 3.4.6). Unit tests reveal the following failures: test_cookielib (no _md5) test_fileio test_nis test_pickletools test_pipes test_pty test_str test_unicode test_userstring test_uuid (no _md5) And the following unexpected (according to it) skips: test_hashlib (no _md5) test_hmac (no _md5) test_urllib2_localnet (no _md5) test_urllib2net (no _md5) test_urllib2 (no _md5) test_tcl (no tcl on my system) test_sundry (no _md5) test_ssl (no SSL in my configuration) test_tarfile (no _md5) test_unicodedata (no _md5) If anyone wants more data on any of these particular failures, let me know, otherwise I'm going to start working through the ones that fail in 3.0 that don't fail in 2.6. All of the _md5 failures are because of the lack of SSL, so I'm not sure that the tests should be 'failing' in this configuration. -- Nick From mark at qtrac.eu Wed Sep 5 10:43:04 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Wed, 5 Sep 2007 09:43:04 +0100 Subject: [Python-3000] abc docs In-Reply-To: References: Message-ID: <200709050943.04784.mark@qtrac.eu> On 2007-09-04, Georg Brandl wrote: > I've added a basic skeleton of documentation for the "abc" module, but it > would be nice if somebody proofread it and at add more from PEP 3119 if > desired. One strange point: the module correctly appears on the library/python.html page (Python Runtime Services), but does _not_ appear in library/index.html (The Python Standard Library), although all the other Python Runtime Services modules do. index.rst lists python.rst and python.rst lists abc.rst. I've done various changes to the text, with one semantic change (in the parenthesised phrase about __mro__). Also, I added a table to collections.rst listing collections.Container, collections.Hashable, and similar that you might want to check over. All in revision 57988. BTW When I tried a variation of one of the ABC examples from the PEP I got this: Python 3.0a1 (py3k, Sep 1 2007, 08:25:11) [GCC 4.1.2 20070626 (Red Hat 4.1.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import abc >>> class MyABC(abc.ABCMeta): pass ... >>> MyABC.register(tuple) Traceback (most recent call last): File "", line 1, in RuntimeError: maximum recursion depth exceeded in __instancecheck__ So then I tried it exactly as written, and it worked fine: Python 3.0a1 (py3k, Sep 1 2007, 08:25:11) [GCC 4.1.2 20070626 (Red Hat 4.1.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from abc import ABCMeta >>> class MyABC(metaclass=ABCMeta): pass ... >>> MyABC.register(tuple) >>> assert issubclass(tuple, MyABC) >>> assert isinstance((), MyABC) I hope that the first one is a bug rather than intended. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From martin at v.loewis.de Wed Sep 5 13:00:32 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Sep 2007 13:00:32 +0200 Subject: [Python-3000] Questions about PEP 3121 In-Reply-To: References: Message-ID: <46DE8C50.6090600@v.loewis.de> > First is whether the name of the function that returns the > module-specific memory is PyModule_GetData() or PyModule_GetState()? > The former is listed by the PEP but the latter is used by the example. I think I like _GetState more, so I have now adjusted the PEP. > Second is how are the exception and type to be added to the module? > Currently one uses PyModule_AddObject() to insert an object into the > global namespace of a module. But the example leaves that out and I > wanted to make sure there was not some magical new step left out > (initializing Xxo_Type is also left out, but that does not directly > deal with module initialization). No, this is just an omission. I'll fix it when I revise the PEP after the implementation. > Lastly, what is tp_reload to be used for? The PEP doesn't say but the > PyModuleDef lists it. I assume it is to be called when a module is > reloaded, but it is not specified in the PEP. Yes; I'm not certain whether module reloading continues to be supported in Py3k or not. If not, it should be removed from the PEP, if yes, it should be specified. A few other issues that you may want to know: I found that enhancing PyModule_New cannot really work, as Py_InitModule does a lot of other things that shouldn't be done in PyModule_New (which is also used to create Python modules). So I keep calling the function Py_InitModule. I also found that passing two constant arguments to the function is pointless, so I moved the module name into struct PyModuleDef. I also add PyModuleDef_HEAD, similar to types. E.g. for array, the current diff looks like that: +static PyModuleDef array_mod = { + PyModuleDef_HEAD, + "array", /* name */ + module_doc, /* doc string */ + a_methods, /* methods */ + 0, /* m_size */ + NULL, /* m_reload */ + NULL, /* m_traverse */ + NULL, /* m_clear */ + NULL, /* m_free */ +}; + PyMODINIT_FUNC -initarray(void) +PyInit_array(void) { PyObject *m; if (PyType_Ready(&Arraytype) < 0) return; Py_Type(&PyArrayIter_Type) = &PyType_Type; - m = Py_InitModule3("array", a_methods, module_doc); + m = Py_InitModule(&array_mod); if (m == NULL) - return; + return NULL; Py_INCREF((PyObject *)&Arraytype); PyModule_AddObject(m, "ArrayType", (PyObject *)&Arraytype); Py_INCREF((PyObject *)&Arraytype); PyModule_AddObject(m, "array", (PyObject *)&Arraytype); /* No need to check the error here, the caller will do that */ + return m; } This doesn't include putting the type into interpreter state, and I won't be able to fix all cases of global variables (also, some global variables are out of scope of the PEP, including most types, so some global variables will remain after I'm done). Notice that I also kept the convention that the caller will check for errors, so you can return a module object even though an exception occurred. Making all these functions exception-safe is fairly tedious, and I'm not attempting that for the moment. Regards, Martin From martin at v.loewis.de Wed Sep 5 13:19:12 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Sep 2007 13:19:12 +0200 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> Message-ID: <46DE90B0.4050905@v.loewis.de> > This is a combination question-and-status-report email. The question > would be, what does the "somewhat" tag mean on Solaris support in the > release notes for 3.0a1, and does someone have a list of things that > don't work, or does that just mean it hasn't been tested? Not sure what "somewhat" means, but you can take a look at the build failures in the Solaris buildbot - this is what is "officially" known not to work. As always with Solaris, there are several dimensions to be considered: - version (2.5,2.6,7,8,9,10,11); not sure what the oldest Solaris version is that we still want to support. - compiler: gcc vs. SunPRO/Forte - 32 vs. 64 bits - SPARC vs. x86 (not all combinations exist, but plenty) > If anyone wants more data on any of these particular failures, let me > know, otherwise I'm going to start working through the ones that fail > in 3.0 that don't fail in 2.6. All of the _md5 failures are because > of the lack of SSL, so I'm not sure that the tests should be 'failing' > in this configuration. I think that's a serious issue to consider. As so much code now depends on OpenSSL, setup.py should try harder to find it. E.g. on the build slave, it can be found in /usr/sfw - not sure whether that is normal on a Solaris 10 installation, and not sure whether there is a Sun-provided OpenSSL on Solaris 8. Notice that the tests don't 'fail', they are skipped. There are also failing test cases, something that is more worrisome than a skipped test case. Regards, Martin From mark at qtrac.eu Wed Sep 5 13:40:43 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Wed, 5 Sep 2007 12:40:43 +0100 Subject: [Python-3000] abc docs Message-ID: <200709051240.43292.mark@qtrac.eu> I may not be the first to mistakenly write class Foo(ABCMeta): when I meant to write class Foo(metaclass=ABCMeta): but I'm sure I won't be the last. Sorry for the mistake... Maybe attempting to register an ABCMeta subclass might lead to a more informative warning though? ---------- Forwarded Message ---------- Subject: Re: [Python-3000] abc docs Date: 2007-09-05 From: Mark Summerfield To: python-3000 at python.org [snip] BTW When I tried a variation of one of the ABC examples from the PEP I got this: Python 3.0a1 (py3k, Sep 1 2007, 08:25:11) [GCC 4.1.2 20070626 (Red Hat 4.1.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import abc >>> class MyABC(abc.ABCMeta): pass ... >>> MyABC.register(tuple) Traceback (most recent call last): File "", line 1, in RuntimeError: maximum recursion depth exceeded in __instancecheck__ [snip] I hope that the first one is a bug rather than intended. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu ------------------------------------------------------- -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From guido at python.org Wed Sep 5 17:02:14 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 5 Sep 2007 08:02:14 -0700 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <46DE90B0.4050905@v.loewis.de> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> Message-ID: On 9/5/07, "Martin v. L?wis" wrote: > > This is a combination question-and-status-report email. The question > > would be, what does the "somewhat" tag mean on Solaris support in the > > release notes for 3.0a1, and does someone have a list of things that > > don't work, or does that just mean it hasn't been tested? > > Not sure what "somewhat" means, but you can take a look at the build > failures in the Solaris buildbot - this is what is "officially" known > not to work. The "somewhat" was my word -- I meant that when I last looked at the Solaris buildbot, I saw a few failures; and also that I don't have access to Sun hardware. And also what Martin says below. I'd be happy though to replace "somewhat" with specific indications of h/w and s/w versions if you are willing to commit to supporting these throughout the 3.0 life cycle. > As always with Solaris, there are several dimensions to be considered: > - version (2.5,2.6,7,8,9,10,11); not sure what the oldest Solaris > version is that we still want to support. > - compiler: gcc vs. SunPRO/Forte > - 32 vs. 64 bits > - SPARC vs. x86 > > (not all combinations exist, but plenty) > > > If anyone wants more data on any of these particular failures, let me > > know, otherwise I'm going to start working through the ones that fail > > in 3.0 that don't fail in 2.6. All of the _md5 failures are because > > of the lack of SSL, so I'm not sure that the tests should be 'failing' > > in this configuration. > > I think that's a serious issue to consider. As so much code now depends > on OpenSSL, setup.py should try harder to find it. E.g. on the build > slave, it can be found in /usr/sfw - not sure whether that is normal > on a Solaris 10 installation, and not sure whether there is a > Sun-provided OpenSSL on Solaris 8. > > Notice that the tests don't 'fail', they are skipped. There are also > failing test cases, something that is more worrisome than a skipped > test case. Yes, this is a serious issue -- we are totally dependent on openssl for computing MD5 checksums. Several modules use MD5 checksums casually, and it's not good that these fail when openssl isn't available (or if it's too old, like what happened on an ancient Red Hat 7.3 system I have at home). I'm tempted to put the old RSA-copyrighted md5.c back in as a fallback, even though its license is impopular. Or perhaps we could make a copy of a small fraction of openssl and use that? I think MD5 is the only one that's popular enough to warrant this treatment; I think SHA1 is a distant second. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 5 17:04:13 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 5 Sep 2007 08:04:13 -0700 Subject: [Python-3000] Questions about PEP 3121 In-Reply-To: <46DE8C50.6090600@v.loewis.de> References: <46DE8C50.6090600@v.loewis.de> Message-ID: On 9/5/07, "Martin v. L?wis" wrote: > Yes; I'm not certain whether module reloading continues to be supported > in Py3k or not. If not, it should be removed from the PEP, if yes, it > should be specified. I'm already missing the reload() builtin, so I think it should be kept around in some form. I expect some form of reload functionality will remain available, perhaps somewhere in the imp module. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 5 17:08:02 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 5 Sep 2007 08:08:02 -0700 Subject: [Python-3000] bytes C API in 2.6 for easy transition to 3.0? In-Reply-To: <52dc1c820709042253j5c69e4e5l1d3f953526c051a4@mail.gmail.com> References: <1636919686236946180@unknownmsgid> <52dc1c820709042253j5c69e4e5l1d3f953526c051a4@mail.gmail.com> Message-ID: On 9/4/07, Gregory P. Smith wrote: > > On 9/4/07, Bill Janssen wrote: > > According to PEP 358, "bytes" will be in both 2.6 and 3.0. It would > > be nice if the C API for "bytes" existed in the trunk, so that it > > could be used for new code that will port more easily to 3.0 . > > I assume this includes the new buffer api since we really seem to want C API > users to use that rather than bytes objects directly? Well, in a pinch the old buffer API would work (the 3.0 bytes object used that until recently :-) but Travis told me he is planning to backport PEP 3118 to 2.6, so eventually that will happen, yes. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg at krypto.org Wed Sep 5 17:36:38 2007 From: greg at krypto.org (Gregory P. Smith) Date: Wed, 5 Sep 2007 09:36:38 -0600 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> Message-ID: <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> > Yes, this is a serious issue -- we are totally dependent on openssl > for computing MD5 checksums. Several modules use MD5 checksums > casually, and it's not good that these fail when openssl isn't > available (or if it's too old, like what happened on an ancient Red > Hat 7.3 system I have at home). I'm tempted to put the old > RSA-copyrighted md5.c back in as a fallback, even though its license > is impopular. Or perhaps we could make a copy of a small fraction of > openssl and use that? I think MD5 is the only one that's popular > enough to warrant this treatment; I think SHA1 is a distant second. Every OS I use has openssl installed so i figured someone else had made the same decision and removed the non-openssl variants. Are there really non-linux/bsd/osx installations out there where anyone intends to build and install python that do -not- have openssl installed somewhere? That'd be sad but in that case we shouldn't abandon them. Modifying setup.py to find it installed in a different place should be easy if thats all it takes. Rather than resurrecting the old RSA-copyright md5.c I can easily make new ones out of the libtomcrypt md5 and sha1 sources the same way i created the non-openssl sha256 and sha512 modules. We should not limit ourselves to only md5 if we do that, lets guarantee that md5, sha1 - sha512 are available on all future python installs; its not difficult. I'll do the work if we need it. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070905/4364f46f/attachment.htm From nick.bastin at gmail.com Wed Sep 5 17:51:30 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 5 Sep 2007 11:51:30 -0400 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <46DE90B0.4050905@v.loewis.de> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> Message-ID: <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> On 9/5/07, "Martin v. L?wis" wrote: > I think that's a serious issue to consider. As so much code now depends > on OpenSSL, setup.py should try harder to find it. E.g. on the build > slave, it can be found in /usr/sfw - not sure whether that is normal > on a Solaris 10 installation, and not sure whether there is a > Sun-provided OpenSSL on Solaris 8. There is not. I can put OpenSSL in my environment, but I do not usually build with it as I can't build with it on many non-US installations. If we really just need OpenSSL for hashing most of the time, we should probably try to implement that somewhere else. The 2.5 "What's new" documentation said that hashlib used OpenSSL when available, but it appears to be requiring OpenSSL? > Notice that the tests don't 'fail', they are skipped. There are also > failing test cases, something that is more worrisome than a skipped > test case. The tests that I marked as "fail" in my email are marked as "fail" by the unittest framework. It is 'wrong' in some of these cases, because it should have skipped the tests, but it didn't. I also think that unittest shouldn't think that SSL-related skips are unexpected if I don't have SSL, but that's a bone to pick for another day. -- Nick From martin at v.loewis.de Wed Sep 5 17:54:39 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Sep 2007 17:54:39 +0200 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> Message-ID: <46DED13F.6080705@v.loewis.de> > Every OS I use has openssl installed so i figured someone else had made > the same decision and removed the non-openssl variants. Are there > really non-linux/bsd/osx installations out there where anyone intends to > build and install python that do -not- have openssl installed > somewhere? Most certainly. Commercial Unix vendors have been very hesitant to include open source software in any form, as they are worried about having to maintain it without having control over it. Sun started recently, but I'm not sure whether you could get a Sun-packaged OpenSSL with Solaris 8 (say). I would expect it's worse for AIX and HP-UX, although IBM's recent open-source strategy may have made life easier for AIX users. > We should not limit ourselves to only md5 if we do that, lets guarantee > that md5, sha1 - sha512 are available on all future python installs; its > not difficult. I'll do the work if we need it. Ok - start with the buildbots. It's easy to see whether it works; if it doesn't, you can probably get accounts on the machines to see whether OpenSSL is included, or some guideline from people familiar with the systems. Regards, Martin From martin at v.loewis.de Wed Sep 5 18:09:36 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Sep 2007 18:09:36 +0200 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> Message-ID: <46DED4C0.20406@v.loewis.de> > There is not. I can put OpenSSL in my environment What do you "I can put". You compile it yourself? Why not use the Sun-provided one? > The > 2.5 "What's new" documentation said that hashlib used OpenSSL when > available, but it appears to be requiring OpenSSL? That's for 2.5. In 3.0 (currently), hashlib requires OpenSSL. > The tests that I marked as "fail" in my email are marked as "fail" by > the unittest framework. Ah, ok. Regards, Martin From guido at python.org Wed Sep 5 18:21:36 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 5 Sep 2007 09:21:36 -0700 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> Message-ID: On 9/5/07, Gregory P. Smith wrote: [Guido] > > Yes, this is a serious issue -- we are totally dependent on openssl > > for computing MD5 checksums. Several modules use MD5 checksums > > casually, and it's not good that these fail when openssl isn't > > available (or if it's too old, like what happened on an ancient Red > > Hat 7.3 system I have at home). I'm tempted to put the old > > RSA-copyrighted md5.c back in as a fallback, even though its license > > is impopular. Or perhaps we could make a copy of a small fraction of > > openssl and use that? I think MD5 is the only one that's popular > > enough to warrant this treatment; I think SHA1 is a distant second. > > Every OS I use has openssl installed so i figured someone else had made the > same decision and removed the non-openssl variants. Are there really > non-linux/bsd/osx installations out there where anyone intends to build and > install python that do -not- have openssl installed somewhere? That'd be > sad but in that case we shouldn't abandon them. Modifying setup.py to find > it installed in a different place should be easy if thats all it takes. > > Rather than resurrecting the old RSA-copyright md5.c I can easily make new > ones out of the libtomcrypt md5 and sha1 sources the same way i created the > non-openssl sha256 and sha512 modules. > > We should not limit ourselves to only md5 if we do that, lets guarantee that > md5, sha1 - sha512 are available on all future python installs; its not > difficult. I'll do the work if we need it. I'd appreciate that -- openssl is a fickle dependency. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From theller at ctypes.org Wed Sep 5 18:51:58 2007 From: theller at ctypes.org (Thomas Heller) Date: Wed, 05 Sep 2007 18:51:58 +0200 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> Message-ID: While we're at solaris, I would appreciate if some solaris expert(s) could take a look at http://bugs.python.org/issue1777530 Thanks, Thomas From nick.bastin at gmail.com Wed Sep 5 19:54:57 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 5 Sep 2007 13:54:57 -0400 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> Message-ID: <66d0a6e10709051054v974178djb5dd589befaa384@mail.gmail.com> On 9/5/07, Guido van Rossum wrote: > On 9/5/07, "Martin v. L?wis" wrote: > > > This is a combination question-and-status-report email. The question > > > would be, what does the "somewhat" tag mean on Solaris support in the > > > release notes for 3.0a1, and does someone have a list of things that > > > don't work, or does that just mean it hasn't been tested? > > > > Not sure what "somewhat" means, but you can take a look at the build > > failures in the Solaris buildbot - this is what is "officially" known > > not to work. > > The "somewhat" was my word -- I meant that when I last looked at the > Solaris buildbot, I saw a few failures; and also that I don't have > access to Sun hardware. And also what Martin says below. I'd be happy > though to replace "somewhat" with specific indications of h/w and s/w > versions if you are willing to commit to supporting these throughout > the 3.0 life cycle. I have access to Solaris 8 and 9 on Sparc, and Solaris 10 on x86. My Solaris 10 x86 installation is currently in a VM, and it's unpleasant to work with (performance is terrible for some reason), but I can at least make a passing attempt to build and run unit tests in that environment. I have to have Python on Sparc for my application, so I'm going to continue to work on Python 3.0 on Solaris 8/9 for Sparc throughout the entire cycle to make sure that we have a usable product there. > > As always with Solaris, there are several dimensions to be considered: > > - version (2.5,2.6,7,8,9,10,11); not sure what the oldest Solaris > > version is that we still want to support. > > - compiler: gcc vs. SunPRO/Forte > > - 32 vs. 64 bits > > - SPARC vs. x86 I will at least build and test the following configurations. I will also attempt to fix any platform specific bugs, but I suspect the Unicode failures are going to create some interesting discussions around here. :-) Solaris 8, 32-bit, 64-bit, Sparc, gcc and SunPro 11 Solaris 9, 32-bit, 64-bit, Sparc, gcc and SunPro 11 I will try to get to: Solaris 10, 32-bit, x86, gcc Because there's no reason not to since I have an x86 machine and VMWare. :-) > > > If anyone wants more data on any of these particular failures, let me > > > know, otherwise I'm going to start working through the ones that fail > > > in 3.0 that don't fail in 2.6. All of the _md5 failures are because > > > of the lack of SSL, so I'm not sure that the tests should be 'failing' > > > in this configuration. > > > > I think that's a serious issue to consider. As so much code now depends > > on OpenSSL, setup.py should try harder to find it. E.g. on the build > > slave, it can be found in /usr/sfw - not sure whether that is normal > > on a Solaris 10 installation, and not sure whether there is a > > Sun-provided OpenSSL on Solaris 8. > > > > Notice that the tests don't 'fail', they are skipped. There are also > > failing test cases, something that is more worrisome than a skipped > > test case. > > Yes, this is a serious issue -- we are totally dependent on openssl > for computing MD5 checksums. Several modules use MD5 checksums > casually, and it's not good that these fail when openssl isn't > available (or if it's too old, like what happened on an ancient Red > Hat 7.3 system I have at home). I'm tempted to put the old > RSA-copyrighted md5.c back in as a fallback, even though its license > is impopular. Or perhaps we could make a copy of a small fraction of > openssl and use that? I think MD5 is the only one that's popular > enough to warrant this treatment; I think SHA1 is a distant second. MD5 is defined in RFC 1321, there's no reason to have to use any particular code with a bad license - there's plenty of LGPL MD5 implementations out there (although you could probably argue that if they'd ever looked at 1321, which they almost certainly did, then they've been tainted by the RSA code). Also, the NIST SHA-1/256/384/512 code is freely available, there's also no reason to rely on OpenSSL for it (although it looks like the PKI reference implementation links that I can find are dead, so we might have to hunt a little bit). In either case, we could probably copy the relevant pieces out of OpenSSL. -- Nick From greg at krypto.org Wed Sep 5 21:12:37 2007 From: greg at krypto.org (Gregory P. Smith) Date: Wed, 5 Sep 2007 13:12:37 -0600 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <66d0a6e10709051054v974178djb5dd589befaa384@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709051054v974178djb5dd589befaa384@mail.gmail.com> Message-ID: <52dc1c820709051212r4a22a917k47dd6e69c15b591a@mail.gmail.com> > > Also, the NIST SHA-1/256/384/512 code is freely available, there's > also no reason to rely on OpenSSL for it (although it looks like the > PKI reference implementation links that I can find are dead, so we > might have to hunt a little bit). > > In either case, we could probably copy the relevant pieces out of OpenSSL. No. OpenSSL hashlib support was added for a good reason. Its implementations are *much* faster as it includes platform optimized versions of all hash algorithms that are continually being updated tweaked and tuned. OpenSSL itself also doesn't lend itself to cut and paste very well. libtomcrypt is the ideal completely unencumbered basic C implementation of all hash and crypto algorithms and is easy to cut from. We already use it for sha256/512 when needed, i'll do it for the non-openssl md5 and sha1 modules in the next week or so. Someone could also implement all these hash algorithms in python. Bad idea. Not what python is good at. :) -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070905/e275ecbb/attachment.htm From brett at python.org Wed Sep 5 21:49:03 2007 From: brett at python.org (Brett Cannon) Date: Wed, 5 Sep 2007 12:49:03 -0700 Subject: [Python-3000] Questions about PEP 3121 In-Reply-To: References: <46DE8C50.6090600@v.loewis.de> Message-ID: On 9/5/07, Guido van Rossum wrote: > On 9/5/07, "Martin v. L?wis" wrote: > > Yes; I'm not certain whether module reloading continues to be supported > > in Py3k or not. If not, it should be removed from the PEP, if yes, it > > should be specified. > > I'm already missing the reload() builtin, so I think it should be kept > around in some form. I expect some form of reload functionality will > remain available, perhaps somewhere in the imp module. +1 on having imp.reload(). -Brett From oliphant.travis at ieee.org Wed Sep 5 22:45:55 2007 From: oliphant.travis at ieee.org (Travis Oliphant) Date: Wed, 05 Sep 2007 15:45:55 -0500 Subject: [Python-3000] bug in py3k buffer object? In-Reply-To: References: Message-ID: Lisandro Dalcin wrote: > Dear Travis, in my MPI wrappers, I use MPI_Alloc_mem function to get > 'special' MPI memory, and next I return it to Python using > > return PyBuffer_FromReadWriteMemory(ptr, len); > > Well, getting back this rw-buffer in python, I tried to do > > mem = MPI.Alloc_mem(10) > mem[:] = str8('\0') * 8 # sort of memzero > > but then I get this error: > > Traceback (most recent call last): > File "", line 1, in > TypeError: buffer is read-only > > > I noticed you use PyBuff_SIMPLE in > buffer_ass_item/buffer_ass_subscript... Is this OK? perhaps > PyBuf_WRITEABLE is the right flag? No much more time to go deeper. > yes, I see the problem. The problem is with get_buf not setting view->readonly when it the buffer object has a NULL base (i.e. its own memory). I'll fix this and check it in as soon as I get on a machine with check-in possibilities. -Travis From nick.bastin at gmail.com Wed Sep 5 22:44:22 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 5 Sep 2007 16:44:22 -0400 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <52dc1c820709051212r4a22a917k47dd6e69c15b591a@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709051054v974178djb5dd589befaa384@mail.gmail.com> <52dc1c820709051212r4a22a917k47dd6e69c15b591a@mail.gmail.com> Message-ID: <66d0a6e10709051344u261bc56bsd1e1369a0cadd5c0@mail.gmail.com> On 9/5/07, Gregory P. Smith wrote: > No. OpenSSL hashlib support was added for a good reason. Its > implementations are *much* faster as it includes platform optimized versions > of all hash algorithms that are continually being updated tweaked and tuned. > OpenSSL itself also doesn't lend itself to cut and paste very well. > libtomcrypt is the ideal completely unencumbered basic C implementation of > all hash and crypto algorithms and is easy to cut from. We already use it > for sha256/512 when needed, i'll do it for the non-openssl md5 and sha1 > modules in the next week or so. I don't care where you get them from.. :-) I would pull them from NIST myself for the SHA code, and just take the md5 code from the RFC (because I would argue that anyone who has implemented their own md5 algorithm is tainted by the RFC code anyhow), and play by the copyright notice. My interest would be in just maintaining the capability, and if you want it optimized, there's no reason for us to maintain that ourselves outside of the OpenSSL code base. -- Nick From brett at python.org Thu Sep 6 01:38:06 2007 From: brett at python.org (Brett Cannon) Date: Wed, 5 Sep 2007 16:38:06 -0700 Subject: [Python-3000] Google spreadsheet to collaborate on backporting Py3K stuff to 2.6 Message-ID: Neal, Anthony, Thomas W., and I have a spreadsheet that was started to keep track of what needs to be done in what needs to be done in 2.6 for Py3K transitioning: http://spreadsheets.google.com/pub?key=pCKY4oaXnT81FrGo3ShGHGg . I am opening the spreadsheet up to everyone so that others can help maintain it. There is a sheet in the Python 3000 Tasks spreadsheet that should be merged into this spreadsheet and then deleted. If anyone wants to help with that it would be great (once something has been moved from "Python 3000 Tasks" to "Python 2 -> 3 transition" just delete it from "Python 3000 Tasks"). Because Neal created this spreadsheet he is the only one who can open editing to everyone. If you would like to have edit abilities to the spreadsheet just reply to this email saying you want an invite and I will add you manually (and if you want a different address added just say so). -Brett From guido at python.org Thu Sep 6 07:06:06 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 5 Sep 2007 22:06:06 -0700 Subject: [Python-3000] test__locale failing on Red Hat 7.3 system for et_EE locale Message-ID: test__locale (that's two underscores, testing _locale.c) fails on my Red Hat 7.3 box. Further investigation shows that it's because the et_EE locale (Estonia(n)) defines the thousands separator as '\xa0' (no-break space U+00A0). Both localeconv() and nl_langinfo() use PyUnicode_FromString() which assumes UTF-8, and hence the decoding fails. On my OSX box, the thousands separator in the et_EE locale is a regular space.. On a Red Hat 9 box I have access to at work it is '\xa0' as well (tested with Python2.4; I assume Python 3.0 would fail there too). On my Ubuntu box that locale is unsupported. I can "fix" it on that particular box by using latin-1 instead, but that sounds wrong. There's an XXX comment in the code for nl_langinfo() about possibly converting to wcs (wide character set?). Any ideas? Removing et_EE from the list of interesting locales in test__locale.py seems lame. I did a quick web search and the first few hits are all about an exchange whereby someone from Estonia asked Red Hat to change the locale to use 8859-15 and the Red Hat guy point blank refused, saying it was the Estonians own fault for having submitted incorrect locale info a few years before. (But in 8859-15 \xa0 is the same no-break space character as it is in Latin-1, so this may all be irrelevant.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From noamraph at gmail.com Thu Sep 6 09:15:31 2007 From: noamraph at gmail.com (Noam Raphael) Date: Thu, 6 Sep 2007 10:15:31 +0300 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: <46DD7518.7070108@gmail.com> References: <46DD25A6.6070504@canterbury.ac.nz> <46DD7518.7070108@gmail.com> Message-ID: (Sorry, it turns out that I posted this reply only to Nick and not to the list, so I post it again.) On 9/4/07, Nick Coghlan wrote: > Containment and iteration really do need to be kept consistent and > having the value matter when checking for dictionary containment would > be outright bizarre. Put the two together and it makes sense for > dictionary iteration and containment tests to both be based on keys. > I absolutely agree that containment and iteration should be kept consistent. I suggest (again, ignoring backwards compatibility completely), that "in" would behave according to the iteration, that is, check if the tuple (key, value) is in dict.items(). If you prefer code: class DreamDict(dict): def __iter__(self): return self.iteritems() def __contains__(self, (key, value)): try: myvalue = self[key] except KeyError: return False return value == myvalue Indeed, the suggested "in" operator is not very useful, so you'll usually use has_key. But I actually think that "d.has_key(k)" is clearer than "k in d" - There's no "syntactic" reason why "k in d" should mean "k in d.keys()" and not "k in d.values()".* Noam From krstic at solarsail.hcs.harvard.edu Thu Sep 6 09:49:44 2007 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?Q?Ivan_Krsti=C4=87?=) Date: Thu, 6 Sep 2007 03:49:44 -0400 Subject: [Python-3000] 3.0 crypto (was: Re: Solaris support in 3.0?) In-Reply-To: <46DED4C0.20406@v.loewis.de> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> Message-ID: <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> On Sep 5, 2007, at 12:09 PM, Martin v. L?wis wrote: > That's for 2.5. In 3.0 (currently), hashlib requires OpenSSL. On the wider subject of crypto in Python, is there someone who actively takes care of this area and who could clarify any legal/ export restrictions on what gets included with the source distribution? There's good-quality, suitably licensed crypto code out there implementing most of the major ciphers, hashes, and asymmetric cryptosystems. I'd love it if we included a real set of crypto batteries with 3.0 that didn't depend on outside libraries, and provided more than just a hash or two. Doing the work isn't a problem. Is legalese? -- Ivan Krsti? | http://radian.org From martin at v.loewis.de Thu Sep 6 10:09:26 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 06 Sep 2007 10:09:26 +0200 Subject: [Python-3000] 3.0 crypto In-Reply-To: <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> Message-ID: <46DFB5B6.1020807@v.loewis.de> > On the wider subject of crypto in Python, is there someone who actively > takes care of this area and who could clarify any legal/export > restrictions on what gets included with the source distribution? The PSF does (more specifically, the PSF board, and even more specifically, Tim Peters). We have registered Python with the U.S. BXA (or whatever the name of this agency is), allowing export of Python from the U.S. to all countries (with a few exceptions, I believe). This is, of course, fairly immaterial, as both the Python source code and the Python releases are located on a server in the Netherlands, so downloading it from www.python.org is not an export from the U.S. There are more issues, of course: some countries restrict the use of cryptography. France is given as an example: you need to register your cryptography keys with the government (SCSSI) before you can use confidentiality-oriented algorithms, IIUC. > There's good-quality, suitably licensed crypto code out there > implementing most of the major ciphers, hashes, and asymmetric > cryptosystems. I'd love it if we included a real set of crypto batteries > with 3.0 that didn't depend on outside libraries, and provided more than > just a hash or two. Doing the work isn't a problem. Is legalese? Why do you say that doing the work is not a problem? I see it as a major problem. In addition, other people also see other problems, like size of the distribution, fear of cryptography in general, and so on. Regards, Martin From p.f.moore at gmail.com Thu Sep 6 10:29:22 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 6 Sep 2007 09:29:22 +0100 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> Message-ID: <79990c6b0709060129s458f6ce4t71e128a4a4f6e2dd@mail.gmail.com> On 05/09/07, Gregory P. Smith wrote: > Rather than resurrecting the old RSA-copyright md5.c I can easily make new > ones out of the libtomcrypt md5 and sha1 sources the same way i created the > non-openssl sha256 and sha512 modules. Which reminds me - when I build Python 3 (on an Ubuntu box) with openssl installed, I get a message about _sha256 and _sha512 not being built. Presumably this is intentional? (It looks a bit odd, and I spent a while trying to work out what dependencies I needed before realising it was probably OK). Paul. From krstic at solarsail.hcs.harvard.edu Thu Sep 6 12:03:45 2007 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?Q?Ivan_Krsti=C4=87?=) Date: Thu, 6 Sep 2007 06:03:45 -0400 Subject: [Python-3000] 3.0 crypto In-Reply-To: <46DFB5B6.1020807@v.loewis.de> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> Message-ID: <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> On Sep 6, 2007, at 4:09 AM, Martin v. L?wis wrote: > There are more issues, of course: some countries restrict the use > of cryptography. France is given as an example: you need to register > your cryptography keys with the government (SCSSI) before you can > use confidentiality-oriented algorithms, IIUC. This gets at what most interests me -- namely, whether there's a strong legal barrier to including more crypto with Python than just the hashes we have at the moment. It sounds like the answer is 'yes', but what are the details? > Why do you say that doing the work is not a problem? I see it as > a major problem. I'm willing to either do the work myself, or have someone else from the secops team at OLPC do it. > In addition, other people also see other problems, like size of the > distribution, fear of cryptography in general, and so on. The distribution size issue can be mitigated by a reasonable choice of supported primitives. I don't think we need to ship the crypto kitchen sink with Python; we can disqualify known-broken algorithms that many libraries still ship, etc. -- Ivan Krsti? | http://radian.org From thomas at python.org Thu Sep 6 12:13:33 2007 From: thomas at python.org (Thomas Wouters) Date: Thu, 6 Sep 2007 12:13:33 +0200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: <46DD25A6.6070504@canterbury.ac.nz> <46DD7518.7070108@gmail.com> Message-ID: <9e804ac0709060313x6b142672xa84f56cd54a3c5a2@mail.gmail.com> On 9/6/07, Noam Raphael wrote: > > (Sorry, it turns out that I posted this reply only to Nick and not to > the list, so I post it again.) > > On 9/4/07, Nick Coghlan wrote: > > Containment and iteration really do need to be kept consistent and > > having the value matter when checking for dictionary containment would > > be outright bizarre. Put the two together and it makes sense for > > dictionary iteration and containment tests to both be based on keys. > > > I absolutely agree that containment and iteration should be kept > consistent. > > I suggest (again, ignoring backwards compatibility completely), that > "in" would behave according to the iteration, that is, check if the > tuple (key, value) is in dict.items(). If you prefer code: > > class DreamDict(dict): > def __iter__(self): > return self.iteritems() > def __contains__(self, (key, value)): > try: > myvalue = self[key] > except KeyError: > return False > return value == myvalue > > Indeed, the suggested "in" operator is not very useful, so you'll > usually use has_key. But I actually think that "d.has_key(k)" is > clearer than "k in d" - There's no "syntactic" reason why "k in d" > should mean "k in d.keys()" and not "k in d.values()".* None of what you're saying is new. It's all been said back when iteration and containment testing were added to the dict type. The choice was explicitly made for the useful containment test, and the conforming iteration behaviour. The iteration is not actually less useful, it's just different. The net result of 'more useful + just as useful' is 'more useful'. I don't believe the actual experience in the three major releases since it was added, have convinced anyone that it's a bad idea (in fact, I had slight misgivings back then, but none what so ever now.) The mapping types simply don't act as containers of (key, value) pairs. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070906/109e5dbe/attachment.htm From martin at v.loewis.de Thu Sep 6 12:18:54 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 06 Sep 2007 12:18:54 +0200 Subject: [Python-3000] 3.0 crypto In-Reply-To: <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> Message-ID: <46DFD40E.8010705@v.loewis.de> > This gets at what most interests me -- namely, whether there's a strong > legal barrier to including more crypto with Python than just the hashes > we have at the moment. It sounds like the answer is 'yes', but what are > the details? The export permission allows for exporting "mass-market" software; anything you can come up with likely classifies. We need to report precisely what is included (i.e. what files contain the crypto code). So with any release that adds new crypto features, a new report to BXA would formally be necessary. >> Why do you say that doing the work is not a problem? I see it as >> a major problem. > > I'm willing to either do the work myself, or have someone else from the > secops team at OLPC do it. It's not something that a single person can well do. You will also need to design APIs, and that traditionally involves the community. If you create something ad-hoc, I would request that this first gets field-proven for a few years before being included in the standard distribution. Then, it would face competition to existing such solutions. > The distribution size issue can be mitigated by a reasonable choice of > supported primitives. I don't think we need to ship the crypto kitchen > sink with Python; we can disqualify known-broken algorithms that many > libraries still ship, etc. Sounds like a PEP topic. Regards, Martin From ndbecker2 at gmail.com Thu Sep 6 14:33:09 2007 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 06 Sep 2007 08:33:09 -0400 Subject: [Python-3000] pep-0362? Message-ID: http://www.python.org/dev/peps/pep-0362/ This would be helpful for boost::python. Any thoughts on approving this for python-3k? From fdrake at acm.org Thu Sep 6 14:50:34 2007 From: fdrake at acm.org (Fred Drake) Date: Thu, 6 Sep 2007 08:50:34 -0400 Subject: [Python-3000] pep-0362? In-Reply-To: References: Message-ID: <3D0ACD2D-EDBC-40DC-93EB-A3B264567B85@acm.org> On Sep 6, 2007, at 8:33 AM, Neal Becker wrote: > http://www.python.org/dev/peps/pep-0362/ > > This would be helpful for boost::python. Any thoughts on approving > this for > python-3k? The var_args and var_kw_args definitions are a little weird. Why use the empty string instead of None when they aren't used in the signature? Also, the post-history is blank; perhaps this still needs to be presented to the community for review and discussion? Or perhaps the field in the PEP needs to be filled in. -Fred -- Fred Drake From skip at pobox.com Thu Sep 6 15:46:44 2007 From: skip at pobox.com (skip at pobox.com) Date: Thu, 6 Sep 2007 08:46:44 -0500 Subject: [Python-3000] pep-0362? In-Reply-To: References: Message-ID: <18144.1220.785244.174063@montanaro.dyndns.org> Neal> This would be helpful for boost::python. Any thoughts on Neal> approving this for python-3k? I haven't read it, but it seems very similar to the new annotations capability in py3k (pep 3107). Will that not suffice? Skip From skip at pobox.com Thu Sep 6 15:48:01 2007 From: skip at pobox.com (skip at pobox.com) Date: Thu, 6 Sep 2007 08:48:01 -0500 Subject: [Python-3000] pep-0362? In-Reply-To: References: Message-ID: <18144.1297.46506.699543@montanaro.dyndns.org> > I haven't read it, but it seems very similar to the new annotations > capability in py3k (pep 3107). Will that not suffice? Which I notice has a "Requires: 362" field. Perhaps you're good to go. ;-) Skip From guido at python.org Thu Sep 6 16:54:04 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 6 Sep 2007 07:54:04 -0700 Subject: [Python-3000] 3.0 crypto (was: Re: Solaris support in 3.0?) In-Reply-To: <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> Message-ID: [Adding Greg P Smith who owns the hashes, and Bill Janssen who has recently taken over our SSL support.] Traditionally this is something for which the core developers haven't had an inclination, so it's been left to 3rd party packages. The position of the US government on crypto export hasn't helped - at some point we felt the need to even ask for permission to include code in the source code that would link to 3rd party crypto libraries, even if we weren't distributing those libraries (e.g. openssl). I think this has calmed down some but I don't know if the requirement to register anything to do with crypto is completely gone; the PSF generally doesn't want to bother with such red tape. I'm not sure what you meant with "doing the work isn't a problem". Are you volunteering? I think we need someone who understands the red tape situation most of all. Hopefully I'm worried for nothing. --Guido On 9/6/07, Ivan Krsti? wrote: > On Sep 5, 2007, at 12:09 PM, Martin v. L?wis wrote: > > That's for 2.5. In 3.0 (currently), hashlib requires OpenSSL. > > On the wider subject of crypto in Python, is there someone who > actively takes care of this area and who could clarify any legal/ > export restrictions on what gets included with the source distribution? > > There's good-quality, suitably licensed crypto code out there > implementing most of the major ciphers, hashes, and asymmetric > cryptosystems. I'd love it if we included a real set of crypto > batteries with 3.0 that didn't depend on outside libraries, and > provided more than just a hash or two. Doing the work isn't a > problem. Is legalese? > > -- > Ivan Krsti? | http://radian.org > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From p.f.moore at gmail.com Thu Sep 6 16:54:58 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 6 Sep 2007 15:54:58 +0100 Subject: [Python-3000] pep-0362? In-Reply-To: <18144.1297.46506.699543@montanaro.dyndns.org> References: <18144.1297.46506.699543@montanaro.dyndns.org> Message-ID: <79990c6b0709060754m3405cc23o77d014d2c59908ae@mail.gmail.com> On 06/09/07, skip at pobox.com wrote: > > > I haven't read it, but it seems very similar to the new annotations > > capability in py3k (pep 3107). Will that not suffice? > > Which I notice has a "Requires: 362" field. Perhaps you're good to go. ;-) Apparently not (yet, at least). >\Apps\Python30\python.exe Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> def f(): pass ... >>> f.__signature__ Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute '__signature__' >>> signature(f) Traceback (most recent call last): File "", line 1, in NameError: name 'signature' is not defined Paul. From brett at python.org Thu Sep 6 19:41:07 2007 From: brett at python.org (Brett Cannon) Date: Thu, 6 Sep 2007 10:41:07 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: <3D0ACD2D-EDBC-40DC-93EB-A3B264567B85@acm.org> References: <3D0ACD2D-EDBC-40DC-93EB-A3B264567B85@acm.org> Message-ID: On 9/6/07, Fred Drake wrote: > On Sep 6, 2007, at 8:33 AM, Neal Becker wrote: > > http://www.python.org/dev/peps/pep-0362/ > > > > This would be helpful for boost::python. Any thoughts on approving > > this for > > python-3k? > > The var_args and var_kw_args definitions are a little weird. Why use > the empty string instead of None when they aren't used in the signature? > I think because when it was designed there was discussions going on about not having different behavior based on types or something. > Also, the post-history is blank; perhaps this still needs to be > presented to the community for review and discussion? Or perhaps the > field in the PEP needs to be filled in. The open issues were brought up on python-dev but they were never resolved. -Brett From brett at python.org Thu Sep 6 19:43:04 2007 From: brett at python.org (Brett Cannon) Date: Thu, 6 Sep 2007 10:43:04 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: <18144.1297.46506.699543@montanaro.dyndns.org> References: <18144.1297.46506.699543@montanaro.dyndns.org> Message-ID: On 9/6/07, skip at pobox.com wrote: > > > I haven't read it, but it seems very similar to the new annotations > > capability in py3k (pep 3107). Will that not suffice? > > Which I notice has a "Requires: 362" field. Perhaps you're good to go. ;-) I think that is there because an original version of PEP 3107 put all of the annotation information into the Signature object and not directly on to the function. -Brett From brett at python.org Thu Sep 6 19:44:10 2007 From: brett at python.org (Brett Cannon) Date: Thu, 6 Sep 2007 10:44:10 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: <18144.1220.785244.174063@montanaro.dyndns.org> References: <18144.1220.785244.174063@montanaro.dyndns.org> Message-ID: On 9/6/07, skip at pobox.com wrote: > > Neal> This would be helpful for boost::python. Any thoughts on > Neal> approving this for python-3k? > > I haven't read it, but it seems very similar to the new annotations > capability in py3k (pep 3107). Will that not suffice? There are different ideas here. Signature objects are meant to collect all of the various pieces of information about parameters into a single place for easier introspection. Annotations are just a part of what is exposed for introspection. -Brett From qrczak at knm.org.pl Thu Sep 6 20:58:41 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 06 Sep 2007 20:58:41 +0200 Subject: [Python-3000] Default dict iterator should have been iteritems() In-Reply-To: References: Message-ID: <1189105122.15072.29.camel@qrnik> Dnia 04-09-2007, Wt o godzinie 11:09 +0200, Georg Brandl napisa?(a): > Even if it's true that a loop over items is more common than a loop over keys, > "x in keys" is much more common than "x in items". In my language iterating over dict yields (key,value) pairs, but the equivalent of "x in dict" checks whether a key is present. My Kogut<->Python binding is smart enough to convert these conventions (which needed some work anyway because tuples could not be converted implicitly between the languages). An ugly part of the conversion was distinguishing between Python dictionaries, sequences and sets by the presence of some methods. For the curious, bits of the binding are at http://kokogut.cvs.sourceforge.net/kokogut/kokogut/lib/Python/Foreign/Python/Collection.ko?view=markup http://kokogut.cvs.sourceforge.net/kokogut/kokogut/lib/Python/Foreign/Python/KogutObject.ko?view=markup -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From jjb5 at cornell.edu Thu Sep 6 21:40:57 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Thu, 06 Sep 2007 15:40:57 -0400 Subject: [Python-3000] pep-0362? In-Reply-To: References: Message-ID: <46E057C9.6000206@cornell.edu> > http://www.python.org/dev/peps/pep-0362/ > > This would be helpful for boost::python. Speaking of helpful... class X: def f(self): pass class Y(X): pass ...I would like a mechanism to indicate that Y.f is inherited, and I was hoping that perhaps that information could be found in its signature. I see that it's not, would it be another PEP to add it? (It was a bit of an eye opener when I first found out that Y.f.im_class wasn't X.) Joel From brett at python.org Thu Sep 6 22:43:02 2007 From: brett at python.org (Brett Cannon) Date: Thu, 6 Sep 2007 13:43:02 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: <46E057C9.6000206@cornell.edu> References: <46E057C9.6000206@cornell.edu> Message-ID: On 9/6/07, Joel Bender wrote: > > http://www.python.org/dev/peps/pep-0362/ > > > > This would be helpful for boost::python. > > Speaking of helpful... > > class X: > def f(self): pass > > class Y(X): pass > > ...I would like a mechanism to indicate that Y.f is inherited, and I was > hoping that perhaps that information could be found in its signature. I > see that it's not, would it be another PEP to add it? (It was a bit of > an eye opener when I first found out that Y.f.im_class wasn't X.) Something like this could go into the 'inspect' module (didn't even worry about __slots__):: def find_def(meth): for cls in meth.im_class.mro(): if meth.im_func.__name__ in cls.__dict__: return cls else: return None For such a simple addition to inspect you just need a patch that has a good implementation, thorough unit tests, and a core developer who thinks it is worthwhile enough to add the functionality. -Brett From collinw at gmail.com Thu Sep 6 22:49:58 2007 From: collinw at gmail.com (Collin Winter) Date: Thu, 6 Sep 2007 13:49:58 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: References: <18144.1297.46506.699543@montanaro.dyndns.org> Message-ID: <43aa6ff70709061349l4d5cb9e4ge2311efe3267e700@mail.gmail.com> On 9/6/07, Brett Cannon wrote: > On 9/6/07, skip at pobox.com wrote: > > > > > I haven't read it, but it seems very similar to the new annotations > > > capability in py3k (pep 3107). Will that not suffice? > > > > Which I notice has a "Requires: 362" field. Perhaps you're good to go. ;-) > > I think that is there because an original version of PEP 3107 put all > of the annotation information into the Signature object and not > directly on to the function. Correct. I'll remove the references to 362 from PEP 3107. Collin Winter From guido at python.org Thu Sep 6 23:10:44 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 6 Sep 2007 14:10:44 -0700 Subject: [Python-3000] [Python-Dev] Google spreadsheet to collaborate on backporting Py3K stuff to 2.6 In-Reply-To: References: Message-ID: I've transferred everything from my spreadsheet to Neal's. On 9/5/07, Brett Cannon wrote: > Neal, Anthony, Thomas W., and I have a spreadsheet that was started to > keep track of what needs to be done in what needs to be done in 2.6 > for Py3K transitioning: > http://spreadsheets.google.com/pub?key=pCKY4oaXnT81FrGo3ShGHGg . I am > opening the spreadsheet up to everyone so that others can help > maintain it. > > There is a sheet in the Python 3000 Tasks spreadsheet that should be > merged into this spreadsheet and then deleted. If anyone wants to > help with that it would be great (once something has been moved from > "Python 3000 Tasks" to "Python 2 -> 3 transition" just delete it from > "Python 3000 Tasks"). > > Because Neal created this spreadsheet he is the only one who can open > editing to everyone. If you would like to have edit abilities to the > spreadsheet just reply to this email saying you want an invite and I > will add you manually (and if you want a different address added just > say so). > > -Brett > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From aahz at pythoncraft.com Thu Sep 6 23:49:42 2007 From: aahz at pythoncraft.com (Aahz) Date: Thu, 6 Sep 2007 14:49:42 -0700 Subject: [Python-3000] pep-0362? In-Reply-To: References: Message-ID: <20070906214942.GB439@panix.com> On Thu, Sep 06, 2007, Neal Becker wrote: > > http://www.python.org/dev/peps/pep-0362/ > > This would be helpful for boost::python. Any thoughts on approving this for > python-3k? What would be helpful IMO is using a Subject: line that doesn't require using a browser to find out what the thread is about. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer http://www.lysator.liu.se/c/ten-commandments.html From nick.bastin at gmail.com Fri Sep 7 18:29:44 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Fri, 7 Sep 2007 12:29:44 -0400 Subject: [Python-3000] Performance Notes In-Reply-To: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> References: <66d0a6e10709031154x6ea3d235ya894014ecdf546a2@mail.gmail.com> Message-ID: <66d0a6e10709070929p6897f69cq940655b2fd46ac0b@mail.gmail.com> On 9/3/07, Nicholas Bastin wrote: > NOTE: This data is time sampling, not call graph. Added time could > come from either more calls, or longer calls. > > +312.9% PyDict_GetItem I've finally managed to get call graph data and it's fairly interesting for this call. I try to find some way to post all of the data at some point, but I thought some initial data might be useful. Calls to PyDict_GetItem in 2.6 (pystone.py 10000): 160839 - instance_getattr2 30325 - class_lookup 5545 - PyString_InternInPlace 4808 - update_one_slot 2290 - PyObject_GenericGetAttr ... Total: 208697 3.0 (pystone.py 10000): 575093 - PyEval_EvalFrameEx 416600 - PyObject_GenericGetAttr 321447 - PyObject_GenericSetAttr 25394 - update_one_slot 10142 - lookup_maybe 8925 - PyUnicode_InternInPlace ... Total: 1368114 Almost all (522631) of the extra calls in PyEval_EvalFrameEx are because in 2.6 we use the unrolled code in LOAD_GLOBAL, and in 3.0, LOAD_GLOBAL always falls through to PyDict_GetItem. I haven't investigated GenericGet/SetAttr yet. -- Nick From g.brandl at gmx.net Fri Sep 7 19:24:10 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 07 Sep 2007 19:24:10 +0200 Subject: [Python-3000] clean out the future? Message-ID: Should the __future__ be cleaned out for 3k, or should all future imports continue to work and do nothing? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From fdrake at acm.org Fri Sep 7 19:29:54 2007 From: fdrake at acm.org (Fred Drake) Date: Fri, 7 Sep 2007 13:29:54 -0400 Subject: [Python-3000] clean out the future? In-Reply-To: References: Message-ID: On Sep 7, 2007, at 1:24 PM, Georg Brandl wrote: > Should the __future__ be cleaned out for 3k, or should all future > imports > continue to work and do nothing? They should continue to work. One advantage of keeping the existing feature table in the __future__ module is that is makes it easier to avoid re-using a feature name; I think there's merit in that. -Fred -- Fred Drake From nick.bastin at gmail.com Fri Sep 7 20:30:33 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Fri, 7 Sep 2007 14:30:33 -0400 Subject: [Python-3000] Where is PyUnicodeObject->hash supposed to be set? Message-ID: <66d0a6e10709071130x600c3383if3718f6ec41395d5@mail.gmail.com> Before I do a bunch of searching around in the source, perhaps someone just knows the answer to this question. A quick trip through the debugger indicates that the reason PyDict_GetItem is being called 5 million times more often in PyEval_EvalFrameEx in 3.0 (in pystone 100000) is because while PyString_CheckExact was swapped out for PyUnicode_CheckExact in LOAD_GLOBAL, ((PyUnicodeObject*)w)->hash always evaluates to -1, which punts us down to the non-inline code. Presumably ((PyStringObject*)w)->ob_shash was already set at this point, which is why it worked in 2.6 and previous. Before I spend a lot of time trying to track down where this is supposed to be getting set (or, needs to be being set), does anyone know where this is supposed to happen? -- Nick From guido at python.org Fri Sep 7 20:46:28 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 7 Sep 2007 11:46:28 -0700 Subject: [Python-3000] Where is PyUnicodeObject->hash supposed to be set? In-Reply-To: <66d0a6e10709071130x600c3383if3718f6ec41395d5@mail.gmail.com> References: <66d0a6e10709071130x600c3383if3718f6ec41395d5@mail.gmail.com> Message-ID: It should be set in unicode_hash(). If you compare the trunk version of that function with the py3k branch version, you see that it's been refactored, and in the refactoring, setting ->hash was omitted. It should be trivial to put it back. On 9/7/07, Nicholas Bastin wrote: > Before I do a bunch of searching around in the source, perhaps someone > just knows the answer to this question. > > A quick trip through the debugger indicates that the reason > PyDict_GetItem is being called 5 million times more often in > PyEval_EvalFrameEx in 3.0 (in pystone 100000) is because while > PyString_CheckExact was swapped out for PyUnicode_CheckExact in > LOAD_GLOBAL, ((PyUnicodeObject*)w)->hash always evaluates to -1, which > punts us down to the non-inline code. Presumably > ((PyStringObject*)w)->ob_shash was already set at this point, which is > why it worked in 2.6 and previous. > > Before I spend a lot of time trying to track down where this is > supposed to be getting set (or, needs to be being set), does anyone > know where this is supposed to happen? > > -- > Nick > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg at krypto.org Fri Sep 7 20:48:18 2007 From: greg at krypto.org (Gregory P. Smith) Date: Fri, 7 Sep 2007 11:48:18 -0700 Subject: [Python-3000] 3.0 crypto In-Reply-To: <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> Message-ID: <52dc1c820709071148l2c3061f9l14c929657ef7e397@mail.gmail.com> On 9/6/07, Ivan Krsti? wrote: > > On Sep 6, 2007, at 4:09 AM, Martin v. L?wis wrote: > > There are more issues, of course: some countries restrict the use > > of cryptography. France is given as an example: you need to register > > your cryptography keys with the government (SCSSI) before you can > > use confidentiality-oriented algorithms, IIUC. > > This gets at what most interests me -- namely, whether there's a > strong legal barrier to including more crypto with Python than just > the hashes we have at the moment. It sounds like the answer is 'yes', > but what are the details? fwiw hashes are not cryptography. The distribution size issue can be mitigated by a reasonable choice > of supported primitives. I don't think we need to ship the crypto > kitchen sink with Python; we can disqualify known-broken algorithms > that many libraries still ship, etc. I see nothing wrong with leaving pycrypto as an add-on library as most things don't need it. http://www.amk.ca/python/code/crypto. The pycrypto API is is very nice. But if we were to consider it for the standard library I'd prefer it just link against OpenSSL rather than use its own C implementations and just leave platforms without ssl without any crypto. Besides the chances are that most programmers seeing a crypto library will misuse it and gain a false sense of security on what they've done. ;) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070907/5350ec2f/attachment.htm From greg at krypto.org Fri Sep 7 22:45:58 2007 From: greg at krypto.org (Gregory P. Smith) Date: Fri, 7 Sep 2007 13:45:58 -0700 Subject: [Python-3000] Performance Notes - new hash algorithm Message-ID: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> On 9/4/07, Thomas Hunger wrote: > > > Hello, > > I don't know much about python internals, so the following might be > bogus: > > I replaced unicode_hash and string_hash with the hash function from > here: http://www.azillionmonkeys.com/qed/hash.html. > > Then I ran the following micro-benchmark : > > $ time ./python bench.py > > where bech.py is: > > f = dict((line, nr) for nr, line > in enumerate(open('/usr/share/dict/words', > encoding='latin1').readlines())) > > Python3k original hash: real 0m2.210s > new hash: real 0m1.842s > > So maybe this is an interesting hash function? > > Tom Sounds like a great idea to me. Can you submit it as a patch? We should run some more realistic perf tests and profiles but I imagine the impact will only be good. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070907/eef30c77/attachment.htm From guido at python.org Fri Sep 7 22:53:45 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 7 Sep 2007 13:53:45 -0700 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> Message-ID: I'd like Tim Peters's input on this before we change it. I seem to recall that there's an aspect of non-randomness to the existing hash function that's important when you hash many closely related strings, e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been so long that I may misremember this, and perhaps it was related to the dictionary implementation. In any case we need to see the code as a patch, of course. On 9/7/07, Gregory P. Smith wrote: > On 9/4/07, Thomas Hunger wrote: > > > > Hello, > > > > I don't know much about python internals, so the following might be > > bogus: > > > > I replaced unicode_hash and string_hash with the hash function from > > here: http://www.azillionmonkeys.com/qed/hash.html. > > > > Then I ran the following micro-benchmark : > > > > $ time ./python bench.py > > > > where bech.py is: > > > > f = dict((line, nr) for nr, line > > in enumerate(open('/usr/share/dict/words', > > > encoding='latin1').readlines())) > > > > Python3k original hash: real 0m2.210s > > new hash: real 0m1.842s > > > > So maybe this is an interesting hash function? > > > > Tom > > Sounds like a great idea to me. Can you submit it as a patch? > > We should run some more realistic perf tests and profiles but I imagine the > impact will only be good. > > -gps > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Fri Sep 7 23:13:31 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Fri, 7 Sep 2007 17:13:31 -0400 Subject: [Python-3000] Where is PyUnicodeObject->hash supposed to be set? In-Reply-To: References: <66d0a6e10709071130x600c3383if3718f6ec41395d5@mail.gmail.com> Message-ID: <66d0a6e10709071413q2d7532edh6d94f43a0e790b81@mail.gmail.com> On 9/7/07, Guido van Rossum wrote: > It should be set in unicode_hash(). If you compare the trunk version > of that function with the py3k branch version, you see that it's been > refactored, and in the refactoring, setting ->hash was omitted. It > should be trivial to put it back. Putting it back nets an average 1.8% performance gain for pystone, but probably there were other cases that were extremely bad given this behaviour. We're still left with another 5 million 'extra' calls to PyDict_GetItem in 3.0 over 2.6 in a 100000 cycle pystone run, so I'll look around into those, but I suspect none of them will generate any larger performance gain. Someone with more experience than I in the 3.0 development cycle will be able to determine what macro-level optimizations / refactoring make sense, and what design decisions we're just going to have to pay for. At the moment (and probably for the forseeable moments), I'm focusing on small improvements across the codebase. -- Nick From guido at python.org Fri Sep 7 23:20:30 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 7 Sep 2007 14:20:30 -0700 Subject: [Python-3000] Where is PyUnicodeObject->hash supposed to be set? In-Reply-To: <66d0a6e10709071413q2d7532edh6d94f43a0e790b81@mail.gmail.com> References: <66d0a6e10709071130x600c3383if3718f6ec41395d5@mail.gmail.com> <66d0a6e10709071413q2d7532edh6d94f43a0e790b81@mail.gmail.com> Message-ID: Can you post the full call graph after this fix (thanks Neil S!) somewhere, or attach it to an email here? --Guido On 9/7/07, Nicholas Bastin wrote: > On 9/7/07, Guido van Rossum wrote: > > It should be set in unicode_hash(). If you compare the trunk version > > of that function with the py3k branch version, you see that it's been > > refactored, and in the refactoring, setting ->hash was omitted. It > > should be trivial to put it back. > > Putting it back nets an average 1.8% performance gain for pystone, but > probably there were other cases that were extremely bad given this > behaviour. We're still left with another 5 million 'extra' calls to > PyDict_GetItem in 3.0 over 2.6 in a 100000 cycle pystone run, so I'll > look around into those, but I suspect none of them will generate any > larger performance gain. > > Someone with more experience than I in the 3.0 development cycle will > be able to determine what macro-level optimizations / refactoring make > sense, and what design decisions we're just going to have to pay for. > At the moment (and probably for the forseeable moments), I'm focusing > on small improvements across the codebase. > > -- > Nick > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From qrczak at knm.org.pl Sat Sep 8 13:59:01 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 08 Sep 2007 13:59:01 +0200 Subject: [Python-3000] Proposed new language for newline parameter to TextIOBase In-Reply-To: References: Message-ID: <1189252741.25695.1.camel@qrnik> Dnia 14-08-2007, Wt o godzinie 21:56 -0700, Guido van Rossum napisa?(a): > (2) newline='': input with untranslated universal newlines mode; lines > may end in \r, \n, or \r\n, and these are returned untranslated. > > (3) newline='\r', newline='\n', newline='\r\n': input lines must end > with the given character(s), and these are translated to \n. What is the difference between '' and '\n'? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Sat Sep 8 16:27:17 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 8 Sep 2007 07:27:17 -0700 Subject: [Python-3000] Proposed new language for newline parameter to TextIOBase In-Reply-To: <1189252741.25695.1.camel@qrnik> References: <1189252741.25695.1.camel@qrnik> Message-ID: On 9/8/07, Marcin 'Qrczak' Kowalczyk wrote: > Dnia 14-08-2007, Wt o godzinie 21:56 -0700, Guido van Rossum napisa?(a): > > > (2) newline='': input with untranslated universal newlines mode; lines > > may end in \r, \n, or \r\n, and these are returned untranslated. > > > > (3) newline='\r', newline='\n', newline='\r\n': input lines must end > > with the given character(s), and these are translated to \n. > > What is the difference between '' and '\n'? None on output. On input, "\n" disables universal newline mode altogether ("\r" doesn't end a line), while "" enables universal newlines for determining the line ending, but disables the *translation* part, meaning you will get lines ending in "\r", "\r\n", or "\n" depending on what's in the input. The default UN mode with translation is easier for most apps (since it guarantees that lines end in \n like most apps expect), but the UN mode without translation is handy if you want to copy the file faithfully (apart from specific edits). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From qrczak at knm.org.pl Sat Sep 8 18:45:20 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 08 Sep 2007 18:45:20 +0200 Subject: [Python-3000] python3.0-config uses python2 syntax Message-ID: <1189269920.25695.3.camel@qrnik> and fails on print. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From g.brandl at gmx.net Sat Sep 8 18:52:54 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 08 Sep 2007 18:52:54 +0200 Subject: [Python-3000] python3.0-config uses python2 syntax In-Reply-To: <1189269920.25695.3.camel@qrnik> References: <1189269920.25695.3.camel@qrnik> Message-ID: Marcin 'Qrczak' Kowalczyk schrieb: > and fails on print. Already fixed. :) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From qrczak at knm.org.pl Sat Sep 8 19:00:39 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 08 Sep 2007 19:00:39 +0200 Subject: [Python-3000] C API for ints and strings Message-ID: <1189270839.25695.18.camel@qrnik> I see that PyInt_* functions are aliases for PyLong_*. Which ones should I use for the long term? There are no PyInt equivalents of PyLong_FromLongLong nor PyLong_AsLongLong. Should I continue to use PyUnicode_* functions for the new str? What is the status of the str8 type? Is it kept temporarily until the modules are updated to Python3 str, or it is an official immutable bytes type? Its repr uses s'...' syntax which is not supported by the parser. Why is _PyLong_FitsInLong private? In order to convert a Python3 int to another numeric representation, I would like to check if it fits in a C long, and convert via a string only if it does not. Should I use PyLong_AsLong + PyErr_Occurred? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Sat Sep 8 19:12:00 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 8 Sep 2007 10:12:00 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <1189270839.25695.18.camel@qrnik> References: <1189270839.25695.18.camel@qrnik> Message-ID: On 9/8/07, Marcin 'Qrczak' Kowalczyk wrote: > I see that PyInt_* functions are aliases for PyLong_*. Which ones > should I use for the long term? There are no PyInt equivalents of > PyLong_FromLongLong nor PyLong_AsLongLong. Use PyLong for now. Eventually we may rename them all; then we'll provide a renaming tool or macros. > Should I continue to use PyUnicode_* functions for the new str? Correct. Again, eventually we may rename. > What is the status of the str8 type? Is it kept temporarily until the > modules are updated to Python3 str, or it is an official immutable bytes > type? Its repr uses s'...' syntax which is not supported by the parser. The problem with its repr() is a hint. ;-) it is a temporary hack until we don't need it any more. During and after the last sprint, Neal Norwitz did a lot of work towards getting rid of it, but more needs to be done. Help is welcome! > Why is _PyLong_FitsInLong private? I don't know; perhaps because it doesn't always give the best answer. > In order to convert a Python3 int to > another numeric representation, I would like to check if it fits in a C > long, and convert via a string only if it does not. Should I use > PyLong_AsLong + PyErr_Occurred? I think either is fine. _PyLong_FitsInLong() will only get better over time. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Sat Sep 8 19:27:11 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 08 Sep 2007 19:27:11 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> Message-ID: <46E2DB6F.2080608@v.loewis.de> >> Why is _PyLong_FitsInLong private? > > I don't know; perhaps because it doesn't always give the best answer. Its sole purpose is to support PyInt_CheckExact. There is some code that relies that after PyInt_CheckExact succeeds, it is safe to do PyInt_AsLong. When I defined PyInt_CheckExact to PyLong_CheckExact, such code would break. Adding this "conservative" estimate allowed that code to work when the macro was true. As this occurs in some time-critical places, I did not want to waste time with computing a correct result. Regards, Martin From guido at python.org Sat Sep 8 19:29:09 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 8 Sep 2007 10:29:09 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E2DB6F.2080608@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <46E2DB6F.2080608@v.loewis.de> Message-ID: Hm, then perhaps rangeobject.c shouldn't use it? On 9/8/07, "Martin v. L?wis" wrote: > >> Why is _PyLong_FitsInLong private? > > > > I don't know; perhaps because it doesn't always give the best answer. > > Its sole purpose is to support PyInt_CheckExact. There is some code > that relies that after PyInt_CheckExact succeeds, it is safe to do > PyInt_AsLong. When I defined PyInt_CheckExact to PyLong_CheckExact, > such code would break. Adding this "conservative" estimate allowed > that code to work when the macro was true. As this occurs in some > time-critical places, I did not want to waste time with computing a > correct result. > > Regards, > Martin > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Sat Sep 8 19:38:48 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 08 Sep 2007 19:38:48 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E2DB6F.2080608@v.loewis.de> Message-ID: <46E2DE28.1040704@v.loewis.de> > Hm, then perhaps rangeobject.c shouldn't use it? That use is correct also; the int_range_iter is also an optimization. It does not matter that the result is not correct; if one bound is >2**30, it will create a longrangeiter, even though an int one would still be sufficient. Regards, Martin From nick.bastin at gmail.com Sat Sep 8 19:41:10 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 8 Sep 2007 13:41:10 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> Message-ID: <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> On 9/8/07, Guido van Rossum wrote: > On 9/8/07, Marcin 'Qrczak' Kowalczyk wrote: > > I see that PyInt_* functions are aliases for PyLong_*. Which ones > > should I use for the long term? There are no PyInt equivalents of > > PyLong_FromLongLong nor PyLong_AsLongLong. > > Use PyLong for now. Eventually we may rename them all; then we'll > provide a renaming tool or macros. > > > Why is _PyLong_FitsInLong private? > > I don't know; perhaps because it doesn't always give the best answer. > > > In order to convert a Python3 int to > > another numeric representation, I would like to check if it fits in a C > > long, and convert via a string only if it does not. Should I use > > PyLong_AsLong + PyErr_Occurred? > > I think either is fine. _PyLong_FitsInLong() will only get better over time. :-) Speaking of PyLong, and its' minor awkwardness to work with in C (you either have to convert to another multiple-precision type through a string, or use Python's arithmetic operators directly), was there any thought given to using something like GPM's mpz_t as the backing data type? -- Nick From martin at v.loewis.de Sat Sep 8 19:44:37 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 08 Sep 2007 19:44:37 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> Message-ID: <46E2DF85.4090005@v.loewis.de> > Speaking of PyLong, and its' minor awkwardness to work with in C (you > either have to convert to another multiple-precision type through a > string, or use Python's arithmetic operators directly), was there any > thought given to using something like GPM's mpz_t as the backing data > type? I never did that. Regards, Martin From janssen at parc.com Sat Sep 8 21:39:25 2007 From: janssen at parc.com (Bill Janssen) Date: Sat, 8 Sep 2007 12:39:25 PDT Subject: [Python-3000] 3.0 crypto In-Reply-To: <46DFD40E.8010705@v.loewis.de> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> <46DFD40E.8010705@v.loewis.de> Message-ID: <07Sep8.123933pdt."57996"@synergy1.parc.xerox.com> > >> Why do you say that doing the work is not a problem? I see it as > >> a major problem. > > > > I'm willing to either do the work myself, or have someone else from the > > secops team at OLPC do it. > > It's not something that a single person can well do. You will also need > to design APIs, and that traditionally involves the community. If you > create something ad-hoc, I would request that this first gets > field-proven for a few years before being included in the standard > distribution. Then, it would face competition to existing such > solutions. We're already linking against the OpenSSL EVP libraries for hashlib (and against the OpenSSL SSL libraries for the SSL support). It wouldn't be hard to expose the EVP functions a bit more, essentially as hash functions that return long (and reversible) hashes: encryptor = opensslevp.encryptor("AES-256-CBC", ...maybe some options...) encryptor.update(...some plaintext...) ... cipertext = encryptor.digest() ... decryptor = opensslevp.decryptor("AES-256-CBC", ...maybe some options...) decryptor.update(cipertext) plaintext = decryptor.digest() Take a look at the docs for EVP_EncryptInit_ex. The crypto would stay in the OpenSSL library; this would just be more hashing on top of it. I'd sure like to have this so I could write a Python decryptor for my PalmOS password keeper (a program called Strip) which I could run on my iPhone. (The iPhone Python has SSL support.) Bill From nick.bastin at gmail.com Sat Sep 8 22:47:56 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 8 Sep 2007 16:47:56 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E2DF85.4090005@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> Message-ID: <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> On 9/8/07, "Martin v. L?wis" wrote: > > Speaking of PyLong, and its' minor awkwardness to work with in C (you > > either have to convert to another multiple-precision type through a > > string, or use Python's arithmetic operators directly), was there any > > thought given to using something like GPM's mpz_t as the backing data > > type? > > I never did that. Would anyone be opposed to rehosting PyLong on top of GMP? I'm not necessarily volunteering to do the work (yet, anyhow), but just trying to get a read on the feelings of the community. PyLong has historically been a bit of a pain to deal with if you embedded or extended python, or otherwise had to deal with it at the C API level. With the distinction between int and long being removed at the user level, it will become more un-pythonic to refuse to accept long integers in some extensions. Additionally, something like GMP would likely provide improved performance, and would be a piece of code, perhaps out of the core domain knowledge of the core python developers, that we would not have to maintain. On the other hand, GMP would become a required library, not one simply built against if you had it (provided that the issues with the pervasiveness of the use of OpenSSL are resolved, no external library is currently required for 'normal' operation of the interpreter). Would we want to maintain parallel implementations? Does this provide a barrier to entry to some platform ports? (I think not, since it doesn't change the definition of the language, but it's worth asking). -- Nick From martin at v.loewis.de Sun Sep 9 00:18:10 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 09 Sep 2007 00:18:10 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> Message-ID: <46E31FA2.4060701@v.loewis.de> > Would anyone be opposed to rehosting PyLong on top of GMP? I would be opposed. It's LGPL'ed, so you would have to ship GMP sources with any Python binary that you distribute. Regards, Martin From greg at krypto.org Sun Sep 9 01:15:58 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 8 Sep 2007 16:15:58 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> Message-ID: <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> A new version is attached; cleaned up and simplified based on your original comments. On 8/29/07, Guido van Rossum wrote: > > That's a huge patch to land so close before a release. I'm not sure I > like the immutability API -- it won't be useful unless we add a hash > method, and then we have all sorts of difficulties again -- the > distinction between a hashable and an unhashable object should be made > by type, not by value (tuples containing unhashable values > notwithstanding). ok i've removed the immutable support in the most recent patch. i still think it -might- be useful but isn't required and you're right that it could open a can of worms if people think it should also mean hashable. immutable bytes may be best implemented as a subclass if its ever wanted. I don't understand the comment about using PyBUF_WRITABLE in > _getbuffer() -- this is only used for data we're *reading* and I don't > think the GIL is even released while we're reading such things. that appears to be correct. the comment was wrong. fixed. -gps If you think it's important to get this in the 3.0a1 release, we > should pair-program on it ASAP, preferable tomorrow morning. > Otherwise, let's do a review next week. > > --Guido > > On 8/29/07, Gregory P. Smith wrote: > > Attached is what I've come up with so far. Only a single field is > > added to the PyBytesObject struct. This adds support to the bytes > > object for PyBUF_LOCKDATA buffer API operation. bytes objects can be > > marked temporarily read-only for use while the buffer api has handed > > them off to something which may run without the GIL (think IO). Any > > attempt to modify them during that time will raise an exception as I > > believe Martin suggested earlier. > > > > As an added bonus because its been discussed here, support for setting > > a bytes object immutable has been added since its pretty trivial once > > the read only export support was in place. Thats not required but was > > trivial to include. > > > > I'd appreciate any feedback. > > > > My TODO list for this patch: > > > > 0. Get feedback and make adjustments as necessary. > > > > 1. Deciding between PyBUF_SIMPLE and PyBUF_WRITEABLE for the internal > > uses of the _getbuffer() function. bytesobject.c contains both > readonly > > and read-write uses of the buffers, i'll add boolean parameter for > > that. > > > > 2. More testing: a few tests in the test suite fail after this but the > > number was low and I haven't had time to look at why or what the > > failures were. > > > > 3. Exporting methods suggested in the TODO at the top of the file. > > > > 4. Unit tests for all of the functionality this adds. > > > > NOTE: after these changes I had to make clean and rm -rf build before > > things would not segfault on import. I suspect some things (modules?) > > were not properly recompiled after the bytesobject.h struct change > > otherwise. > > > > -gps > > > > > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070908/e3621c4a/attachment-0001.htm -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bytes-lockdata-gps02.patch.txt Url: http://mail.python.org/pipermail/python-3000/attachments/20070908/e3621c4a/attachment-0001.txt From nick.bastin at gmail.com Sun Sep 9 01:23:13 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 8 Sep 2007 19:23:13 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E31FA2.4060701@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> Message-ID: <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> On 9/8/07, "Martin v. L?wis" wrote: > > Would anyone be opposed to rehosting PyLong on top of GMP? > > I would be opposed. It's LGPL'ed, so you would have to ship GMP sources > with any Python binary that you distribute. The LGPL has no requirement that you convey source for unmodified libraries. Linkage does not imply modification. -- Nick From tim.peters at gmail.com Sun Sep 9 03:48:56 2007 From: tim.peters at gmail.com (Tim Peters) Date: Sat, 8 Sep 2007 21:48:56 -0400 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> Message-ID: <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> [Guido] > I'd like Tim Peters's input on this before we change it. I seem to > recall that there's an aspect of non-randomness to the existing hash > function that's important when you hash many closely related strings, > e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been > so long that I may misremember this, and perhaps it was related to the > dictionary implementation. Not "important" so much as "possibly helpful" ;-) This is explained in comments in dictobject.c. As it notes there, hashing the strings "namea", "nameb", "namec", and "named" currently produces (on a sizeof(long) == 4 box): -1658398457 -1658398460 -1658398459 -1658398462 That the hash codes are very close but not identical is "a feature", since the dict implementation only looks at the last k bits (for various more-or-less small values of k): this gives "better than random" dict collision behavior for input strings very close together. The proposed hash produces instead: 1892683363 -970432008 51735791 1567337715 Obviously much closer to "random" behavior, but that's not necessarily a good thing for dicts. FYI, wrt http://www.azillionmonkeys.com/qed/hash.html Python's current string hash is very similar to (but developed independently of) the FNV hash. Things to look out for in the proposed hash: - There's no explanation of where all the magic shift constants and shift patterns come from. - It relies on potentially unaligned access to read 16-bit chunks at a time. This means #ifdef cruft to "turn that off" on platforms that don't support unaligned access, and means timing will vary on platforms that do (depending on whether input strings do or do not /happen/ to be 2-byte aligned). - It only delivers a 32-bit hash. But at least before Py3K, Python's hash codes are the native C "long" (32 or 64 bits on all current boxes). The current hash code couldn't care less what sizeof(long) is. It's not clear how to modify the proposed hash to deliver 64-bit hash codes, in large part because of the first point above. - It needs another conditional "at the bottom" to avoid returning a hash code of -1. That will affect timing too. >>> Python3k original hash: real 0m2.210s >>> new hash: real 0m1.842s That's actually a surprisingly small difference, given the much larger timing differences displayed on: http://www.azillionmonkeys.com/qed/hash.html compared to the FNV hash. OTOH, the figures there only looked at 256-byte strings, which is much larger (IMO) "than average" for strings. Better tests would time building and accessing string-keyed dicts with reasonable and unreasonable ;-) keys. From larry at hastings.org Sun Sep 9 04:24:47 2007 From: larry at hastings.org (Larry Hastings) Date: Sat, 08 Sep 2007 19:24:47 -0700 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> Message-ID: <46E3596F.3090606@hastings.org> If the Python community is just noticing the Hsieh hash, that implies that the Bob Jenkins hashes are probably unknown as well. Behold: http://burtleburtle.net/bob/hash/doobs.html To save you a little head-scratching, the functions you want to play with are hashlittle()/hashlittle2() in "lookup3.c": http://burtleburtle.net/bob/c/lookup3.c hashlittle() returns a 32-bit hash; hashlittle2() returns two 32-bit hashes on the same input (in effect a 64-bit hash). The "little" implies that the function is better on little-endian machines. (There is a hashbig(); no hashbig2(), it is left as an exercise for the reader.) In our testing (at Facebook, for memcached) hashlittle2 was faster than the Hsieh hash; that was done a year ago (and before I joined) so I don't have numbers for you. One goal of Jenkin's hashes is uniform distribution, so these functions presumably lack the serendipitous "similar inputs hash to similar values" behavior of Python's current hash function. But why is that a feature? (Not that I doubt Tim Peters!) Oh, and, all the Jenkins code is public domain. Cheers, /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070908/a49272fd/attachment.htm From guido at python.org Sun Sep 9 07:19:54 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 8 Sep 2007 22:19:54 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> Message-ID: On 9/8/07, Nicholas Bastin wrote: > On 9/8/07, "Martin v. L?wis" wrote: > > > Would anyone be opposed to rehosting PyLong on top of GMP? > > > > I would be opposed. It's LGPL'ed, so you would have to ship GMP sources > > with any Python binary that you distribute. > > The LGPL has no requirement that you convey source for unmodified > libraries. Linkage does not imply modification. Nevertheless I think it would be a bad idea to make it the default long implementation. There are bound to be *some* licensing issues with the LGPL (even if it's just more FUD we'd have to fight) and it'd be one more dependency. I believe there are already Python bindings for GMP somewhere, so it's not like there is no way to use if if you absolutely have to. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Sun Sep 9 10:39:10 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 09 Sep 2007 10:39:10 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> Message-ID: <46E3B12E.1000703@v.loewis.de> > The LGPL has no requirement that you convey source for unmodified > libraries. Linkage does not imply modification. Why do you say that? LGPL 2.1, section 6a) (talking about "work that uses the Library"): a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) So you must "accompany the work with complete source code for the Library". Regards, Martin From nick.bastin at gmail.com Sun Sep 9 11:06:37 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 05:06:37 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E3B12E.1000703@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> Message-ID: <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> On 9/9/07, "Martin v. L?wis" wrote: > > The LGPL has no requirement that you convey source for unmodified > > libraries. Linkage does not imply modification. > > Why do you say that? LGPL 2.1, section 6a) (talking about > "work that uses the Library"): > > a) Accompany the work with the complete corresponding machine-readable > source code for the Library including whatever changes were used in the > work (which must be distributed under Sections 1 and 2 above); and, if > the work is an executable linked with the Library, with the complete > machine-readable "work that uses the Library", as object code and/or > source code, so that the user can modify the Library and then relink to > produce a modified executable containing the modified Library. (It is > understood that the user who changes the contents of definitions files > in the Library will not necessarily be able to recompile the application > to use the modified definitions.) > > So you must "accompany the work with complete source code for the Library". You're being awfully selective in your reading. Section 6a is immediately preceded by a statement which says: "Also, you must do one of these things:" 6a is but one of 5 choices. Those choices are: "b) Use a suitable shared library mechanism for linking with the Library." "c) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution." "d) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place." "e) Verify that the user has already received a copy of these materials or that you have already sent this user a copy." Pick any one of those options you like that doesn't involve shipping source code. Using standard shared libraries is a "suitable shared library mechanism". Also, the LGPLv3 in section 4d.1 specifies the same "Use a suitable shared library mechanism for linking with the Library." This is more relevant, since GMP is licensed under v3 and not v2.1. -- Nick From thomas at python.org Sun Sep 9 11:13:00 2007 From: thomas at python.org (Thomas Wouters) Date: Sun, 9 Sep 2007 11:13:00 +0200 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <46E3596F.3090606@hastings.org> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> <46E3596F.3090606@hastings.org> Message-ID: <9e804ac0709090213q4c8f7431oa93037efb36e009e@mail.gmail.com> On 9/9/07, Larry Hastings wrote: > One goal of Jenkin's hashes is uniform distribution, so these functions > presumably lack the serendipitous "similar inputs hash to similar values" > behavior of Python's current hash function. But why is that a feature? > (Not that I doubt Tim Peters!) > Because (relatively) small dicts with (broadly speaking) similar keys are quite common in Python. Module and class and instance __dict__s, for instance ;) As Tim mentioned, the dict implementation only looks at part of the actual hash value (depending on the size of the dict) and having hash values close but not the same greatly decreases the chance of collisions in (relatively) small dicts. It's less of a problem for massive dicts with (almost) completely arbitrary keys, but it doesn't exactly hurt there, either. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070909/abe62bce/attachment.htm From martin at v.loewis.de Sun Sep 9 11:24:55 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 09 Sep 2007 11:24:55 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> Message-ID: <46E3BBE7.4020800@v.loewis.de> > You're being awfully selective in your reading. On purpose. All alternatives can be ruled out quickly as unfeasible, or equivalent to "distribute the source code". > 6a is but one of 5 choices. So which of these would you recommend? > "b) Use a suitable shared library mechanism for linking with the Library." This is shortened. The full text reads b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (1) uses at run time a copy of the library already present on the user's computer system, rather than copying library functions into the executable, and (2) will operate properly with a modified version of the library, if the user installs one, as long as the modified version is interface-compatible with the version that the work was made with. So this is only an option if "a copy of the library [is] already present on the user's computer system". This may work for Linux, but not for Windows, or Solaris (not sure about OSX). > "c) Accompany the work with a written offer, valid for at least three > years, to give the same user the materials specified in Subsection 6a, > above, for a charge no more than the cost of performing this > distribution." I find that equally unacceptable for Python. People distributing Python should not be required to include written offers. > "d) If distribution of the work is made by offering access to copy > from a designated place, offer equivalent access to copy the above > specified materials from the same place." This is the same as "distribute the source code". > "e) Verify that the user has already received a copy of these > materials or that you have already sent this user a copy." This may work for a limited number of copies, where you know all recipients personally, but won't work for Python. > Also, the LGPLv3 in section 4d.1 specifies the same "Use a suitable > shared library mechanism for linking with the Library." This is more > relevant, since GMP is licensed under v3 and not v2.1. And it has the same restriction: the shared library must already be present on the user's computer system. So again, this won't work for the Windows binaries that we distribute. We (python.org) could place the source code of GMP along with the MSI binary, but then people redistributing the MSI binary would break the LGPL, unless they also distribute the GMP sources. Regards, Martin From larry at hastings.org Sun Sep 9 14:04:44 2007 From: larry at hastings.org (Larry Hastings) Date: Sun, 09 Sep 2007 05:04:44 -0700 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <9e804ac0709090213q4c8f7431oa93037efb36e009e@mail.gmail.com> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> <46E3596F.3090606@hastings.org> <9e804ac0709090213q4c8f7431oa93037efb36e009e@mail.gmail.com> Message-ID: <46E3E15C.8040801@hastings.org> Thomas Wouters wrote: > Because (relatively) small dicts with (broadly speaking) similar keys > are quite common in Python. Module and class and instance __dict__s, > for instance ;) As Tim mentioned, the dict implementation only looks > at part of the actual hash value (depending on the size of the dict) > and having hash values close but not the same greatly decreases the > chance of collisions in (relatively) small dicts. I see--it's avoiding the Birthday Paradox. Collisions are actually more likely if the numbers are totally random than if the numbers are, because of a feeble hash algorithm, relatively consecutive. ;) Got it, thanks, /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070909/0cb9a27b/attachment.htm From qrczak at knm.org.pl Sun Sep 9 15:12:23 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 09 Sep 2007 15:12:23 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <1189270839.25695.18.camel@qrnik> References: <1189270839.25695.18.camel@qrnik> Message-ID: <1189343544.4344.9.camel@qrnik> Since PyString_Format is deprecated, is there a better way to convert a Python3 int which doesn't fit in a C long to a hex representation in a C string, than PyUnicode_Format and iterating over characters, casting them from Unicode to bytes? I actually need to convert it to mpz_t, which is best done via text in a C string in a base which is a power of 2. Since PyUnicode_Format for Python3 int creates a byte string first, it's quite silly to let a byte string be converted to a Unicode string and then back. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Sun Sep 9 15:24:37 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 09 Sep 2007 15:24:37 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <1189343544.4344.9.camel@qrnik> References: <1189270839.25695.18.camel@qrnik> <1189343544.4344.9.camel@qrnik> Message-ID: <46E3F415.9060707@v.loewis.de> > I actually need to convert it to mpz_t, which is best done via text > in a C string in a base which is a power of 2. Since PyUnicode_Format > for Python3 int creates a byte string first, it's quite silly to let > a byte string be converted to a Unicode string and then back. You could use _PyLong_AsByteArray. Regards, Martin From ncoghlan at gmail.com Sun Sep 9 16:10:19 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 10 Sep 2007 00:10:19 +1000 Subject: [Python-3000] clean out the future? In-Reply-To: References: Message-ID: <46E3FECB.2080404@gmail.com> Fred Drake wrote: > On Sep 7, 2007, at 1:24 PM, Georg Brandl wrote: >> Should the __future__ be cleaned out for 3k, or should all future >> imports >> continue to work and do nothing? > > They should continue to work. > > One advantage of keeping the existing feature table in the __future__ > module is that is makes it easier to avoid re-using a feature name; I > think there's merit in that. While I don't object to that (I agree keeping the history in the __future__ module is a good thing), 2to3 should probably strip them anyway, since they're now redundant. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Sun Sep 9 16:23:18 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 09 Sep 2007 16:23:18 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E3F415.9060707@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <1189343544.4344.9.camel@qrnik> <46E3F415.9060707@v.loewis.de> Message-ID: <1189347799.4344.12.camel@qrnik> Dnia 09-09-2007, N o godzinie 15:24 +0200, "Martin v. L?wis" napisa?(a): > You could use _PyLong_AsByteArray. I'm scared by the underscore. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Sun Sep 9 16:31:08 2007 From: martin at v.loewis.de (=?ISO-8859-2?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 09 Sep 2007 16:31:08 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <1189347799.4344.12.camel@qrnik> References: <1189270839.25695.18.camel@qrnik> <1189343544.4344.9.camel@qrnik> <46E3F415.9060707@v.loewis.de> <1189347799.4344.12.camel@qrnik> Message-ID: <46E403AC.3050508@v.loewis.de> >> You could use _PyLong_AsByteArray. > > I'm scared by the underscore. If that helps, feel free to submit a patch to remove the underscore, and document the function properly. Regards, Martin From fdrake at acm.org Sun Sep 9 17:47:56 2007 From: fdrake at acm.org (Fred Drake) Date: Sun, 9 Sep 2007 11:47:56 -0400 Subject: [Python-3000] clean out the future? In-Reply-To: <46E3FECB.2080404@gmail.com> References: <46E3FECB.2080404@gmail.com> Message-ID: On Sep 9, 2007, at 10:10 AM, Nick Coghlan wrote: > While I don't object to that (I agree keeping the history in the > __future__ module is a good thing), 2to3 should probably strip them > anyway, since they're now redundant. That would be good. From a compatibility perspective, they should work, but they should be removed from source code (I've never *like* the __future__ imports, though I understand their value). -Fred -- Fred Drake From nick.bastin at gmail.com Sun Sep 9 19:41:53 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 13:41:53 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E3BBE7.4020800@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> Message-ID: <66d0a6e10709091041u5fa1d7c2xfd16b45a91dab0d0@mail.gmail.com> On 9/9/07, "Martin v. L?wis" wrote: > > "d) If distribution of the work is made by offering access to copy > > from a designated place, offer equivalent access to copy the above > > specified materials from the same place." > > This is the same as "distribute the source code". Well, it's the same as "offer for distribution". There's no requirement that the user actually ever download it, only that you offer it for download. Certainly there's no requirement that you put the source in the installer package (the GPL FAQ covers this question - "Our requirements for redistributors are intended to make sure the users can get the source code, not to force users to download the source code even if they don't want it.") Also, if python.org agreed to continually make the GMP library source available, that would solve the problem for other binary distributors. From the GPL FAQ: "the GPL says you must offer access to copy the source code "from the same place"; that is, next to the binaries. However, if you make arrangements with another site to keep the necessary source code available, and put a link or cross-reference to the source code next to the binaries, we think that qualifies as "from the same place"." -- Nick From greg at krypto.org Sun Sep 9 21:02:15 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sun, 9 Sep 2007 12:02:15 -0700 Subject: [Python-3000] Solaris support in 3.0? In-Reply-To: <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> Message-ID: <52dc1c820709091202p7fcb037j850e1750fdc736e3@mail.gmail.com> > Rather than resurrecting the old RSA-copyright md5.c I can easily make new > ones out of the libtomcrypt md5 and sha1 sources the same way i created the > non-openssl sha256 and sha512 modules. > > We should not limit ourselves to only md5 if we do that, lets guarantee > that md5, sha1 - sha512 are available on all future python installs; its not > difficult. I'll do the work if we need it. > > -gps > Done. Waiting on buildbots to confirm it fixes tru64 and solaris. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070909/c48a5771/attachment.htm From greg at krypto.org Sun Sep 9 21:09:23 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sun, 9 Sep 2007 12:09:23 -0700 Subject: [Python-3000] audio device support In-Reply-To: <46DDD42D.8090608@ibp.de> References: <46DDD42D.8090608@ibp.de> Message-ID: <52dc1c820709091209v2f04a406q4f5cf4c8d5d38968@mail.gmail.com> > What I'd like to see: > > I like the idea of having audio device support for the major operating > systems in the standard library. > > But I am even more interested in a common interface for simple operations. > > IMO, the API should support: > > - stereo playback > - stereo recording > - different sampling rates and formats (alaw, mulaw and PCM in signed > integers in various widths and maybe PCM in floats/doubles). > - device selection > - volume control > > Overall, I think the level of abstraction in the OSS or ALSA APIs is > about right, coreaudio on OS X and DirectSound on Windows are overkill > outside of niche applications. > > I would volunteer sample implementations for Windows, OS X and Linux > (ALSA). > > - Lars That sounds like a nice basic simple interface. I suggest writing it up and submitting it as a patch or even making it stand alone module with its own distutils setup.py. It sounds like a good idea regardless of it its accepted into the standard library. (clearly what we have now for python audio is a mess :) -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070909/a88ad6ea/attachment-0001.htm From lars at ibp.de Sun Sep 9 21:39:34 2007 From: lars at ibp.de (Lars Immisch) Date: Sun, 09 Sep 2007 21:39:34 +0200 Subject: [Python-3000] audio device support In-Reply-To: <52dc1c820709091209v2f04a406q4f5cf4c8d5d38968@mail.gmail.com> References: <46DDD42D.8090608@ibp.de> <52dc1c820709091209v2f04a406q4f5cf4c8d5d38968@mail.gmail.com> Message-ID: <46E44BF6.5090501@ibp.de> > That sounds like a nice basic simple interface. I suggest writing it up > and submitting it as a patch or even making it stand alone module with > its own distutils setup.py. It sounds like a good idea regardless of it > its accepted into the standard library. (clearly what we have now for > python audio is a mess :) Terry Reedy suggested looking into pygame; I like its explicit channel abstraction. A standalone module is probably the best start. I'll look into it. - Lars From jimjjewett at gmail.com Sun Sep 9 23:25:56 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 9 Sep 2007 17:25:56 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> Message-ID: On 9/8/07, Nicholas Bastin wrote: > On 9/8/07, "Martin v. L?wis" wrote: > > > Speaking of PyLong, and its' minor awkwardness to work with in C (you > > > either have to convert to another multiple-precision type through a > > > string, or use Python's arithmetic operators directly), was there any > > > thought given to using something like GPM's mpz_t as the backing data > > > type? > Would anyone be opposed to rehosting PyLong on top of GMP? (1) If there are concerns about the RCA attribution license, I would expect much greater concerns about LGPL. (2) License aside, does it really solve the problem you had about needing to convert or use Python's arithmetic operations? At first glance, it looks like you would still have the same problem, except that you would need to use the GMP functions instead of the python functions. (3) Is it stable enough? I know it has been developed since 1991, but they seem to focus on high performance for truly huge numbers. I suspect the vast majority of python programs would perform fine if they were limited to C ints, and so the extra costs may not be worth it. According to http://gmplib.org/ """ IMPORTANT INFORMATION FOR ALL GMP USERS: GMP is very often miscompiled! We are seeing ever increasing problems with miscompilations of the GMP code. It has now come to the point where a compiler should be assumed to miscompile GMP. """ Later details of issues with the current release include: Garbage from some ternary ops (a=c+b*a) with the C++ wrappers (would that apply to C++ python extensions?) crash bugs It doesn't work on the Intel Macintoshes, and the workarounds are so ugly that they won't be applied to the trunk. -jJ From nick.bastin at gmail.com Mon Sep 10 00:14:45 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 18:14:45 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> Message-ID: <66d0a6e10709091514k15d81759h488c5b29ccd63bc7@mail.gmail.com> On 9/9/07, Jim Jewett wrote: > On 9/8/07, Nicholas Bastin wrote: > > On 9/8/07, "Martin v. L?wis" wrote: > > > > Speaking of PyLong, and its' minor awkwardness to work with in C (you > > > > either have to convert to another multiple-precision type through a > > > > string, or use Python's arithmetic operators directly), was there any > > > > thought given to using something like GPM's mpz_t as the backing data > > > > type? > > > Would anyone be opposed to rehosting PyLong on top of GMP? > > (1) If there are concerns about the RCA attribution license, I would > expect much greater concerns about LGPL. Maybe, but I'd rather have a technical discussion than a licensing discussion. If GMP doesn't stand up for technical reasons, then the licensing discussion was a waste of time without resolving whether it would be a good technical decision or not. > (2) License aside, does it really solve the problem you had about > needing to convert or use Python's arithmetic operations? At first > glance, it looks like you would still have the same problem, except > that you would need to use the GMP functions instead of the python > functions. Yes, but the GMP function set is much richer than the Python one, and more efficient. GMP is in fact the thing I most convert PyLong to (via a string, which is, as you might imagine, not that efficient). Obviously if we're going to support numbers larger than the host language (in this case, C) natively supports, there's going to be some other API involved. > (3) Is it stable enough? I know it has been developed since 1991, > but they seem to focus on high performance for truly huge numbers. I > suspect the vast majority of python programs would perform fine if > they were limited to C ints, and so the extra costs may not be worth > it. In a little test, integer math (not-long-requiring) in 3.0 is 2.3x slower than the same integer math in 2.6. Here is my test code: inttest.py: def int_test(rounds): index = 0 while index < rounds: foo = 0 while foo < 10000000: foo += 1 .... (above line repeated 100 times) index += 1 3.0: python Lib\timeit.py "import inttest; inttest.int_test (5)" 10 loops, best of 3: 6.01 sec per loop 2.6: python Lib\timeit.py "import inttest; inttest.int_test (5)" 10 loops, best of 3: 2.64 sec per loop I welcome other benchmarks if people think there's something fundamentally wrong with my test. > It doesn't work on the Intel Macintoshes, and the workarounds are so > ugly that they won't be applied to the trunk. This is clearly a deal killer, thanks for pointing that out. I would however continue to ask the general question - do we really want to maintain our own arbitrary precision math library (which we now use exclusively)? Who is committing to optimizing the performance of PyLong? -- Nick From greg.ewing at canterbury.ac.nz Mon Sep 10 01:01:01 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 10 Sep 2007 11:01:01 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E3B12E.1000703@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> Message-ID: <46E47B2D.6020608@canterbury.ac.nz> Martin v. L?wis wrote: > a) Accompany the work with the complete corresponding machine-readable > source code for the Library But if it's like the regular GPL, you can just tell people where to get the source -- you don't have to physically provide it yourself. -- Greg From nick.bastin at gmail.com Mon Sep 10 01:38:28 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 19:38:28 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E47B2D.6020608@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <46E47B2D.6020608@canterbury.ac.nz> Message-ID: <66d0a6e10709091638q762f010bu7605f1793236177a@mail.gmail.com> On 9/9/07, Greg Ewing wrote: > Martin v. L?wis wrote: > > a) Accompany the work with the complete corresponding machine-readable > > source code for the Library > > But if it's like the regular GPL, you can just tell people > where to get the source -- you don't have to physically > provide it yourself. You technically have to have a written agreement with the people who provide the source that they will continue to do so. This is why I suggested that python.org could just host the source and provide that agreement to other distributors of Python. We could ask the GMP folks for those assurances as well, but that point appears moot as there are technical issues with using the library (which is what I was really trying to get at in the first place). I still think we should investigate other arbitrary precision math libraries, or have someone commit to meeting certain performance goals for PyLong. -- Nick From greg.ewing at canterbury.ac.nz Mon Sep 10 01:46:57 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 10 Sep 2007 11:46:57 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> Message-ID: <46E485F1.6030503@canterbury.ac.nz> Jim Jewett wrote: > It has now come to the point where a > compiler should be assumed to miscompile GMP. > ... > It doesn't work on the Intel Macintoshes, and the workarounds are so > ugly that they won't be applied to the trunk. Sounds like it's been optimised for speed over portability in a really extreme way. I wouldn't go anywhere near code like that. -- Greg From greg.ewing at canterbury.ac.nz Mon Sep 10 02:13:04 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 10 Sep 2007 12:13:04 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E3BBE7.4020800@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> Message-ID: <46E48C10.7010705@canterbury.ac.nz> Martin v. L?wis wrote: > b) Use a suitable shared library mechanism for linking with the Library. > A suitable mechanism is one that (1) uses at run time a copy of the > library already present on the user's computer system, rather than > copying library functions into the executable, and (2) will operate > properly with a modified version of the library, if the user installs > one, as long as the modified version is interface-compatible with the > version that the work was made with. > > So this is only an option if "a copy of the library [is] already > present on the user's computer system". This may work for Linux, > but not for Windows, or Solaris (not sure about OSX). I think it's just trying to say dynamic rather than static linking, not that the library has to be a pre-existing one. The important thing is that the library can be updated just by replacing a file, without having to re-link the executable. So Windows DLLs qualify, as far as I can see. -- Greg From jimjjewett at gmail.com Mon Sep 10 02:58:34 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 9 Sep 2007 20:58:34 -0400 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> Message-ID: On 9/8/07, Tim Peters wrote: > in comments in dictobject.c. As it notes there, hashing the strings > "namea", "nameb", "namec", and "named" currently produces (on a > sizeof(long) == 4 box): > -1658398457 > -1658398460 > -1658398459 > -1658398462 > That the hash codes are very close but not identical is "a feature", > since the dict implementation only looks at the last k bits (for > various more-or-less small values of k): this gives "better than > random" dict collision behavior for input strings very close together. > The proposed hash produces instead: > 1892683363 > -970432008 > 51735791 > 1567337715 > > Obviously much closer to "random" behavior, but that's not necessarily > a good thing for dicts. To spell this out a bit more: For cryptography, you want a "random" has function. For hash tables, you just want one that spreads out your actual input. For strings, this tends to mean short strings that look like possible variable names. Because they often *are* variable names, they are sometimes sequential, like var_a, var_b, var_c. In the current CPython implementation, dicts start as a size-8 smalldict, and most dicts never grow beyond that. So the effective hash is really (hash%8) When adding four entries to an 8-slot table, a truly random hash would have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the time. As expected, the proposed hash does have a collision for those four values (the first and fourth). The current hash function does not collide for strings that change only one character to the "next" in ASCIIbetical order until the 9th string -- at which time you need to resize anyhow. For larger tables, having them close still doesn't cause a problem, and may even be useful if you do decide to sort the keys. (CPython lists use a "timsort" that takes advantage of partially sorted input, so if the iterator gets them close to sorted initially, that can help.) -jJ From jimjjewett at gmail.com Mon Sep 10 03:27:36 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 9 Sep 2007 21:27:36 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E48C10.7010705@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> Message-ID: On 9/9/07, Greg Ewing wrote: > I think it's just trying to say dynamic rather than static ... > library can be updated just by replacing a file, ... > So Windows DLLs qualify, as far as I can see. How many external library calls would need to be resolved at runtime for the following code? for x in range(N): x = 0 while x < N: # Would this comparison be external? x +=1 # And this incf? If python handled small ints itself, and only farmed out the "large" ones, I think the situation would be worse than today, as extensions would still need to support two forms of integer, but they wouldn't even know which was going to be used for a given numeric value. (Unless GMP were modified to return the python version for small ones... in which case we have a fork.) And since we would still have the object headers of python, I suspect it still wouldn't be as simple as just using GMP routines. -jJ From tim.peters at gmail.com Mon Sep 10 03:32:16 2007 From: tim.peters at gmail.com (Tim Peters) Date: Sun, 9 Sep 2007 21:32:16 -0400 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: <46E3E15C.8040801@hastings.org> References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> <46E3596F.3090606@hastings.org> <9e804ac0709090213q4c8f7431oa93037efb36e009e@mail.gmail.com> <46E3E15C.8040801@hastings.org> Message-ID: <1f7befae0709091832m3ff970a7v864757a0c138071f@mail.gmail.com> [Larry Hastings] > I see--it's avoiding the Birthday Paradox. It /tends/ to, yes. This wasn't a design goal of the string hash, it's just a property observed after it was adopted, and appreciated much later ;-) It's much clearer for Python's small-int hash, where hash(i) == i for i != -1. That is, nearly all "small enough" integers are their own "hash codes". That guarantees no collisions whatsoever in a dict keyed by a contiguous range of small integers (excluding -1), no matter how large the range. Read the comments in dictobject.c for more on this. The predictability of such hash schemes has both good & bad implications for dict performance, and Python's dict conflict-resolution strategies are fancier than most to mitigate the possible bad implications. > Collisions are actually more likely if the numbers are totally random than > if the numbers are, because of a feeble hash algorithm, relatively > consecutive. ;) Right. In a "good" (cryptographically speaking) hash function, a 1-bit change in the input "should" change about half the output bits (in the hash code), making collisions much more likely when the keys differ little in the low bits. The important point is that the cost of building and accessing string-keyed dicts is more important (in Python) than the cost of just hashing strings, and collision resolution is a real expense. From nick.bastin at gmail.com Mon Sep 10 04:41:07 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 22:41:07 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> Message-ID: <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> On 9/9/07, Jim Jewett wrote: > On 9/9/07, Greg Ewing wrote: > > > I think it's just trying to say dynamic rather than static ... > > library can be updated just by replacing a file, ... > > > So Windows DLLs qualify, as far as I can see. > > How many external library calls would need to be resolved at runtime > for the following code? > > for x in range(N): > > x = 0 > while x < N: # Would this comparison be external? > x +=1 # And this incf? > > If python handled small ints itself, and only farmed out the "large" > ones, I think the situation would be worse than today, as extensions > would still need to support two forms of integer, but they wouldn't > even know which was going to be used for a given numeric value. For the current implementation in 3.0, for C API extension writers, this is practically already the case. The same type is used everywhere, but you have to test if it is out of range for C types, and then extract it as a string to put in some other long integer type, or work with it using the Python C API exclusively. I'm not suggesting that Python handle small ints itself and then farm out large integer computations, I'm suggesting that since we've already coalesced small ints into 'large' ones, we might want to review the performance implications of that decision, and possibly consider that other people have already solved this problem. Clearly GMP appears to fail on a technical level, but there might be other options worth investigating. -- Nick From guido at python.org Mon Sep 10 05:38:26 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 9 Sep 2007 20:38:26 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> Message-ID: On 9/9/07, Nicholas Bastin wrote: > I'm not suggesting that Python handle small ints itself and then farm > out large integer computations, I'm suggesting that since we've > already coalesced small ints into 'large' ones, we might want to > review the performance implications of that decision, and possibly > consider that other people have already solved this problem. Clearly > GMP appears to fail on a technical level, but there might be other > options worth investigating. The performance problems that are affecting us most are for small-value ints. The old PyInt type has many custom optimizations to help. I think we could do worse than re-introducing some of the same tricks, retargeted to PyLong (which never got much attention for small-value performance). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Mon Sep 10 05:53:53 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sun, 9 Sep 2007 23:53:53 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> Message-ID: <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> On 9/9/07, Guido van Rossum wrote: > On 9/9/07, Nicholas Bastin wrote: > > I'm not suggesting that Python handle small ints itself and then farm > > out large integer computations, I'm suggesting that since we've > > already coalesced small ints into 'large' ones, we might want to > > review the performance implications of that decision, and possibly > > consider that other people have already solved this problem. Clearly > > GMP appears to fail on a technical level, but there might be other > > options worth investigating. > > The performance problems that are affecting us most are for > small-value ints. The old PyInt type has many custom optimizations to > help. I think we could do worse than re-introducing some of the same > tricks, retargeted to PyLong (which never got much attention for > small-value performance). I did redo my benchmark using 200 as the increment number instead of 1, to duck any impact from the interning of small value ints in 2.6, and it made no discernible difference in the results. -- Nick From martin at v.loewis.de Mon Sep 10 07:13:23 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 10 Sep 2007 07:13:23 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E48C10.7010705@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> Message-ID: <46E4D273.9080300@v.loewis.de> > I think it's just trying to say dynamic rather than static > linking, not that the library has to be a pre-existing > one. The important thing is that the library can be > updated just by replacing a file, without having to > re-link the executable. > > So Windows DLLs qualify, as far as I can see. No no no no no. As with the GPL, the important point is that the user of the library has ready access to the source code. Every binary of the library must be accompanied by the source code, where "accompanied" means either "included in the installation media", "downloadable from the same source", or "promised in writing". The first right of the user is to get the source code easily, without having to beg for it. Only then it is also the user's right to modify it, and use the modified version in the application. So normally, the application's task would be to provide source code. However, if the application links with a shared library already on the system, it is the system vendor's task to provide source code - which is the common case on Linux. So in that case, the application vendor can be cleared of having to provide source code. Therefore, Windows DLLs would only qualify if Microsoft would provide them, as then Microsoft would also have to provide the source code. Regards, Martin From qrczak at knm.org.pl Mon Sep 10 14:04:08 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 10 Sep 2007 14:04:08 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> Message-ID: <1189425848.7656.19.camel@qrnik> Dnia 09-09-2007, N o godzinie 21:27 -0400, Jim Jewett napisa?(a): > If python handled small ints itself, and only farmed out the "large" > ones, If GMP is used, it's definitely worth to have a non-GMP representation for small integers, because GMP itself does not do it. A GMP integer is represented by a pointer to digits, the allocated size, and the used size multiplied by the sign; no special cases here. (The fact that GMP does not do it is good for people who want to make a super-compact representation themselves. GMP optimization for the same case would be wasted. It requires some work for implementing overflow detection, but it yields a very good final result.) The major technical problem with GMP is that an out of memory condition during computation is a fatal error, GMP does not provide a way to recover from it. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From eric+python-dev at trueblade.com Mon Sep 10 16:51:21 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Mon, 10 Sep 2007 10:51:21 -0400 Subject: [Python-3000] __format__ and datetime Message-ID: <46E559E9.4090907@trueblade.com> I have a patch to add __format__ to datetime.time, .date, and .datetime. For non-empty format_spec's, I just pass on to .strftime. For empty format_spec's, it returns str(self). I think this is the only reasonable interpretation of format_spec's for datetime. Does anyone think otherwise? Eric. From martin at v.loewis.de Mon Sep 10 16:56:05 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 10 Sep 2007 16:56:05 +0200 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E559E9.4090907@trueblade.com> References: <46E559E9.4090907@trueblade.com> Message-ID: <46E55B05.3090701@v.loewis.de> > I have a patch to add __format__ to datetime.time, .date, and .datetime. > For non-empty format_spec's, I just pass on to .strftime. For empty > format_spec's, it returns str(self). > > I think this is the only reasonable interpretation of format_spec's for > datetime. Does anyone think otherwise? Can you please show an example of how it would look like? Regards, Martin From eric+python-dev at trueblade.com Mon Sep 10 17:16:36 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Mon, 10 Sep 2007 11:16:36 -0400 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E55B05.3090701@v.loewis.de> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> Message-ID: <46E55FD4.9000807@trueblade.com> Martin v. L?wis wrote: >> I have a patch to add __format__ to datetime.time, .date, and .datetime. >> For non-empty format_spec's, I just pass on to .strftime. For empty >> format_spec's, it returns str(self). >> >> I think this is the only reasonable interpretation of format_spec's for >> datetime. Does anyone think otherwise? > > Can you please show an example of how it would look like? >>> import datetime >>> format(datetime.datetime.now(), 'date: %Y-%m-%d time:%H:%M:%s') 'date: 2007-09-10 time:11:15:1189437339' >>> format(datetime.datetime.now(), '') '2007-09-10T11:15:51.329639' From p.f.moore at gmail.com Mon Sep 10 17:29:56 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 10 Sep 2007 16:29:56 +0100 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E55FD4.9000807@trueblade.com> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> Message-ID: <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> On 10/09/2007, Eric Smith wrote: > Martin v. L?wis wrote: > >> I have a patch to add __format__ to datetime.time, .date, and .datetime. > >> For non-empty format_spec's, I just pass on to .strftime. For empty > >> format_spec's, it returns str(self). > >> > >> I think this is the only reasonable interpretation of format_spec's for > >> datetime. Does anyone think otherwise? > > > > Can you please show an example of how it would look like? > > >>> import datetime > >>> format(datetime.datetime.now(), 'date: %Y-%m-%d time:%H:%M:%s') > 'date: 2007-09-10 time:11:15:1189437339' > >>> format(datetime.datetime.now(), '') > '2007-09-10T11:15:51.329639' I'd like to see the default format specified (somewhere). I note that the default format for datetime values seems to differ for me (on 3.0a1 on Windows) Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import datetime >>> str(datetime.datetime.now()) '2007-09-10 16:26:25.218000' (Note lack of 'T'). I'm not sure I like 6 decimal places of seconds to be the default format, either, but consistency (with str()) and accuracy (however extreme) may be more important here... The date and time defaults (which appear to be %Y-%m-%d and %H:%M:%s) seem perfectly acceptable, on the other hand. Paul. From eric+python-dev at trueblade.com Mon Sep 10 17:31:23 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Mon, 10 Sep 2007 11:31:23 -0400 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E55FD4.9000807@trueblade.com> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> Message-ID: <46E5634B.4050405@trueblade.com> Eric Smith wrote: > Martin v. L?wis wrote: >>> I have a patch to add __format__ to datetime.time, .date, and .datetime. >>> For non-empty format_spec's, I just pass on to .strftime. For empty >>> format_spec's, it returns str(self). >>> >>> I think this is the only reasonable interpretation of format_spec's for >>> datetime. Does anyone think otherwise? >> Can you please show an example of how it would look like? > > >>> import datetime > >>> format(datetime.datetime.now(), 'date: %Y-%m-%d time:%H:%M:%s') > 'date: 2007-09-10 time:11:15:1189437339' > >>> format(datetime.datetime.now(), '') > '2007-09-10T11:15:51.329639' Oops, that should have been '%S': >>> format(datetime.datetime.now(), 'date: %Y-%m-%d time:%H:%M:%S') 'date: 2007-09-10 time:11:28:12' I'm not sure what strftime does with '%s', I don't see it documented. >>> datetime.datetime.now().strftime('%s') '1189438155' From thomas at python.org Mon Sep 10 17:33:57 2007 From: thomas at python.org (Thomas Wouters) Date: Mon, 10 Sep 2007 17:33:57 +0200 Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <20070909235556.04BA71E400F@bag.python.org> References: <20070909235556.04BA71E400F@bag.python.org> Message-ID: <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> On 9/10/07, gregory.p.smith wrote: > > Author: gregory.p.smith > Date: Mon Sep 10 01:55:55 2007 > New Revision: 58068 > > Modified: > python/branches/py3k/Doc/library/exceptions.rst > python/branches/py3k/Doc/library/socket.rst > python/branches/py3k/Doc/whatsnew/2.6.rst > python/branches/py3k/Lib/test/test_urllib2net.py > python/branches/py3k/Lib/urllib2.py > python/branches/py3k/Modules/socketmodule.c > Log: > merge this from trunk: Please do these merges with snvmerge. Otherwise, the bookkeeping of what was merged or not gets all messed up, and the next person to use svnmerge will be in a world of hurt. (I know, I've been there.) py3k% svnmerge merge -r58067 [ resolve conflicts, configure, make, make test ] py3k% svn commit -F svnmerge-commit-message.txt svnmerge should come with svn, nowadays, or you can download it separately (as svnmerge.py, probably; it's just a Python script.) Alternatively, if you know what you're doing, you can edit the svnmerge-integrated property on the branch directly -- but don't mess it up :) -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070910/94ad3eb0/attachment.htm From janssen at parc.com Mon Sep 10 18:11:04 2007 From: janssen at parc.com (Bill Janssen) Date: Mon, 10 Sep 2007 09:11:04 PDT Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> References: <20070909235556.04BA71E400F@bag.python.org> <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> Message-ID: <07Sep10.091110pdt."57996"@synergy1.parc.xerox.com> > svnmerge should come with svn, nowadays, or you can download it separately > (as svnmerge.py, probably; it's just a Python script.) It comes with version 3 of svn. Or http://svn.collab.net/repos/svn/trunk/contrib/client-side/svnmerge/svnmerge.py. Bill From janssen at parc.com Mon Sep 10 18:30:52 2007 From: janssen at parc.com (Bill Janssen) Date: Mon, 10 Sep 2007 09:30:52 PDT Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <07Sep10.091110pdt."57996"@synergy1.parc.xerox.com> References: <20070909235556.04BA71E400F@bag.python.org> <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> <07Sep10.091110pdt."57996"@synergy1.parc.xerox.com> Message-ID: <07Sep10.093055pdt."57996"@synergy1.parc.xerox.com> > It comes with version 3 of svn. Sorry, that should be 1.3. But I see I've got version 1.4.4 installed, and no svnmerge. Of course, this is Apple's XCode version of svn. Bill From guido at python.org Mon Sep 10 18:38:40 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 10 Sep 2007 09:38:40 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> Message-ID: On 9/9/07, Nicholas Bastin wrote: > On 9/9/07, Guido van Rossum wrote: > > On 9/9/07, Nicholas Bastin wrote: > > > I'm not suggesting that Python handle small ints itself and then farm > > > out large integer computations, I'm suggesting that since we've > > > already coalesced small ints into 'large' ones, we might want to > > > review the performance implications of that decision, and possibly > > > consider that other people have already solved this problem. Clearly > > > GMP appears to fail on a technical level, but there might be other > > > options worth investigating. > > > > The performance problems that are affecting us most are for > > small-value ints. The old PyInt type has many custom optimizations to > > help. I think we could do worse than re-introducing some of the same > > tricks, retargeted to PyLong (which never got much attention for > > small-value performance). > > I did redo my benchmark using 200 as the increment number instead of > 1, to duck any impact from the interning of small value ints in 2.6, > and it made no discernible difference in the results. I'm sorry, I've lost context. I'm not at all clear at this point what benchmark you might have ran. Note that when I said "small values" I meant (in part) anything that fits in a Python long -- while there's a special cache in 2.x for ints < 100, there's also a special allocator that outperforms the obmalloc allocator. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg at krypto.org Mon Sep 10 18:41:30 2007 From: greg at krypto.org (Gregory P. Smith) Date: Mon, 10 Sep 2007 09:41:30 -0700 Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> References: <20070909235556.04BA71E400F@bag.python.org> <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> Message-ID: <52dc1c820709100941m66d2a5b2v156ac9d0a471a87b@mail.gmail.com> On 9/10/07, Thomas Wouters wrote: > > > On 9/10/07, gregory.p.smith wrote: > > > > Author: gregory.p.smith > > Date: Mon Sep 10 01:55:55 2007 > > New Revision: 58068 > > > > Modified: > > python/branches/py3k/Doc/library/exceptions.rst > > python/branches/py3k/Doc/library/socket.rst > > python/branches/py3k/Doc/whatsnew/2.6.rst > > python/branches/py3k/Lib/test/test_urllib2net.py > > python/branches/py3k/Lib/urllib2.py > > python/branches/py3k/Modules/socketmodule.c > > Log: > > merge this from trunk: > > > Please do these merges with snvmerge. Otherwise, the bookkeeping of what > was merged or not gets all messed up, and the next person to use svnmerge > will be in a world of hurt. (I know, I've been there.) > > py3k% svnmerge merge -r58067 > [ resolve conflicts, configure, make, make test ] > py3k% svn commit -F svnmerge-commit-message.txt > > svnmerge should come with svn, nowadays, or you can download it separately > (as svnmerge.py, probably; it's just a Python script.) > > Alternatively, if you know what you're doing, you can edit the > svnmerge-integrated property on the branch directly -- but don't mess it up > :) Sorry about that & thanks for the pointers, I'll use svnmerge (instead of "svn merge" or "svn diff | patch" which i had been using) in the future. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070910/a3a5ccdc/attachment.htm From guido at python.org Mon Sep 10 18:42:05 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 10 Sep 2007 09:42:05 -0700 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E559E9.4090907@trueblade.com> References: <46E559E9.4090907@trueblade.com> Message-ID: On 9/10/07, Eric Smith wrote: > I have a patch to add __format__ to datetime.time, .date, and .datetime. > For non-empty format_spec's, I just pass on to .strftime. For empty > format_spec's, it returns str(self). > > I think this is the only reasonable interpretation of format_spec's for > datetime. Does anyone think otherwise? +1 -- --Guido van Rossum (home page: http://www.python.org/~guido/) From p.f.moore at gmail.com Mon Sep 10 18:55:54 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 10 Sep 2007 17:55:54 +0100 Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> References: <20070909235556.04BA71E400F@bag.python.org> <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> Message-ID: <79990c6b0709100955i2cbca7dblbd6fd4ed32781ab2@mail.gmail.com> On 10/09/2007, Thomas Wouters wrote: > svnmerge should come with svn, nowadays, or you can download it separately > (as svnmerge.py, probably; it's just a Python script.) It's not part of the Win32 binary distribution for Subversion - but I found it at http://www.orcaware.com/svn/wiki/Svnmerge.py It doesn't seem to need the Subversion Python libraries. OTOH, I haven't tested it on Windows (but there seems to be Windows code in there, so I'm guessing it's meant to work :-)) Paul From mike.klaas at gmail.com Mon Sep 10 18:58:49 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Mon, 10 Sep 2007 09:58:49 -0700 Subject: [Python-3000] [Python-3000-checkins] r58068 - in python/branches/py3k: Doc/library/exceptions.rst Doc/library/socket.rst Doc/whatsnew/2.6.rst Lib/test/test_urllib2net.py Lib/urllib2.py Modules/socketmodule.c In-Reply-To: <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> References: <20070909235556.04BA71E400F@bag.python.org> <9e804ac0709100833t10461267l346a4ebfeabcaedf@mail.gmail.com> Message-ID: <1DF55068-6E1E-45E7-8CC6-4C10EF097A62@gmail.com> On 10-Sep-07, at 8:33 AM, Thomas Wouters wrote: > Alternatively, if you know what you're doing, you can edit the > svnmerge-integrated property on the branch directly -- but don't > mess it up :) > svnmerge also has a handy -M flag that marks a (set of) revisions as merged, but doesn't actually do any merging. -Mike From nick.bastin at gmail.com Mon Sep 10 19:58:47 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 10 Sep 2007 13:58:47 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> Message-ID: <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> On 9/10/07, Guido van Rossum wrote: > On 9/9/07, Nicholas Bastin wrote: > > On 9/9/07, Guido van Rossum wrote: > > > On 9/9/07, Nicholas Bastin wrote: > > > > I'm not suggesting that Python handle small ints itself and then farm > > > > out large integer computations, I'm suggesting that since we've > > > > already coalesced small ints into 'large' ones, we might want to > > > > review the performance implications of that decision, and possibly > > > > consider that other people have already solved this problem. Clearly > > > > GMP appears to fail on a technical level, but there might be other > > > > options worth investigating. > > > > > > The performance problems that are affecting us most are for > > > small-value ints. The old PyInt type has many custom optimizations to > > > help. I think we could do worse than re-introducing some of the same > > > tricks, retargeted to PyLong (which never got much attention for > > > small-value performance). > > > > I did redo my benchmark using 200 as the increment number instead of > > 1, to duck any impact from the interning of small value ints in 2.6, > > and it made no discernible difference in the results. > > I'm sorry, I've lost context. I'm not at all clear at this point what > benchmark you might have ran. I posted a tiny snippet of code earlier in the thread that was a sortof silly benchmark of integer math operations. > Note that when I said "small values" I meant (in part) anything that > fits in a Python long -- while there's a special cache in 2.x for ints > < 100, there's also a special allocator that outperforms the obmalloc > allocator. Yeah, my point was mostly an aside to anyone that might have questioned my earlier results of a 2.3x slowdown on integer-sized values because I used 1. A quick switch to 200 netted the exact same results, and a more extensive refactoring to get the same number of operations on a random set of larger numbers netted the same result as well. -- Nick From guido at python.org Mon Sep 10 20:16:43 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 10 Sep 2007 11:16:43 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> Message-ID: On 9/10/07, Nicholas Bastin wrote: > > > I did redo my benchmark using 200 as the increment number instead of > > > 1, to duck any impact from the interning of small value ints in 2.6, > > > and it made no discernible difference in the results. > > > > I'm sorry, I've lost context. I'm not at all clear at this point what > > benchmark you might have ran. > > I posted a tiny snippet of code earlier in the thread that was a > sortof silly benchmark of integer math operations. Can you report the exact code after all the changes you made, *and* the results that you are now comparing? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Mon Sep 10 21:24:26 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 10 Sep 2007 15:24:26 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> Message-ID: <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> On 9/10/07, Guido van Rossum wrote: > On 9/10/07, Nicholas Bastin wrote: > > > > I did redo my benchmark using 200 as the increment number instead of > > > > 1, to duck any impact from the interning of small value ints in 2.6, > > > > and it made no discernible difference in the results. > > > > > > I'm sorry, I've lost context. I'm not at all clear at this point what > > > benchmark you might have ran. > > > > I posted a tiny snippet of code earlier in the thread that was a > > sortof silly benchmark of integer math operations. > > Can you report the exact code after all the changes you made, *and* > the results that you are now comparing? Simple example code: inttest.py: def int_test2(rounds): index = 0 while index < rounds: foo = 0 while foo < 200000000: foo += 200 .... above line repeated 99 more times index += 1 python timeit.py "import inttest; inttest.int_test2(5)" 3.0: 10 loops, best of 3: 6.76 sec per loop 2.6: 10 loops, best of 3: 2.61 sec per loop The case of foo += 200 actually performs worse in 3.0 than foo += 1, although 2.6 is consistent using either value. This is on Windows XP Pro, Pentium D 3.00 ghz (dual core). Python was invoked with REALTIME process priority with thread affinity set to 1. Without thread affinity, 3.0 averaged 7.15 seconds per loop and 2.6 averaged 2.64 seconds per loop. -- Nick From greg.ewing at canterbury.ac.nz Tue Sep 11 02:07:39 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 11 Sep 2007 12:07:39 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E4D273.9080300@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> Message-ID: <46E5DC4B.6030304@canterbury.ac.nz> Martin v. L?wis wrote: > The first right of the user is to get the source code > easily, without having to beg for it. Only then it is also > the user's right to modify it, and use the modified version > in the application. Where does begging come into it? As long as the user is provided with information which allows them to easily obtain the source, there shouldn't be a problem. What does "from the same source" mean, anyway? On the same hard disk? On a disk connected to the same computer? On a server in the same room? Same building? Owned by the same person/company? If there's a link on the same web page that works when the user clicks on it, I don't think they're even going to notice the difference. -- Greg From larry at hastings.org Tue Sep 11 02:17:22 2007 From: larry at hastings.org (Larry Hastings) Date: Mon, 10 Sep 2007 17:17:22 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E5DC4B.6030304@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> Message-ID: <46E5DE92.8070808@hastings.org> Greg Ewing wrote: > If there's a link on the same web page that works > when the user clicks on it, I don't think they're > even going to notice the difference. They'll notice the difference when they want to redistribute Python, when they note the new licensing-based restrictions ("GMP must be in a user-replaceable shared library", "you must distribute the source to your GMP build"). I am opposed to using LGPL- or GPL-licensed code in Python. /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070910/72b2f981/attachment-0001.htm From greg.ewing at canterbury.ac.nz Tue Sep 11 02:25:53 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 11 Sep 2007 12:25:53 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <1189425848.7656.19.camel@qrnik> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <1189425848.7656.19.camel@qrnik> Message-ID: <46E5E091.5020405@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > The major technical problem with GMP is that an out of memory condition > during computation is a fatal error, GMP does not provide a way to > recover from it. If using GMP itself is not feasible, then perhaps some algorithms could be extracted from it in areas where it does better than Python? -- Greg From nick.bastin at gmail.com Tue Sep 11 02:48:22 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 10 Sep 2007 20:48:22 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E5DC4B.6030304@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> Message-ID: <66d0a6e10709101748n2f4edf9di4dd073c5e7e7bd2f@mail.gmail.com> On 9/10/07, Greg Ewing wrote: > Martin v. L?wis wrote: > > > The first right of the user is to get the source code > > easily, without having to beg for it. Only then it is also > > the user's right to modify it, and use the modified version > > in the application. > > Where does begging come into it? As long as the user > is provided with information which allows them to > easily obtain the source, there shouldn't be a > problem. The FSF has clarified that this is all that it means. Technically you should have an agreement with whoever is providing the source that they will continue to do so, but it is probably sufficient to take that burden upon yourself if and only if they stop doing so. -- Nick From nick.bastin at gmail.com Tue Sep 11 03:02:31 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Mon, 10 Sep 2007 21:02:31 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E5DE92.8070808@hastings.org> References: <1189270839.25695.18.camel@qrnik> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> Message-ID: <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> On 9/10/07, Larry Hastings wrote: > > Greg Ewing wrote: > If there's a link on the same web page that works > when the user clicks on it, I don't think they're > even going to notice the difference. > > They'll notice the difference when they want to redistribute Python, when > they note the new licensing-based restrictions ("GMP must be in a > user-replaceable shared library", "you must distribute the source to your > GMP build"). If python.org agreed to host the GMP source, that would suffice for all people distributing python binaries (they could then just refer to the GMP source download as a link). The FSF explicitly states that this kind of agreement satisfies that requirement of the license. As for the user-replaceable shared library part, that's up for considerable debate. It's unlikely that static linkage legally creates a derivative work (that would be pretty unreasonable in computer science terms), but it's never been tested in court, so static linking would probably be out for distributors without a legal department. -- Nick From eric+python-dev at trueblade.com Tue Sep 11 03:30:27 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Mon, 10 Sep 2007 21:30:27 -0400 Subject: [Python-3000] __format__ and datetime In-Reply-To: <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> Message-ID: <46E5EFB3.7050809@trueblade.com> Paul Moore wrote: > I'd like to see the default format specified (somewhere). I note that > the default format for datetime values seems to differ for me (on > 3.0a1 on Windows) > > Python 3.0a1 (py3k:57844, Aug 31 2007, 16:54:27) [MSC v.1310 32 bit > (Intel)] on win32 > Type "help", "copyright", "credits" or "license" for more information. >>>> import datetime >>>> str(datetime.datetime.now()) > '2007-09-10 16:26:25.218000' > > (Note lack of 'T'). I'm not sure I like 6 decimal places of seconds to > be the default format, either, but consistency (with str()) and > accuracy (however extreme) may be more important here... This is my error. I caught it while adding tests, and I'll fix it before I check anything in. format(datetime.datetime.now(), '') will not have a 'T' in it, just as str(datetime.datetime.now()) doesn't. From skip at pobox.com Tue Sep 11 05:11:03 2007 From: skip at pobox.com (skip at pobox.com) Date: Mon, 10 Sep 2007 22:11:03 -0500 Subject: [Python-3000] __format__ and datetime In-Reply-To: <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> Message-ID: <18150.1863.436464.41503@montanaro.dyndns.org> Paul> The date and time defaults (which appear to be %Y-%m-%d and Paul> %H:%M:%s) seem perfectly acceptable, on the other hand. I would like to see an analog to %S which preserves fractions of a second as the default formatting for time and datetime objects does: >>> print(now) 2007-09-10 22:07:53.654774 >>> print(now.strftime("%H:%M:%S")) 22:07:53 >>> print(now.time()) 22:07:53.654774 Skip From guido at python.org Tue Sep 11 05:24:42 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 10 Sep 2007 20:24:42 -0700 Subject: [Python-3000] __format__ and datetime In-Reply-To: <18150.1863.436464.41503@montanaro.dyndns.org> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> Message-ID: Right. It's odd that there's nothing explicit that exactly produces the default. (Though floats have this issue too -- I wish it could be fixed there too.) On 9/10/07, skip at pobox.com wrote: > > Paul> The date and time defaults (which appear to be %Y-%m-%d and > Paul> %H:%M:%s) seem perfectly acceptable, on the other hand. > > I would like to see an analog to %S which preserves fractions of a second as > the default formatting for time and datetime objects does: > > >>> print(now) > 2007-09-10 22:07:53.654774 > >>> print(now.strftime("%H:%M:%S")) > 22:07:53 > >>> print(now.time()) > 22:07:53.654774 > > Skip > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 11 05:58:17 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 10 Sep 2007 20:58:17 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> Message-ID: I'd like to see Travis's response to this. It's setting a precedent regarding locking objects in read-only mode; I haven't found other examples of objects using LOCKDATA (the only mentions of it seem to be rejecting it :). I keep getting confused by the two separate lock counts (and I think in this version the comment is inconsistent with the code). So I'm hoping Travis has a particular way in mind of handling LOCKDATA that can be used as a template. Travis? --Guido On 9/8/07, Gregory P. Smith wrote: > A new version is attached; cleaned up and simplified based on your original > comments. > > On 8/29/07, Guido van Rossum < guido at python.org> wrote: > > That's a huge patch to land so close before a release. I'm not sure I > > like the immutability API -- it won't be useful unless we add a hash > > method, and then we have all sorts of difficulties again -- the > > distinction between a hashable and an unhashable object should be made > > by type, not by value (tuples containing unhashable values > > notwithstanding). > > ok i've removed the immutable support in the most recent patch. i still > think it -might- be useful but isn't required and you're right that it could > open a can of worms if people think it should also mean hashable. immutable > bytes may be best implemented as a subclass if its ever wanted. > > > I don't understand the comment about using PyBUF_WRITABLE in > > _getbuffer() -- this is only used for data we're *reading* and I don't > > think the GIL is even released while we're reading such things. > > that appears to be correct. the comment was wrong. fixed. > > -gps > > > > If you think it's important to get this in the 3.0a1 release, we > > should pair-program on it ASAP, preferable tomorrow morning. > > Otherwise, let's do a review next week. > > > > --Guido > > > > On 8/29/07, Gregory P. Smith < greg at krypto.org> wrote: > > > Attached is what I've come up with so far. Only a single field is > > > added to the PyBytesObject struct. This adds support to the bytes > > > object for PyBUF_LOCKDATA buffer API operation. bytes objects can be > > > marked temporarily read-only for use while the buffer api has handed > > > them off to something which may run without the GIL (think IO). Any > > > attempt to modify them during that time will raise an exception as I > > > believe Martin suggested earlier. > > > > > > As an added bonus because its been discussed here, support for setting > > > a bytes object immutable has been added since its pretty trivial once > > > the read only export support was in place. Thats not required but was > > > trivial to include. > > > > > > I'd appreciate any feedback. > > > > > > My TODO list for this patch: > > > > > > 0. Get feedback and make adjustments as necessary. > > > > > > 1. Deciding between PyBUF_SIMPLE and PyBUF_WRITEABLE for the internal > > > uses of the _getbuffer() function. bytesobject.c contains both > readonly > > > and read-write uses of the buffers, i'll add boolean parameter for > > > that. > > > > > > 2. More testing: a few tests in the test suite fail after this but the > > > number was low and I haven't had time to look at why or what the > > > failures were. > > > > > > 3. Exporting methods suggested in the TODO at the top of the file. > > > > > > 4. Unit tests for all of the functionality this adds. > > > > > > NOTE: after these changes I had to make clean and rm -rf build before > > > things would not segfault on import. I suspect some things (modules?) > > > were not properly recompiled after the bytesobject.h struct change > > > otherwise. > > > > > > -gps > > > > > > > > > _______________________________________________ > > > Python-3000 mailing list > > > Python-3000 at python.org > > > http://mail.python.org/mailman/listinfo/python-3000 > > > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > > > > > > > > > > > -- > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From tjreedy at udel.edu Tue Sep 11 01:03:13 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 10 Sep 2007 19:03:13 -0400 Subject: [Python-3000] C API for ints and strings References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de><66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com><46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz><66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com><66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> Message-ID: "Nicholas Bastin" wrote in message news:66d0a6e10709101058n22b04bfakf67a15aea8e739f4 at mail.gmail.com... | Yeah, my point was mostly an aside to anyone that might have | questioned my earlier results of a 2.3x slowdown on integer-sized | values because I used 1. A quick switch to 200 netted the exact same | results, Currently, 200 is a small, cached int just as 1 is ([-10,256] or so is range). | and a more extensive refactoring to get the same number of | operations on a random set of larger numbers netted the same result as | well better test tjr From nick.bastin at gmail.com Tue Sep 11 06:50:57 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 11 Sep 2007 00:50:57 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> Message-ID: <66d0a6e10709102150k217adedblfc7cc7b57309f5a7@mail.gmail.com> On 9/10/07, Terry Reedy wrote: > > "Nicholas Bastin" wrote in message > news:66d0a6e10709101058n22b04bfakf67a15aea8e739f4 at mail.gmail.com... > > | Yeah, my point was mostly an aside to anyone that might have > | questioned my earlier results of a 2.3x slowdown on integer-sized > | values because I used 1. A quick switch to 200 netted the exact same > | results, > > Currently, 200 is a small, cached int just as 1 is ([-10,256] or so is > range). Interesting, I didn't look at the code (obviously), but my understanding was that it was only positive integers below 100. -- Nick From oliphant at enthought.com Tue Sep 11 07:10:48 2007 From: oliphant at enthought.com (Travis E. Oliphant) Date: Tue, 11 Sep 2007 00:10:48 -0500 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> Message-ID: <46E62358.3020404@enthought.com> Guido van Rossum wrote: > I'd like to see Travis's response to this. It's setting a precedent > regarding locking objects in read-only mode; I haven't found other > examples of objects using LOCKDATA (the only mentions of it seem to be > rejecting it :). I keep getting confused by the two separate lock > counts (and I think in this version the comment is inconsistent with > the code). So I'm hoping Travis has a particular way in mind of > handling LOCKDATA that can be used as a template. > > Travis? > The use case I had in mind comes about quite often in NumPy when you want to modify the data-area of an object which may have a non-contiguous chunk of memory, but the algorithm being used expects contiguous data. Imagine, for example, that the exporting object is an image whose rows are stored in different segments. The consumer of the buffer interface, however, may be an extension module that does fast image-processing operations and requires contiguous data. Because it wants to write the results back in to the memory area when it is done with the algorithm (which may be thread-safe and may release the GIL), it requests the object to lock its data to read-only so that other consumers do not try to get writeable buffers while it is processing. When the algorithm is done, it alone can write to the memory area and then when it releases the buffer, the original object will restore itself to being writeable. Of course, the exporting object must support this kind of operation and not all objects will. I expect the NumPy array object and the PIL to support it for example, and other media-centric objects. It would probably be useful if the bytes object supported it because then other objects could use it as the memory area. To do it correctly, the object exporting the interface must only allow locking if no other writeable interfaces have been exported (which it must keep track of) and then on release must check to see if the buffer that is being released is the one that locked its data. For a real-life example, NumPy has a flag called UPDATEIFCOPY that is a slightly different implementation of the concept. When this flag is set during conversion to an array, then if a copy must be made to satisfy the requirements, the original array is set as read-only and this special flag is set on the array. When the copy is deleted, its memory is automatically copied (and possibly casted, etc.) back into the original array. It is a nice abstraction of the concept of an output data area that was borrowed from Numarray and allows many things to be implemented very quickly in NumPy. One of the main things people use the NumPy C-API for is to get a contiguous chunk of memory from an array in order to do processing in another language (such as C or Fortran). It is nice to be able to specify that the result gets placed back into another chunk of memory (which may or may not be contiguous) in a unified fashion. NumPy handles all the copying for you. My thinking was that many people will want to be able to get contiguous chunks of memory, do processing, and then copy the result back into a segment of memory from a buffer-exporting object which is passed into the routine as an output object. I'm not sure if my explanations are helpful. Please let me know if I can explain further. -Travis From martin at v.loewis.de Tue Sep 11 07:22:37 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 Sep 2007 07:22:37 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E5DC4B.6030304@canterbury.ac.nz> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> Message-ID: <46E6261D.9010704@v.loewis.de> >> The first right of the user is to get the source code >> easily, without having to beg for it. Only then it is also >> the user's right to modify it, and use the modified version >> in the application. > > Where does begging come into it? As long as the user > is provided with information which allows them to > easily obtain the source, there shouldn't be a > problem. No. If the user got the software on a CD-ROM, he should not be required to use an internet connection to get the source. > What does "from the same source" mean, anyway? On > the same hard disk? On a disk connected to the same > computer? On a server in the same room? Same building? > Owned by the same person/company? Depends on how he gets the software. If the software was received by download, getting the source by download is fine. If the software was in a box he got by mail, the source should be in the same box (or a written offer to get the source in a box). > If there's a link on the same web page that works > when the user clicks on it, I don't think they're > even going to notice the difference. Certainly not. The "problem" is with copies you don't receive through download. E.g. if Python comes preinstalled in some device, that device should be accompanied directly with the source "on a medium customarily used for software interchange" (i.e. you should not just print out the source code in the handbook). Regards, Martin From martin at v.loewis.de Tue Sep 11 07:26:57 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 Sep 2007 07:26:57 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> Message-ID: <46E62721.4020009@v.loewis.de> > If python.org agreed to host the GMP source, that would suffice for > all people distributing python binaries (they could then just refer to > the GMP source download as a link). It would not if they don't distribute the binary through download. If they put it on some media, or preinstalled on a computer (which happens a lot), offering the source for download through the internet is not good enough. Option 6d) only applies if the binaries are distributed "by offering access to copy from a designated place". > The FSF explicitly states that > this kind of agreement satisfies that requirement of the license. Where do they do that? > As for the user-replaceable shared library part, that's up for > considerable debate. It's unlikely that static linkage legally > creates a derivative work (that would be pretty unreasonable in > computer science terms), but it's never been tested in court, so > static linking would probably be out for distributors without a legal > department. Perhaps. However, even if you link dynamically, you would *still* have to provide source code along with the binary. Regards, Martin From martin at v.loewis.de Tue Sep 11 07:32:14 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 Sep 2007 07:32:14 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709102150k217adedblfc7cc7b57309f5a7@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709102150k217adedblfc7cc7b57309f5a7@mail.gmail.com> Message-ID: <46E6285E.7060901@v.loewis.de> > Interesting, I didn't look at the code (obviously), but my > understanding was that it was only positive integers below 100. See NSMALLPOSINTS and NSMALLNEGINTS. It's 257 positive ints since r42552, contributed through bugs.python.org/1436243. Regards, Martin From larry at hastings.org Tue Sep 11 08:09:29 2007 From: larry at hastings.org (Larry Hastings) Date: Mon, 10 Sep 2007 23:09:29 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> Message-ID: <46E63119.2070502@hastings.org> Nicholas Bastin wrote: > As for the user-replaceable shared library part, that's up for > considerable debate. It's unlikely that static linkage legally > creates a derivative work (that would be pretty unreasonable in > computer science terms), but it's never been tested in court, so > static linking would probably be out for distributors without a legal > department. I guess anything is debatable, but the LGPL explicitly defines programs statically-linked with LGPL code as being "derivative works": *5.* A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. I feel it's intellectually dishonest to ignore the LGPL's restrictions on the basis that its definitions haven't been tested in court. You seem to suggest that, were Python to incorporate LGPL code, organizations which redistribute a statically-linked Python should ignore the LGPL-induced restrictions--is that really what you mean? I for one am relatively happy with the existing Python license. I would be quite irritated if Python were to incur more restrictive licenses, whether or not they had been tested in court. /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070910/775bfdb6/attachment.htm From martin at v.loewis.de Tue Sep 11 09:21:22 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 Sep 2007 09:21:22 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> Message-ID: <46E641F2.4020701@v.loewis.de> > 3.0: 10 loops, best of 3: 6.76 sec per loop > 2.6: 10 loops, best of 3: 2.61 sec per loop I can't quite reproduce these results. On a 3.2GHz Pentium 4, running Linux 2.6.21, gcc 4.1.3, I get 3.0: 10 loops, best of 3: 728 msec per loop 2.6: 10 loops, best of 3: 558 msec per loop So it's only 30% slower, not 260%. What puzzles me more is that on comparable machines, it runs 5 to 10 times as fast on Linux as it does on Windows. Have you turned off optimization by any chance in the compiler (what compiler did you use, anyway)? Regards, Martin From krstic at solarsail.hcs.harvard.edu Tue Sep 11 09:21:20 2007 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?Q?Ivan_Krsti=C4=87?=) Date: Tue, 11 Sep 2007 03:21:20 -0400 Subject: [Python-3000] 3.0 crypto In-Reply-To: <52dc1c820709071148l2c3061f9l14c929657ef7e397@mail.gmail.com> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> <52dc1c820709071148l2c3061f9l14c929657ef7e397@mail.gmail.com> Message-ID: <1B544854-053A-45C9-869B-92F48D54CA45@solarsail.hcs.harvard.edu> On Sep 7, 2007, at 2:48 PM, Gregory P. Smith wrote: > fwiw hashes are not cryptography. I assume you mean legally? I was referring to the fact that we're specifically discussing cryptographic hashes. > I see nothing wrong with leaving pycrypto as an add-on library as > most things don't need it. http://www.amk.ca/python/code/crypto. Last I heard, AMK was no longer maintaining pycrypto, and a number of people have found weird issues with it and were generally uncertain of the correctness of the implemented crypto. > The pycrypto API is is very nice. But if we were to consider it > for the standard library I'd prefer it just link against OpenSSL > rather than use its own C implementations and just leave platforms > without ssl without any crypto. That's one option, although there seems to be some FUD surrounding OpenSSL licensing and its interactions with the GPL: It's also a standalone library, and it strikes me as much nicer to just have Python provide the crypto functionality out of the box. So, if we built an API atop the (public domain) LibTomCrypt code that mimicked that of pycrypto, would anyone object to getting that kind of thing into the Python source distribution? > Besides the chances are that most programmers seeing a crypto > library will misuse it and gain a false sense of security on what > they've done. ;) Consenting adults, etc. -- Ivan Krsti? | http://radian.org From krstic at solarsail.hcs.harvard.edu Tue Sep 11 09:29:26 2007 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?Q?Ivan_Krsti=C4=87?=) Date: Tue, 11 Sep 2007 03:29:26 -0400 Subject: [Python-3000] 3.0 crypto (was: Re: Solaris support in 3.0?) In-Reply-To: References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> Message-ID: <6EA91F68-7625-47FA-90BC-2F0E1455F1B9@solarsail.hcs.harvard.edu> On Sep 6, 2007, at 10:54 AM, Guido van Rossum wrote: > I'm not sure what you meant with "doing the work isn't a problem". Are > you volunteering? I think we need someone who understands the red tape > situation most of all. Hopefully I'm worried for nothing. I'm trying to feel out whether there's strong opposition to shipping a good set of built-in crypto operations with Python, and in a way that doesn't depend on external libraries. There are three reasons for opposition that I could imagine: - legal, in that there's uncertainty about what we can or can't ship. I can very likely get the appropriate assistance here to clarify the situation. - technical, in that no one has been willing to do the work of providing such a set of crypto ops, and/or of writing a PEP for them. - philosophical, in that folks think crypto shouldn't come bundled with the language. I'm volunteering to tackle the first two, assuming those are the actual problems. Are they? -- Ivan Krsti? | http://radian.org From nick.bastin at gmail.com Tue Sep 11 10:38:21 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 11 Sep 2007 04:38:21 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E62721.4020009@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> <46E62721.4020009@v.loewis.de> Message-ID: <66d0a6e10709110138w3fcb5f7bl87168db2328695d1@mail.gmail.com> On 9/11/07, "Martin v. L?wis" wrote: > > If python.org agreed to host the GMP source, that would suffice for > > all people distributing python binaries (they could then just refer to > > the GMP source download as a link). > > It would not if they don't distribute the binary through download. > If they put it on some media, or preinstalled on a computer (which > happens a lot), offering the source for download through the internet > is not good enough. Option 6d) only applies if the binaries are > distributed "by offering access to copy from a designated place". This is a good point. > > The FSF explicitly states that > > this kind of agreement satisfies that requirement of the license. > > Where do they do that? In the GPL FAQ (). Specifically: Can I put the binaries on my Internet server and put the source on a different Internet site? The GPL says you must offer access to copy the source code "from the same place"; that is, next to the binaries. However, if you make arrangements with another site to keep the necessary source code available, and put a link or cross-reference to the source code next to the binaries, we think that qualifies as "from the same place". > > As for the user-replaceable shared library part, that's up for > > considerable debate. It's unlikely that static linkage legally > > creates a derivative work (that would be pretty unreasonable in > > computer science terms), but it's never been tested in court, so > > static linking would probably be out for distributors without a legal > > department. > > Perhaps. However, even if you link dynamically, you would *still* > have to provide source code along with the binary. No one is disputing that, just saying that the terms could be made less onerous for subsequent distributors of python by securing a written guarantee from python.org that python.org would continue to distribute the source code on the internet. Of course, as several people have now pointed out, non-internet distribution would still have to ship the source code on their own, since the FAQ also prefers that source distribution be done by the same method as binary distribution. However, that being said, I don't see it as particularly onerous to add a small source distribution to a CD, since there's only a marginal increase in effective cost. All of this being said, GMP has been shot down for plenty of good technical reasons, which is really the question that was asked in the first place. This legal discussion is bordering on the sublime at this point, given that no one is actually suggesting that we bind Python to any LGPL software (nor, by the way, was that actually ever suggested - the question was asked of what the community thought of a particular piece of software, and an idea in general, and instead of answering that question, most decided to explain what they thought of a particular license, ignoring the technical questions entirely). -- Nick From nick.bastin at gmail.com Tue Sep 11 10:59:32 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 11 Sep 2007 04:59:32 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E63119.2070502@hastings.org> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> <46E63119.2070502@hastings.org> Message-ID: <66d0a6e10709110159w1861c488j15375a543a3502b4@mail.gmail.com> On 9/11/07, Larry Hastings wrote: > I guess anything is debatable, but the LGPL explicitly defines programs > statically-linked with LGPL code as being "derivative works": Where exactly does it do that? The GPL does that, but not the LGPL. In fact, the LGPL does not define nor reference "derivative works" in any way. Earlier revisions of the LGPL were potentially somewhat more restrictive, and certainly harder to parse, but the current version is reasonably clear on this topic. > 5. A program that contains no derivative of any portion of the Library, but > is designed to work with the Library by being compiled or linked with it, is > called a "work that uses the Library". Such a work, in isolation, is not a > derivative work of the Library, and therefore falls outside the scope of > this License. What version of the LGPL did you find this clause in? Section 5 of the current license says the following: 5. Combined Libraries. You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: * a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. * b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. >I feel it's intellectually dishonest to ignore the LGPL's restrictions on the basis that its >definitions haven't been tested in court. You seem to suggest that, were Python to >incorporate LGPL code, organizations which redistribute a statically-linked Python should >ignore the LGPL-induced restrictions--is that really what you mean? No, that's why I said that statically linking was out for distributions without their own legal department. That was supposed to be read as, "we don't supply legal advice, they have to make their own decisions". If they want to interpret it to mean that static linkage is fine, then that's their own decision. In my experience, lawyers don't view those kinds of decisions as "intellectually dishonest", but rather as "up for interpretation". I'll leave it as an exercise for the reader to determine what they think of that particular philosophy. -- Nick From nick.bastin at gmail.com Tue Sep 11 11:20:45 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 11 Sep 2007 05:20:45 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E641F2.4020701@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> <46E641F2.4020701@v.loewis.de> Message-ID: <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> On 9/11/07, "Martin v. L?wis" wrote: > > 3.0: 10 loops, best of 3: 6.76 sec per loop > > 2.6: 10 loops, best of 3: 2.61 sec per loop > > I can't quite reproduce these results. On a 3.2GHz Pentium 4, > running Linux 2.6.21, gcc 4.1.3, I get > > 3.0: 10 loops, best of 3: 728 msec per loop > 2.6: 10 loops, best of 3: 558 msec per loop > > So it's only 30% slower, not 260%. It's certainly possible that other architecture/os/compiler combinations will generate different results, although I was able to produce similar scaling results on my Core Duo in my MacBook Pro under MacOS X 10.4.10 using gcc 4.0.1 (Apple build 5247). > What puzzles me more is that on comparable machines, it > runs 5 to 10 times as fast on Linux as it does on Windows. The machines actually aren't that comparable. The differences between the P4 and PD are vast. Depending on which P4 revision you have (and 3 Ghz was available in more than one flavor - northwood, prescott, P4HT, prescott 2M and cedar mill), your FSB is possibly up to 50% faster than mine, and you may have 2MB of L2 cache. Almost all available 3Ghz P4s had hyperthreading, and while I don't believe that would have any effect in this case, I don't know (I don't believe HT ever performed any "magic" on non-threaded code). > Have you turned off optimization by any chance in the > compiler (what compiler did you use, anyway)? VC.NET 2005 Pro. I did not optimize beyond what is in the Python vcproj, but I ran both in release build configurations, which I presume have some optimizations enabled, anyhow (It appears to set /O2, but no more). -- Nick From martin at v.loewis.de Tue Sep 11 13:03:18 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 Sep 2007 13:03:18 +0200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709110138w3fcb5f7bl87168db2328695d1@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> <46E62721.4020009@v.loewis.de> <66d0a6e10709110138w3fcb5f7bl87168db2328695d1@mail.gmail.com> Message-ID: <46E675F6.8090604@v.loewis.de> > In the GPL FAQ (). Specifically: > > Can I put the binaries on my Internet server and put the source on a > different Internet site? Ok. As you say, this applies to downloading only. > Of course, as several people have now pointed out, non-internet > distribution would still have to ship the source code on their own, > since the FAQ also prefers that source distribution be done by the > same method as binary distribution. I'm glad we now agree that you have to ship GMP sources with any Python binary that you distribute. > However, that being said, I don't > see it as particularly onerous to add a small source distribution to a > CD, since there's only a marginal increase in effective cost. So the issue now is only whether that's acceptable. I think it is not; CPython should not rely on LGPL'ed code. > All of this being said, GMP has been shot down for plenty of good > technical reasons, which is really the question that was asked in the > first place. Hmm. You asked "Would anyone be opposed to rehosting PyLong on top of GMP?", which is a different question than the one you just said you asked. If you had agreed on the facts from the beginning, this entire discussion would not have taken place. > This legal discussion is bordering on the sublime at > this point, given that no one is actually suggesting that we bind > Python to any LGPL software (nor, by the way, was that actually ever > suggested - the question was asked of what the community thought of a > particular piece of software No, that was not the question, either. You asked "Would anyone be opposed to rehosting PyLong on top of GMP?", not "what do you think about GMP?". "rehosting PyLong on top of GMP" literally requires binding Python to GMP. > and an idea in general, and instead of > answering that question, most decided to explain what they thought of > a particular license, ignoring the technical questions entirely). I personally never said what I think of the LGPL. I was merely trying to explain what it actually says. FWIW, I quite like both the GPL, and the LGPL, and applaud the motivations behind it. That's why I prefer to follow it faithfully, and in its spirit, rather than trying to weasel-word out of it. Regards, Martin From p.f.moore at gmail.com Tue Sep 11 14:21:20 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 11 Sep 2007 13:21:20 +0100 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709091941h749630fag9e3739fd24ab31fd@mail.gmail.com> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> <46E641F2.4020701@v.loewis.de> <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> Message-ID: <79990c6b0709110521p10722897s6e4d03e5a558b457@mail.gmail.com> On 11/09/2007, Nicholas Bastin wrote: > On 9/11/07, "Martin v. L?wis" wrote: > > > 3.0: 10 loops, best of 3: 6.76 sec per loop > > > 2.6: 10 loops, best of 3: 2.61 sec per loop > > > > I can't quite reproduce these results. On a 3.2GHz Pentium 4, > > running Linux 2.6.21, gcc 4.1.3, I get > > > > 3.0: 10 loops, best of 3: 728 msec per loop > > 2.6: 10 loops, best of 3: 558 msec per loop > > > > So it's only 30% slower, not 260%. FWIW, I get >python -m timeit "import inttest; inttest.int_test2(5)" 10 loops, best of 3: 367 msec per loop >\Apps\Python30\python -m timeit "import inttest; inttest.int_test2(5)" 10 loops, best of 3: 810 msec per loop That's on Windows XP, distributed binaries of Python 2.5 and 3.0a1. Processor speed: 1.7 GHz Processor type: Intel(R) Pentium(R) M processor That's 120% slower (but against very different versions). I guess this proves nothing much, apart from the fact that the test is wildly variable and as such probably not very valid :-) Paul. From eric+python-dev at trueblade.com Tue Sep 11 14:47:12 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Tue, 11 Sep 2007 08:47:12 -0400 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E559E9.4090907@trueblade.com> References: <46E559E9.4090907@trueblade.com> Message-ID: <46E68E50.8050101@trueblade.com> Eric Smith wrote: > I have a patch to add __format__ to datetime.time, .date, and .datetime. > For non-empty format_spec's, I just pass on to .strftime. For empty > format_spec's, it returns str(self). What's the best way to call str(self)? I'm currently doing: if (PyUnicode_GetSize(format) == 0) return PyObject_CallMethod((PyObject *)self, "__str__", NULL); Although this works, calling self.__str__ doesn't seem like the right thing to do. Thanks. From ncoghlan at gmail.com Tue Sep 11 15:35:33 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 11 Sep 2007 23:35:33 +1000 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E68E50.8050101@trueblade.com> References: <46E559E9.4090907@trueblade.com> <46E68E50.8050101@trueblade.com> Message-ID: <46E699A5.20307@gmail.com> Eric Smith wrote: > Eric Smith wrote: >> I have a patch to add __format__ to datetime.time, .date, and .datetime. >> For non-empty format_spec's, I just pass on to .strftime. For empty >> format_spec's, it returns str(self). > > What's the best way to call str(self)? > > I'm currently doing: > if (PyUnicode_GetSize(format) == 0) > return PyObject_CallMethod((PyObject *)self, "__str__", NULL); > > Although this works, calling self.__str__ doesn't seem like the right > thing to do. PyObject_Str is the C API equivalent of str, but I believe PyObject_Unicode is currently the right call for Py3k [1]. Cheers, Nick. [1] http://docs.python.org/api/object.html -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Tue Sep 11 15:59:10 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 11 Sep 2007 23:59:10 +1000 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E675F6.8090604@v.loewis.de> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> <46E62721.4020009@v.loewis.de> <66d0a6e10709110138w3fcb5f7bl87168db2328695d1@mail.gmail.com> <46E675F6.8090604@v.loewis.de> Message-ID: <46E69F2E.9080509@gmail.com> Martin v. L?wis wrote: > I personally never said what I think of the LGPL. I was merely trying > to explain what it actually says. FWIW, I quite like both the GPL, and > the LGPL, and applaud the motivations behind it. That's why I prefer > to follow it faithfully, and in its spirit, rather than trying to > weasel-word out of it. I have to agree with what Martin has said here - the PSF license used for the CPython interpreter is designed to give a lot of flexibility to embedders and developers using the engine. Preserving the freedom of end-users to access the interpreter source code isn't one of the aims of the license, so redistributors are free to use whatever license they like, and are also free to distribute the software purely in binary form. The LGPL and GPL have different aims from the PSF license, with a much greater focus on preserving freedom for the end-user, so code under those licenses doesn't fit in with the licensing model for the base CPython distribution. Even though it would be possible for the PSF to do what was necessary to make the inclusion of LGPL code legal, the effect on the overall licensing model would be a major inconvenience for downstream embedders and developers. So rather than trying to skirt the letter of the licenses, it makes sense to just obey the spirit and accept that this may sometimes prevent us from using code that might otherwise be helpful. In at least one case where this mattered in the past (locale independent atoi/atof, if I recall correctly), the author of the relevant code was actually kind enough to grant the PSF direct permission to use the code under a Python contributor agreement. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From mark at qtrac.eu Tue Sep 11 16:06:32 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Tue, 11 Sep 2007 15:06:32 +0100 Subject: [Python-3000] ordered dict for p3k collections? Message-ID: <200709111506.32823.mark@qtrac.eu> Hi, Is there any chance that an ordered dict will be added to Python 3's library? I personally find such data structures v. useful in C++. I know that in Python the sort function is v. fast, but often I prefer never to sort but simply to use an ordered data structure in the first place. (I'm aware that for ordered lists I can use the bisect module, but I want an ordered key-value data structure.) I think other people must find such things useful. There are three implementations on the Python Cookbook site, and one on PyPI, all in pure Python (plus I have my own implementation, also pure Python). I would suppose that it would be better if it was implemented in C---for example, my own pure Python ordered dict takes about eight times as long to load in 18,000 items compared with loading the same into a dict. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From eric+python-dev at trueblade.com Tue Sep 11 16:21:10 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Tue, 11 Sep 2007 10:21:10 -0400 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E699A5.20307@gmail.com> References: <46E559E9.4090907@trueblade.com> <46E68E50.8050101@trueblade.com> <46E699A5.20307@gmail.com> Message-ID: <46E6A456.1020200@trueblade.com> Nick Coghlan wrote: > Eric Smith wrote: >> Eric Smith wrote: >>> I have a patch to add __format__ to datetime.time, .date, and >>> .datetime. For non-empty format_spec's, I just pass on to >>> .strftime. For empty format_spec's, it returns str(self). >> >> What's the best way to call str(self)? >> >> I'm currently doing: >> if (PyUnicode_GetSize(format) == 0) >> return PyObject_CallMethod((PyObject *)self, "__str__", NULL); >> >> Although this works, calling self.__str__ doesn't seem like the right >> thing to do. > > PyObject_Str is the C API equivalent of str, but I believe > PyObject_Unicode is currently the right call for Py3k [1]. Of course! Thanks for the help, I was trying to over-complicate it. Eric. From skip at pobox.com Tue Sep 11 16:33:03 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 11 Sep 2007 09:33:03 -0500 Subject: [Python-3000] __format__ and datetime In-Reply-To: References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> Message-ID: <18150.42783.278892.121765@montanaro.dyndns.org> Skip> I would like to see an analog to %S which preserves fractions of a Skip> second as the default formatting for time and datetime objects Skip> does: Skip> >>> print(now) Skip> 2007-09-10 22:07:53.654774 Guido> Right. It's odd that there's nothing explicit that exactly Guido> produces the default. (Though floats have this issue too -- I Guido> wish it could be fixed there too.) Looking at the libref doc for time.strftime and the strftime(3) man pages on Solaris 10, Mac OS X and CentOS 4, I see that %f is unused ("f" is mnemonic for "fractions" of a second). Maybe after a little more investigation and not endless amounts of discussion this could be added to Python as the way to represent the fractions of seconds as an int representing microseconds. For example, the above example could be specified by %Y-%m-%d %H:%M:%S.%f Thinking about future advances in timekeeping, is microseconds too short? Maybe "%N" for "nanoseconds"? Skip From qrczak at knm.org.pl Tue Sep 11 17:38:58 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 11 Sep 2007 17:38:58 +0200 Subject: [Python-3000] help(pickle) fails: unorderable types: type() < type() Message-ID: <1189525138.14065.5.camel@qrnik> Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pickle >>> help(pickle) Traceback (most recent call last): [...] File "/usr/local/lib/python3.0/pydoc.py", line 954, in repr1 return getattr(self, methodname)(x, level) File "/usr/local/lib/python3.0/repr.py", line 78, in repr_dict for key in islice(sorted(x), self.maxdict): TypeError: unorderable types: type() < type() BTW, is cPickle officially gone and should pickle be used instead? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From janssen at parc.com Tue Sep 11 18:15:24 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 11 Sep 2007 09:15:24 PDT Subject: [Python-3000] 3.0 crypto (was: Re: Solaris support in 3.0?) In-Reply-To: <6EA91F68-7625-47FA-90BC-2F0E1455F1B9@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <6EA91F68-7625-47FA-90BC-2F0E1455F1B9@solarsail.hcs.harvard.edu> Message-ID: <07Sep11.091532pdt."57996"@synergy1.parc.xerox.com> > I'm trying to feel out whether there's strong opposition to shipping =20 > a good set of built-in crypto operations with Python, and in a way =20 > that doesn't depend on external libraries. Could you say a bit more about what these "built-in crypto operations" would be? What's the scope of your ambition here? Bill From jimjjewett at gmail.com Tue Sep 11 18:56:06 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 11 Sep 2007 12:56:06 -0400 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E62358.3020404@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: On 9/11/07, Travis E. Oliphant wrote: > Guido van Rossum wrote: > > ... I'm hoping Travis has a particular way in mind of > > handling LOCKDATA that can be used as a template. > The use case I had in mind comes about quite often in NumPy when you > want to modify the data-area of an object which may have a > non-contiguous chunk of memory, but the algorithm being used expects > contiguous data. Imagine, for example, that the exporting object is an > image whose rows are stored in different segments. > The consumer of the buffer interface, however, may be an extension > module that does fast image-processing operations and requires > contiguous data. Because it wants to write the results back in to the > memory area when it is done with the algorithm (which may be thread-safe > and may release the GIL), it requests the object to lock its data to > read-only so that other consumers do not try to get writeable buffers > while it is processing. Does it do its processing in the original buffer, causing it to be temporarily invalid? If so, no one else should even be reading it. Or does it just replace the original buffer with the new results once it is finished? If so, then why does it need the lock the whole time? Is someone getting known stale data (when you could tell them to wait) always OK, but overwriting someone else's change never is? -jJ From nick.bastin at gmail.com Tue Sep 11 19:03:00 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Tue, 11 Sep 2007 13:03:00 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <79990c6b0709110521p10722897s6e4d03e5a558b457@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> <46E641F2.4020701@v.loewis.de> <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> <79990c6b0709110521p10722897s6e4d03e5a558b457@mail.gmail.com> Message-ID: <66d0a6e10709111003y4bc1e5acpfe7ce26841718a37@mail.gmail.com> On 9/11/07, Paul Moore wrote: > On 11/09/2007, Nicholas Bastin wrote: > > On 9/11/07, "Martin v. L?wis" wrote: > > > > 3.0: 10 loops, best of 3: 6.76 sec per loop > > > > 2.6: 10 loops, best of 3: 2.61 sec per loop > > > > > > I can't quite reproduce these results. On a 3.2GHz Pentium 4, > > > running Linux 2.6.21, gcc 4.1.3, I get > > > > > > 3.0: 10 loops, best of 3: 728 msec per loop > > > 2.6: 10 loops, best of 3: 558 msec per loop > > > > > > So it's only 30% slower, not 260%. > > FWIW, I get > > >python -m timeit "import inttest; inttest.int_test2(5)" > 10 loops, best of 3: 367 msec per loop > > >\Apps\Python30\python -m timeit "import inttest; inttest.int_test2(5)" > 10 loops, best of 3: 810 msec per loop > > That's on Windows XP, distributed binaries of Python 2.5 and 3.0a1. > Processor speed: 1.7 GHz > Processor type: Intel(R) Pentium(R) M processor > > That's 120% slower (but against very different versions). > > I guess this proves nothing much, apart from the fact that the test is > wildly variable and as such probably not very valid :-) The Pentium M and Pentium D are much more alike, architecturally, than either and the Pentium 4, although the per-clock performance of the Pentium M is much better than either the 4 or the D (although not *that* good compared to a D, I didn't think). In a test like this where the loop is reasonably tight (even given the trek through the python interpreter), processor architecture and differing compiler optimizations will likely have a pretty significant effect on the overall performance. Without looking into it at a much lower level, it's hard to tell, but the difference between a 1MB and 2MB L2 cache might make all the difference in 3.0 performance. -- Nick From guido at python.org Tue Sep 11 19:27:27 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 10:27:27 -0700 Subject: [Python-3000] Which joker tried to remove me from the py3k list? Message-ID: ---------- Forwarded message ---------- From: python-3000-confirm+a02c328561e5ecf4a0373b3c0001cd33ec59ea4f at python.org Date: Sep 11, 2007 9:58 AM Subject: Your confirmation is required to leave the Python-3000 mailing list To: guido at python.org Mailing list removal confirmation notice for mailing list Python-3000 We have received a request for the removal of your email address, "guido at python.org" from the python-3000 at python.org mailing list. To confirm that you want to be removed from this mailing list, simply reply to this message, keeping the Subject: header intact. Or visit this web page: http://mail.python.org/mailman/confirm/python-3000/a02c328561e5ecf4a0373b3c0001cd33ec59ea4f Or include the following line -- and only the following line -- in a message to python-3000-request at python.org: confirm a02c328561e5ecf4a0373b3c0001cd33ec59ea4f Note that simply sending a `reply' to this message should work from most mail readers, since that usually leaves the Subject: line in the right form (additional "Re:" text in the Subject: is okay). If you do not wish to be removed from this list, please simply disregard this message. If you think you are being maliciously removed from the list, or have any other questions, send them to python-3000-owner at python.org. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 11 19:46:12 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 10:46:12 -0700 Subject: [Python-3000] __format__ and datetime In-Reply-To: <18150.42783.278892.121765@montanaro.dyndns.org> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> Message-ID: On 9/11/07, skip at pobox.com wrote: > > Skip> I would like to see an analog to %S which preserves fractions of a > Skip> second as the default formatting for time and datetime objects > Skip> does: > > Skip> >>> print(now) > Skip> 2007-09-10 22:07:53.654774 > > Guido> Right. It's odd that there's nothing explicit that exactly > Guido> produces the default. (Though floats have this issue too -- I > Guido> wish it could be fixed there too.) > > Looking at the libref doc for time.strftime and the strftime(3) man pages on > Solaris 10, Mac OS X and CentOS 4, I see that %f is unused ("f" is mnemonic > for "fractions" of a second). Maybe after a little more investigation and > not endless amounts of discussion this could be added to Python as the way > to represent the fractions of seconds as an int representing microseconds. > For example, the above example could be specified by > > %Y-%m-%d %H:%M:%S.%f > > Thinking about future advances in timekeeping, is microseconds too short? > Maybe "%N" for "nanoseconds"? No, the datetime module is explicitly defined to use microseconds. I don't expect there to be a practical use for nanoseconds (even microseconds are doubtful, but useful since one might want unique timestamps for more than 1000 events per second). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 11 19:52:19 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 10:52:19 -0700 Subject: [Python-3000] help(pickle) fails: unorderable types: type() < type() In-Reply-To: <1189525138.14065.5.camel@qrnik> References: <1189525138.14065.5.camel@qrnik> Message-ID: On 9/11/07, Marcin 'Qrczak' Kowalczyk wrote: > Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) > [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pickle > >>> help(pickle) > Traceback (most recent call last): > [...] > File "/usr/local/lib/python3.0/pydoc.py", line 954, in repr1 > return getattr(self, methodname)(x, level) > File "/usr/local/lib/python3.0/repr.py", line 78, in repr_dict > for key in islice(sorted(x), self.maxdict): > TypeError: unorderable types: type() < type() Mind reporting this on bugs.python.org? > BTW, is cPickle officially gone and should pickle be used instead? Yes. There will be a transparent accellerator written in C, but the public API will be called "pickle". -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 11 20:00:17 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 11:00:17 -0700 Subject: [Python-3000] 3.0 crypto (was: Re: Solaris support in 3.0?) In-Reply-To: <6EA91F68-7625-47FA-90BC-2F0E1455F1B9@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <6EA91F68-7625-47FA-90BC-2F0E1455F1B9@solarsail.hcs.harvard.edu> Message-ID: On 9/11/07, Ivan Krsti? wrote: > On Sep 6, 2007, at 10:54 AM, Guido van Rossum wrote: > > I'm not sure what you meant with "doing the work isn't a problem". Are > > you volunteering? I think we need someone who understands the red tape > > situation most of all. Hopefully I'm worried for nothing. > > I'm trying to feel out whether there's strong opposition to shipping > a good set of built-in crypto operations with Python, and in a way > that doesn't depend on external libraries. > > There are three reasons for opposition that I could imagine: > > - legal, in that there's uncertainty about what we can or can't ship. > I can very likely get the appropriate assistance here to clarify the > situation. I think you will have to start here. > - technical, in that no one has been willing to do the work of > providing such a set of crypto ops, and/or of writing a PEP for them. Well, most people in need of crypto with Python can find what they want as 3rd party code (whether using openssl or not). That these haven't been integrated with Python is often more a matter of different project management styles than a philosophical disagreement. E.g. code that gets significant updates twice a year isn't ready for inclusion into Python, which only releases new features every 18-24 months. > - philosophical, in that folks think crypto shouldn't come bundled > with the language. I don't think so, though the release managers might disagree. The PR disaster if a bug in the crypto code were to require shipment of updates could be significant. > I'm volunteering to tackle the first two, assuming those are the > actual problems. Are they? Why write something new instead of integrating existing code? What's wrong with openssl? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 11 21:02:41 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 12:02:41 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E62358.3020404@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: On 9/10/07, Travis E. Oliphant wrote: > Guido van Rossum wrote: > > I'd like to see Travis's response to this. It's setting a precedent > > regarding locking objects in read-only mode; I haven't found other > > examples of objects using LOCKDATA (the only mentions of it seem to be > > rejecting it :). I keep getting confused by the two separate lock > > counts (and I think in this version the comment is inconsistent with > > the code). So I'm hoping Travis has a particular way in mind of > > handling LOCKDATA that can be used as a template. > > > > Travis? > > The use case I had in mind comes about quite often in NumPy when you > want to modify the data-area of an object which may have a > non-contiguous chunk of memory, but the algorithm being used expects > contiguous data. Imagine, for example, that the exporting object is an > image whose rows are stored in different segments. > > The consumer of the buffer interface, however, may be an extension > module that does fast image-processing operations and requires > contiguous data. Because it wants to write the results back in to the > memory area when it is done with the algorithm (which may be thread-safe > and may release the GIL), it requests the object to lock its data to > read-only so that other consumers do not try to get writeable buffers > while it is processing. > > When the algorithm is done, it alone can write to the memory area and > then when it releases the buffer, the original object will restore > itself to being writeable. Of course, the exporting object must support > this kind of operation and not all objects will. I expect the NumPy > array object and the PIL to support it for example, and other > media-centric objects. Hm, so this is completely different from what I thought. It seems you are describing the following: 1. acquire the buffer with LOCK_DATA 2. copy the data out of the buffer into a scratch area 3. work on the scratch area 4. copy the data from the scratch area back into the buffer 5. release the buffer i would call this an exclusive write lock, which is quite different from the read lock interpretation implemented by Greg in his patch. Could you add some language to PEP 3118 to clarify this usage? Or is it already there? I admit to not having read it in full... > It would probably be useful if the bytes object supported it because > then other objects could use it as the memory area. To do it > correctly, the object exporting the interface must only allow locking if > no other writeable interfaces have been exported (which it must keep > track of) and then on release must check to see if the buffer that is > being released is the one that locked its data. Right. So it seems you would need a counter of outstanding non-data-locked buffer requests and a single bit indicating whether there's a data-locked request. (Rather than two counters like Greg's patch currently uses.) The hacker in me is already exploring the possibility of making the count negative if there's a data-locked request; it sounds like the valid transitions are: 0 -> 1 -> 2 -> ... (SIMPLE or WRITABLE get) ... -> 2 -> 1 -> ... (SIMPLE or WRITABLE release) 0 -> -1 (LOCKDATA get) -1 -> 0 (LOCKDATA release) Have I got that right? I think that you should only be able to request LOCKDATA if there are no other readers *or* writers, but that SIMPLE and WRITABLE clients should be able to coexist (any mess that creates would be the requester's own fault). Any nonzero value here would indicate that the buffer can't be moved. I note that the use case in the bsddb wrapper extension is a bit different -- Greg suspects that BerkeleyDB won't like the data changing while it is using it (e.g. it might violate its own invariant if the key changes between the time its hash is computed and the time it is written to disk). To ensure this, currently LOCKDATA is the only option; but a classic read lock would allow multiple concurrent readers (which is how Greg's patch to bytesobject.c interprets LOCKDATA). I think this needs to be clarified. Perhaps we need to separate clearer the type of access (read or write) and the amount of locking desired (can others read? can others write?). (BTW The current implementation in bytesobject.c allows changing the size as long as it fits within the allocated size; I think this is probably too lenient, and begging for latent bugs.) (Spelling alert: 'writeable' is apparently not an English word. I hope it's not too late to rename the flag to PyBUF_WRITABLE. I've opened http://bugs.python.org/issue1150 to track this.) > For a real-life example, NumPy has a flag called UPDATEIFCOPY that is a > slightly different implementation of the concept. When this flag is > set during conversion to an array, then if a copy must be made to > satisfy the requirements, the original array is set as read-only and > this special flag is set on the array. When the copy is deleted, its > memory is automatically copied (and possibly casted, etc.) back into the > original array. It is a nice abstraction of the concept of an output > data area that was borrowed from Numarray and allows many things to be > implemented very quickly in NumPy. So in terms of locks, this effectively sets read *and* write locks on the original object (since whatever you might read out of it may be invalidated when the modified copy is written back). But how to enforce that at the Python level? If we had something like this for the bytes object, any *use* of the bytes object from Python (e.g. iterating over it or indexing or slicing it) should be prohibited. Is this reasonable? > One of the main things people use the NumPy C-API for is to get a > contiguous chunk of memory from an array in order to do processing in > another language (such as C or Fortran). It is nice to be able to > specify that the result gets placed back into another chunk of memory > (which may or may not be contiguous) in a unified fashion. NumPy > handles all the copying for you. > > My thinking was that many people will want to be able to get contiguous > chunks of memory, do processing, and then copy the result back into a > segment of memory from a buffer-exporting object which is passed into > the routine as an output object. This is probably common for numpy; for the bytes object, I expect that it's all much simpler, since it's just a contiguous 1D array of bytes... > I'm not sure if my explanations are helpful. Please let me know if I > can explain further. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From oliphant at enthought.com Tue Sep 11 21:49:11 2007 From: oliphant at enthought.com (Travis E. Oliphant) Date: Tue, 11 Sep 2007 14:49:11 -0500 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: <46E6F137.2020001@enthought.com> Guido van Rossum wrote: > On 9/10/07, Travis E. Oliphant wrote: > >> > > Hm, so this is completely different from what I thought. It seems you > are describing the following: > > 1. acquire the buffer with LOCK_DATA > 2. copy the data out of the buffer into a scratch area > 3. work on the scratch area > 4. copy the data from the scratch area back into the buffer > 5. release the buffer > > i would call this an exclusive write lock, which is quite different > from the read lock interpretation implemented by Greg in his patch. > Could you add some language to PEP 3118 to clarify this usage? Or is > it already there? I admit to not having read it in full... > Yes, you have nailed the usage I was thinking of. I admit that there are other usage variants that I am not thinking of. These should be vetted. >> It would probably be useful if the bytes object supported it because >> then other objects could use it as the memory area. To do it >> correctly, the object exporting the interface must only allow locking if >> no other writeable interfaces have been exported (which it must keep >> track of) and then on release must check to see if the buffer that is >> being released is the one that locked its data. >> > > Right. So it seems you would need a counter of outstanding > non-data-locked buffer requests and a single bit indicating whether > there's a data-locked request. (Rather than two counters like Greg's > patch currently uses.) > > The hacker in me is already exploring the possibility of making the > count negative if there's a data-locked request; it sounds like the > valid transitions are: > > 0 -> 1 -> 2 -> ... (SIMPLE or WRITABLE get) > ... -> 2 -> 1 -> ... (SIMPLE or WRITABLE release) > 0 -> -1 (LOCKDATA get) > -1 -> 0 (LOCKDATA release) > > Have I got that right? I think that you should only be able to request > LOCKDATA if there are no other readers *or* writers, but that SIMPLE > and WRITABLE clients should be able to coexist (any mess that creates > would be the requester's own fault). Any nonzero value here would > indicate that the buffer can't be moved. > Your understanding looks fine to me. A comment I got at SciPy gave me the feeling that this has the look of an infrastructure that is necessary for shared-memory and thread-safe memory management. But, I do not admit to having thought through all of those issues. However, I would welcome any suggestions for improvement that would allow the buffer interface to be used to manage memory in thread-safe ways. > I note that the use case in the bsddb wrapper extension is a bit > different -- Greg suspects that BerkeleyDB won't like the data > changing while it is using it (e.g. it might violate its own invariant > if the key changes between the time its hash is computed and the time > it is written to disk). To ensure this, currently LOCKDATA is the only > option; but a classic read lock would allow multiple concurrent > readers (which is how Greg's patch to bytesobject.c interprets > LOCKDATA). > I'm not sure I understand the difference between a classic read lock and the exclusive write lock concept. Does the classic read-lock just prevent writing to the memory area. In my mind that is a read-only memory buffer and the buffer interface would complain if a writeable buffer was requested. > I think this needs to be clarified. Perhaps we need to separate > clearer the type of access (read or write) and the amount of locking > desired (can others read? can others write?). > Yes, I think the clarification is useful. > (BTW The current implementation in bytesobject.c allows changing the > size as long as it fits within the allocated size; I think this is > probably too lenient, and begging for latent bugs.) > > (Spelling alert: 'writeable' is apparently not an English word. I hope > it's not too late to rename the flag to PyBUF_WRITABLE. I've opened > http://bugs.python.org/issue1150 to track this.) > > Actually, writeable is an accepted variant of 'writable' (but it doesn't show up in many spell-check dictionaries). No, it is not too late to change it. Or just define WRITEABLE as WRITABLE. NumPy uses "WRITEABLE" simply because I like that spelling better. >> For a real-life example, NumPy has a flag called UPDATEIFCOPY that is a >> slightly different implementation of the concept. When this flag is >> set during conversion to an array, then if a copy must be made to >> satisfy the requirements, the original array is set as read-only and >> this special flag is set on the array. When the copy is deleted, its >> memory is automatically copied (and possibly casted, etc.) back into the >> original array. It is a nice abstraction of the concept of an output >> data area that was borrowed from Numarray and allows many things to be >> implemented very quickly in NumPy. >> > > So in terms of locks, this effectively sets read *and* write locks on > the original object (since whatever you might read out of it may be > invalidated when the modified copy is written back). Sort of, the object is set as read-only before the UPDATEIFCOPY version is made. Another python thread could technically read the data (but the flag would be set on it so that the user could know that another memory area was shadowing this one). Usually these kinds of object only show up as output arguments to functions and the programmer is left responsible to not try and rely on data that may be changing. Perhaps more fine-grained locks are needed. > > This is probably common for numpy; for the bytes object, I expect that > it's all much simpler, since it's just a contiguous 1D array of > bytes... > Yes, indeed it is much simpler.... I'm anxious for feedback and help with the locking mechanism, because I do not have all use cases in mind. I have never thought about a lock that prevents reading. In my mind, this would be handled by the object itself. It could refuse buffer requests if it's data had been locked or it could not. On the other hand, there could be two concepts of locking that a consumer could request from an object 1) Lock so that no other reads or writes are possible until the lock is released. 2) Lock so that only reads are possible. I had only thought of #2 for the current buffer interface. -Travis From oliphant at enthought.com Tue Sep 11 21:53:56 2007 From: oliphant at enthought.com (Travis E. Oliphant) Date: Tue, 11 Sep 2007 14:53:56 -0500 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: <46E6F254.9020501@enthought.com> Jim Jewett wrote: > On 9/11/07, Travis E. Oliphant wrote: > >> Guido van Rossum wrote: >> >>> ... I'm hoping Travis has a particular way in mind of >>> handling LOCKDATA that can be used as a template. >>> > > > Does it do its processing in the original buffer, causing it to be > temporarily invalid? If so, no one else should even be reading it. > No, the processing is done in a scratch area. But whether or not the object thinks anyone should be reading it or not is up to the object. If I've exported my memory as writeable and then somebody else wants to get access to the same memory, then its up to the object to decide whether or not that will be allowed. It is useful to at least allow other objects to get the pointer to the memory (perhaps they are just monitoring what is there or are just a pipeline or a view of the data). > Or does it just replace the original buffer with the new results once > it is finished? If so, then why does it need the lock the whole time? > Is someone getting known stale data (when you could tell them to > wait) always OK, but overwriting someone else's change never is? > There is no mechanism to "tell anybody" that the data is stale. Only read-able copies are allowed until the "shadow" object is done and copies its results back into the original data. Perhaps a mechanism to signal that the data is stale (i.e. has been locked) would be a useful addition. -Travis From amauryfa at gmail.com Tue Sep 11 21:57:33 2007 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Tue, 11 Sep 2007 21:57:33 +0200 Subject: [Python-3000] Which joker tried to remove me from the py3k list? In-Reply-To: References: Message-ID: Hello, Guido van Rossum wrote: > ---------- Forwarded message ---------- > From: python-3000-confirm+a02c328561e5ecf4a0373b3c0001cd33ec59ea4f at python.org > > Date: Sep 11, 2007 9:58 AM > Subject: Your confirmation is required to leave the Python-3000 mailing list > To: guido at python.org > > > Mailing list removal confirmation notice for mailing list Python-3000 ... > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/amauryfa%40gmail.com > Mailman adds these links at the bottom of every message, after the signature. Depending on your mail client, they may be part of the reply (Thunderbird does remove the signature and everything that follows. Gmail seems to quote the entire message). See above, *my* unsubscribe link after *your* signature. It is even archived: http://mail.python.org/pipermail/python-3000/2007-September/010383.html Of course, this means that someone followed the link, then clicked the "unsubscribe" button. A robot? -- Amaury Forgeot d'Arc From greg at krypto.org Tue Sep 11 23:10:58 2007 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 11 Sep 2007 14:10:58 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: <52dc1c820709111410tb37393fh3daae25eec5e6301@mail.gmail.com> On 9/11/07, Guido van Rossum wrote: > > On 9/10/07, Travis E. Oliphant wrote: > > Guido van Rossum wrote: > > > I'd like to see Travis's response to this. It's setting a precedent > > > regarding locking objects in read-only mode; I haven't found other > > > examples of objects using LOCKDATA (the only mentions of it seem to be > > > rejecting it :). I keep getting confused by the two separate lock > > > counts (and I think in this version the comment is inconsistent with > > > the code). So I'm hoping Travis has a particular way in mind of > > > handling LOCKDATA that can be used as a template. > > > > > > Travis? > > > > The use case I had in mind comes about quite often in NumPy when you > > want to modify the data-area of an object which may have a > > non-contiguous chunk of memory, but the algorithm being used expects > > contiguous data. Imagine, for example, that the exporting object is an > > image whose rows are stored in different segments. > > > > The consumer of the buffer interface, however, may be an extension > > module that does fast image-processing operations and requires > > contiguous data. Because it wants to write the results back in to the > > memory area when it is done with the algorithm (which may be thread-safe > > and may release the GIL), it requests the object to lock its data to > > read-only so that other consumers do not try to get writeable buffers > > while it is processing. > > > > When the algorithm is done, it alone can write to the memory area and > > then when it releases the buffer, the original object will restore > > itself to being writeable. Of course, the exporting object must support > > this kind of operation and not all objects will. I expect the NumPy > > array object and the PIL to support it for example, and other > > media-centric objects. > > Hm, so this is completely different from what I thought. It seems you > are describing the following: > > 1. acquire the buffer with LOCK_DATA > 2. copy the data out of the buffer into a scratch area > 3. work on the scratch area > 4. copy the data from the scratch area back into the buffer > 5. release the buffer > > i would call this an exclusive write lock, which is quite different > from the read lock interpretation implemented by Greg in his patch. > Could you add some language to PEP 3118 to clarify this usage? Or is > it already there? I admit to not having read it in full... Yes that is different from what I was using it for based on what the pep 3118 description said. Perhaps the existing description in PEP 3118 should be renamed from LOCKDATA to READONLY? > It would probably be useful if the bytes object supported it because > > then other objects could use it as the memory area. To do it > > correctly, the object exporting the interface must only allow locking if > > no other writeable interfaces have been exported (which it must keep > > track of) and then on release must check to see if the buffer that is > > being released is the one that locked its data. > > Right. So it seems you would need a counter of outstanding > non-data-locked buffer requests and a single bit indicating whether > there's a data-locked request. (Rather than two counters like Greg's > patch currently uses.) > > The hacker in me is already exploring the possibility of making the > count negative if there's a data-locked request; it sounds like the > valid transitions are: > > 0 -> 1 -> 2 -> ... (SIMPLE or WRITABLE get) > ... -> 2 -> 1 -> ... (SIMPLE or WRITABLE release) > 0 -> -1 (LOCKDATA get) > -1 -> 0 (LOCKDATA release) > > Have I got that right? I think that you should only be able to request > LOCKDATA if there are no other readers *or* writers, but that SIMPLE > and WRITABLE clients should be able to coexist (any mess that creates > would be the requester's own fault). Any nonzero value here would > indicate that the buffer can't be moved. > > I note that the use case in the bsddb wrapper extension is a bit > different -- Greg suspects that BerkeleyDB won't like the data > changing while it is using it (e.g. it might violate its own invariant > if the key changes between the time its hash is computed and the time > it is written to disk). To ensure this, currently LOCKDATA is the only > option; but a classic read lock would allow multiple concurrent > readers (which is how Greg's patch to bytesobject.c interprets > LOCKDATA). > > I think this needs to be clarified. Perhaps we need to separate > clearer the type of access (read or write) and the amount of locking > desired (can others read? can others write?). bsddb is not alone here but was just the code I was working on that made me think it necessary. I am hoping that -all- file/socket/whatever output operations using the buffer API will get properly read-locked views of the buffer so that they can release the GIL and not have the data change out from underneath them by other threads. (this avoids hard to debug issues which python has so far been pretty good at avoiding) (BTW The current implementation in bytesobject.c allows changing the > size as long as it fits within the allocated size; I think this is > probably too lenient, and begging for latent bugs.) > > (Spelling alert: 'writeable' is apparently not an English word. I hope > it's not too late to rename the flag to PyBUF_WRITABLE. I've opened > http://bugs.python.org/issue1150 to track this.) eek, yes please lets spell correctly. :) > For a real-life example, NumPy has a flag called UPDATEIFCOPY that is a > > slightly different implementation of the concept. When this flag is > > set during conversion to an array, then if a copy must be made to > > satisfy the requirements, the original array is set as read-only and > > this special flag is set on the array. When the copy is deleted, its > > memory is automatically copied (and possibly casted, etc.) back into the > > original array. It is a nice abstraction of the concept of an output > > data area that was borrowed from Numarray and allows many things to be > > implemented very quickly in NumPy. > > So in terms of locks, this effectively sets read *and* write locks on > the original object (since whatever you might read out of it may be > invalidated when the modified copy is written back). But how to > enforce that at the Python level? If we had something like this for > the bytes object, any *use* of the bytes object from Python (e.g. > iterating over it or indexing or slicing it) should be prohibited. Is > this reasonable? > > > One of the main things people use the NumPy C-API for is to get a > > contiguous chunk of memory from an array in order to do processing in > > another language (such as C or Fortran). It is nice to be able to > > specify that the result gets placed back into another chunk of memory > > (which may or may not be contiguous) in a unified fashion. NumPy > > handles all the copying for you. > > > > My thinking was that many people will want to be able to get contiguous > > chunks of memory, do processing, and then copy the result back into a > > segment of memory from a buffer-exporting object which is passed into > > the routine as an output object. > > This is probably common for numpy; for the bytes object, I expect that > it's all much simpler, since it's just a contiguous 1D array of > bytes... fwiw, in the bsddb and hashlib code I raise an error if the buffer returned is not a 1D array. > I'm not sure if my explanations are helpful. Please let me know if I > > can explain further. > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070911/829c0386/attachment-0001.htm From greg at krypto.org Tue Sep 11 23:38:07 2007 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 11 Sep 2007 14:38:07 -0700 Subject: [Python-3000] 3.0 crypto In-Reply-To: <1B544854-053A-45C9-869B-92F48D54CA45@solarsail.hcs.harvard.edu> References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com> <46DE90B0.4050905@v.loewis.de> <66d0a6e10709050851g21bf8b5ct7486f41122487656@mail.gmail.com> <46DED4C0.20406@v.loewis.de> <5CAF4C40-5087-4BA8-B971-C3DA2A0DE679@solarsail.hcs.harvard.edu> <46DFB5B6.1020807@v.loewis.de> <308CC895-A9EB-48F8-A7B7-80DC90A8D55A@solarsail.hcs.harvard.edu> <52dc1c820709071148l2c3061f9l14c929657ef7e397@mail.gmail.com> <1B544854-053A-45C9-869B-92F48D54CA45@solarsail.hcs.harvard.edu> Message-ID: <52dc1c820709111438n22c45fc0ncf76212324669e4a@mail.gmail.com> > Last I heard, AMK was no longer maintaining pycrypto, and a number of > people have found weird issues with it and were generally uncertain > of the correctness of the implemented crypto. > > > The pycrypto API is is very nice. But if we were to consider it > > for the standard library I'd prefer it just link against OpenSSL > > rather than use its own C implementations and just leave platforms > > without ssl without any crypto. > > That's one option, although there seems to be some FUD surrounding > OpenSSL licensing and its interactions with the GPL: > > > > It's also a standalone library, and it strikes me as much nicer to > just have Python provide the crypto functionality out of the box. So, > if we built an API atop the (public domain) LibTomCrypt code that > mimicked that of pycrypto, would anyone object to getting that kind > of thing into the Python source distribution? I'm +1 for that. LibTomCrypt is a great place to start. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070911/536f508c/attachment.htm From guido at python.org Tue Sep 11 23:49:17 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 14:49:17 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E6F137.2020001@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: On 9/11/07, Travis E. Oliphant wrote: > I'm not sure I understand the difference between a classic read lock and > the exclusive write lock concept. Does the classic read-lock just > prevent writing to the memory area. In my mind that is a read-only > memory buffer and the buffer interface would complain if a writeable > buffer was requested. There are different notions of reading and writing. Sometimes an object it naturally read-only (e.g. a PyString). In that case requesting SIMPLE access should pass but requesting WRITABLE or LOCKDATA access should fail. (I think the other flags are orthogonal to these, right?). Any number of concurrent SIMPLE accesses can coexist since the clients promise they will only read. OTOH suppose we have an object that is naturally writable (e.g. e PyBytes). I understood that in this case any number of SIMPLE or WRITABLE requests would be allowed to be outstanding simultaneously, and any of these would simply prevent the buffer from moving (fixing the object's size). But this doesn't sound like it is how you meant it -- you seem to say that once any SIMPLE (readonly) requests are outstanding, WRITABLE requests should fail. And I suppose that only one WRITABLE request ought to be allowed at a time. But then I don't know what the difference between WRITABLE and LOCKDATA would be. I guess I would be inclined to propose separate flags for indicating the operation that the caller will attempt (read or write) and the level of locking (lock the buffer's address or also prevent anyone else from writing). Then a "classic read lock" would request read access while locking out writers (bsddb would use this); a "classic write lock" would request write access while locking out writers (your scratch area example would use this); others who don't really care if the data changes underneath them as long as it doesn't move (e.g. traditional I/O) could request read access without locking. I'm not sure if there's a use case to be made for write access without locking, but I wouldn't rule it out -- possibly when two threads share a memory area they might have their own protocol for locking it and might just both want to be able to write to (parts of) it. What do you think? Another way to look at this would be to consider these 4 cases: basic read access (I can read, others can read or write) locked read access (I can read, others can only read) basic write access (I can read and write, others can read or write) exclusive write access (I can read and write, no others can read or write) Except that accessing the object from Python (e.g. iteration or indexing) never gets locked out. (Or perhaps it should be? That can also be done.) Also, it remains to be seen whether basic read access should be granted when someone has exclusive write access (see below). > Actually, writeable is an accepted variant of 'writable' (but it doesn't > show up in many spell-check dictionaries). No, it is not too late to > change it. Or just define WRITEABLE as WRITABLE. NumPy uses > "WRITEABLE" simply because I like that spelling better. Google found 1.4M occurrences of writeable vs. 3.9M occurrences of writable. I guess you represent a strong minority. :-) I'd still like to see it changed. We can leave WRITEABLE as an alias for WRITABLE for those who are used to seeing it that way in NumPy. > I'm anxious for feedback and help with the locking mechanism, because I > do not have all use cases in mind. I have never thought about a lock > that prevents reading. In my mind, this would be handled by the object > itself. It could refuse buffer requests if it's data had been locked or > it could not. Well, the scratch area scenario you describe makes it iffy to read anything out of the original object since you wouldn't know whether you were reading before, during or after the write back from the scratch area to the object's buffer. The question is, do we really care. If we adopted my 4 access modes above, we could say that basic read access will still be granted when someone has exclusive write access if we don't care, OR we could say that basic reads are locked out by exclusive write access. (And then there's the separate issue of whether python-level access counts as basic read access or doesn't count at all -- though the moer I think about it, I think it should be treated the smne as basic read access.) > On the other hand, there could be two concepts of locking that a > consumer could request from an object > > 1) Lock so that no other reads or writes are possible until the lock is > released. > 2) Lock so that only reads are possible. > > I had only thought of #2 for the current buffer interface. #1 maps to locked read OR exclusive write access in the strict variant. #2 maps to locked read in my scheme. (Gotta go -- ttyl.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Wed Sep 12 00:38:59 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 10:38:59 +1200 Subject: [Python-3000] C API for ints and strings In-Reply-To: <46E69F2E.9080509@gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <66d0a6e10709101802t3a8f2475gcdeb180ceaaf3855@mail.gmail.com> <46E62721.4020009@v.loewis.de> <66d0a6e10709110138w3fcb5f7bl87168db2328695d1@mail.gmail.com> <46E675F6.8090604@v.loewis.de> <46E69F2E.9080509@gmail.com> Message-ID: <46E71903.8060903@canterbury.ac.nz> Nick Coghlan wrote: > The LGPL and GPL have different aims from the PSF license, with a much > greater focus on preserving freedom for the end-user, Seems to me they go somewhat beyond "preserving freedoms" and into other areas. It's one thing to *allow* people to use the source if they can get it; it's another thing to try to force people who have no interest in the source themselves to act as agents for hosting and distributing it. Still, it appears that this is what the LGPL requires, so I agree that it's not appropriate for Python. -- [L]GPL - just say no. Greg From greg.ewing at canterbury.ac.nz Wed Sep 12 00:47:22 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 10:47:22 +1200 Subject: [Python-3000] __format__ and datetime In-Reply-To: References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> Message-ID: <46E71AFA.9020903@canterbury.ac.nz> Guido van Rossum wrote: > I don't expect there to be a practical use for nanoseconds (even > microseconds are doubtful, but useful since one might want unique > timestamps for more than 1000 events per second). But... what if you want unique timestamps for more than 1000000 events per second? :-) -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 12 00:52:32 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 10:52:32 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> Message-ID: <46E71C30.50409@canterbury.ac.nz> Guido van Rossum wrote: > 0 -> 1 -> 2 -> ... (SIMPLE or WRITABLE get) > ... -> 2 -> 1 -> ... (SIMPLE or WRITABLE release) > 0 -> -1 (LOCKDATA get) > -1 -> 0 (LOCKDATA release) And if this is the correct interpretation, the requests should be called something like READ_LOCK and WRITE_LOCK to make this clear. -- Greg From guido at python.org Wed Sep 12 00:58:11 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 11 Sep 2007 15:58:11 -0700 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E71AFA.9020903@canterbury.ac.nz> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <46E71AFA.9020903@canterbury.ac.nz> Message-ID: On 9/11/07, Greg Ewing wrote: > Guido van Rossum wrote: > > I don't expect there to be a practical use for nanoseconds (even > > microseconds are doubtful, but useful since one might want unique > > timestamps for more than 1000 events per second). > > But... what if you want unique timestamps for more > than 1000000 events per second? :-) Then you can't use the datetime module, or you'll have to petition for an extension to it. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Wed Sep 12 01:12:14 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 11:12:14 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E6F137.2020001@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: <46E720CE.7030602@canterbury.ac.nz> Travis E. Oliphant wrote: > I'm not sure I understand the difference between a classic read lock and > the exclusive write lock concept. A read lock means that others can obtain read locks, and nobody can obtain a write lock. A write lock means that nobody else can obtain a lock of any kind. I think strictly the 'e' should only be inserted if the preceding letter is one whose sound changes depending on whether it's followed by an 'e', such as 'c' or 'g'. "Writeable" does seem to be commonly used, though. In any case, it would be good to adopt a convention for these kinds of word used in source, to minimise confusion. -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 12 01:15:47 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 11:15:47 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E6F254.9020501@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F254.9020501@enthought.com> Message-ID: <46E721A3.4060406@canterbury.ac.nz> Jim Jewett wrote: > > why does it need the lock the whole time? > Is someone getting known stale data (when you could tell them to > wait) always OK, but overwriting someone else's change never is? In a threaded environment, it shouldn't really be a problem as long as the view of the data is consistent. It's no different from what would have happened if the reading thread had got there just a moment sooner, before the writer got hold of it. If that's a problem, there should have been some higher-level synchronisation going on before getting to that point. -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 12 01:17:20 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 11:17:20 +1200 Subject: [Python-3000] Which joker tried to remove me from the py3k list? In-Reply-To: References: Message-ID: <46E72200.7070408@canterbury.ac.nz> Amaury Forgeot d'Arc wrote: > Of course, this means that someone followed the link, then clicked the > "unsubscribe" button. A robot? It could just be someone trying to unsubscribe themselves, but hitting the wrong link and not noticing. -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 12 01:56:08 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Sep 2007 11:56:08 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: <46E72B18.9060908@canterbury.ac.nz> Guido van Rossum wrote: > Any number of concurrent SIMPLE accesses can > coexist since the clients promise they will only read. As a general principle, using a word like SIMPLE in an API is a really bad idea imo, as it's far too vague. I'm finding it impossible to evaluate the truthfulness of statements like the above in this discussion, because of that. > basic read access (I can read, others can read or write) > locked read access (I can read, others can only read) > basic write access (I can read and write, others can read or write) > exclusive write access (I can read and write, no others can read or write) Should that last one perhaps be "I can read and write, others can only read"? Another thread wanting to read but get a stable view of the data will be using "I can read, others can only read", which will fail because the first one is writing. If the reading thread doesn't care about stability, the writing one shouldn't have to know. Then we have two orthogonal things: READ vs WRITE, and SHARED vs EXCLUSIVE (where 'exclusive' means that others are excluded from writing). > Except that accessing the object from Python (e.g. iteration or > indexing) never gets locked out. With the scheme I just proposed, the iterator could use a non-exclusive mode if it wanted, which would give this effect. -- Greg From greg at krypto.org Wed Sep 12 07:55:09 2007 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 11 Sep 2007 22:55:09 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709111003y4bc1e5acpfe7ce26841718a37@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709092053r50cc23fcsb74cea71c9541797@mail.gmail.com> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> <46E641F2.4020701@v.loewis.de> <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> <79990c6b0709110521p10722897s6e4d03e5a558b457@mail.gmail.com> <66d0a6e10709111003y4bc1e5acpfe7ce26841718a37@mail.gmail.com> Message-ID: <52dc1c820709112255j1709da88x7886faa431f2ed70@mail.gmail.com> > The Pentium M and Pentium D are much more alike, architecturally, than > either and the Pentium 4, [cpu rant] Off topic: not true. The Pentium D is the final Pentium 4 netburst architecture based design. It is not at all close to the Pentium M. The M is much more a derivative of the pentium pro,ii,iii, & iii-m before it as core and more distantly core2 are follow ons to the M. Yes the D (50xx) and Woodcrest core2s (51xx) shared the same socket and front side bus but internally they are unrelated. [/cpu rant] Regardless comparing between different cpus doesn't matter, only the difference between runs on the same cpu. for instance on a 1.4Ghz efficeon: python2.5: 10 loops, best of 3: 932 msec per loop python 3.0a1 svn trunk: 10 loops, best of 3: 1.54 sec per loop (both compiled with gcc 4.1.2 -O3) which falls right smack in the middle of the measurements others were reporting in this thread. ;) Without looking into it at a much lower level, > it's hard to tell, but the difference between a 1MB and 2MB L2 cache > might make all the difference in 3.0 performance. doubtful, python's ceval core and the data representing the code being executed are both tiny. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070911/e2ff73f8/attachment.htm From nick.bastin at gmail.com Wed Sep 12 08:54:57 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 12 Sep 2007 02:54:57 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: <52dc1c820709112255j1709da88x7886faa431f2ed70@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709101058n22b04bfakf67a15aea8e739f4@mail.gmail.com> <66d0a6e10709101224j4cbe900dsb8aa52bd7259e66a@mail.gmail.com> <46E641F2.4020701@v.loewis.de> <66d0a6e10709110220i2f415fcan9047e4cb40676488@mail.gmail.com> <79990c6b0709110521p10722897s6e4d03e5a558b457@mail.gmail.com> <66d0a6e10709111003y4bc1e5acpfe7ce26841718a37@mail.gmail.com> <52dc1c820709112255j1709da88x7886faa431f2ed70@mail.gmail.com> Message-ID: <66d0a6e10709112354m76c3bedn28f9713038f137a5@mail.gmail.com> On 9/12/07, Gregory P. Smith wrote: > [cpu rant] > Off topic: not true. The Pentium D is the final Pentium 4 netburst > architecture based design. It is not at all close to the Pentium M. The M > is much more a derivative of the pentium pro,ii,iii, & iii-m before it as > core and more distantly core2 are follow ons to the M. Yes the D (50xx) and > Woodcrest core2s (51xx) shared the same socket and front side bus but > internally they are unrelated. > [/cpu rant] Yeah, my mistake, I misread intel's NetBurst page. I should have stuck with Wikipedia (who knew). > Regardless comparing between different cpus doesn't matter, only the > difference between runs on the same cpu. I agree. > for instance on a 1.4Ghz efficeon: > > python2.5: > 10 loops, best of 3: 932 msec per loop > python 3.0a1 svn trunk: > 10 loops, best of 3: 1.54 sec per loop > > (both compiled with gcc 4.1.2 -O3) > > which falls right smack in the middle of the measurements others were > reporting in this thread. ;) I should look at a comparison of 2.5 and 2.6 at some point, for better reference. > > Without looking into it at a much lower level, > > it's hard to tell, but the difference between a 1MB and 2MB L2 cache > > might make all the difference in 3.0 performance. > > doubtful, python's ceval core and the data representing the code being > executed are both tiny. Makes me miss the G4/5 version of Shark on MacOS X, which would show you the pipelining in the processor and cache utilization, so you could actually see what was going on - the x86 Shark doesn't seem to have this capability. It's suitably interesting on my windows xp machine to turn processor affinity off and see the performance go to hell in a handbasket. (Why, oh why, does windows insist on moving processes between CPUs all the time?) -- Nick From greg at krypto.org Wed Sep 12 09:44:56 2007 From: greg at krypto.org (Gregory P. Smith) Date: Wed, 12 Sep 2007 00:44:56 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E72B18.9060908@canterbury.ac.nz> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E72B18.9060908@canterbury.ac.nz> Message-ID: <52dc1c820709120044h722605cekc86ea668a6a1b4bd@mail.gmail.com> On 9/11/07, Greg Ewing wrote: > > Guido van Rossum wrote: > > Any number of concurrent SIMPLE accesses can > > coexist since the clients promise they will only read. > > As a general principle, using a word like SIMPLE in an > API is a really bad idea imo, as it's far too vague. > I'm finding it impossible to evaluate the truthfulness > of statements like the above in this discussion, because > of that. +1 on that. SIMPLE is a bad name. Based on the pep3118 description, how about calling it 1D_CONTIGUOUS or just RAW or FLAT? I also like your suggestion of renaming PyBUF api flags to READ_LOCK and WRITE_LOCK as those are well defined concepts in the classic multiple readers or one writer synchronization sense. What I implemented in my bytes patch should really be called PyBUF_READ_LOCK and what Travis describes as LOCKDATA in this email thread should become WRITE_LOCK. > basic read access (I can read, others can read or write) > > locked read access (I can read, others can only read) > > basic write access (I can read and write, others can read or write) > > exclusive write access (I can read and write, no others can read or > write) > > Should that last one perhaps be "I can read and write, > others can only read"? > > Another thread wanting to read but get a stable view > of the data will be using "I can read, others can only read", > which will fail because the first one is writing. If the > reading thread doesn't care about stability, the writing > one shouldn't have to know. > > Then we have two orthogonal things: READ vs WRITE, and > SHARED vs EXCLUSIVE (where 'exclusive' means that others > are excluded from writing). When I read the plain term EXCLUSIVE I read that to mean nobody else can read -or- write, ie: not shared in any sense. Lets extend these base concepts to SHARED_READ, SHARED_WRITE, EXCLUSIVE_READ, EXCLUSIVE_WRITE and use them to define the more others: EXCLUSIVE_WRITE - no others write to the buffer while this view is open (this does *not* imply that the requester wants to actually write, thats what the WRIT(E)ABLE flag is for) EXCLUSIVE_READ - no others can read this buffer while this view is open. (this is only useful in conjunction with exclusive write below to make a write_lock). SHARED_READ - anyone can read this buffer SHARED_WRITE - anyone can write this buffer SIMPLE/FLAT/RAW = SHARED_WRITE | SHARED_READ READ_LOCK = EXCLUSIVE_WRITE | SHARED_READ WRITE_LOCK = EXCLUSIVE_WRITE | EXCLUSIVE_READ Just | any of the above with WRIT(E)ABLE if you intend to actually write to the buffer. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070912/aa420023/attachment.htm From skip at pobox.com Wed Sep 12 15:58:05 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 12 Sep 2007 08:58:05 -0500 Subject: [Python-3000] __format__ and datetime In-Reply-To: References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> Message-ID: <18151.61549.956117.769166@montanaro.dyndns.org> Guido> No, the datetime module is explicitly defined to use Guido> microseconds. I don't expect there to be a practical use for Guido> nanoseconds (even microseconds are doubtful, but useful since one Guido> might want unique timestamps for more than 1000 events per Guido> second). I was just thinking about the folks at places like FermiLab and CERN. ;-) So, is '%f" okay to coopt? Is there some sort of future-proofing we can do so that if the libc folks decide later to use "%f" for something we're not (mildly) hosed? Maybe "%."? It appears that all strftime codes are one or two letters. Skip From nas at arctrix.com Wed Sep 12 19:43:07 2007 From: nas at arctrix.com (Neil Schemenauer) Date: Wed, 12 Sep 2007 17:43:07 +0000 (UTC) Subject: [Python-3000] C API for ints and strings References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> Message-ID: Larry Hastings wrote: > I am opposed to using LGPL- or GPL-licensed code in Python. Me too. Also, I don't see the point. Python's current long integer performance is good enough for the large majority of Python users. For the few specialized users, an extension module should serve. Maybe I missed something but I thought the real concern was the performance of the PyLong type when representing relatively short integers. Is GMP a solution to that? Neil From barry at python.org Wed Sep 12 20:06:11 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 12 Sep 2007 14:06:11 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709081041v4ea37ce8od75d8a688b52faae@mail.gmail.com> <46E2DF85.4090005@v.loewis.de> <66d0a6e10709081347k6873d581w869b9b483126a929@mail.gmail.com> <46E31FA2.4060701@v.loewis.de> <66d0a6e10709081623w59440ac2pf8dca78ae05dfd52@mail.gmail.com> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> Message-ID: <889D3A2E-3FE6-49C0-89E5-3EB6B885950D@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 12, 2007, at 1:43 PM, Neil Schemenauer wrote: > Larry Hastings wrote: >> I am opposed to using LGPL- or GPL-licensed code in Python. > > Me too. Also, I don't see the point. Python's current long integer > performance is good enough for the large majority of Python users. > For the few specialized users, an extension module should serve. > Maybe I missed something but I thought the real concern was the > performance of the PyLong type when representing relatively short > integers. Is GMP a solution to that? Back in the days of a previous employment, we used some homegrown extensions to give us GMP support in our embedded app. In a fit of rewrite-mania, we ditched it all and stuck with Python's own long integer support. Made our lives easier and we didn't feel we lost anything in terms of accuracy or functionality. We gained in performance but I can't attribute that solely to Python's implementation, since we also ditched a level of abstraction in the process. In any event, I'd agree that Python's current support is probably good enough for most people. - -1 on GMP in the core. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRugqlHEjvBPtnXfVAQLu5AP/TolPljxJuqOeEUDrJo1cT0c3FgpJY3RE WSCiIC9+5GW1DSkcZvbO5DzHJH6qYd7HL7z1n2D+AMSH7NFQU4G7yXIkTd4AAibW U3M7KSLEh/q75+lnx5nIoHrPB1A0lJU+c34Ly/kuusE5x4JIeuITkorQYKRDCcKs ZcGFOtGs4pE= =Ysmv -----END PGP SIGNATURE----- From guido at python.org Wed Sep 12 20:42:33 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 12 Sep 2007 11:42:33 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <889D3A2E-3FE6-49C0-89E5-3EB6B885950D@python.org> References: <1189270839.25695.18.camel@qrnik> <46E3B12E.1000703@v.loewis.de> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <889D3A2E-3FE6-49C0-89E5-3EB6B885950D@python.org> Message-ID: Can I just shortcut this discussion saying that we will *not* switch to use GMP? It's just not going to happen. Period. End of discussion. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 12 20:45:19 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 12 Sep 2007 11:45:19 -0700 Subject: [Python-3000] __format__ and datetime In-Reply-To: <18151.61549.956117.769166@montanaro.dyndns.org> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <18151.61549.956117.769166@montanaro.dyndns.org> Message-ID: On 9/12/07, skip at pobox.com wrote: > So, is '%f" okay to coopt? Is there some sort of future-proofing we can do > so that if the libc folks decide later to use "%f" for something we're not > (mildly) hosed? Maybe "%."? It appears that all strftime codes are one or > two letters. Which ones are two letters? Given how long strftime has been around I think %f is fine. We may even influence the future of the C library. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rowen at cesmail.net Wed Sep 12 20:53:25 2007 From: rowen at cesmail.net (Russell E. Owen) Date: Wed, 12 Sep 2007 11:53:25 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: In article , "Guido van Rossum" wrote: > I guess I would be inclined to propose separate flags for indicating > the operation that the caller will attempt (read or write) and the > level of locking (lock the buffer's address or also prevent anyone > else from writing). Then a "classic read lock" would request read > access while locking out writers (bsddb would use this); a "classic > write lock" would request write access while locking out writers (your > scratch area example would use this); others who don't really care if > the data changes underneath them as long as it doesn't move (e.g. > traditional I/O) could request read access without locking. I'm not > sure if there's a use case to be made for write access without > locking, but I wouldn't rule it out -- possibly when two threads share > a memory area they might have their own protocol for locking it and > might just both want to be able to write to (parts of) it. > > What do you think? Another way to look at this would be to consider > these 4 cases: > > basic read access (I can read, others can read or write) > locked read access (I can read, others can only read) > basic write access (I can read and write, others can read or write) > exclusive write access (I can read and write, no others can read or write) Sounds much like the modes offered by an old operating system that had a very nice lock manager. The modes: - concurrent read (others can read or write) - protected read (others can read but not write) - concurrent write (others can read or concurrent write) - protected write (others can concurrent read) - exclusive (no other locks allowed) (as well as null to release the resource) Some of these modes were intended for resources that are locked at multiple levels (which I don't think applies to array buffers). For example one might get a concurrent lock for a group of resources, then a protected lock for one resource. But as you say, there are some situations where concurrent write might be useful. -- Russell From jjb5 at cornell.edu Wed Sep 12 23:10:11 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Wed, 12 Sep 2007 17:10:11 -0400 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: <46E855B3.7040908@cornell.edu> > Sounds much like the modes offered by an old operating system that had a > very nice lock manager. Awe, VMS isn't THAT old, is it? :-) I have a wrapper around threading.Lock and threading.RLock that I've been using that does deadlock detection and have wished for these lock modes many times. I would hesitate to create a PEP to support this until I actually new I could pull it off. I would be happy to share the code that I have with anybody that might find it useful, and welcome some help in implementing these modes. There's one other very useful feature, and that was a callback from the lock manager when a lock that you held was blocking a request from some other process. Joel From nick.bastin at gmail.com Wed Sep 12 23:15:54 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 12 Sep 2007 17:15:54 -0400 Subject: [Python-3000] C API for ints and strings In-Reply-To: References: <1189270839.25695.18.camel@qrnik> <66d0a6e10709090206n27b8cbe3y5f6d13085aa74036@mail.gmail.com> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <889D3A2E-3FE6-49C0-89E5-3EB6B885950D@python.org> Message-ID: <66d0a6e10709121415s49db5a03g90d902dd3a613abf@mail.gmail.com> On 9/12/07, Guido van Rossum wrote: > Can I just shortcut this discussion saying that we will *not* switch > to use GMP? It's just not going to happen. Period. End of discussion. I figured that was assumed once it was pointed out that it didn't work on Intel macs... I'm pretty sure that's a platform we'd prefer to continue to support. -- Nick From guido at python.org Wed Sep 12 23:18:37 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 12 Sep 2007 14:18:37 -0700 Subject: [Python-3000] C API for ints and strings In-Reply-To: <66d0a6e10709121415s49db5a03g90d902dd3a613abf@mail.gmail.com> References: <1189270839.25695.18.camel@qrnik> <46E3BBE7.4020800@v.loewis.de> <46E48C10.7010705@canterbury.ac.nz> <46E4D273.9080300@v.loewis.de> <46E5DC4B.6030304@canterbury.ac.nz> <46E5DE92.8070808@hastings.org> <889D3A2E-3FE6-49C0-89E5-3EB6B885950D@python.org> <66d0a6e10709121415s49db5a03g90d902dd3a613abf@mail.gmail.com> Message-ID: On 9/12/07, Nicholas Bastin wrote: > On 9/12/07, Guido van Rossum wrote: > > Can I just shortcut this discussion saying that we will *not* switch > > to use GMP? It's just not going to happen. Period. End of discussion. > > I figured that was assumed once it was pointed out that it didn't work > on Intel macs... I'm pretty sure that's a platform we'd prefer to > continue to support. Then why are people (not you) still arguing about this? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 12 23:19:31 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 12 Sep 2007 14:19:31 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E855B3.7040908@cornell.edu> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E855B3.7040908@cornell.edu> Message-ID: That's a different topic altogether. We're talking here about locking modes for the buffer API (PEP 3118). This does not involve actual locks -- the operations just fail if the requested lock cannot be obtained. On 9/12/07, Joel Bender wrote: > > Sounds much like the modes offered by an old operating system that had a > > very nice lock manager. > > Awe, VMS isn't THAT old, is it? :-) > > I have a wrapper around threading.Lock and threading.RLock that I've > been using that does deadlock detection and have wished for these lock > modes many times. I would hesitate to create a PEP to support this > until I actually new I could pull it off. > > I would be happy to share the code that I have with anybody that might > find it useful, and welcome some help in implementing these modes. > > There's one other very useful feature, and that was a callback from the > lock manager when a lock that you held was blocking a request from some > other process. > > > Joel > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nicko at nicko.org Thu Sep 13 01:33:07 2007 From: nicko at nicko.org (Nicko van Someren) Date: Thu, 13 Sep 2007 00:33:07 +0100 Subject: [Python-3000] Performance Notes - new hash algorithm In-Reply-To: References: <52dc1c820709071345m4f4fbe52i41921be5fcb116df@mail.gmail.com> <1f7befae0709081848m477422bdm11355e58920bf6c6@mail.gmail.com> Message-ID: On 10 Sep 2007, at 01:58, Jim Jewett wrote: > To spell this out a bit more: > ... > When adding four entries to an 8-slot table, a truly random hash would > have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the > time. As expected, the proposed hash does have a collision for those > four values (the first and fourth). While your over-all analysis is both informative and helpful, the pedant in me feels obliged to point out the flaw in your math. The probability of at least one collision is 1 minus the probability of no collision, which is in turn 8/8 * 7/8 * 6/8 * 5/8, so the correct figure is actually that you collide about 59% of the time, not 75%. (If your math were correct then 5 items would collide 125% of the time, which is clearly wrong! :-) Cheers, Nicko From unknown_kev_cat at hotmail.com Thu Sep 13 02:12:06 2007 From: unknown_kev_cat at hotmail.com (Joe Smith) Date: Wed, 12 Sep 2007 20:12:06 -0400 Subject: [Python-3000] Solaris support in 3.0? References: <66d0a6e10709050017s7b354bd7tf418a0c168e181c9@mail.gmail.com><46DE90B0.4050905@v.loewis.de><52dc1c820709050836pba30e32me219a4c03627f223@mail.gmail.com> <79990c6b0709060129s458f6ce4t71e128a4a4f6e2dd@mail.gmail.com> Message-ID: "Paul Moore" wrote in message news:79990c6b0709060129s458f6ce4t71e128a4a4f6e2dd at mail.gmail.com... > On 05/09/07, Gregory P. Smith wrote: >> Rather than resurrecting the old RSA-copyright md5.c I can easily make >> new >> ones out of the libtomcrypt md5 and sha1 sources the same way i created >> the >> non-openssl sha256 and sha512 modules. > > Which reminds me - when I build Python 3 (on an Ubuntu box) with > openssl installed, I get a message about _sha256 and _sha512 not being > built. Presumably this is intentional? (It looks a bit odd, and I > spent a while trying to work out what dependencies I needed before > realising it was probably OK). Yep, perfectly normal. That just says that the code that shipped with python is not being used, because OpenSSL's implementations of those functions were being used instead. (At least that is my understanding.) From skip at pobox.com Thu Sep 13 02:56:57 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 12 Sep 2007 19:56:57 -0500 Subject: [Python-3000] __format__ and datetime In-Reply-To: References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <18151.61549.956117.769166@montanaro.dyndns.org> Message-ID: <18152.35545.33753.630023@montanaro.dyndns.org> Guido> Which ones are two letters? All the locale-specific stuff on Solaris 10. I guess technically the first letter of the pair is a modifier of the actual code, which comes next. From the man page: Modified Conversion Specifications Some conversion specifications can be modified by the E and O modifiers to indicate that an alternate format or specifi- cation should be used rather than the one normally used by the unmodified conversion specification. If the alternate format or specification does not exist in the current locale, the behavior will be as if the unmodified specifica- tion were used. %Ec Locale's alternate appropriate date and time representation. %EC Name of the base year (period) in the locale's alternate representation. %Eg Offset from %EC of the week-based year in the locale's alternative representation. %EG Full alternative representation of the week-based year. %Ex Locale's alternate date representation. %EX Locale's alternate time representation. %Ey Offset from %EC (year only) in the locale's alter- nate representation. %EY Full alternate year representation. %Od Day of the month using the locale's alternate numeric symbols. %Oe Same as %Od. %Og Week-based year (offset from %C) in the locale's alternate representation and using the locale's alternate numeric symbols. %OH Hour (24-hour clock) using the locale's alternate numeric symbols. %OI Hour (12-hour clock) using the locale's alternate numeric symbols. %Om Month using the locale's alternate numeric symbols. %OM Minutes using the locale's alternate numeric sym- bols. %OS Seconds using the locale's alternate numeric sym- bols. %Ou Weekday as a number in the locale's alternate numeric symbols. %OU Week number of the year (Sunday as the first day of the week) using the locale's alternate numeric sym- bols. %Ow Number of the weekday (Sunday=0) using the locale's alternate numeric symbols. %OW Week number of the year (Monday as the first day of the week) using the locale's alternate numeric sym- bols. %Oy Year (offset from %C) in the locale's alternate representation and using the locale's alternate numeric symbols. Skip From skip.montanaro at gmail.com Thu Sep 13 04:29:36 2007 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Wed, 12 Sep 2007 21:29:36 -0500 Subject: [Python-3000] __format__ and datetime In-Reply-To: References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <18151.61549.956117.769166@montanaro.dyndns.org> Message-ID: <60bb7ceb0709121929x53c82180xac8a350fb1d2a422@mail.gmail.com> > Given how long strftime has been around I think %f is fine. We may > even influence the future of the C library. :-) Patch for datetime (py3k only at this point, no tests either) here: http://bugs.python.org/issue1158 Skip From qrczak at knm.org.pl Thu Sep 13 18:22:12 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 13 Sep 2007 18:22:12 +0200 Subject: [Python-3000] Unicode and OS strings Message-ID: <1189700532.22693.40.camel@qrnik> What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)? This is an unfortunate side effect of switching to Unicode. It's unfortunate because often the data is only passed back to another function, and thus lack of round trip is a pure loss caused by choosing a Unicode string as the representation of such data. I opt for Unicode strings nevertheless, Python did a right step. I once checked what other languages with Unicode strings do, and the results were not enlightening: inconsistency, weird errors, damaged or truncated data. Python 3.0a1 mostly fails with weird errors, and fails a bit too early: [qrczak ~]$ echo $LANG pl_PL.UTF-8 [qrczak ~]$ python3.0 - $(printf '\x80') Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Fatal Python error: no mem for sys.argv zsh: abort python3.0 - $(printf '\x80') [qrczak ~]$ FOO=$(printf '\x80') python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os object : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a5142c lost sys.stderr >>> [qrczak ~]$ mkdir $(printf '\x80') [qrczak ~]$ cd $(printf '\x80') [qrczak ~/\M-^@]$ python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os object : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a1242c lost sys.stderr >>> os.listdir returns undecodable filenames as str8. I don't know what it should do. Choices: 1. Fail in a controlled way (without losing sys.stderr), and no earlier than necessary, i.e. fail when the given string is requested, not when a module is imported. 1a. Guarantee that choosing a different encoding and retrying works, for a rare case when the programmer wishes to handle such strings by explicitly trying latin1. 2. Return undecodable information as bytes, and accept bytes when it is passed back to similar functions in the other direction. 3. Have an option to use a modified UTF-8 in these places, where undecodable bytes are e.g. escaped as U+0000 U+00xx. I will not advocate any choice other than 1, but perhaps someone has another idea. My language Kogut uses 1a (even for things like sys.argv which look like variables), experimentally with 3 as an option to be requested either by choosing such encoding by the program or with an environment variable. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Thu Sep 13 18:48:47 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 13 Sep 2007 09:48:47 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189700532.22693.40.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> Message-ID: Yes, I have noticed this too. Environment variables, command line arguments, locale properties, TZ names, and so on, are often given as 8-bit strings in who knows what encoding. I'm not sure what the solution is, but we need one. I'm guessing one thing we need to do is research how various systems decide what encoding to use. Even on OSX, I managed to create an environment variable containing non-ASCII non-UTF-8 bytes. I believe Tcl/Tk used to have some kind of heuristic where they would try UTF-8 first and if that failed used Latin-1 for the bytes that aren't valid UTF-8, but I'm not at all sure that that's the right solution in places where Latin-1 is not spoken. --Guido On 9/13/07, Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)? > > This is an unfortunate side effect of switching to Unicode. It's > unfortunate because often the data is only passed back to another > function, and thus lack of round trip is a pure loss caused by > choosing a Unicode string as the representation of such data. > I opt for Unicode strings nevertheless, Python did a right step. > > I once checked what other languages with Unicode strings do, and the > results were not enlightening: inconsistency, weird errors, damaged or > truncated data. > > Python 3.0a1 mostly fails with weird errors, and fails a bit too early: > > [qrczak ~]$ echo $LANG > pl_PL.UTF-8 > > [qrczak ~]$ python3.0 - $(printf '\x80') > Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) > [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Fatal Python error: no mem for sys.argv > zsh: abort python3.0 - $(printf '\x80') > > [qrczak ~]$ FOO=$(printf '\x80') python3.0 > Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) > [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > object : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte') > type : UnicodeDecodeError > refcount: 4 > address : 0xb7a5142c > lost sys.stderr > >>> > > [qrczak ~]$ mkdir $(printf '\x80') > > [qrczak ~]$ cd $(printf '\x80') > > [qrczak ~/\M-^@]$ python3.0 > Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) > [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > object : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte') > type : UnicodeDecodeError > refcount: 4 > address : 0xb7a1242c > lost sys.stderr > >>> > > os.listdir returns undecodable filenames as str8. > > I don't know what it should do. Choices: > > 1. Fail in a controlled way (without losing sys.stderr), and no earlier > than necessary, i.e. fail when the given string is requested, not > when a module is imported. > > 1a. Guarantee that choosing a different encoding and retrying works, > for a rare case when the programmer wishes to handle such strings by > explicitly trying latin1. > > 2. Return undecodable information as bytes, and accept bytes when it is > passed back to similar functions in the other direction. > > 3. Have an option to use a modified UTF-8 in these places, where > undecodable bytes are e.g. escaped as U+0000 U+00xx. > > I will not advocate any choice other than 1, but perhaps someone has > another idea. > > My language Kogut uses 1a (even for things like sys.argv which look like > variables), experimentally with 3 as an option to be requested either by > choosing such encoding by the program or with an environment variable. > > -- > __("< Marcin Kowalczyk > \__/ qrczak at knm.org.pl > ^^ http://qrnik.knm.org.pl/~qrczak/ > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Thu Sep 13 19:08:40 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 13 Sep 2007 19:08:40 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> Message-ID: <46E96E98.9080406@v.loewis.de> > Yes, I have noticed this too. Environment variables, command line > arguments, locale properties, TZ names, and so on, are often given as > 8-bit strings in who knows what encoding. I'm not sure what the > solution is, but we need one. One "universal" solution is to use Unicode private-use-area characters. We could come up with some error handler which replaces undecodable characters with a PUA character; on encoding, the same error handler encodes the PUA characters again as bytes. We would need a block of 256 PUA characters for that. Of course, if the input data already contains PUA characters, there would be an ambiguity. We can rule this out for most codecs, as they don't support PUA characters. The major exception would be UTF-8, for which we would need to create a UTF-8-noPUA codec, which would then be used at all system interfaces that should use UTF-8 but might use arbitrary bytes. We would make a list of all interfaces that use the PUA error handler: file names, environment variables, command line arguments. > I'm guessing one thing we need to do is > research how various systems decide what encoding to use. Even on OSX, > I managed to create an environment variable containing non-ASCII > non-UTF-8 bytes. Unix-ish systems just don't decide. They pass that on to the application. On display, they display things like question marks. At API level, it's just null-terminated char*. > I believe Tcl/Tk used to have some kind of heuristic where they would > try UTF-8 first and if that failed used Latin-1 for the bytes that > aren't valid UTF-8, but I'm not at all sure that that's the right > solution in places where Latin-1 is not spoken. Indeed not - here lies moji-bake. Regards, Martin From stephen at xemacs.org Thu Sep 13 20:43:59 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 14 Sep 2007 03:43:59 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46E96E98.9080406@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> Message-ID: <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > One "universal" solution is to use Unicode private-use-area > characters. +1 > Of course, if the input data already contains PUA characters, > there would be an ambiguity. That may be true in the implementation, but it shouldn't. What should happen internally is that all undecodable characters (which PUA characters are by definition for standard codecs) are mapped to unused codepoints in the PUA, chosen by Python. This map would be required to maintain some house-keeping information about where the character came from (specificially the original coded character set so that round-tripping would succeed). One possible error-recovery strategy for broken encodings (as opposed to coding which is correct in format but contains a code point not in the table) would be to have a "pure code unit" block in the PUA. Note that since we're talking about code units throughout (there's no guarantee that the encoding in question is octet-oriented, although that's almost always the case in practice), 256 code points may not be enough. > We would make a list of all interfaces that use the PUA error > handler: file names, environment variables, command line > arguments. In general, I don't consider this an error. It's reasonable to use exception handling internally to the codec -- such broken texts are rare except in interactive applications where the speed isn't an issue -- but for some applications it would be useful to accept entire broken strings and pass them to Python with the broken parts marked (ie, by being assigned to the "code unit" block of the PUA) and the rest decoded. Here's an example that comes up in Emacs (specifically AUCTeX). TeX error messages are octet-oriented and regularly slice multibyte encodings in the middle of characters or escape sequences. It turns out the basic codec algorithms often DTRT by (accidentally) resynchronizing on ASCII, and sometimes can even resynch on a multibyte character. So the display of the "broken" text is often useful. However, for reasons I'm not familiar with the AUCTeX developers have asked that the strings be invertible (ie, back to the octets that TeX spit out). This scheme would allow that. From martin at v.loewis.de Thu Sep 13 21:18:05 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 13 Sep 2007 21:18:05 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46E98CED.1010008@v.loewis.de> > > We would make a list of all interfaces that use the PUA error > > handler: file names, environment variables, command line > > arguments. > > In general, I don't consider this an error. I don't, either. However, given the current codec design, this is the least intrusive way to enhance "all" codecs with the feature of mapping unsupported code points to PUA characters. Otherwise, we would have to duplicate all codecs. Regards, Martin From qrczak at knm.org.pl Thu Sep 13 21:26:15 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 13 Sep 2007 21:26:15 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46E96E98.9080406@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> Message-ID: <1189711575.22693.86.camel@qrnik> Dnia 13-09-2007, Cz o godzinie 19:08 +0200, "Martin v. L?wis" napisa?(a): > Of course, if the input data already contains PUA characters, > there would be an ambiguity. We can rule this out for most codecs, > as they don't support PUA characters. The major exception would > be UTF-8, Most codecs other than UTF-8 don't have this problem. Unicode people are generally allergic to any non-standard variants of Unicode specifications, and feel that this is a heresy. I experimentally and optionally use U+0000 escaping, but I'm not convinced that anything like this is a good idea, and it should probably not be enabled by default. Mono uses U+0000 escaping too; I'm not sure if all the details agree. This escaping scheme has an advantage that it's compatible with real UTF-8 for strings which contain no \x00 = U+0000. Most of applicable contexts do guarantee to not contain NUL, so the interpretation of valid data in both directions is unchanged. My encoder even rejects U+0000 prefixes for bytes which would form valid UTF-8 sequences, so you can't have two Unicode strings which encode to the same byte string. The side effect is that not all U+0000 occurrences can be encoded, but the contexts we are talking about don't allow U+0000 anyway. > > I'm guessing one thing we need to do is > > research how various systems decide what encoding to use. This is the easy part; modern Unices have nl_langinfo(CODESET). The hard part is deciding what to do when decoding fails. [I will be absent between Friday and Monday.] Here is what other environments do. This was over 2 years ago, something might have changed. In particular Mono now uses some U+0000 escaping, I need to investigate it again. I checked both directions, i.e. what do they do with unencodable filenames given by the program. Everything is on Linux. Some behaviors are obviously awful. Java (Sun) ---------- Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by "?". Command line arguments and standard I/O are treated in the same way. Java (GNU) ---------- Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are silently skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by "?". C# (mono) --------- Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it is skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, U+0000 throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are silently skipped. Perl ---- Depending on the convention used by a particular function and on imported packages, a Perl string is treated either as Perl-modified Unicode (with character values up to 32 bits or 64 bits depending on the architecture) or as an unspecified locale encoding. It has two internal representations: ISO-8859-1 and Perl-modified UTF-8 (with an extended range). If every Perl string is assumed to be a Unicode string, then filenames are effectively ISO-8859-1. a) Interpreting. Characters up to U+00FF are used. b) Creating. If the filename has no characters above 0xFF, it is converted to ISO-8859-1. Otherwise it is converted to Perl-modified UTF-8 (all characters, not just those above 0xFF). Command line arguments and standard I/O are treated in the same way, i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on output, depending on the contents. This behavior is modifiable by importing various packages and using interpreter invocation flags. When Perl is told that command line arguments are UTF-8, the behavior for strings which cannot be converted is inconsistent: sometimes it's treated as ISO-8859-1, sometimes an error is signalled. Haskell ------- Haskell nominally uses Unicode. There is no conversion framework standarized or implemented yet though. Implementations which support more than 256 characters currently assume ISO-8859-1 for filenames, command line arguments and all I/O, taking the lowest 8 bits of a character code on output. Common Lisp: CLISP ------------------ Common Lisp standard doesn't say anything about string encoding. In Clisp strings are UTF-32 (internally optimized as UCS-2 and ISO-8859-1 when possible). Any character code up to U+10FFFF is allowed, including isolated surrogates. Filenames are assumed to be in the locale encoding. a) Interpreting. If a byte cannot be converted, a condition is signaled. b) Creating. If a character cannot be converted, a condition is signaled. Kogut (my language) ----- Strings are UTF-32 (internally optimized as ISO-8859-1 when possible). Any character code up to U+10FFFF is allowed, including isolated surrogates. Filenames are assumed to be in the locale encoding; the encoding can be overridden by a Kogut-specific environment variable. A program can itself set the encoding to something else, perhaps locally during execution of some code. It can use a conversion which puts U+FFFD / "?" instead of throwing an exception on error, or which does something else. a) Interpreting. If a byte cannot be converted, an exception is thrown. b) Creating. If a character cannot be converted or if a name contains U+0000, an exception is thrown. Command line arguments and standard I/O are treated in the same way. There is an additional encoding which is a modified UTF-8 and can be explicitly used instead of true UTF-8: any byte string can be decoded, where normally undecodable bytes and \0 are escaped as U+0000 U+00xx. GNOME ----- GNOME uses UTF-8 internally, or sometimes byte strings in other encodings. I guess filenames are passed as byte strings. AFAIK sometimes filenames are expressed as URLs, even internally when it's invisible to the user, and then various unsafe bytes are escaped as two hex digits preceded by the percent sign. From the programmer's point of view the original byte strings are generally used. Filename encoding matters for the display though, so here I describe the user's point of view. If the environment variable G_FILENAME_ENCODING is present, it specifies the encoding of filenames, unless it is @locale which means the encoding of the locale. If it's not present but G_BROKEN_FILENAMES is present, filenames are assumed to be in the locale encoding. If neither variable is present, filenames are assumed to be in UTF-8. a) Interpreting. If a filename cannot be converted from the selected encoding, all non-ASCII bytes are shown as octal numbers preceded by the backslash, as hex numbers preceded by the percent sign, or as question marks, depending on the situation (I can observe all three cases in gedit). What is physically stored is the byte string and the file is opened successfully. b) Creating. If a character cannot be represented, the application refuses to save the file until a good filename is entered. Mozilla ------- I don't know how it handles filenames internally. From the user's point of view it matters how it presents a local directory listing. Filenames are assumed to be in the locale encoding. If a filename cannot be converted, it's skipped. If it can be converted but contains characters like 0x80-0x9F in ISO-8859-2, they are displayed as question marks and the file is inaccessible. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From oliphant at enthought.com Thu Sep 13 21:27:33 2007 From: oliphant at enthought.com (Travis E. Oliphant) Date: Thu, 13 Sep 2007 14:27:33 -0500 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> Message-ID: <46E98F25.5010404@enthought.com> Guido van Rossum wrote: > On 9/11/07, Travis E. Oliphant wrote: > >> I'm not sure I understand the difference between a classic read lock and >> the exclusive write lock concept. Does the classic read-lock just >> prevent writing to the memory area. In my mind that is a read-only >> memory buffer and the buffer interface would complain if a writeable >> buffer was requested. >> > > There are different notions of reading and writing. Sometimes an > object it naturally read-only (e.g. a PyString). In that case > requesting SIMPLE access should pass but requesting WRITABLE or > LOCKDATA access should fail. (I think the other flags are orthogonal > to these, right?). Any number of concurrent SIMPLE accesses can > coexist since the clients promise they will only read. > Yes, the other flags are orthogonal to this concept. > OTOH suppose we have an object that is naturally writable (e.g. e > PyBytes). I understood that in this case any number of SIMPLE or > WRITABLE requests would be allowed to be outstanding simultaneously, > and any of these would simply prevent the buffer from moving (fixing > the object's size). But this doesn't sound like it is how you meant it > -- you seem to say that once any SIMPLE (readonly) requests are > outstanding, WRITABLE requests should fail. Wait a minute. I want to clarify that normally any number of SIMPLE or WRITEABLE requests would be possible for an object that is naturally writeable. That is my thinking. The purpose of LOCKDATA is to allow an object to request that the object not be writeable in the future while it holds a view to the object. I did not think that this would be the normal behavior, but exceptional. What seems to be needed is yet another flag that allows a buffer requester to insist that the object not allow any buffer accesses read or write until its view is done. So, you would have something like LOCK_FOR_WRITE LOCK_FOR_READ I would want to encourage people not to use the LOCK_FOR_READ unless there is an important benefit or need to use it. On the other hand, the argument about dma mechanisms (like moving memory to a video card for processing) needing to make the buffer unavailable temporarily sounds like a reasonable one to me. I can already see applications for it. > And I suppose that only > one WRITABLE request ought to be allowed at a time. But then I don't > know what the difference between WRITABLE and LOCKDATA would be. > > > I hope I've clarified the difference between these in my mind. > Then a "classic read lock" would request read > access while locking out writers (bsddb would use this); I did not separate this case in my mind, as I presumed that if something wanted to prevent other writers it would itself want to write. I can see what is wanted here now. > a "classic > write lock" would request write access while locking out writers (your > scratch area example would use this); others who don't really care if > the data changes underneath them as long as it doesn't move (e.g. > traditional I/O) could request read access without locking. I'm not > sure if there's a use case to be made for write access without > locking, but I wouldn't rule it out -- possibly when two threads share > a memory area they might have their own protocol for locking it and > might just both want to be able to write to (parts of) it. > Yes, I would not rule out write-access without locking either. NumPy actually uses that all the time internally where two or more objects share the same data and can both write to it (although the community warns people about doing this without knowing what you are doing). > What do you think? Another way to look at this would be to consider > these 4 cases: > I think I was leaving out the cases 1) requesting a read access with future write locking ('classic read lock') 2) requesting a read or write access with future read locking. Let me see how my thinking maps to your list below which at first glance looks pretty good. > basic read access (I can read, others can read or write) > locked read access (I can read, others can only read) > basic write access (I can read and write, others can read or write) > exclusive write access (I can read and write, no others can read or write) > > I guess my original LOCK_DATA concept (I can read and write, others can only read) is not even in this list as you discuss below. I'm actually wondering if another function should be added to handle the concept of locking. I can imagine that it will want to grow more fine-grained locking possibilities. > Except that accessing the object from Python (e.g. iteration or > indexing) never gets locked out. (Or perhaps it should be? That can > also be done.) > I think if it doesn't go through the buffer interface it is up to the object to decide (i.e. what does the object do with itself when buffers are exported --- that will depend on the object). All it must do is support the buffer interface in the correct way (i.e. not move the memory buffers are relying on and support the access modes correctly that it purports to export). >> Actually, writeable is an accepted variant of 'writable' (but it doesn't >> show up in many spell-check dictionaries). No, it is not too late to >> change it. Or just define WRITEABLE as WRITABLE. NumPy uses >> "WRITEABLE" simply because I like that spelling better. >> > > Google found 1.4M occurrences of writeable vs. 3.9M occurrences of > writable. I guess you represent a strong minority. :-) I'd still like > to see it changed. We can leave WRITEABLE as an alias for WRITABLE for > those who are used to seeing it that way in NumPy. > I'm fine with that. > > Well, the scratch area scenario you describe makes it iffy to read > anything out of the original object since you wouldn't know whether > you were reading before, during or after the write back from the > scratch area to the object's buffer. The question is, do we really > care. If we adopted my 4 access modes above, we could say that basic > read access will still be granted when someone has exclusive write > access if we don't care, OR we could say that basic reads are locked > out by exclusive write access. (And then there's the separate issue of > whether python-level access counts as basic read access or doesn't > count at all -- though the moer I think about it, I think it should be > treated the smne as basic read access.) > > >> On the other hand, there could be two concepts of locking that a >> consumer could request from an object >> >> 1) Lock so that no other reads or writes are possible until the lock is >> released. >> 2) Lock so that only reads are possible. >> >> I had only thought of #2 for the current buffer interface. >> > > #1 maps to locked read OR exclusive write access in the strict variant. > #2 maps to locked read in my scheme. > > Let me think about adding a function for read-write locking that is separate from getting a view (which implements memory-location locking). I appreciate the discussion as it is helping me clarify my thinking. -Travis From stephen at xemacs.org Thu Sep 13 23:12:04 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 14 Sep 2007 06:12:04 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189711575.22693.86.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> Message-ID: <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> "Marcin 'Qrczak' Kowalczyk" writes: >> Of course, if the input data already contains PUA characters, >> there would be an ambiguity. We can rule this out for most codecs, >> as they don't support PUA characters. The major exception would >> be UTF-8, > Most codecs other than UTF-8 don't have this problem. All Japanese codecs do. Corporate variants of JIS remain alive, and well. They're not limited to Microsoft and Apple, but also IBM, Fujitsu/Sun, Hitachi, and NEC software allow entry of characters not in the JIS sets. > Unicode people are generally allergic to any non-standard variants of > Unicode specifications, and feel that this is a heresy. I experimentally > and optionally use U+0000 escaping, but I'm not convinced that anything > like this is a good idea, and it should probably not be enabled by > default. -1 Heresy, no. That doesn't make it anything like a good idea. There are plenty of character sets, even those that are ISO 2022 compatible, with undefined code points. Such code points regularly do appear in text content where the coded character set is either incorrectly specified or ambiguous. This means that a way of handling such points is very useful, and as long as there's enough PUA space, the approach I suggested can handle all of these various issues. Any application where there won't be enough PUA space is very special, either demanding more than 2 planes worth of private space (planes 15 and 16), or demanding very high efficiency (needs to fit in the BMP private space). The approach I suggest has the advantage that applications with a small PUA usage (IIRC more than 4000 PUA code points are available in the BMP) will have string length == character count. > the contexts we are talking about don't allow U+0000 anyway. zsh at least allows you to type ^V^SPC to enter an ASCII NUL character on the command line, and to assign a string containing NULs to an environment variable. From qrczak at knm.org.pl Fri Sep 14 00:31:36 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 14 Sep 2007 00:31:36 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1189722696.30037.14.camel@qrnik> Dnia 14-09-2007, Pt o godzinie 06:12 +0900, Stephen J. Turnbull napisa?(a): > This means that a way of handling such points > is very useful, and as long as there's enough PUA space, the approach > I suggested can handle all of these various issues. PUA already has a representation in UTF-8, so this is more incompatible with UTF-8 than needed, and hijacks characters which might be used (for example I'm using some PUA ranges for encoding my script, they are being transported between processes, and I would be upset if some language had mangled them to something else). While U+0000 is also representable in UTF-8, it cannot occur in filenames, program arguments, environment variables etc., and thus in many contexts it was free. It's not free mostly in file contents, including stdin/stdout/stderr. Of course my escaping scheme can preserve \0 too, by escaping it to U+0000 U+0000, but here it's incompatible with the real UTF-8. > zsh at least allows you to type ^V^SPC to enter an ASCII NUL character > on the command line, and to assign a string containing NULs to an > environment variable. They may work for its internal commands and process-internal variables. But there can't be NULs in arguments of program invocation, or in environment variables which survive execve, because the Unix APIs and data structures - not just C functions - use NULs to delimit these strings. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From greg.ewing at canterbury.ac.nz Fri Sep 14 01:26:28 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 14 Sep 2007 11:26:28 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <52dc1c820709120044h722605cekc86ea668a6a1b4bd@mail.gmail.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E72B18.9060908@canterbury.ac.nz> <52dc1c820709120044h722605cekc86ea668a6a1b4bd@mail.gmail.com> Message-ID: <46E9C724.9080808@canterbury.ac.nz> Gregory P. Smith wrote: > When I read the plain term EXCLUSIVE I read that to mean nobody else can > read -or- write, ie: not shared in any sense. You're right, it's not the best term. > Lets extend these base > concepts to SHARED_READ, SHARED_WRITE, EXCLUSIVE_READ, EXCLUSIVE_WRITE EXCLUDE_WRITE might be better, since EXCLUSIVE_WRITE seems to imply that one is writing oneself as well. > EXCLUSIVE_READ - no others can read this buffer while this view is > open. This is the one that I don't think is necessary. I don't see a need to ever prevent others from *reading* if they really want to and are prepared to deal with the consequences. Most of the time the other party will be using READ_LOCK which includes EXCLUDE_WRITE, so it will fail if you're already holding a write lock. So we just have READ WRITE READ_LOCK = READ | EXCLUDE_WRITE WRITE_LOCK = WRITE | EXCLUDE_WRITE -- Greg From greg.ewing at canterbury.ac.nz Fri Sep 14 02:02:12 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 14 Sep 2007 12:02:12 +1200 Subject: [Python-3000] __format__ and datetime In-Reply-To: <18151.61549.956117.769166@montanaro.dyndns.org> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <18151.61549.956117.769166@montanaro.dyndns.org> Message-ID: <46E9CF84.7060308@canterbury.ac.nz> skip at pobox.com wrote: > I was just thinking about the folks at places like FermiLab and CERN. ;-) Those guys probably need picoseconds... -- Greg From foom at fuhm.net Fri Sep 14 05:41:12 2007 From: foom at fuhm.net (James Y Knight) Date: Thu, 13 Sep 2007 23:41:12 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189700532.22693.40.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> Message-ID: <28CDCC5D-E62C-4C9F-86FE-2DC31C6834B0@fuhm.net> On Sep 13, 2007, at 12:22 PM, Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)? Here's a suggestion I made on the SBCL dev list a while back, in response to the same issues. I am responding to myself here, where my first suggestion was to keep all the environmental gunk in byte- arrays rather than strings. That is still a very nice and simple possibility. My second inclination was to use a variant of utf8 which can handle all bytestrings, instead of utf8 itself: utf-8b. This obviously works best when the system encoding is actually utf8. > On Aug 2, 2007, at 4:55 PM, James Y Knight wrote: > >> Yeah -- it's pretty clear the environment isn't _actually_ in the >> default encoding. It's just binary junk which often but not always >> contains some text encoded in some arbitrary superset of ASCII. Just >> like command line arguments (and filenames on linux). >> >> The hard part is that users expect command line arguments, filenames, >> and environment values to be strings (because they normally do >> contain text-like things), when strictly they cannot be because there >> is no reliable encoding. >> > > A good alternative to this is for SBCL to use the UTF8b encoding to > decode unix environment gunk (filenames, env vars, command line > args) which are *probably* in utf8, but might not be. utf8b has the > nice property that any arbitrary bytestring can be decoded into > unicode, and then round-tripped back to the same bytes. Valid utf8 > sequences turns into the same unicode characters as with the utf8 > codec. Invalid utf8 sequences turn into invalid surrogate pair > sequences in the unicode string. > > Thus, SBCL can return strings, and never throw an error. If you > actually wanted the random binary, you can losslessly convert the > unicode string back to binary. Win win. > > Some references: > Original mail: > http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html > > Blog entry: > http://bsittler.livejournal.com/10381.html > > Python implementation: http://hyperreal.org/~est/libutf8b/ James From greg.ewing at canterbury.ac.nz Fri Sep 14 06:00:56 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 14 Sep 2007 16:00:56 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46EA0778.3000502@canterbury.ac.nz> Stephen J. Turnbull wrote: > What should > happen internally is that all undecodable characters (which PUA > characters are by definition for standard codecs) are mapped to unused > codepoints in the PUA, chosen by Python. You mean chosen dynamically? What happens if these PUA characters get encoded some other way, written out, and read back into another session? The information mapping them back to their original meanings would no longer be correct. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Fri Sep 14 06:28:39 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 14 Sep 2007 16:28:39 +1200 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E98F25.5010404@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E98F25.5010404@enthought.com> Message-ID: <46EA0DF7.2090706@canterbury.ac.nz> Travis E. Oliphant wrote: > I would want to encourage people not to use the LOCK_FOR_READ unless > there is an important benefit or need to use it. If you mean that LOCK_FOR_READ would unilaterally deny anyone else read access, my proposal avoids this by not having such a mode at all. So you can always get read access if you really want it. But I expect that most of the time you'll at least want to make sure nobody is writing while you're trying to read. In my terminology you spell that READ | EXCLUDE_WRITE. > Let me think about adding a function for read-write locking that is > separate from getting a view (which implements memory-location > locking). I'm not sure it needs to be a separate function, just a clearly separated set of options in the flags. Remember that clients are only supposed to be holding a buffer for as short a time as possible. It's most likely that the same read/write locking options are going to apply for the whole duration of a buffer operation, I think. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From stephen at xemacs.org Fri Sep 14 06:52:45 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 14 Sep 2007 13:52:45 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA0778.3000502@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> Message-ID: <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Stephen J. Turnbull wrote: > > What should happen internally is that all undecodable characters > > (which PUA characters are by definition for standard codecs) are > > mapped to unused codepoints in the PUA, chosen by Python. > > You mean chosen dynamically? Yes. > What happens if these PUA characters get encoded some other way, You can't win that, because Unicode is the only encoding that attempts to guarantee even the possibility of round-tripping. The only thing you can win is if it's the *same* character set (which might be used by multiple encodings), and then we record the character set and the code point. That's the best we can do in theory. The main problem with this scheme that I know of is that if you have a Python string that contains such a code point, you'll need to somehow include the information about the original encoding when pickling and the like. From greg.ewing at canterbury.ac.nz Fri Sep 14 07:08:04 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 14 Sep 2007 17:08:04 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46EA1734.6020103@canterbury.ac.nz> Stephen J. Turnbull wrote: > You can't win that, because Unicode is the only encoding that attempts > to guarantee even the possibility of round-tripping. Rubbish -- I can do print [ord(c) for c in my_unicode_string] and get perfect round-trippability if I want. You can ask people to use pre-existing officially-sanctioned encodings for their unicode data, but you can't force them to. > The main problem with this scheme that I know of is that if you have a > Python string that contains such a code point, you'll need to somehow > include the information about the original encoding when pickling and > the like. That's exactly the sort of thing I'm talking about. It would be surprising if pickling worked reliably for all strings *except* ones that happened to come in as a command line argument. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From stephen at xemacs.org Fri Sep 14 08:02:56 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 14 Sep 2007 15:02:56 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189722696.30037.14.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> Message-ID: <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> "Marcin 'Qrczak' Kowalczyk" writes: >> This means that a way of handling such points is very useful, and >> as long as there's enough PUA space, the approach I suggested can >> handle all of these various issues. > PUA already has a representation in UTF-8, so this is more incompatible > with UTF-8 than needed, Hm? It's not incompatible at all, and we're not interested in a representation in UTF-8, but rather in UTF-16 (ie, the Python internal encoding). And it *is* needed, because these characters by assumption are not present in Unicode at all. (More precisely, they may be present, but the tables we happen to have don't have mappings for them.) > and hijacks characters No, it doesn't. As I responded to Greg Ewing, there is an issue about things like pickling which use Python internal representations, but not for anything which normally communicates with Python through codecs. > which might be used (for example I'm using some PUA ranges for > encoding my script, they are being transported between processes, > and I would be upset if some language had mangled them to something > else). Your escaping proposal *guarantees* mangling because it turns characters into tuples of code units; it does not preserve character set information. It only works for you because you only have one private script you care about, so you know what those code units mean. If we don't have character set information, then of course that's the best you can do, and my proposal will do something equivalent. But if we *do* have character set information, then my proposal is far more powerful. It allows us to process PUA characters as characters (ie, put them in strings, slice and dice, merge and meld) with some hope of recovering the character's semantics after many transformations of the containing string. In any case, it would not be hard to create an API allowing a Python program to "reserve" a block in a PUA. You still have the issue of collision among multiple applications wanting the same block, of course. You may be able to guarantee that will never happen in your application, but there are examples of OSes that assigned characters in the PUA (Mac OS and Microsoft Windows both did so at one time or another, although they may not be doing it currently, I haven't checked). > While U+0000 is also representable in UTF-8, it cannot occur in > filenames, program arguments, environment variables etc., in many > contexts it was free. In your experience, and mine, but is it in POSIX? If not, I'd rather not add the restriction, no matter how harmless it seems in practice. (Of course practicality beats purity, but your proposal has many other defects, too.) I'm also very bothered by the fact that the interpretation of U+0000 differs in different contexts in your proposal. As I'm sure you know, the semantics of mixing codecs with different semantics (specifically, the treatment of particular code units) is very hairy. Once you get a string into Python, you normally no longer know where it came from, but now whether something came from the program argument or environment or from a stdio stream changes the semantics of U+0000. For me personally, that's a very good reason to object to your proposal. > Of course my escaping scheme can preserve \0 too, by escaping it to > U+0000 U+0000, but here it's incompatible with the real UTF-8. No. It's *never* compatible with UTF-8 because it assigns a different meaning to U+0000 from ASCII NUL. Your scheme also suffers from the practical problem that strings containing escapes are no longer arrays of characters. One effect of my scheme is to extend the "string is array" model to any application that doesn't need to treat more non-BMP characters than there is space available in the PUA. Once implemented, it could easily be adapted to handle characters in Planes 1-16, thus avoiding any use of surrogates in the vast majority of cases. From qrczak at knm.org.pl Fri Sep 14 09:49:33 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 14 Sep 2007 09:49:33 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1189756174.32337.30.camel@qrnik> Dnia 14-09-2007, Pt o godzinie 15:02 +0900, Stephen J. Turnbull napisa?(a): > > PUA already has a representation in UTF-8, so this is more incompatible > > with UTF-8 than needed, > > Hm? It's not incompatible at all, and we're not interested in a > representation in UTF-8, but rather in UTF-16 PUA is representable in both. When the command line contains an UTF-8 encoding of U+E650 (a PUA character), the script should better receive a UTF-16 or UTF-32 encoding of U+E650 in the appropriate place, otherwise we are corrupting user data. > (ie, the Python internal encoding). (Python also uses UTF-32 alternatively to UTF-16.) > And it *is* needed, because these characters by assumption > are not present in Unicode at all. (More precisely, they may be > present, but the tables we happen to have don't have mappings for > them.) They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in any way. It's just a block of characters which will never be officially assigned by the Unicode Consortium, so they can be used privately among parties who agree about their meaning. > Your escaping proposal *guarantees* mangling because it turns > characters into tuples of code units; it does not preserve character > set information. Huh? What do you mean by preserving character set information? It preserves the byte string contents, which is all that is needed. It has the same result as UTF-8 for all valid UTF-8 sequences not containing NUL. > > While U+0000 is also representable in UTF-8, it cannot occur in > > filenames, program arguments, environment variables etc., in many > > contexts it was free. > > In your experience, and mine, but is it in POSIX? Yes. Both as specified and in the reality (e.g. POSIX offers the second parameter of main() of type char ** as the only way to receive command line arguments, and they are NUL-terminated). > I'm also very bothered by the fact that the interpretation of U+0000 > differs in different contexts in your proposal. Well, for any scheme which attempts to modify UTF-8 by accepting arbitrary byte strings is used, *something* must be interpreted differently than in real UTF-8. > Once you get a > string into Python, you normally no longer know where it came from, > but now whether something came from the program argument or > environment or from a stdio stream changes the semantics of U+0000. > For me personally, that's a very good reason to object to your > proposal. This can be said about any modification of UTF-8. Of course you can use such encoding on a standard stream too. In this case only U+0000 cannot be used normally, and the resulting stream will contain whatever bytes were present in filenames and other strings being output to it. > > Of course my escaping scheme can preserve \0 too, by escaping it to > > U+0000 U+0000, but here it's incompatible with the real UTF-8. > > No. It's *never* compatible with UTF-8 because it assigns a different > meaning to U+0000 from ASCII NUL. It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot occur anyway in these contexts, so this incompatibility is mostly harmless. > Your scheme also suffers from the practical problem that strings > containing escapes are no longer arrays of characters. They are no less arrays of characters than strings containing combining marks. [And now I'm gone for 4 days.] -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Fri Sep 14 10:20:47 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 14 Sep 2007 10:20:47 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <28CDCC5D-E62C-4C9F-86FE-2DC31C6834B0@fuhm.net> References: <1189700532.22693.40.camel@qrnik> <28CDCC5D-E62C-4C9F-86FE-2DC31C6834B0@fuhm.net> Message-ID: <1189758047.544.1.camel@qrnik> Dnia 13-09-2007, Cz o godzinie 23:41 -0400, James Y Knight napisa?(a): > Here's a suggestion I made on the SBCL dev list a while back, in > response to the same issues. After a second thought, this (escaping undecodable UTF-8 bytes by unpaired low surrogates) might be a good idea. (I don't remember why I once rejected this.) -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From stephen at xemacs.org Fri Sep 14 10:56:24 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 14 Sep 2007 17:56:24 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA1734.6020103@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> Message-ID: <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Stephen J. Turnbull wrote: > > You can't win that, because Unicode is the only encoding that attempts > > to guarantee even the possibility of round-tripping. > > Rubbish -- I can do print [ord(c) for c in my_unicode_string] > and get perfect round-trippability if I want. Speaking of rubbish. You chose the context of round-tripping *across encodings*, not me. Please stick with your context. > You can ask people to use pre-existing officially-sanctioned > encodings for their unicode data, but you can't force them to. A wide variety of encodings, some standard and some not, and not necessarily with a known injection into Unicode, is precisely what I'm trying to deal with. None of the other proposals, except maybe Martin's, do. James Knight's proposal as it stands assumes UTF-8 Unicode, while Marcin Kowalczyk's just punts to treating everything unknown as a sequence of code units AFAICS. > > The main problem with this scheme that I know of is that if you have a > > Python string that contains such a code point, you'll need to somehow > > include the information about the original encoding when pickling and > > the like. I was merely admitting that getting it to work *efficiently* and *backward-compatibly* for pickling will be tricky. But it's trivial to get it to work *reliably*. > That's exactly the sort of thing I'm talking about. It > would be surprising if pickling worked reliably for all > strings *except* ones that happened to come in as a > command line argument. Um, no, it's not what you're talking about. Pickling is not currently reliable for strings that come in as command line arguments because Python is not reliable. That's precisely what we're trying to fix. None of the proposals make things worse, since they only apply in cases where the codec would throw an exception or incorrectly decode the argument anyway. Yes, you could improve reliability in this sense by storing those strings as bytes, rather than trying to make better encoding guesses and storing "debugging info" about undecodable input. But surely using bytes objects is a non-starter; users are going to expect that command-line arguments are strings, not bytes, and ASCII-only users will raise hell if you ask them to explicitly invoke codecs to translate command-line arguments to strings so that they can be used. From hagenf at CoLi.Uni-SB.DE Fri Sep 14 11:15:00 2007 From: hagenf at CoLi.Uni-SB.DE (=?UTF-8?B?SGFnZW4gRsO8cnN0ZW5hdQ==?=) Date: Fri, 14 Sep 2007 11:15:00 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189700532.22693.40.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> Message-ID: <46EA5114.9060200@coli.uni-saarland.de> Is it too unreasonable to keep the byte strings we get from the OS as byte strings in Python (since we're not sure about their encoding) and offer functions for getting strings? sys.argv could be of type bytes and sys.arguments (or whatever) could be a function taking an encoding parameter (which defaults to UTF-8) and returning strings. Of course that's backwards incompatible and I'm not sure if it's too late for something like this now. - Hagen From ncoghlan at gmail.com Fri Sep 14 12:07:21 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 14 Sep 2007 20:07:21 +1000 Subject: [Python-3000] __format__ and datetime In-Reply-To: <46E9CF84.7060308@canterbury.ac.nz> References: <46E559E9.4090907@trueblade.com> <46E55B05.3090701@v.loewis.de> <46E55FD4.9000807@trueblade.com> <79990c6b0709100829t6aa18653i5f67b7848c778587@mail.gmail.com> <18150.1863.436464.41503@montanaro.dyndns.org> <18150.42783.278892.121765@montanaro.dyndns.org> <18151.61549.956117.769166@montanaro.dyndns.org> <46E9CF84.7060308@canterbury.ac.nz> Message-ID: <46EA5D59.6050103@gmail.com> Greg Ewing wrote: > skip at pobox.com wrote: >> I was just thinking about the folks at places like FermiLab and CERN. ;-) > > Those guys probably need picoseconds... With the suggested %f format character and the mention of Fermilab and CERN, I started thinking about femtoseconds :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From barry at python.org Fri Sep 14 13:30:10 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 14 Sep 2007 07:30:10 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA1734.6020103@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> Message-ID: <0618908E-E4A5-4062-BC92-1A0B83C69E7B@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 14, 2007, at 1:08 AM, Greg Ewing wrote: > Stephen J. Turnbull wrote: >> You can't win that, because Unicode is the only encoding that >> attempts >> to guarantee even the possibility of round-tripping. > > Rubbish -- I can do print [ord(c) for c in my_unicode_string] > and get perfect round-trippability if I want. I think my_unicode_string.encode('raw-unicode-escape) is equivalent. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRupww3EjvBPtnXfVAQKBWAP/dU7eBsgvg704+beCPRbcKkFJvQuVd7br D0irSae0P4IxQDC36dlVE+nUFvKWQDx0UPBmFfWb7CYZnmGpS+Z1hBNLzKy+5POJ A4KSVV9nv1+YGKZBna1zgxuiP9EEHo7MqPm5PxKHmMHqpmcns3U6hZxutBCXN7Sw pics7Kb7s6s= =fiv7 -----END PGP SIGNATURE----- From barry at python.org Fri Sep 14 13:34:36 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 14 Sep 2007 07:34:36 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA5114.9060200@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> Message-ID: <200CD272-6015-4FE7-A004-5939E59316BE@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 14, 2007, at 5:15 AM, Hagen F?rstenau wrote: > Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? > > sys.argv could be of type bytes and sys.arguments (or whatever) > could be > a function taking an encoding parameter (which defaults to UTF-8) and > returning strings. > > Of course that's backwards incompatible and I'm not sure if it's too > late for something like this now. It might be reasonable and even necessary, but I suspect usability will suffer. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRupxzHEjvBPtnXfVAQJ1owP+OBzC2UfeU4rio8nQJgYHl33xZfsAmHkQ Iv8188QzbCuypWQF/Zwr6i6yu+Kt64b0amDoYKI/VdnTceeC3u5ejSh66JocyP2X SmNJYrt6aikFJTgs5nqAgAKQhcXfPNZh45tg/ZVsnpOro6juZTSgs+XO3b3g16VD VSs//yDdL64= =nBLI -----END PGP SIGNATURE----- From martin at v.loewis.de Fri Sep 14 14:09:58 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 14 Sep 2007 14:09:58 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA5114.9060200@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> Message-ID: <46EA7A16.5010902@v.loewis.de> > Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? I think people will complain if command line arguments aren't strings, and they will complain even more so if file names are not strings. > Of course that's backwards incompatible and I'm not sure if it's too > late for something like this now. That is not a concern. However, it is fundamentally the wrong thing to do. Most people rightfully view command line arguments and file names as strings, as they use the keyboard to enter them, and the computer uses letters from a font to display them. They are not bytes conceptually - they are strings in a potentially unknown encoding. Regards, Martin From hagenf at CoLi.Uni-SB.DE Fri Sep 14 14:20:19 2007 From: hagenf at CoLi.Uni-SB.DE (=?UTF-8?B?SGFnZW4gRsO8cnN0ZW5hdQ==?=) Date: Fri, 14 Sep 2007 14:20:19 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA7A16.5010902@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EA7A16.5010902@v.loewis.de> Message-ID: <46EA7C83.6040507@coli.uni-saarland.de> > That is not a concern. However, it is fundamentally the wrong thing to > do. Most people rightfully view command line arguments and file names > as strings, as they use the keyboard to enter them, and the computer > uses letters from a font to display them. They are not bytes > conceptually - they are strings in a potentially unknown encoding. Are you sure that "strings in an unknown encoding" are conceptually strings and not rather bytes? And what if we skillfully conserve unknown bytes in a private use or surrogate area and the application author actually knows the encoding and wants correctly decoded strings? - Hagen -- http://www.coli.uni-saarland.de/~hagenf/ PGP fingerprint: C8EF 458E 5531 14AA 42BC AA1C 36AE D91D BA94 7D32 From martin at v.loewis.de Fri Sep 14 14:32:59 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Fri, 14 Sep 2007 14:32:59 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA7C83.6040507@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EA7A16.5010902@v.loewis.de> <46EA7C83.6040507@coli.uni-saarland.de> Message-ID: <46EA7F7B.2060609@v.loewis.de> > Are you sure that "strings in an unknown encoding" are conceptually > strings and not rather bytes? For file names, most definitely. For command line arguments, I am fairly sure: the argc/argv calling convention does not allow for arbitrary bytes. > And what if we skillfully conserve unknown bytes in a private use or > surrogate area and the application author actually knows the encoding > and wants correctly decoded strings? They can easily roundtrip that then to the encoding that it should have: good_string = sys.argv[bad_string_index].\ encode(sys.argv_encoding, "pua-replace").decode(real_encoding) However, we are talking about borderline cases here - in most cases, Python will just do the right thing. Special cases aren't special enough to break the rules. Regards, Martin From hagenf at CoLi.Uni-SB.DE Fri Sep 14 14:46:34 2007 From: hagenf at CoLi.Uni-SB.DE (=?UTF-8?B?SGFnZW4gRsO8cnN0ZW5hdQ==?=) Date: Fri, 14 Sep 2007 14:46:34 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA7F7B.2060609@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EA7A16.5010902@v.loewis.de> <46EA7C83.6040507@coli.uni-saarland.de> <46EA7F7B.2060609@v.loewis.de> Message-ID: <46EA82AA.3070200@coli.uni-saarland.de> > They can easily roundtrip that then to the encoding that it should have: > > good_string = sys.argv[bad_string_index].\ > encode(sys.argv_encoding, "pua-replace").decode(real_encoding) To me this doesn't look easier than sys.arguments() in the standard case and sys.arguments(encoding="whatever") if you know the special encoding. Just my two cents... - Hagen From jimjjewett at gmail.com Fri Sep 14 15:39:31 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 14 Sep 2007 09:39:31 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA5114.9060200@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> Message-ID: On 9/14/07, Hagen F?rstenau wrote: > Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? > sys.argv could be of type bytes and sys.arguments (or whatever) could be > a function taking an encoding parameter (which defaults to UTF-8) and > returning strings. > Of course that's backwards incompatible and I'm not sure if it's too > late for something like this now. For that reason alone, it makes sense to do it the other way. sys.argv is the text string, and sys.arguments is a bytes object which can be decoded if you know the encoding. sys.argv == sys.arguments(best_guess) -jJ From nicko at nicko.org Fri Sep 14 19:11:08 2007 From: nicko at nicko.org (Nicko van Someren) Date: Fri, 14 Sep 2007 18:11:08 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709111506.32823.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> Message-ID: On 11 Sep 2007, at 15:06, Mark Summerfield wrote: > Is there any chance that an ordered dict will be added to Python 3's > library? It would make sense, since one of the primary justifications for the new metaclass system (PEP 3115) is to allow the metaclass to provide order-preserving dictionaries to record the order in which members are defined. > I think other people must find such things useful. There are three > implementations on the Python Cookbook site, and one on PyPI, all in > pure Python (plus I have my own implementation, also pure Python). Is there much commonality between the interfaces for these? I'm sure there are various different opinions as to the exact nature of the API, particularly around any facilities for re-ordering, slicing etc. Cheers, Nicko From mark at qtrac.eu Fri Sep 14 19:36:11 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Fri, 14 Sep 2007 18:36:11 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> Message-ID: <200709141836.11481.mark@qtrac.eu> On 2007-09-14, Nicko van Someren wrote: > On 11 Sep 2007, at 15:06, Mark Summerfield wrote: > > Is there any chance that an ordered dict will be added to Python 3's > > library? > > It would make sense, since one of the primary justifications for the > new metaclass system (PEP 3115) is to allow the metaclass to provide > order-preserving dictionaries to record the order in which members > are defined. > > > I think other people must find such things useful. There are three > > implementations on the Python Cookbook site, and one on PyPI, all in > > pure Python (plus I have my own implementation, also pure Python). > > Is there much commonality between the interfaces for these? I'm sure > there are various different opinions as to the exact nature of the > API, particularly around any facilities for re-ordering, slicing etc. > Cheers, > Nicko After posting I realised that actually this isn't P3K-specific. I'd hope to see the collections module extended with more data structures in general. I put a similar post on the main python list but with no consensus so far... I put forward an API which is the same as dict (but any list or iterator returned "just happens" to work in key order) plus a few extra methods to exploit the ordering. I don't know how to refer to a usenet thread but this should get there: http://groups.google.co.uk/group/comp.lang.python/browse_frm/thread/b16c34f8dd09a8a0/62a9cd8f8b73cdac#62a9cd8f8b73cdac I also put an example implementation on PyPI since a respondent advised that I do that: http://pypi.python.org/pypi?:action=display&name=ordereddict&version=1.0.0 I certainly hope that Python will have one or more ordered data structures in the collections module since I think they are often useful. I don't expect mine to be used, I am just trying to get the _idea_ accepted that an ordered data structure is useful and worth putting in the standard library. I hope for example, that an AVL tree and/or a B*tree and/or a skiplist will be implemented. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From rhamph at gmail.com Fri Sep 14 19:50:34 2007 From: rhamph at gmail.com (Adam Olsen) Date: Fri, 14 Sep 2007 11:50:34 -0600 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709141836.11481.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709141836.11481.mark@qtrac.eu> Message-ID: On 9/14/07, Mark Summerfield wrote: > On 2007-09-14, Nicko van Someren wrote: > > On 11 Sep 2007, at 15:06, Mark Summerfield wrote: > > > Is there any chance that an ordered dict will be added to Python 3's > > > library? > > > > It would make sense, since one of the primary justifications for the > > new metaclass system (PEP 3115) is to allow the metaclass to provide > > order-preserving dictionaries to record the order in which members > > are defined. > > > > > I think other people must find such things useful. There are three > > > implementations on the Python Cookbook site, and one on PyPI, all in > > > pure Python (plus I have my own implementation, also pure Python). > > > > Is there much commonality between the interfaces for these? I'm sure > > there are various different opinions as to the exact nature of the > > API, particularly around any facilities for re-ordering, slicing etc. > > Cheers, > > Nicko > > After posting I realised that actually this isn't P3K-specific. I'd hope > to see the collections module extended with more data structures in > general. > > I put a similar post on the main python list but with no consensus so > far... > > I put forward an API which is the same as dict (but any list or iterator > returned "just happens" to work in key order) plus a few extra methods > to exploit the ordering. I don't know how to refer to a usenet thread > but this should get there: That's a sorted dict. PEP 3115 wants an insertion-ordered dict. You're not the first to confuse them. ;) -- Adam Olsen, aka Rhamphoryncus From mark at qtrac.eu Fri Sep 14 21:52:23 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Fri, 14 Sep 2007 20:52:23 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709141836.11481.mark@qtrac.eu> Message-ID: <200709142052.23583.mark@qtrac.eu> On 2007-09-14, Adam Olsen wrote: > On 9/14/07, Mark Summerfield wrote: > > On 2007-09-14, Nicko van Someren wrote: > > > On 11 Sep 2007, at 15:06, Mark Summerfield wrote: > > > > Is there any chance that an ordered dict will be added to Python 3's > > > > library? > > > > > > It would make sense, since one of the primary justifications for the > > > new metaclass system (PEP 3115) is to allow the metaclass to provide > > > order-preserving dictionaries to record the order in which members > > > are defined. > > > > > > > I think other people must find such things useful. There are three > > > > implementations on the Python Cookbook site, and one on PyPI, all in > > > > pure Python (plus I have my own implementation, also pure Python). > > > > > > Is there much commonality between the interfaces for these? I'm sure > > > there are various different opinions as to the exact nature of the > > > API, particularly around any facilities for re-ordering, slicing etc. > > > Cheers, > > > Nicko > > > > After posting I realised that actually this isn't P3K-specific. I'd hope > > to see the collections module extended with more data structures in > > general. > > > > I put a similar post on the main python list but with no consensus so > > far... > > > > I put forward an API which is the same as dict (but any list or iterator > > returned "just happens" to work in key order) plus a few extra methods > > to exploit the ordering. I don't know how to refer to a usenet thread > > but this should get there: > > That's a sorted dict. PEP 3115 wants an insertion-ordered dict. > You're not the first to confuse them. ;) Hmmm, I'd not come across that terminology distinction before. I guess I'll have to rename mine then. BTW In my previous I said "I hope for example, that an AVL tree and/or a B*tree and/or a skiplist will be implemented." Actually, I don't care what data structures are used, I just think that Python lacks ordered data structures, specifically: sorteddict and sortedset. (Personally I've never needed an insertion-ordered dict.) -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From larry at hastings.org Fri Sep 14 22:34:01 2007 From: larry at hastings.org (Larry Hastings) Date: Fri, 14 Sep 2007 13:34:01 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709142052.23583.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709141836.11481.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> Message-ID: <46EAF039.7020208@hastings.org> Mark Summerfield wrote: > (Personally I've never needed an insertion-ordered dict.) Then you've never programmed in PHP I take it. PHP's one-size-fits-all data structure is an insertion-ordered dict; PHP programmers use it everywhere a Python programmer might use a dict /or/ a list. I've had one or two people tell me having both behaviors in one object is "really useful every-so-often", though they didn't go into any more detail. Can't really see the advantage, myself. /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070914/44a25647/attachment.htm From martin at v.loewis.de Fri Sep 14 23:01:37 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 14 Sep 2007 23:01:37 +0200 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709142052.23583.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709141836.11481.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> Message-ID: <46EAF6B1.8000705@v.loewis.de> >> That's a sorted dict. PEP 3115 wants an insertion-ordered dict. >> You're not the first to confuse them. ;) > > Hmmm, I'd not come across that terminology distinction before. > I guess I'll have to rename mine then. I think "insertion-ordered" is over-specification, just to make the distinction clear. Most of the time, people mean "ordered dictionary" to say "keys are in a fixed order" - typically insertion order. When they want to express that the keys ought to be sorted, they call it "sorted dictionary". Regards, Martin From greg.ewing at canterbury.ac.nz Sat Sep 15 00:40:00 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 15 Sep 2007 10:40:00 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46EB0DC0.3050906@canterbury.ac.nz> Stephen J. Turnbull wrote: > You chose the context of round-tripping *across > encodings*, not me. Please stick with your context. Maybe we have different ideas of what the problem is. I thought the problem is to take arbitrary byte sequences coming in as command-line args and represent them as unicode strings in such a way that the can be losslessly converted back into the same byte strings. I was just pointing out that if you do this in a way that involves some sort of dynamically generated mapping, then it won't work if the round trip spans more than one Python session -- and that there are any number of ways that the data could get from one session to another, many of them not involving anything that one would recognise as a unicode encoding in the conventional sense. -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 15 00:44:18 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 15 Sep 2007 10:44:18 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA5114.9060200@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> Message-ID: <46EB0EC2.4030208@canterbury.ac.nz> Hagen F?rstenau wrote: > sys.argv could be of type bytes and sys.arguments (or whatever) could be > a function taking an encoding parameter (which defaults to UTF-8) and > returning strings. > > Of course that's backwards incompatible and I'm not sure if it's too > late for something like this now. It would be pretty disruptive to ask everyone to change their habit of thinking of sys.argv as a list of strings. I would suggest doing it the other way around -- have sys.argv be an object that automatically converts to unicode on access, and something else, such as sys.argbytes, for getting the raw bytes if that fails. -- Greg From guido at python.org Sat Sep 15 01:22:25 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 14 Sep 2007 16:22:25 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB0EC2.4030208@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> Message-ID: On 9/14/07, Greg Ewing wrote: > It would be pretty disruptive to ask everyone to change > their habit of thinking of sys.argv as a list of strings. Indeed. > I would suggest doing it the other way around -- have > sys.argv be an object that automatically converts to > unicode on access, and something else, such as > sys.argbytes, for getting the raw bytes if that fails. Great idea, but sys.argv doesn't need to be magic for this approach to work. If course os.environ would have to be treated similarly. And things like the strings returned by the _locale module (I found at least one test failing on Red Hat platforms because the thousands separator is set to \xa0 in the Estonian locale). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Sat Sep 15 01:21:26 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 15 Sep 2007 11:21:26 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> Message-ID: <46EB1776.6030006@canterbury.ac.nz> Guido van Rossum wrote: > Great idea, but sys.argv doesn't need to be magic for this approach to work. Are you sure? I thought part of the problem was that if an argv entry couldn't be decoded, you got an error too soon to do anything about it. Making sys.argv lazy would avoid that. -- Greg From guido at python.org Sat Sep 15 02:07:52 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 14 Sep 2007 17:07:52 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB1776.6030006@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EB1776.6030006@canterbury.ac.nz> Message-ID: On 9/14/07, Greg Ewing wrote: > Guido van Rossum wrote: > > Great idea, but sys.argv doesn't need to be magic for this approach to work. > > Are you sure? I thought part of the problem was that > if an argv entry couldn't be decoded, you got an error > too soon to do anything about it. Making sys.argv lazy > would avoid that. I see. But you could also insert '?'s into the argv string. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From stephen at xemacs.org Sat Sep 15 02:13:31 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 15 Sep 2007 09:13:31 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189756174.32337.30.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> Message-ID: <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> "Marcin 'Qrczak' Kowalczyk" writes: >> And it *is* needed, because these characters by assumption >> are not present in Unicode at all. (More precisely, they may be >> present, but the tables we happen to have don't have mappings for >> them.) > They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in > any way. The characters I am referring to are the unstandardized so-called "corporate characters" that are very common in Japanese text. My solution handles your problem, slightly less efficiently than yours does, perhaps, but in a Unicode-conforming way. Yours doesn't help with mine at all. > It preserves the byte string contents, which is all that is needed. That is not true in any environment where the encoding is not known with certainty. > It has the same result as UTF-8 for all valid UTF-8 sequences not > containing NUL. Sorry, I'm talking about real Japanese and other situations where there is no corresponding Unicode character point, and a solution which not only handles that but also handles corrupt UTF-8. Valid UTF-8 is not a problem, it's the solution. But a robust language should handle text that is not valid UTF-8 in a way that allows the programmer or user to implement error correction at a finer-grained level than dumping core. >> I'm also very bothered by the fact that the interpretation of U+0000 >> differs in different contexts in your proposal. > Well, for any scheme which attempts to modify UTF-8 by accepting > arbitrary byte strings is used, *something* must be interpreted > differently than in real UTF-8. Wrong. In my scheme everything ends up in the PUA, on which real UTF-8 imposes no interpretation by definition. I haven't gone back to check yet, but it's possible that a "real UTF-8 conforming process" is required to stop processing and issue an error or something like that in the cases we're trying to handle. But your extension and James Knight's extension both fall afoul of any such provision, too. >> Once you get a string into Python, you normally no longer know >> where it came from, but now whether something came from the >> program argument or environment or from a stdio stream changes the >> semantics of U+0000. For me personally, that's a very good reason >> to object to your proposal. > This can be said about any modification of UTF-8. It's not true of James Knight's proposal, because the same modification can be used for both program arguments and file streams. And my proposal doesn't modify UTF-8 at all; it takes advantage of the farsighted wisdom of the designers of Unicode and puts all the non-standard "characters", including broken encoding, in the PUA. > Of course you can use such encoding on a standard stream too. In > this case only U+0000 cannot be used normally, and the resulting > stream will contain whatever bytes were present in filenames and > other strings being output to it. A programmer can use it, but his users will curse his name every time a binary stream gets corrupted because they forgot that little detail. >> > Of course my escaping scheme can preserve \0 too, by escaping it to >> > U+0000 U+0000, but here it's incompatible with the real UTF-8. >> No. It's *never* compatible with UTF-8 because it assigns a different >> meaning to U+0000 from ASCII NUL. > It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot > occur anyway in these contexts, so this incompatibility is mostly > harmless. Forcing users to use codecs of subtly different semantics simply because they're getting I/O from different sources is a substantial harm. >> Your scheme also suffers from the practical problem that strings >> containing escapes are no longer arrays of characters. > They are no less arrays of characters than strings containing combining > marks. Those marks are characters in their own right. Your escapes are not, nor are surrogates. It's true that users will be surprised by the count of characters in many cases with unnormalized Unicode, but these can be reduced to a very few by normalizing to NFC. From stephen at xemacs.org Sat Sep 15 02:25:02 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 15 Sep 2007 09:25:02 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA7C83.6040507@coli.uni-saarland.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EA7A16.5010902@v.loewis.de> <46EA7C83.6040507@coli.uni-saarland.de> Message-ID: <87y7f83hcx.fsf@uwakimon.sk.tsukuba.ac.jp> Hagen F?rstenau writes: > And what if we skillfully conserve unknown bytes in a private use or > surrogate area and the application author actually knows the encoding > and wants correctly decoded strings? This is what my proposal would do, but my proposal would would return a string, not bytes. From stephen at xemacs.org Sat Sep 15 05:44:05 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 15 Sep 2007 12:44:05 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB0DC0.3050906@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> Message-ID: <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Stephen J. Turnbull wrote: > > You chose the context of round-tripping *across > > encodings*, not me. Please stick with your context. > > Maybe we have different ideas of what the problem is. I thought > the problem is to take arbitrary byte sequences coming in as > command-line args and represent them as unicode strings in such a > way that the can be losslessly converted back into the same byte > strings. That's a straw man if taken literally. Just use the ISO-8859-1 codec, and you're done. If you add the condition that the encoding is known with certainty and the source string is well-formed for that encoding, then you need to decode to meaningful Unicode. For that problem, James Knight's solution is good if it makes sense to assume that the sequence of bytes is encoded in UTF-8 Unicode. However, I don't think that is a reasonable assumption for a language that is heavily used in Europe and Japan, and for processing email. These are contexts where UTF-8 is making steady progress, but legacy encodings are still quite important. However, the general problem is to decode a sequence of bytes into a Unicode string and be able to recover the original sequence if you decide you got it wrong, even after you've sliced and concatenated the string with other strings. With no guarantee that all the source encodings where the same. > I was just pointing out that if you do this in a way that involves > some sort of dynamically generated mapping, then it won't work if > the round trip spans more than one Python session -- and that there > are any number of ways that the data could get from one session to > another, many of them not involving anything that one would > recognise as a unicode encoding in the conventional sense. But it also won't work if you just pass around strings that are invertible to byte sequences, *because recipients don't know which byte sequence to invert them to*. Is that cruft corrupt EUC-JP or corrupt Shift JIS or corrupt UTF-8? Or maybe simply a valid character which is even a Unicode character, but not in the table for the source encoding (this happens in Japanese all the time)? You're likely to make different guesses about what was intended by a specific sequence of byte cruft for different original encodings. What I'm suggesting is to provide a way for processes to record and communicate that information without needing to provide a "source encoding" slot for strings, and which is able to handle strings containing unrecognized (including corrupt) characters from multiple source encodings. True, it will be up to the applications to communicate that information, but it is, anyway. Furthermore, the same algorithms can be used to "fold" any text that contains only BMP characters plus no more than 6400 distinct non-BMP characters into the BMP, which would be a nice feature for people wanting to avoid the UTF-16 surrogates for some reason. As Martin points out, it may not be possible to implement this without changing the codecs one by one (I have some hope that it can nevertheless be done, but haven't looked at the codec framework closely yet). I think it would be unfortunate if we're going to try to solve a small subset of these problems (as James and Marcin are doing) to overlook the possibility of a good solution to a whole bunch of related problems. From mark at qtrac.eu Sat Sep 15 07:04:18 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Sat, 15 Sep 2007 06:04:18 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <46EAF6B1.8000705@v.loewis.de> References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> Message-ID: <200709150604.18638.mark@qtrac.eu> On 2007-09-14, Martin v. L?wis wrote: > >> That's a sorted dict. PEP 3115 wants an insertion-ordered dict. > >> You're not the first to confuse them. ;) > > > > Hmmm, I'd not come across that terminology distinction before. > > I guess I'll have to rename mine then. > > I think "insertion-ordered" is over-specification, just to make > the distinction clear. Most of the time, people mean "ordered > dictionary" to say "keys are in a fixed order" - typically insertion > order. When they want to express that the keys ought to be > sorted, they call it "sorted dictionary". I got my terminology from C++ which has C++ map => Python sorteddict (missing!) C++ unordered_map => Python dict C++ set => Python sortedset (missing!) C++ unordered_set => Python set I've now renamed mine to sorteddict: http://pypi.python.org/pypi/sorteddict As for Adam's comment about the dict API, I find it okay, and I think some people would prefer a close match. Unfortunately, I don't think there will ever be consensus (so that's why there's a BDFL), but whatever their APIs, I hope that Python gets a sorteddict and a sortedset. But how does this happen? I've discussed it on comp.lang.python (having used ordereddict in the subject line to create unintentional confusion), but at some point a PEP has to be created. I'm happy to do that (at least for a sorteddict), but if someone who has done PEPs before did so, I'd be just as happy---I'll see what the feedback is (if any) when I get online again next week. PS And no, I've never programmed in PHP and never fancied doing so:-) -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From martin at v.loewis.de Sat Sep 15 07:33:21 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 15 Sep 2007 07:33:21 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46EB6EA1.5020104@v.loewis.de> > What I'm suggesting is to provide a way for processes to record and > communicate that information without needing to provide a "source > encoding" slot for strings, and which is able to handle strings > containing unrecognized (including corrupt) characters from multiple > source encodings. Can you please (re-)state how that way would precisely work? I could not find that in the archives. Regards, Martin From arvind1.singh at gmail.com Sat Sep 15 15:45:21 2007 From: arvind1.singh at gmail.com (Arvind Singh) Date: Sat, 15 Sep 2007 19:15:21 +0530 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709150604.18638.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> <200709150604.18638.mark@qtrac.eu> Message-ID: > > I hope that Python gets a sorteddict and a > sortedset. It doesn't make sense for Python to have sorteddict or sortedset. You see, dict can have keys which cannot be ordered (keys can be heterogeneous, in which case Py3K may raise TypeError; ordering doesn't make sense for the objects used as keys) and same goes for set elements. Sorting makes sense only as a run-time operation, in which case, the programmer should be prepared to handle appropriate exceptions. Btw, would you like a dict or set for which you have to handle exceptions at every insertion? -- Regards, Arvind -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070915/417eef1a/attachment.htm From hfuerstenau at gmx.net Sat Sep 15 12:44:09 2007 From: hfuerstenau at gmx.net (=?ISO-8859-1?Q?Hagen_F=FCrstenau?=) Date: Sat, 15 Sep 2007 12:44:09 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB0EC2.4030208@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> Message-ID: <46EBB779.6090605@gmx.net> >> sys.argv could be of type bytes and sys.arguments (or whatever) could be >> a function taking an encoding parameter (which defaults to UTF-8) and >> returning strings. >> > It would be pretty disruptive to ask everyone to change > their habit of thinking of sys.argv as a list of strings. The idea behind this was that it would preserve the non-decoding behaviour of the present sys.argv and put the new behaviour into a new function. Also "argv" sounds more low-level than something like "arguments". But of course, "argbytes" sounds even more low-level. :-) - Hagen From nevillegrech at gmail.com Sat Sep 15 18:36:51 2007 From: nevillegrech at gmail.com (Neville Grech Neville Grech) Date: Sat, 15 Sep 2007 18:36:51 +0200 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> <200709150604.18638.mark@qtrac.eu> Message-ID: >From a python's user point of view. a sorted dict/set/list was sometimes a requirement for me. Basically. a dictionary that had a BTree implementation instead of a hash table. Also. having an explicit type error would then be a clear indication that you have something wrong in your implementattion (and therefore useful indication). Other languages have separate collection frameworks like C# has powercollections. Having these collections as part of the standard library is another issue though. On 9/15/07, Arvind Singh wrote: > > I hope that Python gets a sorteddict and a > > sortedset. > > > It doesn't make sense for Python to have sorteddict or sortedset. You see, > dict can have keys which cannot be ordered (keys can be heterogeneous, in > which case Py3K may raise TypeError; ordering doesn't make sense for the > objects used as keys) and same goes for set elements. > > Sorting makes sense only as a run-time operation, in which case, the > programmer should be prepared to handle appropriate exceptions. > > Btw, would you like a dict or set for which you have to handle exceptions > at every insertion? > > -- > Regards, > Arvind > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/nevillegrech%40gmail.com > > -- Regards, Neville Grech -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070915/31b227bc/attachment.htm From g.brandl at gmx.net Sat Sep 15 18:48:18 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 15 Sep 2007 18:48:18 +0200 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> <200709150604.18638.mark@qtrac.eu> Message-ID: Arvind Singh schrieb: > I hope that Python gets a sorteddict and a > sortedset. > > > It doesn't make sense for Python to have sorteddict or sortedset. You > see, dict can have keys which cannot be ordered (keys can be > heterogeneous, in which case Py3K may raise TypeError; ordering doesn't > make sense for the objects used as keys) and same goes for set elements. > > Sorting makes sense only as a run-time operation, in which case, the > programmer should be prepared to handle appropriate exceptions. > > Btw, would you like a dict or set for which you have to handle > exceptions at every insertion? In the cases where you have to do that, you shouldn't be using a sorted dict. But why not better look at those other 95% of cases where the values are uniformly typed and perfectly sortable? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From greg at krypto.org Sat Sep 15 19:56:46 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 15 Sep 2007 10:56:46 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E9C724.9080808@canterbury.ac.nz> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E72B18.9060908@canterbury.ac.nz> <52dc1c820709120044h722605cekc86ea668a6a1b4bd@mail.gmail.com> <46E9C724.9080808@canterbury.ac.nz> Message-ID: <52dc1c820709151056t1b14bca2ld723524e542aa914@mail.gmail.com> On 9/13/07, Greg Ewing wrote: > Gregory P. Smith wrote: > > When I read the plain term EXCLUSIVE I read that to mean nobody else can > > read -or- write, ie: not shared in any sense. > > You're right, it's not the best term. > > > Lets extend these base > > concepts to SHARED_READ, SHARED_WRITE, EXCLUSIVE_READ, EXCLUSIVE_WRITE > > EXCLUDE_WRITE might be better, since EXCLUSIVE_WRITE seems > to imply that one is writing oneself as well. > > > EXCLUSIVE_READ - no others can read this buffer while this view is > > open. > > This is the one that I don't think is necessary. I don't > see a need to ever prevent others from *reading* if they > really want to and are prepared to deal with the > consequences. Most of the time the other party will be using > READ_LOCK which includes EXCLUDE_WRITE, so it will fail > if you're already holding a write lock. > > So we just have > > READ > WRITE > READ_LOCK = READ | EXCLUDE_WRITE > WRITE_LOCK = WRITE | EXCLUDE_WRITE I like your terminology. Also, agreed that EXCLUDE_READ is not likely to be necessary; I listed it for completeness sake. From greg at krypto.org Sat Sep 15 20:11:33 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 15 Sep 2007 11:11:33 -0700 Subject: [Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support In-Reply-To: <46E98F25.5010404@enthought.com> References: <20070829234728.GV24059@electricrain.com> <52dc1c820709081615m783ea9fctc562d113252fb7b1@mail.gmail.com> <46E62358.3020404@enthought.com> <46E6F137.2020001@enthought.com> <46E98F25.5010404@enthought.com> Message-ID: <52dc1c820709151111n111856f2k5d1fa80af7d3460b@mail.gmail.com> On 9/13/07, Travis E. Oliphant wrote: > I think if it doesn't go through the buffer interface it is up to the > object to decide (i.e. what does the object do with itself when buffers > are exported --- that will depend on the object). All it must do is > support the buffer interface in the correct way (i.e. not move the > memory buffers are relying on and support the access modes correctly > that it purports to export). Correct. This is what I have done in my Bytes object patch to support READ | EXCLUDE_WRITE (speaking in Greg Ewing's terms which I think we should adopt). > Let me think about adding a function for read-write locking that is > separate from getting a view (which implements memory-location > locking). I appreciate the discussion as it is helping me clarify my > thinking. > > -Travis I'm interested to see what you come up with but... As it is, I agree with Greg Ewing that a separate function is not necessary and that just an set of flags to the existing buffer interface are all thats needed. Otherwise code would need to make multiple calls (get, lock, unlock, release) and deal with errors when both do not succeed which is complicated and error prone to do in C when a single call could encapsulate it. -gps From greg at krypto.org Sat Sep 15 22:36:49 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 15 Sep 2007 13:36:49 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB0EC2.4030208@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> Message-ID: <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> On 9/14/07, Greg Ewing wrote: > Hagen F?rstenau wrote: > > sys.argv could be of type bytes and sys.arguments (or whatever) could be > > a function taking an encoding parameter (which defaults to UTF-8) and > > returning strings. > > > > Of course that's backwards incompatible and I'm not sure if it's too > > late for something like this now. > > It would be pretty disruptive to ask everyone to change > their habit of thinking of sys.argv as a list of strings. Would it? We're already asking them to convert between bytes and unicode strings anywhere else I/O is done. I see the command line and environment as merely more forms of input. The only way to parse them into data structures automatically is to keep them as bytes. They are C concepts and can't imply an encoding. As it is, its entirely possible to have -multiple- encodings on a command line at once as well as in environment variables. They're all context sensitive. This isn't going to change. > I would suggest doing it the other way around -- have > sys.argv be an object that automatically converts to > unicode on access, and something else, such as > sys.argbytes, for getting the raw bytes if that fails. I'd leave sys.argv bytes and make sys.args/arguments/argstrs be some best effort parsing. argv is the C/C++ name for bytes, lets not confuse people. similarly for the environment. os.environ dict should be bytes object keys and values (or perhaps a bytes object subclass that refuses null bytes). the os.getenv and os.putenv functions should take care of any best effort decoding/encoding and have an optional getenv encoding= parameter to explicitly specify. -gps From p.f.moore at gmail.com Sat Sep 15 23:29:35 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 15 Sep 2007 22:29:35 +0100 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> Message-ID: <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> On 15/09/2007, Gregory P. Smith wrote: > similarly for the environment. os.environ dict > should be bytes object keys and values You can't have bytes as keys - the type isn't hashable... Paul From greg at krypto.org Sun Sep 16 00:22:54 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 15 Sep 2007 15:22:54 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> Message-ID: <52dc1c820709151522h5e2a4336qcefe55f820042d36@mail.gmail.com> On 9/15/07, Paul Moore wrote: > On 15/09/2007, Gregory P. Smith wrote: > > similarly for the environment. os.environ dict > > should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... ugh, yeah. as much as i hate to suggest it given my preference for keeping any encoding out of automatic environment and argument parsing, just make os.environ keys be latin-1 encoding or make them a hashable subclass of bytes (yuck or yuck). someone on windows should check to see if it allows evil such as utf16 environment variable names first (i'd hope not, that'd break traditional C/C++ code). From aahz at pythoncraft.com Sun Sep 16 02:40:05 2007 From: aahz at pythoncraft.com (Aahz) Date: Sat, 15 Sep 2007 17:40:05 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EA7F7B.2060609@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EA7A16.5010902@v.loewis.de> <46EA7C83.6040507@coli.uni-saarland.de> <46EA7F7B.2060609@v.loewis.de> Message-ID: <20070916004005.GA12697@panix.com> On Fri, Sep 14, 2007, "Martin v. L??wis" wrote: >Hagen: >> >> And what if we skillfully conserve unknown bytes in a private use or >> surrogate area and the application author actually knows the encoding >> and wants correctly decoded strings? > > They can easily roundtrip that then to the encoding that it should have: > > good_string = sys.argv[bad_string_index].\ > encode(sys.argv_encoding, "pua-replace").decode(real_encoding) That doesn't count as "easily" in my book. What about a sys._argv_orig containing bytes objects? -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ The best way to get information on Usenet is not to ask a question, but to post the wrong information. From aahz at pythoncraft.com Sun Sep 16 02:44:52 2007 From: aahz at pythoncraft.com (Aahz) Date: Sat, 15 Sep 2007 17:44:52 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> Message-ID: <20070916004452.GB12697@panix.com> On Sat, Sep 15, 2007, Paul Moore wrote: > On 15/09/2007, Gregory P. Smith wrote: >> >> similarly for the environment. os.environ dict >> should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... That's why people keep arguing for an immutable bytes types. I keep seeing long discussions that end up with a tortured mechanism for making the keys unicode. Why don't we just bite the bullet and make things easier and have the immutable bytes type? -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ The best way to get information on Usenet is not to ask a question, but to post the wrong information. From greg.ewing at canterbury.ac.nz Sun Sep 16 03:38:17 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 16 Sep 2007 13:38:17 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EBB779.6090605@gmx.net> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> Message-ID: <46EC8909.4050300@canterbury.ac.nz> > Also "argv" sounds more low-level than something like "arguments". While we're on the subject of argv, I've been wondering whether py3k might want to revisit the idea of having argv[0] be the program name. In my experience, one almost *never* wants to treat argv[0] the same way as the rest of the arguments. Putting the program name into argv[0] is a neat trick in C that's relatively harmless, because it's just as easy to start iterating from 1 than 0, but in Python it makes all argument-processing code more complicated than necessary. It also provides a nasty trap for the unwary, as I discovered one day when I wrote a program for deleting files that deleted itself the first time I ran it. :-) Changing the existing behaviour of argv would probably be too disruptive, so how about relegating argv to a low-level detail and providing something else for everyday use that omits argv[0]? sys.arguments would sound quite nice for that. -- Greg From nick.bastin at gmail.com Sun Sep 16 03:53:48 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 15 Sep 2007 21:53:48 -0400 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> <200709150604.18638.mark@qtrac.eu> Message-ID: <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> On 9/15/07, Arvind Singh wrote: > > > I hope that Python gets a sorteddict and a > > sortedset. > > It doesn't make sense for Python to have sorteddict or sortedset. You see, > dict can have keys which cannot be ordered (keys can be heterogeneous, in > which case Py3K may raise TypeError; ordering doesn't make sense for the > objects used as keys) and same goes for set elements. How do you get from "some keys can't be ordered" to "it doesn't make sense for Python to have sorteddict or sortedset"? If you want to use keys that can't be ordered, then feel free to continue to use dict. For situations in which ordering is important, that language should support that. When did this become an all or nothing proposition? There's plenty of space for both dict and sorteddict. > Btw, would you like a dict or set for which you have to handle exceptions at > every insertion? Yes, if that's what the situation calls for. -- Nick From nick.bastin at gmail.com Sun Sep 16 04:00:25 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Sat, 15 Sep 2007 22:00:25 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> Message-ID: <66d0a6e10709151900l3d89b71u2e8b9bcb4b62b9f4@mail.gmail.com> On 9/15/07, Paul Moore wrote: > On 15/09/2007, Gregory P. Smith wrote: > > similarly for the environment. os.environ dict > > should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... Then lets stop beating around the bush and implement an immutable bytes type. Why put ourselves through contortions trying to jam a square peg into a round hole and not just decide to make a round peg? -- Nick From guido at python.org Sun Sep 16 04:01:00 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 15 Sep 2007 19:01:00 -0700 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EC8909.4050300@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> Message-ID: This sounds awfully close to bikeshedding. Change too many details like this and you cause death by a 1000 pinpricks for existing apps. sys.argv[0] *does* get used (though arguably rarely in the same way as sys.argv[1:]). --Guido On 9/15/07, Greg Ewing wrote: > > Also "argv" sounds more low-level than something like "arguments". > > While we're on the subject of argv, I've been wondering > whether py3k might want to revisit the idea of having > argv[0] be the program name. In my experience, one almost > *never* wants to treat argv[0] the same way as the rest of > the arguments. > > Putting the program name into argv[0] is a neat > trick in C that's relatively harmless, because it's > just as easy to start iterating from 1 than 0, but > in Python it makes all argument-processing code > more complicated than necessary. > > It also provides a nasty trap for the unwary, as I > discovered one day when I wrote a program for deleting > files that deleted itself the first time I ran it. :-) > > Changing the existing behaviour of argv would probably > be too disruptive, so how about relegating argv to a > low-level detail and providing something else for > everyday use that omits argv[0]? > > sys.arguments would sound quite nice for that. > > -- > Greg > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Sun Sep 16 04:28:53 2007 From: janssen at parc.com (Bill Janssen) Date: Sat, 15 Sep 2007 19:28:53 PDT Subject: [Python-3000] Unicode and OS strings In-Reply-To: <20070916004452.GB12697@panix.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> <20070916004452.GB12697@panix.com> Message-ID: <07Sep15.192901pdt."57996"@synergy1.parc.xerox.com> > > You can't have bytes as keys - the type isn't hashable... > > That's why people keep arguing for an immutable bytes types. I keep > seeing long discussions that end up with a tortured mechanism for making > the keys unicode. Why don't we just bite the bullet and make things > easier and have the immutable bytes type? It's a pretty horrible hole (IMO) that a sequence of bytes isn't hashable. If we need the immutable bytes type, or some attribute on the regular bytes type akin to the C "const", let's add it now before the insanity gets out of control. Bill From fdrake at acm.org Sun Sep 16 04:46:19 2007 From: fdrake at acm.org (Fred Drake) Date: Sat, 15 Sep 2007 22:46:19 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <66d0a6e10709151900l3d89b71u2e8b9bcb4b62b9f4@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> <66d0a6e10709151900l3d89b71u2e8b9bcb4b62b9f4@mail.gmail.com> Message-ID: On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > Then lets stop beating around the bush and implement an immutable > bytes type. Why put ourselves through contortions trying to jam a > square peg into a round hole and not just decide to make a round peg? +42 !!!! -Fred -- Fred Drake From stephen at xemacs.org Sun Sep 16 09:13:29 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 16 Sep 2007 16:13:29 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EB6EA1.5020104@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> Message-ID: <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > What I'm suggesting is to provide a way for processes to record and > > communicate that information without needing to provide a "source > > encoding" slot for strings, and which is able to handle strings > > containing unrecognized (including corrupt) characters from multiple > > source encodings. > > Can you please (re-)state how that way would precisely work? I could > not find that in the archives. The basic idea is to allocate code points in private space as-needed. All points in private space would be initially "owned" by the Python process. When a codec encounters something it can't handle, whether it's a valid character in a legacy encoding, a private use character in a UTF, or an invalid sequence of code units, it throws an exception specifying the character or code unit and the current coded character set, and the handler either finds that tuple in the table, or assigns a private use character and enters it in the table with key being the charset-codepoint tuple, and the inverse assignment in an inverse mapping table. It may be that no charset can be assigned to the codepoint, in which case None would be assigned as the charset, and instead of mapping characters, the invalid codepoints would be individually mapped. On output, if the codec can output in the recorded character set, it does so, otherwise it throws an unencodable character exception. This definitely requires that the Unicode codecs be modified to do the right thing if they encounter private use characters in the input stream or output stream. Other codecs don't need to be modified, although ISO 2022-based codecs (at least) would benefit from it. Some codecs (like ISO-8859 codecs) will have implicit charsets (ASCII code points can't be errors for them, so only GR matters), and can use codec-specific handlers that know what the implicit charset is. (AIUI this would require that the handler-specifying protocol be changed from an enumeration of the available handlers to the ability to actually specify one.) The rest can use the None charset, so that code units will be preserved. Applications which wish to pass strings across process boundaries will have to pass the table too. If they don't, then in general they can't use this family of exception handlers. To handle cases like Marcin's private encoding, and in general to allow efficient IPC for process that know they're going to get certain private use characters in I/O, there should be an API to preallocate specific code points. (Theoretically, dynamically allocated private code points could be reallocated, but that would require translating all existing strings, and I can't believe that would ever be worth it.) What happens if a string "escapes" without the table? 1. The application uses the preallocation API. Then the characters it understands are handled normally, and dynamically allocated private use characters are errors, anyway. I don't see how this makes things worse. 2. The application doesn't use the preallocation API, but does know about some private use characters. Then it will get confused by the dynamic allocation, as Greg and Marcin point out, and users should be advised not to use the new handler. 3. The application doesn't know about any private use characters. Then dynamically allocated characters are exceptions anyway. I don't see how this makes things worse. Advantages: 1. Almost all the "interesting" information about the original encoded source is preserved, including under string operations like slicing and concatenation with strings form other sources. (I can quantify "almost all" more precisely if necessary.) 2. 100% Unicode conformance in the sense that if the internal representation escapes, it's valid Unicode. 3. Efficient internal representation in the sense that applications need not worry about invalid Unicode when doing string operations. 4. In 16-bit environments, up to 6400 non-BMP characters can be mapped into the BMP private use area using the same algorithm, achieving a "string is character array" representation at the expense of slight overhead in I/O and one extra table reference in each character property lookup. As Marcin points out, given that not all composable characters have one-character NFC representations, we can't guarantee that the user's notion of string length will equal the number of characters in the string, but in practice I think that will almost invariably work out. And if we're doing normalization, the codec overhead becomes less important. Disadvantages: 1. Unicode codecs will need to be modified, since they need to throw exceptions on private use characters. 2. Other codecs will need to be modified to take advantage of this handler, since AFAIK currently none of the available handlers can use charset information, so I can't imagine the codecs provide it. 3. More overhead in exception-handling than James Knight's or Marcin Kowalczyk's proposals. 4. Applications that know about some private use characters will need to be modified to preallocate those characters before they can take advantage of this handler. In general, I don't think that the overhead should be weighted very heavily against this proposal. Exception handlers impose a fair amount of overhead anyway, AIUI. Furthermore, any application that cares enough to keep track of the original code points will IMO be hungry for any additional information that can help in exception handling. This proposal provides as much as you can get, short of buffering all the input. HTH, From martin at v.loewis.de Sun Sep 16 09:25:26 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 16 Sep 2007 09:25:26 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46ECDA66.3040702@v.loewis.de> > The basic idea is to allocate code points in private space as-needed. Ok, thanks. Would you be interested in implementing that scheme? Regards, Martin From p.f.moore at gmail.com Sun Sep 16 13:42:35 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 16 Sep 2007 12:42:35 +0100 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> <66d0a6e10709151900l3d89b71u2e8b9bcb4b62b9f4@mail.gmail.com> Message-ID: <79990c6b0709160442g44100aean3da2890085eb643@mail.gmail.com> On 16/09/2007, Fred Drake wrote: > On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > > Then lets stop beating around the bush and implement an immutable > > bytes type. Why put ourselves through contortions trying to jam a > > square peg into a round hole and not just decide to make a round peg? > > +42 !!!! I knew this would come up again when I made that comment - I deliberately didn't express an opinion then, as I didn't want to obscure the point. But I'll come off the fence and admit that I'm also in favour of an immutable bytes type (and for bytes literals to be of that type). Paul. From larry at hastings.org Sun Sep 16 14:01:52 2007 From: larry at hastings.org (Larry Hastings) Date: Sun, 16 Sep 2007 05:01:52 -0700 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> Message-ID: <46ED1B30.9030708@hastings.org> Guido van Rossum wrote: > On 9/15/07, Greg Ewing wrote: > >> Changing the existing behaviour of argv would probably >> be too disruptive, so how about relegating argv to a >> low-level detail and providing something else for >> everyday use that omits argv[0]? >> >> sys.arguments would sound quite nice for that. > This sounds awfully close to bikeshedding. Change too many details > like this and you cause death by a 1000 pinpricks for existing apps. > sys.argv[0] *does* get used (though arguably rarely in the same way as > sys.argv[1:]). +0.5 on Greg Ewing's proposal. argv[0] has little in common with argv[1:]; why should the user have to differentiate them? I see this as yet one more messy detail of the OS that Python could hide for me. Looking at it with fresh eyes, I think for a in sys.arguments: is a lot prettier than for a in sys.argv[1:]: After all: what's that 1: doing there? Why the magic number? Why does argv have the script name in [0], anyway? None of my other functions/members are forced to take themselves as their first argument. Taking it to its logical conclusion, I further propose: sys.raw_argv -- the original bytes as they came in from from the OS sys.argv -- raw_argv converted into (unicode) strings, not expected to be used by users sys.arguments -- sys.argv[1:] sys.script_path -- sys.argv[0] sys.split_argv -- callable that takes an argv-style array (strings, not bytes) and assigns it into argv, arguments, and script_path, slicing as appropriate Yes, the format of argv has thirty years of history; yes I don't really expect this discussion to get anywhere. But I hate having arbitrary idioms in Python, and I wanted to cast my vote into the swirling void before this idea totally died. If nothing else, at least we could fix the proviso for argv[0]: "(it is operating system dependent whether this is a full pathname or not)." How about we always ensure it is an absolute path? My "there's only one way to do it" reflex is fighting it out with my "beautiful is better than ugly" reflex, /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070916/d659a8f2/attachment.htm From sasch.pe at gmx.de Sun Sep 16 15:34:24 2007 From: sasch.pe at gmx.de (Sascha Peilicke) Date: Sun, 16 Sep 2007 15:34:24 +0200 Subject: [Python-3000] Stackless anyone ? Message-ID: <1189949664.5502.3.camel@schlepp> Hello, is or has there been any discussion about stackless and py3k? regards, Sascha Peilicke -- http://saschashideout.de -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Dies ist ein digital signierter Nachrichtenteil Url : http://mail.python.org/pipermail/python-3000/attachments/20070916/e63b92a7/attachment.pgp From arvind1.singh at gmail.com Sun Sep 16 16:45:40 2007 From: arvind1.singh at gmail.com (Arvind Singh) Date: Sun, 16 Sep 2007 20:15:40 +0530 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> References: <200709111506.32823.mark@qtrac.eu> <200709142052.23583.mark@qtrac.eu> <46EAF6B1.8000705@v.loewis.de> <200709150604.18638.mark@qtrac.eu> <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> Message-ID: > How do you get from "some keys can't be ordered" to "it doesn't make > sense for Python to have sorteddict or sortedset"? If you want to use > keys that can't be ordered, then feel free to continue to use dict. > For situations in which ordering is important, that language should > support that. When did this become an all or nothing proposition? > There's plenty of space for both dict and sorteddict. Sorry for premature conclusions. All I wanted to do was remind the potential problems with any "generic" implementation. And I did say, when ordering is important, we are left with two choices: 1) Sort explicitly (whenever required) and be prepared to handle exceptions raised during sort operation. 2) Have a implicitly "sorted" implementation and handle exceptions at every insertion. I, personally, tend to prefer the former solution. Later case is useful when we have large objects and we do large number of insertions, in which case, per insertion exception handling would be inefficient. Former case, in turn, can be slightly confusing and a bit to debug. -- Regards, Arvind -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070916/cf740b8b/attachment.htm From thomas at python.org Sun Sep 16 17:24:33 2007 From: thomas at python.org (Thomas Wouters) Date: Sun, 16 Sep 2007 17:24:33 +0200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EC8909.4050300@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> Message-ID: <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> On 9/16/07, Greg Ewing wrote: > > > Also "argv" sounds more low-level than something like "arguments". > > While we're on the subject of argv, I've been wondering > whether py3k might want to revisit the idea of having > argv[0] be the program name. In my experience, one almost > *never* wants to treat argv[0] the same way as the rest of > the arguments. -1. If you want to put more meaning in the argv list, use an option parser. The _actual_ meaning of each element depends entirely on the program that's started. For Python-the-language, there isn't any difference between them. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070916/7e01a3f5/attachment.htm From mathieu.fenniak at gmail.com Sun Sep 16 17:46:18 2007 From: mathieu.fenniak at gmail.com (Mathieu Fenniak) Date: Sun, 16 Sep 2007 09:46:18 -0600 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE Message-ID: Hi everyone, I'd like to be able to derive from the bytes type, but this currently isn't possible due to it missing the Py_TPFLAGS_BASETYPE. A comment next to the flags indicates that this class is "sealed / final". I tried to search this list for some information on this, but I couldn't find any relevant posts. Why is this type "sealed"? I've experimented by adding the basetype flag to the type (with a recent svn checkout). Python's test suite continues to run without any errors after this change. My own project's test suite works flawlessly with a bytes derived type, as well. I expected to encounter some error or difficulty that would explain why this type wasn't usable as a base type, but it seems to work great. Mathieu From guido at python.org Sun Sep 16 19:02:09 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 16 Sep 2007 10:02:09 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: Message-ID: It is possible to compromise the integrity of a built-in type by subclassing it if the type wasn't carefully written to expect subclassing. The bytes type currently wasn't written to be careful about this. Why can't you use containment instead of subclassing? --Guido On 9/16/07, Mathieu Fenniak wrote: > Hi everyone, > > I'd like to be able to derive from the bytes type, but this currently > isn't possible due to it missing the Py_TPFLAGS_BASETYPE. A comment > next to the flags indicates that this class is "sealed / final". I > tried to search this list for some information on this, but I > couldn't find any relevant posts. Why is this type "sealed"? > > I've experimented by adding the basetype flag to the type (with a > recent svn checkout). Python's test suite continues to run without > any errors after this change. My own project's test suite works > flawlessly with a bytes derived type, as well. I expected to > encounter some error or difficulty that would explain why this type > wasn't usable as a base type, but it seems to work great. > > Mathieu > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From noamraph at gmail.com Sun Sep 16 19:06:29 2007 From: noamraph at gmail.com (Noam Raphael) Date: Sun, 16 Sep 2007 20:06:29 +0300 Subject: [Python-3000] The order of list comprehensions and generator expressions Message-ID: Hello, I had a thought about syntax I want to share with you. Say you want to get a list of all the phone numbers of your friends. You'll write something like this: telephones = [friend.telephone for friend in friends] Now suppose that, unfortunately, you have many friends, and they are grouped by city. Now, you'll probably write: telephones = [friend.telephone for friend in city.friends for city in cities] and you'll (hopefully) get an exception, and change your line to: telephones = [friend.telephone for city in cities for friend in city.friends] and say, "Ah, I should've remembered this from the last time it happened to me", and forget it until the next time it happens to you. The reason is that the code: for city in cities: for friend in city.friends: yield friend.telephone makes sense if you read it from the first line to the last line, and makes sense if you read it from the last line to the first line, but doesn't make a lot of sense if you start from the last line and then jump to the first line and read it from there. In other words, you can go from the general to the specific, and you can go from the specific to the general, but jumping from the most specific to the most general and back again up to the second-most specific is strange. All this is to say that I think that the "for" parts in list comprehensions and generator expressions should, in a perfect world, be evaluated in the other way round. The question remains, what should be done with the "if" parts. A possible solution is this: only one "if" part will be allowed after each "for" part (you don't need more than that, since you can always use the "and" operator). So, if I want to limit the list, my line will look like this: telephones = [ friend.telephone for friend in city.friends if friend.is_really_good for city in cities if city.is_close_to_me ] What do you think? Noam (P.S. Please don't be annoyed at me. The answer "this will break too much code and isn't worth it" is, of course, very sensible. I just thought that such thoughts can be posted to this list without causing too much harm.) From steven.bethard at gmail.com Sun Sep 16 19:16:56 2007 From: steven.bethard at gmail.com (Steven Bethard) Date: Sun, 16 Sep 2007 11:16:56 -0600 Subject: [Python-3000] The order of list comprehensions and generator expressions In-Reply-To: References: Message-ID: On 9/16/07, Noam Raphael wrote: > All this is to say that I think that the "for" parts in list > comprehensions and generator expressions should, in a perfect world, > be evaluated in the other way round. This proposal is not really appropriate for the python-3000 list -- it's too late for any more core language changes in Python 3000. If the idea belongs anywhere, it belongs on the python-ideas list. STeVe -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy From guido at python.org Sun Sep 16 20:42:42 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 16 Sep 2007 11:42:42 -0700 Subject: [Python-3000] The order of list comprehensions and generator expressions In-Reply-To: References: Message-ID: I think it's not so obvious that reversing the order is any better when you throw in some if clauses: [friend for city in cities if city.name != "Amsterdam" for friend in city.friends if friend.name != "Guido"] vs. [friend for friend in city.friends if friend.name != "Guido" for city in cities if city.name != "Amsterdam"] --Guido On 9/16/07, Noam Raphael wrote: > Hello, > > I had a thought about syntax I want to share with you. > > Say you want to get a list of all the phone numbers of your friends. > You'll write something like this: > telephones = [friend.telephone for friend in friends] > > Now suppose that, unfortunately, you have many friends, and they are > grouped by city. Now, you'll probably write: > telephones = [friend.telephone for friend in city.friends for city in cities] > > and you'll (hopefully) get an exception, and change your line to: > telephones = [friend.telephone for city in cities for friend in city.friends] > > and say, "Ah, I should've remembered this from the last time it > happened to me", and forget it until the next time it happens to you. > > The reason is that the code: > for city in cities: > for friend in city.friends: > yield friend.telephone > > makes sense if you read it from the first line to the last line, and > makes sense if you read it from the last line to the first line, but > doesn't make a lot of sense if you start from the last line and then > jump to the first line and read it from there. In other words, you can > go from the general to the specific, and you can go from the specific > to the general, but jumping from the most specific to the most general > and back again up to the second-most specific is strange. > > All this is to say that I think that the "for" parts in list > comprehensions and generator expressions should, in a perfect world, > be evaluated in the other way round. > > The question remains, what should be done with the "if" parts. A > possible solution is this: only one "if" part will be allowed after > each "for" part (you don't need more than that, since you can always > use the "and" operator). So, if I want to limit the list, my line will > look like this: > > telephones = [ > friend.telephone > for friend in city.friends if friend.is_really_good > for city in cities if city.is_close_to_me > ] > > What do you think? > Noam > > (P.S. Please don't be annoyed at me. The answer "this will break too > much code and isn't worth it" is, of course, very sensible. I just > thought that such thoughts can be posted to this list without causing > too much harm.) > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From noamraph at gmail.com Sun Sep 16 21:01:29 2007 From: noamraph at gmail.com (Noam Raphael) Date: Sun, 16 Sep 2007 21:01:29 +0200 Subject: [Python-3000] The order of list comprehensions and generator expressions In-Reply-To: References: Message-ID: On 9/16/07, Guido van Rossum wrote: > I think it's not so obvious that reversing the order is any better > when you throw in some if clauses: > > [friend for city in cities if city.name != "Amsterdam" for friend in > city.friends if friend.name != "Guido"] > > vs. > > [friend for friend in city.friends if friend.name != "Guido" for city > in cities if city.name != "Amsterdam"] > > --Guido > I think that it's still better, at least if you add some newlines: [friend (Ok, we are talking about a list of friends. From where do these friends come from?) for friend in city.friends if friend.name != "Guido" (Ah, they are all the friends in a city who aren't called Guido. What about the city?) for city in cities if city.name != "Amsterdam"] (Ah, the city is every city which isn't Amsterdam.) Versus: [friend (Ok, we are talking about a list of friends. From where do these friends come from?) for city in cities if city.name != "Amsterdam" (What do cities which aren't Amsterdam have to do with my friend?) for friend in city.friends if friend.name != "Guido"] (Ah, we're talking about all the friends in those cities who aren't called Guido. Let's have a look at the first line to remember what we do with them... ah, yes, we just return them...) Noam From tjreedy at udel.edu Sun Sep 16 21:02:56 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 16 Sep 2007 15:02:56 -0400 Subject: [Python-3000] Stackless anyone ? References: <1189949664.5502.3.camel@schlepp> Message-ID: "Sascha Peilicke" wrote in message news:1189949664.5502.3.camel at schlepp... | is or has there been any discussion about stackless and py3k? No. C. Tismer has focused his current efforts on PyPy. From greg at krypto.org Sun Sep 16 21:14:03 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sun, 16 Sep 2007 12:14:03 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709160442g44100aean3da2890085eb643@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> <66d0a6e10709151900l3d89b71u2e8b9bcb4b62b9f4@mail.gmail.com> <79990c6b0709160442g44100aean3da2890085eb643@mail.gmail.com> Message-ID: <52dc1c820709161214o23160d2au6c80ae61f3e11bcd@mail.gmail.com> On 9/16/07, Paul Moore wrote: > On 16/09/2007, Fred Drake wrote: > > On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > > > Then lets stop beating around the bush and implement an immutable > > > bytes type. Why put ourselves through contortions trying to jam a > > > square peg into a round hole and not just decide to make a round peg? > > > > +42 !!!! > > I knew this would come up again when I made that comment - I > deliberately didn't express an opinion then, as I didn't want to > obscure the point. But I'll come off the fence and admit that I'm also > in favour of an immutable bytes type (and for bytes literals to be of > that type). > > Paul. FYI - my first patch in the bytes object support for PyBUF_LOCKDATA thread added support for immutable bytes. I didn't add a hash method yet but that should be trivial. Should the hash method raise an exception when set_immutable has not been called yet or should it call set_immutable? I'm in favor of the exception. side effects are bad. -gps From mathieu.fenniak at gmail.com Sun Sep 16 22:19:57 2007 From: mathieu.fenniak at gmail.com (Mathieu Fenniak) Date: Sun, 16 Sep 2007 14:19:57 -0600 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> Message-ID: <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> On 16-Sep-07, at 12:38 PM, Guido van Rossum wrote: > I'm not doubting that *your* subclass works well enough. The problem > is that it must robust in the light of *any* subclass, no matter how > crazy. I understand that, but I'm not sure what kind of problems can be created by crazy subclasses. But my imagination of "crazy subclass" is pretty limited. > I'd have to understand more about your app to see whether subclassing > truly makes sense. I didn't want to flood too many pointless details into the discussion, so here's the minimum that I think is relevant. The project is pyPdf, a library for reading and writing PDF files. I've been working on making the library support unicode text strings within PDF documents. In a PDF file, a "string" can either be a text string, or a byte string. A string is a text string if it starts with a UTF-16BE BOM marker, or if it can be decoded using an encoding called PDFDocEncoding (which is specified by the PDF reference, similar to Latin-1 but different just to make life difficult). pyPdf needs to be capable of reading and writing these string objects. Whether a string is a byte or a text string, writing out the raw bytes is the same process after the text has been encoded. This lends itself to a common StringObject base class: class StringObject(PdfObject): # contains common behavior for both types of strings, such as the ability to serialize out a byte array, encrypt/decrypt strings for "secure" PDF files # also contains reading code that attempts to autodetect whether the string is a byte or text string class ByteStringObject(bytes, StringObject): # adds the byte array storage, and passes self back to StringObject for serialization output class TextStringObject(str, StringObject): # overrides the default output serialization to encode the unicode string to match PDF's requirements, # passes the resulting byte array up for serialization. (complete source code, if you're interested: http://hg.pybrary.net/ pyPdf-py3/file/fe0dc2014a1b/pyPdf/generic.py) Deriving from the bytes type provides storage, and also direct & easy access to the byte array content. I think in this case using bytes as a base type makes sense, at least as much as using str as a base type. pyPdf derives from list and dict for different PDF object types in a similar manner as well. Mathieu Fenniak From greg.ewing at canterbury.ac.nz Mon Sep 17 01:01:33 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 11:01:33 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> Message-ID: <46EDB5CD.6020205@canterbury.ac.nz> Gregory P. Smith wrote: > argv is the C/C++ name for bytes, lets not > confuse people. C has never made a clear distinction between characters and bytes, using the type 'char' for both. It got away with it for the same reason that Python did until unicode came along. I'm pretty sure most people using argv in C thought of it as holding characters. Certainly I always did. As far as I know, most other places in Python are going to deal with the changes by keeping the existing text APIs as returning text, e.g. open() gives you a text mode I/O object by default with an assumed encoding, and to get bytes you need to do something explicit (e.g. opening the file in binary mode). I don't see why argv should be different. -- Greg From greg.ewing at canterbury.ac.nz Mon Sep 17 01:03:54 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 11:03:54 +1200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> Message-ID: <46EDB65A.9040402@canterbury.ac.nz> Paul Moore wrote: > On 15/09/2007, Gregory P. Smith wrote: > >>similarly for the environment. os.environ dict >>should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... Has there been any consensus reached yet on whether there will be a frozenbytes type? I can see the non-hashability of bytes leading to lots of annoyances like this. -- Greg From greg.ewing at canterbury.ac.nz Mon Sep 17 01:32:05 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 11:32:05 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> Message-ID: <46EDBCF5.6090209@canterbury.ac.nz> Guido van Rossum wrote: > This sounds awfully close to bikeshedding. I don't agree with that assessment. This is something I've had in mind for quite a while. Python optimises this for the *least* frequent use case, which is just plain silly, as far as I can see. The only reason for it is because that's the way C does it. That might be called a foolish consistency. > Change too many details > like this and you cause death by a 1000 pinpricks for existing apps. That's why I'm not suggesting that argv itself be changed, but that something new be added for the more frequent use cases. -- Greg From guido at python.org Mon Sep 17 03:56:09 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 16 Sep 2007 18:56:09 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> Message-ID: On 9/16/07, Mathieu Fenniak wrote: > On 16-Sep-07, at 12:38 PM, Guido van Rossum wrote: > > I'm not doubting that *your* subclass works well enough. The problem > > is that it must robust in the light of *any* subclass, no matter how > > crazy. > > I understand that, but I'm not sure what kind of problems can be > created by crazy subclasses. But my imagination of "crazy subclass" > is pretty limited. > > > I'd have to understand more about your app to see whether subclassing > > truly makes sense. > > I didn't want to flood too many pointless details into the > discussion, so here's the minimum that I think is relevant. The > project is pyPdf, a library for reading and writing PDF files. I've > been working on making the library support unicode text strings > within PDF documents. > > In a PDF file, a "string" can either be a text string, or a byte > string. A string is a text string if it starts with a UTF-16BE BOM > marker, or if it can be decoded using an encoding called > PDFDocEncoding (which is specified by the PDF reference, similar to > Latin-1 but different just to make life difficult). pyPdf needs to > be capable of reading and writing these string objects. Whether a > string is a byte or a text string, writing out the raw bytes is the > same process after the text has been encoded. This lends itself to a > common StringObject base class: > > class StringObject(PdfObject): > # contains common behavior for both types of strings, such as > the ability to serialize out a byte array, encrypt/decrypt strings > for "secure" PDF files > # also contains reading code that attempts to autodetect whether > the string is a byte or text string > > class ByteStringObject(bytes, StringObject): > # adds the byte array storage, and passes self back to > StringObject for serialization output > > class TextStringObject(str, StringObject): > # overrides the default output serialization to encode the > unicode string to match PDF's requirements, > # passes the resulting byte array up for serialization. > > (complete source code, if you're interested: http://hg.pybrary.net/ > pyPdf-py3/file/fe0dc2014a1b/pyPdf/generic.py) > > Deriving from the bytes type provides storage, and also direct & easy > access to the byte array content. I think in this case using bytes > as a base type makes sense, at least as much as using str as a base > type. pyPdf derives from list and dict for different PDF object > types in a similar manner as well. So suppose my answer was "no, bytes won't be subclassable". How much would you really lose by having to wrap a separate object around a bytes object, rather than being able to subclass? How much extra code do you think you would have to write? Another way to look at it-- how much of the bytes type's API do your objects really have to support? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mathieu.fenniak at gmail.com Mon Sep 17 05:19:34 2007 From: mathieu.fenniak at gmail.com (Mathieu Fenniak) Date: Sun, 16 Sep 2007 21:19:34 -0600 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> Message-ID: On 16-Sep-07, at 7:56 PM, Guido van Rossum wrote: > So suppose my answer was "no, bytes won't be subclassable". How much > would you really lose by having to wrap a separate object around a > bytes object, rather than being able to subclass? How much extra code > do you think you would have to write? > > Another way to look at it-- how much of the bytes type's API do your > objects really have to support? Most often, I'd be concatenating and comparing with other bytes objects, iterating through the byte array, and passing the byte array into methods like stream.write. Iterating and comparing could be dealt with by some code in the containing class; for other needs, I would sprinkle ".data" property accesses throughout the code to access the bytes instance. I'm not too concerned about the programming I'd have to do, even though the end result wouldn't really be what I'd like to have. It's not the end of the world, it's just not ideal. I do think that subclassing bytes would probably be a request a few people would have, especially when porting Python 2 code that subclasses str. It seems especially unusual that bytes can't be subclassed, when builtin types like str, list, dict, and set can be. Mathieu From greg.ewing at canterbury.ac.nz Mon Sep 17 05:58:18 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 15:58:18 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46ED1B30.9030708@hastings.org> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <46ED1B30.9030708@hastings.org> Message-ID: <46EDFB5A.2010808@canterbury.ac.nz> Larry Hastings wrote: > If nothing else, at least we could fix the proviso for argv[0]: "(it is > operating system dependent whether this is a full pathname or not)." It's actually worse than that -- you're entirely at the mercy of whatever made the exec() call as to whether it's a meaningful path at all. Most programs are courteous enough to make sure it's at least a relative path to the executable being run, but you can't rely on that. I'm not sure munging argv[0] to an absolute path is the right thing to do, if it's to be regarded as a low-level thing. A program wanting low-level access to argv might want to know exactly what was passed to exec() for some reason. A separate sys.absolute_path_to_executable() or something might be better. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Mon Sep 17 06:04:11 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 16:04:11 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> Message-ID: <46EDFCBB.8010306@canterbury.ac.nz> Thomas Wouters wrote: > If you want to put more meaning in the argv list, use an option > parser. I want to put *less* meaning in it, not more. :-) And using an argument parser is often overkill for simple programs. > The _actual_ meaning of each element depends entirely on the > program that's started. For Python-the-language, there isn't any > difference between them. So in your Python programs, you're quite happy to write for arg in sys.argv: process(arg) and not care about what this does with argv[0]? I hardly see how one can claim that there's "no difference" between argv[0] and the rest for practical purposes. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Mon Sep 17 06:08:31 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Sep 2007 16:08:31 +1200 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: Message-ID: <46EDFDBF.7010307@canterbury.ac.nz> Guido van Rossum wrote: > It is possible to compromise the integrity of a built-in type by > subclassing it if the type wasn't carefully written to expect > subclassing. Disallowing subclassing in Python may make sense, but it seems unreasonable not to allow subclassing by consenting C code that is careful not to compromise any integrity. Maybe there should be two flags for this instead of just one? -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From stephen at xemacs.org Mon Sep 17 07:55:54 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 17 Sep 2007 14:55:54 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46ECDA66.3040702@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <46ECDA66.3040702@v.loewis.de> Message-ID: <878x75ltsl.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > The basic idea is to allocate code points in private space as-needed. > > Ok, thanks. Would you be interested in implementing that scheme? Yes. I'm recovering from moving from Japan to California, and will be busy until the beginning of October, I'll get started on it then. For this kind of thing, what is the deadline for submission of a patch? Before the alpha, early beta? From guido at python.org Mon Sep 17 16:55:43 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 07:55:43 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: <46EDFDBF.7010307@canterbury.ac.nz> References: <46EDFDBF.7010307@canterbury.ac.nz> Message-ID: On 9/16/07, Greg Ewing wrote: > Guido van Rossum wrote: > > It is possible to compromise the integrity of a built-in type by > > subclassing it if the type wasn't carefully written to expect > > subclassing. > > Disallowing subclassing in Python may make sense, but > it seems unreasonable not to allow subclassing by > consenting C code that is careful not to compromise > any integrity. > > Maybe there should be two flags for this instead of > just one? AFAIK there's nothing stopping you from subclassing in C. I thought we were talking about Python though. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Mon Sep 17 17:00:20 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 08:00:20 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> Message-ID: On 9/16/07, Mathieu Fenniak wrote: > On 16-Sep-07, at 7:56 PM, Guido van Rossum wrote: > > So suppose my answer was "no, bytes won't be subclassable". How much > > would you really lose by having to wrap a separate object around a > > bytes object, rather than being able to subclass? How much extra code > > do you think you would have to write? > > > > Another way to look at it-- how much of the bytes type's API do your > > objects really have to support? > > Most often, I'd be concatenating and comparing with other bytes > objects, iterating through the byte array, and passing the byte array > into methods like stream.write. Iterating and comparing could be > dealt with by some code in the containing class; for other needs, I > would sprinkle ".data" property accesses throughout the code to > access the bytes instance. OK, so it sounds like at least you are treating it as a bytes array. > I'm not too concerned about the programming I'd have to do, even > though the end result wouldn't really be what I'd like to have. It's > not the end of the world, it's just not ideal. > > I do think that subclassing bytes would probably be a request a few > people would have, especially when porting Python 2 code that > subclasses str. Well, due to bytes' mutability, they'll be in for a ton of surprises. > It seems especially unusual that bytes can't be > subclassed, when builtin types like str, list, dict, and set can be. Maybe I should apologize for pushing back so hard, but in my experience most people who subclass a built-in type do it because they can, not because they should -- the lamented "path" module being a prime example in my view. I'm still not convinced of the usefulness in your case -- what would you lose if you just passed a bytes instance around instead of an instance of the subclass you'd like to have? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rrr at ronadam.com Mon Sep 17 17:51:18 2007 From: rrr at ronadam.com (Ron Adam) Date: Mon, 17 Sep 2007 10:51:18 -0500 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EDFCBB.8010306@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> Message-ID: <46EEA276.4040901@ronadam.com> Greg Ewing wrote: > Thomas Wouters wrote: >> If you want to put more meaning in the argv list, use an option >> parser. > > I want to put *less* meaning in it, not more. :-) > And using an argument parser is often overkill for > simple programs. Would it be possible to split out the (pre) parsing from optparse so that instead of returning a list, it returns a dictionary of attributes and values? This would only contain what was given in the command line as a first "lighter weight" step to parsing the command line. opts = opt_parser(argv) command_name = opts['argv0'] # better name for argv0? Or... opts = opt_parser(argv) if "-h" in opts or "--h" in opts: print("Help on {argv0}: ...".format(opts)) If the dictionary was pre defined with defaults it might be more like.. opts = {'-h':False, '--h':False} opts.update(opt_parser(argv) if opts['-h'] or opts['--h']: print("Help on {argv0}: ...".format(opts)) This avoids the loop for the simplest cases. A second dispatcher/validator object could then use this as input. Regards, Ron >> The _actual_ meaning of each element depends entirely on the >> program that's started. For Python-the-language, there isn't any >> difference between them. > > So in your Python programs, you're quite happy > to write > > for arg in sys.argv: > process(arg) > > and not care about what this does with argv[0]? > > I hardly see how one can claim that there's > "no difference" between argv[0] and the rest > for practical purposes. > From mathieu.fenniak at gmail.com Mon Sep 17 18:44:18 2007 From: mathieu.fenniak at gmail.com (Mathieu Fenniak) Date: Mon, 17 Sep 2007 10:44:18 -0600 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> Message-ID: <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> On 17-Sep-07, at 9:00 AM, Guido van Rossum wrote: > Maybe I should apologize for pushing back so hard, but in my > experience most people who subclass a built-in type do it because they > can, not because they should -- the lamented "path" module being a > prime example in my view. > > I'm still not convinced of the usefulness in your case -- what would > you lose if you just passed a bytes instance around instead of an > instance of the subclass you'd like to have? The builtin type subclasses in pyPdf (including the would-be bytes subclass) add additional methods that every pdf object is expected to support. All the PDF object types have two additional methods (writeToStream and getObject) that have varying behavior for each class: (relatively inconsequential PDF information follows) "writeToStream" method that serializes the object -- a byte string would write out <68656c6c6f>, a text string (hello), and so on for other more complex types (dictionaries, labels, arrays, PDF data streams). The type is also responsible for encrypting itself when applicable. PDF files also have an ability to reference objects elsewhere in the file. For example, the length of a content stream can be a simple "500 bytes", or it can be "read this length at offset X in the file". Since almost any object can be an indirect object reference, the library objects support a "getObject" method that returns self -- excluding PDF "indirect object reference" objects, which read an object from a table in a PDF file. If you decide that bytes should be subclassable, I've included with this e-mail a patch that adds the basetype bit, adds some unit tests for bytes subclasses, and includes __dict__ in the bytes_reduce method (for pickling subclass instances). I was going to upload this to the SF patch manager, but it appears to be closed to permit only project members access. Mathieu Fenniak -------------- next part -------------- A non-text attachment was scrubbed... Name: bytes-subclass-patch.diff Type: application/octet-stream Size: 4933 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070917/317ebd40/attachment.obj From tjreedy at udel.edu Mon Sep 17 19:27:42 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 17 Sep 2007 13:27:42 -0400 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com><8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> Message-ID: "Mathieu Fenniak" wrote in message news:96487E1C-4BA3-4FEC-9080-2C09AD330197 at gmail.com... | method (for pickling subclass instances). I was going to upload this | to the SF patch manager, but it appears to be closed to permit only | project members access. Because SF is only an archive now. We are now using bugs.python.org. Your SF account may have carried over. If not, sign up is easy. From guido at python.org Mon Sep 17 19:53:58 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 10:53:58 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> Message-ID: On 9/17/07, Mathieu Fenniak wrote: > On 17-Sep-07, at 9:00 AM, Guido van Rossum wrote: > > Maybe I should apologize for pushing back so hard, but in my > > experience most people who subclass a built-in type do it because they > > can, not because they should -- the lamented "path" module being a > > prime example in my view. > > > > I'm still not convinced of the usefulness in your case -- what would > > you lose if you just passed a bytes instance around instead of an > > instance of the subclass you'd like to have? > > The builtin type subclasses in pyPdf (including the would-be bytes > subclass) add additional methods that every pdf object is expected to > support. All the PDF object types have two additional methods > (writeToStream and getObject) that have varying behavior for each > class: (relatively inconsequential PDF information follows) > > "writeToStream" method that serializes the object -- a byte string > would write out <68656c6c6f>, a text string (hello), and so on for > other more complex types (dictionaries, labels, arrays, PDF data > streams). The type is also responsible for encrypting itself when > applicable. This sounds like a perfect application for generic functions instead of subclassing. > PDF files also have an ability to reference objects elsewhere in the > file. For example, the length of a content stream can be a simple > "500 bytes", or it can be "read this length at offset X in the > file". Since almost any object can be an indirect object reference, > the library objects support a "getObject" method that returns self -- > excluding PDF "indirect object reference" objects, which read an > object from a table in a PDF file. Similar. > If you decide that bytes should be subclassable, I've included with > this e-mail a patch that adds the basetype bit, adds some unit tests > for bytes subclasses, and includes __dict__ in the bytes_reduce > method (for pickling subclass instances). I was going to upload this > to the SF patch manager, but it appears to be closed to permit only > project members access. > > Mathieu Fenniak > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mathieu.fenniak at gmail.com Mon Sep 17 20:33:18 2007 From: mathieu.fenniak at gmail.com (Mathieu Fenniak) Date: Mon, 17 Sep 2007 12:33:18 -0600 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> Message-ID: On 17-Sep-07, at 11:53 AM, Guido van Rossum wrote: >> "writeToStream" method that serializes the object -- a byte string >> would write out <68656c6c6f>, a text string (hello), and so on for >> other more complex types (dictionaries, labels, arrays, PDF data >> streams). The type is also responsible for encrypting itself when >> applicable. > > This sounds like a perfect application for generic functions instead > of subclassing. Sure, there are other options for writing and organizing this code. But, this is a valid application of subclassing the bytes type, and it is the method I would prefer to be able to implement. Mathieu From guido at python.org Mon Sep 17 20:44:33 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 11:44:33 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> Message-ID: I understand. But bytes are still in flux (see the repeated requests for immutable bytes) and I don't want to commit to anything just yet. On 9/17/07, Mathieu Fenniak wrote: > On 17-Sep-07, at 11:53 AM, Guido van Rossum wrote: > >> "writeToStream" method that serializes the object -- a byte string > >> would write out <68656c6c6f>, a text string (hello), and so on for > >> other more complex types (dictionaries, labels, arrays, PDF data > >> streams). The type is also responsible for encrypting itself when > >> applicable. > > > > This sounds like a perfect application for generic functions instead > > of subclassing. > > Sure, there are other options for writing and organizing this code. > But, this is a valid application of subclassing the bytes type, and > it is the method I would prefer to be able to implement. > > Mathieu > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From qrczak at knm.org.pl Mon Sep 17 21:12:00 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 17 Sep 2007 21:12:00 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1190056321.14217.21.camel@qrnik> Dnia 15-09-2007, So o godzinie 09:13 +0900, Stephen J. Turnbull napisa?(a): > > Well, for any scheme which attempts to modify UTF-8 by accepting > > arbitrary byte strings is used, *something* must be interpreted > > differently than in real UTF-8. > > Wrong. In my scheme everything ends up in the PUA, on which real > UTF-8 imposes no interpretation by definition. This is wrong: UTF-8 is specified for PUA. PUA is no special from the point of view of UTF-8. UTF-8 is defined for all Unicode scalar values, i.e. all code points in the ranges U+0000..U+D7FF and U+E000..U+10FFFF, i.e. all code points excluding surrogates. This includes PUA. > I haven't gone back to check yet, but it's possible that a "real UTF-8 > conforming process" is required to stop processing and issue an error > or something like that in the cases we're trying to handle. "C10. When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters." -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Mon Sep 17 23:45:50 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 17 Sep 2007 23:45:50 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <878x75ltsl.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <46ECDA66.3040702@v.loewis.de> <878x75ltsl.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46EEF58E.5050809@v.loewis.de> > Yes. I'm recovering from moving from Japan to California, and will be > busy until the beginning of October, I'll get started on it then. For > this kind of thing, what is the deadline for submission of a patch? > Before the alpha, early beta? Either would work fine, unless somebody else does it first. Regards, Martin From qrczak at knm.org.pl Tue Sep 18 01:06:54 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 18 Sep 2007 01:06:54 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1190070414.20673.12.camel@qrnik> Dnia 16-09-2007, N o godzinie 16:13 +0900, Stephen J. Turnbull napisa?(a): > When a codec encounters something it can't handle, whether it's a > valid character in a legacy encoding, a private use character in a > UTF, or an invalid sequence of code units, it throws an exception > specifying the character or code unit and the current coded character > set, Does this mean that this: $ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") would no longer print e650 in a UTF-8 locale, assuming a shell which understands the escape sequence in printf, and the script would have to make special arrangements to make the character available? U+E650 is a private use character. If so, I'm violently against this. > This definitely requires that the Unicode codecs be modified to do the > right thing if they encounter private use characters in the input > stream or output stream. The right thing is to encode or decode private use characters according to regular codec rules, as all other transcoders of these codecs in all other languages do. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From mike.klaas at gmail.com Tue Sep 18 01:41:57 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Mon, 17 Sep 2007 16:41:57 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46EDB65A.9040402@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <52dc1c820709151336y36f753ffw6359da92ff8dd2e@mail.gmail.com> <79990c6b0709151429q52f06744i86f4b9020d6ce639@mail.gmail.com> <46EDB65A.9040402@canterbury.ac.nz> Message-ID: <8861D3FA-ABD1-4C6B-BA5D-A897D3B36194@gmail.com> On 16-Sep-07, at 4:03 PM, Greg Ewing wrote: > Paul Moore wrote: >> On 15/09/2007, Gregory P. Smith wrote: >> >>> similarly for the environment. os.environ dict >>> should be bytes object keys and values >> >> You can't have bytes as keys - the type isn't hashable... > > Has there been any consensus reached yet on whether > there will be a frozenbytes type? I can see the > non-hashability of bytes leading to lots of > annoyances like this. Might it make things clearer to use something other than the X/ frozenX nomenclature? bytes -> b'HELO' -> immutable octet list bytebuf -> mutable octet buffer (current bytes() objects) buf = bytebuf() buf.append(read(1024)) print buf bytebuf(b'HELO') -Mike From greg.ewing at canterbury.ac.nz Tue Sep 18 02:07:51 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 18 Sep 2007 12:07:51 +1200 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <46EDFDBF.7010307@canterbury.ac.nz> Message-ID: <46EF16D7.2010402@canterbury.ac.nz> Guido van Rossum wrote: > AFAIK there's nothing stopping you from subclassing in C. That may be true -- I may have just incorrectly assumed that the flag would prevent subclassing from working properly in C as well. > I thought we were talking about Python though. That may be true as well. I think I got mixed up with a discussion about adding an immutable bytes type as a builtin. -- Greg From greg.ewing at canterbury.ac.nz Tue Sep 18 02:13:14 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 18 Sep 2007 12:13:14 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EEA276.4040901@ronadam.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <46EEA276.4040901@ronadam.com> Message-ID: <46EF181A.5040907@canterbury.ac.nz> Ron Adam wrote: > Would it be possible to split out the (pre) parsing from optparse so > that instead of returning a list Whatever is done, anything putting itself forward as a light duty argument parser has to have a *very* simple API. Neither of the current ones fits my brain, and I have to go looking up the docs every time I want to use them. -- Greg From greg.ewing at canterbury.ac.nz Tue Sep 18 02:21:38 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 18 Sep 2007 12:21:38 +1200 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: References: <443E4C09-853C-474F-9150-A0EFA5418154@gmail.com> <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> Message-ID: <46EF1A12.4050608@canterbury.ac.nz> Guido van Rossum wrote: > I understand. But bytes are still in flux (see the repeated requests > for immutable bytes) Moreover, my feeling is that immutable byte should be the *default*, and if you want mutable bytes you should have to ask for it. This would make bytes more symmetrical with strings, where immutability is the default, and if you want mutability you use an array.array('c') or whatever the equivalent will be in py3k. It would also help to settle the question of whether b"xyz" should be mutable -- clearly not, for symmetry with strings. -- Greg From guido at python.org Tue Sep 18 02:45:23 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 17:45:23 -0700 Subject: [Python-3000] bytes & Py_TPFLAGS_BASETYPE In-Reply-To: <46EF1A12.4050608@canterbury.ac.nz> References: <8D27D90B-EF8C-42FB-B104-FB98EC65DC1E@gmail.com> <96487E1C-4BA3-4FEC-9080-2C09AD330197@gmail.com> <46EF1A12.4050608@canterbury.ac.nz> Message-ID: On 9/17/07, Greg Ewing wrote: > Guido van Rossum wrote: > > I understand. But bytes are still in flux (see the repeated requests > > for immutable bytes) > > Moreover, my feeling is that immutable byte should be > the *default*, and if you want mutable bytes you > should have to ask for it. > > This would make bytes more symmetrical with strings, > where immutability is the default, and if you want > mutability you use an array.array('c') or whatever > the equivalent will be in py3k. > > It would also help to settle the question of > whether b"xyz" should be mutable -- clearly not, > for symmetry with strings. I'm considering the following option -- it would help if someone explored creating a patch to implement this, just to see the minimum amount of code that would need to change compared to 3.0a1: bytes are always immutable, and for the few places where a mutable bytes buffer would be handy, we use the array module. Then it would also make sense to make b[0] return a bytes array of length 1 instead of a small int -- bytes would be more similar to str in 2.x, albeit completely incompatible with str in terms of mixed operations. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 18 04:18:01 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 17 Sep 2007 19:18:01 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer Message-ID: This may have passed in a thread where no-one was listening, so I'm repeating it here. I'm considering the following option: bytes would always be immutable, and for the few places (mostly in io.py) where a mutable bytes buffer would be handy, we use the array module. Then it would also make sense to make b[0] return a bytes array of length 1 instead of a small int -- bytes would be more similar to str in 2.x, albeit completely incompatible with str in terms of mixed operations. It would help if someone explored creating a patch to implement this, just to see the minimum amount of code that would need to change compared to 3.0a1. (The challenge includes making all the tests pass again.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From talin at acm.org Tue Sep 18 04:32:47 2007 From: talin at acm.org (Talin) Date: Mon, 17 Sep 2007 19:32:47 -0700 Subject: [Python-3000] Stackless anyone ? In-Reply-To: References: <1189949664.5502.3.camel@schlepp> Message-ID: <46EF38CF.4020801@acm.org> Terry Reedy wrote: > "Sascha Peilicke" wrote in message > news:1189949664.5502.3.camel at schlepp... > | is or has there been any discussion about stackless and py3k? > > No. C. Tismer has focused his current efforts on PyPy. That seems like the right strategy to me. Rather than focusing on a specific implementation, it seems better to me to work on an abstract representation of the Python language which can be "rendered" into various implementations. I think for those people in that other thread about threads (which I won't mention by name for fear of bringing that thread over here), that the ultimate solution to Python concurrency won't be via patching CPython, but to compile the meta-Python language to a back-end representation that is inherently concurrent. -- Talin From talin at acm.org Tue Sep 18 04:42:04 2007 From: talin at acm.org (Talin) Date: Mon, 17 Sep 2007 19:42:04 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: <46EF3AFC.7000904@acm.org> Guido van Rossum wrote: > This may have passed in a thread where no-one was listening, so I'm > repeating it here. > > I'm considering the following option: bytes would always be immutable, > and for the few places (mostly in io.py) where a mutable bytes buffer > would be handy, we use the array module. Then it would also make sense > to make b[0] return a bytes array of length 1 instead of a small int > -- bytes would be more similar to str in 2.x, albeit completely > incompatible with str in terms of mixed operations. > > It would help if someone explored creating a patch to implement this, > just to see the minimum amount of code that would need to change > compared to 3.0a1. (The challenge includes making all the tests pass > again.) I don't know if I mentioned this before, since (a) I didn't want to be a distraction while you were busy trying to make mutable bytes work everywhere, and (b) I didn't want to sound completely insane. However - here is my vision of how things would look in an ideal world: Data Type AbstractSequence Immutable Mutable ========= ================ ========= ======= byte ByteSequence bytes buffer character CharSequence str strbuf 'buffer' could be an array.array, although if it's used frequently enough an optimized special-case 'buffer' class might be better. And it can have methods that array doesn't have. -- Talin From rrr at ronadam.com Tue Sep 18 05:28:28 2007 From: rrr at ronadam.com (Ron Adam) Date: Mon, 17 Sep 2007 22:28:28 -0500 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EF181A.5040907@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <46EEA276.4040901@ronadam.com> <46EF181A.5040907@canterbury.ac.nz> Message-ID: <46EF45DC.50501@ronadam.com> Greg Ewing wrote: > Ron Adam wrote: >> Would it be possible to split out the (pre) parsing from optparse so >> that instead of returning a list > > Whatever is done, anything putting itself forward as a light > duty argument parser has to have a *very* simple API. Neither > of the current ones fits my brain, and I have to go looking > up the docs every time I want to use them. I agree. I like reusing dictionaries and lists when possible over special types because I don't have to look up how to use them. Ron From stephen at xemacs.org Tue Sep 18 06:08:29 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 18 Sep 2007 13:08:29 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1190056321.14217.21.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> Message-ID: <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> >>>>> "Marcin 'Qrczak' Kowalczyk" writes: >> > Well, for any scheme which attempts to modify UTF-8 by accepting >> > arbitrary byte strings is used, *something* must be interpreted >> > differently than in real UTF-8. >> Wrong. In my scheme everything ends up in the PUA, on which real >> UTF-8 imposes no interpretation by definition. > This is wrong: UTF-8 is specified for PUA. PUA is no special from the > point of view of UTF-8. It is from the point of view of the Unicode standard, specifically v5. Please see section 16.5, especially about the "corporate use subarea". (No, I hadn't considered this stuff yet in my proposal, but it's not hard to accomodate.) > UTF-8 is defined for all Unicode scalar values, Sure, and what I propose is entirely compatible with the specification of UTF-8 as a UTF, unlike what you propose. Until you understand why that's true, we're at an impasse. >> I haven't gone back to check yet, but it's possible that a "real UTF-8 >> conforming process" is required to stop processing and issue an error >> or something like that in the cases we're trying to handle. > "C10. When a process interprets a code unit sequence which purports to > be in a Unicode character encoding form, it shall treat ill-formed code > unit sequences as an error condition and shall not interpret such > sequences as characters." Yeah, that's the one. While I'm uncomfortable advocating the position that my proposal is entirely compatible with C10, it is true that it treats ill-formed sequences as an error, and it is arguable that "mapping code units to characters in private space" is not the same as "interpreting them as characters". For obvious reasons I'm uncomfortable with that, but I actually don't consider this non-conformance a huge loss in the context of this thread since both your proposal and James Knight's do equally non-conformant things. From stephen at xemacs.org Tue Sep 18 06:56:37 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 18 Sep 2007 13:56:37 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1190070414.20673.12.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <87veaejths.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA0778.3000502@canterbury.ac.nz> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> Message-ID: <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> >>>>> "Marcin 'Qrczak' Kowalczyk" writes: >> When a codec encounters something it can't handle, whether it's a >> valid character in a legacy encoding, a private use character in a >> UTF, or an invalid sequence of code units, it throws an exception >> specifying the character or code unit and the current coded character >> set, > Does this mean that this: > $ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") > would no longer print e650 in a UTF-8 locale What do you mean "no longer"? Look: chibi:MacPorts steve$ export LC_ALL=en_US.UTF-8 chibi:MacPorts steve$ python -c 'import sys; print("%s" % sys.argv[1])' $(printf "\ue650") \ue650 chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") Traceback (most recent call last): File "", line 1, in ? TypeError: ord() expected a character, but string of length 6 found chibi:MacPorts steve$ Note that some people are currently arguing that sys.argv should be an array of bytes objects, and Guido has not yet said "no". In that case, all of the current proposals should have exactly this result. My position is that if you do something that depends on the internal representation of implementation-dependent objects, you deserve whatever results you get. From steven.bethard at gmail.com Tue Sep 18 08:40:47 2007 From: steven.bethard at gmail.com (Steven Bethard) Date: Tue, 18 Sep 2007 00:40:47 -0600 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EEA276.4040901@ronadam.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <46EEA276.4040901@ronadam.com> Message-ID: On 9/17/07, Ron Adam wrote: > Greg Ewing wrote: > > Thomas Wouters wrote: > >> If you want to put more meaning in the argv list, use an option > >> parser. > > > > I want to put *less* meaning in it, not more. :-) > > And using an argument parser is often overkill for > > simple programs. > > Would it be possible to split out the (pre) parsing from optparse so that > instead of returning a list, it returns a dictionary of attributes and values? > > This would only contain what was given in the command line as a first > "lighter weight" step to parsing the command line. > > opts = opt_parser(argv) > command_name = opts['argv0'] # better name for argv0? You might look at argparse_ which allows you to treat positional arguments just like optional ones. So you'd write:: parser = argparse.ArgumentParser() parser.add_argument('command') # positional argument parser.add_argument('--option') # optional argument args = parser.parse_args() ... args.command ... ... args.option ... If you're really insistent on a dict interface instead of an attribute interface, the object returned by parse_args() is just a simple namespace, so vars(args) will give you a dict. .. _argparse: http://argparse.python-hosting.com/ STeVe -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy From qrczak at knm.org.pl Tue Sep 18 11:12:19 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 18 Sep 2007 11:12:19 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1190106739.23701.17.camel@qrnik> Dnia 18-09-2007, Wt o godzinie 13:08 +0900, Stephen J. Turnbull napisa?(a): > > This is wrong: UTF-8 is specified for PUA. PUA is no special from the > > point of view of UTF-8. > > It is from the point of view of the Unicode standard, specifically v5. > Please see section 16.5, especially about the "corporate use subarea". It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already specified for PUA. > > UTF-8 is defined for all Unicode scalar values, > > Sure, and what I propose is entirely compatible with the specification > of UTF-8 as a UTF, It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it might be encoded as a single byte. > > "C10. When a process interprets a code unit sequence which purports to > > be in a Unicode character encoding form, it shall treat ill-formed code > > unit sequences as an error condition and shall not interpret such > > sequences as characters." > > Yeah, that's the one. > > While I'm uncomfortable advocating the position that my proposal is > entirely compatible with C10, It is not. Elements of PUA are characters. > it is arguable that "mapping code units to > characters in private space" is not the same as "interpreting them as > characters". It's not the same, but interpreting as characters in PUA is obviously interpreting as characters. > chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") > Traceback (most recent call last): > File "", line 1, in ? > TypeError: ord() expected a character, but string of length 6 found I meant Python3 where sys.argv is a list of Unicode strings. It should work out of the box. Why length 6? "\ue650" encoded in UTF-8 has length 3. For an old discussion about using PUA to represent bytes undecodable as UTF-8, see http://www.mail-archive.com/unicode at unicode.org/ and subthreads with "roundtripping" in the subject. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Tue Sep 18 17:11:41 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 08:11:41 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/17/07, Stephen J. Turnbull wrote: > Note that some people are currently arguing that sys.argv should be an > array of bytes objects, and Guido has not yet said "no". Then let me say "no" now. I'd be happy to support a lower-level API for getting at the actual bytes in the C-level argv and env (even taking into account modifications to these made by C code out of our control; and in Windows we should provide access to the command line text as well). But argv and environ should be strings. If they contain non-ASCII bytes I am currently in favor os doing a best-effort decoding using the default locale encoding, replacing errors with '?' rather than throwing exception. Others have already explained why (they are typically text entered by a user). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 18 18:50:08 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 09:50:08 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: No takers? What about those repeated +42 voters? Does anyone want immutable bytes enough to do a teensy bit of work? --Guido On 9/17/07, Guido van Rossum wrote: > This may have passed in a thread where no-one was listening, so I'm > repeating it here. > > I'm considering the following option: bytes would always be immutable, > and for the few places (mostly in io.py) where a mutable bytes buffer > would be handy, we use the array module. Then it would also make sense > to make b[0] return a bytes array of length 1 instead of a small int > -- bytes would be more similar to str in 2.x, albeit completely > incompatible with str in terms of mixed operations. > > It would help if someone explored creating a patch to implement this, > just to see the minimum amount of code that would need to change > compared to 3.0a1. (The challenge includes making all the tests pass > again.) > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jyasskin at gmail.com Tue Sep 18 19:19:51 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Tue, 18 Sep 2007 10:19:51 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> I'll take it. I assume it's just a matter of removing the mutating methods and making the tests pass? I saw but didn't read a couple threads about the buffer API... how much has to change there? On 9/18/07, Guido van Rossum wrote: > No takers? What about those repeated +42 voters? Does anyone want > immutable bytes enough to do a teensy bit of work? > > --Guido > > On 9/17/07, Guido van Rossum wrote: > > This may have passed in a thread where no-one was listening, so I'm > > repeating it here. > > > > I'm considering the following option: bytes would always be immutable, > > and for the few places (mostly in io.py) where a mutable bytes buffer > > would be handy, we use the array module. Then it would also make sense > > to make b[0] return a bytes array of length 1 instead of a small int > > -- bytes would be more similar to str in 2.x, albeit completely > > incompatible with str in terms of mixed operations. > > > > It would help if someone explored creating a patch to implement this, > > just to see the minimum amount of code that would need to change > > compared to 3.0a1. (The challenge includes making all the tests pass > > again.) > > > > -- > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/jyasskin%40gmail.com > -- Namast?, Jeffrey Yasskin http://jeffrey.yasskin.info/ "Religion is an improper response to the Divine." ? "Skinny Legs and All", by Tom Robbins From fdrake at acm.org Tue Sep 18 19:28:26 2007 From: fdrake at acm.org (Fred Drake) Date: Tue, 18 Sep 2007 13:28:26 -0400 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: On Sep 18, 2007, at 12:50 PM, Guido van Rossum wrote: > No takers? What about those repeated +42 voters? Does anyone want > immutable bytes enough to do a teensy bit of work? Dang, Guido! I don't eat, sleep, or breath any more; how quick do you expect me to jump? I'll take a look at it as soon as I can, but won't object if someone beats me to it. -Fred -- Fred Drake From guido at python.org Tue Sep 18 19:30:35 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 10:30:35 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> Message-ID: On 9/18/07, Jeffrey Yasskin wrote: > I'll take it. I assume it's just a matter of removing the mutating > methods and making the tests pass? And adding __hash__. And (but this could be a separate, later change) switch indexing to return 1-char bytes arrays instead of small ints. And similar changes to the constructor. Of course, the devil is in the "making the tests pass". > I saw but didn't read a couple > threads about the buffer API... how much has to change there? The bytes buffer API should refuse requests for writable buffers. Since you're so close, please do interrupt me over IM to review incomplete work or ideas! --Guido > On 9/18/07, Guido van Rossum wrote: > > No takers? What about those repeated +42 voters? Does anyone want > > immutable bytes enough to do a teensy bit of work? > > > > --Guido > > > > On 9/17/07, Guido van Rossum wrote: > > > This may have passed in a thread where no-one was listening, so I'm > > > repeating it here. > > > > > > I'm considering the following option: bytes would always be immutable, > > > and for the few places (mostly in io.py) where a mutable bytes buffer > > > would be handy, we use the array module. Then it would also make sense > > > to make b[0] return a bytes array of length 1 instead of a small int > > > -- bytes would be more similar to str in 2.x, albeit completely > > > incompatible with str in terms of mixed operations. > > > > > > It would help if someone explored creating a patch to implement this, > > > just to see the minimum amount of code that would need to change > > > compared to 3.0a1. (The challenge includes making all the tests pass > > > again.) > > > > > > -- > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > -- > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/jyasskin%40gmail.com > > > > > -- > Namast?, > Jeffrey Yasskin > http://jeffrey.yasskin.info/ > > "Religion is an improper response to the Divine." ? "Skinny Legs and > All", by Tom Robbins > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From stephen at xemacs.org Tue Sep 18 22:36:41 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 19 Sep 2007 05:36:41 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1190106739.23701.17.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> <46E96E98.9080406@v.loewis.de> <1189711575.22693.86.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> Message-ID: <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> >>>>> "Marcin 'Qrczak' Kowalczyk" writes: >> > This is wrong: UTF-8 is specified for PUA. PUA is no special from the >> > point of view of UTF-8. > >> It is from the point of view of the Unicode standard, specifically v5. >> Please see section 16.5, especially about the "corporate use subarea". > > It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already > specified for PUA. There's no UTF-8 in Python's internal string encoding. What are you talking about? >> Sure, and what I propose is entirely compatible with the specification >> of UTF-8 as a UTF, > > It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it > might be encoded as a single byte. Of course not; the point of the proposal is to ensure that all text can be round-tripped through Python's internal representation. Anything that comes in as a character through a codec using my exception handler will be the same character when output with that handler. Again, what are you talking about? >> While I'm uncomfortable advocating the position that my proposal is >> entirely compatible with C10, > > It is not. Elements of PUA are characters. Yes. Where did I say anything else? > It's not the same, but interpreting as characters in PUA is obviously > interpreting as characters. No. Internally mapping to characters in PUA is mapping. Unicode does not try to restrict internal processing, only behavior at process boundaries. Interpretation as characters happens only on output. I do not yet know how to prevent that (or even if I can, it may be practically impossible because of important cases where the internal representation is exchanged between processes). If it can't be prevented while maintaining efficiency, that is a major flaw (but not necessarily fatal, since I'm proposing an exception handler, not a required feature of Unicode codecs). > I meant Python3 where sys.argv is a list of Unicode strings. It should > work out of the box. I really don't think so. Exposing internal representations as you are doing here is your problem; it is not something that Python should attempt to guarantee will work. More troublesome from your point of view, Guido has stated that the internal representation used by Python strings is a sequence of Unicode code units, not characters. I don't think that's reached the status of "pronouncement" yet, but you will probably need a PEP to get the guarantees you want. > Why length 6? "\ue650" encoded in UTF-8 has length 3. MS UTF-8, I suppose. You see, you simply cannot depend on any particular Python string being translated to a particular Unicode representation unless you choose the codec explicitly. Since you have to specify that codec to be reliable anyway, I don't see much loss here except to lazy programmers willing to live dangerously. But that's not true of anybody in this thread! The whole point is to preserve even broken input for later forensic analysis. > For an old discussion about using PUA to represent bytes undecodable > as UTF-8, see http://www.mail-archive.com/unicode at unicode.org/ and > subthreads with "roundtripping" in the subject. Which (after a half hour of looking) are mostly irrelevant, because Mr. Kristan's proposal (I assume that's what you're talking about) as far as I can see involved standardizing such representations within Unicode. We're not talking about that here; we're talking about representations internal to Python, for the convenience of Python users. From rrr at ronadam.com Tue Sep 18 22:52:50 2007 From: rrr at ronadam.com (Ron Adam) Date: Tue, 18 Sep 2007 15:52:50 -0500 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <46EEA276.4040901@ronadam.com> Message-ID: <46F03AA2.4090000@ronadam.com> Steven Bethard wrote: > On 9/17/07, Ron Adam wrote: >> Greg Ewing wrote: >>> Thomas Wouters wrote: >>>> If you want to put more meaning in the argv list, use an option >>>> parser. >>> I want to put *less* meaning in it, not more. :-) >>> And using an argument parser is often overkill for >>> simple programs. >> Would it be possible to split out the (pre) parsing from optparse so that >> instead of returning a list, it returns a dictionary of attributes and values? >> >> This would only contain what was given in the command line as a first >> "lighter weight" step to parsing the command line. >> >> opts = opt_parser(argv) >> command_name = opts['argv0'] # better name for argv0? > > You might look at argparse_ which allows you to treat positional > arguments just like optional ones. So you'd write:: > > parser = argparse.ArgumentParser() > parser.add_argument('command') # positional argument > parser.add_argument('--option') # optional argument > args = parser.parse_args() > ... args.command ... > ... args.option ... > > If you're really insistent on a dict interface instead of an attribute > interface, the object returned by parse_args() is just a simple > namespace, so vars(args) will give you a dict. > > .. _argparse: http://argparse.python-hosting.com/ I think a dict interface or even a (list, dict) interface is better in this case. It makes it much easier to use these in already existing functions and other objects. Once an objects data is stored in named attributes, it becomes a more specialized data structure and requires more specialized functions and objects to make use of it. In the above case the attribute names are not even consistent because they depend on the .add_argument() calls. I think this makes it harder to write reusable code. If the parser returned a list and dictionary pair, it might make it easy to use the (*args, **kwds) form to pass these values directly to functions or other objects. That also gives an easy and light weight way to validate command line arguments in the simplest cases without a lot of work. Just let the function receiving them validate its arguments at call time. Regards, Ron From jimjjewett at gmail.com Tue Sep 18 23:19:46 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 18 Sep 2007 17:19:46 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Stephen J. Turnbull wrote: > There's no UTF-8 in Python's internal string encoding. What are you > talking about? (At least as of a few days ago) In Python 3 there is; strings are unicode. A PyUnicodeObject object has two encodings that you can grab from a pointer (which means they have to be there; you don't have time to generate them like you would with a function pointer). One of these (str) is the "internal encoding" which is chosen at compile time, and the other (defenc) is now hard-coded to UTF-8. Hashing is also based on the UTF-8 bytestring. -jJ From guido at python.org Tue Sep 18 23:26:09 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 14:26:09 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Jim Jewett wrote: > On 9/18/07, Stephen J. Turnbull wrote: > > > There's no UTF-8 in Python's internal string encoding. What are you > > talking about? > > (At least as of a few days ago) > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > has two encodings that you can grab from a pointer (which means they > have to be there; you don't have time to generate them like you would > with a function pointer). Incorrect. The pointer can be NULL. The API for getting the UTF-8 encoding is a function (moreover a function whose name starts with _Py). > One of these (str) is the "internal encoding" which is chosen at > compile time, and the other (defenc) is now hard-coded to UTF-8. > > Hashing is also based on the UTF-8 bytestring. Not any more as of a few hours ago; the hashing based on UTF-8 was excessively expensive, and I rewrote it to directly use the code units(?) (or whatever they are called -- the Py_UNICODE values). For strings not using code units(?) > 2**16 this will give the same value on all platforms; if there are code units(?) >= 2**16 results vary since these will be represented as surrogates on 2-byte systems but not on 4-byte systems. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Tue Sep 18 23:24:24 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 19 Sep 2007 09:24:24 +1200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <46EF3AFC.7000904@acm.org> References: <46EF3AFC.7000904@acm.org> Message-ID: <46F04208.8070704@canterbury.ac.nz> Talin wrote: > Data Type AbstractSequence Immutable Mutable > ========= ================ ========= ======= > byte ByteSequence bytes buffer > character CharSequence str strbuf > > 'buffer' could be an array.array, although if it's used frequently > enough an optimized special-case 'buffer' class might be better. I'd prefer to keep the term 'buffer' for an object that exposes the buffer interface of another object. I suggest calling it a 'bytearray' if you want a specialised type for it. -- Greg From guido at python.org Tue Sep 18 23:29:41 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 14:29:41 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> References: <1189700532.22693.40.camel@qrnik> <18153.42916.640227.483752@uwakimon.sk.tsukuba.ac.jp> <1189722696.30037.14.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Stephen J. Turnbull wrote: > Guido has stated that the > internal representation used by Python strings is a sequence of > Unicode code units, not characters. I don't think that's reached the > status of "pronouncement" yet, but you will probably need a PEP to get > the guarantees you want. I think of this as cast in stone; we can't reasonably guarantee more if we want to be compatible with the UTF-16 (*) Unicode representations used on Windows and in Java. How much more pronouncement do you want? (*) I'm not at all sure that it's called that -- you guys keep asking trick questions based on terminology that's only clear to people who have read the Unicode standard several times forwards and backwards. I mean the representation that uses 16-bit values, where characters >= 2**16 are represented as two 16-bit "surrogate" values. (I hope I at least have the 'surrogate' thing right this time.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Tue Sep 18 23:25:55 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 19 Sep 2007 09:25:55 +1200 Subject: [Python-3000] Stackless anyone ? In-Reply-To: <46EF38CF.4020801@acm.org> References: <1189949664.5502.3.camel@schlepp> <46EF38CF.4020801@acm.org> Message-ID: <46F04263.9000609@canterbury.ac.nz> Talin wrote: > the ultimate solution to Python concurrency won't be via patching > CPython, but to compile the meta-Python language to a back-end > representation that is inherently concurrent. You can't get something for nothing, though -- that "inherently concurrent" back-end representation will have to deal with all the same issues one way or another. -- Greg From jimjjewett at gmail.com Wed Sep 19 00:23:18 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 18 Sep 2007 18:23:18 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <18154.9232.740864.946506@uwakimon.sk.tsukuba.ac.jp> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Guido van Rossum wrote: > On 9/18/07, Jim Jewett wrote: > > On 9/18/07, Stephen J. Turnbull wrote: > > > There's no UTF-8 in Python's internal string encoding. > > (At least as of a few days ago) > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > > has two encodings that you can grab from a pointer (which means > > they have to be there; you don't have time to generate them like > > you would with a function pointer). > Incorrect. The pointer can be NULL. I had missed that comment, but I do see it now; thank you. > The API for getting the UTF-8 encoding is a function Thank you. But given that defenc is now always UTF-8, won't exposing it in the public typedef then just be an attractive nuisance? > (moreover a function whose name starts with _Py). That I still don't see. http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String( PyObject *unicode /* Unicode object */ ); PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8( const Py_UNICODE *data, /* Unicode char buffer */ Py_ssize_t length, /* number of Py_UNICODE chars to encode */ const char *errors /* error handling */ ); Later, the same file shows me: /* --- Unicode Type ------------------------------------------------------- */ typedef struct { PyObject_HEAD Py_ssize_t length; /* Length of raw Unicode data in buffer */ Py_UNICODE *str; /* Raw Unicode buffer */ long hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this object * are *not* counted in ob_refcnt. */ PyObject *defenc; /* (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } PyUnicodeObject; I would be happier with: typedef struct { PyObject_VAR_HEAD /* Length in code points, not chars */ } PyUnicodeObject; And, in unicodeobject.c (*not* in a public header) typedef struct { PyUnicodeObject ob_unicodehead; Py_UNICODE *str; /* Raw Unicode buffer */ long hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this object * are *not* counted in ob_refcnt. */ PyObject *defenc; /* (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } _PyDefaultUnicodeObject; As this would allow 3rd parties to create implementations specialized for (and saving space on) smaller alphabets, without breaking C extensions that stick to the public header files. (Moving hash or even state to the public header might be OK too, but they seemed to get ignored for subclasses anyhow.) -jJ From pje at telecommunity.com Wed Sep 19 00:21:05 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Tue, 18 Sep 2007 18:21:05 -0400 Subject: [Python-3000] Stackless anyone ? In-Reply-To: <46F04263.9000609@canterbury.ac.nz> References: <1189949664.5502.3.camel@schlepp> <46EF38CF.4020801@acm.org> <46F04263.9000609@canterbury.ac.nz> Message-ID: <20070918222729.9E38F3A40AC@sparrow.telecommunity.com> At 09:25 AM 9/19/2007 +1200, Greg Ewing wrote: >Talin wrote: > > the ultimate solution to Python concurrency won't be via patching > > CPython, but to compile the meta-Python language to a back-end > > representation that is inherently concurrent. > >You can't get something for nothing, though -- that >"inherently concurrent" back-end representation will >have to deal with all the same issues one way or >another. Right, but since you can write PyPy "C" extensions in RPython, the part you actually get for "free" is that PyPy extensions don't need to be written so as to take concurrency into account. Those bits can be delegated to the "object space", in PyPy terms. From guido at python.org Wed Sep 19 00:29:24 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 15:29:24 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <1189756174.32337.30.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Jim Jewett wrote: > On 9/18/07, Guido van Rossum wrote: > > On 9/18/07, Jim Jewett wrote: > > > On 9/18/07, Stephen J. Turnbull wrote: > > > > > There's no UTF-8 in Python's internal string encoding. > > > > (At least as of a few days ago) > > > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > > > has two encodings that you can grab from a pointer (which means > > > they have to be there; you don't have time to generate them like > > > you would with a function pointer). > > > Incorrect. The pointer can be NULL. > > I had missed that comment, but I do see it now; thank you. > > > The API for getting the UTF-8 encoding is a function > > Thank you. But given that defenc is now always UTF-8, won't exposing > it in the public typedef then just be an attractive nuisance? *ALL* fields of the struct def are strictly internal. > > (moreover a function whose name starts with _Py). > > That I still don't see. I am talking about _PyUnicode_AsDefaultEncoding(). (Which you shouldn't be calling. :-) > http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup > > PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String( > PyObject *unicode /* Unicode object */ > ); > > PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8( > const Py_UNICODE *data, /* Unicode char buffer */ > Py_ssize_t length, /* number of Py_UNICODE chars to encode */ > const char *errors /* error handling */ > ); > > > Later, the same file shows me: > > /* --- Unicode Type ------------------------------------------------------- */ > > typedef struct { > PyObject_HEAD > Py_ssize_t length; /* Length of raw Unicode data in buffer */ > Py_UNICODE *str; /* Raw Unicode buffer */ > long hash; /* Hash value; -1 if not set */ > int state; /* != 0 if interned. In this case the two > * references from the dictionary to this object > * are *not* counted in ob_refcnt. */ > PyObject *defenc; /* (Default) Encoded version as Python > string, or NULL; this is used for > implementing the buffer protocol */ > } PyUnicodeObject; > > > I would be happier with: > > typedef struct { > PyObject_VAR_HEAD /* Length in code points, not chars */ > } PyUnicodeObject; > > And, in unicodeobject.c (*not* in a public header) > > typedef struct { > PyUnicodeObject ob_unicodehead; > Py_UNICODE *str; /* Raw Unicode buffer */ > long hash; /* Hash value; -1 if not set */ > int state; /* != 0 if interned. In this case the two > * references from the dictionary to this object > * are *not* counted in ob_refcnt. */ > PyObject *defenc; /* (Default) Encoded version as Python > string, or NULL; this is used for > implementing the buffer protocol */ > } _PyDefaultUnicodeObject; > > As this would allow 3rd parties to create implementations specialized > for (and saving space on) smaller alphabets, without breaking C > extensions that stick to the public header files. (Moving hash or > even state to the public header might be OK too, but they seemed to > get ignored for subclasses anyhow.) That is not a supported use case. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From foom at fuhm.net Wed Sep 19 00:52:18 2007 From: foom at fuhm.net (James Y Knight) Date: Tue, 18 Sep 2007 18:52:18 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> Message-ID: <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > If they contain > non-ASCII bytes I am currently in favor os doing a best-effort > decoding using the default locale encoding, replacing errors with '?' > rather than throwing exception. One of the more common things to do with command line arguments is open them. So, it'd really be nice if: python -c 'import sys; open(sys.argv[1])' [some filename] would always work, regardless of the current system encoding and what characters make up the filename. Note that filenames are essentially random binary gunk in most Unix systems; the encoding is unspecified, and there can in fact be multiple encodings, even for different directories making up a single file's path. I'd like to propose that python simply assume the external world is likely to be UTF-8, and always decode command-line arguments (and environment vars), and encode for filesystem operations using the roundtrip-able UTF-8b. Even if the system says its encoding is iso-2022 or some other abomination. This has upsides (simple, doesn't trample on PUA codepoints, only needs one new codec, never throws exception in the above example, and really is correct much of the time), and downsides (if the system locale is iso-2022, and all the filenames you're dealing with really are also properly encoded in iso-2022, it might be nice if they decoded into the sensible unicode string, instead of a non-sensical (but still round-trippable) one. I think the advantages outweigh the disadvantages, but the world I live in, using anything other than UTF8 or ASCII is grounds for entry into an insane asylum. ;) James From guido at python.org Wed Sep 19 01:00:09 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 18 Sep 2007 16:00:09 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> References: <1189700532.22693.40.camel@qrnik> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: On 9/18/07, James Y Knight wrote: > > On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > > If they contain > > non-ASCII bytes I am currently in favor os doing a best-effort > > decoding using the default locale encoding, replacing errors with '?' > > rather than throwing exception. > > One of the more common things to do with command line arguments is > open them. So, it'd really be nice if: > > python -c 'import sys; open(sys.argv[1])' [some filename] I'd like this too, but it isn't easy. > would always work, regardless of the current system encoding and what > characters make up the filename. Note that filenames are essentially > random binary gunk in most Unix systems; the encoding is unspecified, > and there can in fact be multiple encodings, even for different > directories making up a single file's path. > > I'd like to propose that python simply assume the external world is > likely to be UTF-8, and always decode command-line arguments (and > environment vars), and encode for filesystem operations using the > roundtrip-able UTF-8b. Even if the system says its encoding is > iso-2022 or some other abomination. This has upsides (simple, doesn't > trample on PUA codepoints, only needs one new codec, never throws > exception in the above example, and really is correct much of the > time), and downsides (if the system locale is iso-2022, and all the > filenames you're dealing with really are also properly encoded in > iso-2022, it might be nice if they decoded into the sensible unicode > string, instead of a non-sensical (but still round-trippable) one. > > I think the advantages outweigh the disadvantages, but the world I > live in, using anything other than UTF8 or ASCII is grounds for entry > into an insane asylum. ;) You seem to be contradicting yourself. The world *isn't* using UTF-8(b) predominantly yet, so assuming UTF-8(b) everywhere will break your first requirement. Two encodings are more likely (though not guaranteed) to produce success: the locale encoding or the filesystem encoding. I'm thinking that the locale encoding is probably the one to use for argv and environ, since at least the user can change it in order to make things work. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From thomas at python.org Wed Sep 19 03:40:40 2007 From: thomas at python.org (Thomas Wouters) Date: Tue, 18 Sep 2007 18:40:40 -0700 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46EDFCBB.8010306@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> Message-ID: <9e804ac0709181840gb33da5co5db610aa2cd6bbeb@mail.gmail.com> On 9/16/07, Greg Ewing wrote: > > Thomas Wouters wrote: > > If you want to put more meaning in the argv list, use an option > > parser. > > I want to put *less* meaning in it, not more. :-) Then why are you discriminating against argv[0]? It's just another member of the argv list the OS gives us. And using an argument parser is often overkill for > simple programs. So is trying to "fix" this non-issue. > The _actual_ meaning of each element depends entirely on the > > program that's started. For Python-the-language, there isn't any > > difference between them. > > So in your Python programs, you're quite happy > to write > > for arg in sys.argv: > process(arg) > > and not care about what this does with argv[0]? No. I'm quite happy to realize the argv list is what the shell executed. I'm also quite happy to use a proper option parser even for my simple programs. It adds useful defaults even if I didn't think I'd ever use them. I hardly see how one can claim that there's > "no difference" between argv[0] and the rest > for practical purposes. The only meaning is by accident of position. For most programs, the very same thing goes for the rest of the arguments: 'mv foo bar' assigns a different meaning to 'foo' than it does to 'bar'. Notice how sys.argvmatches what the user typed, including sys.argv[0]. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070918/481aff7f/attachment.htm From stephen at xemacs.org Wed Sep 19 07:00:51 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 19 Sep 2007 14:00:51 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> References: <1189700532.22693.40.camel@qrnik> <87wsut7srm.fsf@uwakimon.sk.tsukuba.ac.jp> <46EA1734.6020103@canterbury.ac.nz> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: <873axbi70c.fsf@uwakimon.sk.tsukuba.ac.jp> James Y Knight writes: > iso-2022 or some other abomination. This has upsides (simple, doesn't > trample on PUA codepoints, only needs one new codec, never throws > exception in the above example, and really is correct much of the > time), and downsides (if the system locale is iso-2022, and all the > filenames you're dealing with really are also properly encoded in > iso-2022, it might be nice if they decoded into the sensible unicode > string, instead of a non-sensical (but still round-trippable) one. ISO 2022, like Unicode, is an extensible standard. Corporate character sets in Asia extend, but are not easy to distinguish from each other though they often conflict. They're not proper in the sense that they abuse the registered final bytes of the national standards they're based on, but it's also not reasonable for those of us who live there to ignore them. > I think the advantages outweigh the disadvantages, but the world I > live in, using anything other than UTF8 or ASCII is grounds for entry > into an insane asylum. ;) You're very fortunate. In the world I live in, Shift JIS, which isn't even ISO 2022 compatible, is mandated by a power higher even than the Borg of Redmond: the telephone company. From victor.stinner at haypocalc.com Wed Sep 19 12:12:10 2007 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 19 Sep 2007 12:12:10 +0200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: <200709191212.10502.victor.stinner@haypocalc.com> Hi, On Tuesday 18 September 2007 04:18:01 Guido van Rossum wrote: > I'm considering the following option: bytes would always be immutable, > (...) make b[0] return a bytes array of length 1 instead of a small int Great idea! That will help migration from Python 2.x to Python 3.0. Choosing between byte and character string is already a difficult choice. So choosing between mutable (current bytes type) and immutable string (current str type) is a more difficult choice. And substring behaviour change (python 2.x => 3) was also strange for python programmers. >>> 'xyz'[0] 'x' >>> b"xyz"[0] 120 This result is not symmetric. I would prefer what Guido proposes: >>> 'xyz'[0] 'x' >>> b"xyz"[0] b'x' And so be able to write such tests: >>> b"xyz"[:2] == b'xy' True >>> b"xyz"[0:1] == b'x' True >>> b"xyz"[0] == b'x' True Victor Stinner http://hachoir.org/ From victor.stinner at haypocalc.com Wed Sep 19 12:40:33 2007 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 19 Sep 2007 12:40:33 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <1189700532.22693.40.camel@qrnik> References: <1189700532.22693.40.camel@qrnik> Message-ID: <200709191240.33698.victor.stinner@haypocalc.com> Hi, On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)? On Linux, filenames are *byte* string and not *character* string. I always have his problem with Python 2.x. I converted filename (argv[x]) to Unicode to be able to format error messages in full unicode... but it's not possible. Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), see Marcin's examples. So I propose to keep sys.argv as byte string array. If you try to create unicode strings, you will be unable to write a program to convert filesystem with "broken" filenames (see convmv program for example) or open file with broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... charset). --- For Python 2.x, my solution is to keep byte string for I/O and use unicode string for error messages. Function to convert any byte string (filename string) to Unicode: def unicodeFilename(filename, charset=None): if not charset: charset = getTerminalCharset() try: return unicode(filename, charset) except UnicodeDecodeError: return makePrintable(filename, charset, to_unicode=True) makePrintable() replace invalid byte sequence by escape string, example: >>> from hachoir_core.tools import makePrintable >>> makePrintable("a\x80", "utf8", to_unicode=True) u'a\\x80' >>> print makePrintable("a\x80", "utf8", to_unicode=True) a\x80 Source code of function makePrintable: http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225 Source code of function getTerminalCharset(): http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23 Victor Stinner http://hachoir.org/ From stephen at xemacs.org Wed Sep 19 18:12:51 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 20 Sep 2007 01:12:51 +0900 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <200709191240.33698.victor.stinner@haypocalc.com> References: <1189700532.22693.40.camel@qrnik> <200709191240.33698.victor.stinner@haypocalc.com> Message-ID: <878x72y6po.fsf@uwakimon.sk.tsukuba.ac.jp> Victor Stinner writes: > On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote: > > What should happen when a command line argument or an environment > > variable is not decodable using the system encoding (on Unix where > > from the OS point of view it is an array of bytes)? > > On Linux, filenames are *byte* string and not *character* string. I always > have his problem with Python 2.x. I converted filename (argv[x]) to Unicode > to be able to format error messages in full unicode... but it's not possible. > Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), > see Marcin's examples. This should be solved by providing library facilities to handle these conditions. Users and programmers may "know" that file names are actually raw bytes obeying a set of restrictions unique to file names, but they expect to be able to *use* them as characters, and 99.44% of the time that just works.[1] Even for the Japanese, who have over 1500 years' experience in creating unusable writing systems. > So I propose to keep sys.argv as byte string array. If you try to create > unicode strings, you will be unable to write a program to convert filesystem > with "broken" filenames (see convmv program for example) or open file with > broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... > charset). This is simply not true. Any of the proposals (Martin's, Marcin's, James's, mine) will make this *possible*. It's just less convenient for the programmer who wishes to deal with such situations. This inconvenience is IMO more than balanced by the convenience for the programmer who lives his life in ASCII or whose users just don't do stuff like that, or who's writing a one-off script and doesn't care. N.B. You don't need to go farther than your favorite rootkit to find broken filenames such as "^J" (linefeed). This doesn't cause problems specific to Unicode, of course, but it does demonstrate that a library designed to help with weird file names has broader applicability than just translation to Unicode strings. Footnotes: [1] "99.44%" is an expression of "very pure" derived from an advertising campaign for soap. Here it's an exaggeration, I guess, but nobody knows how much. From lists at cheimes.de Wed Sep 19 18:42:27 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 19 Sep 2007 18:42:27 +0200 Subject: [Python-3000] New io system and binary data Message-ID: Today I stumbled over another problem that is related to the unicode and OS string topic. The new io system - or to be more precisely the implicit converting of input and output data to UTF-8 makes it impossible to pipe binary data through Python 3.0. For example an user wants to write a filter for binary data like images in Python. With Python 2.5 the input and output data isn't implicitly converted: # stdredirect.py # simple stupid example import sys sys.stdout.write(sys.stdin.read()) $ chmod 755 stdredict.py $ cat ./Mac/Demo/html.icons/python.gif | python2.5 stdredirect.py >out.gif $ diff ./Mac/Demo/html.icons/python.gif out.gif But Python 3.0 is using TextIOWrapper for stdin, stdout and stderr: $ cat ./Mac/Demo/html.icons/python.gif | ./python stdredirect.py >out.gifTraceback (most recent call last): File "./stdredict.py", line 4, in sys.stdout.write(sys.stdin.read()) File "/home/heimes/dev/python/py3k/Lib/io.py", line 1225, in read res += decoder.decode(self.buffer.read(), True) File "/home/heimes/dev/python/py3k/Lib/codecs.py", line 291, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 10-13: invalid data An easy workaround for the problem is: sys.stdout = sys.stdout.buffer sys.stdin = sys.stdin.buffer I recommend that the problem and fix gets documented. Maybe stdin, stdout and stderr should get a method that disables the implicit conversion like setMode("b") / setMode("t"). Christian From guido at python.org Wed Sep 19 19:19:13 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 10:19:13 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: References: Message-ID: Changing the mode between text and binary is not feasible (since it would have to change the class). But it is perfectly acceptable to use sys.std{in,out}.buffer if you need to write a binary transparent filter. Of course you'll be dealing with bytes at that point so the usual cautions apply. I wouldn't do the assignments you propose though, since that might surprise other code which expects text files. --Guido On 9/19/07, Christian Heimes wrote: > Today I stumbled over another problem that is related to the unicode and > OS string topic. The new io system - or to be more precisely the > implicit converting of input and output data to UTF-8 makes it > impossible to pipe binary data through Python 3.0. > > For example an user wants to write a filter for binary data like images > in Python. With Python 2.5 the input and output data isn't implicitly > converted: > > # stdredirect.py > # simple stupid example > import sys > sys.stdout.write(sys.stdin.read()) > > $ chmod 755 stdredict.py > $ cat ./Mac/Demo/html.icons/python.gif | python2.5 stdredirect.py >out.gif > $ diff ./Mac/Demo/html.icons/python.gif out.gif > > But Python 3.0 is using TextIOWrapper for stdin, stdout and stderr: > > $ cat ./Mac/Demo/html.icons/python.gif | ./python stdredirect.py > >out.gifTraceback (most recent call last): > File "./stdredict.py", line 4, in > sys.stdout.write(sys.stdin.read()) > File "/home/heimes/dev/python/py3k/Lib/io.py", line 1225, in read > res += decoder.decode(self.buffer.read(), True) > File "/home/heimes/dev/python/py3k/Lib/codecs.py", line 291, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 10-13: > invalid data > > An easy workaround for the problem is: > > sys.stdout = sys.stdout.buffer > sys.stdin = sys.stdin.buffer > > I recommend that the problem and fix gets documented. Maybe stdin, > stdout and stderr should get a method that disables the implicit > conversion like setMode("b") / setMode("t"). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Wed Sep 19 19:56:38 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 19 Sep 2007 10:56:38 PDT Subject: [Python-3000] New io system and binary data In-Reply-To: References: Message-ID: <07Sep19.105646pdt."57996"@synergy1.parc.xerox.com> GvR wrote: > I wouldn't do the assignments you propose > though, since that might surprise other code which expects text files. But presumably that code wouldn't be used in that same program. This really isn't a UTF-8 problem. It is the problem with file opens defaulting to "text" mode instead of "binary" mode rearing its ugly head again. Bill > Changing the mode between text and binary is not feasible (since it > would have to change the class). But it is perfectly acceptable to use > sys.std{in,out}.buffer if you need to write a binary transparent > filter. Of course you'll be dealing with bytes at that point so the > usual cautions apply. I wouldn't do the assignments you propose > though, since that might surprise other code which expects text files. > > --Guido > > On 9/19/07, Christian Heimes wrote: > > Today I stumbled over another problem that is related to the unicode and > > OS string topic. The new io system - or to be more precisely the > > implicit converting of input and output data to UTF-8 makes it > > impossible to pipe binary data through Python 3.0. > > > > For example an user wants to write a filter for binary data like images > > in Python. With Python 2.5 the input and output data isn't implicitly > > converted: > > > > # stdredirect.py > > # simple stupid example > > import sys > > sys.stdout.write(sys.stdin.read()) > > > > $ chmod 755 stdredict.py > > $ cat ./Mac/Demo/html.icons/python.gif | python2.5 stdredirect.py >out.gif > > $ diff ./Mac/Demo/html.icons/python.gif out.gif > > > > But Python 3.0 is using TextIOWrapper for stdin, stdout and stderr: > > > > $ cat ./Mac/Demo/html.icons/python.gif | ./python stdredirect.py > > >out.gifTraceback (most recent call last): > > File "./stdredict.py", line 4, in > > sys.stdout.write(sys.stdin.read()) > > File "/home/heimes/dev/python/py3k/Lib/io.py", line 1225, in read > > res += decoder.decode(self.buffer.read(), True) > > File "/home/heimes/dev/python/py3k/Lib/codecs.py", line 291, in decode > > (result, consumed) = self._buffer_decode(data, self.errors, final) > > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 10-13: > > invalid data > > > > An easy workaround for the problem is: > > > > sys.stdout = sys.stdout.buffer > > sys.stdin = sys.stdin.buffer > > > > I recommend that the problem and fix gets documented. Maybe stdin, > > stdout and stderr should get a method that disables the implicit > > conversion like setMode("b") / setMode("t"). From brett at python.org Wed Sep 19 20:12:16 2007 From: brett at python.org (Brett Cannon) Date: Wed, 19 Sep 2007 11:12:16 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: On 9/17/07, Guido van Rossum wrote: > This may have passed in a thread where no-one was listening, so I'm > repeating it here. > > I'm considering the following option: bytes would always be immutable, > and for the few places (mostly in io.py) where a mutable bytes buffer > would be handy, we use the array module. Then it would also make sense > to make b[0] return a bytes array of length 1 instead of a small int > -- bytes would be more similar to str in 2.x, albeit completely > incompatible with str in terms of mixed operations. > How far do you want to push the similarity? For instance, would ord() start working on length 1 byte arrays or would int() be the only way to get the integer out of the byte? -Brett From guido at python.org Wed Sep 19 20:24:47 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 11:24:47 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: I think ord() would be fine. On 9/19/07, Brett Cannon wrote: > On 9/17/07, Guido van Rossum wrote: > > This may have passed in a thread where no-one was listening, so I'm > > repeating it here. > > > > I'm considering the following option: bytes would always be immutable, > > and for the few places (mostly in io.py) where a mutable bytes buffer > > would be handy, we use the array module. Then it would also make sense > > to make b[0] return a bytes array of length 1 instead of a small int > > -- bytes would be more similar to str in 2.x, albeit completely > > incompatible with str in terms of mixed operations. > > > > How far do you want to push the similarity? For instance, would ord() > start working on length 1 byte arrays or would int() be the only way > to get the integer out of the byte? > > -Brett > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 19 20:26:03 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 11:26:03 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <-7804278669952876495@unknownmsgid> References: <-7804278669952876495@unknownmsgid> Message-ID: On 9/19/07, Bill Janssen wrote: > This really isn't a UTF-8 problem. It is the problem with file opens > defaulting to "text" mode instead of "binary" mode rearing its ugly > head again. You can repeat that until you're blue in the face but it's not going to change. Way more programs (especially simple ones) deal with txet than with binary data. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Wed Sep 19 21:34:39 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 19 Sep 2007 12:34:39 PDT Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> Message-ID: <07Sep19.123446pdt."57996"@synergy1.parc.xerox.com> > You can repeat that until you're blue in the face but it's not going > to change. That happens to me a lot :-). > Way more programs (especially simple ones) deal with txet > than with binary data. I'd love to see stats on that, Guido. I'm sure it's true in your immediate vicinity, given what you work on, but I don't believe it's true in general. And even for "text" files, it begs the several questions of the expression of the text in a file, which is always a binary artifact, due to the fact that files store bytes, not "text". I'll shut up now... Bill From jason.orendorff at gmail.com Wed Sep 19 21:58:49 2007 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Wed, 19 Sep 2007 15:58:49 -0400 Subject: [Python-3000] New io system and binary data In-Reply-To: <2296611449423656759@unknownmsgid> References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> Message-ID: On 9/19/07, Bill Janssen wrote: > > Way more programs (especially simple ones) deal with txet > > than with binary data. > > I'd love to see stats on that, Guido. I'm sure it's true in your > immediate vicinity, given what you work on, but I don't believe it's > true in general. Given the context (stdin/stdout/stderr), I'd love to know what you're thinking of here. I can't name a program offhand that wants to operate on binary data via a pipeline. There are a few that *can*, like gzip, but my impression is that even those aren't often used that way anymore. -j From skip at pobox.com Wed Sep 19 21:59:54 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 19 Sep 2007 14:59:54 -0500 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> Message-ID: <18161.32698.291402.642086@montanaro.dyndns.org> Guido> You can repeat that until you're blue in the face but it's not Guido> going to change. Way more programs (especially simple ones) deal Guido> with txet than with binary data. For us Unix-heads the notion that a file is anything other than a stream of bytes is rather foreign. I understand that to a large degree if you made the world right for us the tail would be wagging the dog. Skip From skip at pobox.com Wed Sep 19 22:06:03 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 19 Sep 2007 15:06:03 -0500 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> Message-ID: <18161.33067.153099.859501@montanaro.dyndns.org> Jason> Given the context (stdin/stdout/stderr), I'd love to know what Jason> you're thinking of here. I can't name a program offhand that Jason> wants to operate on binary data via a pipeline. You've obviously never used the netpbm (nee pbmplus, nee pbm) tools. I still use this pipeline to capture a window fairly frequently: xwd | xwdtopnm | pnmtopng > window.png I believe ImageMagick also operates by means of filter programs transforming binary data. Jason> There are a few that *can*, like gzip, but my impression is that Jason> even those aren't often used that way anymore. Only because the true believers have been overrun by the unwashed masses who need to use a GUI as a crutch. I use g(un)?zip and b(un)?zip2 as filters all the time. It's that elegant Unix model of computing I grew up with. Lots of small tools do one thing well instead of a massively bloated tool that has a swiss army knife drawer full of options. Skip From fdrake at acm.org Wed Sep 19 22:46:45 2007 From: fdrake at acm.org (Fred Drake) Date: Wed, 19 Sep 2007 16:46:45 -0400 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> Message-ID: <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> On Sep 19, 2007, at 3:58 PM, Jason Orendorff wrote: > Given the context (stdin/stdout/stderr), I'd love to know what you're > thinking of here. I can't name a program offhand that wants to > operate on binary data via a pipeline. There are a few that *can*, > like gzip, but my impression is that even those aren't often used that > way anymore. Huh. I use pipelines constructed in the shell for binary data regularly; I don't see any reason not to do that. I'd certainly rather see the stdio streams be available as binary data, possibly with convenient text-centric wrappers also available. But I'd be fine with constructing those myself. -Fred -- Fred Drake From guido at python.org Wed Sep 19 23:00:39 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 14:00:39 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> Message-ID: On 9/19/07, Fred Drake wrote: > On Sep 19, 2007, at 3:58 PM, Jason Orendorff wrote: > > Given the context (stdin/stdout/stderr), I'd love to know what you're > > thinking of here. I can't name a program offhand that wants to > > operate on binary data via a pipeline. There are a few that *can*, > > like gzip, but my impression is that even those aren't often used that > > way anymore. > > Huh. I use pipelines constructed in the shell for binary data > regularly; I don't see any reason not to do that. I'd certainly > rather see the stdio streams be available as binary data, possibly > with convenient text-centric wrappers also available. But I'd be > fine with constructing those myself. I agree that binary pipelines are useful and should be possible. I just don't think this should be the default behavior for stdin/stdout. Since the binary stream underlying stdin is readily available as sys.stdin.buffer (and ditto for stdout and even stderr) I don't think any action needs to be taken. note that the instance variable doesn't start with an underscore. It's part of the public API for text files. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Wed Sep 19 23:08:16 2007 From: brett at python.org (Brett Cannon) Date: Wed, 19 Sep 2007 14:08:16 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <18161.32698.291402.642086@montanaro.dyndns.org> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> Message-ID: On 9/19/07, skip at pobox.com wrote: > > Guido> You can repeat that until you're blue in the face but it's not > Guido> going to change. Way more programs (especially simple ones) deal > Guido> with txet than with binary data. > > For us Unix-heads the notion that a file is anything other than a stream of > bytes is rather foreign. I understand that to a large degree if you made > the world right for us the tail would be wagging the dog. I think the key thing here is that Guido said "especially simple ones" and the examples people are talking about are not overly simple (e.g, gzip, ImageMagik, etc.). That would suggest that if you want the raw bytes from stdin or write out to stdout that accessing the 'buffer' attribute you probably know what you are doing and thus accessing a 'buffer' attribute is probably not difficult for you. =) -Brett From fdrake at acm.org Wed Sep 19 23:42:14 2007 From: fdrake at acm.org (Fred Drake) Date: Wed, 19 Sep 2007 17:42:14 -0400 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> Message-ID: On Sep 19, 2007, at 5:00 PM, Guido van Rossum wrote: > Since the binary stream underlying stdin is readily available as > sys.stdin.buffer (and ditto for stdout and even stderr) I don't think > any action needs to be taken. note that the instance variable doesn't > start with an underscore. It's part of the public API for text files. Amazingly, that's good enough for me. ;-) -Fred -- Fred Drake From skip at pobox.com Thu Sep 20 00:19:29 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 19 Sep 2007 17:19:29 -0500 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> Message-ID: <18161.41073.724795.594482@montanaro.dyndns.org> Guido> I agree that binary pipelines are useful and should be Guido> possible. I just don't think this should be the default behavior Guido> for stdin/stdout. Binary has (like it or not) been the default behavior on all previous Pythons running on Unix systems where text and binary were never different (a view of the computing world which VMS ruined with it's morass of file types and which Windows NT lapped up like antifreeze). The only time I ever open a file with the "b" attribute is when I expect that code to run on Windows (thankfully a rare occurrence for me). Python 3 will obviously be changing behavior in this regard for some of us, though as I indicated before, satisfying those of us who hold this perspective (apparently just Bill, Fred and me at this point) would probably be counter to the needs of the Python community as a whole, fully 90% of whom think a pipe is made out of PVC, not copper and can't understand what I'm telling them to type unless I say grep -i python VERTICAL BAR lpr instead of grep -i python PIPE lpr Unix folks of course know to type '|' when you say PIPE and if you say VERTICAL BAR they type 'vertical bar'. But enough wistful reminiscing. I will shut up after one more parting shot: Dennis-Ritchie-had-it-right-ly yr's, Skip From weilawei at gmail.com Thu Sep 20 02:34:23 2007 From: weilawei at gmail.com (Rob Crowther) Date: Wed, 19 Sep 2007 20:34:23 -0400 Subject: [Python-3000] Implementing Abstract Interface for Numbers Message-ID: This is the documentation for PyNumberMethods right now. PyNumberMethods *tp_as_number; XXX I've managed to wrap GNU MP floats and add rich comparisons, but there's a sore lack of documentation on how to implement the Number interface. Given a bit of pointers on where to look, an alpha version of this extension will be available tomorrow, most likely. Thanks for the help. Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070919/20259ca7/attachment-0001.htm From tjreedy at udel.edu Thu Sep 20 03:24:04 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 19 Sep 2007 21:24:04 -0400 Subject: [Python-3000] New io system and binary data References: Message-ID: "Guido van Rossum" wrote in message news:ca471dc20709191019k2f5e16e5j75767b25ddf90e30 at mail.gmail.com... | Changing the mode between text and binary is not feasible (since it | would have to change the class). But it is perfectly acceptable to use | sys.std{in,out}.buffer if you need to write a binary transparent | filter. In PEP 3116, the Buffered I/O section has Additionally, the abstract base class provides one member variable: .raw A reference to the underlying RawIOBase object. The Text I/O section does *not* have, but I presume should, similar lines about member variable .buffer. Perhaps a note could be added that stdin/out will be Text I/O and that the bytes buffer is easily unwrapped via .buffer (and even via .raw). While I sympathize with the initial surprise, I am willing to type .buffer should I need to. The real problem is that 2to3.py cannot do so automatically (and be always. and probably not even usually, correct). tjr From greg.ewing at canterbury.ac.nz Thu Sep 20 03:40:06 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 20 Sep 2007 13:40:06 +1200 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <9e804ac0709181840gb33da5co5db610aa2cd6bbeb@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <9e804ac0709181840gb33da5co5db610aa2cd6bbeb@mail.gmail.com> Message-ID: <46F1CF76.20004@canterbury.ac.nz> Thomas Wouters wrote: > The only meaning is by accident of position. For most programs, the very > same thing goes for the rest of the arguments: 'mv foo bar' assigns a > different meaning to 'foo' than it does to 'bar'. Notice how sys.argv > matches what the user typed, including sys.argv[0]. But most users don't think of the 'mv' in 'mv foo bar' as being an argument in any normal sense of the word. It's the thing the arguments are passed *to*, not an argument itself. Also, most programs aren't interested in argv[0] at all, and those that are treat it in a very different way from the rest of argv. I still think that argv[0] is in the "too clever by half" category. It has a kind of theoretical elegance from a certain point of view, but no practical benefit that I can see. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From guido at python.org Thu Sep 20 04:11:51 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 19:11:51 -0700 Subject: [Python-3000] Move argv[0]? (Re: Unicode and OS strings) In-Reply-To: <46F1CF76.20004@canterbury.ac.nz> References: <1189700532.22693.40.camel@qrnik> <46EA5114.9060200@coli.uni-saarland.de> <46EB0EC2.4030208@canterbury.ac.nz> <46EBB779.6090605@gmx.net> <46EC8909.4050300@canterbury.ac.nz> <9e804ac0709160824m3634437dseb2f0183580a7674@mail.gmail.com> <46EDFCBB.8010306@canterbury.ac.nz> <9e804ac0709181840gb33da5co5db610aa2cd6bbeb@mail.gmail.com> <46F1CF76.20004@canterbury.ac.nz> Message-ID: On 9/19/07, Greg Ewing wrote: > I still think that argv[0] is in the "too clever by > half" category. It has a kind of theoretical elegance > from a certain point of view, but no practical > benefit that I can see. And I still think you're wasting your time on trivia. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Sep 20 04:13:15 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 19 Sep 2007 19:13:15 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: References: Message-ID: Yeah, the PEP is pretty out of date (perhaps only surpassed by PEP 3135, super()). It's on my list to update it. This should definitely be added. On 9/19/07, Terry Reedy wrote: > > "Guido van Rossum" wrote in message > news:ca471dc20709191019k2f5e16e5j75767b25ddf90e30 at mail.gmail.com... > | Changing the mode between text and binary is not feasible (since it > | would have to change the class). But it is perfectly acceptable to use > | sys.std{in,out}.buffer if you need to write a binary transparent > | filter. > > In PEP 3116, the Buffered I/O section has > Additionally, the abstract base class provides one member variable: > .raw > A reference to the underlying RawIOBase object. > > The Text I/O section does *not* have, but I presume should, similar lines > about member variable .buffer. > > Perhaps a note could be added that stdin/out will be Text I/O and that the > bytes buffer is easily unwrapped via .buffer (and even via .raw). > > While I sympathize with the initial surprise, I am willing to type .buffer > should I need to. The real problem is that 2to3.py cannot do so > automatically (and be always. and probably not even usually, correct). > > tjr > > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Thu Sep 20 05:19:20 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 20 Sep 2007 15:19:20 +1200 Subject: [Python-3000] New io system and binary data In-Reply-To: References: Message-ID: <46F1E6B8.9080609@canterbury.ac.nz> Christian Heimes wrote: > With Python 2.5 the input and output data isn't implicitly > converted Are you sure that's always true? What about systems where newlines aren't \n? > I recommend that the problem and fix gets documented. Maybe stdin, > stdout and stderr should get a method that disables the implicit > conversion like setMode("b") / setMode("t"). Or maybe another set of objects called stdbin, stdbout, stdberr. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Thu Sep 20 05:38:06 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 20 Sep 2007 15:38:06 +1200 Subject: [Python-3000] New io system and binary data In-Reply-To: <18161.41073.724795.594482@montanaro.dyndns.org> References: <-7804278669952876495@unknownmsgid> <2296611449423656759@unknownmsgid> <5ACD931A-2E92-4018-ABB2-22C626589961@acm.org> <18161.41073.724795.594482@montanaro.dyndns.org> Message-ID: <46F1EB1E.1080402@canterbury.ac.nz> skip at pobox.com wrote: > Binary has (like it or not) been the default behavior on all previous > Pythons running on Unix systems where text and binary were never different Um, no, *text* has always been the default on all systems. It's just that on systems where text and binary are the same, you don't notice the difference. This has led some Unix programmers into bad habits. > The only time I ever > open a file with the "b" attribute is when I expect that code to run on > Windows A more defensive approach is to always open with "b" when you're dealing with binary data, then it will work even if someone does happen to run it on Windows. Programs following this philosophy won't have any problems with Py3k (at least not from that source). -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From weilawei at gmail.com Thu Sep 20 05:49:29 2007 From: weilawei at gmail.com (Rob Crowther) Date: Wed, 19 Sep 2007 23:49:29 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point Message-ID: Okay, here's the barebones, scrapped together version. It's ugly. It's messy. It might eat your kids. On the other hand, it seems to work. http://umass.glexia.net/mpf.tar.bz2 It provides a module, mpf, with an MPF type and a bunch of methods. You can directly set the value of an MPF type by setting the value attribute to a string containing a number. You can read the value of an MPF type by getting value. Note that it will come back as a tuple in the form (base, sign, whole, decimal) whole and decimal will be strings. This is to avoid losing precision. I couldn't think of another way to easily work with values that can be (theoretically) infinite. If the number is 0, value will be None. If there isn't a whole part (or a decimal part), its place in the tuple will be set to None. If you want to change the default precision (128 bits), use the keyword prec in the costructor. It takes a Long. The MPF type implements rich comparisons. The MPF type does not (yet) implement the Number methods. It will. The methods provided by mpf are... (these methods take two MPF values and return an MPF value): mpf_add mpf_sub mpf_div mpf_mul (these methods take one MPF value and return... gasp.. one MPF value): mpf_sqrt mpf_abs mpf_neg (this method has a stub and does not exist yet): mpf_pow Sharp Edges: If you try to set value to something weird, like a dictionary, it will segfault. PyString_Check wasn't working for me. It's in there, but defined out. If you divide by zero, GNU MP will crash it saying "Floating point exception." Same goes for... well, I forgot already. I've been hacking on this nearly 24 hours straight and I'm new to both the Python API and GNU MP. Anything else? Good luck, hope someone likes it. I'll be continuing to work on it, hopefully eliminate some redundancy in the code, clean it up, flesh out the floating point support. After that, it's on to integers. Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070919/e65098a1/attachment.htm From jyasskin at gmail.com Thu Sep 20 08:07:19 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Wed, 19 Sep 2007 23:07:19 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> Message-ID: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> I've attached a very preliminary patch for this. It makes bytes immutable but doesn't do either of the other suggested changes. It's enough to make the tests run, but doesn't do anything to make them pass. The test results so far are: 270 tests OK. 28 tests failed: test_asynchat test_asyncore test_audioop test_base64 test_binascii test_bytes test_codecs test_ftplib test_httplib test_io test_logging test_mailbox test_marshal test_mhlib test_mmap test_old_mailbox test_poplib test_smtplib test_socket test_string test_tarfile test_telnetlib test_unicode test_univnewlines test_urllib2_localnet test_uuid test_xmlrpc test_zipimport 24 tests skipped: test_bsddb3 test_codecmaps_cn test_codecmaps_hk test_codecmaps_jp test_codecmaps_kr test_codecmaps_tw test_curses test_gdbm test_largefile test_locale test_normalization test_ossaudiodev test_pep277 test_socket_ssl test_socketserver test_ssl test_startfile test_timeout test_urllib2net test_urllibnet test_winreg test_winsound test_xmlrpc_net test_zipfile64 1 skip unexpected on darwin: test_ssl On 9/18/07, Guido van Rossum wrote: > On 9/18/07, Jeffrey Yasskin wrote: > > I'll take it. I assume it's just a matter of removing the mutating > > methods and making the tests pass? > > And adding __hash__. And (but this could be a separate, later change) > switch indexing to return 1-char bytes arrays instead of small ints. > And similar changes to the constructor. > > Of course, the devil is in the "making the tests pass". > > > I saw but didn't read a couple > > threads about the buffer API... how much has to change there? > > The bytes buffer API should refuse requests for writable buffers. > > Since you're so close, please do interrupt me over IM to review > incomplete work or ideas! > > --Guido > > > On 9/18/07, Guido van Rossum wrote: > > > No takers? What about those repeated +42 voters? Does anyone want > > > immutable bytes enough to do a teensy bit of work? > > > > > > --Guido > > > > > > On 9/17/07, Guido van Rossum wrote: > > > > This may have passed in a thread where no-one was listening, so I'm > > > > repeating it here. > > > > > > > > I'm considering the following option: bytes would always be immutable, > > > > and for the few places (mostly in io.py) where a mutable bytes buffer > > > > would be handy, we use the array module. Then it would also make sense > > > > to make b[0] return a bytes array of length 1 instead of a small int > > > > -- bytes would be more similar to str in 2.x, albeit completely > > > > incompatible with str in terms of mixed operations. > > > > > > > > It would help if someone explored creating a patch to implement this, > > > > just to see the minimum amount of code that would need to change > > > > compared to 3.0a1. (The challenge includes making all the tests pass > > > > again.) > > > > > > > > -- > > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > > > > -- > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > _______________________________________________ > > > Python-3000 mailing list > > > Python-3000 at python.org > > > http://mail.python.org/mailman/listinfo/python-3000 > > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/jyasskin%40gmail.com > > > > > > > > > -- > > Namast?, > > Jeffrey Yasskin > > http://jeffrey.yasskin.info/ > > > > "Religion is an improper response to the Divine." ? "Skinny Legs and > > All", by Tom Robbins > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -- Namast?, Jeffrey Yasskin http://jeffrey.yasskin.info/ "Religion is an improper response to the Divine." ? "Skinny Legs and All", by Tom Robbins -------------- next part -------------- A non-text attachment was scrubbed... Name: preliminary_immutable_bytes.patch Type: application/octet-stream Size: 22657 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070919/c7b91d7e/attachment-0001.obj From lists at cheimes.de Thu Sep 20 12:12:48 2007 From: lists at cheimes.de (Christian Heimes) Date: Thu, 20 Sep 2007 12:12:48 +0200 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F1E6B8.9080609@canterbury.ac.nz> References: <46F1E6B8.9080609@canterbury.ac.nz> Message-ID: Greg Ewing wrote: > Christian Heimes wrote: >> With Python 2.5 the input and output data isn't implicitly >> converted > > Are you sure that's always true? What about systems > where newlines aren't \n? Windows is a strange beast. As far as I can remember the OS converts the incoming and outgoing standard streams to Unix line endings \n. A true binary standard stream on Windows needs some effort - unfortunately. :( >> I recommend that the problem and fix gets documented. Maybe stdin, >> stdout and stderr should get a method that disables the implicit >> conversion like setMode("b") / setMode("t"). > > Or maybe another set of objects called stdbin, stdbout, stdberr. I have given some thoughts to it while I was writing the initial mail. I had the names stdinb, stdoutb and stderrb in mind but your names are better. The problem with the binary stream lies in the fine detail. We can't simply assign sys.stdout.buffer to sys.stdbout. I - as a Python user - would expect that stdbout will always use the same backend as stdout: Python sets >>> sys.stdbout = sys.stdout.buffer Now the user assigns a new file to stdout >>> sys.stdout = file("myoutput", "w") and blindly expects that >>> sys.stdbout.write("data\ndata\n") does the right thing. A proxy like following (untested) class might do the trick. import sys class StdBinaryFacade: def __init__(self, name): self._name = name def __getattr__(self, key): buffer = getattr(sys, self._name).buffer return getattr(buffer, key) def __repr__(self): return "<%s for sys.%s at %i>" % (self.__name__, self._name, id(self)) >>> sys.stdbout = StdBinaryFacade("stdout") Christian From eric+python-dev at trueblade.com Thu Sep 20 12:58:31 2007 From: eric+python-dev at trueblade.com (Eric Smith) Date: Thu, 20 Sep 2007 06:58:31 -0400 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <46F1E6B8.9080609@canterbury.ac.nz> Message-ID: <46F25257.2030509@trueblade.com> Christian Heimes wrote: > Greg Ewing wrote: >> Christian Heimes wrote: >>> With Python 2.5 the input and output data isn't implicitly >>> converted >> Are you sure that's always true? What about systems >> where newlines aren't \n? > > Windows is a strange beast. As far as I can remember the OS converts the > incoming and outgoing standard streams to Unix line endings \n. A true > binary standard stream on Windows needs some effort - unfortunately. :( To be precise, it's not the OS that does this, but rather the C runtime. Eric. From adam at hupp.org Thu Sep 20 13:50:08 2007 From: adam at hupp.org (Adam Hupp) Date: Thu, 20 Sep 2007 07:50:08 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: References: Message-ID: <766a29bd0709200450p26588f71x8526cd0d0ec64206@mail.gmail.com> On 9/19/07, Rob Crowther wrote: > If you try to set value to something weird, like a dictionary, it will > segfault. PyString_Check wasn't working for me. It's in there, but defined > out. I think you'll need to use PyUnicode_Check for that. -- Adam Hupp | http://hupp.org/adam/ From adam at hupp.org Thu Sep 20 15:46:36 2007 From: adam at hupp.org (Adam Hupp) Date: Thu, 20 Sep 2007 09:46:36 -0400 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> Message-ID: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> On 9/20/07, Jeffrey Yasskin wrote: > I've attached a very preliminary patch for this. It makes bytes > immutable but doesn't do either of the other suggested changes. It's > enough to make the tests run, but doesn't do anything to make them > pass. The test results so far are: I have fixes for the following: test_asynchat test_asyncore test_bytes test_string test_base64 test_binascii test_tarfile I'll post a patch later today. -- Adam Hupp | http://hupp.org/adam/ From martin at v.loewis.de Thu Sep 20 15:51:16 2007 From: martin at v.loewis.de (martin at v.loewis.de) Date: Thu, 20 Sep 2007 15:51:16 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <200709191240.33698.victor.stinner@haypocalc.com> References: <1189700532.22693.40.camel@qrnik> <200709191240.33698.victor.stinner@haypocalc.com> Message-ID: <20070920155116.wiziy13ig4kksoc8@webmail.df.eu> > On Linux, filenames are *byte* string and not *character* string. That's not true, although this is a wide-spread misunderstanding. The POSIX standard defines that the file names must be a superset of the portable character set, which includes things such as '/', which is the path separator. > I always > have his problem with Python 2.x. I converted filename (argv[x]) to Unicode > to be able to format error messages in full unicode... but it's not possible. > Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), > see Marcin's examples. True. However, this does not mean that the file names are byte strings - they are character strings in an unspecified/undetermined encoding. Regards, Martin From janssen at parc.com Thu Sep 20 17:57:59 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 20 Sep 2007 08:57:59 PDT Subject: [Python-3000] New io system and binary data In-Reply-To: <46F1E6B8.9080609@canterbury.ac.nz> References: <46F1E6B8.9080609@canterbury.ac.nz> Message-ID: <07Sep20.085807pdt."57996"@synergy1.parc.xerox.com> Greg Ewing writes: > Christian Heimes writes: > > I recommend that the problem and fix gets documented. Maybe stdin, > > stdout and stderr should get a method that disables the implicit > > conversion like setMode("b") / setMode("t"). > > Or maybe another set of objects called stdbin, stdbout, stdberr. Nice idea, but it would have been a tad more true to the origin of the names if "stdin", "stderr", and "stdout" were binary (as the re-use of those fine names automatically implies to anyone who knows what they're doing), and "textin", "textout", and "texterr" were the bogus VMS/Windows corrupted versions of the dandy UNIX originals. Bill From guido at python.org Thu Sep 20 19:08:03 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 20 Sep 2007 10:08:03 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <1462544809546634408@unknownmsgid> References: <46F1E6B8.9080609@canterbury.ac.nz> <1462544809546634408@unknownmsgid> Message-ID: On 9/20/07, Bill Janssen wrote: > Greg Ewing writes: > > Christian Heimes writes: > > > I recommend that the problem and fix gets documented. Maybe stdin, > > > stdout and stderr should get a method that disables the implicit > > > conversion like setMode("b") / setMode("t"). > > > > Or maybe another set of objects called stdbin, stdbout, stdberr. > > Nice idea, but it would have been a tad more true to the origin of the > names if "stdin", "stderr", and "stdout" were binary (as the re-use of > those fine names automatically implies to anyone who knows what > they're doing), and "textin", "textout", and "texterr" were the bogus > VMS/Windows corrupted versions of the dandy UNIX originals. Oh for chrissakes. Can we stop the bikeshedding on this topic already? Several people have already agreed that sys.stdin.buffer is good enough. Please stop while you're ahead. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From thomas at python.org Thu Sep 20 20:03:53 2007 From: thomas at python.org (Thomas Wouters) Date: Thu, 20 Sep 2007 11:03:53 -0700 Subject: [Python-3000] Implementing Abstract Interface for Numbers In-Reply-To: References: Message-ID: <9e804ac0709201103s2b0d3392jdf8c19196c33efa9@mail.gmail.com> On 9/19/07, Rob Crowther wrote: > > This is the documentation for PyNumberMethods right now. > > PyNumberMethods *tp_as_number; > XXX > > > I've managed to wrap GNU MP floats and add rich comparisons, but there's a > sore lack of documentation on how to implement the Number interface. Given a > bit of pointers on where to look, an alpha version of this extension will be > available tomorrow, most likely. I'm not sure where you saw that 'XXX' -- are you looking at Py3k docs? In that case, don't bother, the Numbers API has hardly changed, just use the Python 2.5 docs. Or the Python 2.0 docs, as there's little difference ;) But it's true there isn't all that much documentation on those parts. The PyNumberMethods struct is really straightforward, you should be able to guess what each function is supposed to do just by looking at the function signature and the name. But when in doubt, the best place to go is the Python source. Just look at Objects/intobject.c or Objects/longobject.c. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070920/8ef58daa/attachment.htm From jyasskin at gmail.com Thu Sep 20 21:34:57 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Thu, 20 Sep 2007 12:34:57 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> Message-ID: <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> On 9/20/07, Adam Hupp wrote: > On 9/20/07, Jeffrey Yasskin wrote: > > I've attached a very preliminary patch for this. It makes bytes > > immutable but doesn't do either of the other suggested changes. It's > > enough to make the tests run, but doesn't do anything to make them > > pass. The test results so far are: > > I have fixes for the following: > > test_asynchat > test_asyncore > test_bytes > test_string > test_base64 > test_binascii > test_tarfile > > I'll post a patch later today. Thanks for the help! This brings up a policy question: For patches like the one I've attached here, do we want to start submitting them now, or build up a mondo patch to fix them all at once? -- Namast?, Jeffrey Yasskin -------------- next part -------------- A non-text attachment was scrubbed... Name: sample_test_changes.patch Type: application/octet-stream Size: 2727 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070920/0aae4034/attachment-0001.obj From tjreedy at udel.edu Thu Sep 20 22:09:51 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 20 Sep 2007 16:09:51 -0400 Subject: [Python-3000] Immutable bytes -- looking for volunteer References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com><5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com><766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> Message-ID: "Jeffrey Yasskin" wrote in message news:5d44f72f0709201234vec00c4w13d41bf5c4bea8d7 at mail.gmail.com... | On 9/20/07, Adam Hupp wrote: || > | > I have fixes for the following: ... | > I'll post a patch later today. | | Thanks for the help! This brings up a policy question: For patches | like the one I've attached here, do we want to start submitting them | now, or build up a mondo patch to fix them all at once? I think it OK to post patches to the tracker even if one does not intend for them to be immediately applied or even expect them to be combined with others. Then they do not get lost (or mangled by the mail system) and are available to anyone. From guido at python.org Thu Sep 20 22:18:00 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 20 Sep 2007 13:18:00 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> Message-ID: On 9/20/07, Jeffrey Yasskin wrote: > On 9/20/07, Adam Hupp wrote: > > On 9/20/07, Jeffrey Yasskin wrote: > > > I've attached a very preliminary patch for this. It makes bytes > > > immutable but doesn't do either of the other suggested changes. It's > > > enough to make the tests run, but doesn't do anything to make them > > > pass. The test results so far are: > > > > I have fixes for the following: > > > > test_asynchat > > test_asyncore > > test_bytes > > test_string > > test_base64 > > test_binascii > > test_tarfile > > > > I'll post a patch later today. > > Thanks for the help! This brings up a policy question: For patches > like the one I've attached here, do we want to start submitting them > now, or build up a mondo patch to fix them all at once? This is supposed to be an exploration of the possibilities. So either you create a branch, where you can submit to your heart's content, or you collect everything in one big jumbo patch (in which case I'd recommend reserving a tracker item). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From adam at hupp.org Fri Sep 21 00:48:14 2007 From: adam at hupp.org (Adam Hupp) Date: Thu, 20 Sep 2007 18:48:14 -0400 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> Message-ID: <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> On 9/20/07, Jeffrey Yasskin wrote: > > Thanks for the help! This brings up a policy question: For patches > like the one I've attached here, do we want to start submitting them > now, or build up a mondo patch to fix them all at once? My changes are here: http://bugs.python.org/issue1184 With that patch there are only two issues remaining (6 test failures). -- Adam Hupp | http://hupp.org/adam/ From greg.ewing at canterbury.ac.nz Fri Sep 21 03:00:38 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 21 Sep 2007 13:00:38 +1200 Subject: [Python-3000] New io system and binary data In-Reply-To: <07Sep20.085807pdt.57996@synergy1.parc.xerox.com> References: <46F1E6B8.9080609@canterbury.ac.nz> <07Sep20.085807pdt.57996@synergy1.parc.xerox.com> Message-ID: <46F317B6.6070202@canterbury.ac.nz> Bill Janssen wrote: > Nice idea, but it would have been a tad more true to the origin of the > names if "stdin", "stderr", and "stdout" were binary (as the re-use of > those fine names automatically implies to anyone who knows what > they're doing) No, the names only imply that to Unix users who are ignorant of the correct way to use the C stdio library portably. Right from the beginning, binary mode was an option, and if you didn't ask for it, you got text mode. The same thing applies to stdin/out/err. Anyone using them to handle binary data is and was writing non-portable code. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From jimjjewett at gmail.com Fri Sep 21 16:00:38 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 21 Sep 2007 10:00:38 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> References: <1189700532.22693.40.camel@qrnik> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: On 9/18/07, James Y Knight wrote: > On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > One of the more common things to do with command line arguments is > open them. So, it'd really be nice if: > python -c 'import sys; open(sys.argv[1])' [some filename] > would always work, regardless of the current system encoding and what > characters make up the filename. (Outside ASCII), if you treat sys.argv as text, that is probably impossible without filesystem support. Before python even sees the data, the terminal itself is allowed to change between canonical equivalents, which have different binary representations. It does sound like we need a way to get to the original bytes, similar to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? (Since this would be bytes rather than text, I assume this would be a single array, rather than a list of already separated arguments.) Similarly, could os.environ have a bytes mirror, where the keys and values are (immutable) bytes? -jJ From p.f.moore at gmail.com Fri Sep 21 16:41:03 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 21 Sep 2007 15:41:03 +0100 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> On 21/09/2007, Jim Jewett wrote: > (Outside ASCII), if you treat sys.argv as text, that is probably > impossible without filesystem support. Before python even sees the > data, the terminal itself is allowed to change between canonical > equivalents, which have different binary representations. Please note - this statement is Unix specific. The situation on Windows is entirely different (the fact that the CRT on Windows emulates some aspects of the Unix semantics is not relevant here - you need to understand the underlying OS model). If you want to redesign things (and I don't, personally, believe that is a good idea) then make sure you don't base your design solely on Unix semantics. Paul. From jimjjewett at gmail.com Fri Sep 21 16:45:51 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 21 Sep 2007 10:45:51 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <18155.9131.229187.756043@uwakimon.sk.tsukuba.ac.jp> <1190056321.14217.21.camel@qrnik> <18159.20285.252979.634446@uwakimon.sk.tsukuba.ac.jp> <1190106739.23701.17.camel@qrnik> <18160.14041.403941.778059@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 9/18/07, Guido van Rossum wrote: > On 9/18/07, Jim Jewett wrote: > > ... given that defenc is now always UTF-8, won't exposing > > it in the public typedef then just be an attractive nuisance? > *ALL* fields of the struct def are strictly internal. Is that policy documented somewhere? I didn't get that impression from the C API, the Extending and Embedding document, or from the header itself. In the header, it was above the "public API" line, but so were things like Py_UNICODE_REPLACEMENT_CHARACTER, and it does start with Py rather than _Py. Other declarations, such as _PyUnicode_AsDefaultEncodedString, were clearly marked as internal in both comments and name. > > [ My proposal to remove *str and *defenc from definition in > > the public .h file.) > > As this would allow 3rd parties to create implementations specialized > > for (and saving space on) smaller alphabets, without breaking C > > extensions that stick to the public header files. > That is not a supported use case. Why not? If it is just for lack of contributions, I'll shut up until I find time. But it sounds (and has sounded in the past) like a policy decision -- and I want to know the reasoning behind it. -jJ From exarkun at divmod.com Fri Sep 21 16:46:47 2007 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 21 Sep 2007 10:46:47 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: Message-ID: <20070921144647.8162.742888147.divmod.quotient.12306@ohm> On Fri, 21 Sep 2007 10:00:38 -0400, Jim Jewett wrote: > [snip] > >It does sound like we need a way to get to the original bytes, similar >to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? >(Since this would be bytes rather than text, I assume this would be a >single array, rather than a list of already separated arguments.) Without commenting on whether this is a good idea overall or not, it would not be a single array, rather than a list of already separated arguments, because it is given to the C main() function as an array of char*, not a single char*. On Windows it's more complicated, but the same argument can probably be applied (or it should also reflect the underlying system API on Windows, which means on Windows it will be a single bytes object instead of a list of them, but only on Windows. This goes beyond even the 2.x level of low-level detail exposure). Jean-Paul From jimjjewett at gmail.com Fri Sep 21 17:01:24 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 21 Sep 2007 11:01:24 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> Message-ID: On 9/21/07, Paul Moore wrote: > On 21/09/2007, Jim Jewett wrote: > > (Outside ASCII), if you treat sys.argv as text, that is probably > > impossible without filesystem support. Before python even sees the > > data, the terminal itself is allowed to change between canonical > > equivalents, which have different binary representations. > Please note - this statement is Unix specific. The situation on > Windows is entirely different (the fact that the CRT on Windows > emulates some aspects of the Unix semantics is not relevant here - you > need to understand the underlying OS model). No; it is a consequence of unicode. The command shell (or other program launcher) have the same freedom. If you are using text (as opposed to bytes), then ? can be either U+00C0 or . If the file system makes a distinction, then it is using bytes, and any program interacting with it needs* to use bytes too. * To be correct; in practice, the problems will occur rarely enough that most people won't notice. -jJ From theller at ctypes.org Fri Sep 21 17:18:01 2007 From: theller at ctypes.org (Thomas Heller) Date: Fri, 21 Sep 2007 17:18:01 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <20070921144647.8162.742888147.divmod.quotient.12306@ohm> References: <20070921144647.8162.742888147.divmod.quotient.12306@ohm> Message-ID: Jean-Paul Calderone schrieb: > On Fri, 21 Sep 2007 10:00:38 -0400, Jim Jewett wrote: >> [snip] >> >>It does sound like we need a way to get to the original bytes, similar >>to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? >>(Since this would be bytes rather than text, I assume this would be a >>single array, rather than a list of already separated arguments.) > > Without commenting on whether this is a good idea overall or not, it > would not be a single array, rather than a list of already separated > arguments, because it is given to the C main() function as an array > of char*, not a single char*. > > On Windows it's more complicated, but the same argument can probably > be applied (or it should also reflect the underlying system API on > Windows, which means on Windows it will be a single bytes object > instead of a list of them, but only on Windows. This goes beyond > even the 2.x level of low-level detail exposure). I *hope* that on Windows, these objects will be unicode not bytes objects - the wide windows api should be used to get these values. No conversion needed. Thomas From murman at gmail.com Fri Sep 21 17:22:29 2007 From: murman at gmail.com (Michael Urman) Date: Fri, 21 Sep 2007 10:22:29 -0500 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: On 9/21/07, Jim Jewett wrote: > (Outside ASCII), if you treat sys.argv as text, that is probably > impossible without filesystem support. Before python even sees the > data, the terminal itself is allowed to change between canonical > equivalents, which have different binary representations. > > It does sound like we need a way to get to the original bytes, similar > to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? If there's not something straightforward to put in the ... below that would allow simple iteration and processing of all files passed on the command line, preferably interchangeably on both unix (where filenames cannot necessarily be converted to Unicode) and Windows NT and up (where filenames cannot necessarily be represented by bytestrings, and arguments don't necessarily come in as bytes), then I will be one of many disappointed people. >>> arguments = ... # something equivalent to (python 2.x on unix) sys.argv[1:] >>> for filename in arguments: ... archive.add(filename) # definitely - akin to open(file) ... print(filename, file=listing) # maybe - this makes too many assumptions Obviously simple things like replacing an un(de/en)codable character with '?' will fail - while they could be partially worked around by using glob (assuming a one to one replacement, as processed by the OS), that's just asking for an unwitting corner-case behavior when another file nearly matches the name of another with a replaced character. I don't have a preference between sys.argv[1:] doing this like it always has on unix, and tends to within a single locale on Windows; the introduction of a new sys.arguments (either [0:] or [1:]); or even some simple map(encode_step, sys.argv[1:]). Of course the problem with the encode_step is unless it is a no-op on Windows, it can break filenames as badly as decoding them will on unix, unless the common OS interfaces all reverse the process (in which case doing it manually is never necessary). Michael -- Michael Urman From p.f.moore at gmail.com Fri Sep 21 17:59:43 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 21 Sep 2007 16:59:43 +0100 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> Message-ID: <79990c6b0709210859h4116a7ecx1d1e024f16698483@mail.gmail.com> On 21/09/2007, Jim Jewett wrote: > If you are using text (as opposed to bytes), then ? can be either > U+00C0 or . If the file system makes a distinction, > then it is using bytes, and any program interacting with it needs* to > use bytes too. OK. I don't know enough about Unicode (or this low a level of the Windows API) to be sure. But it's certainly possible that under Windows, the file system (API) doesn't make a distinction. > * To be correct; in practice, the problems will occur rarely enough > that most people won't notice. Too right. The only explicit case of an issue that I'm aware of is the one that started the thread, of a Unix system with incompatible terminal and filesystem encodings (or was it extremely obscure shell incantations? whatever, it was well beyond my level of Unix knowledge). I'd say YAGNI except that someone seems to have demonstrated a genuine (if rare) need on Unix. I'll stick with YAGNI on Windows, though. (Where's uncle Tim to point out that Windows is the better platform when you need him? :-)) Paul. PS I'm now so far out of my depth on Unicode issues that I'll drop out of this thread at this point. From guido at python.org Fri Sep 21 18:52:52 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 21 Sep 2007 09:52:52 -0700 Subject: [Python-3000] Py3k Trivia :-) Message-ID: Would you believe there's a Curious George episode named "Curious George Vs The Turbo Python 3000"? """ George isn't tall enough to ride the greatest rollercoaster of all time, The Turbo Python 3000. He uses licorice whips to measure his height and determines that he is 7-whips tall, one short of the 8-whip minimum! """ http://pbskids.org/curiousgeorge/parentsteachers/program/ep_desc_3.html -- --Guido van Rossum (home page: http://www.python.org/~guido/) From lists at cheimes.de Wed Sep 19 22:28:08 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 19 Sep 2007 22:28:08 +0200 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> Message-ID: <46F18658.40805@cheimes.de> Guido van Rossum wrote: > You can repeat that until you're blue in the face but it's not going > to change. Way more programs (especially simple ones) deal with txet > than with binary data. I have to agree with Guido. The new behavior is much better than the default in Python 2.x. It seems that I'm the first user with an use case which requires a binary stdin and stdout. I can imagine two problems with the new way. The problems should have a documented answer and best way to deal with them: * stdin or stdout are used in binary mode * stdin or stdout have to deal with data in a different encoding than UTF-8 Christian From arvind1.singh at gmail.com Fri Sep 21 19:31:10 2007 From: arvind1.singh at gmail.com (Arvind Singh) Date: Fri, 21 Sep 2007 23:01:10 +0530 Subject: [Python-3000] decorators for variable assignments? Message-ID: Hi, We have function and class decorators. Can we also have decorators for variable assignments? For example: @validate_proxy proxy = "http://user:passwd at host:port/" be a syntactical sugar for: proxy = validate_proxy("http://user:passwd at host:port/") Python is often used as a configuration language (small utility scripts, Django, etc.) and it makes more sense to have the validation of a user supplied configuration value at the time of assignment rather than leaving the burden of validation on every piece of code that uses it. Although both approaches can be used for it, a user will be more hesitant to edit the string in the latter form (the former one is also more readable IMHO). Arvind PS: I hope it's not something too radical to talk so late about. From guido at python.org Fri Sep 21 19:40:55 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 21 Sep 2007 10:40:55 -0700 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: References: Message-ID: None of the arguments for function and class decorators apply here. On 9/21/07, Arvind Singh wrote: > Hi, > > We have function and class decorators. Can we also have decorators for > variable assignments? > > For example: > @validate_proxy > proxy = "http://user:passwd at host:port/" > > be a syntactical sugar for: > proxy = validate_proxy("http://user:passwd at host:port/") > > > Python is often used as a configuration language (small utility > scripts, Django, etc.) and it makes more sense to have the validation > of a user supplied configuration value at the time of assignment > rather than leaving the burden of validation on every piece of code > that uses it. Although both approaches can be used for it, a user will > be more hesitant to edit the string in the latter form (the former one > is also more readable IMHO). > > > Arvind > > PS: I hope it's not something too radical to talk so late about. > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From tjreedy at udel.edu Fri Sep 21 23:51:57 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 21 Sep 2007 17:51:57 -0400 Subject: [Python-3000] decorators for variable assignments? References: Message-ID: | @validate_proxy | proxy = "http://user:passwd at host:port/" | | be a syntactical sugar for: | proxy = validate_proxy("http://user:passwd at host:port/") Sorry, to me, this is syntactical pepper -- or worse ;-) tjr From tjreedy at udel.edu Sat Sep 22 00:17:13 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 21 Sep 2007 18:17:13 -0400 Subject: [Python-3000] Unicode and OS strings References: <1189700532.22693.40.camel@qrnik><46EB0DC0.3050906@canterbury.ac.nz><87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp><46EB6EA1.5020104@v.loewis.de><87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp><1190070414.20673.12.camel@qrnik><18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp><32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: "Michael Urman" wrote in message news:dcbbbb410709210822p354ef608o6cd01994a67710f1 at mail.gmail.com... | If there's not something straightforward to put in the ... below that | would allow simple iteration and processing of all files passed on the | command line, preferably interchangeably on both unix (where filenames | cannot necessarily be converted to Unicode) and Windows NT and up | (where filenames cannot necessarily be represented by bytestrings, and | arguments don't necessarily come in as bytes), then I will be one of | many disappointed people. Perhaps we need one or more library functions (generators) to hide the OS differences and corner-case details. From arvind1.singh at gmail.com Sat Sep 22 00:23:54 2007 From: arvind1.singh at gmail.com (Arvind Singh) Date: Sat, 22 Sep 2007 03:53:54 +0530 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: References: Message-ID: > | @validate_proxy > | proxy = "http://user:passwd at host:port/" > | > | be a syntactical sugar for: > | proxy = validate_proxy("http://user:passwd at host:port/") > > Sorry, to me, this is syntactical pepper -- or worse ;-) "Poison" perhaps? Then, maybe we can have Poisonous Python! :-) -- Regards, Arvind From nicko at nicko.org Sat Sep 22 02:18:48 2007 From: nicko at nicko.org (Nicko van Someren) Date: Sat, 22 Sep 2007 01:18:48 +0100 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: References: Message-ID: <2EE33BBD-C301-4CBF-BCCB-3BBE9204EF56@nicko.org> On 21 Sep 2007, at 22:51, Terry Reedy wrote: > | @validate_proxy > | proxy = "http://user:passwd at host:port/" > | > | be a syntactical sugar for: > | proxy = validate_proxy("http://user:passwd at host:port/") > > Sorry, to me, this is syntactical pepper -- or worse ;-) I'm thinking it tends towards "syntactic h?karl" :-) Nicko [*] http://en.wikipedia.org/wiki/Hakarl From greg.ewing at canterbury.ac.nz Sat Sep 22 03:05:53 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 22 Sep 2007 13:05:53 +1200 Subject: [Python-3000] Py3k Trivia :-) In-Reply-To: References: Message-ID: <46F46A71.1060409@canterbury.ac.nz> Guido van Rossum wrote: > """ > George isn't tall enough to ride the greatest rollercoaster of all > time, The Turbo Python 3000. He uses licorice whips to measure his > height and determines that he is 7-whips tall, one short of the 8-whip > minimum! > """ Fantastic! I vote that we hereby adopt the licorice whip as the standard unit for measuring the speed of Python 3.0 implementations, with the speed of 2.6 (whatever it turns out to be) defined as 7 whips. -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 22 03:08:53 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 22 Sep 2007 13:08:53 +1200 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: References: Message-ID: <46F46B25.6030907@canterbury.ac.nz> On 9/21/07, Arvind Singh wrote: > @validate_proxy > proxy = "http://user:passwd at host:port/" > > it makes more sense to have the validation > of a user supplied configuration value at the time of assignment > rather than leaving the burden of validation on every piece of code > that uses it. Why can't you just use a property that does the validation in its set method? -- Greg From oliphant.travis at ieee.org Sat Sep 22 03:40:18 2007 From: oliphant.travis at ieee.org (Travis Oliphant) Date: Fri, 21 Sep 2007 20:40:18 -0500 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: Guido van Rossum wrote: > This may have passed in a thread where no-one was listening, so I'm > repeating it here. > > I'm considering the following option: bytes would always be immutable, > and for the few places (mostly in io.py) where a mutable bytes buffer > would be handy, we use the array module. Then it would also make sense > to make b[0] return a bytes array of length 1 instead of a small int > -- bytes would be more similar to str in 2.x, albeit completely > incompatible with str in terms of mixed operations. If it is decided to make bytes immutable (which sounds good to me), then I want to add my voice to those that clamor for an additional mutable object capable of allocating chunks of memory. This object should have a C-API and have it's structure exposed to extension module writers (thus array.array does not fit the bill -- but might be a prototype if some of it is moved over to the Objects directory and given an API). -Travis Oliphant From guido at python.org Sat Sep 22 03:55:10 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 21 Sep 2007 18:55:10 -0700 Subject: [Python-3000] Py3k Trivia :-) In-Reply-To: <46F46A71.1060409@canterbury.ac.nz> References: <46F46A71.1060409@canterbury.ac.nz> Message-ID: On 9/21/07, Greg Ewing wrote: > Guido van Rossum wrote: > > """ > > George isn't tall enough to ride the greatest rollercoaster of all > > time, The Turbo Python 3000. He uses licorice whips to measure his > > height and determines that he is 7-whips tall, one short of the 8-whip > > minimum! > > """ > > Fantastic! I vote that we hereby adopt the licorice whip > as the standard unit for measuring the speed of Python 3.0 > implementations, with the speed of 2.6 (whatever it turns > out to be) defined as 7 whips. Ah, but is 6 whips faster or slower than 7 whips? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Sat Sep 22 04:03:44 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 21 Sep 2007 19:03:44 -0700 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: <2EE33BBD-C301-4CBF-BCCB-3BBE9204EF56@nicko.org> References: <2EE33BBD-C301-4CBF-BCCB-3BBE9204EF56@nicko.org> Message-ID: Can we stop this already? The idea is dead. No need to drag it through the mud around town for an extended period of time. On 9/21/07, Nicko van Someren wrote: > On 21 Sep 2007, at 22:51, Terry Reedy wrote: > > > | @validate_proxy > > | proxy = "http://user:passwd at host:port/" > > | > > | be a syntactical sugar for: > > | proxy = validate_proxy("http://user:passwd at host:port/") > > > > Sorry, to me, this is syntactical pepper -- or worse ;-) > > I'm thinking it tends towards "syntactic h?karl" :-) > > Nicko > > [*] http://en.wikipedia.org/wiki/Hakarl > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Sat Sep 22 07:48:40 2007 From: martin at v.loewis.de (martin at v.loewis.de) Date: Sat, 22 Sep 2007 07:48:40 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> Message-ID: <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> Zitat von Jim Jewett : > On 9/21/07, Paul Moore wrote: >> On 21/09/2007, Jim Jewett wrote: >> > (Outside ASCII), if you treat sys.argv as text, that is probably >> > impossible without filesystem support. Before python even sees the >> > data, the terminal itself is allowed to change between canonical >> > equivalents, which have different binary representations. > >> Please note - this statement is Unix specific. The situation on >> Windows is entirely different (the fact that the CRT on Windows >> emulates some aspects of the Unix semantics is not relevant here - you >> need to understand the underlying OS model). > > No; it is a consequence of unicode. The command shell (or other > program launcher) have the same freedom. I'm not quite sure what you are talking about here (what "same" freedom?), but Paul is right: your statement *is* Unix specific, and the situation on Windows *is* different on Windows. argc/argv does not exist on Windows (that you seem to see it anyway is an illusion), and if it did exist, it would be characters, not bytes. "Canonical equivalents" is not a property of bytes, but of Unicode characters (code points specifically). Also, I'm not quite sure why you think the file system has to do anything with sys.argv (unless your understanding of what a "filesystem" is differs from mine). Regards, Martin From qrczak at knm.org.pl Sat Sep 22 10:18:34 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 22 Sep 2007 10:18:34 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <87tzpx7hhj.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB0DC0.3050906@canterbury.ac.nz> <87ps0kmw3e.fsf@uwakimon.sk.tsukuba.ac.jp> <46EB6EA1.5020104@v.loewis.de> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> Message-ID: <1190449114.30559.11.camel@qrnik> Dnia 21-09-2007, Pt o godzinie 10:00 -0400, Jim Jewett napisa?(a): > Is it reasonable to expose sys.argv.buffer? > (Since this would be bytes rather than text, I assume this would be a > single array, rather than a list of already separated arguments.) On Unix the arguments are already separated on the OS level. It's the shell which usually separates them if they were previously written with spaces between (and understands quotes and other things). The execve() system call obtains them separated, and the program receives them separated. Each Unix argument is a null-terminated array of bytes, i.e. only 0 bytes are disallowed, and the OS does not mangle the contents. Of course people typically interpret these bytes as characters in a guessed encoding, and the encoding is always a superset of ASCII. On Windows the arguments are not separated, the whole command line is a single string with spaces and possible quotes left for the program to possibly interpret as separate arguments (unless something has changed in the last 10 years). I believe it's an array of 16-bit code units, typically meant to be interpreted as UTF-16, but without checking that it's a well-formed UTF-16 sequence. I suppose that any 16-bit word except 0 is allowed, but I'm not sure. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From p.f.moore at gmail.com Sat Sep 22 14:05:30 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 22 Sep 2007 13:05:30 +0100 Subject: [Python-3000] decorators for variable assignments? In-Reply-To: References: <2EE33BBD-C301-4CBF-BCCB-3BBE9204EF56@nicko.org> Message-ID: <79990c6b0709220505k1c123289pb1471dc03f584cc8@mail.gmail.com> On 22/09/2007, Guido van Rossum wrote: > Can we stop this already? The idea is dead. No need to drag it through > the mud around town for an extended period of time. It's not dead, it's just pining for the fjords. Sorry, couldn't resist :-) Paul. From jimjjewett at gmail.com Sat Sep 22 21:11:34 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sat, 22 Sep 2007 15:11:34 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> References: <1189700532.22693.40.camel@qrnik> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> Message-ID: On 9/22/07, martin at v.loewis.de wrote: > Zitat von Jim Jewett : > > > On 9/21/07, Paul Moore wrote: > >> On 21/09/2007, Jim Jewett wrote: [The original context, expressed with some detail by Michael Urman in http://mail.python.org/pipermail/python-3000/2007-September/010621.html was that it must be possible to treat command line arguments as filenames.] > >> > (Outside ASCII), if you treat sys.argv as text, that is probably > >> > impossible without filesystem support. Before python even sees the > >> > data, the terminal itself is allowed to change between canonical > >> > equivalents, which have different binary representations. > > No; it is a consequence of unicode. The command shell (or other > > program launcher) have the same freedom. > I'm not quite sure what you are talking about here (what "same" > freedom?), The same freedom to represent ? as either U+00C0 or > argc/argv does not exist on Windows (that you seem to see it > anyway is an illusion), and if it did exist, it would be characters, > not bytes. "Canonical equivalents" is not a property of bytes, > but of Unicode characters (code points specifically). > Also, I'm not quite sure why you think the file system has > to do anything with sys.argv (unless your understanding of > what a "filesystem" is differs from mine). The filesystem is unrelated to sys.argv, except for the need to pass filenames through argv. If the filesystem is using bytes rather than characters, then sys.argv must offer the same option, or else certain scripts will (under some rare circumstances) fail. -jJ From martin at v.loewis.de Sat Sep 22 21:27:58 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 22 Sep 2007 21:27:58 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: References: <1189700532.22693.40.camel@qrnik> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> Message-ID: <46F56CBE.5010702@v.loewis.de> > The filesystem is unrelated to sys.argv, except for the need to pass > filenames through argv. If the filesystem is using bytes rather than > characters, then sys.argv must offer the same option, or else certain > scripts will (under some rare circumstances) fail. The same holds for file names on Windows - they aren't byte strings, either. Regards, Martin From charleshixsn at earthlink.net Sun Sep 23 19:24:24 2007 From: charleshixsn at earthlink.net (Charles D Hixson) Date: Sun, 23 Sep 2007 10:24:24 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> Message-ID: <46F6A148.6070107@earthlink.net> Guido van Rossum wrote: > On 9/19/07, Bill Janssen wrote: > >> This really isn't a UTF-8 problem. It is the problem with file opens >> defaulting to "text" mode instead of "binary" mode rearing its ugly >> head again. >> > > You can repeat that until you're blue in the face but it's not going > to change. Way more programs (especially simple ones) deal with txet > than with binary data. > > OTOH, almost all of that text is ASCII. Even if the system mode is set to utf-8, ascii is still ascii. Still, this won't affect me, much, as I rarely send anything complex via pipes. (I know, I should. It's more secure. But the fact is, I don't. I use files.) But this is the kind of thing that could make dealing with, say, xpm files a real hassle. (Probably won't, as ascii is still ascii, but it will introduce corner cases.) A lot of the time what I'm really dealing with is bytes rather than characters. I think of them as characters, and try to choose values that display nicely as characters, because that's the way that's been convenient for decades. But they ARE bytes, sometimes signed bytes. And this is going to mean that there are lots of cases where they don't map nicely to something that's trying to understand them as unicode. So there needs to be an easy and obvious way to deal with files whose records are arrays of byte valued data...that is commonly manipulated by an editor using ascii-8. From martin at v.loewis.de Sun Sep 23 19:36:53 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 23 Sep 2007 19:36:53 +0200 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F6A148.6070107@earthlink.net> References: <-7804278669952876495@unknownmsgid> <46F6A148.6070107@earthlink.net> Message-ID: <46F6A435.10203@v.loewis.de> > So there needs to be an easy and obvious way to deal with files whose > records are arrays of byte valued data...that is commonly manipulated by > an editor using ascii-8. Did you follow the thread at all? There is an easy and obvious way to deal with such files. Regards, Martin From charleshixsn at earthlink.net Sun Sep 23 20:09:25 2007 From: charleshixsn at earthlink.net (Charles D Hixson) Date: Sun, 23 Sep 2007 11:09:25 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> Message-ID: <46F6ABD5.7010103@earthlink.net> Brett Cannon wrote: > On 9/19/07, skip at pobox.com wrote: > >> Guido> You can repeat that until you're blue in the face but it's not >> Guido> going to change. Way more programs (especially simple ones) deal >> Guido> with txet than with binary data. >> >> For us Unix-heads the notion that a file is anything other than a stream of >> bytes is rather foreign. I understand that to a large degree if you made >> the world right for us the tail would be wagging the dog. >> > > I think the key thing here is that Guido said "especially simple ones" > and the examples people are talking about are not overly simple (e.g, > gzip, ImageMagik, etc.). That would suggest that if you want the raw > bytes from stdin or write out to stdout that accessing the 'buffer' > attribute you probably know what you are doing and thus accessing a > 'buffer' attribute is probably not difficult for you. =) > > -Brett > The problem here seems to be that this isn't currently well documented. I've got no objection to using the buffer attribute...but I've searched the documentation and haven't found any references to it that don't merely refer back to a PEP. There's one reference about "the new buffer interface", but no further details. There's a comment in the tutorial that says to see the library reference for more information...but there doesn't appear to be anything in the library reference to justify that comment. Etc. P.S.: If opening files on Linux is now to be semantically meaningful, then the documentation on that section also needs to change. Currently it appears to mean that it's a meaningless specification that will be ignored unless you happen to be using the MSWindows platform. From martin at v.loewis.de Sun Sep 23 20:48:20 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 23 Sep 2007 20:48:20 +0200 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F6ABD5.7010103@earthlink.net> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> Message-ID: <46F6B4F4.2060307@v.loewis.de> > The problem here seems to be that this isn't currently well documented. > I've got no objection to using the buffer attribute... Ok, then it seems you missed the obvious way: Open the file in binary mode ('rb' or 'wb') if you want to read or write bytes. It has always been that way in Python; the only change now is that it matters on systems other than Windows. Regards, Martin From skip.montanaro at gmail.com Sun Sep 23 23:21:54 2007 From: skip.montanaro at gmail.com (Skip Montanaro) Date: Sun, 23 Sep 2007 16:21:54 -0500 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F6ABD5.7010103@earthlink.net> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> Message-ID: <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> > P.S.: If opening files on Linux is now to be semantically meaningful, > then the documentation on that section also needs to change. Currently > it appears to mean that it's a meaningless specification that will be > ignored unless you happen to be using the MSWindows platform. I just checked in a change to the documentation for the builtin open function. Please have a look at Doc/library/functions.rst and let me know if you think more needs to be done. Also, if there are other places in the documentation where it seems to imply that the distinction between text and binary modes is meaningless on Unix systems, drop me a note and I'll have a look. Skip From skip at pobox.com Sun Sep 23 23:07:02 2007 From: skip at pobox.com (skip at pobox.com) Date: Sun, 23 Sep 2007 16:07:02 -0500 Subject: [Python-3000] More uniform treatment of files' newlines attribute? Message-ID: <18166.54646.466616.838995@montanaro.dyndns.org> While editing the documentation of the builtin open function, I noticed that the newlines attributes can take on three different value types: None, strings or tuples of strings. It seems to me it would be better if was always a set containing the newline values seen so far. There's no testing necessary if you need to do something with the newlines you've seen, you just loop over them: for nl in f.newlines: print("%r" % nl) With the current mixed types metaphor you have to do something like this: if f.newlines is not None: if type(f.newlines) is tuple: for nl in f.newlines: print("%r" % nl) else: print("%r" % f.newlines) This, of course, assumes the file has been opened in text mode. If you have a binary mode file you also have to call hasattr(f, "newlines"). Presumably in most cases you'll know the file's mode without needing to check, but maybe binary files should also have a newlines attribute which is always the empty set. Skip From charleshixsn at earthlink.net Tue Sep 25 04:32:26 2007 From: charleshixsn at earthlink.net (Charles D Hixson) Date: Mon, 24 Sep 2007 19:32:26 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> Message-ID: <46F8733A.2020908@earthlink.net> Skip Montanaro wrote: >> P.S.: If opening files on Linux is now to be semantically meaningful, >> then the documentation on that section also needs to change. Currently >> it appears to mean that it's a meaningless specification that will be >> ignored unless you happen to be using the MSWindows platform. >> > > I just checked in a change to the documentation for the builtin open function. > Please have a look at Doc/library/functions.rst and let me know if you > think more needs to be done. Also, if there are other places in the > documentation > where it seems to imply that the distinction between text and binary modes is > meaningless on Unix systems, drop me a note and I'll have a look. > > Skip > > Yes, that says what I feel it should say. (Well, I looked it up at http://docs.python.org/dev/3.0/library/functions.html?highlight=builtin ). There's another place in the tutorial section http://docs.python.org/dev/3.0/tutorial/inputoutput.html?highlight=open and search for "On Windows and the Macintosh, 'b' appended to the mode opens the file in binary mode," From janssen at parc.com Tue Sep 25 05:42:16 2007 From: janssen at parc.com (Bill Janssen) Date: Mon, 24 Sep 2007 20:42:16 PDT Subject: [Python-3000] New io system and binary data In-Reply-To: <46F8733A.2020908@earthlink.net> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> <46F8733A.2020908@earthlink.net> Message-ID: <07Sep24.204222pdt."57996"@synergy1.parc.xerox.com> > Also, if there are other places in the > > documentation > > where it seems to imply that the distinction between text and binary modes is > > meaningless on Unix systems, drop me a note and I'll have a look. That's certainly the prescribed behavior for the C stdio streams on POSIX-compliant systems. I think a lot of the original design of the Python I/O system was based on that C stdio system, including names like stdin, stdout, and stderr. Now that we've moved away from the C stdio model, and the distinction between text and binary streams is meaningful even on POSIX systems, perhaps we should also change those names to reflect that difference from C. Given that Py3K is a once-in-a-decade chance to break backwards compatibility, and all. Perhaps something like sys.io.input, sys.io.output, sys.io.err, or something similar. Bill From jyasskin at gmail.com Tue Sep 25 08:09:54 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Mon, 24 Sep 2007 23:09:54 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> Message-ID: <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> On 9/20/07, Adam Hupp wrote: > On 9/20/07, Jeffrey Yasskin wrote: > > > > Thanks for the help! This brings up a policy question: For patches > > like the one I've attached here, do we want to start submitting them > > now, or build up a mondo patch to fix them all at once? > > My changes are here: > > http://bugs.python.org/issue1184 > > With that patch there are only two issues remaining (6 test failures). I've finally gotten around to tracking down the ParseTuple issue, which turned out to fix all 6 remaining tests, and posted the patch to the same issue. Thanks for the help! Guido, the patch isn't quite in a form I'd want to commit, but the tests pass. What do you think? -- Namast?, Jeffrey Yasskin From p.f.moore at gmail.com Tue Sep 25 09:39:24 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 25 Sep 2007 08:39:24 +0100 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: Message-ID: <79990c6b0709250039q3cf5b6a5j3a37797b84fe43d3@mail.gmail.com> On 22/09/2007, Travis Oliphant wrote: > If it is decided to make bytes immutable (which sounds good to me), > then I want to add my voice to those that clamor for an additional > mutable object capable of allocating chunks of memory. > > This object should have a C-API and have it's structure exposed to > extension module writers (thus array.array does not fit the bill -- but > might be a prototype if some of it is moved over to the Objects > directory and given an API). Can you describe in a little more detail what you mean by "should have a C-API"? I don't often work at the C level these days, so I may be missing something obvious. The array module is built in, so it's written in C - what needs to be exposed to qualify as a "C API"? And why does the code need to move location to qualify? (In case it's not clear, I'm thinking of having a look, and seeing if I can help implement what you are after. No promises, given the amount of free time I have, but with some hints I'll see how far I can get!) Paul. From mark at qtrac.eu Tue Sep 25 10:58:03 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Tue, 25 Sep 2007 09:58:03 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> Message-ID: <200709250958.03993.mark@qtrac.eu> On 2007-09-16, Arvind Singh wrote: > > How do you get from "some keys can't be ordered" to "it doesn't make > > sense for Python to have sorteddict or sortedset"? If you want to use > > keys that can't be ordered, then feel free to continue to use dict. > > For situations in which ordering is important, that language should > > support that. When did this become an all or nothing proposition? > > There's plenty of space for both dict and sorteddict. > > Sorry for premature conclusions. All I wanted to do was remind the > potential problems with any "generic" implementation. > > And I did say, when ordering is important, we are left with two choices: > 1) Sort explicitly (whenever required) and be prepared to handle exceptions > raised during sort operation. > 2) Have a implicitly "sorted" implementation and handle exceptions at every > insertion. > > I, personally, tend to prefer the former solution. Later case is useful > when we have large objects and we do large number of insertions, in which > case, per insertion exception handling would be inefficient. Former case, > in turn, can be slightly confusing and a bit to debug. I can understand your personal preference for dict, although mine is for sorteddict---but IMO Python should provide both since both are legitimate in appropriate contexts. To this end I've put a posting on comp.lang.python with subject: sorteddict PEP proposal [started off as orderedict] If there is a positive response I will submit it to the PEP editors. If there is not, I will just hope that someone else will pick up the idea, even if in another form or with a different API, because I'd really like to see some kind of sorted dictionary in Python's standard library. (I also think there's a similar case for a sorted set.) -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From weilawei at gmail.com Tue Sep 25 15:46:01 2007 From: weilawei at gmail.com (Rob Crowther) Date: Tue, 25 Sep 2007 09:46:01 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point Message-ID: <20070925094601.c151245c.weilawei@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've uploaded the latest code to http://umass.glexia.net/mpf.tar.bz2 It's been cleaned up, implements a little bit of the abstract number interface, many very repetitive function declarations were turned into macros making it far easier to maintain, and it now has a printable representation like you'd expect from a float. At this point, I'm able to use it as a stripped down drop in replacement for Decimal. It's also much, much faster. One question I was asked in IRC was if it was possible to change the precision. Currently, that's only implemented during initialization of an instance, by passing the prec keyword. It defaults to 128 bits, what looks to me to be about double the precision of a builtin float. Included in this is a copy of my git repository since I don't have it online. I'm going to be away for a while, and someone else may find it useful if they want to hack on it. (I have occasionally been known to screw up =P) Well, that's all for today's update. Hopefully, more to come soon. Rob -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG+REZqR5p8HaX4oURAnJMAKCDxjO2YUnNrJVClujA0l8+wKSLkACeOc5F 097rKqJO6DoaLShfpA3oPsU= =lB+r -----END PGP SIGNATURE----- From uche at ogbuji.net Tue Sep 25 15:39:24 2007 From: uche at ogbuji.net (Uche Ogbuji) Date: Tue, 25 Sep 2007 07:39:24 -0600 Subject: [Python-3000] New io system and binary data In-Reply-To: <07Sep24.204222pdt.57996@synergy1.parc.xerox.com> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> <46F8733A.2020908@earthlink.net> <07Sep24.204222pdt.57996@synergy1.parc.xerox.com> Message-ID: <46F90F8C.6000301@ogbuji.net> Bill Janssen wrote: > That's certainly the prescribed behavior for the C stdio streams on > POSIX-compliant systems. I think a lot of the original design of the > Python I/O system was based on that C stdio system, including names > like stdin, stdout, and stderr. > > Now that we've moved away from the C stdio model, and the distinction > between text and binary streams is meaningful even on POSIX systems, > perhaps we should also change those names to reflect that difference > from C. Given that Py3K is a once-in-a-decade chance to break > backwards compatibility, and all. Perhaps something like > sys.io.input, sys.io.output, sys.io.err, or something similar. > +1, except I'd say "sys.io.error"for the latter. -- Uche Ogbuji http://uche.ogbuji.net Founding Partner, Zepheira http://zepheira.com Linked-in profile: http://www.linkedin.com/in/ucheogbuji Articles: http://uche.ogbuji.net/tech/publications/ From facundobatista at gmail.com Tue Sep 25 18:06:40 2007 From: facundobatista at gmail.com (Facundo Batista) Date: Tue, 25 Sep 2007 13:06:40 -0300 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: <20070925094601.c151245c.weilawei@gmail.com> References: <20070925094601.c151245c.weilawei@gmail.com> Message-ID: 2007/9/25, Rob Crowther : > a float. At this point, I'm able to use it as a stripped down drop in > replacement for Decimal. It's also much, much faster. Didn't understand this phrase. You're able to use it, after stripping it down, as a replacement of Decimal? Or you're able to use it as a replacement of a stripped down Decimal? For the record: I don't have the "not invented here" syndrome. If you find a replacement to Decimal that is faster than actual, it's great! Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From guido at python.org Tue Sep 25 19:10:31 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 10:10:31 -0700 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F90F8C.6000301@ogbuji.net> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> <46F8733A.2020908@earthlink.net> <07Sep24.204222pdt.57996@synergy1.parc.xerox.com> <46F90F8C.6000301@ogbuji.net> Message-ID: On 9/25/07, Uche Ogbuji wrote: > Bill Janssen wrote: > > That's certainly the prescribed behavior for the C stdio streams on > > POSIX-compliant systems. I think a lot of the original design of the > > Python I/O system was based on that C stdio system, including names > > like stdin, stdout, and stderr. > > > > Now that we've moved away from the C stdio model, and the distinction > > between text and binary streams is meaningful even on POSIX systems, > > perhaps we should also change those names to reflect that difference > > from C. Given that Py3K is a once-in-a-decade chance to break > > backwards compatibility, and all. Perhaps something like > > sys.io.input, sys.io.output, sys.io.err, or something similar. > > > > +1, except I'd say "sys.io.error"for the latter. -1. I could just say "the deadline for PEPs was last April" or "let's stop bikeshedding", but I'd rather explain why I would have been against this idea even if it was proposed with a proper PEP before the deadline. Maybe it helps stem similar proposals. In general the goal for Python 3000 is to change only things that are genuine language warts (things that would remain stumbling blocks forever if not fixed), and to leave everything else alone as much as possible. I don't think the naming of sys.stdin and friends in Python has ever confused anybody, regardless of whether they were amongst the authors of the C standard library, or had never seen a line of C in their life. There are literally thousands of names in the standard library that could be changed to conform to a better naming scheme, to be more intuitive, to divorce them from an irrelevant legacy, or for whatever other reason. Doing so would just cause endless annoyance for people used to Python 2.x, at no real benefit for future users. Python 3000 is boldly choosing to be backwards compatible, except in cases where a real benefit can be obtained by being incompatible. This is not such a case. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 25 19:18:50 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 10:18:50 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709250958.03993.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> <200709250958.03993.mark@qtrac.eu> Message-ID: On 9/25/07, Mark Summerfield wrote: > I can understand your personal preference for dict, although mine is for > sorteddict---but IMO Python should provide both since both are > legitimate in appropriate contexts. Careful what you wish for. One of Python's strengths is that there is *not* a lot of choice in data type implementations (unless you go to relatively obscure places like the collections module or 3rd party extensions). This saves programmers time because they don't have to decide what data type implementation to use in cases where it doesn't matter (and that's the majority of cases). This is not a rationalization after the fact: it has always been a specific design goal in Python to minimize the number of decisions that a programmer must make up front. This goal also minimizes the danger that the *wrong* decision is made, as the standard data types are pretty darn good for almost any purpose. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 25 19:20:04 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 10:20:04 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: On 9/24/07, Jeffrey Yasskin wrote: > On 9/20/07, Adam Hupp wrote: > > On 9/20/07, Jeffrey Yasskin wrote: > > > > > > Thanks for the help! This brings up a policy question: For patches > > > like the one I've attached here, do we want to start submitting them > > > now, or build up a mondo patch to fix them all at once? > > > > My changes are here: > > > > http://bugs.python.org/issue1184 > > > > With that patch there are only two issues remaining (6 test failures). > > I've finally gotten around to tracking down the ParseTuple issue, > which turned out to fix all 6 remaining tests, and posted the patch to > the same issue. Thanks for the help! Guido, the patch isn't quite in a > form I'd want to commit, but the tests pass. What do you think? I'll have a good look at this today. Thanks for your efforts everyone! -- --Guido van Rossum (home page: http://www.python.org/~guido/) From weilawei at gmail.com Tue Sep 25 19:30:37 2007 From: weilawei at gmail.com (Rob Crowther) Date: Tue, 25 Sep 2007 13:30:37 -0400 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <66d0a6e10709151853w37b949a8i6b4ed2bcb709c064@mail.gmail.com> <200709250958.03993.mark@qtrac.eu> Message-ID: <20070925133037.9405a211.weilawei@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 25 Sep 2007 10:18:50 -0700 "Guido van Rossum" wrote: > On 9/25/07, Mark Summerfield wrote: > > I can understand your personal preference for dict, although mine is for > > sorteddict---but IMO Python should provide both since both are > > legitimate in appropriate contexts. > > This is not a rationalization after the fact: it has always been a > specific design goal in Python to minimize the number of decisions > that a programmer must make up front. This goal also minimizes the > danger that the *wrong* decision is made, as the standard data types > are pretty darn good for almost any purpose. I ran into the issue of wanting an ordered dict recently. I was rather upset at having to redesign my data structures--at first. After reworking them to fit within the confines of an unordered dict, I realized that it actually worked better. This isn't to say there should be no such thing, but it really doesn't need to be a part of the standard library, imo. -1 vote for ordered dicts. Rob -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG+UW9qR5p8HaX4oURAtYmAKCX4xjNTyC7n2ksV/Jb6+ztrtd43ACglRF2 PGUqWUUviyMoWvg9cAO6otk= =umXa -----END PGP SIGNATURE----- From mark at qtrac.eu Tue Sep 25 19:43:12 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Tue, 25 Sep 2007 18:43:12 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709250958.03993.mark@qtrac.eu> Message-ID: <200709251843.12325.mark@qtrac.eu> On 2007-09-25, Guido van Rossum wrote: > On 9/25/07, Mark Summerfield wrote: > > I can understand your personal preference for dict, although mine is for > > sorteddict---but IMO Python should provide both since both are > > legitimate in appropriate contexts. > > Careful what you wish for. > > One of Python's strengths is that there is *not* a lot of choice in > data type implementations (unless you go to relatively obscure places > like the collections module or 3rd party extensions). This saves > programmers time because they don't have to decide what data type > implementation to use in cases where it doesn't matter (and that's the > majority of cases). > > This is not a rationalization after the fact: it has always been a > specific design goal in Python to minimize the number of decisions > that a programmer must make up front. This goal also minimizes the > danger that the *wrong* decision is made, as the standard data types > are pretty darn good for almost any purpose. My proposal was for the sorteddict to be put in the collections module, not as a builtin. One of the things I particularly like about Python is that the core language is small. However, I think that the collections module is rather thin, and as you say, it is "obscure" so won't get in the way of inexperienced or casual users if it is beefed up a bit, yet could be really useful to more demanding users. On comp.lang.python, a respondent called Paul Hankin suggested a somewhat different approach to mine: he proposed a sorteddict with the same API as a dict but with a constructor that is similar to the sorted() function: sorteddict((mapping | sequence | nothing), cmp=None, key=None, reverse=None) He points out that this has a problem with keyword argument dictionaries, but that one solution is sorteddict(dict(**kwargs), ...). From comments other people have made on this list and on comp.lang.python, it may be that Paul Hankin's approach is more popular and better than the one I proposed---the only downside being that he didn't give any hints as to an implementation. I am hoping that Python 2.6 (and 3.0) will have a sorted dictionary of some kind, and I get the impression that it would be welcomed (in the standard library). -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From guido at python.org Tue Sep 25 20:06:04 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 11:06:04 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709251843.12325.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709250958.03993.mark@qtrac.eu> <200709251843.12325.mark@qtrac.eu> Message-ID: On 9/25/07, Mark Summerfield wrote: > My proposal was for the sorteddict to be put in the collections module, > not as a builtin. One of the things I particularly like about Python is > that the core language is small. > > However, I think that the collections module is rather thin, and as you > say, it is "obscure" so won't get in the way of inexperienced or casual > users if it is beefed up a bit, yet could be really useful to more > demanding users. > > On comp.lang.python, a respondent called Paul Hankin suggested a > somewhat different approach to mine: he proposed a sorteddict with the > same API as a dict but with a constructor that is similar to the > sorted() function: > > sorteddict((mapping | sequence | nothing), cmp=None, key=None, > reverse=None) > > He points out that this has a problem with keyword argument > dictionaries, but that one solution is sorteddict(dict(**kwargs), ...). Why would this be a problem? There is no requirement that sorteddict() support this feature. > From comments other people have made on this list and on > comp.lang.python, it may be that Paul Hankin's approach is more popular > and better than the one I proposed---the only downside being that he > didn't give any hints as to an implementation. > > I am hoping that Python 2.6 (and 3.0) will have a sorted dictionary of > some kind, and I get the impression that it would be welcomed (in the > standard library). For that to happen, someone has to write a production-quality implementation, release it as a separate 3rd party module for a while, show that it is sufficiently stable and popular to be incorporated in the standard library, and commit to maintaining it for a few years at least. (It doesn't have to be all the same someone.) Hoping and wishing doesn't cause working code to spring into existence. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mark at qtrac.eu Tue Sep 25 23:01:43 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Tue, 25 Sep 2007 22:01:43 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709251843.12325.mark@qtrac.eu> Message-ID: <200709252201.43689.mark@qtrac.eu> On 2007-09-25, you wrote: > On 9/25/07, Mark Summerfield wrote: > > My proposal was for the sorteddict to be put in the collections module, > > not as a builtin. One of the things I particularly like about Python is > > that the core language is small. > > > > However, I think that the collections module is rather thin, and as you > > say, it is "obscure" so won't get in the way of inexperienced or casual > > users if it is beefed up a bit, yet could be really useful to more > > demanding users. > > > > On comp.lang.python, a respondent called Paul Hankin suggested a > > somewhat different approach to mine: he proposed a sorteddict with the > > same API as a dict but with a constructor that is similar to the > > sorted() function: > > > > sorteddict((mapping | sequence | nothing), cmp=None, key=None, > > reverse=None) > > > > He points out that this has a problem with keyword argument > > dictionaries, but that one solution is sorteddict(dict(**kwargs), ...). > > Why would this be a problem? There is no requirement that sorteddict() > support this feature. > > > From comments other people have made on this list and on > > comp.lang.python, it may be that Paul Hankin's approach is more popular > > and better than the one I proposed---the only downside being that he > > didn't give any hints as to an implementation. > > > > I am hoping that Python 2.6 (and 3.0) will have a sorted dictionary of > > some kind, and I get the impression that it would be welcomed (in the > > standard library). > > For that to happen, someone has to write a production-quality > implementation, release it as a separate 3rd party module for a while, > show that it is sufficiently stable and popular to be incorporated in > the standard library, and commit to maintaining it for a few years at > least. (It doesn't have to be all the same someone.) OK, I'm sure I or Paul Hankin or others will put up at least one version on PyPI and maybe get it in for Python 4:-) > Hoping and wishing doesn't cause working code to spring into existence. As a matter of fact it does... by the time I read this Paul Hankin had written a version based on his idea... and so have I. Neither is likely to be fast but they both provide the API described above in pure Python. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From guido at python.org Tue Sep 25 23:26:40 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 14:26:40 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: OK, Jeffrey's and Adam's patches were helpful; it looks like the damage done by making bytes immutable is pretty limited: plenty of modules are affected, but the changes are straightforward and localized. So now I have an idea that goes a little farther. It relates to Talin's response (second message in this thread if you're using gmail) and acknowledges that there are some good use cases for mutable bytes as well (as I've always maintained). How about we take the existing PyString implementation (Python 2's str, currently still present as str8 in py3k), remove the locale and unicode mixing support, and call it bytes. Then the PyBytes type can be renamed to buffer. It is well-documented that I don't care much about the existing buffer() builtin; it can be renamed to memview for all I care (that would be a more descriptive name anyway). This would provide a much better transitional path for 2.x code manipulating raw bytes using str instances: just change "..." into b"..." and str into bytes. (Of course, 2.x code that is confused about bytes vs. characters will fail hard in 3.0 as soon as a bytes and a str instance meet -- this is already the case in the current 3.0 code base and will remain unchanged.) It would mean more fixes beyond what Jeffrey and Adam did, since iterating over a bytes instance would return a bytes instance of length 1 instead of a small int, and the bytes constructor would change accordingly (no more initializing a bytes object from a list of ints). The (new) buffer object would also have to change to be more compatible with the (new) bytes object -- bytes<-->buffer conversions should be 1-1, and iterating over a buffer instance would also have to return a length-1 buffer (or bytes???) instance. Thoughts? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Sep 25 23:32:05 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 14:32:05 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: On 9/25/07, Guido van Rossum wrote: > OK, Jeffrey's and Adam's patches were helpful; it looks like the > damage done by making bytes immutable is pretty limited: plenty of > modules are affected, but the changes are straightforward and > localized. > > So now I have an idea that goes a little farther. It relates to > Talin's response (second message in this thread if you're using gmail) > and acknowledges that there are some good use cases for mutable bytes > as well (as I've always maintained). > > How about we take the existing PyString implementation (Python 2's > str, currently still present as str8 in py3k), remove the locale and > unicode mixing support, and call it bytes. Then the PyBytes type can > be renamed to buffer. It is well-documented that I don't care much > about the existing buffer() builtin; it can be renamed to memview for > all I care (that would be a more descriptive name anyway). D'oh. Travis already implemented a memoryview object that has most of the required properties. So let's use that instead of memview or the old buffer object. > This would provide a much better transitional path for 2.x code > manipulating raw bytes using str instances: just change "..." into > b"..." and str into bytes. (Of course, 2.x code that is confused about > bytes vs. characters will fail hard in 3.0 as soon as a bytes and a > str instance meet -- this is already the case in the current 3.0 code > base and will remain unchanged.) > > It would mean more fixes beyond what Jeffrey and Adam did, since > iterating over a bytes instance would return a bytes instance of > length 1 instead of a small int, and the bytes constructor would > change accordingly (no more initializing a bytes object from a list of > ints). > > The (new) buffer object would also have to change to be more > compatible with the (new) bytes object -- bytes<-->buffer conversions > should be 1-1, and iterating over a buffer instance would also have to > return a length-1 buffer (or bytes???) instance. > > Thoughts? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Wed Sep 26 00:14:19 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 25 Sep 2007 18:14:19 -0400 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: > How about we take the existing PyString implementation (Python 2's > str, currently still present as str8 in py3k), remove the locale and > unicode mixing support, and call it bytes. Is that just encode/decode? But isn't this one sensible way to store an encoded str, so that decode (only) would still make sense? I would have expected to drop text or character-oriented methods, because they should really be done on the (decoded) unicode version. Given bytes use in wire protocols, I could also understand saying that these methods only work on ASCII, and either raise an exception or return false for other byte values. text-or-chararacter-oriented methods: 'capitalize', 'center', 'endswith', 'expandtabs', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'ljust', 'lower', 'lstrip', 'rjust', 'rstrip', 'splitlines', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill' > It would mean more fixes beyond what Jeffrey and Adam did, since > iterating over a bytes instance would return a bytes instance of > length 1 instead of a small int, makes sense > and the bytes constructor would > change accordingly (no more initializing a bytes object from a list of > ints). Why not? I expect the literal b"ASCII string" to be the most common constructor, but I don't see the problem with a sequence of ints (or hex) as an alternative constructor. > The (new) buffer object would also have to change to be more > compatible with the (new) bytes object -- bytes<-->buffer conversions > should be 1-1, and iterating over a buffer instance would also have to > return a length-1 buffer (or bytes???) instance. I would return a bytes instance. If you return a 1-char buffer, and someone does modify that, it isn't clear whether the change should be reflected in the original source buffer. If someone does want an in-place filter, they can always use enumerate and slicing. Can we assume that the two types are unequal, but that you can search a buffer for a (constant) bytes? >>> mybytes = b"some data" >>> mybuffer = buffer(mybytes) >>> mybuffer == mybytes False >>> mybuffer.startswith(mybytes) and \ ... mybuffer.endswith(mybytes) and \ ... len(mybuffer) == len(mybytes) True -jJ From brett at python.org Wed Sep 26 02:03:29 2007 From: brett at python.org (Brett Cannon) Date: Tue, 25 Sep 2007 17:03:29 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: On 9/25/07, Guido van Rossum wrote: > OK, Jeffrey's and Adam's patches were helpful; it looks like the > damage done by making bytes immutable is pretty limited: plenty of > modules are affected, but the changes are straightforward and > localized. > > So now I have an idea that goes a little farther. It relates to > Talin's response (second message in this thread if you're using gmail) > and acknowledges that there are some good use cases for mutable bytes > as well (as I've always maintained). > > How about we take the existing PyString implementation (Python 2's > str, currently still present as str8 in py3k), remove the locale and > unicode mixing support, and call it bytes. Then the PyBytes type can > be renamed to buffer. It is well-documented that I don't care much > about the existing buffer() builtin; it can be renamed to memview for > all I care (that would be a more descriptive name anyway). > > This would provide a much better transitional path for 2.x code > manipulating raw bytes using str instances: just change "..." into > b"..." and str into bytes. (Of course, 2.x code that is confused about > bytes vs. characters will fail hard in 3.0 as soon as a bytes and a > str instance meet -- this is already the case in the current 3.0 code > base and will remain unchanged.) > > It would mean more fixes beyond what Jeffrey and Adam did, since > iterating over a bytes instance would return a bytes instance of > length 1 instead of a small int, and the bytes constructor would > change accordingly (no more initializing a bytes object from a list of > ints). > +0. While 2to3 would be able to help more, the methods that will be ripped out will make the ease in transition from this a lot less. Plus you can have immutable bytes in a way by passing the current bytes to tuple. > The (new) buffer object would also have to change to be more > compatible with the (new) bytes object -- bytes<-->buffer conversions > should be 1-1, and iterating over a buffer instance would also have to > return a length-1 buffer (or bytes???) instance. Return a byte. If you want a mutable length-1 thing you should have to do a length 1 slice. Otherwise its an index operation and you want what is stored at the index, which is an immutable byte. -Brett From guido at python.org Wed Sep 26 02:22:39 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 25 Sep 2007 17:22:39 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: On 9/25/07, Brett Cannon wrote: > On 9/25/07, Guido van Rossum wrote: > > OK, Jeffrey's and Adam's patches were helpful; it looks like the > > damage done by making bytes immutable is pretty limited: plenty of > > modules are affected, but the changes are straightforward and > > localized. > > > > So now I have an idea that goes a little farther. It relates to > > Talin's response (second message in this thread if you're using gmail) > > and acknowledges that there are some good use cases for mutable bytes > > as well (as I've always maintained). > > > > How about we take the existing PyString implementation (Python 2's > > str, currently still present as str8 in py3k), remove the locale and > > unicode mixing support, and call it bytes. Then the PyBytes type can > > be renamed to buffer. It is well-documented that I don't care much > > about the existing buffer() builtin; it can be renamed to memview for > > all I care (that would be a more descriptive name anyway). > > > > This would provide a much better transitional path for 2.x code > > manipulating raw bytes using str instances: just change "..." into > > b"..." and str into bytes. (Of course, 2.x code that is confused about > > bytes vs. characters will fail hard in 3.0 as soon as a bytes and a > > str instance meet -- this is already the case in the current 3.0 code > > base and will remain unchanged.) > > > > It would mean more fixes beyond what Jeffrey and Adam did, since > > iterating over a bytes instance would return a bytes instance of > > length 1 instead of a small int, and the bytes constructor would > > change accordingly (no more initializing a bytes object from a list of > > ints). > > > > +0. While 2to3 would be able to help more, the methods that will be > ripped out will make the ease in transition from this a lot less. Compared to what? The methods to be ripped out are already not available on bytes objects. > Plus you can have immutable bytes in a way by passing the current > bytes to tuple. At what cost? tuple(b"x"*100) is a tuple of length 100. > > The (new) buffer object would also have to change to be more > > compatible with the (new) bytes object -- bytes<-->buffer conversions > > should be 1-1, and iterating over a buffer instance would also have to > > return a length-1 buffer (or bytes???) instance. > > Return a byte. If you want a mutable length-1 thing you should have > to do a length 1 slice. Otherwise its an index operation and you want > what is stored at the index, which is an immutable byte. OK. Though it's questionable even whether a slice of a mutable bytes object should return a mutable bytes object (as it is not a shared view). But as that is what PyBytes currently do it is certainly the easiest... -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Wed Sep 26 02:43:05 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 26 Sep 2007 12:43:05 +1200 Subject: [Python-3000] New io system and binary data In-Reply-To: <07Sep24.204222pdt.57996@synergy1.parc.xerox.com> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> <46F8733A.2020908@earthlink.net> <07Sep24.204222pdt.57996@synergy1.parc.xerox.com> Message-ID: <46F9AB19.7080404@canterbury.ac.nz> Bill Janssen wrote: > Now that we've moved away from the C stdio model, and the distinction > between text and binary streams is meaningful even on POSIX systems, > perhaps we should also change those names to reflect that difference > from C. I don't think anything would be gained by changing these well-established and widely-understood names just because of such an obscure and pedantic detail. -- Greg From brett at python.org Wed Sep 26 02:55:47 2007 From: brett at python.org (Brett Cannon) Date: Tue, 25 Sep 2007 17:55:47 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: On 9/25/07, Guido van Rossum wrote: > On 9/25/07, Brett Cannon wrote: > > On 9/25/07, Guido van Rossum wrote: > > > OK, Jeffrey's and Adam's patches were helpful; it looks like the > > > damage done by making bytes immutable is pretty limited: plenty of > > > modules are affected, but the changes are straightforward and > > > localized. > > > > > > So now I have an idea that goes a little farther. It relates to > > > Talin's response (second message in this thread if you're using gmail) > > > and acknowledges that there are some good use cases for mutable bytes > > > as well (as I've always maintained). > > > > > > How about we take the existing PyString implementation (Python 2's > > > str, currently still present as str8 in py3k), remove the locale and > > > unicode mixing support, and call it bytes. Then the PyBytes type can > > > be renamed to buffer. It is well-documented that I don't care much > > > about the existing buffer() builtin; it can be renamed to memview for > > > all I care (that would be a more descriptive name anyway). > > > > > > This would provide a much better transitional path for 2.x code > > > manipulating raw bytes using str instances: just change "..." into > > > b"..." and str into bytes. (Of course, 2.x code that is confused about > > > bytes vs. characters will fail hard in 3.0 as soon as a bytes and a > > > str instance meet -- this is already the case in the current 3.0 code > > > base and will remain unchanged.) > > > > > > It would mean more fixes beyond what Jeffrey and Adam did, since > > > iterating over a bytes instance would return a bytes instance of > > > length 1 instead of a small int, and the bytes constructor would > > > change accordingly (no more initializing a bytes object from a list of > > > ints). > > > > > > > +0. While 2to3 would be able to help more, the methods that will be > > ripped out will make the ease in transition from this a lot less. > > Compared to what? The methods to be ripped out are already not > available on bytes objects. > Right, but that doesn't mean we could put others back in or something to help others with their code transitions. > > Plus you can have immutable bytes in a way by passing the current > > bytes to tuple. > > At what cost? tuple(b"x"*100) is a tuple of length 100. > Right, but the question is how often people will need this. There is a reason that mutable bytes were chosen in the first place. > > > The (new) buffer object would also have to change to be more > > > compatible with the (new) bytes object -- bytes<-->buffer conversions > > > should be 1-1, and iterating over a buffer instance would also have to > > > return a length-1 buffer (or bytes???) instance. > > > > Return a byte. If you want a mutable length-1 thing you should have > > to do a length 1 slice. Otherwise its an index operation and you want > > what is stored at the index, which is an immutable byte. > > OK. Though it's questionable even whether a slice of a mutable bytes > object should return a mutable bytes object (as it is not a shared > view). But as that is what PyBytes currently do it is certainly the > easiest... -Brett From greg.ewing at canterbury.ac.nz Wed Sep 26 02:57:56 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 26 Sep 2007 12:57:56 +1200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <79990c6b0709250039q3cf5b6a5j3a37797b84fe43d3@mail.gmail.com> References: <79990c6b0709250039q3cf5b6a5j3a37797b84fe43d3@mail.gmail.com> Message-ID: <46F9AE94.7010703@canterbury.ac.nz> Paul Moore wrote: > The array module is built in, so it's > written in C - what needs to be exposed to qualify as a "C API"? I think he's referring to the fact that there is no public array.h header file provided that lays out the C-level details. In fact, last time I looked I don't think there was any array.h file at all, it was all inside array.c. You can fake it by copying the relevant declarations into your own .h file, but then there's no assurance that you're not relying on implementation details that could change. A published interface would be much more reassuring. With the new buffer interface, probably just providing that would be sufficient, together with a C function for creating an array. The internals could still remain private if desired. -- Greg From skip at pobox.com Wed Sep 26 03:11:38 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 25 Sep 2007 20:11:38 -0500 Subject: [Python-3000] New io system and binary data In-Reply-To: <46F8733A.2020908@earthlink.net> References: <-7804278669952876495@unknownmsgid> <18161.32698.291402.642086@montanaro.dyndns.org> <46F6ABD5.7010103@earthlink.net> <60bb7ceb0709231421v2adaa658m1999604047db527b@mail.gmail.com> <46F8733A.2020908@earthlink.net> Message-ID: <18169.45514.615900.396756@montanaro.dyndns.org> Charles> There's another place in the tutorial section Charles> http://docs.python.org/dev/3.0/tutorial/inputoutput.html and Charles> search for "On Windows and the Macintosh, 'b' appended to the Charles> mode opens the file in binary mode," I fixed that up as well. I mentioned the automatic encode/decode for text files there as well, though I'm not sure it needs to be mentioned in the tutorial. Skip From greg.ewing at canterbury.ac.nz Wed Sep 26 03:49:03 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 26 Sep 2007 13:49:03 +1200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709181019n1eb7dfe4u81e0d7d5e67b2420@mail.gmail.com> <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: <46F9BA8F.80907@canterbury.ac.nz> Brett Cannon wrote: > Return a byte. If you want a mutable length-1 thing you should have > to do a length 1 slice. Otherwise its an index operation and you want > what is stored at the index, which is an immutable byte. Why shouldn't this argument apply to immutable bytes objects as well? Or should it? -- Greg From mike.klaas at gmail.com Wed Sep 26 05:09:06 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Tue, 25 Sep 2007 20:09:06 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709252201.43689.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709251843.12325.mark@qtrac.eu> <200709252201.43689.mark@qtrac.eu> Message-ID: On 25-Sep-07, at 2:01 PM, Mark Summerfield wrote: > On 2007-09-25, Guido wrote: >> >> For that to happen, someone has to write a production-quality >> implementation, release it as a separate 3rd party module for a >> while, >> show that it is sufficiently stable and popular to be incorporated in >> the standard library, and commit to maintaining it for a few years at >> least. (It doesn't have to be all the same someone.) > > OK, I'm sure I or Paul Hankin or others will put up at least one > version > on PyPI and maybe get it in for Python 4:-) Since this isn't backward-incompatible, it can be added any time: 2.X, 3.X, etc. -Mike From brett at python.org Wed Sep 26 07:31:34 2007 From: brett at python.org (Brett Cannon) Date: Tue, 25 Sep 2007 22:31:34 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <46F9BA8F.80907@canterbury.ac.nz> References: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <46F9BA8F.80907@canterbury.ac.nz> Message-ID: On 9/25/07, Greg Ewing wrote: > Brett Cannon wrote: > > Return a byte. If you want a mutable length-1 thing you should have > > to do a length 1 slice. Otherwise its an index operation and you want > > what is stored at the index, which is an immutable byte. > > Why shouldn't this argument apply to immutable bytes objects as > well? Or should it? Never said it shouldn't. But I don't view immutable bytes as a container like mutable bytes. -Brett From mark at qtrac.eu Wed Sep 26 09:02:44 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Wed, 26 Sep 2007 08:02:44 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709252201.43689.mark@qtrac.eu> Message-ID: <200709260802.44381.mark@qtrac.eu> On 2007-09-26, Mike Klaas wrote: > On 25-Sep-07, at 2:01 PM, Mark Summerfield wrote: > > On 2007-09-25, Guido wrote: > >> For that to happen, someone has to write a production-quality > >> implementation, release it as a separate 3rd party module for a > >> while, > >> show that it is sufficiently stable and popular to be incorporated in > >> the standard library, and commit to maintaining it for a few years at > >> least. (It doesn't have to be all the same someone.) > > > > OK, I'm sure I or Paul Hankin or others will put up at least one > > version > > on PyPI and maybe get it in for Python 4:-) > > Since this isn't backward-incompatible, it can be added any time: > 2.X, 3.X, etc. > > -Mike Yes of course, but I think GvR was really saying "no", at least not until a year or so has passed, and only then if lots of users ask for it. So I won't be submitting a PEP. I have put a new version (incorporating another implementation idea from Paul Hankin) on PyPI: http://pypi.python.org/pypi/sorteddict It does not have the all round (theoretically) good performance of my original version, but does have a much nicer API than my original idea. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From skip at pobox.com Wed Sep 26 13:12:27 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 26 Sep 2007 06:12:27 -0500 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709260802.44381.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709252201.43689.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> Message-ID: <18170.16027.665491.815991@montanaro.dyndns.org> Mark> I have put a new version (incorporating another implementation Mark> idea from Paul Hankin) on PyPI: Mark> http://pypi.python.org/pypi/sorteddict >From that: The main benefit of sorteddicts is that you never have to explicitly sort. Surely there must be something more than that. Wrapping sorted() around a keys() or values() call is a pretty trivial operation. I didn't see that the implementation saved anything. Skip From mark at qtrac.eu Wed Sep 26 13:33:57 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Wed, 26 Sep 2007 12:33:57 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <18170.16027.665491.815991@montanaro.dyndns.org> References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> Message-ID: <200709261233.57636.mark@qtrac.eu> On 2007-09-26, skip at pobox.com wrote: > Mark> I have put a new version (incorporating another implementation > Mark> idea from Paul Hankin) on PyPI: > > Mark> http://pypi.python.org/pypi/sorteddict > > From that: > > The main benefit of sorteddicts is that you never have to explicitly > sort. > > Surely there must be something more than that. Wrapping sorted() around a > keys() or values() call is a pretty trivial operation. I didn't see that > the implementation saved anything. Assuming you have a good sorteddict implementation (i.e., based on a balanced tree or a skiplist, not the one I've put up which is just showing the API) you can gain significant performance benefits. For example, if you have a large dataset that you need to traverse quite frequently in sorted order, calling sorted() each time will be expensive compared to simply traversing an intrinsically sorted data structure. When I program in C++/Qt I use QMap (a sorteddict) very often; the STL equivalent is called map. Both the Qt and STL libraries have dict equivalents (QHash and unordered_map), but my impression is that the sorted data structures are used far more frequently than the unsorted versions. If you primarily program in Python, using dict + sorted() is very natural because they are built into the language. But using a sorted data structure and never sorting is a very common practice in other languages. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From guido at python.org Wed Sep 26 16:25:13 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Sep 2007 07:25:13 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709261233.57636.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> <200709261233.57636.mark@qtrac.eu> Message-ID: On 9/26/07, Mark Summerfield wrote: > On 2007-09-26, skip at pobox.com wrote: > > Mark> I have put a new version (incorporating another implementation > > Mark> idea from Paul Hankin) on PyPI: > > > > Mark> http://pypi.python.org/pypi/sorteddict > > > > From that: > > > > The main benefit of sorteddicts is that you never have to explicitly > > sort. > > > > Surely there must be something more than that. Wrapping sorted() around a > > keys() or values() call is a pretty trivial operation. I didn't see that > > the implementation saved anything. > > Assuming you have a good sorteddict implementation (i.e., based on a > balanced tree or a skiplist, not the one I've put up which is just > showing the API) you can gain significant performance benefits. That depends very much on the use case, and in general I strongly doubt it. I haven't looked this up in Knuth, but I believe that in a sorted dict implementation, the best performance you can get for random access and random insertions is O(log N), which is always beat by the O(1) of a hash table. This translates in O(N log N) for inserting N elements into a sorted dict, vs. O(N) in a hash table. Sorted traversal is O(N) for the sorted dict and O(N log N) for the hash table. So in order to gain a "significant performance benefit" you'd have to have one pass of insertions and two traversals with a small number of insertions or deletions in between (otherwise the sorted result from the hash table could just be cached). I don't believe that this pattern is common enough, but I don't know your application. > For example, if you have a large dataset that you need to traverse quite > frequently in sorted order, calling sorted() each time will be expensive > compared to simply traversing an intrinsically sorted data structure. > > When I program in C++/Qt I use QMap (a sorteddict) very often; the STL > equivalent is called map. Both the Qt and STL libraries have dict > equivalents (QHash and unordered_map), but my impression is that the > sorted data structures are used far more frequently than the unsorted > versions. Perhaps out of ignorance? Or perhaps the hash implementations have suboptimal implementations? Or perhaps because no equivalent to sorted() exists? > If you primarily program in Python, using dict + sorted() is very > natural because they are built into the language. But using a sorted > data structure and never sorting is a very common practice in other > languages. Ah, now the real reason you want this so badly is finally clear: simply because you're more familiar with it. :-) Is the number of elements in a typical use case large enough that the performance difference even matters? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mark at qtrac.eu Wed Sep 26 17:27:25 2007 From: mark at qtrac.eu (Mark Summerfield) Date: Wed, 26 Sep 2007 16:27:25 +0100 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709261233.57636.mark@qtrac.eu> Message-ID: <200709261627.25593.mark@qtrac.eu> On 2007-09-26, Guido van Rossum wrote: > On 9/26/07, Mark Summerfield wrote: > > On 2007-09-26, skip at pobox.com wrote: > > > Mark> I have put a new version (incorporating another > > > implementation Mark> idea from Paul Hankin) on PyPI: > > > > > > Mark> http://pypi.python.org/pypi/sorteddict > > > > > > From that: > > > > > > The main benefit of sorteddicts is that you never have to > > > explicitly sort. > > > > > > Surely there must be something more than that. Wrapping sorted() > > > around a keys() or values() call is a pretty trivial operation. I > > > didn't see that the implementation saved anything. > > > > Assuming you have a good sorteddict implementation (i.e., based on a > > balanced tree or a skiplist, not the one I've put up which is just > > showing the API) you can gain significant performance benefits. > > That depends very much on the use case, and in general I strongly > doubt it. I haven't looked this up in Knuth, but I believe that in a > sorted dict implementation, the best performance you can get for > random access and random insertions is O(log N), which is always beat > by the O(1) of a hash table. This translates in O(N log N) for > inserting N elements into a sorted dict, vs. O(N) in a hash table. > Sorted traversal is O(N) for the sorted dict and O(N log N) for the > hash table. So in order to gain a "significant performance benefit" > you'd have to have one pass of insertions and two traversals with a > small number of insertions or deletions in between (otherwise the > sorted result from the hash table could just be cached). I'm sure your numbers are right. It seems to me that the trade off is this: with dict + sorted() you pay O(N log N) whenever you need to sort (okay, Python is optimised for sorting partially ordered data so probably is better than the theoretical best). With sorteddict you pay O(log N) for accessing, but you pay nothing for sorting. > I don't believe that this pattern is common enough, but I don't know > your application. > > For example, if you have a large dataset that you need to traverse quite > > frequently in sorted order, calling sorted() each time will be expensive > > compared to simply traversing an intrinsically sorted data structure. > > > > When I program in C++/Qt I use QMap (a sorteddict) very often; the STL > > equivalent is called map. Both the Qt and STL libraries have dict > > equivalents (QHash and unordered_map), but my impression is that the > > sorted data structures are used far more frequently than the unsorted > > versions. > > Perhaps out of ignorance? Or perhaps the hash implementations have > suboptimal implementations? Or perhaps because no equivalent to > sorted() exists? C++ provides sorting algorithms that can be applied to STL containers (or Qt containers which also has its own sorting algorithms), so these do exist. > > If you primarily program in Python, using dict + sorted() is very > > natural because they are built into the language. But using a sorted > > data structure and never sorting is a very common practice in other > > languages. > > Ah, now the real reason you want this so badly is finally clear: > simply because you're more familiar with it. :-) That is true! > Is the number of elements in a typical use case large enough that the > performance difference even matters? I don't know. In C++ I use QMap or map so my ordering is free and I never notice the extra cost of lookup compared with a hash. In Python, I can only compare theoretically, not empirically because I'd need a sorteddict that was as well implemented as dict is. I'll leave sorteddict on PyPI for those sad souls who want it, and I'll try to think "dict + sorted()" for Python. Of course this might be academic for Python 3, at least for strings (unless you implement some kind of string comparison normalisation method), since two strings that are the same to humans may be different byte sequences which rather makes "sorting" a moot point. -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu From jimjjewett at gmail.com Wed Sep 26 17:35:15 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 26 Sep 2007 11:35:15 -0400 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> <200709261233.57636.mark@qtrac.eu> Message-ID: On 9/26/07, Guido van Rossum wrote: > On 9/26/07, Mark Summerfield wrote: > > Assuming you have a good sorteddict implementation ... > > you can gain significant performance benefits. > ... sorted dict implementation, the best performance you can get for > random access and random insertions is O(log N), which is always beat > by the O(1) of a hash table It is possible to keep two structures in parallel, so that lookup (using the hash) is still O(1) and traversal (using the tree) is still O(N); the penalty is that you pay for both methods when you do a mutation. (In big O notation, that doesn't matter, but the overhead may be important in practice.) -jJ From guido at python.org Wed Sep 26 18:34:16 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Sep 2007 09:34:16 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <46F9BA8F.80907@canterbury.ac.nz> Message-ID: Sounds like we need a PEP to sort out the details. I'll try to come up with something. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at pobox.com Wed Sep 26 18:43:10 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 26 Sep 2007 11:43:10 -0500 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <200709261627.25593.mark@qtrac.eu> References: <200709111506.32823.mark@qtrac.eu> <200709261233.57636.mark@qtrac.eu> <200709261627.25593.mark@qtrac.eu> Message-ID: <18170.35870.184920.53212@montanaro.dyndns.org> Mark> With sorteddict you pay O(log N) for accessing, but you pay Mark> nothing for sorting. Pay me now or pay me later, but maintaining a sorted sequence will always cost something. Skip From martin at v.loewis.de Wed Sep 26 20:06:55 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 26 Sep 2007 20:06:55 +0200 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> <200709261233.57636.mark@qtrac.eu> Message-ID: <46FA9FBF.8060909@v.loewis.de> >> When I program in C++/Qt I use QMap (a sorteddict) very often; the STL >> equivalent is called map. Both the Qt and STL libraries have dict >> equivalents (QHash and unordered_map), but my impression is that the >> sorted data structures are used far more frequently than the unsorted >> versions. > > Perhaps out of ignorance? Or perhaps the hash implementations have > suboptimal implementations? Or perhaps because no equivalent to > sorted() exists? I feel (without being able to prove it) that C++ (i.e. STL uses a red-black-tree instead of a hash table for two reasons): 1. it is theoretically better. Hash tables have not O(1), but O(n) as the worst case, whereas balanced trees can guarantee O(log n). Hash tables have O(1) in the average case only if the hash function is good, plus the costs for computing the hash are typically higher than the costs for comparison, unless the hash is cached. 2. it is often easier for applications to provide sorting. For most things you want to search for, coming up with a total order is straight-forward; defining a hash operation might not be that easy (of course, for identity lookups, hashing is easier). Regards, Martin From jason.orendorff at gmail.com Wed Sep 26 20:07:19 2007 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Wed, 26 Sep 2007 14:07:19 -0400 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> <200709261233.57636.mark@qtrac.eu> Message-ID: One situation where a sorteddict would win is finding upper and lower bounds. This especially matters if you want to iterate over a specific range of keys: "show me all entries between 1 Jan 2007 and 1 Feb 2007" is O(N) in the number of entries in that range, not the entire data set. I think people ask for things like this because they have a high-level need like "read 3 log files, jam all the data into a single data structure, and extract time slices from that" for which no particularly obivous combination of lists and dicts seems to jump out at you. Then they hit on an idea like sorteddict that looks like it might get them 60% of the way there and seems like a simple, obvious building block that belongs in the stdlib. That's my own experience, anyway. Is sorteddict really such a great building block? I dunno. It seems like that might or might not be true. These situations seem to come up pretty rarely in Python's problem domain, so it's hard to get a feel for it. I do know from recent personal experience that system-level code often wants custom data structures, and having spent a decade with Python lists and dictionaries, I'm out of shape. :) -j From nick.bastin at gmail.com Wed Sep 26 21:00:23 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 26 Sep 2007 15:00:23 -0400 Subject: [Python-3000] Py3k Trivia :-) In-Reply-To: References: <46F46A71.1060409@canterbury.ac.nz> Message-ID: <66d0a6e10709261200x5834898bof1a23f80cc0ddca5@mail.gmail.com> On 9/21/07, Guido van Rossum wrote: > On 9/21/07, Greg Ewing wrote: > > Guido van Rossum wrote: > > > """ > > > George isn't tall enough to ride the greatest rollercoaster of all > > > time, The Turbo Python 3000. He uses licorice whips to measure his > > > height and determines that he is 7-whips tall, one short of the 8-whip > > > minimum! > > > """ > > > > Fantastic! I vote that we hereby adopt the licorice whip > > as the standard unit for measuring the speed of Python 3.0 > > implementations, with the speed of 2.6 (whatever it turns > > out to be) defined as 7 whips. > > Ah, but is 6 whips faster or slower than 7 whips? Slower. If we get up to 8 we can go for an exciting ride! :-) -- Nick From charleshixsn at earthlink.net Wed Sep 26 21:39:49 2007 From: charleshixsn at earthlink.net (Charles D Hixson) Date: Wed, 26 Sep 2007 12:39:49 -0700 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: <18170.35870.184920.53212@montanaro.dyndns.org> References: <200709111506.32823.mark@qtrac.eu> <200709261233.57636.mark@qtrac.eu> <200709261627.25593.mark@qtrac.eu> <18170.35870.184920.53212@montanaro.dyndns.org> Message-ID: <46FAB585.1000005@earthlink.net> skip at pobox.com wrote: > Mark> With sorteddict you pay O(log N) for accessing, but you pay > Mark> nothing for sorting. > > Pay me now or pay me later, but maintaining a sorted sequence will always > cost something. > > Skip > Very frequently, however, I want frequent sorted access to a container. I.e. I will want something like "what's the next key bigger than nnn" (I said nnn because it often isn't a string). In such cases a B+Tree or B*Tree is a much better answer than a hash table. I'll grant that for the most common cases hash tables are superior...but not, by any means, for all cases. There have been cases where I have maintained both a list and a dict for the same data. (Well, the list was an index into the dict, but you get the idea.) The dict was for fast access when I knew the key, and the list was for binary search when I knew things *about* the key. An important note here is that the key to the dict/list was generally NOT a string in these situations. If strings suffice, then I've generally found a hash table to work well enough, and frequently been superior. OTOH, if you don't need continual access while you are building the list, then I agree with you. The problem is that each time you sort a hash table you must pay for an entire sort, while adding a key or two to a B+Tree is relatively cheap. From qrczak at knm.org.pl Wed Sep 26 22:00:56 2007 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 26 Sep 2007 22:00:56 +0200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <5d44f72f0709192307r2d0cec8am5a83b3c32812cd9b@mail.gmail.com> <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> Message-ID: <1190836856.16322.55.camel@qrnik> Dnia 25-09-2007, Wt o godzinie 17:22 -0700, Guido van Rossum napisa?(a): > OK. Though it's questionable even whether a slice of a mutable bytes > object should return a mutable bytes object (as it is not a shared > view). But as that is what PyBytes currently do it is certainly the > easiest... A slice of a list is a list, as it always have been, so letting slicing return the same type as the whole sequence is at least consistent and easy to explain. Hard to say though what are typical use cases. OTOH I believe individual elements of mutable or immutable bytes should be ints. Here is why I think that the analogy between characters and bytes is not strong enough to let elements of bytes be bytes of length 1 just because strings do the same. Bytes are often computed, while characters are often only copied from place to place. Arithmetic is defined on ints, but not on bytes sequences of length 1. This means that computing a bytes sequence from scratch requires explicit conversions between a byte represented by an int and a byte represented by bytes of length 1. There is also a philosophical reason. The division of a string into characters is quite arbitrary: considering UTF-16/UTF-32, combining characters, the encoding of Hangul, orthography peculiarities, proportional fonts, ligatures, variant selectors etc. ? all of these obscuring the concept of a character and of string length, and considering that a sequence of characters might have been decoded from or will be encoded into a sequence of bytes with a different length. This means that having atomic string components is more a technical convenience than a fundamental necessity, that the very concept of a character in a Unicode world is arbitrary, and the length of a string is more a technical detail of a representation than an inherent property of the text being represented. All this means that the concept of a string is more fundamental than a character. OTOH a byte count and byte offsets are usually important in protocols based on bytes (except text files when they encode human text). The individual bytes are in some sense delimited very sharply from each other, the amount of information stored in one byte is very well defined. A single byte is a more important concept in a bytes world than a character in a text world, it's not merely a sequence with length 1. Having characters different from strings would require creation of a new type, because the existing int type is not very appropriate for single characters, because many properties differ, e.g. the effect of writing to a text file. To avoid the burden of creating a new type for a concept which is rarely useful in isolation, strings of length 1 have been reused. OTOH the existing int type seems appropriate for elements of bytes. They can be easily thought of as just integers in the range 0..255, and Python does not use separate integer types for different potential ranges. If you really don't like ints there, I would prefer immutable bytes even as elements of mutable bytes. This is just a value isomorphic to an int, not an object with its own state. Moreover for atomic objects like individual bytes mutability is not helpful to obtain performance, which would be a reason to use a mutable type for non-atomic objects even when conceptually they are identityless values (mutability often helps in such case because an object can be constructed piece by piece). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Wed Sep 26 23:58:53 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Sep 2007 14:58:53 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer Message-ID: Please comment. PEP: 3137 Title: Immutable Bytes and Mutable Buffer Version: $Revision: 58264 $ Last-Modified: $Date: 2007-09-26 14:58:29 -0700 (Wed, 26 Sep 2007) $ Author: Guido van Rossum Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-Sep-2007 Python-Version: 3.0 Post-History: 26-Sep-2007 Introduction ============ After releasing Python 3.0a1 with a mutable bytes type, pressure mounted to add a way to represent immutable bytes. Gregory P. Smith proposed a patch that would allow making a bytes object temporarily immutable by requesting that the data be locked using the new buffer API from PEP 3118. This did not seem the right approach to me. Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to make the bytes type immutable (by crudely removing all mutating APIs) and fix the fall-out in the test suite. This showed that there aren't all that many places that depend on the mutability of bytes, with the exception of code that builds up a return value from small pieces. Thinking through the consequences, and noticing that using the array module as an ersatz mutable bytes type is far from ideal, and recalling a proposal put forward earlier by Talin, I floated the suggestion to have both a mutable and an immutable bytes type. (This had been brought up before, but until seeing the evidence of Jeffrey's patch I wasn't open to the suggestion.) Moreover, a possible implementation strategy became clear: use the old PyString implementation, stripped down to remove locale support and implicit conversions to/from Unicode, for the immutable bytes type, and keep the new PyBytes implementation as the mutable bytes type. The ensuing discussion made it clear that the idea is welcome but needs to be specified more precisely. Hence this PEP. Advantages ========== One advantage of having an immutable bytes type is that code objects can use these. It also makes it possible to efficiently create hash tables using bytes for keys; this may be useful when parsing protocols like HTTP or SMTP which are based on bytes representing text. Porting code that manipulates binary data (or encoded text) in Python 2.x will be easier using the new design than using the original 3.0 design with mutable bytes; simply replace ``str`` with ``bytes`` and change '...' literals into b'...' literals. Naming ====== I propose the following type names at the Python level: - ``bytes`` is an immutable array of bytes (PyString) - ``buffer`` is a mutable array of bytes (PyBytes) - ``memoryview`` is a bytes view on another object (PyMemory) The old type named ``buffer`` is so similar to the new type ``memoryview``, introduce by PEP 3118, that it is redundant. The rest of this PEP doesn't discuss the functionality of ``memoryview``; it is just mentioned here to justify getting rid of the old ``buffer`` type so we can reuse its name for the mutable bytes type. While eventually it makes sense to change the C API names, this PEP maintains the old C API names, which should be familiar to all. Literal Notations ================= The b'...' notation introduced in Python 3.0a1 returns an immutable bytes object, whatever variation is used. To create a mutable bytes buffer object, use buffer(b'...') or buffer([...]). The latter may use a list of integers in range(256). Functionality ============= PEP 3118 Buffer API ------------------- Both bytes and buffer support the PEP 3118 buffer API. The bytes type only supports read-only requests; the buffer type allows writable and data-locked requests as well. The element data type is always 'B' (i.e. unsigned byte). Constructors ------------ There are four forms of constructors, applicable to both bytes and buffer: - ``bytes()``, ``bytes()``, ``buffer()``, ``buffer()``: simple copying constructors, with the note that ``bytes()`` might return its (immutable) argument. - ``bytes(, [, ])``, ``buffer(, [, ])``: encode a text string. Note that the ``str.encode()`` method returns an *immutable* bytes object. The argument is mandatory; is optional. - ``bytes()``, ``buffer()``: construct a bytes or buffer object from anything that supports the PEP 3118 buffer API. - ``bytes()``, ``buffer()``: construct an immutable bytes or mutable buffer object from a stream of integers in range(256). - ``buffer()``: construct a zero-initialized buffer of a given lenth. Comparisons ----------- The bytes and buffer types are comparable with each other and orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'. Comparing either type to a str object raises an exception. This turned out to be necessary to catch common mistakes. Slicing ------- Slicing a bytes object returns a bytes object. Slicing a buffer object returns a buffer object. Slice assignment to a mutable buffer object accept anything that supports the PEP 3118 buffer API, or an iterable of integers in range(256). Indexing -------- **Open Issue:** I'm undecided on whether indexing bytes and buffer objects should return small ints (like the bytes type in 3.0a1, and like lists or array.array('B')), or bytes/buffer objects of length 1 (like the str type). The latter (str-like) approach will ease porting code from Python 2.x; but it makes it harder to extract values from a bytes array. Assignment to an item of a mutable buffer object accepts an int in range(256); if we choose the str-like approach for indexing above, it also accepts an object implementing the PEP 3118 buffer API, if it has length 1. Str() and Repr() ---------------- The str() and repr() functions return the same thing for these objects. The repr() of a bytes object returns a b'...' style literal. The repr() of a buffer returns a string of the form "buffer(b'...')". Methods ------- The following methods are supported by bytes as well as buffer, with similar semantics. They accept anything that implements the PEP 3118 buffer API for bytes arguments, and return the same type as the object whose method is called ("self"):: .capitalize(), .center(), .count(), .decode(), .endswith(), .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(), .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(), .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(), .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(), .splitlines(), .startswith(), .strip(), .swapcase(), .title(), .translate(), .upper(), .zfill() This is exactly the set of methods present on the str type in Python 2.x, with the exclusion of .encode(). The signatures and semantics are the same too. However, whenever character classes like letter, whitespace, lower case are used, the ASCII definitions of these classes are used. (The Python 2.x str type uses the definitions from the current locale, settable through the locale module.) The .encode() method is left out because of the more strict definitions of encoding and decoding in Python 3000: encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string. Bytes and the Str Type ---------------------- Like the bytes type in Python 3.0a1, and unlike the relationship between str and unicode in Python 2.x, any attempt to mix bytes (or buffer) objects and str objects without specifying an encoding will raise a TypeError exception. This is the case even for simply comparing a bytes or buffer object to a str object (even violating the general rule that comparing objects of different types for equality should just return False). Conversions between bytes or buffer objects and str objects must always be explicit, using an encoding. There are two equivalent APIs: ``str(b, [, ])`` is equivalent to ``b.encode([, ])``, and ``bytes(s, [, ])`` is equivalent to ``s.decode([, ])``. There is one exception: we can convert from bytes (or buffer) to str without specifying an encoding by writing ``str(b)``. This produces the same result as ``repr(b)``. This exception is necessary because of the general promise that *any* object can be printed, and printing is just a special case of conversion to str. There is however no promise that printing a bytes object interprets the individual bytes as characters (unlike in Python 2.x). The str type current supports the PEP 3118 buffer API. While this is perhaps occasionally convenient, it is also potentially confusing, because the bytes accessed via the buffer API represent a platform-depending encoding: depending on the platform byte order and a compile-time configuration option, the encoding could be UTF-16-BE, UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of the str type might completely change the bytes representation, e.g. to UTF-8, or even make it impossible to access the data as a contiguous array of bytes at all. Therefore, support for the PEP 3118 buffer API will be removed from the str type. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Thu Sep 27 00:57:47 2007 From: brett at python.org (Brett Cannon) Date: Wed, 26 Sep 2007 15:57:47 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: On 9/26/07, Guido van Rossum wrote: > Please comment. > > PEP: 3137 > Title: Immutable Bytes and Mutable Buffer > Version: $Revision: 58264 $ > Last-Modified: $Date: 2007-09-26 14:58:29 -0700 (Wed, 26 Sep 2007) $ > Author: Guido van Rossum > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 26-Sep-2007 > Python-Version: 3.0 > Post-History: 26-Sep-2007 > > Introduction > ============ > > After releasing Python 3.0a1 with a mutable bytes type, pressure > mounted to add a way to represent immutable bytes. Gregory P. Smith > proposed a patch that would allow making a bytes object temporarily > immutable by requesting that the data be locked using the new buffer > API from PEP 3118. This did not seem the right approach to me. > > Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to > make the bytes type immutable (by crudely removing all mutating APIs) > and fix the fall-out in the test suite. This showed that there aren't > all that many places that depend on the mutability of bytes, with the > exception of code that builds up a return value from small pieces. > > Thinking through the consequences, and noticing that using the array > module as an ersatz mutable bytes type is far from ideal, and > recalling a proposal put forward earlier by Talin, I floated the > suggestion to have both a mutable and an immutable bytes type. (This > had been brought up before, but until seeing the evidence of Jeffrey's > patch I wasn't open to the suggestion.) > > Moreover, a possible implementation strategy became clear: use the old > PyString implementation, stripped down to remove locale support and > implicit conversions to/from Unicode, for the immutable bytes type, > and keep the new PyBytes implementation as the mutable bytes type. > > The ensuing discussion made it clear that the idea is welcome but > needs to be specified more precisely. Hence this PEP. > > Advantages > ========== > > One advantage of having an immutable bytes type is that code objects > can use these. Woohoo (from a security perspective)! > It also makes it possible to efficiently create hash > tables using bytes for keys; this may be useful when parsing protocols > like HTTP or SMTP which are based on bytes representing text. > > Porting code that manipulates binary data (or encoded text) in Python > 2.x will be easier using the new design than using the original 3.0 > design with mutable bytes; simply replace ``str`` with ``bytes`` and > change '...' literals into b'...' literals. > > Naming > ====== > > I propose the following type names at the Python level: > > - ``bytes`` is an immutable array of bytes (PyString) > > - ``buffer`` is a mutable array of bytes (PyBytes) > > - ``memoryview`` is a bytes view on another object (PyMemory) > > The old type named ``buffer`` is so similar to the new type > ``memoryview``, introduce by PEP 3118, that it is redundant. The rest > of this PEP doesn't discuss the functionality of ``memoryview``; it is > just mentioned here to justify getting rid of the old ``buffer`` type > so we can reuse its name for the mutable bytes type. > > While eventually it makes sense to change the C API names, this PEP > maintains the old C API names, which should be familiar to all. > > Literal Notations > ================= > > The b'...' notation introduced in Python 3.0a1 returns an immutable > bytes object, whatever variation is used. To create a mutable bytes > buffer object, use buffer(b'...') or buffer([...]). The latter may > use a list of integers in range(256). > > Functionality > ============= > > PEP 3118 Buffer API > ------------------- > > Both bytes and buffer support the PEP 3118 buffer API. The bytes type > only supports read-only requests; the buffer type allows writable and > data-locked requests as well. The element data type is always 'B' > (i.e. unsigned byte). > > Constructors > ------------ > > There are four forms of constructors, applicable to both bytes and > buffer: > > - ``bytes()``, ``bytes()``, ``buffer()``, > ``buffer()``: simple copying constructors, with the note > that ``bytes()`` might return its (immutable) argument. > > - ``bytes(, [, ])``, ``buffer(, > [, ])``: encode a text string. Note that the > ``str.encode()`` method returns an *immutable* bytes object. > The argument is mandatory; is optional. > > - ``bytes()``, ``buffer()``: construct a > bytes or buffer object from anything that supports the PEP 3118 > buffer API. > > - ``bytes()``, ``buffer()``: > construct an immutable bytes or mutable buffer object from a > stream of integers in range(256). > > - ``buffer()``: construct a zero-initialized buffer of a given > lenth. Typo; went ahead and fixed it in svn. > > Comparisons > ----------- > > The bytes and buffer types are comparable with each other and > orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'. > > Comparing either type to a str object raises an exception. This > turned out to be necessary to catch common mistakes. > > Slicing > ------- > > Slicing a bytes object returns a bytes object. Slicing a buffer > object returns a buffer object. > > Slice assignment to a mutable buffer object accept anything that > supports the PEP 3118 buffer API, or an iterable of integers in > range(256). > > Indexing > -------- > > **Open Issue:** I'm undecided on whether indexing bytes and buffer > objects should return small ints (like the bytes type in 3.0a1, and > like lists or array.array('B')), or bytes/buffer objects of length 1 > (like the str type). The latter (str-like) approach will ease porting > code from Python 2.x; but it makes it harder to extract values from a > bytes array. > How much do you care about making the 2 -> 3 transition easy? If you don't go the str way then comparisons like ``bytes_[0] == b"A"`` won't work unless you allow comparisons between ints and length 1 bytes/buffers. Extracting a single item is not horrendous if you pass it to int(). Personally I say go with the list-like semantics. Having the following code return false seems odd (but not ridiculous) to me:: stuff = bytes([0, 1]) stuff[1] = 42 stuff[1] == 42 So unless int comparisons are allowed I am -0 on the str-like semantics. > Assignment to an item of a mutable buffer object accepts an int in > range(256); if we choose the str-like approach for indexing above, it > also accepts an object implementing the PEP 3118 buffer API, if it has > length 1. > > Str() and Repr() > ---------------- > > The str() and repr() functions return the same thing for these > objects. The repr() of a bytes object returns a b'...' style literal. > The repr() of a buffer returns a string of the form "buffer(b'...')". > > Methods > ------- > > The following methods are supported by bytes as well as buffer, with > similar semantics. They accept anything that implements the PEP 3118 > buffer API for bytes arguments, and return the same type as the object > whose method is called ("self"):: > > .capitalize(), .center(), .count(), .decode(), .endswith(), > .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(), > .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(), > .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(), > .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(), > .splitlines(), .startswith(), .strip(), .swapcase(), .title(), > .translate(), .upper(), .zfill() > > This is exactly the set of methods present on the str type in Python > 2.x, with the exclusion of .encode(). The signatures and semantics > are the same too. However, whenever character classes like letter, > whitespace, lower case are used, the ASCII definitions of these > classes are used. (The Python 2.x str type uses the definitions from > the current locale, settable through the locale module.) The > .encode() method is left out because of the more strict definitions of > encoding and decoding in Python 3000: encoding always takes a Unicode > string and returns a bytes sequence, and decoding always takes a bytes > sequence and returns a Unicode string. > > Bytes and the Str Type > ---------------------- > > Like the bytes type in Python 3.0a1, and unlike the relationship > between str and unicode in Python 2.x, any attempt to mix bytes (or > buffer) objects and str objects without specifying an encoding will > raise a TypeError exception. This is the case even for simply > comparing a bytes or buffer object to a str object (even violating the > general rule that comparing objects of different types for equality > should just return False). > > Conversions between bytes or buffer objects and str objects must > always be explicit, using an encoding. There are two equivalent APIs: > ``str(b, [, ])`` is equivalent to > ``b.encode([, ])``, and > ``bytes(s, [, ])`` is equivalent to > ``s.decode([, ])``. > > There is one exception: we can convert from bytes (or buffer) to str > without specifying an encoding by writing ``str(b)``. This produces > the same result as ``repr(b)``. This exception is necessary because > of the general promise that *any* object can be printed, and printing > is just a special case of conversion to str. There is however no > promise that printing a bytes object interprets the individual bytes > as characters (unlike in Python 2.x). > > The str type current supports the PEP 3118 buffer API. While this is Fixed to "currently" in svn. > perhaps occasionally convenient, it is also potentially confusing, > because the bytes accessed via the buffer API represent a > platform-depending encoding: depending on the platform byte order and > a compile-time configuration option, the encoding could be UTF-16-BE, > UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation > of the str type might completely change the bytes representation, > e.g. to UTF-8, or even make it impossible to access the data as a > contiguous array of bytes at all. Therefore, support for the PEP 3118 > buffer API will be removed from the str type. > +1 from me regardless of how the length 1 discussion turns out as this will help with Py3K transitioning. -Brett From guido at python.org Thu Sep 27 00:58:00 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Sep 2007 15:58:00 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <1190836856.16322.55.camel@qrnik> References: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> Message-ID: I find this semi-convincing. It would be very convincing in a greenfield situation I think. However there's quite a bit of Python 2.x code around that manipulates *bytes* in the guise of 8-bit strings, and it uses tests like "if s[0] == 'x': ..." frequently. This can of course be rewritten using a slice, but not so easily when you're looping over bytes: for b in bb: if b == b'x': ... This becomes the relatively ugly (because it uses a 1-char *string*): for b in bb: if b == ord('x'): ... So I've left this as an open issue in PEP 3137. --Guido On 9/26/07, Marcin 'Qrczak' Kowalczyk wrote: > Dnia 25-09-2007, Wt o godzinie 17:22 -0700, Guido van Rossum napisa?(a): > > > OK. Though it's questionable even whether a slice of a mutable bytes > > object should return a mutable bytes object (as it is not a shared > > view). But as that is what PyBytes currently do it is certainly the > > easiest... > > A slice of a list is a list, as it always have been, so letting slicing > return the same type as the whole sequence is at least consistent and > easy to explain. Hard to say though what are typical use cases. > > OTOH I believe individual elements of mutable or immutable bytes should > be ints. Here is why I think that the analogy between characters and > bytes is not strong enough to let elements of bytes be bytes of length 1 > just because strings do the same. > > Bytes are often computed, while characters are often only copied > from place to place. Arithmetic is defined on ints, but not on bytes > sequences of length 1. This means that computing a bytes sequence from > scratch requires explicit conversions between a byte represented by an > int and a byte represented by bytes of length 1. > > There is also a philosophical reason. The division of a string into > characters is quite arbitrary: considering UTF-16/UTF-32, combining > characters, the encoding of Hangul, orthography peculiarities, > proportional fonts, ligatures, variant selectors etc. ? all of these > obscuring the concept of a character and of string length, and > considering that a sequence of characters might have been decoded from > or will be encoded into a sequence of bytes with a different length. > This means that having atomic string components is more a technical > convenience than a fundamental necessity, that the very concept of a > character in a Unicode world is arbitrary, and the length of a string is > more a technical detail of a representation than an inherent property of > the text being represented. All this means that the concept of a string > is more fundamental than a character. > > OTOH a byte count and byte offsets are usually important in protocols > based on bytes (except text files when they encode human text). The > individual bytes are in some sense delimited very sharply from each > other, the amount of information stored in one byte is very well > defined. A single byte is a more important concept in a bytes world > than a character in a text world, it's not merely a sequence with > length 1. > > Having characters different from strings would require creation of a new > type, because the existing int type is not very appropriate for single > characters, because many properties differ, e.g. the effect of writing > to a text file. To avoid the burden of creating a new type for a concept > which is rarely useful in isolation, strings of length 1 have been > reused. OTOH the existing int type seems appropriate for elements of > bytes. They can be easily thought of as just integers in the range > 0..255, and Python does not use separate integer types for different > potential ranges. > > If you really don't like ints there, I would prefer immutable bytes even > as elements of mutable bytes. This is just a value isomorphic to an int, > not an object with its own state. Moreover for atomic objects like > individual bytes mutability is not helpful to obtain performance, which > would be a reason to use a mutable type for non-atomic objects even when > conceptually they are identityless values (mutability often helps in > such case because an object can be constructed piece by piece). > > -- > __("< Marcin Kowalczyk > \__/ qrczak at knm.org.pl > ^^ http://qrnik.knm.org.pl/~qrczak/ > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Sep 27 01:03:12 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Sep 2007 16:03:12 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: [PEP 3137] > > **Open Issue:** I'm undecided on whether indexing bytes and buffer > > objects should return small ints (like the bytes type in 3.0a1, and > > like lists or array.array('B')), or bytes/buffer objects of length 1 > > (like the str type). The latter (str-like) approach will ease porting > > code from Python 2.x; but it makes it harder to extract values from a > > bytes array. On 9/26/07, Brett Cannon wrote: > How much do you care about making the 2 -> 3 transition easy? If you > don't go the str way then comparisons like ``bytes_[0] == b"A"`` won't > work unless you allow comparisons between ints and length 1 > bytes/buffers. Extracting a single item is not horrendous if you pass > it to int(). > > Personally I say go with the list-like semantics. Having the > following code return false seems odd (but not ridiculous) to me:: > > stuff = bytes([0, 1]) > stuff[1] = 42 > stuff[1] == 42 > > So unless int comparisons are allowed I am -0 on the str-like semantics. int comparisons would stick out like a sore thumb, especially since they can only be reasonably made to work on 1-byte strings. I'm still undecided (despite Marcin's eloquent argument for ints as bytes) but I'm open for votes for this case. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Thu Sep 27 04:15:58 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Wed, 26 Sep 2007 22:15:58 -0400 Subject: [Python-3000] ordered dict for p3k collections? In-Reply-To: References: <200709111506.32823.mark@qtrac.eu> <200709260802.44381.mark@qtrac.eu> <18170.16027.665491.815991@montanaro.dyndns.org> <200709261233.57636.mark@qtrac.eu> Message-ID: <66d0a6e10709261915j244b00d9s7f7369acb78e272a@mail.gmail.com> On 9/26/07, Jason Orendorff wrote: > One situation where a sorteddict would win is finding upper and lower > bounds. This especially matters if you want to iterate over a > specific range of keys: "show me all entries between 1 Jan 2007 and 1 > Feb 2007" is O(N) in the number of entries in that range, not the > entire data set. Yeah, we do this a lot. We frequently end up with dictionaries with hundreds of thousands of entries and a simple wrapper on std::map gives us about 120x the performance of python dict in our use case, almost entirely due to the fact that we search a LOT more than we insert. -- Nick From alexandre at peadrop.com Thu Sep 27 04:36:08 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Wed, 26 Sep 2007 22:36:08 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: On 9/26/07, Guido van Rossum wrote: > > Constructors > ------------ > > There are four forms of constructors, applicable to both bytes and > buffer: > > - ``bytes()``, ``bytes()``, ``buffer()``, > ``buffer()``: simple copying constructors, with the note > that ``bytes()`` might return its (immutable) argument. > > - ``bytes(, [, ])``, ``buffer(, > [, ])``: encode a text string. Note that the > ``str.encode()`` method returns an *immutable* bytes object. > The argument is mandatory; is optional. > > - ``bytes()``, ``buffer()``: construct a > bytes or buffer object from anything that supports the PEP 3118 > buffer API. > > - ``bytes()``, ``buffer()``: > construct an immutable bytes or mutable buffer object from a > stream of integers in range(256). > > - ``buffer()``: construct a zero-initialized buffer of a given > lenth. > I think this section could be better organized. I had to read a few time to fully understand it. Maybe a table would emphasize better the differences between the two constructors. > Indexing > -------- > > **Open Issue:** I'm undecided on whether indexing bytes and buffer > objects should return small ints (like the bytes type in 3.0a1, and > like lists or array.array('B')), or bytes/buffer objects of length 1 > (like the str type). The latter (str-like) approach will ease porting > code from Python 2.x; but it makes it harder to extract values from a > bytes array. I think indexing a bytes/buffer object should return an int. I find this behavior more natural, to me, than using an ord()-like function to extract values. In fact, I remarked that the use of ord() is good indicator that bytes should be used instead of str (look by yourself: grep -R --include='*.py' 'ord(' python25/Lib). > Str() and Repr() > ---------------- > > The str() and repr() functions return the same thing for these > objects. The repr() of a bytes object returns a b'...' style literal. > The repr() of a buffer returns a string of the form "buffer(b'...')". Does that mean calling str() on a bytes/buffer object -- e.g., str(b"abc") -- wouldn't decode the content of the object (like array objects)? > Bytes and the Str Type > ---------------------- > > Like the bytes type in Python 3.0a1, and unlike the relationship > between str and unicode in Python 2.x, any attempt to mix bytes (or > buffer) objects and str objects without specifying an encoding will > raise a TypeError exception. This is the case even for simply > comparing a bytes or buffer object to a str object (even violating the > general rule that comparing objects of different types for equality > should just return False). > > Conversions between bytes or buffer objects and str objects must > always be explicit, using an encoding. There are two equivalent APIs: > ``str(b, [, ])`` is equivalent to > ``b.encode([, ])``, and > ``bytes(s, [, ])`` is equivalent to > ``s.decode([, ])``. > > There is one exception: we can convert from bytes (or buffer) to str > without specifying an encoding by writing ``str(b)``. This produces > the same result as ``repr(b)``. This exception is necessary because > of the general promise that *any* object can be printed, and printing > is just a special case of conversion to str. There is however no > promise that printing a bytes object interprets the individual bytes > as characters (unlike in Python 2.x). Ah! That answers my last question. :) -- Alexandre From greg.ewing at canterbury.ac.nz Thu Sep 27 04:38:13 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 27 Sep 2007 14:38:13 +1200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FB1795.5030404@canterbury.ac.nz> Guido van Rossum wrote: > I'm still undecided (despite Marcin's eloquent argument for ints as > bytes) but I'm open for votes for this case. Whatever is done, please don't do it *only* to make conversion from 2.x easy. There should be good independent reasons for whatever is chosen. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Thu Sep 27 04:35:10 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 27 Sep 2007 14:35:10 +1200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> Message-ID: <46FB16DE.7010109@canterbury.ac.nz> Guido van Rossum wrote: > However there's quite a bit of Python 2.x code around that manipulates > *bytes* in the guise of 8-bit strings, and it uses tests like "if s[0] > == 'x': ..." frequently. This can of course be rewritten using a > slice, but not so easily when you're looping over bytes: > > for b in bb: > if b == b'x': ... Would it make anything easier if there were a character literal? for b in bb: if b == c'x': ... where c'x' is another way of writing ord(b'x'). An advantage of this is that it would make Py3k compatible with Pyrex, which already has c'x' literals. :-) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From jyasskin at gmail.com Thu Sep 27 05:44:16 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Wed, 26 Sep 2007 20:44:16 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <5d44f72f0709262044p6ca07f05o662a89ef4a262775@mail.gmail.com> On 9/26/07, Guido van Rossum wrote: > ... > Indexing > -------- > > **Open Issue:** I'm undecided on whether indexing bytes and buffer > objects should return small ints (like the bytes type in 3.0a1, and > like lists or array.array('B')), or bytes/buffer objects of length 1 > (like the str type). The latter (str-like) approach will ease porting > code from Python 2.x; but it makes it harder to extract values from a > bytes array. Marcin was far more eloquent than I could hope to be, but I too prefer indexing bytes to return a small int. My reasoning is a little more academic: All iterable types except for str get simpler when you iterate over them, so eventually you come to a type that isn't iterable. It would be a shame to extend this misbehavior to bytes if we have a chance to remove it. For example, the recursive flatten() function gets more complicated for each type that does this: >>> list(flatten.flatten([1, [2, [3, [4, 5]]]])) [1, 2, 3, 4, 5] >>> list(flatten.flatten([1, [2, [3, ["str", 5]]]])) [1, 2, 3, 's', 't', 'r', 5] If all iterables iterated over a simpler type, we could use: def flatten(iterable): try: for elem in iterable: for elem in flatten(elem): yield elem except TypeError: # Not iterable yield iterable but with strings, you need def flatten(iterable): try: for elem in iterable: if isinstance(elem, str) and len(elem) == 1: yield elem else: for elem in flatten(elem): yield elem except TypeError: # Not iterable yield iterable and another special case for each similar type. Comparisons with literal bytes could be done with: for b in bb: if b == b'x'[0]: ... or perhaps if b == int(b'x'): ... but you're right that's not ideal. -- Namast?, Jeffrey Yasskin From greg at krypto.org Thu Sep 27 07:06:35 2007 From: greg at krypto.org (Gregory P. Smith) Date: Wed, 26 Sep 2007 22:06:35 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <46FB16DE.7010109@canterbury.ac.nz> References: <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> <46FB16DE.7010109@canterbury.ac.nz> Message-ID: <52dc1c820709262206o33c0b792ib94156556d0b5bc5@mail.gmail.com> On 9/26/07, Greg Ewing wrote: > Guido van Rossum wrote: > > > However there's quite a bit of Python 2.x code around that manipulates > > *bytes* in the guise of 8-bit strings, and it uses tests like "if s[0] > > == 'x': ..." frequently. This can of course be rewritten using a > > slice, but not so easily when you're looping over bytes: > > > > for b in bb: > > if b == b'x': ... > > Would it make anything easier if there were a character > literal? > > for b in bb: > if b == c'x': ... > > where c'x' is another way of writing ord(b'x'). > > An advantage of this is that it would make Py3k compatible > with Pyrex, which already has c'x' literals. :-) My gut feeling on this is first "neat" but then "eew." There should not be multiple ways to write something so simple and letter'' syntax we already use for b'' s'' u'' r'' and such already annoys me as ugly. However that syntax is already established so maybe its okay. Should it be i'x' instead of c'x' since the result is an int? i'x' might look odd in some fonts? Writing org(b'x') is ugly. Would a special case in the b'x' comparison tests that knows how to compare a len==1 bytes (mutable or not) object to an integer be reasonable or just alternately confusing? b'x' == ord(b'x') b'x' > 65 Could that lead to people wanting to treat len==1 bytes objects like tiny ints and use them in math (do *not* allow that)? And if we did that what would a bytes len!=1 comparison to an integer do? return False as it currently does i'd hope. -gps From greg at krypto.org Thu Sep 27 07:16:16 2007 From: greg at krypto.org (Gregory P. Smith) Date: Wed, 26 Sep 2007 22:16:16 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <52dc1c820709262216n223b37fak835523027c8877eb@mail.gmail.com> On 9/26/07, Guido van Rossum wrote: > [PEP 3137] > > > **Open Issue:** I'm undecided on whether indexing bytes and buffer > > > objects should return small ints (like the bytes type in 3.0a1, and > > > like lists or array.array('B')), or bytes/buffer objects of length 1 > > > (like the str type). The latter (str-like) approach will ease porting > > > code from Python 2.x; but it makes it harder to extract values from a > > > bytes array. > > On 9/26/07, Brett Cannon wrote: > > How much do you care about making the 2 -> 3 transition easy? If you > > don't go the str way then comparisons like ``bytes_[0] == b"A"`` won't > > work unless you allow comparisons between ints and length 1 > > bytes/buffers. Extracting a single item is not horrendous if you pass > > it to int(). > > > > Personally I say go with the list-like semantics. Having the > > following code return false seems odd (but not ridiculous) to me:: > > > > stuff = bytes([0, 1]) > > stuff[1] = 42 > > stuff[1] == 42 > > > > So unless int comparisons are allowed I am -0 on the str-like semantics. > > int comparisons would stick out like a sore thumb, especially since > they can only be reasonably made to work on 1-byte strings. > > I'm still undecided (despite Marcin's eloquent argument for ints as > bytes) but I'm open for votes for this case. > looks like my response in the other thread suggesting allowing comparisons of len==1 to ints was already mentioned before me. yay. I'm +0.5 on the idea of allowing the len==1 to int comparison and returning ints for the bytes/buffer indices and iteration. glad to see this as a PEP, it feels more real. :) -gps From g.brandl at gmx.net Thu Sep 27 08:15:00 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Thu, 27 Sep 2007 08:15:00 +0200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: Alexandre Vassalotti schrieb: >> Indexing >> -------- >> >> **Open Issue:** I'm undecided on whether indexing bytes and buffer >> objects should return small ints (like the bytes type in 3.0a1, and >> like lists or array.array('B')), or bytes/buffer objects of length 1 >> (like the str type). The latter (str-like) approach will ease porting >> code from Python 2.x; but it makes it harder to extract values from a >> bytes array. > > I think indexing a bytes/buffer object should return an int. I find > this behavior > more natural, to me, than using an ord()-like function to extract > values. In fact, I > remarked that the use of ord() is good indicator that bytes should be used > instead of str (look by yourself: grep -R --include='*.py' 'ord(' python25/Lib). If b[0] returns an int, you will have to use ord() to compare it to b"a". If it returns b"a", you won't. If you want to compare a byte by ordinal, you can still use b"\xAB", without a function call... Therefore I vote for returning not an int, but I wouldn't object to bytes of length 1 being comparable to ints. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From walter at livinglogic.de Thu Sep 27 09:34:48 2007 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Thu, 27 Sep 2007 09:34:48 +0200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FB5D18.1000601@livinglogic.de> Guido van Rossum wrote: > Please comment. > [...] > Conversions between bytes or buffer objects and str objects must > always be explicit, using an encoding. There are two equivalent APIs: > ``str(b, [, ])`` is equivalent to > ``b.encode([, ])``, and > ``bytes(s, [, ])`` is equivalent to > ``s.decode([, ])``. This looks backwards to me. IMHO it should be: ``str(b, [, ])`` is equivalent to ``b.decode([, ])``, and ``bytes(s, [, ])`` is equivalent to ``s.encode([, ])``. Servus, Walter From talin at acm.org Thu Sep 27 10:20:14 2007 From: talin at acm.org (Talin) Date: Thu, 27 Sep 2007 01:20:14 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FB67BE.4080502@acm.org> Guido van Rossum wrote: > Thinking through the consequences, and noticing that using the array > module as an ersatz mutable bytes type is far from ideal, and > recalling a proposal put forward earlier by Talin, I floated the > suggestion to have both a mutable and an immutable bytes type. (This > had been brought up before, but until seeing the evidence of Jeffrey's > patch I wasn't open to the suggestion.) One thing that you may have missed from my proposal is that both 'bytes' and 'buffer' inherit from a common ABC. This ABC defines all of the operations which 'bytes' and 'buffer' have in common. My name for this ABC was 'ByteSequence', but I have no particular attachment to that name. -- Talin From jjb5 at cornell.edu Thu Sep 27 15:56:59 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Thu, 27 Sep 2007 09:56:59 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FBB6AB.4030509@cornell.edu> > **Open Issue:** I'm undecided on whether indexing bytes and buffer > objects should return small ints (like the bytes type in 3.0a1, and > like lists or array.array('B')), or bytes/buffer objects of length 1 > (like the str type). The latter (str-like) approach will ease porting > code from Python 2.x; but it makes it harder to extract values from a > bytes array. The protocol encoding and decoding world calls these "octet strings" and it makes encoding and decoding discussions a lot easier. ASN.1 calls them that and it's a good thing. In that frame of mind, the first element is an octet, and while Python would not add a new datatype, just like it doesn't have one for character, it would be an unsigned integer in range(256). > Methods > ------- > > The following methods are supported by bytes as well as buffer, with > similar semantics. They accept anything that implements the PEP 3118 > buffer API for bytes arguments, and return the same type as the object > whose method is called ("self"): First, please enforce that where these functions take a "string" parameter that they require an octet or octet string (I couldn't find what kinds of arguments these functions require in PEP 3118): >>> x = b'123*45' >>> x.find("*") TypeError: expected an octet string or int >>> x.find(b'*') 3 >>> x.find(42) 3 Second, Please add slice operations and .append() to mutable octet strings: >>> x[:0] = b'>' # start of message >>> x.append(sum(x) % 256) # simple checksum Joel From alexandre at peadrop.com Thu Sep 27 17:13:38 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 27 Sep 2007 11:13:38 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: On 9/26/07, Alexandre Vassalotti wrote: > I think indexing a bytes/buffer object should return an int. > I find this behavior more natural, to me, than using an > ord()-like function to extract values. I didn't known about the length-1 comparison issue when I wrote this. Personally, I wouldn't mind writing either this: for b in bytes: if b == b'a'[0]: pass or this: for b in bytes: if b == b'a': pass > In fact, I remarked that the use of ord() is good indicator > that bytes should be used instead of str (look by yourself: > grep -R --include='*.py' 'ord(' python25/Lib). I don't think my argument is still valid. Compared the use of ord() in Python 2.x vs. Python 3.x with: % egrep -R --include='*.py' '\ References: Message-ID: <46FBCE5A.6050503@gmail.com> Guido van Rossum wrote: > [PEP 3137] >>> **Open Issue:** I'm undecided on whether indexing bytes and buffer >>> objects should return small ints (like the bytes type in 3.0a1, and >>> like lists or array.array('B')), or bytes/buffer objects of length 1 >>> (like the str type). The latter (str-like) approach will ease porting >>> code from Python 2.x; but it makes it harder to extract values from a >>> bytes array. > > On 9/26/07, Brett Cannon wrote: >> How much do you care about making the 2 -> 3 transition easy? If you >> don't go the str way then comparisons like ``bytes_[0] == b"A"`` won't >> work unless you allow comparisons between ints and length 1 >> bytes/buffers. Extracting a single item is not horrendous if you pass >> it to int(). >> >> Personally I say go with the list-like semantics. Having the >> following code return false seems odd (but not ridiculous) to me:: >> >> stuff = bytes([0, 1]) >> stuff[1] = 42 >> stuff[1] == 42 >> >> So unless int comparisons are allowed I am -0 on the str-like semantics. > > int comparisons would stick out like a sore thumb, especially since > they can only be reasonably made to work on 1-byte strings. > > I'm still undecided (despite Marcin's eloquent argument for ints as > bytes) but I'm open for votes for this case. Making an iterator over an integer sequence acceptable in the constructor strongly suggests that a byte sequence contains integers between 0 and 255 inclusive, not length 1 byte sequences. And I think that's the cleanest conceptual model for them as well. A byte sequence doesn't contain length 1 byte sequences, it contains bytes (i.e. numbers between 0 and 255 inclusive). For direct comparison, a slice works fine: if data[0:1] == b'x': print "Starts with x!" The only problematic case is cases such as iterating over a byte sequence where we may have an integer and want to compare it to a length 1 byte string. With just the simple conceptual model, we would have to write one of: if val == b'x'[0]: if bytes([val]) == b'x': if val == ord(b'x'): I don't think it's worth breaking the conceptual model of the data type just to reduce the simplest spelling of that comparison by 3 characters. However, I do think it may be worth having an additional iterator on bytes and buffer objects: def fragments(self, size=1): # Could do with a better name for i in range(len(self)): yield self[i:i+size] Then the problematic example could be written: for val in data.fragments(): if val == b'x': print "Found an x!" Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From nas at arctrix.com Thu Sep 27 17:55:30 2007 From: nas at arctrix.com (Neil Schemenauer) Date: Thu, 27 Sep 2007 15:55:30 +0000 (UTC) Subject: [Python-3000] Immutable bytes -- looking for volunteer References: <766a29bd0709200646h1591715fib3344ba561d595cc@mail.gmail.com> <5d44f72f0709201234vec00c4w13d41bf5c4bea8d7@mail.gmail.com> <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> Message-ID: Guido van Rossum wrote: > However there's quite a bit of Python 2.x code around that manipulates > *bytes* in the guise of 8-bit strings, and it uses tests like "if s[0] >== 'x': ..." frequently. I think it would be useful to do a survey and see how much code would be affected and the effect on readability. Neil From weilawei at gmail.com Thu Sep 27 19:02:05 2007 From: weilawei at gmail.com (Rob Crowther) Date: Thu, 27 Sep 2007 13:02:05 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've uploaded the latest code to http://umass.glexia.net/mpf.tar.bz2 Here's a quick rundown of supported functions and operations. The MPF() constructor accepts a string and an optional keyword argument, prec, specifying precision (as a Long). Supported module functions: mpf_add Add two MPF objects mpf_sub Subtract two MPF objects mpf_div Divide two MPF objects mpf_mul Multiply two MPF objects mpf_sqrt Take the square root of an MPF object mpf_neg Get the negative of an MPF object mpf_abs Get the absolute value of an MPF object mpf_pow Raise an MPF object to a power mpf_ceil Round an MPF object to the next highest integer mpf_floor Round an MPF object to the next lower integer mpf_trunc Truncate the decimal portion of an MPF object Operations supported: (note that only MPF objects are supported atm) + - * / ** abs() and - (negative) Attributes: value A tuple of the form (base, sign, whole, decimal) Also, it supports a print() representation. No more finagling with value if you don't want to. Things to come: floor divide, support for other python numbers in the number interface Comments: This wasn't a case of NIH syndrome. I wrote this extension because Decimal simply was not fast enough and the builtin floats didn't provide enough precision for a project. The pre-existing modules were terrible, didn't compile, etc. Necessity, not NIH syndrome. Questions: What features would you like to see? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG++HqqR5p8HaX4oURAsROAKCfNMxxoa+i0lFWJZPDWH8/lguT5ACfSl7d eYrrkokoCIjuFmnxTW6f4y4= =cZ8M -----END PGP SIGNATURE----- From jjb5 at cornell.edu Thu Sep 27 19:14:53 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Thu, 27 Sep 2007 13:14:53 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FBCE5A.6050503@gmail.com> References: <46FBCE5A.6050503@gmail.com> Message-ID: <46FBE50D.303@cornell.edu> > Making an iterator over an integer sequence acceptable in the > constructor strongly suggests that a byte sequence contains integers > between 0 and 255 inclusive, not length 1 byte sequences. > > And I think that's the cleanest conceptual model for them as well. A > byte sequence doesn't contain length 1 byte sequences, it contains bytes > (i.e. numbers between 0 and 255 inclusive). Using standards language, an octet string contains octets. Since Python blurs the distinction between characters and strings of length 1, shouldn't it also blur the distinction between octets and an octet strings of length 1? > The only problematic case is cases such as iterating over a byte > sequence where we may have an integer and want to compare it to a length > 1 byte string. Why is it problematic? Why does a programmer have to jump through hoops to compare the two? >>> x, y = "abc", "a" >>> x[0] == y True And the same should be true for octet strings: >>> x, y = b"abc", b"a" >>> x[0] == y True > With just the simple conceptual model... Python doesn't have a simple conceptual model, there is no distinction between strings of length 1 and characters. This makes it pretty clear that octet strings contain octets: >>> list(b"1234") [49, 50, 51, 52, 53] And you should be able check for an octet in an octet string: >>> 51 in b"1234" True And if I want to specify the same octet in ASCII do this: >>> b'3' in b"1234" True > I don't think it's worth breaking the conceptual model of the data type > just to reduce the simplest spelling of that comparison by 3 characters. The programmer shouldn't have to go through any one of those gyrations, the only reason why saying chr(51) == '3' is necessary is because characters and integers are different types. But octets and "integers in the range(256)" are exactly the same thing. >>> b'3' == 51 True The fact that octets can be written as an octet string of length 1 is just a happy coincidence of Python, just like characters. > for val in data.fragments(): > if val == b'x': > print "Found an x!" That's a hideous amount of work to just say: if b'x' in data: print "Found an x!" Joel From guido at python.org Thu Sep 27 19:29:24 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 27 Sep 2007 10:29:24 -0700 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: References: <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> Message-ID: On 9/27/07, Neil Schemenauer wrote: > Guido van Rossum wrote: > > However there's quite a bit of Python 2.x code around that manipulates > > *bytes* in the guise of 8-bit strings, and it uses tests like "if s[0] > >== 'x': ..." frequently. > > I think it would be useful to do a survey and see how much code > would be affected and the effect on readability. Agreed. Anyone interested in researching this? (Though at this point I'm pretty much ready to resolve the issue by choosing ints.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Sep 27 19:39:32 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 27 Sep 2007 10:39:32 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FBCE5A.6050503@gmail.com> References: <46FBCE5A.6050503@gmail.com> Message-ID: I think I've been convinced that b[0] should return an int in range(256). To Joel Bender: octet is not, and never will be a technical term for Python. It is a silly standards body compromise. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Sep 27 19:41:08 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 27 Sep 2007 10:41:08 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FB67BE.4080502@acm.org> References: <46FB67BE.4080502@acm.org> Message-ID: I didn't miss it, and I don't disagree, I just don't think it has much bearing on the discussion (which is whether to go with this proposal at all). On 9/27/07, Talin wrote: > Guido van Rossum wrote: > > Thinking through the consequences, and noticing that using the array > > module as an ersatz mutable bytes type is far from ideal, and > > recalling a proposal put forward earlier by Talin, I floated the > > suggestion to have both a mutable and an immutable bytes type. (This > > had been brought up before, but until seeing the evidence of Jeffrey's > > patch I wasn't open to the suggestion.) > > One thing that you may have missed from my proposal is that both 'bytes' > and 'buffer' inherit from a common ABC. This ABC defines all of the > operations which 'bytes' and 'buffer' have in common. My name for this > ABC was 'ByteSequence', but I have no particular attachment to that name. > > -- Talin > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Sep 27 19:44:51 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 27 Sep 2007 10:44:51 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FBB6AB.4030509@cornell.edu> References: <46FBB6AB.4030509@cornell.edu> Message-ID: On 9/27/07, Joel Bender wrote: > First, please enforce that where these functions take a "string" > parameter that they require an octet or octet string (I couldn't find > what kinds of arguments these functions require in PEP 3118): > > >>> x = b'123*45' > >>> x.find("*") > TypeError: expected an octet string or int > > >>> x.find(b'*') > 3 > >>> x.find(42) > 3 PEP 3118 has nothing to do with this, but one of the last paragraphs of PEP 3137 spells it out: """ The str type currently implements the PEP 3118 buffer API. While this is perhaps occasionally convenient, it is also potentially confusing, because the bytes accessed via the buffer API represent a platform-depending encoding: depending on the platform byte order and a compile-time configuration option, the encoding could be UTF-16-BE, UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of the str type might completely change the bytes representation, e.g. to UTF-8, or even make it impossible to access the data as a contiguous array of bytes at all. Therefore, the PEP 3118 buffer API will be removed from the str type. """ > Second, Please add slice operations and .append() to mutable octet strings: > > >>> x[:0] = b'>' # start of message > >>> x.append(sum(x) % 256) # simple checksum Slice operations area already in the PEP, under "Slicing": """ Slice assignment to a mutable buffer object accept anything that implements the PEP 3118 buffer API, or an iterable of integers in range(256). """ I agree that append() and a few other list methods (insert(), extend()) should be added to the buffer type. The PyBytes implementation already has these so it's just a matter of updating the PEP. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Fri Sep 28 00:52:05 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 27 Sep 2007 18:52:05 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: On 9/26/07, Guido van Rossum wrote: > Comparisons > ----------- > The bytes and buffer types are comparable with each other and > orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'. I think bytes (regardless of length) should compare to integers, so that: b"" < -sys.maxint < 97 == b'a' < b'aa' < 98 (zero-length buffer < any integer; otherwise compare the number to the first byte, and in case of ties, a BytesSequence of length 2 or more is greater) I'm not as sure about comparing to floats. Should they be incomparable to integer sequences? (97, 98) != b'ab' not (97, 98) < b'ab' not (97, 98) > b'ab' > Bytes and the Str Type > ---------------------- > ... any attempt to mix bytes (or > buffer) objects and str objects without specifying an encoding will > raise a TypeError exception. This is the case even for simply > comparing a bytes or buffer object to a str object ... Should a TypeError be raised as soon as you try to put a bytes and a string in the same dict, even if they don't happen to hash equal? (I assume that buffer(b'abc') in {} will raise a TypeError, just as list("abc") in {} would.) > Therefore, support for the PEP 3118 > buffer API will be removed from the str type. Good; this may be the single biggest aid for separting characters from a particular (bytes) representation. -jJ From guido at python.org Fri Sep 28 01:03:03 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 27 Sep 2007 16:03:03 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: On 9/27/07, Jim Jewett wrote: > On 9/26/07, Guido van Rossum wrote: > > > Comparisons > > ----------- > > > The bytes and buffer types are comparable with each other and > > orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'. > > I think bytes (regardless of length) should compare to integers, so that: > > b"" < -sys.maxint < 97 == b'a' < b'aa' < 98 Argh. Yuck. I'm not even asking for a use case. No. (Note, I've already decided that b[0] should produce an int, not a 1-size bytes object.) > (zero-length buffer < any integer; otherwise compare the number to the > first byte, and in case of ties, a BytesSequence of length 2 or more > is greater) > > I'm not as sure about comparing to floats. > > Should they be incomparable to integer sequences? > > (97, 98) != b'ab' > not (97, 98) < b'ab' > not (97, 98) > b'ab' No. There are no precedents for supporting sequence comparisons across type boundaries. > > Bytes and the Str Type > > ---------------------- > > > ... any attempt to mix bytes (or > > buffer) objects and str objects without specifying an encoding will > > raise a TypeError exception. This is the case even for simply > > comparing a bytes or buffer object to a str object ... > > Should a TypeError be raised as soon as you try to put a bytes and a > string in the same dict, even if they don't happen to hash equal? Good idea, if you can figure out a way to implement this efficiently. > (I assume that buffer(b'abc') in {} will raise a TypeError, just as > list("abc") in {} would.) Indeed. It will fail to hash. > > Therefore, support for the PEP 3118 > > buffer API will be removed from the str type. > > Good; this may be the single biggest aid for separting characters from > a particular (bytes) representation. Right. Much better than the 3.0a1 approach of explicitly excluding PyUnicode/str where a sequence of bytes is accepted. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From nick.bastin at gmail.com Fri Sep 28 02:28:47 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Thu, 27 Sep 2007 20:28:47 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> References: <1189700532.22693.40.camel@qrnik> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> Message-ID: <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> On 9/22/07, martin at v.loewis.de wrote: > argc/argv does not exist on Windows (that you seem to see it > anyway is an illusion), and if it did exist, it would be characters, > not bytes. Of course it exists on Windows. argc/argv are defined by the C standard, and say what you will about Windows, but it has a conforming implementation. argv exists on Windows exactly the way the C standard requires it - as an array of null terminated "strings". It's left as an exercise to people with more time than I to argue about the definition of the term 'string' in the C standard (since the standard itself is silent on the issue). For what it's worth, the *Python* documentation does NOT guarantee that the items in sys.argv will be strings. -- Nick From greg.ewing at canterbury.ac.nz Fri Sep 28 03:14:35 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 28 Sep 2007 13:14:35 +1200 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <52dc1c820709262206o33c0b792ib94156556d0b5bc5@mail.gmail.com> References: <766a29bd0709201548v77c4bfa5xdae9182c2f3083c3@mail.gmail.com> <5d44f72f0709242309m492cc238k1b81d860c11345ab@mail.gmail.com> <1190836856.16322.55.camel@qrnik> <46FB16DE.7010109@canterbury.ac.nz> <52dc1c820709262206o33c0b792ib94156556d0b5bc5@mail.gmail.com> Message-ID: <46FC557B.3020306@canterbury.ac.nz> Gregory P. Smith wrote: > Would a special case in the b'x' comparison tests that knows how to > compare a len==1 bytes (mutable or not) object to an integer be > reasonable or just alternately confusing? Comparison isn't the only thing you might want to do with bytes. Doing this just for comparison would be rather arbitrary. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From larry at hastings.org Fri Sep 28 03:32:29 2007 From: larry at hastings.org (Larry Hastings) Date: Thu, 27 Sep 2007 18:32:29 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: <46FBCE5A.6050503@gmail.com> Message-ID: <46FC59AD.1000403@hastings.org> Guido van Rossum wrote: > I think I've been convinced that b[0] should return an int in range(256). This made me feel funny. I stared at this for a while: b'a' != b'abcde'[0] ?!? b'a'[0] != b'a' ?!? Then I realized that making b[0] return an int simply makes bytes objects behave less like strings, and more like tuples of integers: ( 97, ) != ( 97, 98, 99, 100, 101 ) ( 97, )[0] != ( 97, ) Strings have always been the odd man out; no other sequence type has this individual-elements-are-implicitly-sequences-too behavior. So now bytes are straddling the difference between strings and the other mapping types: tuple: to construct one with multiple elements: ( 97, 98, 99, 100, 101 ) elements aren't implicitly sequences: ( 97, ) != ( 97, 98, 99 )[0] list: to construct one with multiple elements: [ 97, 98, 99, 100, 101 ] elements aren't implicitly sequences: [ 97, ] != [ 97, 98, 99 ][0] bytes: to construct one with multiple elements: b"abcde" elements aren't implicitly sequences: b"a" != b"abcde"[0] str: to construct one with multiple elements: "abcde" elements are implicity sequences: "a" == "abcde"[0] So what should the bytes constructor take? We all already know it should *not* take a string. (You must explicitly decode a string to get a bytes object.) Clearly it should take an int in the proper range: bytes(97) == b'a' and a bytes object: bytes(b'a') == b'a' bytes(b'abcde') == b'abcde' Like the tuple and list constructors, I think it should also attempt to cast iterables into its type. So if you pass in an iterable, and the iterable contains nothing but ints in the proper range, it should produce a bytes object: bytes( [ 97, 98, 99, 100, 101] ) == b'abcde' Sorry if this is obvious to everybody; thinking through it helped me, at least. /larry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070927/050eed54/attachment.htm From greg.ewing at canterbury.ac.nz Fri Sep 28 03:37:50 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 28 Sep 2007 13:37:50 +1200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FC5AEE.1060101@canterbury.ac.nz> Alexandre Vassalotti wrote: > Personally, I wouldn't mind writing either this: > > for b in bytes: > if b == b'a'[0]: > pass Well, I would mind, because it's needlessly verbose and inefficient. I still think that c'x' is the least bad solution. As long as we're wanting to write arrays of integers by means of their corresponding ASCII characters, it makes sense to be able to do that for a single integer as well. So my current vote is: a) Indexing bytes or buffer gives an integer b) Have a c'x' notation for expressing a single integer -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Fri Sep 28 03:39:37 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 28 Sep 2007 13:39:37 +1200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FBCE5A.6050503@gmail.com> References: <46FBCE5A.6050503@gmail.com> Message-ID: <46FC5B59.9050807@canterbury.ac.nz> Nick Coghlan wrote: > However, I do think it may be worth having an additional iterator on > bytes and buffer objects: > > def fragments(self, size=1): # Could do with a better name I suggest dice(). :-) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From victor.stinner at haypocalc.com Fri Sep 28 04:29:39 2007 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Fri, 28 Sep 2007 04:29:39 +0200 Subject: [Python-3000] Python, int/long and GMP Message-ID: <200709280429.39396.victor.stinner@haypocalc.com> Hi, I read some days ago a discussion about GMP (license). I wanted to know if GMP is really better than current Python int/long implementation. So I wrote a patch for python 3000 subversion (rev. 58277). I changed long type structure with: struct _longobject { PyObject_HEAD mpz_t number; }; False is the number 0 and True is 1. marshal module is broken, my patch just makes gcc happy. The most important point is the pystone results: original python: 32573.3 pystones/second python with GMP: 26666.7 pystones/second So I can now say that GMP is much slower for Python pystone usage of integers. I use 32-bit CPU (Celeron M 420 at 1600 MHz on Ubuntu), so most integers are just one CPU word (and not a GMP complex structure). Victor Stinner http://hachoir.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: working-gmp.patch Type: text/x-diff Size: 103033 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070928/18a22c4e/attachment-0001.patch -------------- next part -------------- A non-text attachment was scrubbed... Name: longobject.c Type: text/x-csrc Size: 28488 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070928/18a22c4e/attachment-0001.c From greg.ewing at canterbury.ac.nz Fri Sep 28 04:56:12 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 28 Sep 2007 14:56:12 +1200 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FC59AD.1000403@hastings.org> References: <46FBCE5A.6050503@gmail.com> <46FC59AD.1000403@hastings.org> Message-ID: <46FC6D4C.9080100@canterbury.ac.nz> Larry Hastings wrote: > So now bytes are straddling the difference between strings and the other > mapping types: I think the main reason it seems that way is that we're using a string-like notation for a bytes literal. With b[i] returning an int, it really behaves just like any other sequence. > So what should the bytes constructor take? ... Clearly it should > take an int in the proper range: > > bytes(97) == b'a' That should be bytes([97]) if it's to be consistent with other sequence constructors: >>> list(97) Traceback (most recent call last): File "", line 1, in ? TypeError: iteration over non-sequence -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From martin at v.loewis.de Fri Sep 28 06:40:44 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 28 Sep 2007 06:40:44 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <87y7f7ozfq.fsf@uwakimon.sk.tsukuba.ac.jp> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> Message-ID: <46FC85CC.4030806@v.loewis.de> Nicholas Bastin schrieb: > On 9/22/07, martin at v.loewis.de wrote: >> argc/argv does not exist on Windows (that you seem to see it >> anyway is an illusion), and if it did exist, it would be characters, >> not bytes. > > Of course it exists on Windows. argc/argv are defined by the C > standard, and say what you will about Windows, but it has a conforming > implementation. It doesn't. Microsoft has a conforming implementation of C for Windows (Visual C), but Windows does not. Regards, Martin From apt.shansen at gmail.com Fri Sep 28 07:00:57 2007 From: apt.shansen at gmail.com (Stephen Hansen) Date: Thu, 27 Sep 2007 22:00:57 -0700 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <1190070414.20673.12.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> Message-ID: <7a9c25c20709272200i5856753ey8fb00c7a2d834057@mail.gmail.com> On 9/27/07, Nicholas Bastin wrote: > > On 9/22/07, martin at v.loewis.de wrote: > > argc/argv does not exist on Windows (that you seem to see it > > anyway is an illusion), and if it did exist, it would be characters, > > not bytes. > > Of course it exists on Windows. argc/argv are defined by the C > standard, and say what you will about Windows, but it has a conforming > implementation. argv exists on Windows exactly the way the C standard > requires it - as an array of null terminated "strings". It's left as > an exercise to people with more time than I to argue about the > definition of the term 'string' in the C standard (since the standard > itself is silent on the issue). The entry point of a Windows application is WinMain, not main; you can create a console-only standard C application if you'd like, but its not a Windows program. Python apps are Windows programs even if they have a console attached. And the WinMain function passes the entire command line as a single char* with no breaking or parsing of any kind. --S -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070927/e9170a20/attachment.htm From nick.bastin at gmail.com Fri Sep 28 08:21:18 2007 From: nick.bastin at gmail.com (Nicholas Bastin) Date: Fri, 28 Sep 2007 02:21:18 -0400 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <46FC85CC.4030806@v.loewis.de> References: <1189700532.22693.40.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> <46FC85CC.4030806@v.loewis.de> Message-ID: <66d0a6e10709272321v52063cdcldeaac952c4ef4f28@mail.gmail.com> On 9/28/07, "Martin v. L?wis" wrote: > Nicholas Bastin schrieb: > > On 9/22/07, martin at v.loewis.de wrote: > >> argc/argv does not exist on Windows (that you seem to see it > >> anyway is an illusion), and if it did exist, it would be characters, > >> not bytes. > > > > Of course it exists on Windows. argc/argv are defined by the C > > standard, and say what you will about Windows, but it has a conforming > > implementation. > > It doesn't. Microsoft has a conforming implementation of C for Windows > (Visual C), but Windows does not. msvcrt ships with the operating system - I'd call that a conforming implementation. Programs running in the standard C runtime are just as much applications as programs using the Win32 API in advapi32. But we have drifted far from the topic at hand, since this is obviously a misunderstanding on whether Windows was used to refer to the OS or the API. I still regard handling argv as anything other the raw bytes that come from the host as bad. argv *means* something - regardless of whether WinMain provides it or not. If we're going to call something sys.argv, then presumably that was done because there was a conventionally accepted meaning to it, and I would argue that meaning comes from standard C. If it were called sys.lpCmdLine, then I'd say you have a point, but it isn't, and to the degree that it isn't, I believe that we should emulate the standard argv behaviour (especially since lpCmdLine doesn't include the program name). Of course, on Win32 this entire issue is moot, given the availability of CommandLineToArgvW(), which would allow you to provide a nice convenient unicode argv. However, since not all supported platforms provide us this functionality, I would suggest we store the result of any effort to transform argv into unicode into some other well named member of sys (or make it a function call so it can be computed on demand if you don't want it in the first place). Changing the current meaning of argv will break applications which already handle this problem, and while I realize that that's not a showstopper for Python 3k, I don't see any particular benefit to introducing this inconsistency, rather than adding something more defined, like sys.arguments. -- Nick From foom at fuhm.net Fri Sep 28 09:53:28 2007 From: foom at fuhm.net (James Y Knight) Date: Fri, 28 Sep 2007 03:53:28 -0400 Subject: [Python-3000] Python, int/long and GMP In-Reply-To: <200709280429.39396.victor.stinner@haypocalc.com> References: <200709280429.39396.victor.stinner@haypocalc.com> Message-ID: <400ED549-B7C7-4A3D-9343-826B54E7B2BB@fuhm.net> On Sep 27, 2007, at 10:29 PM, Victor Stinner wrote: > Hi, > > I read some days ago a discussion about GMP (license). I wanted to > know if GMP > is really better than current Python int/long implementation. So I > wrote a > patch for python 3000 subversion (rev. 58277). > > I changed long type structure with: > > struct _longobject { > PyObject_HEAD > mpz_t number; > }; > So I can now say that GMP is much slower for Python pystone usage > of integers. > I use 32-bit CPU (Celeron M 420 at 1600 MHz on Ubuntu), so most > integers are > just one CPU word (and not a GMP complex structure). GMP doesn't have a concept of a non-complex structure. It always allocates memory. If you want to have a single CPU word integer, you have to provide that outside of GMP. GMP's API is really designed for allocating an integer object and reusing it for a number of operations. You can generally get away with not doing that without destroying performance, but certainly not on small integers. Here's the init function, just for illustration: mpz_init (mpz_ptr x) { x->_mp_alloc = 1; x->_mp_d = (mp_ptr) (*__gmp_allocate_func) (BYTES_PER_MP_LIMB); x->_mp_size = 0; } So replacing py3's integers with gmp as you did is not really fair. If you're going to use GMP in an immutable integer scenario, you really need to have a machine-word-int implementation as well. So, if you want to actually give GMP a fair trial, I'd suggest trying to integrate it with python 2.X, replacing longobject, leaving intobject as is. Also, removing python's caching of integers < 100 as you did in this patch is surely a *huge* killer of performance. James From jjb5 at cornell.edu Fri Sep 28 15:58:38 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Fri, 28 Sep 2007 09:58:38 -0400 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: References: Message-ID: <46FD088E.6050404@cornell.edu> Should this PEP include changes to the struct module, or should it be a separate PEP? I would like struct.pack() to return bytes and struct.unpack() to accept bytes or buffers but not strings. The 's' and 'p' format specifier should refer to bytes and not strings. In protocol encoding and decoding, "unpack and strip off the front" and "pack and append" are very common operations. I would also like to have buffer.unpack(fmt) be the former and buffer.pack(fmt, v1, v2, ...) be the latter. Joel From guido at python.org Fri Sep 28 16:47:45 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 28 Sep 2007 07:47:45 -0700 Subject: [Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer In-Reply-To: <46FD088E.6050404@cornell.edu> References: <46FD088E.6050404@cornell.edu> Message-ID: On 9/28/07, Joel Bender wrote: > Should this PEP include changes to the struct module, or should it be a > separate PEP? Neither. > I would like struct.pack() to return bytes and struct.unpack() to accept > bytes or buffers but not strings. This is already the case in 3.0a1. (Don't people try stuff out before posting?) > The 's' and 'p' format specifier should refer to bytes and not strings. They currently allow both, which I think is fine. > In protocol encoding and decoding, "unpack and strip off the front" and > "pack and append" are very common operations. I would also like to have > buffer.unpack(fmt) be the former and buffer.pack(fmt, v1, v2, ...) be > the latter. IMO that would tie the buffer type too close to the struct module. You could easily write a wrapper that does this though. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From weilawei at gmail.com Fri Sep 28 18:32:52 2007 From: weilawei at gmail.com (Rob Crowther) Date: Fri, 28 Sep 2007 12:32:52 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: <20070928122915.798d00e1.weilawei@gmail.com> References: <20070925094601.c151245c.weilawei@gmail.com> <20070927125557.a5895341.weilawei@gmail.com> <20070928122915.798d00e1.weilawei@gmail.com> Message-ID: <20070928123252.5a0692b0.weilawei@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Another day, another update. Latest code: http://umass.glexia.net/mpf.tar.bz2 There's been a couple minor changes externally: a) MPF() now takes a float or integer argument because mpf_set_str is just wacky and I haven't gotten it working properly yet. This does somewhat limit the values you can pass to it, but strings will be added back later on. At that point, you'll have a choice of initializing it with a tuple (base, sign, whole, decimal), a string, a float, or an integer. b) As a side effect of this, roundtripping doesn't work. Not that it ever worked. But it's a bit further away right now. Externally, the MPF_get function was rewritten from scratch (for the fourth time). MPF_init was changed to use mpf_set_d instead of mpf_set_str because... well, mpf_set_str is too wacky and unpredictable at the moment. I'm sorting that out as we speak. If you really want to see lots of internal information, use the build_debug.sh script instead of setup.py. (Note that the directories already need to be in place to compile this way.) There's a test program which you can compile with the command: gcc -o test test.c -lgmp It's my scratchpad for working out new ideas before integrating them into the extension. Currently, it contains a barebones version of MPF_get and a slew of test cases, soon to be ported to Python. YES, there WILL be a test suite. Question -- Does anyone know of a decent place to host this project? I'm really lazy about updating project sites, so I'd like something simple offering storage space and a bug tracker. I don't need SVN. I use git on my development box, so that would be a bonus if someone knew of free project hosting with git. Rob -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG/Sy0qR5p8HaX4oURAhznAJ9a8N6mgCHXcGph09KhjXu/kYPnFgCeOKLH ngznr86SynMbF0wQep3GDB0= =6Pun -----END PGP SIGNATURE----- From rhamph at gmail.com Fri Sep 28 18:44:43 2007 From: rhamph at gmail.com (Adam Olsen) Date: Fri, 28 Sep 2007 10:44:43 -0600 Subject: [Python-3000] Python, int/long and GMP In-Reply-To: <400ED549-B7C7-4A3D-9343-826B54E7B2BB@fuhm.net> References: <200709280429.39396.victor.stinner@haypocalc.com> <400ED549-B7C7-4A3D-9343-826B54E7B2BB@fuhm.net> Message-ID: On 9/28/07, James Y Knight wrote: > > On Sep 27, 2007, at 10:29 PM, Victor Stinner wrote: > > > Hi, > > > > I read some days ago a discussion about GMP (license). I wanted to > > know if GMP > > is really better than current Python int/long implementation. So I > > wrote a > > patch for python 3000 subversion (rev. 58277). > > > > I changed long type structure with: > > > > struct _longobject { > > PyObject_HEAD > > mpz_t number; > > }; > > > So I can now say that GMP is much slower for Python pystone usage > > of integers. > > I use 32-bit CPU (Celeron M 420 at 1600 MHz on Ubuntu), so most > > integers are > > just one CPU word (and not a GMP complex structure). > > GMP doesn't have a concept of a non-complex structure. It always > allocates memory. If you want to have a single CPU word integer, you > have to provide that outside of GMP. GMP's API is really designed for > allocating an integer object and reusing it for a number of > operations. You can generally get away with not doing that without > destroying performance, but certainly not on small integers. > > Here's the init function, just for illustration: > mpz_init (mpz_ptr x) > { > x->_mp_alloc = 1; > x->_mp_d = (mp_ptr) (*__gmp_allocate_func) (BYTES_PER_MP_LIMB); > x->_mp_size = 0; > } > > So replacing py3's integers with gmp as you did is not really fair. > If you're going to use GMP in an immutable integer scenario, you > really need to have a machine-word-int implementation as well. > > So, if you want to actually give GMP a fair trial, I'd suggest trying > to integrate it with python 2.X, replacing longobject, leaving > intobject as is. > > Also, removing python's caching of integers < 100 as you did in this > patch is surely a *huge* killer of performance. I can vouch for that. Allocation can easily dominate performance. It invalidates the rest of the benchmark. -- Adam Olsen, aka Rhamphoryncus From victor.stinner at haypocalc.com Fri Sep 28 18:58:29 2007 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Fri, 28 Sep 2007 18:58:29 +0200 Subject: [Python-3000] Python, int/long and GMP In-Reply-To: References: <200709280429.39396.victor.stinner@haypocalc.com> <400ED549-B7C7-4A3D-9343-826B54E7B2BB@fuhm.net> Message-ID: <200709281858.29705.victor.stinner@haypocalc.com> On Friday 28 September 2007 18:44:43 you wrote: > > GMP doesn't have a concept of a non-complex structure. It always > > allocates memory. (...) I don't know GMP internals. I thaught that GMP uses an hack for small integers. > > Also, removing python's caching of integers < 100 as you did in this > > patch is surely a *huge* killer of performance. Oh yes, I removed the cache because I would like to quickly get a working Python version. It took me two weeks to write the patch. It's not easy to get into CPython source code! And integer is one of the most important type! > I can vouch for that. Allocation can easily dominate performance. It > invalidates the rest of the benchmark. I may also use Python garbage collector for GMP memory allocations since GMP allows to use my own memory allocating functions. GMP also has its own reference counter mechanism :-/ Victor -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From jimjjewett at gmail.com Fri Sep 28 19:23:40 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 28 Sep 2007 13:23:40 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) Message-ID: On 9/27/07, Guido van Rossum wrote: > On 9/27/07, Jim Jewett wrote: > > Should a TypeError be raised as soon as you try to put a bytes and a > > string in the same dict, even if they don't happen to hash equal? > Good idea, if you can figure out a way to implement this efficiently. In news that may surprise no one, there were corner cases... (1) Does it have to raise the TypeError eagerly in all cases, or is it OK to do so only when its easy? For example, would it be OK to stop verifying once some keys have been deleted? (2) Is the restriction "sticky" for a dict, or based on current contents? Current contents makes sense, but ... If code clears an existing dict rather than creating a new one, then that specific dict is probably a communication channel, and the API should specify whether it takes bytes or characters. -jJ From guido at python.org Fri Sep 28 19:36:57 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 28 Sep 2007 10:36:57 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: Message-ID: Well, maybe this is a good enough argument to give up. If the best we can say is that having a bytes and a str as keys *may* cause a TypeError on lookups, I'm not sure it is worth it to try to raise the probability that it'll actually be raised... --Guido On 9/28/07, Jim Jewett wrote: > On 9/27/07, Guido van Rossum wrote: > > On 9/27/07, Jim Jewett wrote: > > > > Should a TypeError be raised as soon as you try to put a bytes and a > > > string in the same dict, even if they don't happen to hash equal? > > > Good idea, if you can figure out a way to implement this efficiently. > > In news that may surprise no one, there were corner cases... > > (1) Does it have to raise the TypeError eagerly in all cases, or is > it OK to do so only when its easy? > > For example, would it be OK to stop verifying once some keys have been deleted? > > (2) Is the restriction "sticky" for a dict, or based on current contents? > > Current contents makes sense, but ... > > If code clears an existing dict rather than creating a new one, then > that specific dict is probably a communication channel, and the API > should specify whether it takes bytes or characters. > > -jJ > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Fri Sep 28 20:33:04 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 28 Sep 2007 14:33:04 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: Message-ID: On 9/28/07, Guido van Rossum wrote: > Well, maybe this is a good enough argument to give up. Not quite yet... I still see two potential solutions, depending on whether or not the exclusion is sticky. Details below. ========= If the exclusion is sticky, then add (implicit) flags saying "seen a string" and "seen a byte". Similar logic is already there, in that "seen a non-string" replaces the lookdict function. The most common case (exact unicode in an exact unicode-only dict) would stay the same as today, but the other cases would have some extra type-checking. ========= If the exclusion is based on current contents, then we can add a count; my concern is that keeping this efficient may be too hacky. It looks like there is room for exactly one more pointer (-sized count variable) before small dicts bleed to a third cacheline. Because of this guard, bytes and strings can never appear in the same dict, so at least one count is zero. Because dict entries are 3 pointers long, there can never be more than (Py_ssize_t / 2) entries, so the sign bit can be repurposed to indicate whether the count refers to strings or bytes. (count==0 means no bytes or strings; count==5 means 5 string keys; count==-32 means 32 bytes keys.) -jJ From adam at hupp.org Fri Sep 28 20:34:36 2007 From: adam at hupp.org (Adam Hupp) Date: Fri, 28 Sep 2007 14:34:36 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: Message-ID: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> On 9/28/07, Guido van Rossum wrote: > Well, maybe this is a good enough argument to give up. If the best we > can say is that having a bytes and a str as keys *may* cause a > TypeError on lookups, I'm not sure it is worth it to try to raise the > probability that it'll actually be raised... Would it make sense to have dict ignore TypeError on lookups? Alternatively, the byte/str comparison could throw a specific subclass of TypeError that dict ignored e.g. IncompatibleComparisonError. -- Adam Hupp | http://hupp.org/adam/ From guido at python.org Fri Sep 28 20:40:40 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 28 Sep 2007 11:40:40 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: On 9/28/07, Adam Hupp wrote: > On 9/28/07, Guido van Rossum wrote: > > Well, maybe this is a good enough argument to give up. If the best we > > can say is that having a bytes and a str as keys *may* cause a > > TypeError on lookups, I'm not sure it is worth it to try to raise the > > probability that it'll actually be raised... > > Would it make sense to have dict ignore TypeError on lookups? Certainly not. > Alternatively, the byte/str comparison could throw a specific subclass > of TypeError that dict ignored e.g. IncompatibleComparisonError. Well, if we wanted "x" and b"x" to compare unequal instead of raising an exception, we could just define it that way (it was that way until just before 3.0a1). But we're explicitly defining it to raise a TypeError so as to catch buggy code. I think trying to fix dict lookup so that it, and only it, treats this as unequal, would be adding too many quirks. We could choose to kill the TypeError altogether. If we keep it, we should consistently let it raise TypeError everywhere. The question is whether it's worth the effort to raise TypeError when the *potential* exists that a certain hash sequence *could* raise this TypeError. I'm less and less convinced -- after all, we're making the exception only for bytes/str, not for other types that might raise TypeError upon comparison. So, I think that after all this was a bad idea. Sorry. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From p.f.moore at gmail.com Fri Sep 28 20:59:56 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 28 Sep 2007 19:59:56 +0100 Subject: [Python-3000] Immutable bytes -- looking for volunteer In-Reply-To: <46F9AE94.7010703@canterbury.ac.nz> References: <79990c6b0709250039q3cf5b6a5j3a37797b84fe43d3@mail.gmail.com> <46F9AE94.7010703@canterbury.ac.nz> Message-ID: <79990c6b0709281159u79a4aae1u844549d33358ac01@mail.gmail.com> On 26/09/2007, Greg Ewing wrote: > Paul Moore wrote: > > The array module is built in, so it's > > written in C - what needs to be exposed to qualify as a "C API"? > > I think he's referring to the fact that there is no > public array.h header file provided that lays out the > C-level details. In fact, last time I looked I don't > think there was any array.h file at all, it was all > inside array.c. Thanks. I see what you mean. Given the way the discussion is currently going, I think I'll hold off doing anything just yet, but I'll keep it in mind. Paul From jimjjewett at gmail.com Fri Sep 28 21:02:59 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 28 Sep 2007 15:02:59 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: On 9/28/07, Guido van Rossum wrote: > The question is whether it's worth the effort to raise TypeError when > the *potential* exists that a certain hash sequence *could* raise this > TypeError. Bugs depending on the hash sequence are exactly the sort of thing that doesn't get found by tests, and can't be easily reproduced. > I'm less and less convinced -- after all, we're making the > exception only for bytes/str, not for other types that might raise > TypeError upon comparison. What would those other types be? As you point out in the "Bytes and the Str Type" section, this exception violates the "general rule that comparing objects of different types for equality should just return False". In Py3, there are plenty of types that aren't orderable, but I still can't think of any[*] others that raise an exception when tested just for equality. [*] It is of course possible to write a malicious class, and it is possible to write a buggy class. Even then, most buggy classes fail when compared to anything from any other class, rather than just for specific banned comparisons. -jJ From ntoronto at cs.byu.edu Fri Sep 28 21:46:30 2007 From: ntoronto at cs.byu.edu (Neil Toronto) Date: Fri, 28 Sep 2007 13:46:30 -0600 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: <46FD5A16.7030004@cs.byu.edu> Jim Jewett wrote: > On 9/28/07, Guido van Rossum wrote: > > >> The question is whether it's worth the effort to raise TypeError when >> the *potential* exists that a certain hash sequence *could* raise this >> TypeError. >> > > Bugs depending on the hash sequence are exactly the sort of thing that > doesn't get found by tests, and can't be easily reproduced. > Not that my opinion counts for much because I mostly just lurk, but I have to agree. A one-in-a-million Heisenbug (Mandelbug?) is exactly the sort of thing that breaks production systems but nobody can figure out how to fix, and causes management to lose faith in a language or in their developers. >> I'm less and less convinced -- after all, we're making the >> exception only for bytes/str, not for other types that might raise >> TypeError upon comparison. >> > > What would those other types be? > > As you point out in the "Bytes and the Str Type" section, this > exception violates the "general rule that comparing objects of > different types for equality > should just return False". > So there's a special case comparison that's intended to protect users from themselves - to keep them from comparing bytes and strings without specifying an encoding. Then there has to be another potentially performance-munching special case to save them from an essentially random exception that could occur because of this extra protection - and this special-casing can only be guaranteed for built-in types, not custom ones. It's too easy to forget to consider it. Is the only case they need to be saved from the 'if == ' case? Shouldn't it be perfectly fine for a dict to hold a str and a bytes? If I recall correctly, the decision to raise a TypeError on str/bytes comparison was made before bytes became immutable and could be put into dicts. Maybe the *extra protection* isn't worth the effort. How about a warning instead of a TypeError? Can the bytecode interpreter do something for simple '==' cases? Are there other alternatives? Neil From martin at v.loewis.de Fri Sep 28 23:00:29 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 28 Sep 2007 23:00:29 +0200 Subject: [Python-3000] Unicode and OS strings In-Reply-To: <66d0a6e10709272321v52063cdcldeaac952c4ef4f28@mail.gmail.com> References: <1189700532.22693.40.camel@qrnik> <18159.23173.178488.190409@uwakimon.sk.tsukuba.ac.jp> <32C3C54C-18CC-4171-8A59-06170B5CFCD6@fuhm.net> <79990c6b0709210741y465c016pbaefb04c2c2f3eee@mail.gmail.com> <20070922074840.pwm2kfr2dc4gcgwg@webmail.df.eu> <66d0a6e10709271728i15b31a82s51541816d5c6a66f@mail.gmail.com> <46FC85CC.4030806@v.loewis.de> <66d0a6e10709272321v52063cdcldeaac952c4ef4f28@mail.gmail.com> Message-ID: <46FD6B6D.6080905@v.loewis.de> > msvcrt ships with the operating system - I'd call that a conforming > implementation. Yes, but it's not part of the operating system interface; Microsoft documents it as "for future use only by system-level components". > I still regard handling argv as anything other the raw bytes that come > from the host as bad. The point is that you cannot use "raw bytes" in Win32, not without potential loss of data. If you pass arbitrary bytes to os.spawn*, they get converted to Unicode, and the resulting Unicode command line gets passed to the child process. So the *native* API is Unicode, not arbitrary bytes - there is also _wmain supported by the C library, if you want broken down command line arguments, but without character set conversions. > If we're going to call something > sys.argv, then presumably that was done because there was a > conventionally accepted meaning to it, and I would argue that meaning > comes from standard C. Yes, but also in C, the meaning is "characters", not "bytes". ISO C 99 5.1.2.2.1p2 specifies they are *strings* passed by the host environment, and elaborates that if the host environment does is not capable of supplying mixed-case strings, it should convert them all into lower case. So the intention clearly is that argv[] is text, not bytes. Regards, Martin From greg.ewing at canterbury.ac.nz Sat Sep 29 01:48:33 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 29 Sep 2007 11:48:33 +1200 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: Message-ID: <46FD92D1.6020706@canterbury.ac.nz> Jim Jewett wrote: > If code clears an existing dict rather than creating a new one, then > that specific dict is probably a communication channel, and the API > should specify whether it takes bytes or characters. This suggests it might be simpler to have normal dicts refuse to accept bytes at all, and have another type bytedict for that purpose. -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 29 01:57:10 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 29 Sep 2007 11:57:10 +1200 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: <46FD94D6.6040103@canterbury.ac.nz> Adam Hupp wrote: > Would it make sense to have dict ignore TypeError on lookups? > Alternatively, the byte/str comparison could throw a specific subclass > of TypeError that dict ignored e.g. IncompatibleComparisonError. Presumably the reason for making strings and bytes uncomparable in the first place is to catch errors due to unwittingly mixing strings and bytes. Having dicts ignore the exception would partly defeat that. I'm not all that comfortable with the idea of having things that can't even be compared for equality. Is this meant to be a permanent feature of the language, or just something to help people get over the transition? Could it be dropped once everyone has got over the shock of having strings and bytes being different things? -- Greg From tjreedy at udel.edu Sat Sep 29 04:27:29 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 28 Sep 2007 22:27:29 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: "Guido van Rossum" wrote in message news:ca471dc20709281140q2ef95c2ap8bbc7b7d3d46ebc0 at mail.gmail.com... | | Well, if we wanted "x" and b"x" to compare unequal instead of raising | an exception, we could just define it that way (it was that way until | just before 3.0a1). But we're explicitly defining it to raise a | TypeError so as to catch buggy code. I think trying to fix dict lookup | so that it, and only it, treats this as unequal, would be adding too | many quirks. | | We could choose to kill the TypeError altogether. If we keep it, we | should consistently let it raise TypeError everywhere. | | The question is whether it's worth the effort to raise TypeError when | the *potential* exists that a certain hash sequence *could* raise this | TypeError. I'm less and less convinced -- after all, we're making the | exception only for bytes/str, not for other types that might raise | TypeError upon comparison. | | So, I think that after all this was a bad idea. Sorry. If you mean making a special case exception for string/bytes equality test, I agree. Would a restricted key dict (say, rdict, in collections) solve the problem you are aiming at? import collections adict = rdict(str) bdict = rdict(bytes) Now any buggy insertions get caught. Terry J. Reedy From guido at python.org Sat Sep 29 05:08:06 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 28 Sep 2007 20:08:06 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: On 9/28/07, Terry Reedy wrote: > "Guido van Rossum" wrote in message > news:ca471dc20709281140q2ef95c2ap8bbc7b7d3d46ebc0 at mail.gmail.com... > | > | Well, if we wanted "x" and b"x" to compare unequal instead of raising > | an exception, we could just define it that way (it was that way until > | just before 3.0a1). But we're explicitly defining it to raise a > | TypeError so as to catch buggy code. I think trying to fix dict lookup > | so that it, and only it, treats this as unequal, would be adding too > | many quirks. > | > | We could choose to kill the TypeError altogether. If we keep it, we > | should consistently let it raise TypeError everywhere. > | > | The question is whether it's worth the effort to raise TypeError when > | the *potential* exists that a certain hash sequence *could* raise this > | TypeError. I'm less and less convinced -- after all, we're making the > | exception only for bytes/str, not for other types that might raise > | TypeError upon comparison. > | > | So, I think that after all this was a bad idea. Sorry. > > If you mean making a special case exception for string/bytes equality test, > I agree. Would a restricted key dict (say, rdict, in collections) solve > the problem you are aiming at? > > import collections > adict = rdict(str) > bdict = rdict(bytes) > > Now any buggy insertions get caught. That sounds like a completely different use case -- a typechecking dict. The use case we started with is to catch programmers who accidentally mix str and bytes as dict keys -- those programmers aren't likely to have thought much about their key type, so they're not likely to go out of their way to use the rdict you propose above. But here's a clever trick that might just do the job, without any extra effort: make it so that the hash() of a bytes string containing only ASCII bytes is the same as that of a text string containing only ASCII characters. Likely, programmers will attempt to look up keys that they know are in the dict -- and if they use the wrong type, because of the identical hash values, they will get the TypeError as soon as they compare it to the first object at the hashed location. Even better, in the proposal we'll be reusing the old PyString type for the new immutable bytes type, and its hash *already* is equal to that of a PyUnicode object if they both contain the same ASCII bytes only. (This used to be by design in 2.x, and I maintained this property when I made PyUnicode's hash a lot faster.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From pje at telecommunity.com Sat Sep 29 16:24:02 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Sat, 29 Sep 2007 10:24:02 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> Message-ID: <20070929142126.D61D23A4045@sparrow.telecommunity.com> At 08:08 PM 9/28/2007 -0700, Guido van Rossum wrote: >Likely, programmers will attempt to look up keys >that they know are in the dict -- and if they use the wrong type, >because of the identical hash values, they will get the TypeError as >soon as they compare it to the first object at the hashed location. I'm coming into this thread a little bit late, but if we don't want strings and bytes to be comparable, shouldn't we just make them *unequal*? I mean, under normal circumstances, == and != are available on all objects without causing errors, and the same TypeError would occur for things like list.remove(). This seems a lot like Oleg's question on Python-Dev the other day, about raising a TypeError from __nonzero__: i.e., changing a significant expectation about all "normal" objects. While it's true that it would be good to know when you've unintentionally mixed bytes and strings, surely there could be less fatal ways to find this, like perhaps a command-line option that causes byte/string comparisons to output a warning? From guido at python.org Sat Sep 29 16:33:01 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 29 Sep 2007 07:33:01 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: <20070929142126.D61D23A4045@sparrow.telecommunity.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> Message-ID: On 9/29/07, Phillip J. Eby wrote: > At 08:08 PM 9/28/2007 -0700, Guido van Rossum wrote: > >Likely, programmers will attempt to look up keys > >that they know are in the dict -- and if they use the wrong type, > >because of the identical hash values, they will get the TypeError as > >soon as they compare it to the first object at the hashed location. > > I'm coming into this thread a little bit late, but if we don't want > strings and bytes to be comparable, shouldn't we just make them > *unequal*? I mean, under normal circumstances, == and != are > available on all objects without causing errors, and the same > TypeError would occur for things like list.remove(). Until just before 3.0a1, they were unequal. We decided to raise TypeError because we noticed many bugs in code that was doing things like data = f.read(4096) if data == "": break where data was bytes and thus the break never taken. Similar with checks for certain magic strings (so it wasn't just empty strings). It is also in line with the policy to refuse things like b"abc".replace("a", "A") or "abc".replace(b"b", b"B"). > This seems a lot like Oleg's question on Python-Dev the other day, > about raising a TypeError from __nonzero__: i.e., changing a > significant expectation about all "normal" objects. > > While it's true that it would be good to know when you've > unintentionally mixed bytes and strings, surely there could be less > fatal ways to find this, like perhaps a command-line option that > causes byte/string comparisons to output a warning? I thought about using warning too, but since nobody wants warnings, that would be pretty much the same as raising TypeError except for the most dedicated individuals (and if I were really dedicated I'd just write my own eq() function anyway). And the warning would do nothing about the issue brought up by Jim Jewett, the unpredictable behavior of a dict with both bytes and strings as keys. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From pje at telecommunity.com Sat Sep 29 17:14:04 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Sat, 29 Sep 2007 11:14:04 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> Message-ID: <20070929151127.AE5203A4045@sparrow.telecommunity.com> At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote: >Until just before 3.0a1, they were unequal. We decided to raise >TypeError because we noticed many bugs in code that was doing things >like > > data = f.read(4096) > if data == "": break Thought experiment: what if read() always returned strings, and to read bytes, you had to use something like 'f.readinto(ob, 4096)', where 'ob' is a mutable bytes instance or memory view? In Python 2.x, there's only one read() method because (prior to unicode), there was only one type of reading to do. But as the above example makes clear, in 3.x you simply *can't* write code that works correctly with an arbitrary file that might be binary or text, at least not without typechecking the return value from read(). (In which case, you might as well inspect the file object.) So, the above problem could be fixed by having .read() raise an error (or simply not exist) on a binary file object. In this way, the problem is fixed at the point where it really occurs: i.e., at the point of not having decided whether the stream is bytes or text. This also seems to fit better (IMO) with the best practice of enforcing str/unicode/encoding distinctions at the point where data enters the program, rather than delaying the error to later. >I thought about using warning too, but since nobody wants warnings, >that would be pretty much the same as raising TypeError except for the >most dedicated individuals (and if I were really dedicated I'd just >write my own eq() function anyway). The use case I'm concerned about is code that's not type-specific getting a TypeError by comparing arbitrary objects. For example, if you write Python code to create a Python code object (e.g. the compiler package or my own BytecodeAssembler), you need to create a list of constants as you generate the code, and you need to be able to search the list for an equal constant. Since strings and bytes can both be constants, a simple list.index() test could now raise a TypeError, as could "item in list". So raising an error to make bad code fail sooner, will also take down unsuspecting code that isn't really broken, and *force* the writing of special comparison code -- which won't be usable with things like list.remove and the "in" operator. In comparison, forcing code to be bytes vs. text aware at the point of I/O directs attention to the place where you can best decide what to do about it. (After all, the comparison that raises the TypeError might occur deep in a library that's expecting to work with text.) >And the warning would do nothing >about the issue brought up by Jim Jewett, the unpredictable behavior >of a dict with both bytes and strings as keys. I've looked at all of Jim's messages for September, but I don't see this. I do see where raising TypeError for comparisons causes a problem with dictionaries, but I don't see how an unequal comparison creates "unpredictable" behavior (as opposed to predictable failure to match). From murman at gmail.com Sat Sep 29 17:12:06 2007 From: murman at gmail.com (Michael Urman) Date: Sat, 29 Sep 2007 10:12:06 -0500 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> Message-ID: On 9/29/07, Guido van Rossum wrote: > On 9/29/07, Phillip J. Eby wrote: > > I'm coming into this thread a little bit late, but if we don't want > > strings and bytes to be comparable, shouldn't we just make them > > *unequal*? I mean, under normal circumstances, == and != are > > available on all objects without causing errors, and the same > > TypeError would occur for things like list.remove(). > > Until just before 3.0a1, they were unequal. We decided to raise > TypeError because we noticed many bugs in code that was doing things > like > > data = f.read(4096) > if data == "": break I agree that it's nice to catch this sort of error early, but I'm wondering how to reconcile this decision with the discussion we had a year ago when dicts stopped suppressing comparison exceptions. http://mail.python.org/pipermail/python-dev/2006-August/068090.html is the beginning of the thread, and http://mail.python.org/pipermail/python-dev/2006-August/068112.html is a clear description of an __eq__ raising an exception as being buggy. If we're going to take a PBP approach to letting bytes() == str() raise an exception, is there a PBP factor to having dictionaries cover for this exception? The only unpredictable thing I see is if you're willy-nilly mixing bytes and strs and expecting to be able to lookup one with the other. If you're instead trying to store both, much like you can store strs and tuples, this shouldn't cause a problem. Even if it doing so is weird. The idea of if "" in somedict: pass raising a TypeError depending on the values in somedict is not pleasant. Just to throw another idea out there, would a variant of dict that suppresses these comparison exceptions, say collections.loosedict, sidestep the issue? -- Michael Urman From lists at cheimes.de Sat Sep 29 17:28:16 2007 From: lists at cheimes.de (Christian Heimes) Date: Sat, 29 Sep 2007 17:28:16 +0200 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytes and Mutable Buffer) In-Reply-To: References: Message-ID: Jim Jewett wrote: > On 9/27/07, Guido van Rossum wrote: >> On 9/27/07, Jim Jewett wrote: > >>> Should a TypeError be raised as soon as you try to put a bytes and a >>> string in the same dict, even if they don't happen to hash equal? > >> Good idea, if you can figure out a way to implement this efficiently. What do you think about using the class hierarchy for the job? Instead of raising a TypeError a comparison between a string and a byte raises StringBytesError that subclasses from TypeError. The dict methods like lookdict() then reraise the StringBytesError explicitly. I'm know very little about the dict implementation and my idea could be totally wrong ... The idea just came to me and perhaps it helps to find the solution. Christian From pje at telecommunity.com Sat Sep 29 18:01:00 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Sat, 29 Sep 2007 12:01:00 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> <20070929151127.AE5203A4045@sparrow.telecommunity.com> Message-ID: <20070929155823.C552B3A4045@sparrow.telecommunity.com> At 10:26 AM 9/29/2007 -0500, Michael Urman wrote: >[Sending direct because this is just a thanks and some idea fodder, >but feel free to return this to the list] > >On 9/29/07, Phillip J. Eby wrote: > > can both be constants, a simple list.index() test could now raise a > > TypeError, as could "item in list". > >Good point - I keep missing the forest for the trees. This isn't just >a matter of dicts; any collection type can be susceptible. Thanks for >this reminder. > >I'm torn on your idea of making a read vs readinto separation of >files. If this works by, e.g., raising IOError on attempt to use the >wrong one, the use case you proposed will be filtering out a ton of >expected exceptions, but it's easy to understand the behavior. > >If it works by removing the wrong method from the object, then we've >got two different file-like object types returned from the same >function based on the value of an argument (but a better LBYL check >available). Of course since we currently have two different types >returned from a method based on a value passed to its constructor, >this may be no worse. > >I'm not sure which way makes it easier to add new file-like-objects, >either; they'll have the same problems. They'll have the same problems *anyway*. In fact, having different methods will simply force people creating such objects to decide what they're really trying to do. From jyasskin at gmail.com Sat Sep 29 20:10:07 2007 From: jyasskin at gmail.com (Jeffrey Yasskin) Date: Sat, 29 Sep 2007 11:10:07 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: <20070929151127.AE5203A4045@sparrow.telecommunity.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> <20070929151127.AE5203A4045@sparrow.telecommunity.com> Message-ID: <5d44f72f0709291110g7e66f00icead0bd060f5ebf9@mail.gmail.com> On 9/29/07, Phillip J. Eby wrote: > At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote: > >Until just before 3.0a1, they were unequal. We decided to raise > >TypeError because we noticed many bugs in code that was doing things > >like > > > > data = f.read(4096) > > if data == "": break > > Thought experiment: what if read() always returned strings, and to > read bytes, you had to use something like 'f.readinto(ob, 4096)', > where 'ob' is a mutable bytes instance or memory view? > > In Python 2.x, there's only one read() method because (prior to > unicode), there was only one type of reading to do. > > But as the above example makes clear, in 3.x you simply *can't* write > code that works correctly with an arbitrary file that might be binary > or text, at least not without typechecking the return value from > read(). (In which case, you might as well inspect the file > object.) So, the above problem could be fixed by having .read() > raise an error (or simply not exist) on a binary file object. Perhaps write if len(data) == 0: break since that's what you really mean. Any other code that compares the result of read() to either a bytes or a str really is taking a text or binary file object specifically and not working on an arbitrary file. > In this way, the problem is fixed at the point where it really > occurs: i.e., at the point of not having decided whether the stream > is bytes or text. > > This also seems to fit better (IMO) with the best practice of > enforcing str/unicode/encoding distinctions at the point where data > enters the program, rather than delaying the error to later. > > > >I thought about using warning too, but since nobody wants warnings, > >that would be pretty much the same as raising TypeError except for the > >most dedicated individuals (and if I were really dedicated I'd just > >write my own eq() function anyway). > > The use case I'm concerned about is code that's not type-specific > getting a TypeError by comparing arbitrary objects. For example, if > you write Python code to create a Python code object (e.g. the > compiler package or my own BytecodeAssembler), you need to create a > list of constants as you generate the code, and you need to be able > to search the list for an equal constant. Since strings and bytes > can both be constants, a simple list.index() test could now raise a > TypeError, as could "item in list". > > So raising an error to make bad code fail sooner, will also take down > unsuspecting code that isn't really broken, and *force* the writing > of special comparison code -- which won't be usable with things like > list.remove and the "in" operator. > > In comparison, forcing code to be bytes vs. text aware at the point > of I/O directs attention to the place where you can best decide what > to do about it. (After all, the comparison that raises the TypeError > might occur deep in a library that's expecting to work with text.) > > > >And the warning would do nothing > >about the issue brought up by Jim Jewett, the unpredictable behavior > >of a dict with both bytes and strings as keys. > > I've looked at all of Jim's messages for September, but I don't see > this. I do see where raising TypeError for comparisons causes a > problem with dictionaries, but I don't see how an unequal comparison > creates "unpredictable" behavior (as opposed to predictable failure to match). > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/jyasskin%40gmail.com > -- Namast?, Jeffrey Yasskin http://jeffrey.yasskin.info/ "Religion is an improper response to the Divine." ? "Skinny Legs and All", by Tom Robbins From greg at krypto.org Sat Sep 29 21:04:42 2007 From: greg at krypto.org (Gregory P. Smith) Date: Sat, 29 Sep 2007 12:04:42 -0700 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: <5d44f72f0709291110g7e66f00icead0bd060f5ebf9@mail.gmail.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> <20070929151127.AE5203A4045@sparrow.telecommunity.com> <5d44f72f0709291110g7e66f00icead0bd060f5ebf9@mail.gmail.com> Message-ID: <52dc1c820709291204r214e3037w78aba5495894da7b@mail.gmail.com> On 9/29/07, Jeffrey Yasskin wrote: > > On 9/29/07, Phillip J. Eby wrote: > > At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote: > > >Until just before 3.0a1, they were unequal. We decided to raise > > >TypeError because we noticed many bugs in code that was doing things > > >like > > > > > > data = f.read(4096) > > > if data == "": break > > > > Thought experiment: what if read() always returned strings, and to > > read bytes, you had to use something like 'f.readinto(ob, 4096)', > > where 'ob' is a mutable bytes instance or memory view? > Using what encoding? read() should raise an exception on a file opened as binary in that case. And instead of readinto() how about readbytes() that just returns bytes and raises an exception on non-binary mode files. (readinto for buffers is a good idea and i think we should have it but that idea could be taken further to allow for even more scattered IO into a mutable buffer; thats another discussion and should be a PEP of its own) > But as the above example makes clear, in 3.x you simply *can't* write > > code that works correctly with an arbitrary file that might be binary > > or text, at least not without typechecking the return value from > > read(). (In which case, you might as well inspect the file > > object.) So, the above problem could be fixed by having .read() > > raise an error (or simply not exist) on a binary file object. > > Perhaps write > if len(data) == 0: break > since that's what you really mean. data = f.read() if not data: break Is the preferred way to write that. Regardless, I agree. read() returning a different type based on the file open mode is going to cause problems. I do -NOT- like the idea of bytes vs string comparison raising an exception. read() and readbytes() methods that raise exceptions when used on the wrong mode of file would "solve" the problem in a more obvious way. Any other code that compares the result of read() to either a bytes or > a str really is taking a text or binary file object specifically and > not working on an arbitrary file. > > > In this way, the problem is fixed at the point where it really > > occurs: i.e., at the point of not having decided whether the stream > > is bytes or text. > > > > This also seems to fit better (IMO) with the best practice of > > enforcing str/unicode/encoding distinctions at the point where data > > enters the program, rather than delaying the error to later. > > > > > > >I thought about using warning too, but since nobody wants warnings, > > >that would be pretty much the same as raising TypeError except for the > > >most dedicated individuals (and if I were really dedicated I'd just > > >write my own eq() function anyway). > > > > The use case I'm concerned about is code that's not type-specific > > getting a TypeError by comparing arbitrary objects. For example, if > > you write Python code to create a Python code object (e.g. the > > compiler package or my own BytecodeAssembler), you need to create a > > list of constants as you generate the code, and you need to be able > > to search the list for an equal constant. Since strings and bytes > > can both be constants, a simple list.index() test could now raise a > > TypeError, as could "item in list". > > > > So raising an error to make bad code fail sooner, will also take down > > unsuspecting code that isn't really broken, and *force* the writing > > of special comparison code -- which won't be usable with things like > > list.remove and the "in" operator. > > > > In comparison, forcing code to be bytes vs. text aware at the point > > of I/O directs attention to the place where you can best decide what > > to do about it. (After all, the comparison that raises the TypeError > > might occur deep in a library that's expecting to work with text.) > > > > > > >And the warning would do nothing > > >about the issue brought up by Jim Jewett, the unpredictable behavior > > >of a dict with both bytes and strings as keys. > > > > I've looked at all of Jim's messages for September, but I don't see > > this. I do see where raising TypeError for comparisons causes a > > problem with dictionaries, but I don't see how an unequal comparison > > creates "unpredictable" behavior (as opposed to predictable failure to > match). > > > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/jyasskin%40gmail.com > > > > > -- > Namast?, > Jeffrey Yasskin > http://jeffrey.yasskin.info/ > > "Religion is an improper response to the Divine." ? "Skinny Legs and > All", by Tom Robbins > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/greg%40krypto.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070929/f1395dd0/attachment.htm From tjreedy at udel.edu Sat Sep 29 23:28:40 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 29 Sep 2007 17:28:40 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: ImmutableBytesand Mutable Buffer) References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com><20070929142126.D61D23A4045@sparrow.telecommunity.com> Message-ID: "Guido van Rossum" wrote in message news:ca471dc20709290733i54f63ac3pb4501b94530db820 at mail.gmail.com... | Until just before 3.0a1, they were unequal. I think it valuable that in the language as delivered, 'o==p' (as well as 'bool(o)' )always return True or False. Both make reasoning about code easier since one does not have to learn and carry around in the back of one's mind niggling exceptions. I am -1 on the last minute change and for much the same reasons I have against building into the language Windows-specific suppression of \r output (see pydev post). | We decided to raise | TypeError because we noticed many bugs in code that was doing things | like | | data = f.read(4096) | if data == "": break | | where data was bytes and thus the break never taken. As G. Smith said, if a generic comparison is meant, then that should be if not data: break In any case, this seems like a old-code translation problem rather than a new-code writing problem. We already know that each existing str literal may have to be humanly checked to determine whether a 'b' should be prepended, as would appear to be the case above. | Similar with checks for certain magic strings (so it wasn't just empty strings). If a generic comparison is wanted, then "if data in ('abc', b'abc')". If a specific comparison is wanted, then raising an exception complicates what should be simple. Consider def g(stuff): if stuff == 'abc": special_text() elif stuff == b'abc': special_bytes() else: general_stuff(stuff) Breaking equality is not free. | It is also in line with the policy to refuse things like | b"abc".replace("a", "A") or "abc".replace(b"b", b"B"). I do not see the connection. I would expect either to return TypeError, just as '123'.replace(1,4) does today, even though '1' == 1 is False, rather than exception raising. Terry Jan Reedy From pje at telecommunity.com Sat Sep 29 23:47:39 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Sat, 29 Sep 2007 17:47:39 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: <52dc1c820709291204r214e3037w78aba5495894da7b@mail.gmail.co m> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> <20070929151127.AE5203A4045@sparrow.telecommunity.com> <5d44f72f0709291110g7e66f00icead0bd060f5ebf9@mail.gmail.com> <52dc1c820709291204r214e3037w78aba5495894da7b@mail.gmail.com> Message-ID: <20070929214503.E5B133A4045@sparrow.telecommunity.com> At 12:04 PM 9/29/2007 -0700, Gregory P. Smith wrote: >On 9/29/07, Jeffrey Yasskin ><jyasskin at gmail.com> wrote: >On 9/29/07, Phillip J. Eby ><pje at telecommunity.com> wrote: > > At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote: > > >Until just before 3.0a1, they were unequal. We decided to raise > > >TypeError because we noticed many bugs in code that was doing things > > >like > > > > > > data = f.read(4096) > > > if data == "": break > > > > Thought experiment: what if read() always returned strings, and to > > read bytes, you had to use something like 'f.readinto(ob, 4096)', > > where 'ob' is a mutable bytes instance or memory view? > > >Using what encoding? read() should raise an exception on a file >opened as binary in that case. Yes, that's what I meant -- the availability of read() and readinto() would be mutually exclusive. > And instead of readinto() how about readbytes() that just returns > bytes and raises an exception on non-binary mode files. Sure. > (readinto for buffers is a good idea and i think we should have > it but that idea could be taken further to allow for even more > scattered IO into a mutable buffer; thats another discussion and > should be a PEP of its own) Fair enough, although readbytes() can be implemented in terms of readinto(), while the reverse isn't the case. From facundobatista at gmail.com Sun Sep 30 16:32:38 2007 From: facundobatista at gmail.com (Facundo Batista) Date: Sun, 30 Sep 2007 11:32:38 -0300 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: <20070928123252.5a0692b0.weilawei@gmail.com> References: <20070925094601.c151245c.weilawei@gmail.com> <20070927125557.a5895341.weilawei@gmail.com> <20070928122915.798d00e1.weilawei@gmail.com> <20070928123252.5a0692b0.weilawei@gmail.com> Message-ID: 2007/9/28, Rob Crowther : > a) MPF() now takes a float or integer argument because mpf_set_str is just Rob, there has been a *lot* of discussion about this for Decimal (see the PEP and discussions in python-dev and python-list around the PEP date). The main issue here is what means the user if he calls MPF(2.3): a) MPF("2.3") b) MPF("2.2999999999999998") The difficult of the choice is that a) is maybe what she expects, b) is the value value (so why not to think she expects the real value?) Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From dickinsm at gmail.com Sun Sep 30 17:12:00 2007 From: dickinsm at gmail.com (Mark Dickinson) Date: Sun, 30 Sep 2007 11:12:00 -0400 Subject: [Python-3000] Extension: mpf for GNU MP floating point In-Reply-To: References: <20070925094601.c151245c.weilawei@gmail.com> <20070927125557.a5895341.weilawei@gmail.com> <20070928122915.798d00e1.weilawei@gmail.com> <20070928123252.5a0692b0.weilawei@gmail.com> Message-ID: <5c6f2a5d0709300812w56b024b2l2765cc35a07353f8@mail.gmail.com> On 9/30/07, Facundo Batista wrote: > > 2007/9/28, Rob Crowther : > > > a) MPF() now takes a float or integer argument because mpf_set_str is > just > > Rob, there has been a *lot* of discussion about this for Decimal (see > the PEP and discussions in python-dev and python-list around the PEP > date). But there's a major difference here: Decimal is *decimal* floating point, MPF and Python floats are *binary* floating point. So in the case of Decimal, conversion from a decimal string is a straightforward operation, while conversion from binary involves making choices about how to round, how many decimal digits to use, etc. But for MPF it's the other way around: conversion from a float is immediate (the GMP precision is always at least 53 bits, so any IEEE double can be represented as an MPF with no loss of information), while conversion from a string involves hard work and decisions about how to round (and GMP's approach to rounding seems pretty haphazard here...). So since there's really no ambiguity about what MPF(float) should be, and since it's a computationally trivial operation to initialize an MPF from a float, you certainly want to allow MPF's to be initialized from floats. Admittedly, for initialization from a float *literal* there are still going to be some surprises for the unwary: with MPF precision set to 128 bits, MPF( 1.1) is going to give a binary number that's an accurate representation of the decimal 1.1 to only 53 bits, not 128 bits. > The main issue here is what means the user if he calls MPF(2.3): > > a) MPF("2.3") > > b) MPF("2.2999999999999998") All 3 of MPF(2.3), MPF("2.3") and MPF(" 2.29...998") should be different values. MPF(2.3) is the closest 53-bit binary floating point number to the decimal 2.3, padded out with zero bits to whatever the current MPF precision is. MPF("2.3") should ideally be the closest p-bit binary floating point number to the decimal 2.3, where p is the current precision. But in fact, with the way that GMP works it seems that all that can be said is that MPF(" 2.3") is a (p+some_extra_bits) binary floating point number that's close (but not necessarily closest) to the decimal 2.3. Similarly for MPF(" 2.29...998"). By the way, I'm wondering whether this discussion really belongs on comp.lang.python instead... Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070930/31243546/attachment.htm From jimjjewett at gmail.com Sun Sep 30 18:31:23 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 30 Sep 2007 12:31:23 -0400 Subject: [Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) In-Reply-To: <20070929155823.C552B3A4045@sparrow.telecommunity.com> References: <766a29bd0709281134m48c930b6ye5d03ed08b27f4d3@mail.gmail.com> <20070929142126.D61D23A4045@sparrow.telecommunity.com> <20070929151127.AE5203A4045@sparrow.telecommunity.com> <20070929155823.C552B3A4045@sparrow.telecommunity.com> Message-ID: At 10:26 AM 9/29/2007 -0500, Michael Urman wrote: > This isn't just a matter of dicts; any collection type can be susceptible. The reason that dicts (and sets) are even worse is that the comparison could be delayed. If b"bytes" in [...] raises an exception, it happens while b"bytes" is still in the traceback context. With a dictionary, the problem comparison could be delayed until the next resize. Even if the TypeError did tell you which dict and (pair of pre-existing) keys were a problem, you still wouldn't know how those keys got there. Example data flow: insert string1 with hash X insert string2 with hash X -- collision, so it moves to the next slot del string1 insert bytes with hash X -- replaces the dummy entry, so nothing raised yet ... insert something utterly unrelated, such as an integer. This causes a resize, so that now string2 and bytes do collide and raise a TypeError complaining about strings and bytes -- even though the key you added is neither. -jJ