From ncoghlan at gmail.com Wed Oct 1 00:02:29 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 01 Oct 2008 08:02:29 +1000 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>

<48E29AB6.908@gmail.com> Message-ID: <48E2A1F5.5040009@gmail.com> Guido van Rossum wrote: > On Tue, Sep 30, 2008 at 2:31 PM, Nick Coghlan wrote: >> I'm also starting to wonder if allowing mixed types might be the way to >> go for these interfaces - leaving the bytes objects in place if the >> Unicode decode operation fails. > > No, no, nooooo! Yeah, I realised shortly after sending that message that this is exactly the problem this discussion is trying to get rid of. I saw at least one other post containing a similar comment though, so I didn't feel *too* foolish for writing it (although that didn't stop me wishing my email client had a "Retract stupid comment" button). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From guido at python.org Wed Oct 1 00:04:03 2008 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Sep 2008 15:04:03 -0700 Subject: [Python-3000] [Python-Dev] Patch for an initial support of bytes filename in Python3 In-Reply-To: <20080930184751.31635.1484325691.divmod.xquotient.520@weber.divmod.com> References: <200809300247.20349.victor.stinner@haypocalc.com> <20080930132151.31635.132601277.divmod.xquotient.434@weber.divmod.com> <20080930175932.31635.989735053.divmod.xquotient.478@weber.divmod.com> <20080930184751.31635.1484325691.divmod.xquotient.520@weber.divmod.com> Message-ID: On Tue, Sep 30, 2008 at 11:47 AM, wrote: > > On 05:56 pm, guido at python.org wrote: >> >> On Tue, Sep 30, 2008 at 10:59 AM, wrote: >>> >>> On 02:32 pm, guido at python.org wrote: > >>> In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the >>> "benefit of the doubt" case? It could always be added to 2.7, and the >>> parity release of 2to3 could have a --2.7 switch that would modify the >>> behavior of this and other fixers. >> >> I'm not sure what you're proposing. *My* proposal is that 2to3 changes >> os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone >> -- there's no way to tell whether os.getcwdb() would be a better >> match, and for portable code, it won't be (since os.getcwdb() is a >> Unix-only thing). > > My proposal is simply to change getcwd to getcwdb, and getcwdu to getcwd. > This preserves whatever bytes/text behavior you are expecting from 2.6 into > 3.0. Granted, the fact that unicode is really always the right thing to do > on Windows complicates things. Plus, even on Linux Unicode is *usually* what you should be doing, unless you're writing a backup tool. > I already tend to avoid os.getcwd() though, and this is just one more reason > to avoid it. In the rare cases where I really do need it, it looks like > os.path.abspath(b".") / os.path.abspath(u".") will provide the clarity that > I want. Or os.path.expanduser('~') vs. os.path.expanduser(b'~'). :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Wed Oct 1 00:21:04 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 01 Oct 2008 00:21:04 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E28C31.6060606@v.loewis.de> Message-ID: <48E2A650.4000108@v.loewis.de> >> My concern still is that it brings the bytes type into the status of >> another character string type, which is really bad, and will require >> further modifications to Python for the lifetime of 3.x. > > I'd like to understand why this is "really bad". I though it was by > design that the str and bytes types behave pretty similarly. You can > use both as dict keys. If they have to behave pretty similarly, they have to be supported in all APIs that deal with text. For example, people will demand that printing bytes should just copy them onto the stream (rather than invoking repr()), and writing them onto a text stream should work the same way. GUI library should support them, the XML libraries, and so on. Where will you stop, and tell people that bytes are just not supposed to do this or that? >> This is because applications will then regularly use byte strings for >> file names on Unix, and regular strings on Windows, and then expect >> the program to work the same without further modifications. > > It seems that bytes arguments actually *do* work on Windows -- somehow > they get decoded. (Unless Terry's report was from 2.x.) To a limited degree - see my other message. Don't try to listdir a directory with characters outside CP_ACP (it will give you invalid file names). > Actually something like that may not be a bad idea. Ian Bicking's > webob supports similar double APIs for getting the request parameters > out of a request object; I believe request.GET['x'] is a text object > and request.GET_str['x'] is the corresponding uninterpreted bytes > sequence. I would prefer to have os.environb over os.environ[b"PATH"] > though. And would you keep them synchronized? > I assume at some point we can stop and have sufficiently low-level > interfaces that everyone can agree are in bytes only. Bytes aren't > going away. How does Java deal with this? Its File class doesn't seem > to deal in bytes at all. What would its listFiles() method do with > undecodable filenames? Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will fail with FileNotFoundException. IOW, Java hasn't solved the problem in the last 10 years. Marcin Kowalczyk did a more thorough analysis about a year ago in http://mail.python.org/pipermail/python-3000/2007-September/010450.html Regards, Martin From martin at v.loewis.de Wed Oct 1 00:28:22 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 01 Oct 2008 00:28:22 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E29CB1.5010309@v.loewis.de> <83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net> Message-ID: <48E2A806.6020607@v.loewis.de> > Yes! If there is a byte-string access method for Windows, pretty please > make it decode from UTF-8 internally and call the Unicode version of the > Windows APIs. The non-unicode windows APIs are pretty much just broken > -- Ideally, Python should never be calling those. I don't think we will manage to release Python 3.0 this year if that change is to be implemented. And then, I don't think the release manager will agree to such a delay. I disagree that the ANSI APIs are broken. For most users (and by that, I mean much more than 99% of the world population with access to Windows computers), they work just fine. You have to deliberately try to break them, or work in an environment were you speak multiple languages (with conflicting scripts) simultaneously. Practicality beats purity, and I applaud Microsoft for such a foresighted design (they are guilty for bad designs in other places, but this one really gives a good tradeoff of all issues, all things considered). Regards, Martin From martin at v.loewis.de Wed Oct 1 00:32:03 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 01 Oct 2008 00:32:03 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: References: <200809291407.55291.victor.stinner@haypocalc.com> <48E29D3B.5030900@v.loewis.de> Message-ID: <48E2A8E3.3070805@v.loewis.de> > How does windows (and Python on windows) handle NFC versus NFD issues? That's left to the application. > Can I have two files called "?mlaut.txt", one in NFD and one NFC form? Yes, you can. It sounds confusing, but only in a theoretical way. You never have combining characters on Windows (at least, I don't). The keyboard input defaults to NFC, and users normally don't type file names, anyways, except when creating the files - later, they just use the mouse to indicate what file they want to act on. > And are both of those representable on the Python side (i.e. can they > both be returned from listdir() and passed to open())? Certainly! > CIf I compare > these two filenames, do they compare differently? Certainly! Regards, Martin From guido at python.org Wed Oct 1 00:33:50 2008 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Sep 2008 15:33:50 -0700 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E2A650.4000108@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E28C31.6060606@v.loewis.de> <48E2A650.4000108@v.loewis.de> Message-ID: On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. L?wis" wrote: >>> My concern still is that it brings the bytes type into the status of >>> another character string type, which is really bad, and will require >>> further modifications to Python for the lifetime of 3.x. >> >> I'd like to understand why this is "really bad". I though it was by >> design that the str and bytes types behave pretty similarly. You can >> use both as dict keys. > > If they have to behave pretty similarly, they have to be supported in > all APIs that deal with text. I don't see how you get from "pretty similarly" to "all APIs". :-) > For example, people will demand that > printing bytes should just copy them onto the stream (rather than > invoking repr()), and writing them onto a text stream should work the > same way. GUI library should support them, the XML libraries, and so > on. > > Where will you stop, and tell people that bytes are just not supposed > to do this or that? Printing a bytes object already works, and displays its repr(), which is guaranteed to be pure ASCII (unlike the repr() of a unicode str object in Py3k). All the others you mention will cause breakage as they should -- these errors exist to force the programmer to think about encodings or conversions. I don't see that as a big burden because the only way there could be bytes here in the first place is when the user explicitly requested bytes. A program that only ever passes text strings to the os module is only ever going to get text strings back. >>> This is because applications will then regularly use byte strings for >>> file names on Unix, and regular strings on Windows, and then expect >>> the program to work the same without further modifications. >> >> It seems that bytes arguments actually *do* work on Windows -- somehow >> they get decoded. (Unless Terry's report was from 2.x.) > > To a limited degree - see my other message. Don't try to listdir a > directory with characters outside CP_ACP (it will give you invalid > file names). Understood. >> Actually something like that may not be a bad idea. Ian Bicking's >> webob supports similar double APIs for getting the request parameters >> out of a request object; I believe request.GET['x'] is a text object >> and request.GET_str['x'] is the corresponding uninterpreted bytes >> sequence. I would prefer to have os.environb over os.environ[b"PATH"] >> though. > > And would you keep them synchronized? Yes, the bytes versions would be the canonical version and the str version would wrap around that -- though updating the str version would also update the bytes version. Some keys would be missing from the str version (or perhaps they would raise exceptions or default to some other error handler, like ignore or replace). >> I assume at some point we can stop and have sufficiently low-level >> interfaces that everyone can agree are in bytes only. Bytes aren't >> going away. How does Java deal with this? Its File class doesn't seem >> to deal in bytes at all. What would its listFiles() method do with >> undecodable filenames? > > Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte > sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will > fail with FileNotFoundException. > > IOW, Java hasn't solved the problem in the last 10 years. Marcin > Kowalczyk did a more thorough analysis about a year ago in > > http://mail.python.org/pipermail/python-3000/2007-September/010450.html I can't say I like the Java solution. I would like to be able to write a robust backup tool in Python, even if the code needed to make it work everywhere isn't going to win any prizes (due to the need to use bytes on Unix, str on Windows). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From foom at fuhm.net Wed Oct 1 00:36:23 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 30 Sep 2008 18:36:23 -0400 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E2A650.4000108@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E28C31.6060606@v.loewis.de> <48E2A650.4000108@v.loewis.de> Message-ID: <0DBCA888-43DA-4DE9-952F-A377E96B286D@fuhm.net> On Sep 30, 2008, at 6:21 PM, Martin v. L?wis wrote: > IOW, Java hasn't solved the problem in the last 10 years. Java is already really bad at being a small little language to write cooperating tools in. I'd never even attempt to write a little pipeline filter in Java -- I've already pretty much learned to expect Java applications to be in their own world, so I'd hardly find it surprising if a Java app could only read files it wrote itself, nevermind files in odd encodings. Python, on the other hand, is an awesome tool for writing small little scripts that interact well with the surrounding environment, Just The Way It Is, without trying to layer so much abstraction upon it so that you lose functionality. Moving away from that would be unfortunate. James From victor.stinner at haypocalc.com Wed Oct 1 01:11:10 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 1 Oct 2008 01:11:10 +0200 Subject: [Python-3000] Filename: unicode normalization Message-ID: <200810010111.10956.victor.stinner@haypocalc.com> Since it's hard to follow the filename thread on two mailing list, i'm starting a new thread only on python-3000 about unicode normalization of the filenames. Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC to create a file, you have to reuse NFC to open your file (and the same for NFD). Python2 example to create files in the different forms: >>> name=u'x?x' >>> from unicodedata import normalize >>> open(u'NFD-' + normalize('NFD', name), 'w').close() >>> open(u'NFC-' + normalize('NFC', name), 'w').close() >>> open(u'NFKC-' + normalize('NFKC', name), 'w').close() >>> open(u'NFKD-' + normalize('NFKD', name), 'w').close() >>> import os >>> os.listdir('.') ['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x'] >>> os.listdir(u'.') [u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x'] Directory listing using Python3: >>> import os >>> [ name.encode('utf-8') for name in os.listdir('.') ] [b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x', b'NFKD-xa\xcc\x88x'] >>> os.listdir('.') ['NFD-x?x', 'NFC-x?x', 'NFKC-x?x', 'NFKD-x?x'] Same results, correct. Then try to open files: >>> open(normalize('NFC', 'NFC-x?x')).close() >>> open(normalize('NFD', 'NFC-x?x')).close() IOError: [Errno 2] No such file or directory: 'NFC-x?x' >>> open(normalize('NFD', 'NFD-x?x')).close() >>> open(normalize('NFC', 'NFD-x?x')).close() IOError: [Errno 2] No such file or directory: 'NFD-x?x' If the user chooses a result from os.listdir(): no problem (if he has good eyes and he's able to find the difference between 'x?x' (NFD) and 'x?x' (NFC) :-D). If the user enters the filename using the keyboard (on the command line or a GUI dialog), you have to hope that the keyboard is encoded in the same norm than the filename was encoded... -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From guido at python.org Wed Oct 1 01:23:01 2008 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Sep 2008 16:23:01 -0700 Subject: [Python-3000] Filename: unicode normalization In-Reply-To: <200810010111.10956.victor.stinner@haypocalc.com> References: <200810010111.10956.victor.stinner@haypocalc.com> Message-ID: Martin answered a similar question from Jack Jansen in another thread. OSX doesn't normalize either. It's unlikely to confuse users in practice. On Tue, Sep 30, 2008 at 4:11 PM, Victor Stinner wrote: > Since it's hard to follow the filename thread on two mailing list, i'm > starting a new thread only on python-3000 about unicode normalization of the > filenames. > > Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC > to create a file, you have to reuse NFC to open your file (and the same for > NFD). > > Python2 example to create files in the different forms: >>>> name=u'x?x' >>>> from unicodedata import normalize >>>> open(u'NFD-' + normalize('NFD', name), 'w').close() >>>> open(u'NFC-' + normalize('NFC', name), 'w').close() >>>> open(u'NFKC-' + normalize('NFKC', name), 'w').close() >>>> open(u'NFKD-' + normalize('NFKD', name), 'w').close() >>>> import os >>>> os.listdir('.') > ['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x'] >>>> os.listdir(u'.') > [u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x'] > > Directory listing using Python3: >>>> import os >>>> [ name.encode('utf-8') for name in os.listdir('.') ] > [b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x', > b'NFKD-xa\xcc\x88x'] >>>> os.listdir('.') > ['NFD-x?x', 'NFC-x?x', 'NFKC-x?x', 'NFKD-x?x'] > > Same results, correct. Then try to open files: >>>> open(normalize('NFC', 'NFC-x?x')).close() >>>> open(normalize('NFD', 'NFC-x?x')).close() > IOError: [Errno 2] No such file or directory: 'NFC-x?x' >>>> open(normalize('NFD', 'NFD-x?x')).close() >>>> open(normalize('NFC', 'NFD-x?x')).close() > IOError: [Errno 2] No such file or directory: 'NFD-x?x' > > If the user chooses a result from os.listdir(): no problem (if he has good > eyes and he's able to find the difference between 'x?x' (NFD) and 'x?x' > (NFC) :-D). > > If the user enters the filename using the keyboard (on the command line or a > GUI dialog), you have to hope that the keyboard is encoded in the same norm > than the filename was encoded... > > -- > Victor Stinner aka haypo > http://www.haypocalc.com/blog/ > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From victor.stinner at haypocalc.com Wed Oct 1 02:17:33 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 1 Oct 2008 02:17:33 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E2A806.6020607@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net> <48E2A806.6020607@v.loewis.de> Message-ID: <200810010217.33570.victor.stinner@haypocalc.com> Le Wednesday 01 October 2008 00:28:22 Martin v. L?wis, vous avez ?crit?: > I don't think we will manage to release Python 3.0 this year if that > change is to be implemented. And then, I don't think the release manager > will agree to such a delay. The minimum change is to disallow bytes/str mix: - os.listdir(unicode)->unicode and ignore invalid files (current behaviour is to return unicode and bytes) - os.readlink(unicode)->unicode or raise an error (current behaviour is to return unicode or bytes) - remove os.getcwdu() (use its code -which is better- for getcwd) and fix the test_unicode_file.py listdir() change (ignore invalid filenames) is important to avoid strange bugs in os.path.*(), glob.*() or on displaying a filename. I can generate a specific patch for these issues. It's just a subset of my last patch. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From foom at fuhm.net Wed Oct 1 02:38:45 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 30 Sep 2008 20:38:45 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E29F56.7060206@v.loewis.de> References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>

<48E29AB6.908@gmail.com> <48E29F56.7060206@v.loewis.de> Message-ID: On Sep 30, 2008, at 5:51 PM, Martin v. L?wis wrote: > While I can sympathize with people having non-ASCII file names on > their > disks, I can't sympathize with this example. Normal users just don't > put \x90 into their command lines, and those who do deserve the error > message they get. That's just not true! One of the most common kind of thing to put on a command line is a filename. And you can't say that users wouldn't be able to type the odd bytesequences: tab completion and xargs will both allow input of those oddly-named files to the command line. James From greg.ewing at canterbury.ac.nz Wed Oct 1 03:05:48 2008 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 01 Oct 2008 13:05:48 +1200 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> Message-ID: <48E2CCEC.9030709@canterbury.ac.nz> James Y Knight wrote: > Since from what I've tried, things seem to work, I'd really like to > know what precisely does fail from the opponents of utf-8b. Seems like what will fail is taking one of these utf-8b decoded names and passing it to some external library that uses it as a filename without knowing that it has to use utf-8b to encode it. Then the funny characters won't be encoded the way they were originally, and it won't compare equal to existing filenames that it should be equal to. -- Greg From rhamph at gmail.com Wed Oct 1 04:22:08 2008 From: rhamph at gmail.com (Adam Olsen) Date: Tue, 30 Sep 2008 20:22:08 -0600 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> Message-ID: On Tue, Sep 30, 2008 at 8:06 PM, wrote: > The proposal of using U+0000 seems like it would have been almost the same > from such a wrapper's perspective, except (A) people using the filesystem > APIs without the benefit of such a wrapper would have been even more > screwed, and (B) there are a few nasty corner-cases when dealing with > surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what > it would have done with. Surrogates in UTF-8 *should* be treated as errors, but current python is far too lax. That actually leads to another problem: improving validating will change what gets escaped and what doesn't. http://bugs.python.org/issue3297 http://bugs.python.org/issue3672 -- Adam Olsen, aka Rhamphoryncus From foom at fuhm.net Wed Oct 1 05:32:04 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 30 Sep 2008 23:32:04 -0400 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> Message-ID: <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net> On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote: > However, Martin, I can promise you that I will _never_ ask for any > convenience functions related to bytes as a result of this > decision. I want bytes to come back from filesystem APIs because I > intend to have a wrapper layer which knows two things about the > file: the bytes (which are needed to talk to POSIX filesystem APIs) > and the characters (which are computed from those bytes, can be > safely renormalized, displayed to users, etc). On Windows this > filesystem wrapper will necessarily behave differently, and will not > use bytes for anything. Any formatting beyond joining path segments > together and possibly splitting extensions off will be done on > character strings, not byte strings. Can you clarify what proposal you are supporting for Python: 1) Two sets of APIs, one returning unicode strings, and one returning bytestrings. (subpoints: what does the unicode-returning API do when it cannot decode the bytestring into unicode? raise exception, pretend argument/envvar/file didn't exist/?) or 2) All APIs return bytestrings only. Converting to unicode is considered lossy, and would have to be done by applications for display purposes only. I really don't understand the reasoning for (1). It seems to me that most software (probably including all of the Python stdlib) would continue to use the unicode string API. Switching all of the Python stdlib to use the bytestring APIs instead would certainly be a large undertaking, and would have all sorts of ripple-on API changes (e.g. __file__). So I can only imagine that if you're proposing (1), you're doing so without the intention of suggesting that Python be converted to use it. And so, of course, that doesn't really fix things (such as getcwd failing if your cwd is a path that is undecodeable in the current locale, or well, currently, python refusing to even start). If you're proposing (2), it's at least as large an undertaking as (1) + converting Python to use the optional bytestring APIs. But at least it avoids exposing an API that people ought not use, and does make it obvious what still needs to be fixed: the unfixed code simply won't run at all. > The proposal of using U+0000 seems like it would have been almost > the same from such a wrapper's perspective, except (A) people using > the filesystem APIs without the benefit of such a wrapper would have > been even more screwed I'm not sure what your "more screwed" is comparing against: current py3k behavior? (aka: decoding to Unicode in locale's specified encoding)? I don't see how you can really be more screwed than that: not only can't you send your filename to display in a Gtk+ button, you can't access it at all, even staying within python. > and (B) there are a few nasty corner-cases when dealing with > surrogate (i.e. invalid, in UTF-8) code points which I'm not quite > sure what it would have done with. The lone-surrogate-pair proposal was a totally different proposal than the U+0000 one. James From tjreedy at udel.edu Wed Oct 1 06:39:31 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 01 Oct 2008 00:39:31 -0400 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E28C31.6060606@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E28C31.6060606@v.loewis.de> Message-ID: Martin v. L?wis wrote: > Guido van Rossum wrote: >> However >> the *proposed* behavior (returns bytes if the arg was bytes, and >> returns str when the arg was str) is IMO sane, and no different than >> the polymorphism found in len() or many builtin operations. > > My concern still is that it brings the bytes type into the status of > another character string type, which is really bad, and will require > further modifications to Python for the lifetime of 3.x. I am one of those who wanted bytes kept and bytearray added and once grumbled about strings becoming unicode. Now that I am using 3.0 (and can imagine future use of non-ascii chars), I appreciate having just one string type and a separation between normal text and small-int arrays. So I find my self, somewhat surprisingly to me, sharing Martin's concern about regression toward having two text types again. There once was a discussion about whether paths should be represented by strings or a separate path class (that would keep a tuple of strings for each component). This was rejected, as I remember, both because of the complication/benefit ratio and the anticipation that having just one string type would make string representation easier. Using just 3.0 strings seems not to be possible. So a different argument for a path class would be to encapsulate the implementation, which could depend on the OS, and hide the complications from the user, who just wants open to work. Terry Jan Reedy From martin at v.loewis.de Wed Oct 1 07:27:47 2008 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 01 Oct 2008 07:27:47 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> Message-ID: <48E30A53.5040708@v.loewis.de> > However, Martin, I can promise you that I will _never_ ask for any > convenience functions related to bytes as a result of this decision. :-) Regards, Martin From martin at v.loewis.de Wed Oct 1 08:56:15 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 01 Oct 2008 08:56:15 +0200 Subject: [Python-3000] Filename: unicode normalization In-Reply-To: <200810010111.10956.victor.stinner@haypocalc.com> References: <200810010111.10956.victor.stinner@haypocalc.com> Message-ID: <48E31F0F.9080208@v.loewis.de> > Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC > to create a file, you have to reuse NFC to open your file (and the same for > NFD). That's not news to me. Of course it does: Unix is completely agnostic of encodings in file APIs. On the implementation level, it's just bytes. Even Windows, which does have the notion that file names are character strings, doesn't normalize. (for OS X, I believe it's slightly more complicated, depending on what API you use: the POSIX/BSD API probably lets through everything as-is, whereas the higher-layer Object-C based APIs do normalize, IIUC) As Guido says: it's no problem. Regards, Martin From victor.stinner at haypocalc.com Wed Oct 1 10:43:25 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 1 Oct 2008 10:43:25 +0200 Subject: [Python-3000] =?utf-8?q?=5BPython-Dev=5D__New_proposition_for_Pyt?= =?utf-8?q?hon3_bytes=09filename_issue?= In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> Message-ID: <200810011043.25662.victor.stinner@haypocalc.com> Le Wednesday 01 October 2008 04:06:25 glyph at divmod.com, vous avez ?crit?: > b = gtk.Button(u"\u0000/hello/world") > > which emits this message: > TypeError: OGtkButton.__init__() argument 1 must be string without > null bytes or None, not unicode > > SQLite has a similar problem with NULLs, and I'm definitely sticking > paths in there, too. I think that you can say "all C libraries". Would it possible to convert the encoded string to bytes just before call Gtk? (job done by some Python internals, not as an explicit conversion) I don't know if it would help the discussion, but Java uses its own modified UTF-8 encoding: * NUL byte is encoded as 0xc0 0x80 instead of 0x00 * Java doesn't support unicode > 0xFFFF (bouuuuh!) http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From mal at egenix.com Wed Oct 1 11:32:30 2008 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 01 Oct 2008 11:32:30 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <200810010954.47564.eckhardt@satorlaser.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> <200810010954.47564.eckhardt@satorlaser.com> Message-ID: <48E343AE.3080009@egenix.com> On 2008-10-01 09:54, Ulrich Eckhardt wrote: > On Tuesday 30 September 2008, M.-A. Lemburg wrote: >> On 2008-09-30 08:00, Martin v. L?wis wrote: >>>> Change the default file system encoding to store bytes in Unicode is >>>> like introducing a new Python type: . >>> Exactly. Seems like the best solution to me, despite your polemics. >> Not a bad idea... have os.listdir() return Unicode subclasses that work >> like file handles, ie. they have an extra buffer that holds the original >> bytes value received from the underlying C API. > > Why does it have to be a Unicode subclass? In my eyes, a Unicode object > promises a few things, in particular that it contains a Unicode string. If it > now suddenly contains bytes without any further meaning, that would be bad. Please read my entire email. I was proposing to store the underlying non-decodeable byte string value in such a subclass. The Unicode value of the object would then be that underlying value decoded as e.g. Latin-1 in order to be able to work on it as text. Path operations would have to be made aware of such subclasses and operate on the underlying bytes value. However, like Guido mentioned, this only works if all components are indeed aware of such subclasses... and that's likely to fail for code outside the stdlib. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 01 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 From solipsis at pitrou.net Wed Oct 1 12:26:20 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 1 Oct 2008 10:26:20 +0000 (UTC) Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> Message-ID: Greg Ewing canterbury.ac.nz> writes: > > Seems like what will fail is taking one of these utf-8b > decoded names and passing it to some external library > that uses it as a filename without knowing that it has > to use utf-8b to encode it. Then the funny characters > won't be encoded the way they were originally, But those funny characters only appear for invalid filenames. Passing filenames to a library will work for valid filenames. Sure, not all the problem is solved, but the most important part of it (have all filenames work with Python's IO functions) is. From stephen at xemacs.org Wed Oct 1 13:16:07 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 01 Oct 2008 20:16:07 +0900 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> Message-ID: <871vz0pnuw.fsf@xemacs.org> Antoine Pitrou writes: > But those funny characters only appear for invalid > filenames. What makes you think the filenames are invalid? The file*names* are probably perfectly valid in the intended encoding; they are simply invalid in the encoding that Python wants to apply. From solipsis at pitrou.net Wed Oct 1 13:15:34 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 1 Oct 2008 11:15:34 +0000 (UTC) Subject: [Python-3000] =?utf-8?q?=5BPython-Dev=5D_Filename_as_byte_string_?= =?utf-8?b?aW4JcHl0aG9uCTIuNiBvciAzLjA/?= References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> Message-ID: Stephen J. Turnbull xemacs.org> writes: > > What makes you think the filenames are invalid? The file*names* are > probably perfectly valid in the intended encoding; they are simply > invalid in the encoding that Python wants to apply. Those filenames don't work today with Python 3, the problem is to make them work. Whether they are valid or not in a hypothetical encoding is none of our business, if it's not the encoding we are expecting. From ncoghlan at gmail.com Wed Oct 1 14:43:23 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 01 Oct 2008 22:43:23 +1000 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net> <20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com> Message-ID: <48E3706B.9060308@gmail.com> glyph at divmod.com wrote: > The reasoning is that a lot of software doesn't care if it's wrong for > edge cases, it's really hard to come up with something that's correct > with respect to all of those edge cases (absurdly difficult, if you need > to stay in the straightjacket of string / bytes types, as well as > provide a useful library interface - which is why we're having this > discussion). But, it should be _possible_ to write software that's > correct in the face of those edge cases. I just wanted to highlight this as something to keep in mind during this discussion: we want to keep the easy things easy and make the difficult things possible. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From turnbull at sk.tsukuba.ac.jp Wed Oct 1 16:10:51 2008 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 01 Oct 2008 23:10:51 +0900 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> Message-ID: <87wsgso178.fsf@xemacs.org> Antoine Pitrou writes: > Stephen J. Turnbull xemacs.org> writes: > > > > What makes you think the filenames are invalid? The file*names* are > > probably perfectly valid in the intended encoding; they are simply > > invalid in the encoding that Python wants to apply. > > Those filenames don't work today with Python 3, the problem is to > make them work. Whether they are valid or not in a hypothetical > encoding is none of our business, if it's not the encoding we are > expecting. It's usually not "hypothetical"; often, the user knows what it is. Why not ask her? That's what web browsers do, in effect, by providing View as Charset commands. The problem with the strategies that are being proposed is that this is an application-level problem, not a Python-level problem. Good web browsers allow you to redisplay the document in a different encoding. Python should make it possible to do the same, *if* the application wants to. It should also be possible for apps to do other things, *if* they want to. That means IMO that Python should limit itself to caching the bytes (or equivalent hacky representation) somewhere that apps that want to do something robust (including "ask the user" or "automatically try a different guess" or "silently throw them away") can find them. Doing more that that is just asking for bug reports that can only be closed as "wontfix" or "pebkac". From solipsis at pitrou.net Wed Oct 1 16:36:35 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 1 Oct 2008 14:36:35 +0000 (UTC) Subject: [Python-3000] =?utf-8?q?=5BPython-Dev=5D_Filename_as_byte=09strin?= =?utf-8?b?ZwlpbglweXRob24JMi42IG9yIDMuMD8=?= References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> Message-ID: Stephen J. Turnbull sk.tsukuba.ac.jp> writes: > > It's usually not "hypothetical"; often, the user knows what it is. > Why not ask her? That's what web browsers do, in effect, by providing > View as Charset commands. The average user does not even /know/ what a charset is. Web browsers provide lots of functions, not all of them are meant for average users (for example they give access to a "Javascript console" and let people choose whether they accept TLS v1.0). > The problem with the strategies that are being proposed is that this > is an application-level problem, not a Python-level problem. I don't understand why you think that. If a filename can't be exactly represented with a valid Unicode sequence, all applications wanting to access that file are impacted in the same way, and it is likely that the same solution or workaround can be applied to all applications. This sounds very much like a Python-level (or at least stdlib-level) problem to me. > Good web > browsers allow you to redisplay the document in a different encoding. Are you suggesting that the solution to the filename problem is to prompt the user and ask them for a different encoding? Not only this solution places a burden on the user, relying on them to give technical information that they may even not understand (let along be able to retrieve); but it also places a burden on the application developer to code the corresponding logic (prompt the user / provide an additional configure option / have a separate path with manual encoding/decoding of filenames). > Doing more that that is just asking for bug reports that can only be > closed as "wontfix" or "pebkac". There are always bug reports due to miscomprehension of an API or mismatching expectations. I don't think "we want to avoid bug reports" is a good criterion. What would be a good criterion is "we want to avoid legitimate dissatisfaction". From guido at python.org Wed Oct 1 16:53:29 2008 From: guido at python.org (Guido van Rossum) Date: Wed, 1 Oct 2008 07:53:29 -0700 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> Message-ID: On Wed, Oct 1, 2008 at 7:36 AM, Antoine Pitrou wrote: > The average user does not even /know/ what a charset is. Except those users who need the feature. They certainly have no trouble learning how to make the pages readable once someone explains it to them. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Wed Oct 1 17:54:15 2008 From: janssen at parc.com (Bill Janssen) Date: Wed, 1 Oct 2008 08:54:15 PDT Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E343AE.3080009@egenix.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> <200810010954.47564.eckhardt@satorlaser.com> <48E343AE.3080009@egenix.com> Message-ID: <74342.1222876455@parc.com> M.-A. Lemburg wrote: > On 2008-10-01 09:54, Ulrich Eckhardt wrote: > > On Tuesday 30 September 2008, M.-A. Lemburg wrote: > >> On 2008-09-30 08:00, Martin v. L?wis wrote: > >>>> Change the default file system encoding to store bytes in Unicode is > >>>> like introducing a new Python type: . > >>> Exactly. Seems like the best solution to me, despite your polemics. > >> Not a bad idea... have os.listdir() return Unicode subclasses that work > >> like file handles, ie. they have an extra buffer that holds the original > >> bytes value received from the underlying C API. > > > > Why does it have to be a Unicode subclass? In my eyes, a Unicode object > > promises a few things, in particular that it contains a Unicode string. If it > > now suddenly contains bytes without any further meaning, that would be bad. > > Please read my entire email. I was proposing to store the underlying > non-decodeable byte string value in such a subclass. The Unicode value > of the object would then be that underlying value decoded as e.g. > Latin-1 in order to be able to work on it as text. I'm actually sort of liking this idea. A Pathname class, for convenience a subtype of String, but containing the underlying binary representation used by the OS. Even non-unicode pathnames could be represented. Bill From stephen at xemacs.org Wed Oct 1 18:58:14 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 02 Oct 2008 01:58:14 +0900 Subject: [Python-3000] [Python-Dev] Filename as byte strin g in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> Message-ID: <87vdwcntg9.fsf@xemacs.org> Antoine Pitrou writes: > Stephen J. Turnbull sk.tsukuba.ac.jp> writes: > > > > It's usually not "hypothetical"; often, the user knows what it is. > > Why not ask her? That's what web browsers do, in effect, by providing > > View as Charset commands. > > The average user does not even /know/ what a charset is. Where I live they do -- there's a reason why "mojibake" is one of the few Japanese words to be borrowed into English rather than vice versa. > > The problem with the strategies that are being proposed is that this > > is an application-level problem, not a Python-level problem. > > I don't understand why you think that. If a filename can't be > exactly represented with a valid Unicode sequence, all applications > wanting to access that file are impacted in the same way, and it is > likely that the same solution or workaround can be applied to all > applications. That is not my experience in 10+ years of developing XEmacs/MULE. There are many solutions/workarounds, but all of them are vulnerable to the fundamental mismatch between the POSIX definition of a filename (or string, for that matter) as a slightly restricted sequence of octets, and the human being's insistence on interpreting that sequence of octets as the encoded representation of a textual string. True, some solutions are better than others, but there seems to be none that dominates across the board. Rather, each of the better ones is appropriate for some subset of users and applications. From janssen at parc.com Wed Oct 1 19:14:00 2008 From: janssen at parc.com (Bill Janssen) Date: Wed, 1 Oct 2008 10:14:00 PDT Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> <200810010954.47564.eckhardt@satorlaser.com> <48E343AE.3080009@egenix.com> <74342.1222876455@parc.com> <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com> Message-ID: <75388.1222881240@parc.com> glyph at divmod.com wrote: > > I'm actually sort of liking this idea. A Pathname class, for > > convenience > > a subtype of String, but containing the underlying binary > > representation > >used by the OS. Even non-unicode pathnames could be represented. > > On the one hand, I agree with you - except for the part where it's a > subtype of String, that doesn't work. In case I haven't mentioned it > enough times already: > > http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html > > On the other hand, we've all been on this merry-go-round before: > > http://www.python.org/dev/peps/pep-0355/ > > Note especially the rejection notice: "Subclassing from str is a > particularly bad idea". Yes, the only real justification for it is to not break existing code (otherwise, calling str() is not that much of an ordeal). > On the other hand, we've all been on this merry-go-round before: > > http://www.python.org/dev/peps/pep-0355/ The very existence of os.path seems a good argument that something like this is useful. Perhaps PEP 355 just went too far. Bill From foom at fuhm.net Wed Oct 1 20:30:29 2008 From: foom at fuhm.net (James Y Knight) Date: Wed, 1 Oct 2008 14:30:29 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <87od26e3an.fsf@xemacs.org> References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> Message-ID: <2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net> BTW, Windows will cheerfully let you create and access files with "garbage surrogates" in it. Try it yourself: open(u"\ud8fd", 'w').close() os.listdir(u'.') IMO that pretty much blows out of the water any suggestion encoding invalid UTF-8 sequences into lone surrogates is an evil and broken thing to do. So, I'm back to favoring the lone surrogate plan over the U+0000 plan. But either one seems better than the alternatives. James On Sep 29, 2008, at 11:11 PM, Stephen J. Turnbull wrote: > James Y Knight writes: >> On Sep 29, 2008, at 3:32 AM, Adam Olsen wrote: > >>> UTF-8b doesn't work as intended. It produces an invalid unicode >>> object (garbage surrogates) that cannot be used with external APIs >>> or >>> libraries that require unicode. >> >> I'd be interested to hear more detail on what you expect the >> practical >> ramifications of this to be. It doesn't sound likely to be a problem >> to me. > > That's because you have a specific use case in mind. Adam clearly has > in mind passing the filename on to a library which might proceed to > signal an error (to him, unexpected) on garbage surrogates. He > doesn't want to be surprised by that. From martin at v.loewis.de Wed Oct 1 21:08:50 2008 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 01 Oct 2008 21:08:50 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <200810011043.25662.victor.stinner@haypocalc.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> <200810011043.25662.victor.stinner@haypocalc.com> Message-ID: <48E3CAC2.6010203@v.loewis.de> >> SQLite has a similar problem with NULLs, and I'm definitely sticking >> paths in there, too. > > I think that you can say "all C libraries". Just for the sake of nit-picking: the socket library, and the regular POSIX stream IO library (as well as C standard "unformatted" IO) deal just fine with embedded NULL characters. > * Java doesn't support unicode > 0xFFFF (bouuuuh!) I don't think that is true anymore. Regards, Martin From guido at python.org Wed Oct 1 22:29:39 2008 From: guido at python.org (Guido van Rossum) Date: Wed, 1 Oct 2008 13:29:39 -0700 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E3CC12.1070207@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net> <48E3CC12.1070207@g.nevcal.com> Message-ID: On Wed, Oct 1, 2008 at 12:14 PM, Glenn Linderman wrote: > The original byte string must be preserved for use in actually opening > files. How it is displayed is another question. Doing something that > works for both Unicode display and access to the file is basically > impossible in all cases. Providing an encapsulation of the byte string > that has display methods, together with new methods to transform the > file path, and use parts of it to create other file paths, is the > solution I described earlier. Using the display string (what existing > programs are likely to do) for transformations instead of the new > methods will work for files with Unicode file names, and break for > others. As long as the solution of new transformation methods is made > available, there is a migration path for people that encounter > problems. I think handling files containing Unicode names properly and > compatibly, together with a migration path for file not in Unicode is > about the best that can be expected. The low-level solution(s) we'll be making available in 3.0 should enable you to implement this and many other higher-level approaches. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jcollins37 at carolina.rr.com Wed Oct 1 17:56:44 2008 From: jcollins37 at carolina.rr.com (James E. Collins III) Date: Wed, 1 Oct 2008 11:56:44 -0400 (Eastern Daylight Time) Subject: [Python-3000] Automatic Reply: Sound (Python-3000 Digest, Vol 32, Issue 4) Message-ID: <489713D2.000001.05260@JCOLLINS37-PCA> Silence is one of hardest arguments to refute Have a great day! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 46 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 82 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4551 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1235 bytes Desc: not available URL: From Jack.Jansen at cwi.nl Wed Oct 1 00:05:22 2008 From: Jack.Jansen at cwi.nl (Jack Jansen) Date: Wed, 1 Oct 2008 00:05:22 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E29D3B.5030900@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E29D3B.5030900@v.loewis.de> Message-ID: On 30-Sep-2008, at 23:42 , Martin v. L?wis wrote: > It's the other way 'round: On Windows, Unicode file names are the > natural choice, and byte strings have limitations. In a sense, Windows > got it right - but then, they started later. Unix missed the > opportunity > of declaring that all file APIs are UTF-8 (except for Plan-9 and OS X, > neither being "true" Unix). How does windows (and Python on windows) handle NFC versus NFD issues? Can I have two files called "?mlaut.txt", one in NFD and one NFC form? And are both of those representable on the Python side (i.e. can they both be returned from listdir() and passed to open())? CIf I compare these two filenames, do they compare differently? -- Jack Jansen, , http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jack.Jansen at cwi.nl Wed Oct 1 00:49:57 2008 From: Jack.Jansen at cwi.nl (Jack Jansen) Date: Wed, 1 Oct 2008 00:49:57 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E2A8E3.3070805@v.loewis.de> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E29D3B.5030900@v.loewis.de> <48E2A8E3.3070805@v.loewis.de> Message-ID: <82D029DA-C218-4631-A68E-CE3DBB03494A@cwi.nl> On 1-Oct-2008, at 00:32 , Martin v. L?wis wrote: > >> How does windows (and Python on windows) handle NFC versus NFD >> issues? > > That's left to the application. > >> Can I have two files called "?mlaut.txt", one in NFD and one NFC >> form? > > Yes, you can. It sounds confusing, but only in a theoretical way. You > never have combining characters on Windows (at least, I don't). The > keyboard input defaults to NFC, and users normally don't type file > names, anyways, except when creating the files - later, they just use > the mouse to indicate what file they want to act on. > >> And are both of those representable on the Python side (i.e. can they >> both be returned from listdir() and passed to open())? > > Certainly! > >> CIf I compare >> these two filenames, do they compare differently? > > Certainly! Actually, that all sounds pretty non-confusing to me:-) So, normal users will always have the one form, and if by chance they get the other form they can still use the file. Also from Python, even when doing listdir() and then open(), everything will work just as expected. That there are two files that have a similar visual representation is not too bad, the same happens with ellipses versus dot-dot-dot and many other cases. Which means the only problem area left is unix filesystems (whether on Linux or mounted remotely on MacOS or whatever), where filenames are really byte strings with only / and nul illegal. -- Jack Jansen, , http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman -------------- next part -------------- An HTML attachment was scrubbed... URL: From glyph at divmod.com Wed Oct 1 04:06:25 2008 From: glyph at divmod.com (glyph at divmod.com) Date: Wed, 01 Oct 2008 02:06:25 -0000 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> Message-ID: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> On 30 Sep, 09:22 pm, guido at python.org wrote: >On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. L?wis" >wrote: >>Guido van Rossum wrote: >>>On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. L?wis" >>> wrote: >>>Martin, I don't understand why you are in favor of storing raw bytes >>>encoded as Latin-1 in Unicode string objects, which clearly gives >>>rise >>>to mojibake. This is my word of the day, by the way. Reading this whole thread was _totally_ worth it to learn about "mojibake". Obviously I'm familiar with the phenomenon but somehow I'd never heard this awesome term before. >I am also encouraged by Glyph's support for (a). He has a lot of >practical experience. Thanks for the vote of confidence. I hope for all our sakes that you're not over-valuing that experience ;-). For what it's worth, I can see MvL's point in that I think there is some danger in generating confusion by adding _too many_ string-like functions to the bytes type. I don't want my suggestion to contribute to the confusion between bytes and text. However, Martin, I can promise you that I will _never_ ask for any convenience functions related to bytes as a result of this decision. I want bytes to come back from filesystem APIs because I intend to have a wrapper layer which knows two things about the file: the bytes (which are needed to talk to POSIX filesystem APIs) and the characters (which are computed from those bytes, can be safely renormalized, displayed to users, etc). On Windows this filesystem wrapper will necessarily behave differently, and will not use bytes for anything. Any formatting beyond joining path segments together and possibly splitting extensions off will be done on character strings, not byte strings. The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed, and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with. Guido already mentioned "libraries" as a hypothetical issue, but here's a real-world problem that results from putting NULLs into filenames. Consider this program: import gtk w = gtk.Window() b = gtk.Button(u"\u0000/hello/world") w.add(b) w.show_all() gtk.main() which emits this message: TypeError: OGtkButton.__init__() argument 1 must be string without null bytes or None, not unicode SQLite has a similar problem with NULLs, and I'm definitely sticking paths in there, too. Eventually I'd like to propose such a path type for inclusion in the stdlib, but that will have to wait for issues like to be resolved. From glyph at divmod.com Wed Oct 1 07:19:47 2008 From: glyph at divmod.com (glyph at divmod.com) Date: Wed, 01 Oct 2008 05:19:47 -0000 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net> References: <200809291407.55291.victor.stinner@haypocalc.com> <200809300202.38574.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E2865A.3010404@v.loewis.de> <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com> <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net> Message-ID: <20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com> On 03:32 am, foom at fuhm.net wrote: >On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote: >Can you clarify what proposal you are supporting for Python: Sure. Neither of your descriptions is terribly accurate, but I'll try to explain. >1) Two sets of APIs, one returning unicode strings, and one returning >bytestrings. (subpoints: what does the unicode-returning API do when >it cannot decode the bytestring into unicode? raise exception, pretend >argument/envvar/file didn't exist/?) The only API discussed so far which would actually provide two variants is 'getcwd', which would have a 'getcwdb' that gives back bytes instead. Pretty much every other API takes some kind of input. listdir(bytes) would give back bytes, while listdir(text) would give back text. listdir(text) would skip undecodable filenames. Similarly for all the other APIs in os and os.path that take pathnames for input. >2) All APIs return bytestrings only. Converting to unicode is >considered lossy, and would have to be done by applications for >display purposes only. This is a bad way to do things, because on Windows, filenames *really are* unicode. Converting to bytes is what's lossy. (See previous discussion of active codepages and CreateFileA/CreateFileW.) >I really don't understand the reasoning for (1). The reasoning is that a lot of software doesn't care if it's wrong for edge cases, it's really hard to come up with something that's correct with respect to all of those edge cases (absurdly difficult, if you need to stay in the straightjacket of string / bytes types, as well as provide a useful library interface - which is why we're having this discussion). But, it should be _possible_ to write software that's correct in the face of those edge cases. And - let's not forget this - the worlds of POSIX and Windows really are different and really do require subtly different inputs. Python can try to paper over this like Java does and make it impossible to write certain classes of application, or it can just provide an ugly, slightly inconsistent API that exposes the ugly, slightly inconsistent reality. Modulo the issues you've raised which I don't think the proposal totally covers yet (abspath with a non-decodable cwd) I think it strikes a nice balance; allow people to live in the delusion of unicode-on-POSIX and have software that mostly works, most of the time, or allow them to face the unpleasantness and spend the effort to get something really solid. I think the _right_ answer to all of this is to (A) make FilePath work completely correctly for every totally insane edge case ever, and (B) include it in the stdlib. One day I think we'll do that. But nobody has the time or energy to do even the first part of that *right now*, before 3.0 is released, so I'm just looking for something which it will be possible to build FilePath, or something like it, on top of, without breaking other people's applications who rely on the os module directly too badly. >It seems to me that most software (probably including all of the >Python stdlib) would continue to use the unicode string API. That's true. And that software wouldn't handle these edge cases completely correctly. As Guido put it, "it's a quality of implementation issue". >Switching all of the Python stdlib to use the bytestring APIs instead >would certainly be a large undertaking, and would have all sorts of >ripple-on API changes (e.g. __file__). I am not quite sure what to do about __file__. My preference would probably be to use unicode filename for consistency so it can always be displayed, but provide a second attribute (__open_file__?) that would be sometimes unicode, sometimes bytes, which would be guaranteed to work with open(). I suspect that most software which interacts with __file__ on a deep level would be of the variety which would deal with the edge cases. But where the Python stdlib wants a pathname it should be accepting either bytes or unicode, as all of the os.path functions want. This does kind of suck, but the alternatives are to encode crazy extra information in unicode path names that cannot be exchanged with other programs (or with users: NULL is potentially the worst bogus character from a UI perspective), or revert to bytes for everything (which is a non-solution, c.f. Windows above). >So I can only imagine that if you're proposing (1), you're doing so >without the intention of suggesting that Python be converted to use >it. Maybe updating the stdlib to be correct in the face of such changes is hard, but it doesn't seem intractible. Taken together, it looks like there are only about 100 calls in the stdlib to both getcwd and abspath together, and I suspect many of them are for purely aesthetic purposes and could just be eliminated, and many of them are redefinitions of the functions and don't need any changes. All the other path manipulation functions would continue to work as-is, although some of them might skip undecodable files. >And so, of course, that doesn't really fix things (such as getcwd >failing if your cwd is a path that is undecodeable in the current >locale, or well, currently, python refusing to even start). The proposal as I understand it so far doesn't address this specifically, so I'll try to. os.getcwd, os.path.abspath, and os.path.realpath (when called with unicode) will probably need to do something gross if they're called on a non-decodable directory. One thing that comes to mind is to create a temporary symbolic link and return u'/tmp/python-$YOURUID-undecodable/$GUID/something'. I hope someone else has a better idea, especially since that sort of defeats the purpose of realpath. On the other hand, even this strawman answer is correct for pretty much any sane purpose, and if you _really_ care, you need to learn that you have to use and ask for bytes, on POSIX, to deal with such corner cases. >If you're proposing (2), (...) Luckily I'm not. >>The proposal of using U+0000 seems like it would have been almost the >>same from such a wrapper's perspective, except (A) people using the >>filesystem APIs without the benefit of such a wrapper would have been >>even more screwed > >I'm not sure what your "more screwed" is comparing against: current >py3k behavior? (aka: decoding to Unicode in locale's specified >encoding)? I don't see how you can really be more screwed than that: >not only can't you send your filename to display in a Gtk+ button, you >can't access it at all, even staying within python. You're screwed if you're trying to access files in a portable way without worrying at all about encodings. There are files you won't be able to access, there are conditions you won't be able to deal with. Sorry, but POSIX sucks and that's life. You're _more_ screwed if you're trying to access those files in a portable way without worrying about encodings, and the API you're using is giving you back invalid, magic path names, with NULLs rather than being slightly lossy and dropping filenames you (obviously, by virtue of the way you requested those filenames) won't be able to deal with. So I was talking here about the default behavior in the case of a naive program that wants to pretend all paths are unicode. >>and (B) there are a few nasty corner-cases when dealing with >>surrogate (i.e. invalid, in UTF-8) code points which I'm not quite >>sure what it would have done with. > >The lone-surrogate-pair proposal was a totally different proposal than >the U+0000 one. I wasn't referring to the lone-surrogate-pair encoding trick, I was referring to the fact that some people are going to want to treat surrogate pairs as encoding errors (i.e. include the NULL byte) and some will want to treat them as valid. If you want them to be valid you have to normalize away the surrogates in order to talk to other software, but you can't do that because then you'll get different bytes when you re- encode them. There's probably a way around that but it would be subtle and controversial no matter how you did it. From eckhardt at satorlaser.com Wed Oct 1 09:54:47 2008 From: eckhardt at satorlaser.com (Ulrich Eckhardt) Date: Wed, 1 Oct 2008 09:54:47 +0200 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <48E20017.3020405@egenix.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> Message-ID: <200810010954.47564.eckhardt@satorlaser.com> On Tuesday 30 September 2008, M.-A. Lemburg wrote: > On 2008-09-30 08:00, Martin v. L?wis wrote: > >> Change the default file system encoding to store bytes in Unicode is > >> like introducing a new Python type: . > > > > Exactly. Seems like the best solution to me, despite your polemics. > > Not a bad idea... have os.listdir() return Unicode subclasses that work > like file handles, ie. they have an extra buffer that holds the original > bytes value received from the underlying C API. Why does it have to be a Unicode subclass? In my eyes, a Unicode object promises a few things, in particular that it contains a Unicode string. If it now suddenly contains bytes without any further meaning, that would be bad. What I wonder is what the requirements on path handling are. I'll try to list the ones I can see: 1. A path received from the system should be preserved, so it can be given to the system later on. IOW, the internal representation should not loose any information compared to the one used by the OS. 2. Typical operations like joining two path segments or moving to the parent dir should be defined. 3. There must be a way to display the path to the user. IOW, there should be a way to turn the path into a string that the user can recognise, according to some encoding. Note that this is not always possible, so this can fail. 4. There must be a way to receive a path from the user. That means that there must be a way from a user-entered string to a path. Note that this, too, isn't always possible and can fail. 5. The conversion between a string and a path should be configurable, defaults retrieved from the system. This is so that most operations will just work and do the thing that the user expects. 6. There should be a way to modify the path data itself. This of course requires knowledge about the internals but gives full power to the programmer. For requirement 3, I would say a lossy conversion to a string would be enough, i.e. try to convert the path to a Unicode string and use a question mark or some escaping to mark parts that can't be decoded. It will allow users to recognise the decodeable parts of the path with hopefully just a few characters left without decoding. For requirement 4, a failure to encode a string to a path must result in a loud failure, i.e. an exception. This is because the user entered a path that we can't use, any guessing what the user might have wanted is futile. Are there any points to add? Uli -- Sator Laser GmbH Gesch?ftsf?hrer: Thorsten F?cking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at ************************************************************************************** Diese E-Mail einschlie?lich s?mtlicher Anh?nge ist nur f?r den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empf?nger sein sollten. Die E-Mail ist in diesem Fall zu l?schen und darf weder gelesen, weitergeleitet, ver?ffentlicht oder anderweitig benutzt werden. E-Mails k?nnen durch Dritte gelesen werden und Viren sowie nichtautorisierte ?nderungen enthalten. Sator Laser GmbH ist f?r diese Folgen nicht verantwortlich. ************************************************************************************** From glyph at divmod.com Wed Oct 1 18:20:06 2008 From: glyph at divmod.com (glyph at divmod.com) Date: Wed, 01 Oct 2008 16:20:06 -0000 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <74342.1222876455@parc.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> <200810010954.47564.eckhardt@satorlaser.com> <48E343AE.3080009@egenix.com> <74342.1222876455@parc.com> Message-ID: <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com> On 03:54 pm, janssen at parc.com wrote: >I'm actually sort of liking this idea. A Pathname class, for >convenience >a subtype of String, but containing the underlying binary >representation >used by the OS. Even non-unicode pathnames could be represented. On the one hand, I agree with you - except for the part where it's a subtype of String, that doesn't work. In case I haven't mentioned it enough times already: http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html On the other hand, we've all been on this merry-go-round before: http://www.python.org/dev/peps/pep-0355/ Note especially the rejection notice: "Subclassing from str is a particularly bad idea". Again, one day I'd really like to add one of these to Python. Now is not the time. From ncoghlan at gmail.com Wed Oct 1 23:39:42 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 02 Oct 2008 07:39:42 +1000 Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue In-Reply-To: <75388.1222881240@parc.com> References: <200809291407.55291.victor.stinner@haypocalc.com> <48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com> <200810010954.47564.eckhardt@satorlaser.com> <48E343AE.3080009@egenix.com> <74342.1222876455@parc.com> <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com> <75388.1222881240@parc.com> Message-ID: <48E3EE1E.5000300@gmail.com> Bill Janssen wrote: > Perhaps PEP 355 just went too far. That was certainly one of the major objections to it. A filesystem path object which didn't try to combine a half-dozen different modules into methods on a single object, but instead focused on solving a few specific problems with using raw strings as file paths would have a far greater chance of acceptance. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From foom at fuhm.net Thu Oct 2 00:14:50 2008 From: foom at fuhm.net (James Y Knight) Date: Wed, 1 Oct 2008 18:14:50 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E3C98A.1000906@nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <48DE705E.6050405@v.loewis.de> <52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net> <48E3C98A.1000906@nevcal.com> Message-ID: <5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net> On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote: > On approximately 10/1/2008 11:30 AM, came the following characters > from the keyboard of James Y Knight: >> BTW, Windows will cheerfully let you create and access files with >> "garbage surrogates" in it. >> Try it yourself: >> >> open(u"\ud8fd", 'w').close() >> os.listdir(u'.') > > But Windows doesn't have the problem of non-Unicode sequences > needing to be translated to something else in the first place. So > this is mostly irrelevant to the problem at hand. Well...either you consider lone surrogates as valid Unicode sequences, or else Windows *does* have the problem of non-Unicode sequences needing to be translated to something else. Currently, the answer is that lone surrogates are treated as valid Unicode, and allowed into Python via the windows file APIs. Thus, filename strings in Python are going to have lone surrogates, anyways, on Windows. Therefore, any external library which freaks out upon seeing a lone surrogate is already going to be broken for some filenames on Windows. So, it seems to me, converting invalid UTF-8 sequences into lone surrogates for Unix doesn't actually add any new form of brokenness. So why not just do that? >> So, I'm back to favoring the lone surrogate plan over the U+0000 >> plan. But either one seems better than the alternatives. > > The original byte string must be preserved for use in actually > opening files. Or reversibly transformed. > How it is displayed is another question. Doing something that works > for both Unicode display and access to the file is basically > impossible in all cases. Providing an encapsulation of the byte > string that has display methods, together with new methods to > transform the file path, and use parts of it to create other file > paths, is the solution I described earlier. This sounds like a fine solution. And it would work just as well with a UTF-8b base API as with a dual string/byte string base API. The only difference is what the default behavior for people who don't use your new fancy API is. In the UTF-8b case, most things would work, even with invalidly-encoded filenames. James From rhamph at gmail.com Thu Oct 2 00:41:32 2008 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 1 Oct 2008 16:41:32 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net> References: <200809271404.25654.victor.stinner@haypocalc.com> <48DFF382.7020006@v.loewis.de> <52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com> <96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net> <48E3C98A.1000906@nevcal.com> <5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net> Message-ID: On Wed, Oct 1, 2008 at 4:14 PM, James Y Knight wrote: > On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote: >> On approximately 10/1/2008 11:30 AM, came the following characters from >> the keyboard of James Y Knight: >>> >>> BTW, Windows will cheerfully let you create and access files with >>> "garbage surrogates" in it. >>> Try it yourself: >>> >>> open(u"\ud8fd", 'w').close() >>> os.listdir(u'.') >> >> But Windows doesn't have the problem of non-Unicode sequences needing to >> be translated to something else in the first place. So this is mostly >> irrelevant to the problem at hand. > > > Well...either you consider lone surrogates as valid Unicode sequences, or > else Windows *does* have the problem of non-Unicode sequences needing to be > translated to something else. > > Currently, the answer is that lone surrogates are treated as valid Unicode, > and allowed into Python via the windows file APIs. Thus, filename strings in > Python are going to have lone surrogates, anyways, on Windows. We allow lone surrogates into our unicode objects, but they aren't valid Unicode. They'll fail for any APIs that expect only valid Unicode. > Therefore, any external library which freaks out upon seeing a lone > surrogate is already going to be broken for some filenames on Windows. So, > it seems to me, converting invalid UTF-8 sequences into lone surrogates for > Unix doesn't actually add any new form of brokenness. So why not just do > that? I see it the opposite: lone surrogates on windows should be rejected from unicode APIs, just as we want to do for invalid UTF-8 on linux. But since the same rationale for having a "raw" API applies, maybe the windows byte APIs should expose raw UTF-16, rather than letting it be translated? -- Adam Olsen, aka Rhamphoryncus From victor.stinner at haypocalc.com Thu Oct 2 13:50:49 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 2 Oct 2008 13:50:49 +0200 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError Message-ID: <200810021350.49292.victor.stinner@haypocalc.com> This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It's a *draft*, don't hesitate to comment it. This document suppose that my patch to allow bytes filenames is accept which is not the case today. While I was writing this document I found poential problems in Python3. So here is a TODO list (things to be checked): FIXME: PyUnicode_DecodeFSDefaultAndSize(): errors="replace"! FIXME: import.c uses ASCII if default file system is unknown, whereas other functions uses UTF-8 FIXME: Write a function in Python3 to convert a bytes filename to a nice string FIXME: When bytearray is accepted or not? FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get bytes or unicode? FIXME: Use a shorter title for this PEP :-) Can anyone write a section about bytes encoding in Unicode using escape sequence? What is the best tool to work on a PEP? I hate email threads, and I would prefer SVN / Mercurial / anything else. --- Title: Python3 and UnicodeDecodeError for the command line, environment variables and filenames Introduction ============ Python3 does its best to give you texts encoded as a valid unicode characters strings. When it hits an invalid bytes sequence (according to the used charset), it has two choices: drops the value or raises an UnicodeDecodeError. This document present the behaviour of Python3 for the command line, environment variables and filenames. Example of an invalid bytes sequence: :: >>> str(b'\xff', 'utf8') UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...) whereas the same byte sequence is valid in another charset like ISO-8859-1: :: >>> str(b'\xff', 'iso-8859-1') '?' Default encoding ================ Python uses "UTF-8" as the default Unicode encoding. You can read the default charset using sys.getdefaultencoding(). The "default encoding" is used by PyUnicode_FromStringAndSize(). A function sys.setdefaultencoding() exists, but it raises a ValueError for charset different than UTF-8 since the charset is hardcoded in PyUnicode_FromStringAndSize(). Command line ============ Python creates a nice unicode table for sys.argv using mbstowcs(): :: $ ./python -c 'import sys; print(sys.argv)' 'Ho h? !' ['-c', 'Ho h? !'] On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the encoding. On an invalid bytes sequence, Python quits directly with an exit code 1. Example with UTF-8 locale: :: $ python3.0 $(echo -e 'invalid:\xff') Could not convert argument 1 to string Environment variables ===================== Python uses "_wenviron" on Windows which are contains unicode (UTF-16-LE) strings. On other OS, it uses "environ" variable and the UTF-8 charset. It drops a variable if its key or value is not convertible to unicode. Example: :: env -i HOME=/home/my PATH=$(echo -e "\xff") python >>> import os; list(os.environ.items()) [('HOME', '/home/my')] Both key and values are unicode strings. Empty key and/or value are allowed. Filenames ========= Introduction ------------ Python2 uses byte filenames everywhere, but it was also possible to use unicode filenames. Examples: - os.getcwd() gives bytes whereas os.getcwdu() always returns unicode - os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes on UnicodeDecodeError), os.readlink() has the same behaviour - glob.glob() converts the unicode pattern to bytes, and so create bytes filenames - open() supports bytes and unicode Since listdir() mix bytes and unicode, you are not able to manipulate easily filenames: :: >>> path=u'.' >>> for name in os.listdir(path): ... print repr(name) ... print repr(os.path.join(path, name)) ... u'valid' u'./valid' 'invalid\xff' Traceback (most recent call last): ... File "/usr/lib/python2.5/posixpath.py", line 65, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...) Python3 supports both types, bytes and unicode, but disallow mixing them. If you ask for unicode, you will always get unicode or an exception is raised. You should only use unicode filenames, except if you are writing a program fixing file system encoding, a backup tool or you users are unable to fix their broken system. Windows ------- Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames. So you should only use unicode filenames. Non Windows (POSIX) ------------------- POSIX OS like Linux uses bytes for historical reasons. In the best case, all filenames will be encoded as valid UTF-8 strings and Python creates valid unicode strings. But since system calls uses bytes, the file system may returns an invalid filename, or a program can creates a file with an invalid filename. An invalid filename is a string which can not be decoded to unicode using the default file system encoding (which is UTF-8 most of the time). A robust program have to use only the bytes type to make sure that it will be able to open / copy / remove any file or directory. Filename encoding ----------------- Python use: * "mbcs" on Windows * or "utf-8" on Mac OS X * or nl_langinfo(CODESET) on OS supporting this function * or UTF-8 by default "mbcs" is not a valid charset name, it's an internal charset saying that Python will use the function MultiByteToWideChar() to decode bytes to unicode. This function uses the current codepage to decode bytes string. You can read the charset using sys.getfilesystemencoding(). The function may returns None if Python is unable to determine the default encoding. PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or UTF-8 if it is not set. On UNIX (and other operating systems), it's possible to mount different file systems using different charsets. sys.getdefaultencoding() will be the same for the different file systems since this encoding is only used between Python and the Linux kernel, not between the kernel and the file system which may uses a different charset. Display a filename ------------------ Example of a function formatting a filename to display it to human eyes: :: from sys import getfilesystemencoding def format_filename(filename): return str(filename, getfilesystemencoding(), 'replace') Example: format_filename('r\xffport.doc') gives 'r?port.doc' with the UTF-8 encoding. Functions producing filenames ----------------------------- Policy: for unicode arguments: drop invalid bytes filenames; for bytes arguments: return bytes - os.listdir() - glob.glob() Policy: for an unicode argument: raise an UnicodeDecodeError on invalid filename; for an bytes argument: return bytes - os.readlink() Policy: create unicode directory or raise an UnicodeDecodeError - os.getcwd() Policy: always returns bytes - os.getcwdb() Functions for filename manipulation ----------------------------------- Policy: raise TypeError on bytes/str mix - os.path.*(), eg. os.path.join() - fnmatch.*() Functions accessing files ------------------------- Policy: accept both bytes and str - io.open() - os.open() - os.chdir() - os.stat(), os.lstat() - os.rename() - os.unlink() - shutil.*() os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment, and unicode for the other argument bytearray --------- In most cases, bytearray() can be used as bytes for a filename. Unicode normalisation ===================== Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD. Python does never normalize strings (nor filenames). No operating system does normalize filenames. So the users using different norms will be unable to retrieve their file. Don't panic! All users use the same norm. Use unicodedata.normalize() to normalize an unicode string. From mal at egenix.com Thu Oct 2 14:07:50 2008 From: mal at egenix.com (M.-A. Lemburg) Date: Thu, 02 Oct 2008 14:07:50 +0200 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError In-Reply-To: <200810021350.49292.victor.stinner@haypocalc.com> References: <200810021350.49292.victor.stinner@haypocalc.com> Message-ID: <48E4B996.9030101@egenix.com> On 2008-10-02 13:50, Victor Stinner wrote: > This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. The PEP doesn't appear to address any potential changes. Wouldn't it be better to add such information to the Python3 documentation itself ?! > It's > a *draft*, don't hesitate to comment it. This document suppose that my patch > to allow bytes filenames is accept which is not the case today. > > While I was writing this document I found poential problems in Python3. So > here is a TODO list (things to be checked): > > FIXME: PyUnicode_DecodeFSDefaultAndSize(): errors="replace"! > FIXME: import.c uses ASCII if default file system is unknown, whereas other > functions uses UTF-8 > FIXME: Write a function in Python3 to convert a bytes filename to a nice > string > FIXME: When bytearray is accepted or not? > FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get > bytes or unicode? > FIXME: Use a shorter title for this PEP :-) > > Can anyone write a section about bytes encoding in Unicode using escape > sequence? > > What is the best tool to work on a PEP? I hate email threads, and I would > prefer SVN / Mercurial / anything else. > --- > > Title: Python3 and UnicodeDecodeError for the command line, > environment variables and filenames > > Introduction > ============ > > Python3 does its best to give you texts encoded as a valid unicode characters > strings. When it hits an invalid bytes sequence (according to the used > charset), it has two choices: drops the value or raises an UnicodeDecodeError. > This document present the behaviour of Python3 for the command line, > environment variables and filenames. > > Example of an invalid bytes sequence: :: > > >>> str(b'\xff', 'utf8') > UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...) > > whereas the same byte sequence is valid in another charset like ISO-8859-1: :: > > >>> str(b'\xff', 'iso-8859-1') > '?' You have left out all the options you have by using a different error handling mechanism (using a third parameter to str()), e.g. 'replace', 'ignore', etc. > Default encoding > ================ > > Python uses "UTF-8" as the default Unicode encoding. You can read the default > charset using sys.getdefaultencoding(). The "default encoding" is used by > PyUnicode_FromStringAndSize(). > > A function sys.setdefaultencoding() exists, but it raises a ValueError for > charset different than UTF-8 since the charset is hardcoded in > PyUnicode_FromStringAndSize(). Not only there: the C API makes various assumptions on the default encoding as well. We should probably drop the term "default encoding" altogether and replace it with "utf-8". sys.setdefaultencoding() should probably be dropped altogether from Python3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 02 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 From ncoghlan at gmail.com Thu Oct 2 14:31:06 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 02 Oct 2008 22:31:06 +1000 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError In-Reply-To: <48E4B996.9030101@egenix.com> References: <200810021350.49292.victor.stinner@haypocalc.com> <48E4B996.9030101@egenix.com> Message-ID: <48E4BF0A.9040604@gmail.com> M.-A. Lemburg wrote: > On 2008-10-02 13:50, Victor Stinner wrote: >> This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. > > The PEP doesn't appear to address any potential changes. Wouldn't > it be better to add such information to the Python3 documentation > itself ?! True, a simple wiki page would probably be adequate - once we agree on the details, it can be added to the main Python 3 docs. Victor - the Python wiki is also one of the easiest places to work on early PEP drafts. See http://wiki.python.org/moin/PythonEnhancementProposals. > Not only there: the C API makes various assumptions on the default > encoding as well. We should probably drop the term "default encoding" > altogether and replace it with "utf-8". > > sys.setdefaultencoding() should probably be dropped altogether from > Python3. Isn't that method still there to allow other implementations to be more permissive about allowing the default encoding to be changed? Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From victor.stinner at haypocalc.com Thu Oct 2 14:35:48 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 2 Oct 2008 14:35:48 +0200 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError In-Reply-To: <48E4B996.9030101@egenix.com> References: <200810021350.49292.victor.stinner@haypocalc.com> <48E4B996.9030101@egenix.com> Message-ID: <200810021435.48955.victor.stinner@haypocalc.com> Le Thursday 02 October 2008 14:07:50 M.-A. Lemburg, vous avez ?crit?: > On 2008-10-02 13:50, Victor Stinner wrote: > > This is a PEP (...) > > The PEP doesn't appear to address any potential changes. Wouldn't > it be better to add such information to the Python3 documentation > itself ?! I don't know the right name of this document. Yeah, it may move to Doc/ in Python3 source code. > > Example of an invalid bytes sequence: :: > > >>> str(b'\xff', 'utf8') > > UnicodeDecodeError > > > > >>> str(b'\xff', 'iso-8859-1') > > '?' > > You have left out all the options you have by using a different > error handling mechanism (using a third parameter to str()), e.g. > 'replace', 'ignore', etc. Yes, I can explain why replace and ignore can *not* be use in this case. If you use ignore or replace, filenames will be valid unicode strings, but you will be unable to open / copy / remove you file. > > Default encoding > > ================ > > > > Python uses "UTF-8" as the default Unicode encoding. You can read the > > default charset using sys.getdefaultencoding(). The "default encoding" is > > used by PyUnicode_FromStringAndSize(). > > Not only there: the C API makes various assumptions on the default > encoding as well. We should probably drop the term "default encoding" > altogether and replace it with "utf-8". The concept of "default encoding" is unclear in Python. Yes, we might remove sys.getdefaultencoding() and write that PyUnicode_FromStringAndSize() uses the UTF-8 charset. > sys.setdefaultencoding() should probably be dropped altogether from > Python3. Yes. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From victor.stinner at haypocalc.com Thu Oct 2 18:46:13 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 2 Oct 2008 18:46:13 +0200 Subject: [Python-3000] Issues about Python script encoding Message-ID: <200810021846.13939.victor.stinner@haypocalc.com> Python3 traceback have bugs making debugging harder: [Py3k] line number is wrong after encoding declaration http://bugs.python.org/issue2384 PyTraceBack_Print() doesn't respect # coding: xxx header http://bugs.python.org/issue3975 Both issues has patch + testcase. -- About the coding header, IDLE doesn't read #coding: header. Here is a fix (use tokenize.detect_encoding): http://bugs.python.org/issue4008 And finally, two more patches for the encoding detecting in: http://bugs.python.org/issue4016 -> use tokenize.detect_encoding() in linecache (instead of a duplicate incomplete (eg. no UTF-8 BOM support) code to detect the encoding) -> reuse codecs.BOM_UTF8 in tokenize That's all for today :) -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From victor.stinner at haypocalc.com Thu Oct 2 19:25:27 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 2 Oct 2008 19:25:27 +0200 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError In-Reply-To: <48E4BF0A.9040604@gmail.com> References: <200810021350.49292.victor.stinner@haypocalc.com> <48E4B996.9030101@egenix.com> <48E4BF0A.9040604@gmail.com> Message-ID: <200810021925.27369.victor.stinner@haypocalc.com> Le Thursday 02 October 2008 14:31:06, vous avez ?crit?: > Victor - the Python wiki is also one of the easiest places to work on > early PEP drafts. See > http://wiki.python.org/moin/PythonEnhancementProposals. Ok, I converted the document to the wiki syntax: http://wiki.python.org/moin/Python3UnicodeDecodeError -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From martin at v.loewis.de Thu Oct 2 22:32:43 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 02 Oct 2008 22:32:43 +0200 Subject: [Python-3000] Issues about Python script encoding In-Reply-To: <200810021846.13939.victor.stinner@haypocalc.com> References: <200810021846.13939.victor.stinner@haypocalc.com> Message-ID: <48E52FEB.5020307@v.loewis.de> > About the coding header, IDLE doesn't read #coding: header. Here is a fix (use > tokenize.detect_encoding): > http://bugs.python.org/issue4008 Are you really sure about that? It did in the past. Regards, Martin From martin at v.loewis.de Thu Oct 2 22:34:55 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 02 Oct 2008 22:34:55 +0200 Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError In-Reply-To: <48E4BF0A.9040604@gmail.com> References: <200810021350.49292.victor.stinner@haypocalc.com> <48E4B996.9030101@egenix.com> <48E4BF0A.9040604@gmail.com> Message-ID: <48E5306F.2070903@v.loewis.de> >> sys.setdefaultencoding() should probably be dropped altogether from >> Python3. > > Isn't that method still there to allow other implementations to be more > permissive about allowing the default encoding to be changed? That never was my understanding - although it's an interesting thought. Is that opportunity actually used? I.e. is there a Python implementation that does work correctly in the presence of setdefaultencoding? I find that hard to believe. Regards, Martin From victor.stinner at haypocalc.com Thu Oct 2 23:54:06 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 2 Oct 2008 23:54:06 +0200 Subject: [Python-3000] Issues about Python script encoding In-Reply-To: <48E52FEB.5020307@v.loewis.de> References: <200810021846.13939.victor.stinner@haypocalc.com> <48E52FEB.5020307@v.loewis.de> Message-ID: <200810022354.06928.victor.stinner@haypocalc.com> Le Thursday 02 October 2008 22:32:43 Martin v. L?wis, vous avez ?crit?: > > About the coding header, IDLE doesn't read #coding: header. Here is a fix > > (use tokenize.detect_encoding): > > http://bugs.python.org/issue4008 > > Are you really sure about that? It did in the past. Try IDLE in an ASCII terminal: python Tools/scripts/idle idle-3.0rc1-quits-when-run.py (the .py file is attached to the issue). IDLE use open(filename, 'r') without setting the encoding. io module is not aware of the #coding: header. The issue is maybe related to the terminal locale since IDLE uses a "locale encoding" (import IOBinding; IOBinding.encoding) which is marked as "deprecated" in IDLE source code. (We should use the bug tracker to discuss this issue) -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From jcollins37 at carolina.rr.com Fri Oct 3 00:50:14 2008 From: jcollins37 at carolina.rr.com (James E. Collins III) Date: Thu, 2 Oct 2008 18:50:14 -0400 (Eastern Daylight Time) Subject: [Python-3000] Automatic Reply: Sound (Python-3000 Digest, Vol 32, Issue 9) Message-ID: <48E5500B.000001.05980@JCOLLINS37-PCA> Silence is one of hardest arguments to refute Have a great day! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/unknown Size: 46 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/unknown Size: 82 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/unknown Size: 4551 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/unknown Size: 1235 bytes Desc: not available URL: From jimjjewett at gmail.com Fri Oct 3 19:35:31 2008 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 3 Oct 2008 13:35:31 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> Message-ID: On Wed, Oct 1, 2008 at 10:36 AM, Antoine Pitrou wrote: > The average user does not even /know/ what a charset is. Because for the average user, there is no need. Part of the HTML5 standard is how to guess at charsets, and when to automatically use a superset instead of the declared encoding. For most of the US and Europe, the guesses are good enough. For the languages and countries where multiple charsets are in common use, and the guesses are often wrong, browser vendors say that the change charset commands are well-known and frequently used. > If a filename can't be exactly > represented with a valid Unicode sequence, all > applications wanting to access > that file are impacted in the same way, Not really. Some utilities never really need to display the filename; they just need to be able to manage the file. Many applications need to display a file chooser, but may never need to actually open problematic files, and may not need an accurate or complete representation. (Consider "Progra~1" on windows.) > This sounds very much like a > Python-level (or at least stdlib-level) problem to me. The stdlib should provide a way of dealing with raw bytes. Beyond that, the needs get too specialized. (And that way of dealing with raw bytes *might* just be documenting the Latin-1 hack.) > Are you suggesting that the solution to the filename > problem is to prompt the > user and ask them for a different encoding? For some applications, yes. -jJ From foom at fuhm.net Fri Oct 3 21:53:27 2008 From: foom at fuhm.net (James Y Knight) Date: Fri, 3 Oct 2008 15:53:27 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E67175.1030103@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> Message-ID: <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote: > In order to work, the actual name must be preserved, or if > translated, must be a reversible, 1-to-1 translation. A lot of > discussion here has talked about reversible translations, but > haven't noted the requirement that it be 1-to-1... and if the > translation produces something that looks like it could be a file > name, then the reverse translation is unlikely to be 1-to-1! > Somewhere, you need to add a flag that indicates whether or not a > reverse translation needs to be done, independently of the content > of the translated name. That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1 transforms. James From qrczak at knm.org.pl Fri Oct 3 23:23:48 2008 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 3 Oct 2008 23:23:48 +0200 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E68911.6090403@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> Message-ID: <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> 2008/10/3 Glenn Linderman : > My understanding of the Posix file names is that any byte values are valid > except "/" and null. Is this a correct understanding? Yes (well, names "." and ".." are reserved, and there might be length restrictions). > The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a > Unicode character stream. Call the original byte stream FOO. The > transformation then produces FOOTR, a set of Unicode code points. Now FOOTR > has a representation in UTF-8, which is a byte stream, call that byte stream > FOOTRUTF8. How, by looking at FOOTR, do you know whether it represents the > file name FOO or FOOTRUTF8 ? In the unpaired surrogate scheme: there is no FOOTRUTF8 because UTF-8 can encode only Unicode scalar values (which exclude surrogates). Python strings can contain surrogates (in 4-byte builds) or unpaired surrogates which are malformed UTF-16 (in 2-byte builds) ? in the filename context they can't be represented in UTF-8 so they must mean escaped bytes. In the U+0000 scheme: FOOTRUTF8 contains a 0 byte, so the filename must mean FOO. > but if it > introduces null characters into the translated "file name", then there is > file name parsing software that it will be incompatible with, which may be > as problematic as not translating the file names in the first place... What do you mean by "not translating"? If a piece of software validates filenames while they are represented by Unicode strings, then they must have been somehow translated from byte strings (on POSIX) or UTF-16-assumed-but-not-guaranteed strings (on Windows). -- Marcin Kowalczyk qrczak at knm.org.pl http://qrnik.knm.org.pl/~qrczak/ From rhamph at gmail.com Fri Oct 3 23:36:25 2008 From: rhamph at gmail.com (Adam Olsen) Date: Fri, 3 Oct 2008 15:36:25 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E68911.6090403@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> Message-ID: On Fri, Oct 3, 2008 at 3:05 PM, Glenn Linderman wrote: > On approximately 10/3/2008 12:53 PM, came the following characters from the > keyboard of James Y Knight: >> >> On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote: >>> >>> In order to work, the actual name must be preserved, or if translated, >>> must be a reversible, 1-to-1 translation. A lot of discussion here has >>> talked about reversible translations, but haven't noted the requirement that >>> it be 1-to-1... and if the translation produces something that looks like it >>> could be a file name, then the reverse translation is unlikely to be 1-to-1! >>> Somewhere, you need to add a flag that indicates whether or not a reverse >>> translation needs to be done, independently of the content of the translated >>> name. >> >> That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1 >> transforms. >> >> James > > My understanding of the Posix file names is that any byte values are valid > except "/" and null. Is this a correct understanding? > > The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a > Unicode character stream. Call the original byte stream FOO. The > transformation then produces FOOTR, a set of Unicode code points. Now FOOTR > has a representation in UTF-8, which is a byte stream, call that byte stream > FOOTRUTF8. How, by looking at FOOTR, do you know whether it represents the > file name FOO or FOOTRUTF8 ? And remember that the user might provide a > Unicode character stream identical to FOOTR: should it be translated to FOO > or FOOTRUTF8 when creating a new file according to the user-supplied name? UTF-8b produces an *invalid* unicode sequence, via lone scalars. Any attempt to encode or decode using a validating UTF-8 (or UTF-16/UTF-32) codec would reject them, which is why they can unambiguously be used. In other words, it's not unicode (despite a resemblence), so it's easy to be 1-to-1. > So the U+0000 transform may be 1-to-1 since it introduces null characters > into the translated "file name", which are effectively producing names that > are invalid according to the Posix file name standard ... but if it > introduces null characters into the translated "file name", then there is > file name parsing software that it will be incompatible with, which may be > as problematic as not translating the file names in the first place... deep > analysis would have to be used to determine which problem is larger, or more > significant. I've certainly been "guilty" of writing software that assumes > that there are no null characters in a file name. I've even been "guilty" > of writing software that assumes there are no space characters in a file > name, although I've tried to break that habit in recent years... Yup, U+0000 is unicode, but still can't be used with many external APIs, as it's a transformation of the real file name. The only real advantage is you can store it in certain external formats, but wouldn't you know it, XML isn't one of them[1]. Can you think of any common formats where it would work? [1] http://www.w3.org/International/questions/qa-controls -- Adam Olsen, aka Rhamphoryncus From rhamph at gmail.com Sat Oct 4 01:54:06 2008 From: rhamph at gmail.com (Adam Olsen) Date: Fri, 3 Oct 2008 17:54:06 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E6A492.4090604@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> <48E6A492.4090604@g.nevcal.com> Message-ID: On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman wrote: > On approximately 10/3/2008 2:36 PM, came the following characters from the > keyboard of Adam Olsen: >> >> UTF-8b produces an *invalid* unicode sequence, via lone scalars. Any >> attempt to encode or decode using a validating UTF-8 (or >> UTF-16/UTF-32) codec would reject them, which is why they can >> unambiguously be used. >> >> In other words, it's not unicode (despite a resemblence), so it's easy >> to be 1-to-1. > > Sort of. There is no numerical reason they cannot be represented in a > UTF-8-like numeric encoding scheme. It is only rules and regulations that > prevent it. So FOOTRUTF8 can exist, just not legally. If the expectation > is that an illegal UTF-16 code can be used, to permit the UTF-8b translation > scheme to work at all, then it seems reasonable to expect than an illegal > translation of it to UTF-8 might happen also, which means that the > transformation isn't 1-to-1! No, UTF-8b can't be translated to UTF-8. It's illegal. > I think someone demonstrated the use of unpaired surrogates in the Windows > filename context the other day. Whether that is a bug or not, it is the > current state of affairs, someone might read a name from Windows and want to > create it on Posix... what happens? If we implement UTF-8b, I know what > would happen. But what would happen if we don't, today, on a Posix Python > 3? Would it use FOOTRUTF8 or would it generate an error? I don't suppose > it matters a lot, it is stupidity to use such names whether or not the > prevention of it is enforced. If python worked properly? The illegal unicode object would get an encoding error when you tried to translate to UTF-8 to send it over to the Posix box. You'd have alter all the software that touches it to use your looks-like-but-isn't-quite-unicode, rather than using the real unicode. That's why I favour validating the windows API too, and making the raw API be the raw UTF-16 (rather than letting it get encoded into a single-byte encoding). The rawness is what bytes need, not ASCII similarity. > But if someone on Posix is creating non-Python software that uses illegal > lone surrogates, illegally UTF-8 coding them to create the file, and then > giving them to a Python program to manipulate the content, things could get > confused, if UTF-8b translations happen under the Python covers... the > Python program would attempt to open a different file than the non-Python > software created. No, they can't illegal use UTF-8. It's not UTF-8, period. It's just garbage. > Seems like attempts to manipulate and transform names are doomed to failure; > the approach of having a bytes level interface seems to be the correct one, > glad that seems to be the approach that Victor is implementing and Guido is > favoring, although it is a pity that it can't be fully encapsulated into an > object in time for 3.0, leaving us with multiple APIs for file access, and a > potential future translation to an encapsulated object approach. the bytes object covers 90% of the raw usage. The other 10% is a lossy encoding to unicode. I much prefer that to be explicit, so an attribute may do.. say b.decode('UTF-8', 'replace')? Or do we need a subtype of bytes, just to reduce that to 5-8 characters? -- Adam Olsen, aka Rhamphoryncus From rhamph at gmail.com Sat Oct 4 08:57:36 2008 From: rhamph at gmail.com (Adam Olsen) Date: Sat, 4 Oct 2008 00:57:36 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E6ED99.2050406@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> <48E6A492.4090604@g.nevcal.com> <48E6ED99.2050406@g.nevcal.com> Message-ID: On Fri, Oct 3, 2008 at 10:14 PM, Glenn Linderman wrote: > On approximately 10/3/2008 4:54 PM, came the following characters from the > keyboard of Adam Olsen: >> On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman >> wrote: > > OK, so UTF-8b is not Unicode, either. It's just garbage. You can't have it > both ways. I've always said UTF-8b wasn't valid. >>> Seems like attempts to manipulate and transform names are doomed to >>> failure; >>> the approach of having a bytes level interface seems to be the correct >>> one, >>> glad that seems to be the approach that Victor is implementing and Guido >>> is >>> favoring, although it is a pity that it can't be fully encapsulated into >>> an >>> object in time for 3.0, leaving us with multiple APIs for file access, >>> and a >>> potential future translation to an encapsulated object approach. >>> >> >> the bytes object covers 90% of the raw usage. The other 10% is a >> lossy encoding to unicode. I much prefer that to be explicit, so an >> attribute may do.. say b.decode('UTF-8', 'replace')? Or do we need a >> subtype of bytes, just to reduce that to 5-8 characters? >> > > I don't understand what you mean here... Victor/Guido's plan results in: > > Alternative 1: Windows only programs can use the Python Unicode file > interfaces, Posix programs can take a chance, and also use them (one stab at > semi-portability, if people don't need access to weirdly named files). Windows programs using non-validating unicode APIs will be exposed to random exceptions when they use a validating unicode API. Better to validate everything early, where you can expect the failures. Posix programs SHOULD take a chance. It's much easier to deal with pure unicode, and some things can only be done that way (such as getting file names from the user through a GUI). > Alternative 2: Posix only programs can use the Python bytes file interfaces > and get all the files, but can't necessarily display them, except in lossy > Unicode or hex, or by pretending they are Latin-1, or whatever they want to > do, but they can't assume UTF-8, unless it happens to work. Windows > programs can use the bytes interface (another stab at semi-portability), if > people don't need access to files named using Unicode characters not in the > program's current code page. Can't display them, can't export them. 'tis fun! > Alternative 3: Portable programs use the Unicode file interfaces on Windows, > and the bytes file interfaces on Posix, and deal with the differences, as > described for Windows only in alternative 1 and Posix only in alternative 2. > > Alternative 4: Someone implements an object that does alternative 3 under > the covers, and every one will wish Alternative 1 & 2 didn't even exist. > The only reasons not to do this seem to be (a) Python 2.6 is already > released and doesn't have it, (b) Python 3.0 would slip its schedule even > more, (c) it's a significant chunk of code to implement and get right in a > hurry. Nope, not possible. The closest we can do is "bytes with implicit conversion to unicode", but (a) implicit conversion is much less maintainable (zen, etc), (b) it STILL doesn't work. You still can't round-trip a bad file name through a unicode API. You have the file system and the user/libraries, and never the twain shall meet. -- Adam Olsen, aka Rhamphoryncus From brett at python.org Sat Oct 4 20:03:54 2008 From: brett at python.org (Brett Cannon) Date: Sat, 4 Oct 2008 11:03:54 -0700 Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__ import planning) In-Reply-To: References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com> <3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org> Message-ID: On Sat, Oct 4, 2008 at 12:45 AM, Georg Brandl wrote: > Barry Warsaw schrieb: >> On Oct 3, 2008, at 5:26 PM, Benjamin Peterson wrote: >> >>> So now that we've released 2.6 and are working hard on shepherding 3.0 >>> out the door, it's time to worry about the next set of releases. :) >> >>> I propose that we dramatically shorten our release cycle for 2.7/3.1 >>> to roughly a year and put a strong focus stabilizing all the new >>> goodies we included in the last release(s). In the 3.x branch, we >>> should continue to solidify the new code and features that were >>> introduced. One 2.7's main objectives should be binding 3.x and 2.x >>> ever closer. >> >> There are several things that I would like to see us concentrate on >> after the 3.0 release. I agree that 3.1 should be primarily a >> stabilizing release. I suspect that we will find a lot of things that >> need tweaking only after 3.0 final has been out there for a while. >> >> I think 2.7 should continue along the path of convergence toward 3.x. >> The vision some of us talked about at Pycon was that at some point >> down the line, maybe there's no difference between "python2.9 -3" and >> "python3.3 -2". > > Especially 3.1 should also be a release where we focus as much on the > community as on the code. There are many people out there for whom > Python 3, as an incompatible language, is not an easy step to make, > especially those with huge 2.x codebases on their hands. They have > two problems: The libraries they depend on aren't ported, and the > KLOC of code they care about are hard and tedious work to port, not > to mention that it typically isn't viewed as productive work by those > who pay them. > > We need to make 2to3 and related tools reliable and do more showcases > of porting, like Martin did with Django, so that people have real-world > examples at their disposal, by which they can estimate their own > porting needs. (Waiting for the extended community to deliver such > examples may be a mistake.) > > We also need to commit to help people with porting. I propose a new > mailing list (e.g. python3-porting), parallel to python-list, > specifically for people going that way. I think it will help to > focus the community effort of getting Python 3 off the ground. > This is a good idea; python-help for porting. > Last not least, there should be a *central* location on python.org where > specifically all resources on 2->3 transition are collected. Talks, > documents, links, and some crucial information many people seem to miss, > such as how long the 2.x series will at least be maintained. They depend > on this. That seems reasonable if someone gets around to doing it. =) -Brett From martin at v.loewis.de Sat Oct 4 21:17:21 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 04 Oct 2008 21:17:21 +0200 Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__ import planning) In-Reply-To: References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com> <3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org> Message-ID: <48E7C141.8010903@v.loewis.de> > Well, since for >95% of the (potential) Py3k users it is more important than > e.g. the import rewrite in Python (no stab at you intended, Brett), it is > something someone will have to get around to doing. > > I'm not excusing myself; in fact, I'd be happy to work on this, but overall > the team "Python 3 advocacy and support" should consist of more than one > person. I think this has time. I'm (now) confident that people will port to Python 3 sooner rather than later, just because it's there. In fact, we have to be careful not to talk too many people into porting, since there will be some glitches which need to be resolved, and may not get resolved before 3.2 or so. So people with a natural wariness are advised to trust this wariness, or else all their concerns become self-fulfilling prophecies. Regards, Martin From brett at python.org Sat Oct 4 21:36:17 2008 From: brett at python.org (Brett Cannon) Date: Sat, 4 Oct 2008 12:36:17 -0700 Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__ import planning) In-Reply-To: <48E7C141.8010903@v.loewis.de> References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com> <3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org> <48E7C141.8010903@v.loewis.de> Message-ID: [replying to both Georg and Martin] On Sat, Oct 4, 2008 at 12:17 PM, "Martin v. L?wis" wrote: >> Well, since for >95% of the (potential) Py3k users it is more important than >> e.g. the import rewrite in Python (no stab at you intended, Brett), it is >> something someone will have to get around to doing. >> Don't worry, I realize my import work is approaching vaporware status at this rate (still plugging away at it, though). But you are right: helping people port to 3 will be the most important thing we can help people with. >> I'm not excusing myself; in fact, I'd be happy to work on this, but overall >> the team "Python 3 advocacy and support" should consist of more than one >> person. > I would definitely be willing to help. So the mailing list is a good idea. Perhaps it should just be python-porting so that it can also be used for people who have problems with minor releases? We could then have a /porting/ section to the site where we can actually document after each release how to port to the newest version. And as for 2 -> 3 stuff, should probably provide the expected steps to port, tips for pure Python code (and how to write 2.6/3.0 compatible code), extension modules, and make it clear what our overall plan is (e.g. 3.2 probably being the truly stable release semantically). > I think this has time. I'm (now) confident that people will port to > Python 3 sooner rather than later, just because it's there. In fact, > we have to be careful not to talk too many people into porting, since > there will be some glitches which need to be resolved, and may not get > resolved before 3.2 or so. So people with a natural wariness are advised > to trust this wariness, or else all their concerns become > self-fulfilling prophecies. Yes, people should be warned that if they are not ready to make changes after each Python release that are probably more than they are used to between minor releases, they might to hold off for 3.1 or 3.2. But I don't want to be too discouraging as that might stifle any forward momentum we might have and potentially leave 3 flat before it even gets going. -Brett From facundobatista at gmail.com Sun Oct 5 01:19:31 2008 From: facundobatista at gmail.com (Facundo Batista) Date: Sat, 4 Oct 2008 20:19:31 -0300 Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__ import planning) In-Reply-To: References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com> <3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org> <48E7C141.8010903@v.loewis.de> Message-ID: 2008/10/4 Brett Cannon : > So the mailing list is a good idea. Perhaps it should just be > python-porting so that it can also be used for people who have > problems with minor releases? +1. I'd try to help on that list, also. -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From tjreedy at udel.edu Mon Oct 6 01:11:10 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 05 Oct 2008 19:11:10 -0400 Subject: [Python-3000] A plus for naked unbound methods Message-ID: I have seen a couple of objections to leaving unbound methods naked (as functions) when retrieved in 3.0. Here is a plus. A c.l.p poster reported that 2.6 broke his code because the addition of default rich comparisons to object turned tests like hassattr(ob, '__lt__') from False to True. The obvious fix ob.__lt__ == object.__lt__ does not work because wrapping makes it always False, even when conceptually true. In 3.0, that equality test works. (I pointed him to 'object' in repr(ob.__lt__) as a workaround. Others posted others.) tjr From wescpy at gmail.com Mon Oct 6 04:14:07 2008 From: wescpy at gmail.com (wesley chun) Date: Sun, 5 Oct 2008 19:14:07 -0700 Subject: [Python-3000] Problem with grammar for 'except'? In-Reply-To: References: Message-ID: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com> On Thu, Sep 4, 2008 at 12:36 PM, Guido van Rossum wrote: > On Wed, Sep 3, 2008 at 9:25 PM, Raymond Hettinger wrote: >> [Brett] >>> I gave a talk last night at the Vancouver Python users group on >>> 2.6/3.0, and I tried the following code and it failed during a live demo: >>> >>> >>> try: pass >>> ... except Exception, Exception: pass >>> File "", line 2 >>> except Exception, Exception: pass >>> ^ >>> SyntaxError: invalid syntax >>> >>> Now from what I can tell from PEP 3110, that should be legal in 3.0. >>> Am I reading the PEP correctly? >> >> Don't think so. >> The parens are necessary for a tuple of exceptions >> lest it be confused with the old "except E, v" syntax >> which meant "except E as e". >> >> Maybe in 3.1, the paren requirement can be dropped. > > I would wait longer -- until well after the 2.x line is dead and > buried. It will take some time for every Python user to train their > Python fingers not to type "except E, v:" and we don't want people who > are late in migrating inserting bugs like this in their first 3.x program. it's probably a good idea to leave the paren requirement in there, but i just reread the PEP myself, and it appears as though no parens is actually supported, specifically: "except AttributeError, os.error:" here: http://www.python.org/dev/peps/pep-3110/#grammar-changes also, and granted this is older info, Guido's 2006 talks seem to hint this as well: - change except clause syntax to except E1, E2, E3 as err: - this avoids the bug in except E1, E2: # meant except (E1, E2) from both of these: ACCU - Apr 2006 (slide 11) http://www.python.org/doc/essays/ppt/accu2006/Py3kACCU.ppt Vancouver Python Workshop - Aug 2006 (slide 13) http://www.vanpyz.org/conference/2006/proceedings/MarygX/Py3KVanPyz.ppt while we can't change the past, we can/should at least update the PEP as well as the current 2.6 and 3.0 docs to specifically state that the parens are required (for now) *and* give an example usage. cheers, -- wesley - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "Python Web Development with Django", Addison Wesley, (c) 2008 http://withdjango.com wesley.j.chun :: wescpy-at-gmail.com python training and technical consulting cyberweb.consulting : silicon valley, ca http://cyberwebconsulting.com From guido at python.org Mon Oct 6 04:45:14 2008 From: guido at python.org (Guido van Rossum) Date: Sun, 5 Oct 2008 19:45:14 -0700 Subject: [Python-3000] Problem with grammar for 'except'? In-Reply-To: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com> References: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com> Message-ID: Someone please fix the PEP. There are very good reasons for *not* allowing "except X, Y:" to have a meaning -- if 2.x code somehow accidentally ended up in the 3.0 world without having been run through 2to3, it would silently perturb the meaning in the most confusing way. That's why the implementation got it right. --Guido On Sun, Oct 5, 2008 at 7:14 PM, wesley chun wrote: > On Thu, Sep 4, 2008 at 12:36 PM, Guido van Rossum wrote: >> On Wed, Sep 3, 2008 at 9:25 PM, Raymond Hettinger wrote: >>> [Brett] >>>> I gave a talk last night at the Vancouver Python users group on >>>> 2.6/3.0, and I tried the following code and it failed during a live demo: >>>> >>>> >>> try: pass >>>> ... except Exception, Exception: pass >>>> File "", line 2 >>>> except Exception, Exception: pass >>>> ^ >>>> SyntaxError: invalid syntax >>>> >>>> Now from what I can tell from PEP 3110, that should be legal in 3.0. >>>> Am I reading the PEP correctly? >>> >>> Don't think so. >>> The parens are necessary for a tuple of exceptions >>> lest it be confused with the old "except E, v" syntax >>> which meant "except E as e". >>> >>> Maybe in 3.1, the paren requirement can be dropped. >> >> I would wait longer -- until well after the 2.x line is dead and >> buried. It will take some time for every Python user to train their >> Python fingers not to type "except E, v:" and we don't want people who >> are late in migrating inserting bugs like this in their first 3.x program. > > > it's probably a good idea to leave the paren requirement in there, but > i just reread the PEP myself, and it appears as though no parens is > actually supported, specifically: "except AttributeError, os.error:" > here: > > http://www.python.org/dev/peps/pep-3110/#grammar-changes > > also, and granted this is older info, Guido's 2006 talks seem to hint > this as well: > > - change except clause syntax to except E1, E2, E3 as err: > - this avoids the bug in except E1, E2: # meant except (E1, E2) > > from both of these: > > ACCU - Apr 2006 (slide 11) > http://www.python.org/doc/essays/ppt/accu2006/Py3kACCU.ppt > > Vancouver Python Workshop - Aug 2006 (slide 13) > http://www.vanpyz.org/conference/2006/proceedings/MarygX/Py3KVanPyz.ppt > > while we can't change the past, we can/should at least update the PEP > as well as the current 2.6 and 3.0 docs to specifically state that the > parens are required (for now) *and* give an example usage. > > cheers, > -- wesley > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > "Python Web Development with Django", Addison Wesley, (c) 2008 > http://withdjango.com > > wesley.j.chun :: wescpy-at-gmail.com > python training and technical consulting > cyberweb.consulting : silicon valley, ca > http://cyberwebconsulting.com > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mrs at mythic-beasts.com Mon Oct 6 18:50:34 2008 From: mrs at mythic-beasts.com (Mark Seaborn) Date: Mon, 06 Oct 2008 17:50:34 +0100 (BST) Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: References: Message-ID: <20081006.175034.343188282.mrs@localhost.localdomain> Terry Reedy wrote: > I have seen a couple of objections to leaving unbound methods naked (as > functions) when retrieved in 3.0. Here is a plus. > > A c.l.p poster reported that 2.6 broke his code because the addition of > default rich comparisons to object turned tests like hassattr(ob, > '__lt__') from False to True. For the record, the post is: http://mail.python.org/pipermail/python-list/2008-October/510540.html > The obvious fix ob.__lt__ == object.__lt__ does not work because > wrapping makes it always False, even when conceptually true. In > 3.0, that equality test works. (I pointed him to 'object' in > repr(ob.__lt__) as a workaround. Others posted others.) Assuming ob is an instance object, ob.__lt__ will give you a bound method (taking 1 argument) which you would never expect to compare as equal to object.__lt__ (taking 2 arguments). So the presence or absence of unbound methods makes no difference here. Mark From tjreedy at udel.edu Mon Oct 6 20:19:56 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 06 Oct 2008 14:19:56 -0400 Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: <20081006.175034.343188282.mrs@localhost.localdomain> References: <20081006.175034.343188282.mrs@localhost.localdomain> Message-ID: Mark Seaborn wrote: > Terry Reedy wrote: > >> I have seen a couple of objections to leaving unbound methods naked (as >> functions) when retrieved in 3.0. Here is a plus. >> >> A c.l.p poster reported that 2.6 broke his code because the addition of >> default rich comparisons to object turned tests like hassattr(ob, >> '__lt__') from False to True. > > For the record, the post is: > http://mail.python.org/pipermail/python-list/2008-October/510540.html > >> The obvious fix ob.__lt__ == object.__lt__ does not work because >> wrapping makes it always False, even when conceptually true. In >> 3.0, that equality test works. (I pointed him to 'object' in >> repr(ob.__lt__) as a workaround. Others posted others.) > > Assuming ob is an instance object, It was a class derived from object. I should have made that clearer. From mrs at mythic-beasts.com Mon Oct 6 22:20:59 2008 From: mrs at mythic-beasts.com (Mark Seaborn) Date: Mon, 06 Oct 2008 21:20:59 +0100 (BST) Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: References: <20081006.175034.343188282.mrs@localhost.localdomain> Message-ID: <20081006.212059.465784769.mrs@localhost.localdomain> Terry Reedy wrote: > Mark Seaborn wrote: > > Terry Reedy wrote: > > > >> I have seen a couple of objections to leaving unbound methods naked (as > >> functions) when retrieved in 3.0. Here is a plus. > >> > >> A c.l.p poster reported that 2.6 broke his code because the addition of > >> default rich comparisons to object turned tests like hassattr(ob, > >> '__lt__') from False to True. > > > > For the record, the post is: > > http://mail.python.org/pipermail/python-list/2008-October/510540.html > > > >> The obvious fix ob.__lt__ == object.__lt__ does not work because > >> wrapping makes it always False, even when conceptually true. In > >> 3.0, that equality test works. (I pointed him to 'object' in > >> repr(ob.__lt__) as a workaround. Others posted others.) > > > > Assuming ob is an instance object, > > It was a class derived from object. I should have made that clearer. It appears that unbound methods do what you want in the general case in Python 2.5 and 2.6. It's just that __lt__ behaves unlike normal unbound methods. So this isn't an argument against unbound methods, it's an argument for __lt__ not to be a special case. >>> class C(object): ... def f(self): pass ... def g(self): pass ... >>> class D(C): ... def g(self): pass ... >>> C.f == D.f True >>> C.g == D.g False >>> C.__str__ == D.__str__ True >>> C.__str__ == object.__str__ True It is slightly odd that C.f and D.f compare as equal when they are not equivalent. It is not inconsistent with other cases where == returns True on non-equivalent objects (such as dicts with equal content but different identities), but it is odd for this to happen on a callable. Mark From barry at python.org Tue Oct 7 02:47:57 2008 From: barry at python.org (Barry Warsaw) Date: Mon, 6 Oct 2008 20:47:57 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule Message-ID: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 So, we need to come up with a new release schedule for Python 3.0. My suggestion: 15-Oct-2008 3.0 beta 4 05-Nov-2008 3.0 rc 2 19-Nov-2008 3.0 rc 3 03-Dec-2008 3.0 final Given what still needs to be done, is this a reasonable schedule? Do we need two more betas? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOqxvnEjvBPtnXfVAQIR5QP/coSi2ltsZSpE2dyUg7Y35QcSk/+4ZbGK zF0AgLaOkGs+DFnxRH9vy9kN3JaEkp1MhEpDjkomE7kNpnJB7bWotTrHI67HD9ma ZDqqmaCc02IeUtLm7HuELvofjCgh+gryKWvRc71ErRHmn/YxMGr1OcEirPpx4nZ9 DeDV0OeUtTE= =RchU -----END PGP SIGNATURE----- From musiccomposition at gmail.com Tue Oct 7 02:52:54 2008 From: musiccomposition at gmail.com (Benjamin Peterson) Date: Mon, 6 Oct 2008 19:52:54 -0500 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> On Mon, Oct 6, 2008 at 7:47 PM, Barry Warsaw wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > So, we need to come up with a new release schedule for Python 3.0. My > suggestion: > > 15-Oct-2008 3.0 beta 4 > 05-Nov-2008 3.0 rc 2 > 19-Nov-2008 3.0 rc 3 > 03-Dec-2008 3.0 final > > Given what still needs to be done, is this a reasonable schedule? Do we > need two more betas? I'm not sure we do. Correct me if I'm wrong, but the "big ticket", issue bytes/unicode filepaths, has been resolved. And looking at the tracker, I only see 18 release blockers. -- Cheers, Benjamin Peterson "There's nothing quite as beautiful as an oboe... except a chicken stuck in a vacuum cleaner." From tjreedy at udel.edu Tue Oct 7 03:08:29 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 06 Oct 2008 21:08:29 -0400 Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: <20081006.212059.465784769.mrs@localhost.localdomain> References: <20081006.175034.343188282.mrs@localhost.localdomain> <20081006.212059.465784769.mrs@localhost.localdomain> Message-ID: Mark Seaborn wrote: > Terry Reedy wrote: > >> Mark Seaborn wrote: >>> Terry Reedy wrote: >>> >>>> I have seen a couple of objections to leaving unbound methods naked (as >>>> functions) when retrieved in 3.0. Here is a plus. >>>> >>>> A c.l.p poster reported that 2.6 broke his code because the addition of >>>> default rich comparisons to object turned tests like hassattr(ob, >>>> '__lt__') from False to True. >>> For the record, the post is: >>> http://mail.python.org/pipermail/python-list/2008-October/510540.html >>> >>>> The obvious fix ob.__lt__ == object.__lt__ does not work because >>>> wrapping makes it always False, even when conceptually true. In >>>> 3.0, that equality test works. (I pointed him to 'object' in >>>> repr(ob.__lt__) as a workaround. Others posted others.) >>> Assuming ob is an instance object, >> It was a class derived from object. I should have made that clearer. > > It appears that unbound methods do what you want in the general case > in Python 2.5 and 2.6. It's just that __lt__ behaves unlike normal > unbound methods. So this isn't an argument against unbound methods, > it's an argument for __lt__ not to be a special case. It is not a special case. >>> def C(object): pass ... >>> C.__hash__ == object.__hash__ False >>> C.__str__ == object.__str__ False I strongly suspect that the same is true of every method that a user class inherits from a builtin class. Still, the clp OP is specifically interested in object as the base of his inheritance networks. >>>> class C(object): > ... def f(self): pass > ... def g(self): pass > ... >>>> class D(C): > ... def g(self): pass > ... >>>> C.f == D.f > True >>>> C.g == D.g > False > It is slightly odd that C.f and D.f compare as equal when they are not > equivalent. It is not inconsistent with other cases where == returns > True on non-equivalent objects (such as dicts with equal content but > different identities), but it is odd for this to happen on a callable. Interesting. MethodWrapper must have an over-riding equality method that compare im.func attributes for the specific case of comparing MethodWrappers. But not relevant to the specific need;-). So my point remains: leaving unbound methods unwrapped makes Python3 work better for at least one real use case. Terry Jan Reedy From python at rcn.com Tue Oct 7 03:48:18 2008 From: python at rcn.com (Raymond Hettinger) Date: Mon, 6 Oct 2008 18:48:18 -0700 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> [Barry Warsaw] > So, we need to come up with a new release schedule for Python 3.0. My > suggestion: > > 15-Oct-2008 3.0 beta 4 > 05-Nov-2008 3.0 rc 2 > 19-Nov-2008 3.0 rc 3 > 03-Dec-2008 3.0 final > > Given what still needs to be done, is this a reasonable schedule? Do > we need two more betas? Yes to both questions. I'm seeing that people are just starting to download and play with 3.0. I expect that we'll start getting more feedback on conversion issues, the C API, screwy interactions with operating systems, bytes/text issues, unanticipated interactions with other tools, etc. Each user will stress it in new ways and perhaps reveal a bunch of little integration issues and documentation issues. Those little fixups way go a long way toward establishing a good first impression and reputation for 3.0 from the outset. Raymond From barry at python.org Tue Oct 7 04:13:06 2008 From: barry at python.org (Barry Warsaw) Date: Mon, 6 Oct 2008 22:13:06 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> Message-ID: <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote: > [Barry Warsaw] >> So, we need to come up with a new release schedule for Python 3.0. >> My suggestion: >> 15-Oct-2008 3.0 beta 4 >> 05-Nov-2008 3.0 rc 2 >> 19-Nov-2008 3.0 rc 3 >> 03-Dec-2008 3.0 final >> Given what still needs to be done, is this a reasonable schedule? >> Do we need two more betas? > > Yes to both questions. I think that's contradictory :). If we need two betas, then 05-Nov becomes beta 5, 19-Nov is rc 2. If we don't need another rc then we can still do a final release on 03-Dec, otherwise we probably go 2 weeks later. I don't want to go much later than that though because then we get into the holiday season. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOrFs3EjvBPtnXfVAQJceQP/QJN7oLM4nG+iXmgdb0NmKzOzaE3J89sQ UWZnc/hp618QNH4JWC8v2bYApFu+iVg3pcv1Lnmhuql6mOuDhSuKKJVA5jTdR7U2 2enhAEY2DXtmav/29nn2Fy6PYcWJy9pE2xBsbBW8qXc6tYww0iEBsz9SU68jPzPk x5LFC5NqmXo= =Kyr4 -----END PGP SIGNATURE----- From foom at fuhm.net Tue Oct 7 05:22:09 2008 From: foom at fuhm.net (James Y Knight) Date: Mon, 6 Oct 2008 23:22:09 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> Message-ID: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> On Oct 6, 2008, at 8:52 PM, Benjamin Peterson wrote: > I'm not sure we do. Correct me if I'm wrong, but the "big ticket", > issue bytes/unicode filepaths, has been resolved. And looking at the > tracker, I only see 18 release blockers. Well, if you mean that the resolution decided upon is to "simply" allow access to all system APIs using either byte or unicode strings, then it seems to me that there's a rather large amount of work left to do... Here's some I found from a few minutes of futzing around with r66821 of py3k on Linux. - Having os.getcwdb isn't much use when you can't even run python in the first place when the current directory has "bad" bytes in it. Currently Python outputs: Could not find platform independent libraries Could not find platform dependent libraries Consider setting $PYTHONHOME to [:] Fatal Python error: Py_Initialize: can't initialize sys standard streams ImportError: No module named encodings.utf_8 Aborted - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'" ought to work (with files with "bad" bytes being returned by find), which means that Python shouldn't blow up and refuse to start when there's a non-properly-encoding argv ("Could not convert argument 1 to string" and exiting isn't appropriate behavior). - Of course, just being able to start the interpreter isn't quite enough: you'll want to be able to access that argument list too, somehow (add sys.argvb?). - And then, getopt and optparse modules should work on bytestring vectors, so that you can use sys.argvb without writing your own argument parser. They don't currently. - There's no os.environb for bytewise access to the environment. Seems important. - Isn't it a potential security issue that " 'WHATEVER' in os.environ" can return False if WHATEVER had some "bad" bytes in it, but spawning a subprocess actually will include WHATEVER in the subprocess's environment? Actually, even better: the behavior depends on whether you use subprocess.call('foo') or subprocess.call('foo', os.environ). The first passes through the "bad" environment variables, while the second does not. A bit surprising, perhaps. - Shouldn't this work? subprocess.call(b'/bin/echo') Currently raises an exception: AttributeError: 'int' object has no attribute 'rfind' - I suppose sys.path should handle bytestrings on the path, and should be populated using the bytes-version of os.environ so that PYTHONPATH gets read in properly. Which of course implies that all the importers need to handle byte filenames. - zipfile.ZipFile(b'whatever.zip') doesn't work. - zipfile decodes/encodes the filenames inside the zip file to unicode, so thus can only handle correctly encoded filenames. I'm sure there's even more APIs dealing with pathnames, command line arguments, or environment variables that ought to be able to handle both bytes and strings, that currently don't. James From rhamph at gmail.com Tue Oct 7 07:18:48 2008 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 6 Oct 2008 23:18:48 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48EA9B71.3060109@nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> <48E6A492.4090604@g.nevcal.com> <48E6ED99.2050406@g.nevcal.com> <48EA9B71.3060109@nevcal.com> Message-ID: On Mon, Oct 6, 2008 at 5:12 PM, Glenn Linderman wrote: > On approximately 10/3/2008 11:57 PM, came the following characters from the > keyboard of Adam Olsen: >> On Fri, Oct 3, 2008 at 10:14 PM, Glenn Linderman >> wrote: >>> Alternative 3: Portable programs use the Unicode file interfaces on >>> Windows, >>> and the bytes file interfaces on Posix, and deal with the differences, as >>> described for Windows only in alternative 1 and Posix only in alternative >>> 2. >>> >>> Alternative 4: Someone implements an object that does alternative 3 under >>> the covers, and every one will wish Alternative 1 & 2 didn't even exist. >>> The only reasons not to do this seem to be (a) Python 2.6 is already >>> released and doesn't have it, (b) Python 3.0 would slip its schedule even >>> more, (c) it's a significant chunk of code to implement and get right in >>> a >>> hurry. >>> >> >> Nope, not possible. The closest we can do is "bytes with implicit >> conversion to unicode", but (a) implicit conversion is much less >> maintainable (zen, etc), (b) it STILL doesn't work. You still can't >> round-trip a bad file name through a unicode API. >> > > Not clear if you meant Alternative 3, 4 or both were not possible. > > The object would provide methods for manipulating the path names, > particularly the ability to extract a path from one object and a file from > another and combine them, somehow. So programs wouldn't have to perform > these sorts of manipulations themselves, so they wouldn't care if they are > done on Posix and bytes and on Windows as Unicode. But "Unicode" on windows is invalid. It shares all the same problems UTF-8b does, but worse as a correct UTF-16 codec would forbid exporting it. We'd need to invent a UTF-16b to save it, or simulate one manually. If the binary APIs on windows emitted raw UTF-16 bytes then we merely need to add a os.sepb equal to os.sep.encode('UTF-16') and you've got your portable low-level API. You don't need a path object. -- Adam Olsen, aka Rhamphoryncus From rhamph at gmail.com Tue Oct 7 08:22:59 2008 From: rhamph at gmail.com (Adam Olsen) Date: Tue, 7 Oct 2008 00:22:59 -0600 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48EAF263.5080006@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <48E68911.6090403@g.nevcal.com> <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> <48E6A492.4090604@g.nevcal.com> <48E6ED99.2050406@g.nevcal.com> <48EA9B71.3060109@nevcal.com> <48EAF263.5080006@g.nevcal.com> Message-ID: On Mon, Oct 6, 2008 at 11:23 PM, Glenn Linderman wrote: > On approximately 10/6/2008 10:18 PM, came the following characters from the > keyboard of Adam Olsen: >> But "Unicode" on windows is invalid. It shares all the same problems >> UTF-8b does, but worse as a correct UTF-16 codec would forbid >> exporting it. We'd need to invent a UTF-16b to save it, or simulate >> one manually. >> >> If the binary APIs on windows emitted raw UTF-16 bytes >> >> They do, for some definition of UTF-16, yes. >> >> then we merely >> need to add a os.sepb equal to os.sep.encode('UTF-16') and you've got >> your portable low-level API. You don't need a path object. > > Except it isn't portable, because you can't do that on Posix. The posix version should hardcode it as b'/'; I only meant windows to use UTF-16. You could perhaps use sys.getfilesystemencoding(), but I'm unsure what it does if the encoding isn't an ascii superset (or even if that can actually happen.) -- Adam Olsen, aka Rhamphoryncus From martin at v.loewis.de Tue Oct 7 09:47:20 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 07 Oct 2008 09:47:20 +0200 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> Message-ID: <48EB1408.1030007@v.loewis.de> > Here's some I found from a few minutes of futzing around with r66821 of > py3k on Linux. > > - Having os.getcwdb isn't much use when you can't even run python in > the first place when the current directory has "bad" bytes in it. That's not true: it *is* of much use. Python will live in /usr/bin, which has a nicely-decodable path. > Currently Python outputs: > Could not find platform independent libraries > Could not find platform dependent libraries > Consider setting $PYTHONHOME to [:] > Fatal Python error: Py_Initialize: can't initialize sys standard streams > ImportError: No module named encodings.utf_8 > Aborted I can't reproduce that. This happens (for me) when Python lives in a directory that has an undecodable path - not when the current directory is undecodable. > - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'" ought > to work (with files with "bad" bytes being returned by find), which > means that Python shouldn't blow up and refuse to start when there's a > non-properly-encoding argv ("Could not convert argument 1 to string" and > exiting isn't appropriate behavior). Contributions are welcome. *Of course* can you access these files with POSIX API. However, Python's path handling can't. See above why I don't consider this as a serious bug, on Unix. > - Of course, just being able to start the interpreter isn't quite > enough: you'll want to be able to access that argument list too, somehow > (add sys.argvb?). Perhaps. However, I don't see the need to be able to do so in Python 3.0. > - And then, getopt and optparse modules should work on bytestring > vectors, so that you can use sys.argvb without writing your own argument > parser. They don't currently. And I hope they never will. Using bytes to represent this stuff will just bring back the 2.x status, so some other solution must be found - for 3.1 (or 3.2). > - There's no os.environb for bytewise access to the environment. Seems > important. Not to me. I don't have environment variables with non-ASCII characters in them, and I think few other people do. > I'm sure there's even more APIs dealing with pathnames, command line > arguments, or environment variables that ought to be able to handle both > bytes and strings, that currently don't. Please, no. Regards, Martin From victor.stinner at haypocalc.com Tue Oct 7 11:30:35 2008 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Tue, 7 Oct 2008 11:30:35 +0200 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> Message-ID: <200810071130.35729.victor.stinner@haypocalc.com> Hi, First of all, please read my document: http://wiki.python.org/moin/Python3UnicodeDecodeError I moved the document to a public wiki to allow anyone to edit it! Le Tuesday 07 October 2008 05:22:09 James Y Knight, vous avez ?crit?: > On Oct 6, 2008, at 8:52 PM, Benjamin Peterson wrote: > > I'm not sure we do. Correct me if I'm wrong, but the "big ticket", > > issue bytes/unicode filepaths, has been resolved. Python3 now accepts bytes for os.listdir(), open() (io.open()), os.unlink(), os.path.*(), etc. But it's not enough to say that Python3 can use bytes everywhere. It would take months or *years* to fix all issues related to bytes and unicode. Remember, this task started in 2000 with Python *2.0* (creation of the unicode type). > Well, if you mean that the resolution decided upon is to "simply" > allow access to all system APIs using either byte or unicode strings, > then it seems to me that there's a rather large amount of work left to > do... If you know a problem, open a ticket and propose a solution. It's not possible to list all new problems since we don't know them yet :-) > - Having os.getcwdb isn't much use when you can't even run python in > the first place when the current directory has "bad" bytes in it. My python3.0 works correctly in a directory with an invalid name. What is your OS / locale / Python version? Please create a ticket if needed. > - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'" > ought to work (with files with "bad" bytes being returned by find), First, fix your home directory :-) There are good tools (convmv?) to fix invalid filenames. > which means that Python shouldn't blow up and refuse to start when > there's a non-properly-encoding argv ("Could not convert argument 1 to > string" and exiting isn't appropriate behavior) Why not? It's a good idea to break compatibility to refuse invalid bytes sequences. You can still uses the command line, an input file or a GUI to read raw bytes sequences. > - Of course, just being able to start the interpreter isn't quite > enough: you'll want to be able to access that argument list too, > somehow (add sys.argvb?). If we create sys.argvb, what shoul be done if sys.argv creation failed? sys.argv would be empty or unset? Or some values would be removed (and so argv[2] is argv[1])? I think that many (a lot of) programs suppose that sys.argv exists and "is valid". If you introduce a special case (sometimes, sys.argv doesn't exist or is truncated !?), it will introduce new issues. > - There's no os.environb for bytewise access to the environment. > Seems important. It would be strange if you can put a variable in bytes to os.environb whereas os.environ would not get the key. I know two major usages of the environment: (1) read a variable in Python (2) put a variable for a child process (1) can be done with os.getenv() and returns None if the variable (key or value) is an invalid bytes sequence. (2) can be done with subprocess.Popen(). subprocess doesn't support bytes yet but I wrote patches: #4035 and #4036. > - Isn't it a potential security issue that " 'WHATEVER' in > os.environ" can return False if WHATEVER had some "bad" bytes in it, > but spawning a subprocess actually will include WHATEVER in the > subprocess's environment? Yes. Python should remove the key while creating os.environ. > - Shouldn't this work? subprocess.call(b'/bin/echo') Yes. Most programs (at least on Linux and Mac) supports bytes and so you should be able use bytes arguments in their command lines, see issues #4035 and #4036. > - I suppose sys.path should handle bytestrings on the path, and > should be populated using the bytes-version of os.environ so that > PYTHONPATH gets read in properly. Which of course implies that all the > importers need to handle byte filenames. If your file system is broken, rename your directory but don't introduce a special case for sys.path. > - zipfile.ZipFile(b'whatever.zip') doesn't work. Since zipfile uses bytes in its file structure, zipfile should accept bytes. But the right question is: should this issue block Python3 or can we wait for Python 3.1 (maybe 3.0.1)? -- People wants to try the new Python version! Python3 introduces new amazing features like "keyword only arguments". The bytes/unicode problem is old and only affects broken systems Windows (90% of the computers in the world?) only uses characters for the filenames, environment and command line. Mac and Linux use UTF-8 most of the time, and slowly everything speaks UTF-8! Python3 should not be delayed because of this problem. About the initial barry's question: why Python3 is delayed until december? There are too much open issues? -- Victor Stinner aka haypo http://www.haypocalc.com/blog/ From ncoghlan at gmail.com Tue Oct 7 12:10:19 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 07 Oct 2008 20:10:19 +1000 Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: References: <20081006.175034.343188282.mrs@localhost.localdomain> <20081006.212059.465784769.mrs@localhost.localdomain> Message-ID: <48EB358B.8020101@gmail.com> (added Michael to the CC list) It isn't object that has grown an __lt__ method, but type. The extra check Michael actually wants is a way to make sure that the method isn't coming from the object's metaclass, and the only reliable way to do that is the way collections.Hashable does it when looking for __hash__: iterate through the MRO looking for that method name in the class dictionaries. E.g. def defines_method(obj, method_name): try: mro = obj.__mro__ except AttributeError: return False # Not a type for cls in mro: if cls is object and not obj is object: break # Methods inherited from object don't count if method_name in cls.__dict__: return True return False # Didn't find it >>> class X(object): ... def __repr__(self): print "My Repr" ... >>> class Y(X): ... def __str__(self): print "My Str" ... >>> defines_method(object, "__repr__") True >>> defines_method(object, "__str__") True >>> defines_method(object, "__cmp__") False >>> defines_method(X, "__repr__") True >>> defines_method(X, "__str__") False >>> defines_method(X, "__cmp__") False >>> defines_method(Y, "__repr__") True >>> defines_method(Y, "__str__") True >>> defines_method(Y, "__cmp__") False Terry Reedy wrote: > I strongly suspect that the same is true of every method that a user > class inherits from a builtin class. Still, the clp OP is specifically > interested in object as the base of his inheritance networks. Your suspicion would be incorrect. What is actually happening is that the behaviour of the returned method varies depending on whether or not the object returned comes from the class itself (which will compare equal with itself even when retrieved from a subclass), or a bound method from the metaclass (which will not compare equal when retrieved from a subclass, since it is bound to a different instance of the metaclass). In the case of the comparison methods, they're being retrieved from type rather than object. This difference is made clear when you attempt to invoke the retrieved method: >>> object.__cmp__(1, 2) Traceback (most recent call last): File "", line 1, in TypeError: expected 1 arguments, got 2 >>> object.__cmp__(2) Traceback (most recent call last): File "", line 1, in TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int' >>> object.__cmp__(object) 0 >>> object.__hash__() Traceback (most recent call last): File "", line 1, in TypeError: descriptor '__hash__' of 'object' object needs an argument >>> object.__hash__(object) 135575008 Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From solipsis at pitrou.net Tue Oct 7 13:45:30 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 7 Oct 2008 11:45:30 +0000 (UTC) Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again) References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> Message-ID: Hi, James Y Knight fuhm.net> writes: > > - Having os.getcwdb isn't much use when you can't even run python in > the first place when the current directory has "bad" bytes in it. I don't agree it's a similar problem. Python should be installed in a well-known place with a sensible path. Of course, bonus points if Python can be launched from anywhere, but I don't think it's a severe problem. In other words, I'd flag this as "low priority". If you want a more important issue, there's the issue of importing modules with an unicode (non-ascii) path. Amaury has worked on this in the tracker. > Currently Python outputs: > Could not find platform independent libraries > Could not find platform dependent libraries > Consider setting $PYTHONHOME to [:] > Fatal Python error: Py_Initialize: can't initialize sys standard streams > ImportError: No module named encodings.utf_8 Ok, so the error message is quite cryptic and would perhaps deserve improving. Still, "low priority" IMHO. > - And then, getopt and optparse modules should work on bytestring > vectors, so that you can use sys.argvb without writing your own > argument parser. They don't currently. Then we will gradually start moving all modules even remotely related with IO and filesystem stuff to a dual bytes/unicode API? That's precisely the kind of confusion we want to end with Py3k (the confusion between bytes and unicode as similar data types which could be used almost interchangeably without giving any consideration to semantics). > - Isn't it a potential security issue that " 'WHATEVER' in > os.environ" can return False if WHATEVER had some "bad" bytes in it, > but spawning a subprocess actually will include WHATEVER in the > subprocess's environment? I do agree with that. Errors should certainly not pass silently, especially when they can have strong security implications. > - I suppose sys.path should handle bytestrings on the path, and > should be populated using the bytes-version of os.environ so that > PYTHONPATH gets read in properly. Well, except on Windows where unicode paths are the Right Thing to do. But then we have a glaring incompatibility between major platforms. Regards Antoine. From facundobatista at gmail.com Tue Oct 7 14:20:23 2008 From: facundobatista at gmail.com (Facundo Batista) Date: Tue, 7 Oct 2008 09:20:23 -0300 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> Message-ID: 2008/10/6 Raymond Hettinger : >> 15-Oct-2008 3.0 beta 4 >> 05-Nov-2008 3.0 rc 2 >> 19-Nov-2008 3.0 rc 3 >> 03-Dec-2008 3.0 final >> >> Given what still needs to be done, is this a reasonable schedule? Do we >> need two more betas? > > Yes to both questions. I agree with you here. > I'm seeing that people are just starting to download and play with 3.0. > I expect that we'll start getting more feedback on conversion issues, > the C API, screwy interactions with operating systems, bytes/text issues, > unanticipated interactions with other tools, etc. Each user will stress > it in new ways and perhaps reveal a bunch of little integration issues > and documentation issues. Those little fixups way go a long way toward > establishing a good first impression and reputation for 3.0 from the outset. And maybe also here, but bounded. I don't want to keep deferring 3.0 months and months, I prefer to have a redesigned schedule now, and stick to it as much as possible, even if the 3.0 version is not as robust as we would want. Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From eric at trueblade.com Tue Oct 7 14:50:53 2008 From: eric at trueblade.com (Eric Smith) Date: Tue, 07 Oct 2008 08:50:53 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again) In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> Message-ID: <48EB5B2D.10200@trueblade.com> Antoine Pitrou wrote: > Hi, > > James Y Knight fuhm.net> writes: >> - Having os.getcwdb isn't much use when you can't even run python in >> the first place when the current directory has "bad" bytes in it. > > I don't agree it's a similar problem. Python should be installed in a well-known > place with a sensible path. Of course, bonus points if Python can be launched > from anywhere, but I don't think it's a severe problem. In other words, I'd flag > this as "low priority". What about the case when using something like py2exe to create a distributable executable? I haven't been following this conversation closely, so maybe this issue never applies to Windows. But I can see a py2exe executable not having a sensible path, and there might be similar issues on other platforms. Eric. From amauryfa at gmail.com Tue Oct 7 15:51:07 2008 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Tue, 7 Oct 2008 15:51:07 +0200 Subject: [Python-3000] Accessing module state from extension types Message-ID: Hello, Extension modules have a new "md_state" member, I understand that it is designed to hold the "static" state of the module. IIUC, for example in _cpickle.c, the "PyObject *dispatch_table" variable is a good candidate for such module state. This would allow to play more nicely with multiple startups/shutdowns, reloading of the module, or with different sub-interpreters. This state is accessible through the PyModule_GetState() function. This is fine for module functions (the module object is passed as the first argument, even if we always name it "self"), but how does it work with classes or class methods? Classes do not contain a reference to their modules, they only have access to the __name__, which is not the same thing at all, specially in this case. This is unfortunate for extension modules which try to be object-oriented, and have very few functions (the _pickle module does not have any BTW) How is this supposed to work? -- Amaury Forgeot d'Arc From janssen at parc.com Tue Oct 7 17:24:08 2008 From: janssen at parc.com (Bill Janssen) Date: Tue, 7 Oct 2008 08:24:08 PDT Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again) In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> Message-ID: <40733.1223393048@parc.com> Antoine Pitrou wrote: > > - And then, getopt and optparse modules should work on bytestring > > vectors, so that you can use sys.argvb without writing your own > > argument parser. They don't currently. > > Then we will gradually start moving all modules even remotely related with IO > and filesystem stuff to a dual bytes/unicode API? That's precisely the kind of > confusion we want to end with Py3k (the confusion between bytes and unicode as > similar data types which could be used almost interchangeably without giving any > consideration to semantics). I wouldn't mix "IO" and "filesystem" that way. "IO" is complicated. The problem is, as we've lately discovered, that things which "look toward" the machine and the OS, like file system APIs or os.getcwd() or os.environ, are really dealing in bit sequences of various kinds, not strings, though the designers of these low-level artifacts have made some effort to disguise that. Things which "look toward" the user, on the other hand, are really dealing in strings, not bytes. There's a conversion step in there, if you are trying to write a program to print to stdout (that is, the user) all the files in a directory (the OS). Now, we can provide a automatic converter which will work in lots of cases, but we can't affort to just deny the cases in which it doesn't work. We need bytes APIs to the OS and underlying machine and networking and probably other things; we need string APIs to communicate with the user. Bill From tjreedy at udel.edu Tue Oct 7 17:44:24 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 07 Oct 2008 11:44:24 -0400 Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: <48EB358B.8020101@gmail.com> References: <20081006.175034.343188282.mrs@localhost.localdomain> <20081006.212059.465784769.mrs@localhost.localdomain> <48EB358B.8020101@gmail.com> Message-ID: Nick Coghlan wrote: > (added Michael to the CC list) > > It isn't object that has grown an __lt__ method, but type. The extra > check Michael actually wants is a way to make sure that the method isn't > coming from the object's metaclass, and the only reliable way to do that > is the way collections.Hashable does it when looking for __hash__: > iterate through the MRO looking for that method name in the class > dictionaries Thank you for the explanation. I was aware that MRO traversal should be the 'officially correct' procedure for the original, but did not understand why (for 2.x, at least). > In the case of the comparison methods, they're being retrieved from type > rather than object. This difference is made clear when you attempt to > invoke the retrieved method: > >>>> object.__cmp__(1, 2) > Traceback (most recent call last): > File "", line 1, in > TypeError: expected 1 arguments, got 2 >>>> object.__cmp__(2) > Traceback (most recent call last): > File "", line 1, in > TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int' >>>> object.__cmp__(object) > 0 This surprises me, partly because the situation seems to be different in 3.0. Using __le__ in place of the non-existent __cmp__, >>> ole = object.__le__ >>> ole(1,2) NotImplemented >>> ole(1) Traceback (most recent call last): File "", line 1, in ole(1) TypeError: expected 1 arguments, got 0 >>> ole(object) Traceback (most recent call last): File "", line 1, in ole(object) TypeError: expected 1 arguments, got 0 >>> ole >>> dir(ole) ['__call__', '__class__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__name__', '__ne__', '__new__', '__objclass__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__'] # no __self__ attribute >>> class C(object): pass >>> C.__le__ # same as for hash in 2.5 I interpret all this to mean that in 3.0, rich comparison *are* defined on and being retrieved from object. Correct? I presume the change is because in 3.0, everything is an instance of object, so all classes can inherit the common methods from object, whereas that was *not* true in 2.x. I very much like the cleaner design. Terry Jan Reedy From foom at fuhm.net Tue Oct 7 17:51:19 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 7 Oct 2008 11:51:19 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <48EB1408.1030007@v.loewis.de> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <48EB1408.1030007@v.loewis.de> Message-ID: On Oct 7, 2008, at 3:47 AM, Martin v. L?wis wrote: >> - Having os.getcwdb isn't much use when you can't even run python in >> the first place when the current directory has "bad" bytes in it. > > That's not true: it *is* of much use. Python will live in /usr/bin, > which has a nicely-decodable path. > >> Currently Python outputs: >> Could not find platform independent libraries >> Could not find platform dependent libraries >> Consider setting $PYTHONHOME to [:] >> Fatal Python error: Py_Initialize: can't initialize sys standard >> streams >> ImportError: No module named encodings.utf_8 >> Aborted > > I can't reproduce that. This happens (for me) when Python lives in > a directory that has an undecodable path - not when the current > directory is undecodable. Sorry about that: this test was indeed in error: I ran "../python" from an undecodeable current directory, rather than "/full/path/to/ python", or putting python on the PATH and running it as "python". The first does not work, but the other more common ways to start it do. >> >> I'm sure there's even more APIs dealing with pathnames, command line >> arguments, or environment variables that ought to be able to handle >> both >> bytes and strings, that currently don't. > > Please, no. I completely and totally agree with your distate, it's rather gross to allow bytes-or-str for every API that touches anything like filenames/ argv/environ. That's why I was pushing for the reversible conversion to str...But if bytes-or-str is the solution that's been chosen for this issue, it ought to either be fully committed to and implemented, or at least fully recognized and documented as a half-baked solution. Of course, if an reversible encoding into string solution is used instead, none of these things would need special treatment: they would all work already. FWIW: Qt works fine with undecodeable filenames, and it too uses unicode strings everywhere in its API. I looked into what it does, and found that it uses your (Martin)'s original idea for solving this: it stores undecodeable bytes as characters from 0x10fe00 to 0x10feff (which is valid private-use codespace). While that might not be ideally correct, since you lose those 256 PUA characters, even that is IMO better than pushing out bytes to every API, or worse, giving up and just having python unable to access files, as it is now. See lines 3074: QString::toUtf8() and 3408: QString::fromUtf8()) of http://www.google.com/codesearch?q=+show:o7fNK6SzOYs:NO-Bv-AR2rI:toIOngLf1V8&cs_p=http://ie.archive.ubuntu.com/trolltech/pub/qt/snapshots/qt-x11-opensource-src-4.4.0-snapshot-20070402.tar.bz2&cs_f=qt-x11-opensource-src-4.4.0-snapshot-20070402/src/corelib/tools/qstring.cpp James From g.brandl at gmx.net Tue Oct 7 18:04:37 2008 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 07 Oct 2008 18:04:37 +0200 Subject: [Python-3000] Problem with grammar for 'except'? In-Reply-To: References: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com> Message-ID: Guido van Rossum schrieb: > Someone please fix the PEP. There are very good reasons for *not* > allowing "except X, Y:" to have a meaning -- if 2.x code somehow > accidentally ended up in the 3.0 world without having been run through > 2to3, it would silently perturb the meaning in the most confusing way. > That's why the implementation got it right. Done. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From tjreedy at udel.edu Tue Oct 7 20:07:36 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 07 Oct 2008 14:07:36 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <48EB1408.1030007@v.loewis.de> Message-ID: James Y Knight wrote: > FWIW: Qt works fine with undecodeable filenames, and it too uses unicode > strings everywhere in its API. I looked into what it does, and found > that it uses your (Martin)'s original idea for solving this: it stores > undecodeable bytes as characters from 0x10fe00 to 0x10feff (which is > valid private-use codespace). While that might not be ideally correct, > since you lose those 256 PUA characters, even that is IMO better than > pushing out bytes to every API, or worse, giving up and just having > python unable to access files, as it is now. If Python uses a bit of the PUA (but only for filenames), which I think it should be free to do, then the manual should document that fact and when and why. Then any Python app that needs to use the full PUA could do so as long as it either avoids mixing filenames with its strings or avoids working with invalid filenames. The referenced QT file is licenced GPL2. From martin at v.loewis.de Tue Oct 7 21:40:21 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 07 Oct 2008 21:40:21 +0200 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <00e001c92881$68ba93c0$3a2fbb40$@com.au> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <00e001c92881$68ba93c0$3a2fbb40$@com.au> Message-ID: <48EBBB25.70609@v.loewis.de> > More specifically, I think 2to3 is shaping up well. pywin32 is taking the > approach of "port where possible, but keep in py2x syntax and convert at > 'setup.py' time" and this is working out fairly well I can't say how glad I am that you say that. It supports lib2to3 being a proper library, despite the problems that this may cause in itself. > * Better support for 2to3 in distutils (specifically, the support in > build_py is stale, plus 'build_scripts' and 'install_data' should convert > .py files to py3k syntax.) Please do create a bug report for that. It sounds like it's easy to fix. > An 'example' project that uses py2k syntax and > "just works" on py3k using this strategy might be useful here. Perhaps pywin32 :-? I don't think a demo project would do much good, as it doesn't exercise all the issues that may occur. > * A standard 'helper script' that allows people to use py3k to execute a > py2x syntax script by auto-converting the code. I've a 10ish-line script > that uses lib2to3 plus exec() to achieve that result, but a helper in 2to3 > for this would be nice. For a concrete use-case, we want to keep our > distutils script in py2x syntax, but execute it via py3k. Its very possible > this already exists and I've just missed it... For the case of setup.py, I was hoping that it could be written in compatible syntax even without needing conversion. That worked fine for my Django port. Is that not the case for pywin32? This specific issue might be out of scope for 3.x, IMO. > Either way, I'm fairly confident a pywin32 build for py3k will be available > in the next month or 2 (but as a result, I'm not really in a position to > help with the above for that period...) But please do file bug reports, preferably along with any patches to distutils that you already have. Regards, Martin From martin at v.loewis.de Tue Oct 7 22:06:52 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 07 Oct 2008 22:06:52 +0200 Subject: [Python-3000] Python3UnicodeDecodeError (Was: Proposed Python 3.0 schedule) In-Reply-To: <200810071130.35729.victor.stinner@haypocalc.com> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> Message-ID: <48EBC15C.1050305@v.loewis.de> > First of all, please read my document: > http://wiki.python.org/moin/Python3UnicodeDecodeError I have problems understanding that document. Is it supposed to be a PEP (i.e. a proposal to enhance Python), or is it a description of the status quo? If it is a PEP, it should clearly separate status quo, specification, and rationale (in any order that you find reasonable). It should also have an "open issues" section, explicitly listing the questions that haven't been resolved, and it should record objections to the proposal. I think I would object to the specification (perhaps to the degree of proposing a counter-PEP), but to do so, I first need a specification to object to. In terms of time-line, I think any such PEP is *clearly* out of scope for Python 3.0. All the remaining issues should deferred to 3.1. That the approach "we can use bytes in the file system API" was so rushed into the code base is already unfortunate, but I can understand the motivation - people want to write backup programs in Python. If I take the text as if it was a specification, here are some of my objections: - Default encoding: a) seems irrelevant for the PEP. The default encoding doesn't nearly have the role anymore that it had in 2.x, and shouldn't have any effect on how file names are treated. b) I would propose that the notion of a default encoding is entirely eliminated from Python, along with sys.(get|set)defaultencoding - argv and environ: are you suggesting that the behavior described in the PEP is desirable? I don't think it is (but I don't think it should change for 3.0, either, only for 3.1) Regards, Martin From martin at v.loewis.de Tue Oct 7 22:09:31 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 07 Oct 2008 22:09:31 +0200 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <48EB1408.1030007@v.loewis.de> Message-ID: <48EBC1FB.5090209@v.loewis.de> James Y Knight wrote: > or at least fully recognized and documented as a half-baked > solution. I would prefer that, leaving a full resolution to 3.1 (or perhaps 3.2). If we wait long enough, the issue will disappear (a strategy that Sun is apparently taking for Java :-) Regards, Martin From fdrake at acm.org Tue Oct 7 22:18:09 2008 From: fdrake at acm.org (Fred Drake) Date: Tue, 07 Oct 2008 16:18:09 -0400 Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError In-Reply-To: <48EBC15C.1050305@v.loewis.de> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de> Message-ID: On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote: > b) I would propose that the notion of a default encoding is entirely > eliminated from Python, along with sys.(get|set)defaultencoding +1 -Fred -- Fred Drake From guido at python.org Tue Oct 7 22:28:30 2008 From: guido at python.org (Guido van Rossum) Date: Tue, 7 Oct 2008 13:28:30 -0700 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw wrote: > So, we need to come up with a new release schedule for Python 3.0. My > suggestion: > > 15-Oct-2008 3.0 beta 4 > 05-Nov-2008 3.0 rc 2 > 19-Nov-2008 3.0 rc 3 > 03-Dec-2008 3.0 final > > Given what still needs to be done, is this a reasonable schedule? Do we > need two more betas? I know I'm contradicting what I said earlier, but perhaps we should just forget going back to beta and stick to ever-more-perfect release candidates? In other worlds release candidates often contain tons of imperfections (I believe I've seen this both for Java and Windows) and the label "release candidate" more clearly encourages people to download and play with it, which is what we need at this point! Then the schedule would be something like 15-Oct-2008 3.0 rc 2 05-Nov-2008 3.0 rc 3 19-Nov-2008 3.0 rc 4 03-Dec-2008 3.0 final -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Oct 7 22:29:43 2008 From: guido at python.org (Guido van Rossum) Date: Tue, 7 Oct 2008 13:29:43 -0700 Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de> Message-ID: On Tue, Oct 7, 2008 at 1:18 PM, Fred Drake wrote: > On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote: >> >> b) I would propose that the notion of a default encoding is entirely >> eliminated from Python, along with sys.(get|set)defaultencoding > > +1 I expect that the only effect of this change would be that the filesystem encoding would become the de-facto default encoding for other contexts as well. Not that that is necessarily a bad thing... -- --Guido van Rossum (home page: http://www.python.org/~guido/) From tjreedy at udel.edu Tue Oct 7 22:44:16 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 07 Oct 2008 16:44:16 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: Guido van Rossum wrote: > On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw wrote: >> So, we need to come up with a new release schedule for Python 3.0. My >> suggestion: >> >> 15-Oct-2008 3.0 beta 4 >> 05-Nov-2008 3.0 rc 2 >> 19-Nov-2008 3.0 rc 3 >> 03-Dec-2008 3.0 final >> >> Given what still needs to be done, is this a reasonable schedule? Do we >> need two more betas? > > I know I'm contradicting what I said earlier, but perhaps we should > just forget going back to beta and stick to ever-more-perfect release > candidates? In other worlds release candidates often contain tons of > imperfections (I believe I've seen this both for Java and Windows) and > the label "release candidate" more clearly encourages people to > download and play with it, which is what we need at this point! Then > the schedule would be something like > > 15-Oct-2008 3.0 rc 2 > 05-Nov-2008 3.0 rc 3 > 19-Nov-2008 3.0 rc 4 > 03-Dec-2008 3.0 final As a user, I agree, even if it does stretch the usual notion of rc. Having a beta follow and be better than a gamma (rc) would be confusing. Also, it was the rc designation that encouraged more people to download and play with rc1. I think there has definitely been more attention on 3.0 on c.l.p lately. From rhamph at gmail.com Tue Oct 7 22:45:11 2008 From: rhamph at gmail.com (Adam Olsen) Date: Tue, 7 Oct 2008 14:45:11 -0600 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <48EB1408.1030007@v.loewis.de> Message-ID: On Tue, Oct 7, 2008 at 9:51 AM, James Y Knight wrote: > On Oct 7, 2008, at 3:47 AM, Martin v. L?wis wrote: >>> >>> - Having os.getcwdb isn't much use when you can't even run python in >>> the first place when the current directory has "bad" bytes in it. >> >> That's not true: it *is* of much use. Python will live in /usr/bin, >> which has a nicely-decodable path. >> >>> Currently Python outputs: >>> Could not find platform independent libraries >>> Could not find platform dependent libraries >>> Consider setting $PYTHONHOME to [:] >>> Fatal Python error: Py_Initialize: can't initialize sys standard streams >>> ImportError: No module named encodings.utf_8 >>> Aborted >> >> I can't reproduce that. This happens (for me) when Python lives in >> a directory that has an undecodable path - not when the current >> directory is undecodable. > > Sorry about that: this test was indeed in error: I ran "../python" from an > undecodeable current directory, rather than "/full/path/to/python", or > putting python on the PATH and running it as "python". The first does not > work, but the other more common ways to start it do. > >>> >>> I'm sure there's even more APIs dealing with pathnames, command line >>> arguments, or environment variables that ought to be able to handle both >>> bytes and strings, that currently don't. >> >> Please, no. > > I completely and totally agree with your distate, it's rather gross to allow > bytes-or-str for every API that touches anything like > filenames/argv/environ. That's why I was pushing for the reversible > conversion to str...But if bytes-or-str is the solution that's been chosen > for this issue, it ought to either be fully committed to and implemented, or > at least fully recognized and documented as a half-baked solution. > > Of course, if an reversible encoding into string solution is used instead, > none of these things would need special treatment: they would all work > already. > > FWIW: Qt works fine with undecodeable filenames, and it too uses unicode > strings everywhere in its API. I looked into what it does, and found that it > uses your (Martin)'s original idea for solving this: it stores undecodeable > bytes as characters from 0x10fe00 to 0x10feff (which is valid private-use > codespace). While that might not be ideally correct, since you lose those > 256 PUA characters, even that is IMO better than pushing out bytes to every > API, or worse, giving up and just having python unable to access files, as > it is now. > > See lines 3074: QString::toUtf8() and 3408: QString::fromUtf8()) of > > http://www.google.com/codesearch?q=+show:o7fNK6SzOYs:NO-Bv-AR2rI:toIOngLf1V8&cs_p=http://ie.archive.ubuntu.com/trolltech/pub/qt/snapshots/qt-x11-opensource-src-4.4.0-snapshot-20070402.tar.bz2&cs_f=qt-x11-opensource-src-4.4.0-snapshot-20070402/src/corelib/tools/qstring.cpp So what does Qt do when given a file name already using those PUA? Looks like they get passed through untouched when decoded, but will get translated into invalid names upon encoding. So you still have file names you can't open, and you're incompatible with what other libraries do. The only thing going for Qt is that they seem specifically interested in latin-1, rather than arbitrary bad names. The latin-1 strings that would correspond to the UTF-8 PUA used would include at least one control character, as well as other unusual bits, so it's pretty unlikely to encounter a real latin-1 file name like that. -- Adam Olsen, aka Rhamphoryncus From mal at egenix.com Tue Oct 7 22:52:04 2008 From: mal at egenix.com (M.-A. Lemburg) Date: Tue, 07 Oct 2008 22:52:04 +0200 Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de> Message-ID: <48EBCBF4.7080200@egenix.com> On 2008-10-07 22:18, Fred Drake wrote: > On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote: >> b) I would propose that the notion of a default encoding is entirely >> eliminated from Python, along with sys.(get|set)defaultencoding > > +1 As already mentioned in my reply to Viktor: +1. It's not adjustable anymore, so we might as well get rid off the sys module APIs. The term "default encoding" itself still has some value in that it is associated with the C API char* encoding used for PyUnicode objects in Python 3.0. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 From solipsis at pitrou.net Tue Oct 7 23:31:42 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 7 Oct 2008 21:31:42 +0000 (UTC) Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de> Message-ID: Guido van Rossum python.org> writes: > > I expect that the only effect of this change would be that the > filesystem encoding would become the de-facto default encoding for > other contexts as well. But there is no such thing as "the" filesystem encoding (except in Python's simplified heuristics). There is one distinct encoding for each mounted filesystem. Regards Antoine. From mrs at mythic-beasts.com Tue Oct 7 23:28:16 2008 From: mrs at mythic-beasts.com (Mark Seaborn) Date: Tue, 07 Oct 2008 22:28:16 +0100 (BST) Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: References: <20081006.212059.465784769.mrs@localhost.localdomain> Message-ID: <20081007.222816.343187053.mrs@localhost.localdomain> Terry Reedy wrote: > Mark Seaborn wrote: > > It appears that unbound methods do what you want in the general case > > in Python 2.5 and 2.6. It's just that __lt__ behaves unlike normal > > unbound methods. So this isn't an argument against unbound methods, > > it's an argument for __lt__ not to be a special case. > > It is not a special case. > > >>> def C(object): pass > ... > > >>> C.__hash__ == object.__hash__ > False > > >>> C.__str__ == object.__str__ > False I assume you meant to use "class" instead of "def", in which case most of the attributes do compare the way you want: >>> class C(object): pass ... >>> C.__hash__ == object.__hash__ True >>> C.__str__ == object.__str__ True # But in Python 2.6: >>> C.__lt__ == object.__lt__ False Mark From ncoghlan at gmail.com Tue Oct 7 23:47:30 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 08 Oct 2008 07:47:30 +1000 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> Message-ID: <48EBD8F2.4090802@gmail.com> Barry Warsaw wrote: > On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote: > >> [Barry Warsaw] >>> So, we need to come up with a new release schedule for Python 3.0. >>> My suggestion: >>> 15-Oct-2008 3.0 beta 4 >>> 05-Nov-2008 3.0 rc 2 >>> 19-Nov-2008 3.0 rc 3 >>> 03-Dec-2008 3.0 final >>> Given what still needs to be done, is this a reasonable schedule? >>> Do we need two more betas? > >> Yes to both questions. > > I think that's contradictory :). If we need two betas, then 05-Nov > becomes beta 5, 19-Nov is rc 2. If we don't need another rc then we can > still do a final release on 03-Dec, otherwise we probably go 2 weeks > later. I don't want to go much later than that though because then we > get into the holiday season. Do we need the full two weeks between rc's? Or is it too much of a pain to cut releases 3 weeks in a row? E.g. something like: 15-Oct-2008 3.0 beta 4 05-Nov-2008 3.0 beta 5 19-Nov-2008 3.0 rc 2 26-Nov-2008 3.0 rc 3 (if needed) 03-Dec-2008 3.0 final Cheers, Nick. _______________________________________________ Python-3000 mailing list Python-3000 at python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/ncoghlan%40gmail.com -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From martin at v.loewis.de Tue Oct 7 23:50:48 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 07 Oct 2008 23:50:48 +0200 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <48EBD8F2.4090802@gmail.com> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> <48EBD8F2.4090802@gmail.com> Message-ID: <48EBD9B8.4040102@v.loewis.de> > Do we need the full two weeks between rc's? If they are just other names for betas, yes. If they are true release candidates (in the sense of "we really want to release this as-is unless somebody tells us why this is a really bad idea"), then no. > Or is it too much of a pain > to cut releases 3 weeks in a row? It's a lot of effort, yes. Also for users, who will have barely installed one release candidate when the next one comes out. Regards, Martin From barry at python.org Wed Oct 8 00:00:23 2008 From: barry at python.org (Barry Warsaw) Date: Tue, 7 Oct 2008 18:00:23 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote: > On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw wrote: >> So, we need to come up with a new release schedule for Python 3.0. >> My >> suggestion: >> >> 15-Oct-2008 3.0 beta 4 >> 05-Nov-2008 3.0 rc 2 >> 19-Nov-2008 3.0 rc 3 >> 03-Dec-2008 3.0 final >> >> Given what still needs to be done, is this a reasonable schedule? >> Do we >> need two more betas? > > I know I'm contradicting what I said earlier, but perhaps we should > just forget going back to beta and stick to ever-more-perfect release > candidates? In other worlds release candidates often contain tons of > imperfections (I believe I've seen this both for Java and Windows) and > the label "release candidate" more clearly encourages people to > download and play with it, which is what we need at this point! Then > the schedule would be something like > > 15-Oct-2008 3.0 rc 2 > 05-Nov-2008 3.0 rc 3 > 19-Nov-2008 3.0 rc 4 > 03-Dec-2008 3.0 final I'm okay with that too. It does seem odd to go back to beta then release another rc. What's in a name, anyway? . And it is good that more people are downloading it now that it's rc. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOvb93EjvBPtnXfVAQJTQAP/cmNdzd/SRymxXvW85EnW2NTHUkh1Auw9 bGlbSC0BF2p9ArgbDLPh/X4uatB3UaqoNeq5LTWHL2f9iCnsI7lFMPuexGr+3t4l Xmld8qN77j4GpU6bXL8uUo3/vlhU4MiG5ETl0kMH30f47srOAAGEGZAqW9jAM92I YSkQPSgBdYo= =+s9t -----END PGP SIGNATURE----- From martin at v.loewis.de Wed Oct 8 00:00:49 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 08 Oct 2008 00:00:49 +0200 Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de>

Message-ID: <48EBDC11.2040806@v.loewis.de> Antoine Pitrou wrote: > Guido van Rossum python.org> writes: >> I expect that the only effect of this change would be that the >> filesystem encoding would become the de-facto default encoding for >> other contexts as well. > > But there is no such thing as "the" filesystem encoding (except in Python's > simplified heuristics). There is one distinct encoding for each mounted > filesystem. At best - for mounted joliet/vfat/ntfs partitions. For ext3/ufs/jfs slices, every directory might use its own encoding, different files in a single directory might use different encodings, and even a single file name might switch encodings within itself. However, this is completely unrelated to the issue at hand: remove the "default encoding". Guido was suggesting that then merely the "file system encoding" takes its place. These are both Python-only concepts (in fact, Mark Hammond originally called the latter one "file system default encoding"). I think the notion of "default encoding" is flawed (for what it was used), and so it should be removed. You seem to think that the notion of "file system encoding" is also flawed - but do you infer from that that it also should be removed? Regards, Martin From barry at python.org Wed Oct 8 00:01:39 2008 From: barry at python.org (Barry Warsaw) Date: Tue, 7 Oct 2008 18:01:39 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <48EBD8F2.4090802@gmail.com> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> <48EBD8F2.4090802@gmail.com> Message-ID: <3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 7, 2008, at 5:47 PM, Nick Coghlan wrote: > Barry Warsaw wrote: >> On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote: >> >>> [Barry Warsaw] >>>> So, we need to come up with a new release schedule for Python 3.0. >>>> My suggestion: >>>> 15-Oct-2008 3.0 beta 4 >>>> 05-Nov-2008 3.0 rc 2 >>>> 19-Nov-2008 3.0 rc 3 >>>> 03-Dec-2008 3.0 final >>>> Given what still needs to be done, is this a reasonable schedule? >>>> Do we need two more betas? >> >>> Yes to both questions. >> >> I think that's contradictory :). If we need two betas, then 05-Nov >> becomes beta 5, 19-Nov is rc 2. If we don't need another rc then >> we can >> still do a final release on 03-Dec, otherwise we probably go 2 weeks >> later. I don't want to go much later than that though because then >> we >> get into the holiday season. > > Do we need the full two weeks between rc's? Or is it too much of a > pain > to cut releases 3 weeks in a row? > > E.g. something like: > > 15-Oct-2008 3.0 beta 4 > 05-Nov-2008 3.0 beta 5 > 19-Nov-2008 3.0 rc 2 > 26-Nov-2008 3.0 rc 3 (if needed) > 03-Dec-2008 3.0 final I won't be able to cut another release between the 15th and 5th, so at least that one should be 2 weeks. If we don't need the additional rc, then we can release early, which would put us just before the US Thanksgiving holiday. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOvcQ3EjvBPtnXfVAQK5mwP9GQfw3zNvGhJWiSkZ2gQ1LNr0rnmfVmpF WcDePkz3e5nsOjtkwiN0rlYHIQE9ySPfvtqqrInBW8y97y79mTjiM4S32XHLyAsd WEWRb0ClcLuZs+JveAb8KF5pO0RlDgX9Dd6puuPr8kGa5aN/rosfsnXra1GrYpj3 JQghQ89JNkE= =+Ymq -----END PGP SIGNATURE----- From martin at v.loewis.de Wed Oct 8 00:05:45 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 08 Oct 2008 00:05:45 +0200 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <48E68911.6090403@g.nevcal.com> <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com> <48E6A492.4090604@g.nevcal.com> <48E6ED99.2050406@g.nevcal.com> <48EA9B71.3060109@nevcal.com> <48EAF263.5080006@g.nevcal.com> Message-ID: <48EBDD39.1030902@v.loewis.de> > The posix version should hardcode it as b'/'; I only meant windows to > use UTF-16. You could perhaps use sys.getfilesystemencoding(), but > I'm unsure what it does if the encoding isn't an ascii superset (or > even if that can actually happen.) POSIX has the notion of a "portable character set", which includes the ASCII letters, digits, forward slash, and a few others. It requires this set to be supported on any POSIX implementation. So the file system encoding should always be an ASCII superset, in the repertoire superset sense. I don't think POSIX assigns specific code points, so it doesn't have to be a superset in the coded character set sense. I'm sure those VMS users will tell us some day. Regards, Martin From solipsis at pitrou.net Wed Oct 8 00:07:24 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 08 Oct 2008 00:07:24 +0200 Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError In-Reply-To: <48EBDC11.2040806@v.loewis.de> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <200810071130.35729.victor.stinner@haypocalc.com> <48EBC15C.1050305@v.loewis.de>

<48EBDC11.2040806@v.loewis.de> Message-ID: <1223417244.14619.2.camel@fsol> Le mercredi 08 octobre 2008 ? 00:00 +0200, "Martin v. L?wis" a ?crit : > You seem to think that the notion of "file system encoding" > is also flawed - but do you infer from that that it also should be > removed? Under the condition we find something better, yes. Otherwise, let's keep the heuristic. From martin at v.loewis.de Wed Oct 8 00:10:47 2008 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 08 Oct 2008 00:10:47 +0200 Subject: [Python-3000] Accessing module state from extension types In-Reply-To: References: Message-ID: <48EBDE67.2090102@v.loewis.de> > How is this supposed to work? The design was that you use PyState_FindModule, as an efficient way for getting a module object if you have the module def. The implementation fills an index into the module def (which will stay constant across interpreters), this this should give you your module object anywhere, in constant time. If you have specific proposals on how to make this more convenient to use, please go ahead. (also, if you think that this somehow flawed: this would be the time to mention it) Regards, Martin From solipsis at pitrou.net Wed Oct 8 00:12:24 2008 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 08 Oct 2008 00:12:24 +0200 Subject: [Python-3000] [python-committers] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>

Message-ID: <1223417544.14619.4.camel@fsol> Le mardi 07 octobre 2008 ? 18:00 -0400, Barry Warsaw a ?crit : > On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote: > > 15-Oct-2008 3.0 rc 2 > > 05-Nov-2008 3.0 rc 3 > > 19-Nov-2008 3.0 rc 4 > > 03-Dec-2008 3.0 final > > I'm okay with that too. It does seem odd to go back to beta then > release another rc. What's in a name, anyway? . And it is good > that more people are downloading it now that it's rc. I also think it's better to call them rcs and encourage people to play with them. From barry at python.org Wed Oct 8 00:15:31 2008 From: barry at python.org (Barry Warsaw) Date: Tue, 7 Oct 2008 18:15:31 -0400 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote: > 15-Oct-2008 3.0 rc 2 > 05-Nov-2008 3.0 rc 3 > 19-Nov-2008 3.0 rc 4 > 03-Dec-2008 3.0 final I've updated PEP 361 and the Google calendar with this schedule, except that the PEP says that rc3 and rc4 are planned "if needed". - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOvfg3EjvBPtnXfVAQKDfwP/Sz9Ioe1tIrKtvD7JPG2cg2F+wfDJrc+9 vqfh6/eMWiUIOeSKJu6+gye7oXRcHwQXAPivNza3993HesOu0TjudnwXfkAlfsdE m09Rh70AXQQiY7JX46etugRC4BwkuNeBo253cvmfo6hPK0ZhOHZSy3H1LkhvvLA6 Cq56CVqDUgs= =i/Km -----END PGP SIGNATURE----- From barry at python.org Wed Oct 8 00:16:56 2008 From: barry at python.org (Barry Warsaw) Date: Tue, 7 Oct 2008 18:16:56 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org> <48EBD8F2.4090802@gmail.com> <3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 7, 2008, at 6:01 PM, Barry Warsaw wrote: > I won't be able to cut another release between the 15th and 5th, so > at least that one should be 2 weeks. If we don't need the > additional rc, then we can release early, which would put us just > before the US Thanksgiving holiday. Er, /3/ weeks between rc2 and rc3. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSOvf2HEjvBPtnXfVAQJDsQP8DRL2gQDMf1eEvgmmijPtVdbfAypZ1XMY huNzPu91v6dpvrogIP5MJbmJnSnka5yk78JIlkbTU4ZHS0ADsQX+IApU5y/SlO9Y FDtIqb+NFoVRFj5xQaN/EEqO8kNpq3WPmaEQJ4HHeDUIzcrbsPxfCm+vbePgnGzI AwhQqCzmX1I= =aQnH -----END PGP SIGNATURE----- From foom at fuhm.net Wed Oct 8 00:22:13 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 7 Oct 2008 18:22:13 -0400 Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net> <48EB1408.1030007@v.loewis.de> Message-ID: On Oct 7, 2008, at 4:45 PM, Adam Olsen wrote: > So what does Qt do when given a file name already using those PUA? > Looks like they get passed through untouched when decoded, but will > get translated into invalid names upon encoding. Well, I'd say that looks like a bug. It should probably decode those PUA characters as if they were undecodeable sequences so that they too roundtrip properly. > So you still have > file names you can't open In practical terms, I suspect nobody has ever run into a file which has this problem. You certainly can't say that is the case for Python-3's current behavior; my suspicion is that anyone who uses any non-ascii filenames at all will run into issues with Python3's behavior at least once. > , and you're incompatible with what other > libraries do. I'm sure there's a situation where that matters, but, at least I can run kpdf /any/arbitrary/file.pdf and have it work. And use the KDE file chooser, and have it able to browse my files, and choose any file, no matter what random characters it has in it. If there is an issue with interfacing to another library, the string can be converted to whatever the other library expects at the interface point... People keep claiming that odd filenames are only going to be an issue for "backup tools", but I don't think that's true. I think it'll be an issue for most any program that reads user-specified files. Whether it be by running Python in an ASCII (e.g. "C") locale when there are files created with UTF-8 names, or by having copied/downloaded a file with an incorrectly encoded name, it's going to come up, and be an irritant when it does. That Qt felt the need to make this change rather strengthens that point IMO... > The only thing going for Qt is that they seem specifically interested > in latin-1, rather than arbitrary bad names. The latin-1 strings that > would correspond to the UTF-8 PUA used would include at least one > control character, as well as other unusual bits, so it's pretty > unlikely to encounter a real latin-1 file name like that. I'd say they're most concerned about files that their users are likely to run into, yes. James From amauryfa at gmail.com Wed Oct 8 00:46:12 2008 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Wed, 8 Oct 2008 00:46:12 +0200 Subject: [Python-3000] Accessing module state from extension types In-Reply-To: <48EBDE67.2090102@v.loewis.de> References: <48EBDE67.2090102@v.loewis.de> Message-ID: 2008/10/8 "Martin v. L?wis" : >> How is this supposed to work? > > The design was that you use PyState_FindModule, as an efficient way for > getting a module object if you have the module def. The implementation > fills an index into the module def (which will stay constant across > interpreters), this this should give you your module object anywhere, > in constant time. This is exactly what I was looking for. Thanks! > If you have specific proposals on how to make this more convenient to > use, please go ahead. (also, if you think that this somehow flawed: > this would be the time to mention it) I suppose that common usage will do things like this: ((MyModuleState *)PyModule_GetState(PyState_FindModule(&myModuleDef)))->globalValue If you want to check for errors, it becomes tedious and some kind of macro could be useful. But this can be added later. Why is the function caller PyState_FindModule? It's the only one with this prefix (with _PyState_AddModule); other functions in the same module are called PyInterpreterState_*. I suggest to rename it now; otherwise there may be confusion between "module state" and "interpreter state"; see the example above: PyModule_GetState(PyState_FindModule(x)) seems to be a round-trip (or a no-op) to the casual reader. And unless you already planned to do so, I think I will start to document the module API. -- Amaury Forgeot d'Arc From ncoghlan at gmail.com Wed Oct 8 11:44:50 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 08 Oct 2008 19:44:50 +1000 Subject: [Python-3000] A plus for naked unbound methods In-Reply-To: References: <20081006.175034.343188282.mrs@localhost.localdomain> <20081006.212059.465784769.mrs@localhost.localdomain> <48EB358B.8020101@gmail.com> Message-ID: <48EC8112.9020404@gmail.com> Terry Reedy wrote: >> In the case of the comparison methods, they're being retrieved from type >> rather than object. This difference is made clear when you attempt to >> invoke the retrieved method: >> >>>>> object.__cmp__(1, 2) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: expected 1 arguments, got 2 >>>>> object.__cmp__(2) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int' >>>>> object.__cmp__(object) >> 0 > > This surprises me, partly because the situation seems to be different in > 3.0. That's because the default comparison of object() instances also changes in Py3k: equality and inequality checks will succeed (using identity based comparison), but ordering checks will fail with a TypeError. The rich comparisons on type() in 2.6 are actually there in order to issue a Py3k warning when -3 is defined and an ordering comparison is invoked on a type, but it appears no such warning is currently present for default object comparison. That lack of Py3k warnings is arguably a bug in 2.6, but we would want to think carefully about the backwards compatibility implications of defining rich comparisons on object before adding such warnings. As we've seen, even adding rich comparisons to type was enough to break some user code (admittedly it was code that made some unwarranted assumptions and hence was already potentially broken in the face of metaclasses other than type, but the change did in fact break that code for cases where it used to work). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From mhammond at skippinet.com.au Tue Oct 7 15:34:15 2008 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 8 Oct 2008 00:34:15 +1100 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> Message-ID: <00e001c92881$68ba93c0$3a2fbb40$@com.au> [when 2 mailing lists are not enough... :-] > I'm seeing that people are just starting to download and play with 3.0. > I expect that we'll start getting more feedback on conversion issues +1 from this direction too. pywin32 has recently started looking seriously at py3k, and while things are in fairly good shape for us who are already "on the bandwagon", cleaning up a few rough edges would help people's first impressions - and as they say, you only get one chance at a good first impression... More specifically, I think 2to3 is shaping up well. pywin32 is taking the approach of "port where possible, but keep in py2x syntax and convert at 'setup.py' time" and this is working out fairly well (in fact, with just a couple of helpers in pywintypes, I think we can support python 2.3 upwards). I believe that many projects may well take a similar approach as it allows them to defer a full commitment to py3k, so doing all we can to support this might help with that first impression. My experience is that this could best be achieved by addressing the following issues before release: * Almost all open 2to3 issues that aren't truly edge cases should be resolved - if 2to3 doesn't work for people, they may be forced to (even temporarily) "fork" their project, which will cause concern. I'll note that good recent progress is being made here, but its still worth mentioning... * Better support for 2to3 in distutils (specifically, the support in build_py is stale, plus 'build_scripts' and 'install_data' should convert .py files to py3k syntax.) An 'example' project that uses py2k syntax and "just works" on py3k using this strategy might be useful here. * A standard 'helper script' that allows people to use py3k to execute a py2x syntax script by auto-converting the code. I've a 10ish-line script that uses lib2to3 plus exec() to achieve that result, but a helper in 2to3 for this would be nice. For a concrete use-case, we want to keep our distutils script in py2x syntax, but execute it via py3k. Its very possible this already exists and I've just missed it... Either way, I'm fairly confident a pywin32 build for py3k will be available in the next month or 2 (but as a result, I'm not really in a position to help with the above for that period...) Hopefully-helpfully, Mark From mhammond at skippinet.com.au Wed Oct 8 03:04:36 2008 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 8 Oct 2008 12:04:36 +1100 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <48EBBB25.70609@v.loewis.de> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <00e001c92881$68ba93c0$3a2fbb40$@com.au> <48EBBB25.70609@v.loewis.de> Message-ID: <014001c928e1$dc13af40$943b0dc0$@com.au> > > * Better support for 2to3 in distutils (specifically, the support in > > build_py is stale, plus 'build_scripts' and 'install_data' should > > convert > > .py files to py3k syntax.) > > Please do create a bug report for that. It sounds like it's easy to > fix. Yeah, build_py is fairly easy to fix, but I also needed to extend the support to build_scripts and install_data. In addition, some already reported bugs in 2to3 mean that some files fail to convert, and this breaks the entire process - so as a result I ended up duplicating lib2to3's 'refactor_items()' but with exceptions being logged and ingored rather than aborting the process. Oh - and I deleted the .bak files (a copy of the sources are converted, not the sources themselves) Please see bugs 4072 and 4073 - but as mentioned below, the lack of a test case means I didn't supply a tested patch. > > An 'example' project that uses py2k syntax and > > "just works" on py3k using this strategy might be useful here. > > Perhaps pywin32 :-? > > I don't think a demo project would do much good, as it doesn't exercise > all the issues that may occur. My idea was that the demo project would simply demonstrate the 2to3 concepts that such a project could use. pywin32 isn't a good example as it has a very non-trivial setup.py and a large set of C extensions (the demo I had in mind could avoid C extensions completely - C developers will already assume #ifdef will be their friend, but .py code is the unknown...) It would basically be a 'distutils demo', could have a single .py module and a single .py script. setup.py would support both 2.x and 3.x and would demonstrate how the source is converted to py3k syntax before it is installed into the py3k distribution. It would also provide a useful test case - eg, for the distutils bug above, I'm not sure how I can (a) demonstrate it is currently broken and (b) demonstrate a patch corrects the problem. > > * A standard 'helper script' that allows people to use py3k to > > execute a py2x syntax script by auto-converting the code. I've > > a 10ish-line script that uses lib2to3 plus exec() to achieve that > > result, but a helper in 2to3 > > for this would be nice. For a concrete use-case, we want to keep our > > distutils script in py2x syntax, but execute it via py3k. Its very > > possible this already exists and I've just missed it... > > For the case of setup.py, I was hoping that it could be written in > compatible syntax even without needing conversion. That worked fine for > my Django port. Is that not the case for pywin32? setup.py catches and examines some exceptions. Consider the more general case though - pywin32 has a number of tests all of which will also be maintained in py2x syntax. It is extremely convenient to be able to execute: % py3k run2.py my_test.py etc And have 'my_test.py' (which is 2.x syntax) be executed directly by py3k without doing a full 'setup.py install' or manually invoking 2to3 via a temp file, etc. As mentioned, 'run2.py' is quite short and just uses lib2to3+exec, but I'm not sure everyone will work out how to roll their own... Specifically, I believe that a script with similar capabilities could be installed with py3k in the "scripts" directory and it advertised as a reasonable way to directly execute your *scripts* which, although py3x compatible, are being maintained in py2x syntax. Below is my quick attempt at such a script, which I promptly stopped looking at as soon as it worked (ie, I'm not sure if all those options are needed, etc), but it does let me execute my tests using py3k directly from the source tree. Cheers, Mark --- # This is a Python 3.x script to execute a python 2.x script by 2to3'ing it. import sys from lib2to3.refactor import RefactoringTool, get_fixers_from_package fixers = get_fixers_from_package('lib2to3.fixes') options = dict(doctests_only=False, fix=[], list_fixes=[], print_function=False, verbose=False, write=True) r = RefactoringTool(fixers, options) script = sys.argv[1] data = open(script).read() print("Converting...") got = r.refactor_string(data, script) print("Executing...") # nuke ourselves from argv del sys.argv[1] exec(str(got)) --- From mhammond at skippinet.com.au Wed Oct 8 03:26:22 2008 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 8 Oct 2008 12:26:22 +1100 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <014001c928e1$dc13af40$943b0dc0$@com.au> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <00e001c92881$68ba93c0$3a2fbb40$@com.au> <48EBBB25.70609@v.loewis.de> <014001c928e1$dc13af40$943b0dc0$@com.au> Message-ID: <014201c928e4$e726bd20$b5743760$@com.au> > at such a script, which I promptly stopped looking at as soon as it > worked Which is quite obvious really given that: > # nuke ourselves from argv > del sys.argv[1] is removing the wrong value! Mark From musiccomposition at gmail.com Wed Oct 8 20:59:38 2008 From: musiccomposition at gmail.com (Benjamin Peterson) Date: Wed, 8 Oct 2008 12:59:38 -0600 Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python 3.0 schedule In-Reply-To: <014001c928e1$dc13af40$943b0dc0$@com.au> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1> <00e001c92881$68ba93c0$3a2fbb40$@com.au> <48EBBB25.70609@v.loewis.de> <014001c928e1$dc13af40$943b0dc0$@com.au> Message-ID: <1afaf6160810081159o18e64e68te95ab94f1198472c@mail.gmail.com> On 10/7/08, Mark Hammond wrote: > # This is a Python 3.x script to execute a python 2.x script by 2to3'ing it. > import sys > from lib2to3.refactor import RefactoringTool, get_fixers_from_package > > fixers = get_fixers_from_package('lib2to3.fixes') > options = dict(doctests_only=False, fix=[], list_fixes=[], > print_function=False, verbose=False, > write=True) Note that only the print_function option is used. > r = RefactoringTool(fixers, options) > script = sys.argv[1] > data = open(script).read() > print("Converting...") > got = r.refactor_string(data, script) > print("Executing...") > # nuke ourselves from argv > del sys.argv[1] > exec(str(got)) > --- > > _______________________________________________ > python-committers mailing list > python-committers at python.org > http://mail.python.org/mailman/listinfo/python-committers > -- Cheers, Benjamin Peterson "There's nothing quite as beautiful as an oboe... except a chicken stuck in a vacuum cleaner." From musiccomposition at gmail.com Wed Oct 8 22:43:22 2008 From: musiccomposition at gmail.com (Benjamin Peterson) Date: Wed, 8 Oct 2008 15:43:22 -0500 Subject: [Python-3000] Proposed Python 3.0 schedule In-Reply-To: <48EC4A57.8030608@hlabs.spb.ru> References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org> <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com> <48EC4A57.8030608@hlabs.spb.ru> Message-ID: <1afaf6160810081343h62bf5ab3pad21d32e68a48313@mail.gmail.com> On Wed, Oct 8, 2008 at 12:51 AM, Dmitry Vasiliev wrote: > > BTW, I think the following issues should be also marked as release blockers: Agreed and done. > > - http://bugs.python.org/issue3714 (nntplib module broken by str to > unicode conversion) > - http://bugs.python.org/issue3725 (telnetlib module broken by str to > unicode conversion) > - http://bugs.python.org/issue3727 (poplib module broken by str to > unicode conversion) > > -- > Dmitry Vasiliev > http://hlabs.spb.ru > -- Cheers, Benjamin Peterson "There's nothing quite as beautiful as an oboe... except a chicken stuck in a vacuum cleaner." From tjreedy at udel.edu Fri Oct 10 04:20:02 2008 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 09 Oct 2008 22:20:02 -0400 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48E68911.6090403@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> Message-ID: Glenn Linderman wrote: > My understanding of the Posix file names is that any byte values are > valid except "/" and null. Is this a correct understanding? > > The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a > Unicode character stream. Call the original byte stream FOO. The > transformation then produces FOOTR, a set of Unicode code points. Now > FOOTR has a representation in UTF-8, which is a byte stream, call that > byte stream FOOTRUTF8. How, by looking at FOOTR, do you know whether it > represents the file name FOO or FOOTRUTF8 ? And remember that the user > might provide a Unicode character stream identical to FOOTR: should it > be translated to FOO or FOOTRUTF8 when creating a new file according to > the user-supplied name? If FOOTR is using PUA chars, then I believe that users should not be providing such a stream as it would have no defined meaning coming from them. From stephen at xemacs.org Fri Oct 10 06:38:25 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 10 Oct 2008 13:38:25 +0900 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> Message-ID: <87k5chxdxa.fsf@xemacs.org> Terry Reedy writes: > If FOOTR is using PUA chars, then I believe that users should not > be providing such a stream as it would have no defined meaning > coming from them. But that's precisely what "private use" means: the users provide their own definitions! The Unicode standard provides that if a process doesn't know what those characters mean, it *must* pass them through *unchanged*, on the assumption that they will eventually reach a user who knows what they mean. So this means that (to conform to Unicode) every Python program must take responsibility for ensuring that it tracks every filename to be sure that no internal-use PUA characters make it to the "outside world" where they will be propagated indefinitely by conforming processes. This is a substantial burden. This is precisely the advantage of UTF-8b: the first conforming process that catches any escapees will scream bloody murder and turn them over to the Spanish Inquisition, who will torture them on the rack until they confess that Python did it. From stephen at xemacs.org Fri Oct 10 08:55:56 2008 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 10 Oct 2008 15:55:56 +0900 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48EEDECA.8050107@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com> <87k5chxdxa.fsf@xemacs.org> <48EEDECA.8050107@g.nevcal.com> Message-ID: <87iqs1x7k3.fsf@xemacs.org> Glenn Linderman writes: > Define a conforming process. For present purposes, one that promises not to emit invalid Unicode strings as Unicode. > If it is one that handles Unicode with full validation, all is > wonderful, except on platforms that permit non-validated Unicode names > or non-Unicode names. And these are precisely the platforms for which > these various translation schemes have been proposed. Those aren't the proposals I've been reading about. True, people have suggested limiting the translation schemes with various coverage for different platforms. But AFAIK, all platforms supported by Python allow NFS mounts, not to mention FAT filesystems on removable devices, so in practice all may encounter arbitrary filenames in arbitrary encodings. Nor is it trivial for Python to figure out what filesystems, let alone encodings, are being used. So Python has to support whatever is decided, period, perhaps with more or less complex heuristics to tune treatment to platforms. > And so they will not enforce full validation on file names, even if they > handle full validation on other strings. Well, in practice that means conforming processes *will* validate at least some file names, since I don't know of any systems that really treat file names as anything but strings. > And Python will not always be the culprit. But if the defaults get screwed up here, it will remain one of the "usual suspects" for a long time to come. It would be nice to provide a foundation for doing better than that, but nothing proposed so far does. That's not surprising, because they're designed to preserve, rather than handle, apparently invalid data, in hopes that somebody else will clean up the mess. The problem that all the proposals face is that they assume that we know where the cleaning up will be done, and that we're in control of the code that will have to do it. From ncoghlan at gmail.com Fri Oct 10 10:25:26 2008 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 10 Oct 2008 18:25:26 +1000 Subject: [Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0? In-Reply-To: <48EF04CC.5080503@g.nevcal.com> References: <200809271404.25654.victor.stinner@haypocalc.com> <2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net> <87od26e3an.fsf@xemacs.org> <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net> <48E2CCEC.9030709@canterbury.ac.nz> <871vz0pnuw.fsf@xemacs.org> <87wsgso178.fsf@xemacs.org> <48E67175.1030103@g.nevcal.com> <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net> <48E68911.6090403@g.nevcal.com>