From ncoghlan at gmail.com  Wed Oct  1 00:02:29 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 01 Oct 2008 08:02:29 +1000
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
 2.6 or 3.0?
In-Reply-To: <ca471dc20809301438p4d6761c1m6937859f29bc677@mail.gmail.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>	
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>	
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>	
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	
	<87od26e3an.fsf@xemacs.org>	
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>	
	<ca471dc20809300957o61b554e2n9101e0b1078b1647@mail.gmail.com>	
	<B11BFA0C-4238-4623-B040-73BC5358831F@fuhm.net>	
	<48E29AB6.908@gmail.com>
	<ca471dc20809301438p4d6761c1m6937859f29bc677@mail.gmail.com>
Message-ID: <48E2A1F5.5040009@gmail.com>

Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 2:31 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> I'm also starting to wonder if allowing mixed types might be the way to
>> go for these interfaces - leaving the bytes objects in place if the
>> Unicode decode operation fails.
> 
> No, no, nooooo!

Yeah, I realised shortly after sending that message that this is exactly
the problem this discussion is trying to get rid of. I saw at least one
other post containing a similar comment though, so I didn't feel *too*
foolish for writing it (although that didn't stop me wishing my email
client had a "Retract stupid comment" button).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From guido at python.org  Wed Oct  1 00:04:03 2008
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 Sep 2008 15:04:03 -0700
Subject: [Python-3000] [Python-Dev] Patch for an initial support of
	bytes filename in Python3
In-Reply-To: <20080930184751.31635.1484325691.divmod.xquotient.520@weber.divmod.com>
References: <200809300247.20349.victor.stinner@haypocalc.com>
	<20080930132151.31635.132601277.divmod.xquotient.434@weber.divmod.com>
	<ca471dc20809300732r456678fcgb8caeb369a6cf349@mail.gmail.com>
	<20080930175932.31635.989735053.divmod.xquotient.478@weber.divmod.com>
	<ca471dc20809301056j6800b6e1nca9a9ec5a52e8445@mail.gmail.com>
	<20080930184751.31635.1484325691.divmod.xquotient.520@weber.divmod.com>
Message-ID: <ca471dc20809301504h2fc99567o34eb6c947d882c1e@mail.gmail.com>

On Tue, Sep 30, 2008 at 11:47 AM,  <glyph at divmod.com> wrote:
>
> On 05:56 pm, guido at python.org wrote:
>>
>> On Tue, Sep 30, 2008 at 10:59 AM,  <glyph at divmod.com> wrote:
>>>
>>> On 02:32 pm, guido at python.org wrote:
>
>>> In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the
>>> "benefit of the doubt" case?  It could always be added to 2.7, and the
>>> parity release of 2to3 could have a --2.7 switch that would modify the
>>> behavior of this and other fixers.
>>
>> I'm not sure what you're proposing. *My* proposal is that 2to3 changes
>> os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone
>> -- there's no way to tell whether os.getcwdb() would be a better
>> match, and for portable code, it won't be (since os.getcwdb() is a
>> Unix-only thing).
>
> My proposal is simply to change getcwd to getcwdb, and getcwdu to getcwd.
>  This preserves whatever bytes/text behavior you are expecting from 2.6 into
> 3.0.  Granted, the fact that unicode is really always the right thing to do
> on Windows complicates things.

Plus, even on Linux Unicode is *usually* what you should be doing,
unless you're writing a backup tool.

> I already tend to avoid os.getcwd() though, and this is just one more reason
> to avoid it.  In the rare cases where I really do need it, it looks like
> os.path.abspath(b".") / os.path.abspath(u".") will provide the clarity that
> I want.

Or os.path.expanduser('~') vs. os.path.expanduser(b'~'). :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Wed Oct  1 00:21:04 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 01 Oct 2008 00:21:04 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <ca471dc20809301434u6116391cje5778bcef5048cc9@mail.gmail.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbr0nv$iqu$1@ger.gmane.org>	<200809300202.38574.victor.stinner@haypocalc.com>	<gbsgk6$kc1$1@ger.gmane.org>	<ca471dc20809300659g608f8c14g29ba2b30def1be1f@mail.gmail.com>	<gbtnjo$quh$1@ger.gmane.org>	<ca471dc20809301045r59251402g3fe947dec3bc7f22@mail.gmail.com>	<48E28C31.6060606@v.loewis.de>
	<ca471dc20809301434u6116391cje5778bcef5048cc9@mail.gmail.com>
Message-ID: <48E2A650.4000108@v.loewis.de>

>> My concern still is that it brings the bytes type into the status of
>> another character string type, which is really bad, and will require
>> further modifications to Python for the lifetime of 3.x.
> 
> I'd like to understand why this is "really bad". I though it was by
> design that the str and bytes types behave pretty similarly. You can
> use both as dict keys.

If they have to behave pretty similarly, they have to be supported in
all APIs that deal with text. For example, people will demand that
printing bytes should just copy them onto the stream (rather than
invoking repr()), and writing them onto a text stream should work the
same way. GUI library should support them, the XML libraries, and so
on.

Where will you stop, and tell people that bytes are just not supposed
to do this or that?

>> This is because applications will then regularly use byte strings for
>> file names on Unix, and regular strings on Windows, and then expect
>> the program to work the same without further modifications.
> 
> It seems that bytes arguments actually *do* work on Windows -- somehow
> they get decoded. (Unless Terry's report was from 2.x.)

To a limited degree - see my other message. Don't try to listdir a
directory with characters outside CP_ACP (it will give you invalid
file names).

> Actually something like that may not be a bad idea. Ian Bicking's
> webob supports similar double APIs for getting the request parameters
> out of a request object; I believe request.GET['x'] is a text object
> and request.GET_str['x'] is the corresponding uninterpreted bytes
> sequence. I would prefer to have os.environb over os.environ[b"PATH"]
> though.

And would you keep them synchronized?

> I assume at some point we can stop and have sufficiently low-level
> interfaces that everyone can agree are in bytes only. Bytes aren't
> going away. How does Java deal with this? Its File class doesn't seem
> to deal in bytes at all. What would its listFiles() method do with
> undecodable filenames?

Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte
sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will
fail with FileNotFoundException.

IOW, Java hasn't solved the problem in the last 10 years. Marcin
Kowalczyk did a more thorough analysis about a year ago in

http://mail.python.org/pipermail/python-3000/2007-September/010450.html

Regards,
Martin



From martin at v.loewis.de  Wed Oct  1 00:28:22 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 01 Oct 2008 00:28:22 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbtq8t$3dl$1@ger.gmane.org>
	<48E29CB1.5010309@v.loewis.de>
	<83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net>
Message-ID: <48E2A806.6020607@v.loewis.de>

> Yes! If there is a byte-string access method for Windows, pretty please
> make it decode from UTF-8 internally and call the Unicode version of the
> Windows APIs. The non-unicode windows APIs are pretty much just broken
> -- Ideally, Python should never be calling those.

I don't think we will manage to release Python 3.0 this year if that
change is to be implemented. And then, I don't think the release manager
will agree to such a delay.

I disagree that the ANSI APIs are broken. For most users (and by that,
I mean much more than 99% of the world population with access to
Windows computers), they work just fine. You have to deliberately try
to break them, or work in an environment were you speak multiple
languages (with conflicting scripts) simultaneously. Practicality
beats purity, and I applaud Microsoft for such a foresighted design
(they are guilty for bad designs in other places, but this one really
gives a good tradeoff of all issues, all things considered).

Regards,
Martin

From martin at v.loewis.de  Wed Oct  1 00:32:03 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 01 Oct 2008 00:32:03 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <EF183EA0-B073-4883-9362-C8B6C8E470D3@cwi.nl>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbtq8t$3dl$1@ger.gmane.org>	<ca471dc20809301120y5149d346s31b0027b7bdd529e@mail.gmail.com>	<gbtvd3$na4$1@ger.gmane.org>
	<48E29D3B.5030900@v.loewis.de>
	<EF183EA0-B073-4883-9362-C8B6C8E470D3@cwi.nl>
Message-ID: <48E2A8E3.3070805@v.loewis.de>


> How does windows (and Python on windows) handle NFC versus NFD issues?

That's left to the application.

> Can I have two files called "?mlaut.txt", one in NFD and one NFC form?

Yes, you can. It sounds confusing, but only in a theoretical way. You
never have combining characters on Windows (at least, I don't). The
keyboard input defaults to NFC, and users normally don't type file
names, anyways, except when creating the files - later, they just use
the mouse to indicate what file they want to act on.

> And are both of those representable on the Python side (i.e. can they
> both be returned from listdir() and passed to open())?

Certainly!

> CIf I compare
> these two filenames, do they compare differently? 

Certainly!

Regards,
Martin

From guido at python.org  Wed Oct  1 00:33:50 2008
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 Sep 2008 15:33:50 -0700
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E2A650.4000108@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<gbr0nv$iqu$1@ger.gmane.org>
	<200809300202.38574.victor.stinner@haypocalc.com>
	<gbsgk6$kc1$1@ger.gmane.org>
	<ca471dc20809300659g608f8c14g29ba2b30def1be1f@mail.gmail.com>
	<gbtnjo$quh$1@ger.gmane.org>
	<ca471dc20809301045r59251402g3fe947dec3bc7f22@mail.gmail.com>
	<48E28C31.6060606@v.loewis.de>
	<ca471dc20809301434u6116391cje5778bcef5048cc9@mail.gmail.com>
	<48E2A650.4000108@v.loewis.de>
Message-ID: <ca471dc20809301533y477c19dx699255c14290ae97@mail.gmail.com>

On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>>> My concern still is that it brings the bytes type into the status of
>>> another character string type, which is really bad, and will require
>>> further modifications to Python for the lifetime of 3.x.
>>
>> I'd like to understand why this is "really bad". I though it was by
>> design that the str and bytes types behave pretty similarly. You can
>> use both as dict keys.
>
> If they have to behave pretty similarly, they have to be supported in
> all APIs that deal with text.

I don't see how you get from "pretty similarly" to "all APIs". :-)

> For example, people will demand that
> printing bytes should just copy them onto the stream (rather than
> invoking repr()), and writing them onto a text stream should work the
> same way. GUI library should support them, the XML libraries, and so
> on.
>
> Where will you stop, and tell people that bytes are just not supposed
> to do this or that?

Printing a bytes object already works, and displays its repr(), which
is guaranteed to be pure ASCII (unlike the repr() of a unicode str
object in Py3k). All the others you mention will cause breakage as
they should -- these errors exist to force the programmer to think
about encodings or conversions. I don't see that as a big burden
because the only way there could be bytes here in the first place is
when the user explicitly requested bytes. A program that only ever
passes text strings to the os module is only ever going to get text
strings back.

>>> This is because applications will then regularly use byte strings for
>>> file names on Unix, and regular strings on Windows, and then expect
>>> the program to work the same without further modifications.
>>
>> It seems that bytes arguments actually *do* work on Windows -- somehow
>> they get decoded. (Unless Terry's report was from 2.x.)
>
> To a limited degree - see my other message. Don't try to listdir a
> directory with characters outside CP_ACP (it will give you invalid
> file names).

Understood.

>> Actually something like that may not be a bad idea. Ian Bicking's
>> webob supports similar double APIs for getting the request parameters
>> out of a request object; I believe request.GET['x'] is a text object
>> and request.GET_str['x'] is the corresponding uninterpreted bytes
>> sequence. I would prefer to have os.environb over os.environ[b"PATH"]
>> though.
>
> And would you keep them synchronized?

Yes, the bytes versions would be the canonical version and the str
version would wrap around that -- though updating the str version
would also update the bytes version. Some keys would be missing from
the str version (or perhaps they would raise exceptions or default to
some other error handler, like ignore or replace).

>> I assume at some point we can stop and have sufficiently low-level
>> interfaces that everyone can agree are in bytes only. Bytes aren't
>> going away. How does Java deal with this? Its File class doesn't seem
>> to deal in bytes at all. What would its listFiles() method do with
>> undecodable filenames?
>
> Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte
> sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will
> fail with FileNotFoundException.
>
> IOW, Java hasn't solved the problem in the last 10 years. Marcin
> Kowalczyk did a more thorough analysis about a year ago in
>
> http://mail.python.org/pipermail/python-3000/2007-September/010450.html

I can't say I like the Java solution. I would like to be able to write
a robust backup tool in Python, even if the code needed to make it
work everywhere isn't going to win any prizes (due to the need to use
bytes on Unix, str on Windows).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From foom at fuhm.net  Wed Oct  1 00:36:23 2008
From: foom at fuhm.net (James Y Knight)
Date: Tue, 30 Sep 2008 18:36:23 -0400
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E2A650.4000108@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbr0nv$iqu$1@ger.gmane.org>	<200809300202.38574.victor.stinner@haypocalc.com>	<gbsgk6$kc1$1@ger.gmane.org>	<ca471dc20809300659g608f8c14g29ba2b30def1be1f@mail.gmail.com>	<gbtnjo$quh$1@ger.gmane.org>	<ca471dc20809301045r59251402g3fe947dec3bc7f22@mail.gmail.com>	<48E28C31.6060606@v.loewis.de>
	<ca471dc20809301434u6116391cje5778bcef5048cc9@mail.gmail.com>
	<48E2A650.4000108@v.loewis.de>
Message-ID: <0DBCA888-43DA-4DE9-952F-A377E96B286D@fuhm.net>

On Sep 30, 2008, at 6:21 PM, Martin v. L?wis wrote:
> IOW, Java hasn't solved the problem in the last 10 years.

Java is already really bad at being a small little language to write  
cooperating tools in. I'd never even attempt to write a little  
pipeline filter in Java -- I've already pretty much learned to expect  
Java applications to be in their own world, so I'd hardly find it  
surprising if a Java app could only read files it wrote itself,  
nevermind files in odd encodings.

Python, on the other hand, is an awesome tool for writing small little  
scripts that interact well with the surrounding environment, Just The  
Way It Is, without trying to layer so much abstraction upon it so that  
you lose functionality. Moving away from that would be unfortunate.

James

From victor.stinner at haypocalc.com  Wed Oct  1 01:11:10 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Wed, 1 Oct 2008 01:11:10 +0200
Subject: [Python-3000] Filename: unicode normalization
Message-ID: <200810010111.10956.victor.stinner@haypocalc.com>

Since it's hard to follow the filename thread on two mailing list, i'm 
starting a new thread only on python-3000 about unicode normalization of the 
filenames.

Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC 
to create a file, you have to reuse NFC to open your file (and the same for 
NFD).

Python2 example to create files in the different forms:
>>> name=u'x?x'
>>> from unicodedata import normalize
>>> open(u'NFD-' + normalize('NFD', name), 'w').close()
>>> open(u'NFC-' + normalize('NFC', name), 'w').close()
>>> open(u'NFKC-' + normalize('NFKC', name), 'w').close()
>>> open(u'NFKD-' + normalize('NFKD', name), 'w').close()
>>> import os
>>> os.listdir('.')
['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x']
>>> os.listdir(u'.')
[u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x']

Directory listing using Python3:
>>> import os
>>> [ name.encode('utf-8') for name in  os.listdir('.') ]
[b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x', 
b'NFKD-xa\xcc\x88x']
>>> os.listdir('.')
['NFD-x?x', 'NFC-x?x', 'NFKC-x?x', 'NFKD-x?x']

Same results, correct. Then try to open files:
>>> open(normalize('NFC', 'NFC-x?x')).close()
>>> open(normalize('NFD', 'NFC-x?x')).close()
IOError: [Errno 2] No such file or directory: 'NFC-x?x'
>>> open(normalize('NFD', 'NFD-x?x')).close()
>>> open(normalize('NFC', 'NFD-x?x')).close()
IOError: [Errno 2] No such file or directory: 'NFD-x?x'

If the user chooses a result from os.listdir(): no problem (if he has good 
eyes and he's able to find the difference between 'x?x' (NFD) and 'x?x' 
(NFC) :-D).

If the user enters the filename using the keyboard (on the command line or a 
GUI dialog), you have to hope that the keyboard is encoded in the same norm 
than the filename was encoded...

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From guido at python.org  Wed Oct  1 01:23:01 2008
From: guido at python.org (Guido van Rossum)
Date: Tue, 30 Sep 2008 16:23:01 -0700
Subject: [Python-3000] Filename: unicode normalization
In-Reply-To: <200810010111.10956.victor.stinner@haypocalc.com>
References: <200810010111.10956.victor.stinner@haypocalc.com>
Message-ID: <ca471dc20809301623u34bd7b28q9cd127a06779f19b@mail.gmail.com>

Martin answered a similar question from Jack Jansen in another thread.
OSX doesn't normalize either. It's unlikely to confuse users in
practice.

On Tue, Sep 30, 2008 at 4:11 PM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> Since it's hard to follow the filename thread on two mailing list, i'm
> starting a new thread only on python-3000 about unicode normalization of the
> filenames.
>
> Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC
> to create a file, you have to reuse NFC to open your file (and the same for
> NFD).
>
> Python2 example to create files in the different forms:
>>>> name=u'x?x'
>>>> from unicodedata import normalize
>>>> open(u'NFD-' + normalize('NFD', name), 'w').close()
>>>> open(u'NFC-' + normalize('NFC', name), 'w').close()
>>>> open(u'NFKC-' + normalize('NFKC', name), 'w').close()
>>>> open(u'NFKD-' + normalize('NFKD', name), 'w').close()
>>>> import os
>>>> os.listdir('.')
> ['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x']
>>>> os.listdir(u'.')
> [u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x']
>
> Directory listing using Python3:
>>>> import os
>>>> [ name.encode('utf-8') for name in  os.listdir('.') ]
> [b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x',
> b'NFKD-xa\xcc\x88x']
>>>> os.listdir('.')
> ['NFD-x?x', 'NFC-x?x', 'NFKC-x?x', 'NFKD-x?x']
>
> Same results, correct. Then try to open files:
>>>> open(normalize('NFC', 'NFC-x?x')).close()
>>>> open(normalize('NFD', 'NFC-x?x')).close()
> IOError: [Errno 2] No such file or directory: 'NFC-x?x'
>>>> open(normalize('NFD', 'NFD-x?x')).close()
>>>> open(normalize('NFC', 'NFD-x?x')).close()
> IOError: [Errno 2] No such file or directory: 'NFD-x?x'
>
> If the user chooses a result from os.listdir(): no problem (if he has good
> eyes and he's able to find the difference between 'x?x' (NFD) and 'x?x'
> (NFC) :-D).
>
> If the user enters the filename using the keyboard (on the command line or a
> GUI dialog), you have to hope that the keyboard is encoded in the same norm
> than the filename was encoded...
>
> --
> Victor Stinner aka haypo
> http://www.haypocalc.com/blog/
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From victor.stinner at haypocalc.com  Wed Oct  1 02:17:33 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Wed, 1 Oct 2008 02:17:33 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E2A806.6020607@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<83758335-97EA-441B-A783-05F16EBE6D7A@fuhm.net>
	<48E2A806.6020607@v.loewis.de>
Message-ID: <200810010217.33570.victor.stinner@haypocalc.com>

Le Wednesday 01 October 2008 00:28:22 Martin v. L?wis, vous avez ?crit?:
> I don't think we will manage to release Python 3.0 this year if that
> change is to be implemented. And then, I don't think the release manager
> will agree to such a delay.

The minimum change is to disallow bytes/str mix:
 - os.listdir(unicode)->unicode and ignore invalid files
   (current behaviour is to return unicode and bytes)
 - os.readlink(unicode)->unicode or raise an error
   (current behaviour is to return unicode or bytes)
 - remove os.getcwdu() (use its code -which is better- for getcwd) 
   and fix the test_unicode_file.py

listdir() change (ignore invalid filenames) is important to avoid strange bugs 
in os.path.*(), glob.*() or on displaying a filename.

I can generate a specific patch for these issues. It's just a subset of my 
last patch.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From foom at fuhm.net  Wed Oct  1 02:38:45 2008
From: foom at fuhm.net (James Y Knight)
Date: Tue, 30 Sep 2008 20:38:45 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E29F56.7060206@v.loewis.de>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<48DE705E.6050405@v.loewis.de>	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>	<48DFF382.7020006@v.loewis.de>	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	<87od26e3an.fsf@xemacs.org>	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>	<ca471dc20809300957o61b554e2n9101e0b1078b1647@mail.gmail.com>	<B11BFA0C-4238-4623-B040-73BC5358831F@fuhm.net>
	<48E29AB6.908@gmail.com> <48E29F56.7060206@v.loewis.de>
Message-ID: <A0153037-6D19-4DFE-A288-D6327ECDC365@fuhm.net>


On Sep 30, 2008, at 5:51 PM, Martin v. L?wis wrote:
> While I can sympathize with people having non-ASCII file names on  
> their
> disks, I can't sympathize with this example. Normal users just don't
> put \x90 into their command lines, and those who do deserve the error
> message they get.

That's just not true! One of the most common kind of thing to put on a  
command line is a filename.

And you can't say that users wouldn't be able to type the odd  
bytesequences: tab completion and xargs will both allow input of those  
oddly-named files to the command line.

James

From greg.ewing at canterbury.ac.nz  Wed Oct  1 03:05:48 2008
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 01 Oct 2008 13:05:48 +1200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
 2.6 or 3.0?
In-Reply-To: <6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
Message-ID: <48E2CCEC.9030709@canterbury.ac.nz>

James Y Knight wrote:

> Since from what I've tried, things seem to work, I'd really like to  
> know what precisely does fail from the opponents of utf-8b.

Seems like what will fail is taking one of these utf-8b
decoded names and passing it to some external library
that uses it as a filename without knowing that it has
to use utf-8b to encode it. Then the funny characters
won't be encoded the way they were originally, and it
won't compare equal to existing filenames that it should
be equal to.

-- 
Greg

From rhamph at gmail.com  Wed Oct  1 04:22:08 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Tue, 30 Sep 2008 20:22:08 -0600
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<gbr0nv$iqu$1@ger.gmane.org>
	<200809300202.38574.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de>
	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>
	<48E2865A.3010404@v.loewis.de>
	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
Message-ID: <aac2c7cb0809301922o60b4cb09n6132dee3587ddb08@mail.gmail.com>

On Tue, Sep 30, 2008 at 8:06 PM,  <glyph at divmod.com> wrote:
> The proposal of using U+0000 seems like it would have been almost the same
> from such a wrapper's perspective, except (A) people using the filesystem
> APIs without the benefit of such a wrapper would have been even more
> screwed, and (B) there are a few nasty corner-cases when dealing with
> surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what
> it would have done with.

Surrogates in UTF-8 *should* be treated as errors, but current python
is far too lax.  That actually leads to another problem: improving
validating will change what gets escaped and what doesn't.

http://bugs.python.org/issue3297
http://bugs.python.org/issue3672



-- 
Adam Olsen, aka Rhamphoryncus

From foom at fuhm.net  Wed Oct  1 05:32:04 2008
From: foom at fuhm.net (James Y Knight)
Date: Tue, 30 Sep 2008 23:32:04 -0400
Subject: [Python-3000] [Python-Dev] New proposition for Python3
	bytes	filename issue
In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<gbr0nv$iqu$1@ger.gmane.org>
	<200809300202.38574.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de>
	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>
	<48E2865A.3010404@v.loewis.de>
	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
Message-ID: <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net>


On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote:
> However, Martin, I can promise you that I will _never_ ask for any  
> convenience functions related to bytes as a result of this  
> decision.  I want bytes to come back from filesystem APIs because I  
> intend to have a wrapper layer which knows two things about the  
> file: the bytes (which are needed to talk to POSIX filesystem APIs)  
> and the characters (which are computed from those bytes, can be  
> safely renormalized, displayed to users, etc).  On Windows this  
> filesystem wrapper will necessarily behave differently, and will not  
> use bytes for anything.  Any formatting beyond joining path segments  
> together and possibly splitting extensions off will be done on  
> character strings, not byte strings.

Can you clarify what proposal you are supporting for Python:

1) Two sets of APIs, one returning unicode strings, and one returning  
bytestrings. (subpoints: what does the unicode-returning API do when  
it cannot decode the bytestring into unicode? raise exception, pretend  
argument/envvar/file didn't exist/?)

or

2) All APIs return bytestrings only. Converting to unicode is  
considered lossy, and would have to be done by applications for  
display purposes only.

I really don't understand the reasoning for (1). It seems to me that  
most software (probably including all of the Python stdlib) would  
continue to use the unicode string API. Switching all of the Python  
stdlib to use the bytestring APIs instead would certainly be a large  
undertaking, and would have all sorts of ripple-on API changes (e.g.  
__file__). So I can only imagine that if you're proposing (1), you're  
doing so without the intention of suggesting that Python be converted  
to use it.

And so, of course, that doesn't really fix things (such as getcwd  
failing if your cwd is a path that is undecodeable in the current  
locale, or well, currently, python refusing to even start).

If you're proposing (2), it's at least as large an undertaking as (1)  
+ converting Python to use the optional bytestring APIs. But at least  
it avoids exposing an API that people ought not use, and does make it  
obvious what still needs to be fixed: the unfixed code simply won't  
run at all.

> The proposal of using U+0000 seems like it would have been almost  
> the same from such a wrapper's perspective, except (A) people using  
> the filesystem APIs without the benefit of such a wrapper would have  
> been even more screwed

I'm not sure what your "more screwed" is comparing against: current  
py3k behavior? (aka: decoding to Unicode in locale's specified  
encoding)? I don't see how you can really be more screwed than that:  
not only can't you send your filename to display in a Gtk+ button, you  
can't access it at all, even staying within python.

> and (B) there are a few nasty corner-cases when dealing with  
> surrogate (i.e. invalid, in UTF-8) code points which I'm not quite  
> sure what it would have done with.

The lone-surrogate-pair proposal was a totally different proposal than  
the U+0000 one.

James

From tjreedy at udel.edu  Wed Oct  1 06:39:31 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Wed, 01 Oct 2008 00:39:31 -0400
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E28C31.6060606@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbr0nv$iqu$1@ger.gmane.org>	<200809300202.38574.victor.stinner@haypocalc.com>	<gbsgk6$kc1$1@ger.gmane.org>	<ca471dc20809300659g608f8c14g29ba2b30def1be1f@mail.gmail.com>	<gbtnjo$quh$1@ger.gmane.org>	<ca471dc20809301045r59251402g3fe947dec3bc7f22@mail.gmail.com>
	<48E28C31.6060606@v.loewis.de>
Message-ID: <gbuuu6$97h$1@ger.gmane.org>

Martin v. L?wis wrote:
> Guido van Rossum wrote:
>> However
>> the *proposed* behavior (returns bytes if the arg was bytes, and
>> returns str when the arg was str) is IMO sane, and no different than
>> the polymorphism found in len() or many builtin operations.
> 
> My concern still is that it brings the bytes type into the status of
> another character string type, which is really bad, and will require
> further modifications to Python for the lifetime of 3.x.

I am one of those who wanted bytes kept and bytearray added and once 
grumbled about strings becoming unicode.  Now that I am using 3.0 (and 
can imagine future use of non-ascii chars), I appreciate having just one 
string type and a separation between normal text and small-int arrays. 
So I find my self, somewhat surprisingly to me, sharing Martin's concern 
about regression toward having two text types again.

There once was a discussion about whether paths should be represented by 
strings or a separate path class (that would keep a tuple of strings for 
each component).  This was rejected, as I remember, both because of the 
complication/benefit ratio and the anticipation that having just one 
string type would make string representation easier.

Using just 3.0 strings seems not to be possible.  So a different 
argument for a path class would be to encapsulate the implementation, 
which could depend on the OS, and hide the complications from the user, 
who just wants open to work.

Terry Jan Reedy


From martin at v.loewis.de  Wed Oct  1 07:27:47 2008
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Wed, 01 Oct 2008 07:27:47 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3	bytes
 filename issue
In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbr0nv$iqu$1@ger.gmane.org>	<200809300202.38574.victor.stinner@haypocalc.com>	<48E1C097.8030309@v.loewis.de>	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>	<48E2865A.3010404@v.loewis.de>	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
Message-ID: <48E30A53.5040708@v.loewis.de>

> However, Martin, I can promise you that I will _never_ ask for any
> convenience functions related to bytes as a result of this decision.

:-)

Regards,
Martin

From martin at v.loewis.de  Wed Oct  1 08:56:15 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 01 Oct 2008 08:56:15 +0200
Subject: [Python-3000] Filename: unicode normalization
In-Reply-To: <200810010111.10956.victor.stinner@haypocalc.com>
References: <200810010111.10956.victor.stinner@haypocalc.com>
Message-ID: <48E31F0F.9080208@v.loewis.de>

> Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC 
> to create a file, you have to reuse NFC to open your file (and the same for 
> NFD).

That's not news to me. Of course it does: Unix is completely agnostic of
encodings in file APIs. On the implementation level, it's just bytes.

Even Windows, which does have the notion that file names are character
strings, doesn't normalize.
(for OS X, I believe it's slightly more complicated, depending on what
API you use: the POSIX/BSD API probably lets through everything as-is,
whereas the higher-layer Object-C based APIs do normalize, IIUC)

As Guido says: it's no problem.

Regards,
Martin

From victor.stinner at haypocalc.com  Wed Oct  1 10:43:25 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Wed, 1 Oct 2008 10:43:25 +0200
Subject: [Python-3000]
	=?utf-8?q?=5BPython-Dev=5D__New_proposition_for_Pyt?=
	=?utf-8?q?hon3_bytes=09filename_issue?=
In-Reply-To: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
Message-ID: <200810011043.25662.victor.stinner@haypocalc.com>

Le Wednesday 01 October 2008 04:06:25 glyph at divmod.com, vous avez ?crit?:
>     b = gtk.Button(u"\u0000/hello/world")
>
> which emits this message:
>     TypeError: OGtkButton.__init__() argument 1 must be string without
> null bytes or None, not unicode
>
> SQLite has a similar problem with NULLs, and I'm definitely sticking
> paths in there, too.

I think that you can say "all C libraries".

Would it possible to convert the encoded string to bytes just before call Gtk? 
(job done by some Python internals, not as an explicit conversion)

I don't know if it would help the discussion, but Java uses its own modified 
UTF-8 encoding:
 * NUL byte is encoded as 0xc0 0x80 instead of 0x00
 * Java doesn't support unicode > 0xFFFF (bouuuuh!)
http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From mal at egenix.com  Wed Oct  1 11:32:30 2008
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 01 Oct 2008 11:32:30 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <200810010954.47564.eckhardt@satorlaser.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com>
	<200810010954.47564.eckhardt@satorlaser.com>
Message-ID: <48E343AE.3080009@egenix.com>

On 2008-10-01 09:54, Ulrich Eckhardt wrote:
> On Tuesday 30 September 2008, M.-A. Lemburg wrote:
>> On 2008-09-30 08:00, Martin v. L?wis wrote:
>>>> Change the default file system encoding to store bytes in Unicode is
>>>> like introducing a new Python type: <fake Unicode for filename hacks>.
>>> Exactly. Seems like the best solution to me, despite your polemics.
>> Not a bad idea... have os.listdir() return Unicode subclasses that work
>> like file handles, ie. they have an extra buffer that holds the original
>> bytes value received from the underlying C API.
> 
> Why does it have to be a Unicode subclass? In my eyes, a Unicode object 
> promises a few things, in particular that it contains a Unicode string. If it 
> now suddenly contains bytes without any further meaning, that would be bad.

Please read my entire email. I was proposing to store the underlying
non-decodeable byte string value in such a subclass. The Unicode value
of the object would then be that underlying value decoded as e.g.
Latin-1 in order to be able to work on it as text.

Path operations would have to be made aware of such subclasses and
operate on the underlying bytes value.

However, like Guido mentioned, this only works if all components are
indeed aware of such subclasses... and that's likely to fail for
code outside the stdlib.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 01 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From solipsis at pitrou.net  Wed Oct  1 12:26:20 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 1 Oct 2008 10:26:20 +0000 (UTC)
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
Message-ID: <loom.20081001T101918-594@post.gmane.org>

Greg Ewing <greg.ewing <at> canterbury.ac.nz> writes:
> 
> Seems like what will fail is taking one of these utf-8b
> decoded names and passing it to some external library
> that uses it as a filename without knowing that it has
> to use utf-8b to encode it. Then the funny characters
> won't be encoded the way they were originally,

But those funny characters only appear for invalid filenames. Passing filenames
to a library will work for valid filenames. Sure, not all the problem is solved,
but the most important part of it (have all filenames work with Python's IO
functions) is.




From stephen at xemacs.org  Wed Oct  1 13:16:07 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 01 Oct 2008 20:16:07 +0900
Subject: [Python-3000] [Python-Dev] Filename as byte string in
	python	2.6 or 3.0?
In-Reply-To: <loom.20081001T101918-594@post.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
Message-ID: <871vz0pnuw.fsf@xemacs.org>

Antoine Pitrou writes:

 > But those funny characters only appear for invalid
 > filenames.

What makes you think the filenames are invalid?  The file*names* are
probably perfectly valid in the intended encoding; they are simply
invalid in the encoding that Python wants to apply.


From solipsis at pitrou.net  Wed Oct  1 13:15:34 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 1 Oct 2008 11:15:34 +0000 (UTC)
Subject: [Python-3000]
	=?utf-8?q?=5BPython-Dev=5D_Filename_as_byte_string_?=
	=?utf-8?b?aW4JcHl0aG9uCTIuNiBvciAzLjA/?=
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
Message-ID: <loom.20081001T111216-867@post.gmane.org>

Stephen J. Turnbull <stephen <at> xemacs.org> writes:
> 
> What makes you think the filenames are invalid?  The file*names* are
> probably perfectly valid in the intended encoding; they are simply
> invalid in the encoding that Python wants to apply.

Those filenames don't work today with Python 3, the problem is to make them work.
Whether they are valid or not in a hypothetical encoding is none of our
business, if it's not the encoding we are expecting.



From ncoghlan at gmail.com  Wed Oct  1 14:43:23 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 01 Oct 2008 22:43:23 +1000
Subject: [Python-3000] [Python-Dev] New proposition for Python3	bytes
 filename issue
In-Reply-To: <20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbr0nv$iqu$1@ger.gmane.org>	<200809300202.38574.victor.stinner@haypocalc.com>	<48E1C097.8030309@v.loewis.de>	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>	<48E2865A.3010404@v.loewis.de>	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>	<22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net>
	<20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com>
Message-ID: <48E3706B.9060308@gmail.com>

glyph at divmod.com wrote:
> The reasoning is that a lot of software doesn't care if it's wrong for
> edge cases, it's really hard to come up with something that's correct
> with respect to all of those edge cases (absurdly difficult, if you need
> to stay in the straightjacket of string / bytes types, as well as
> provide a useful library interface - which is why we're having this
> discussion).  But, it should be _possible_ to write software that's
> correct in the face of those edge cases.

I just wanted to highlight this as something to keep in mind during this
discussion: we want to keep the easy things easy and make the difficult
things possible.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From turnbull at sk.tsukuba.ac.jp  Wed Oct  1 16:10:51 2008
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Wed, 01 Oct 2008 23:10:51 +0900
Subject: [Python-3000] [Python-Dev] Filename as byte
	string	in	python	2.6 or 3.0?
In-Reply-To: <loom.20081001T111216-867@post.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
Message-ID: <87wsgso178.fsf@xemacs.org>

Antoine Pitrou writes:
 > Stephen J. Turnbull <stephen <at> xemacs.org> writes:
 > > 
 > > What makes you think the filenames are invalid?  The file*names* are
 > > probably perfectly valid in the intended encoding; they are simply
 > > invalid in the encoding that Python wants to apply.
 > 
 > Those filenames don't work today with Python 3, the problem is to
 > make them work.  Whether they are valid or not in a hypothetical
 > encoding is none of our business, if it's not the encoding we are
 > expecting.

It's usually not "hypothetical"; often, the user knows what it is.
Why not ask her?  That's what web browsers do, in effect, by providing
View as Charset commands.

The problem with the strategies that are being proposed is that this
is an application-level problem, not a Python-level problem.  Good web
browsers allow you to redisplay the document in a different encoding.
Python should make it possible to do the same, *if* the application
wants to.  It should also be possible for apps to do other things,
*if* they want to.  That means IMO that Python should limit itself to
caching the bytes (or equivalent hacky representation) somewhere that
apps that want to do something robust (including "ask the user" or
"automatically try a different guess" or "silently throw them away")
can find them.

Doing more that that is just asking for bug reports that can only be
closed as "wontfix" or "pebkac".

From solipsis at pitrou.net  Wed Oct  1 16:36:35 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 1 Oct 2008 14:36:35 +0000 (UTC)
Subject: [Python-3000]
	=?utf-8?q?=5BPython-Dev=5D_Filename_as_byte=09strin?=
	=?utf-8?b?ZwlpbglweXRob24JMi42IG9yIDMuMD8=?=
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>
Message-ID: <loom.20081001T142457-236@post.gmane.org>

Stephen J. Turnbull <turnbull <at> sk.tsukuba.ac.jp> writes:
> 
> It's usually not "hypothetical"; often, the user knows what it is.
> Why not ask her?  That's what web browsers do, in effect, by providing
> View as Charset commands.

The average user does not even /know/ what a charset is.
Web browsers provide lots of functions, not all of them are meant for average
users (for example they give access to a "Javascript console" and let people 
choose whether they accept TLS v1.0).

> The problem with the strategies that are being proposed is that this
> is an application-level problem, not a Python-level problem.

I don't understand why you think that. If a filename can't be exactly
represented with a valid Unicode sequence, all applications wanting to access
that file are impacted in the same way, and it is likely that the same solution
or workaround can be applied to all applications. This sounds very much like a
Python-level (or at least stdlib-level) problem to me.

> Good web
> browsers allow you to redisplay the document in a different encoding.

Are you suggesting that the solution to the filename problem is to prompt the
user and ask them for a different encoding?

Not only this solution places a burden on the user, relying on them to give
technical information that they may even not understand (let along be able to
retrieve); but it also places a burden on the application developer to code the
corresponding logic (prompt the user / provide an additional configure option /
have a separate path with manual encoding/decoding of filenames).

> Doing more that that is just asking for bug reports that can only be
> closed as "wontfix" or "pebkac".

There are always bug reports due to miscomprehension of an API or mismatching
expectations. I don't think "we want to avoid bug reports" is a good criterion.
What would be a good criterion is "we want to avoid legitimate dissatisfaction".




From guido at python.org  Wed Oct  1 16:53:29 2008
From: guido at python.org (Guido van Rossum)
Date: Wed, 1 Oct 2008 07:53:29 -0700
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <loom.20081001T142457-236@post.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org> <871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org> <87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
Message-ID: <ca471dc20810010753j7d14fc75ud0cc356ebf2e37e0@mail.gmail.com>

On Wed, Oct 1, 2008 at 7:36 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> The average user does not even /know/ what a charset is.

Except those users who need the feature. They certainly have no
trouble learning how to make the pages readable once someone explains
it to them.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From janssen at parc.com  Wed Oct  1 17:54:15 2008
From: janssen at parc.com (Bill Janssen)
Date: Wed, 1 Oct 2008 08:54:15 PDT
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E343AE.3080009@egenix.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com>
	<200810010954.47564.eckhardt@satorlaser.com>
	<48E343AE.3080009@egenix.com>
Message-ID: <74342.1222876455@parc.com>

M.-A. Lemburg <mal at egenix.com> wrote:
> On 2008-10-01 09:54, Ulrich Eckhardt wrote:
> > On Tuesday 30 September 2008, M.-A. Lemburg wrote:
> >> On 2008-09-30 08:00, Martin v. L?wis wrote:
> >>>> Change the default file system encoding to store bytes in Unicode is
> >>>> like introducing a new Python type: <fake Unicode for filename hacks>.
> >>> Exactly. Seems like the best solution to me, despite your polemics.
> >> Not a bad idea... have os.listdir() return Unicode subclasses that work
> >> like file handles, ie. they have an extra buffer that holds the original
> >> bytes value received from the underlying C API.
> > 
> > Why does it have to be a Unicode subclass? In my eyes, a Unicode object 
> > promises a few things, in particular that it contains a Unicode string. If it 
> > now suddenly contains bytes without any further meaning, that would be bad.
> 
> Please read my entire email. I was proposing to store the underlying
> non-decodeable byte string value in such a subclass. The Unicode value
> of the object would then be that underlying value decoded as e.g.
> Latin-1 in order to be able to work on it as text.

I'm actually sort of liking this idea.  A Pathname class, for convenience
a subtype of String, but containing the underlying binary representation 
used by the OS.  Even non-unicode pathnames could be represented.

Bill

From stephen at xemacs.org  Wed Oct  1 18:58:14 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 02 Oct 2008 01:58:14 +0900
Subject: [Python-3000] [Python-Dev] Filename as
	byte	strin	g	in	python	2.6 or 3.0?
In-Reply-To: <loom.20081001T142457-236@post.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
Message-ID: <87vdwcntg9.fsf@xemacs.org>

Antoine Pitrou writes:
 > Stephen J. Turnbull <turnbull <at> sk.tsukuba.ac.jp> writes:
 > > 
 > > It's usually not "hypothetical"; often, the user knows what it is.
 > > Why not ask her?  That's what web browsers do, in effect, by providing
 > > View as Charset commands.
 > 
 > The average user does not even /know/ what a charset is.

Where I live they do -- there's a reason why "mojibake" is one of the
few Japanese words to be borrowed into English rather than vice versa.

 > > The problem with the strategies that are being proposed is that this
 > > is an application-level problem, not a Python-level problem.
 > 
 > I don't understand why you think that. If a filename can't be
 > exactly represented with a valid Unicode sequence, all applications
 > wanting to access that file are impacted in the same way, and it is
 > likely that the same solution or workaround can be applied to all
 > applications.

That is not my experience in 10+ years of developing XEmacs/MULE.

There are many solutions/workarounds, but all of them are vulnerable
to the fundamental mismatch between the POSIX definition of a filename
(or string, for that matter) as a slightly restricted sequence of
octets, and the human being's insistence on interpreting that sequence
of octets as the encoded representation of a textual string.

True, some solutions are better than others, but there seems to be
none that dominates across the board.  Rather, each of the better ones
is appropriate for some subset of users and applications.

From janssen at parc.com  Wed Oct  1 19:14:00 2008
From: janssen at parc.com (Bill Janssen)
Date: Wed, 1 Oct 2008 10:14:00 PDT
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com>
	<200810010954.47564.eckhardt@satorlaser.com>
	<48E343AE.3080009@egenix.com> <74342.1222876455@parc.com>
	<20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com>
Message-ID: <75388.1222881240@parc.com>

glyph at divmod.com wrote:

> > I'm actually sort of liking this idea.  A Pathname class, for
> > convenience
> > a subtype of String, but containing the underlying binary
> > representation
> >used by the OS.  Even non-unicode pathnames could be represented.
> 
> On the one hand, I agree with you - except for the part where it's a
> subtype of String, that doesn't work.  In case I haven't mentioned it
> enough times already:
> 
> http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html
> 
> On the other hand, we've all been on this merry-go-round before:
> 
>    http://www.python.org/dev/peps/pep-0355/
> 
> Note especially the rejection notice: "Subclassing from str is a
> particularly bad idea".

Yes, the only real justification for it is to not break existing code
(otherwise, calling str() is not that much of an ordeal).

> On the other hand, we've all been on this merry-go-round before:
> 
>    http://www.python.org/dev/peps/pep-0355/

The very existence of os.path seems a good argument that something like
this is useful.  Perhaps PEP 355 just went too far.

Bill

From foom at fuhm.net  Wed Oct  1 20:30:29 2008
From: foom at fuhm.net (James Y Knight)
Date: Wed, 1 Oct 2008 14:30:29 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <87od26e3an.fsf@xemacs.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DE705E.6050405@v.loewis.de>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
Message-ID: <2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net>

BTW, Windows will cheerfully let you create and access files with  
"garbage surrogates" in it.
Try it yourself:

open(u"\ud8fd", 'w').close()
os.listdir(u'.')

IMO that pretty much blows out of the water any suggestion encoding  
invalid UTF-8 sequences into lone surrogates is an evil and broken  
thing to do.

So, I'm back to favoring the lone surrogate plan over the U+0000 plan.  
But either one seems better than the alternatives.

James

On Sep 29, 2008, at 11:11 PM, Stephen J. Turnbull wrote:

> James Y Knight writes:
>> On Sep 29, 2008, at 3:32 AM, Adam Olsen wrote:
>
>>> UTF-8b doesn't work as intended.  It produces an invalid unicode
>>> object (garbage surrogates) that cannot be used with external APIs  
>>> or
>>> libraries that require unicode.
>>
>> I'd be interested to hear more detail on what you expect the  
>> practical
>> ramifications of this to be. It doesn't sound likely to be a problem
>> to me.
>
> That's because you have a specific use case in mind.  Adam clearly has
> in mind passing the filename on to a library which might proceed to
> signal an error (to him, unexpected) on garbage surrogates.  He
> doesn't want to be surprised by that.


From martin at v.loewis.de  Wed Oct  1 21:08:50 2008
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Wed, 01 Oct 2008 21:08:50 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <200810011043.25662.victor.stinner@haypocalc.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
	<200810011043.25662.victor.stinner@haypocalc.com>
Message-ID: <48E3CAC2.6010203@v.loewis.de>

>> SQLite has a similar problem with NULLs, and I'm definitely sticking
>> paths in there, too.
> 
> I think that you can say "all C libraries".

Just for the sake of nit-picking: the socket library, and the regular
POSIX stream IO library (as well as C standard "unformatted" IO) deal
just fine with embedded NULL characters.

>  * Java doesn't support unicode > 0xFFFF (bouuuuh!)

I don't think that is true anymore.

Regards,
Martin

From guido at python.org  Wed Oct  1 22:29:39 2008
From: guido at python.org (Guido van Rossum)
Date: Wed, 1 Oct 2008 13:29:39 -0700
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E3CC12.1070207@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net>
	<48E3CC12.1070207@g.nevcal.com>
Message-ID: <ca471dc20810011329i51c63c1bt82308e3bc0d5bc92@mail.gmail.com>

On Wed, Oct 1, 2008 at 12:14 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> The original byte string must be preserved for use in actually opening
> files.  How it is displayed is another question.  Doing something that
> works for both Unicode display and access to the file is basically
> impossible in all cases.  Providing an encapsulation of the byte string
> that has display methods, together with new methods to transform the
> file path, and use parts of it to create other file paths, is the
> solution I described earlier.  Using the display string (what existing
> programs are likely to do) for transformations instead of the new
> methods will work for files with Unicode file names, and break for
> others.  As long as the solution of new transformation methods is made
> available, there is a migration path for people that encounter
> problems.  I think handling files containing Unicode names properly and
> compatibly, together with a migration path for file not in Unicode is
> about the best that can be expected.

The low-level solution(s) we'll be making available in 3.0 should
enable you to implement this and many other higher-level approaches.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jcollins37 at carolina.rr.com  Wed Oct  1 17:56:44 2008
From: jcollins37 at carolina.rr.com (James E. Collins III)
Date: Wed, 1 Oct 2008 11:56:44 -0400 (Eastern Daylight Time)
Subject: [Python-3000] Automatic Reply: Sound (Python-3000 Digest, Vol 32,
	Issue 4)
Message-ID: <489713D2.000001.05260@JCOLLINS37-PCA>

Silence is one of hardest arguments to refute

Have a great day!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/e3f49539/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 46 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/e3f49539/attachment-0004.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 82 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/e3f49539/attachment-0005.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 4551 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/e3f49539/attachment-0006.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1235 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/e3f49539/attachment-0007.gif>

From Jack.Jansen at cwi.nl  Wed Oct  1 00:05:22 2008
From: Jack.Jansen at cwi.nl (Jack Jansen)
Date: Wed, 1 Oct 2008 00:05:22 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E29D3B.5030900@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbtq8t$3dl$1@ger.gmane.org>	<ca471dc20809301120y5149d346s31b0027b7bdd529e@mail.gmail.com>
	<gbtvd3$na4$1@ger.gmane.org> <48E29D3B.5030900@v.loewis.de>
Message-ID: <EF183EA0-B073-4883-9362-C8B6C8E470D3@cwi.nl>


On  30-Sep-2008, at 23:42 , Martin v. L?wis wrote:
> It's the other way 'round: On Windows, Unicode file names are the
> natural choice, and byte strings have limitations. In a sense, Windows
> got it right - but then, they started later. Unix missed the  
> opportunity
> of declaring that all file APIs are UTF-8 (except for Plan-9 and OS X,
> neither being "true" Unix).


How does windows (and Python on windows) handle NFC versus NFD issues?  
Can I have two files called "?mlaut.txt", one in NFD and one NFC form?  
And are both of those representable on the Python side (i.e. can they  
both be returned from listdir() and passed to open())? CIf I compare  
these two filenames, do they compare differently?
--
Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/83be838a/attachment-0001.htm>

From Jack.Jansen at cwi.nl  Wed Oct  1 00:49:57 2008
From: Jack.Jansen at cwi.nl (Jack Jansen)
Date: Wed, 1 Oct 2008 00:49:57 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
	filename issue
In-Reply-To: <48E2A8E3.3070805@v.loewis.de>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<gbtq8t$3dl$1@ger.gmane.org>	<ca471dc20809301120y5149d346s31b0027b7bdd529e@mail.gmail.com>	<gbtvd3$na4$1@ger.gmane.org>
	<48E29D3B.5030900@v.loewis.de>
	<EF183EA0-B073-4883-9362-C8B6C8E470D3@cwi.nl>
	<48E2A8E3.3070805@v.loewis.de>
Message-ID: <82D029DA-C218-4631-A68E-CE3DBB03494A@cwi.nl>


On  1-Oct-2008, at 00:32 , Martin v. L?wis wrote:

>
>> How does windows (and Python on windows) handle NFC versus NFD  
>> issues?
>
> That's left to the application.
>
>> Can I have two files called "?mlaut.txt", one in NFD and one NFC  
>> form?
>
> Yes, you can. It sounds confusing, but only in a theoretical way. You
> never have combining characters on Windows (at least, I don't). The
> keyboard input defaults to NFC, and users normally don't type file
> names, anyways, except when creating the files - later, they just use
> the mouse to indicate what file they want to act on.
>
>> And are both of those representable on the Python side (i.e. can they
>> both be returned from listdir() and passed to open())?
>
> Certainly!
>
>> CIf I compare
>> these two filenames, do they compare differently?
>
> Certainly!

Actually, that all sounds pretty non-confusing to me:-)

So, normal users will always have the one form, and if by chance they  
get the other form they can still use the file. Also from Python, even  
when doing listdir() and then open(), everything will work just as  
expected. That there are two files that have a similar visual  
representation is not too bad, the same happens with ellipses versus  
dot-dot-dot and many other cases.

Which means the only problem area left is unix filesystems (whether on  
Linux or mounted remotely on MacOS or whatever), where filenames are  
really byte strings with only / and nul illegal.



--
Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081001/10203f3f/attachment-0001.htm>

From glyph at divmod.com  Wed Oct  1 04:06:25 2008
From: glyph at divmod.com (glyph at divmod.com)
Date: Wed, 01 Oct 2008 02:06:25 -0000
Subject: [Python-3000] [Python-Dev] New proposition for Python3
	bytes	filename issue
In-Reply-To: <ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<gbr0nv$iqu$1@ger.gmane.org>
	<200809300202.38574.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de>
	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>
	<48E2865A.3010404@v.loewis.de>
	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
Message-ID: <20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>

On 30 Sep, 09:22 pm, guido at python.org wrote:
>On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. L?wis" <martin at v.loewis.de> 
>wrote:
>>Guido van Rossum wrote:
>>>On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. L?wis" 
>>><martin at v.loewis.de> wrote:

>>>Martin, I don't understand why you are in favor of storing raw bytes
>>>encoded as Latin-1 in Unicode string objects, which clearly gives 
>>>rise
>>>to mojibake.

This is my word of the day, by the way.  Reading this whole thread was 
_totally_ worth it to learn about "mojibake".  Obviously I'm familiar 
with the phenomenon but somehow I'd never heard this awesome term 
before.
>I am also encouraged by Glyph's support for (a). He has a lot of
>practical experience.

Thanks for the vote of confidence.  I hope for all our sakes that you're 
not over-valuing that experience ;-).

For what it's worth, I can see MvL's point in that I think there is some 
danger in generating confusion by adding _too many_ string-like 
functions to the bytes type.  I don't want my suggestion to contribute 
to the confusion between bytes and text.

However, Martin, I can promise you that I will _never_ ask for any 
convenience functions related to bytes as a result of this decision.  I 
want bytes to come back from filesystem APIs because I intend to have a 
wrapper layer which knows two things about the file: the bytes (which 
are needed to talk to POSIX filesystem APIs) and the characters (which 
are computed from those bytes, can be safely renormalized, displayed to 
users, etc).  On Windows this filesystem wrapper will necessarily behave 
differently, and will not use bytes for anything.  Any formatting beyond 
joining path segments together and possibly splitting extensions off 
will be done on character strings, not byte strings.

The proposal of using U+0000 seems like it would have been almost the 
same from such a wrapper's perspective, except (A) people using the 
filesystem APIs without the benefit of such a wrapper would have been 
even more screwed, and (B) there are a few nasty corner-cases when 
dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm 
not quite sure what it would have done with.

Guido already mentioned "libraries" as a hypothetical issue, but here's 
a real-world problem that results from putting NULLs into filenames. 
Consider this program:

    import gtk
    w = gtk.Window()
    b = gtk.Button(u"\u0000/hello/world")
    w.add(b)
    w.show_all()
    gtk.main()

which emits this message:
    TypeError: OGtkButton.__init__() argument 1 must be string without 
null bytes or None, not unicode

SQLite has a similar problem with NULLs, and I'm definitely sticking 
paths in there, too.

Eventually I'd like to propose such a path type for inclusion in the 
stdlib, but that will have to wait for issues like 
<http://twistedmatrix.com/trac/ticket/2366> to be resolved.

From glyph at divmod.com  Wed Oct  1 07:19:47 2008
From: glyph at divmod.com (glyph at divmod.com)
Date: Wed, 01 Oct 2008 05:19:47 -0000
Subject: [Python-3000] [Python-Dev] New proposition for Python3
	bytes	filename issue
In-Reply-To: <22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<gbr0nv$iqu$1@ger.gmane.org>
	<200809300202.38574.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de>
	<ca471dc20809300653m4e79dcd7y818b624f9ecd8f5e@mail.gmail.com>
	<48E2865A.3010404@v.loewis.de>
	<ca471dc20809301422u1e797dacm8a19fd9b4e3e74e6@mail.gmail.com>
	<20081001020625.31635.800517030.divmod.xquotient.681@weber.divmod.com>
	<22920D6A-8B70-4E6D-BE99-D7447D831B41@fuhm.net>
Message-ID: <20081001051947.31635.1251804577.divmod.xquotient.807@weber.divmod.com>

On 03:32 am, foom at fuhm.net wrote:
>On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote:

>Can you clarify what proposal you are supporting for Python:

Sure.  Neither of your descriptions is terribly accurate, but I'll try 
to explain.
>1) Two sets of APIs, one returning unicode strings, and one returning 
>bytestrings. (subpoints: what does the unicode-returning API do when 
>it cannot decode the bytestring into unicode? raise exception, pretend 
>argument/envvar/file didn't exist/?)

The only API discussed so far which would actually provide two variants 
is 'getcwd', which would have a 'getcwdb' that gives back bytes instead.

Pretty much every other API takes some kind of input.  listdir(bytes) 
would give back bytes, while listdir(text) would give back text. 
listdir(text) would skip undecodable filenames.

Similarly for all the other APIs in os and os.path that take pathnames 
for input.
>2) All APIs return bytestrings only. Converting to unicode is 
>considered lossy, and would have to be done by applications for 
>display purposes only.

This is a bad way to do things, because on Windows, filenames *really 
are* unicode.  Converting to bytes is what's lossy.  (See previous 
discussion of active codepages and CreateFileA/CreateFileW.)
>I really don't understand the reasoning for (1).

The reasoning is that a lot of software doesn't care if it's wrong for 
edge cases, it's really hard to come up with something that's correct 
with respect to all of those edge cases (absurdly difficult, if you need 
to stay in the straightjacket of string / bytes types, as well as 
provide a useful library interface - which is why we're having this 
discussion).  But, it should be _possible_ to write software that's 
correct in the face of those edge cases.

And - let's not forget this - the worlds of POSIX and Windows really are 
different and really do require subtly different inputs.  Python can try 
to paper over this like Java does and make it impossible to write 
certain classes of application, or it can just provide an ugly, slightly 
inconsistent API that exposes the ugly, slightly inconsistent reality. 
Modulo the issues you've raised which I don't think the proposal totally 
covers yet (abspath with a non-decodable cwd) I think it strikes a nice 
balance; allow people to live in the delusion of unicode-on-POSIX and 
have software that mostly works, most of the time, or allow them to face 
the unpleasantness and spend the effort to get something really solid.

I think the _right_ answer to all of this is to (A) make FilePath work 
completely correctly for every totally insane edge case ever, and (B) 
include it in the stdlib.  One day I think we'll do that.  But nobody 
has the time or energy to do even the first part of that *right now*, 
before 3.0 is released, so I'm just looking for something which it will 
be possible to build FilePath, or something like it, on top of, without 
breaking other people's applications who rely on the os module directly 
too badly.
>It seems to me that  most software (probably including all of the 
>Python stdlib) would  continue to use the unicode string API.

That's true.  And that software wouldn't handle these edge cases 
completely correctly.  As Guido put it, "it's a quality of 
implementation issue".
>Switching all of the Python  stdlib to use the bytestring APIs instead 
>would certainly be a large  undertaking, and would have all sorts of 
>ripple-on API changes (e.g.  __file__).

I am not quite sure what to do about __file__.  My preference would 
probably be to use unicode filename for consistency so it can always be 
displayed, but provide a second attribute (__open_file__?) that would be 
sometimes unicode, sometimes bytes, which would be guaranteed to work 
with open().  I suspect that most software which interacts with __file__ 
on a deep level would be of the variety which would deal with the edge 
cases.

But where the Python stdlib wants a pathname it should be accepting 
either bytes or unicode, as all of the os.path functions want.  This 
does kind of suck, but the alternatives are to encode crazy extra 
information in unicode path names that cannot be exchanged with other 
programs (or with users: NULL is potentially the worst bogus character 
from a UI perspective), or revert to bytes for everything (which is a 
non-solution, c.f. Windows above).
>So I can only imagine that if you're proposing (1), you're  doing so 
>without the intention of suggesting that Python be converted  to use 
>it.

Maybe updating the stdlib to be correct in the face of such changes is 
hard, but it doesn't seem intractible.  Taken together, it looks like 
there are only about 100 calls in the stdlib to both getcwd and abspath 
together, and I suspect many of them are for purely aesthetic purposes 
and could just be eliminated, and many of them are redefinitions of the 
functions and don't need any changes.

All the other path manipulation functions would continue to work as-is, 
although some of them might skip undecodable files.
>And so, of course, that doesn't really fix things (such as getcwd 
>failing if your cwd is a path that is undecodeable in the current 
>locale, or well, currently, python refusing to even start).

The proposal as I understand it so far doesn't address this 
specifically, so I'll try to.  os.getcwd, os.path.abspath, and 
os.path.realpath (when called with unicode) will probably need to do 
something gross if they're called on a non-decodable directory.  One 
thing that comes to mind is to create a temporary symbolic link and 
return u'/tmp/python-$YOURUID-undecodable/$GUID/something'.  I hope 
someone else has a better idea, especially since that sort of defeats 
the purpose of realpath.

On the other hand, even this strawman answer is correct for pretty much 
any sane purpose, and if you _really_ care, you need to learn that you 
have to use and ask for bytes, on POSIX, to deal with such corner cases.
>If you're proposing (2),  (...)

Luckily I'm not.
>>The proposal of using U+0000 seems like it would have been almost  the 
>>same from such a wrapper's perspective, except (A) people using  the 
>>filesystem APIs without the benefit of such a wrapper would have  been 
>>even more screwed
>
>I'm not sure what your "more screwed" is comparing against: current 
>py3k behavior? (aka: decoding to Unicode in locale's specified 
>encoding)? I don't see how you can really be more screwed than that: 
>not only can't you send your filename to display in a Gtk+ button, you 
>can't access it at all, even staying within python.

You're screwed if you're trying to access files in a portable way 
without worrying at all about encodings.  There are files you won't be 
able to access, there are conditions you won't be able to deal with. 
Sorry, but POSIX sucks and that's life.

You're _more_ screwed if you're trying to access those files in a 
portable way without worrying about encodings, and the API you're using 
is giving you back invalid, magic path names, with NULLs rather than 
being slightly lossy and dropping filenames you (obviously, by virtue of 
the way you requested those filenames) won't be able to deal with.

So I was talking here about the default behavior in the case of a naive 
program that wants to pretend all paths are unicode.
>>and (B) there are a few nasty corner-cases when dealing with 
>>surrogate (i.e. invalid, in UTF-8) code points which I'm not quite 
>>sure what it would have done with.
>
>The lone-surrogate-pair proposal was a totally different proposal than 
>the U+0000 one.

I wasn't referring to the lone-surrogate-pair encoding trick, I was 
referring to the fact that some people are going to want to treat 
surrogate pairs as encoding errors (i.e. include the NULL byte) and some 
will want to treat them as valid.  If you want them to be valid you have 
to normalize away the surrogates in order to talk to other software, but 
you can't do that because then you'll get different bytes when you re- 
encode them.

There's probably a way around that but it would be subtle and 
controversial no matter how you did it.

From eckhardt at satorlaser.com  Wed Oct  1 09:54:47 2008
From: eckhardt at satorlaser.com (Ulrich Eckhardt)
Date: Wed, 1 Oct 2008 09:54:47 +0200
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <48E20017.3020405@egenix.com>
References: <200809291407.55291.victor.stinner@haypocalc.com> 
	<48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com>
Message-ID: <200810010954.47564.eckhardt@satorlaser.com>

On Tuesday 30 September 2008, M.-A. Lemburg wrote:
> On 2008-09-30 08:00, Martin v. L?wis wrote:
> >> Change the default file system encoding to store bytes in Unicode is
> >> like introducing a new Python type: <fake Unicode for filename hacks>.
> >
> > Exactly. Seems like the best solution to me, despite your polemics.
>
> Not a bad idea... have os.listdir() return Unicode subclasses that work
> like file handles, ie. they have an extra buffer that holds the original
> bytes value received from the underlying C API.

Why does it have to be a Unicode subclass? In my eyes, a Unicode object 
promises a few things, in particular that it contains a Unicode string. If it 
now suddenly contains bytes without any further meaning, that would be bad.


What I wonder is what the requirements on path handling are. I'll try to list 
the ones I can see:

1. A path received from the system should be preserved, so it can be given to 
the system later on. IOW, the internal representation should not loose any 
information compared to the one used by the OS.

2. Typical operations like joining two path segments or moving to the parent 
dir should be defined.

3. There must be a way to display the path to the user. IOW, there should be a 
way to turn the path into a string that the user can recognise, according to 
some encoding. Note that this is not always possible, so this can fail.

4. There must be a way to receive a path from the user. That means that there 
must be a way from a user-entered string to a path. Note that this, too, 
isn't always possible and can fail.

5. The conversion between a string and a path should be configurable, defaults 
retrieved from the system. This is so that most operations will just work and 
do the thing that the user expects.

6. There should be a way to modify the path data itself. This of course 
requires knowledge about the internals but gives full power to the 
programmer.


For requirement 3, I would say a lossy conversion to a string would be enough, 
i.e. try to convert the path to a Unicode string and use a question mark or 
some escaping to mark parts that can't be decoded. It will allow users to 
recognise the decodeable parts of the path with hopefully just a few 
characters left without decoding.

For requirement 4, a failure to encode a string to a path must result in a 
loud failure, i.e. an exception. This is because the user entered a path that 
we can't use, any guessing what the user might have wanted is futile.


Are there any points to add?

Uli

-- 
Sator Laser GmbH
Gesch?ftsf?hrer: Thorsten F?cking, Amtsgericht Hamburg HR B62 932

**************************************************************************************
           Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschlie?lich s?mtlicher Anh?nge ist nur f?r den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empf?nger sein sollten. Die E-Mail ist in diesem Fall zu l?schen und darf weder gelesen, weitergeleitet, ver?ffentlicht oder anderweitig benutzt werden.
E-Mails k?nnen durch Dritte gelesen werden und Viren sowie nichtautorisierte ?nderungen enthalten. Sator Laser GmbH ist f?r diese Folgen nicht verantwortlich.

**************************************************************************************


From glyph at divmod.com  Wed Oct  1 18:20:06 2008
From: glyph at divmod.com (glyph at divmod.com)
Date: Wed, 01 Oct 2008 16:20:06 -0000
Subject: [Python-3000] [Python-Dev] New proposition for Python3
	bytes	filename issue
In-Reply-To: <74342.1222876455@parc.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>
	<48E1C097.8030309@v.loewis.de> <48E20017.3020405@egenix.com>
	<200810010954.47564.eckhardt@satorlaser.com>
	<48E343AE.3080009@egenix.com> <74342.1222876455@parc.com>
Message-ID: <20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com>


On 03:54 pm, janssen at parc.com wrote:
>I'm actually sort of liking this idea.  A Pathname class, for 
>convenience
>a subtype of String, but containing the underlying binary 
>representation
>used by the OS.  Even non-unicode pathnames could be represented.

On the one hand, I agree with you - except for the part where it's a 
subtype of String, that doesn't work.  In case I haven't mentioned it 
enough times already:

http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html

On the other hand, we've all been on this merry-go-round before:

    http://www.python.org/dev/peps/pep-0355/

Note especially the rejection notice: "Subclassing from str is a 
particularly bad idea".

Again, one day I'd really like to add one of these to Python.  Now is 
not the time.

From ncoghlan at gmail.com  Wed Oct  1 23:39:42 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 02 Oct 2008 07:39:42 +1000
Subject: [Python-3000] [Python-Dev] New proposition for Python3 bytes
 filename issue
In-Reply-To: <75388.1222881240@parc.com>
References: <200809291407.55291.victor.stinner@haypocalc.com>	<48E1C097.8030309@v.loewis.de>
	<48E20017.3020405@egenix.com>	<200810010954.47564.eckhardt@satorlaser.com>	<48E343AE.3080009@egenix.com>
	<74342.1222876455@parc.com>	<20081001162006.31635.1753470290.divmod.xquotient.824@weber.divmod.com>
	<75388.1222881240@parc.com>
Message-ID: <48E3EE1E.5000300@gmail.com>

Bill Janssen wrote:
> Perhaps PEP 355 just went too far.

That was certainly one of the major objections to it. A filesystem path
object which didn't try to combine a half-dozen different modules into
methods on a single object, but instead focused on solving a few
specific problems with using raw strings as file paths would have a far
greater chance of acceptance.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From foom at fuhm.net  Thu Oct  2 00:14:50 2008
From: foom at fuhm.net (James Y Knight)
Date: Wed, 1 Oct 2008 18:14:50 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E3C98A.1000906@nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<48DE705E.6050405@v.loewis.de>	<52dc1c820809281334t36086001ie7b87f618b949bdb@mail.gmail.com>	<48DFF382.7020006@v.loewis.de>	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	<87od26e3an.fsf@xemacs.org>
	<2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net>
	<48E3C98A.1000906@nevcal.com>
Message-ID: <5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net>


On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote:

> On approximately 10/1/2008 11:30 AM, came the following characters  
> from the keyboard of James Y Knight:
>> BTW, Windows will cheerfully let you create and access files with  
>> "garbage surrogates" in it.
>> Try it yourself:
>>
>> open(u"\ud8fd", 'w').close()
>> os.listdir(u'.')
>
> But Windows doesn't have the problem of non-Unicode sequences  
> needing to be translated to something else in the first place.  So  
> this is mostly irrelevant to the problem at hand.


Well...either you consider lone surrogates as valid Unicode sequences,  
or else Windows *does* have the problem of non-Unicode sequences  
needing to be translated to something else.

Currently, the answer is that lone surrogates are treated as valid  
Unicode, and allowed into Python via the windows file APIs. Thus,  
filename strings in Python are going to have lone surrogates, anyways,  
on Windows.

Therefore, any external library which freaks out upon seeing a lone  
surrogate is already going to be broken for some filenames on Windows.  
So, it seems to me, converting invalid UTF-8 sequences into lone  
surrogates for Unix doesn't actually add any new form of brokenness.  
So why not just do that?

>> So, I'm back to favoring the lone surrogate plan over the U+0000  
>> plan. But either one seems better than the alternatives.
>
> The original byte string must be preserved for use in actually  
> opening files.

Or reversibly transformed.

> How it is displayed is another question.  Doing something that works  
> for both Unicode display and access to the file is basically  
> impossible in all cases.  Providing an encapsulation of the byte  
> string that has display methods, together with new methods to  
> transform the file path, and use parts of it to create other file  
> paths, is the solution I described earlier.

This sounds like a fine solution. And it would work just as well with  
a UTF-8b base API as with a dual string/byte string base API. The only  
difference is what the default behavior for people who don't use your  
new fancy API is. In the UTF-8b case, most things would work, even  
with invalidly-encoded filenames.

James

From rhamph at gmail.com  Thu Oct  2 00:41:32 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 1 Oct 2008 16:41:32 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48DFF382.7020006@v.loewis.de>
	<52dc1c820809281621l3beb260ahec22988a05e74327@mail.gmail.com>
	<96AAA50A-8C20-4320-A3C7-58B4C33D091D@fuhm.net>
	<aac2c7cb0809290032x336951e3pd430e464607c4fb0@mail.gmail.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<2E304D87-CBC7-4D43-AAF4-93D08DF826D5@fuhm.net>
	<48E3C98A.1000906@nevcal.com>
	<5F040550-868B-40EC-A80B-460EE701B1A1@fuhm.net>
Message-ID: <aac2c7cb0810011541v42255012r7df276430b30f99e@mail.gmail.com>

On Wed, Oct 1, 2008 at 4:14 PM, James Y Knight <foom at fuhm.net> wrote:
> On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote:
>> On approximately 10/1/2008 11:30 AM, came the following characters from
>> the keyboard of James Y Knight:
>>>
>>> BTW, Windows will cheerfully let you create and access files with
>>> "garbage surrogates" in it.
>>> Try it yourself:
>>>
>>> open(u"\ud8fd", 'w').close()
>>> os.listdir(u'.')
>>
>> But Windows doesn't have the problem of non-Unicode sequences needing to
>> be translated to something else in the first place.  So this is mostly
>> irrelevant to the problem at hand.
>
>
> Well...either you consider lone surrogates as valid Unicode sequences, or
> else Windows *does* have the problem of non-Unicode sequences needing to be
> translated to something else.
>
> Currently, the answer is that lone surrogates are treated as valid Unicode,
> and allowed into Python via the windows file APIs. Thus, filename strings in
> Python are going to have lone surrogates, anyways, on Windows.

We allow lone surrogates into our unicode objects, but they aren't
valid Unicode.  They'll fail for any APIs that expect only valid
Unicode.


> Therefore, any external library which freaks out upon seeing a lone
> surrogate is already going to be broken for some filenames on Windows. So,
> it seems to me, converting invalid UTF-8 sequences into lone surrogates for
> Unix doesn't actually add any new form of brokenness. So why not just do
> that?

I see it the opposite: lone surrogates on windows should be rejected
from unicode APIs, just as we want to do for invalid UTF-8 on linux.

But since the same rationale for having a "raw" API applies, maybe the
windows byte APIs should expose raw UTF-16, rather than letting it be
translated?


-- 
Adam Olsen, aka Rhamphoryncus

From victor.stinner at haypocalc.com  Thu Oct  2 13:50:49 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 2 Oct 2008 13:50:49 +0200
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
Message-ID: <200810021350.49292.victor.stinner@haypocalc.com>

This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It's 
a *draft*, don't hesitate to comment it. This document suppose that my patch 
to allow bytes filenames is accept which is not the case today.

While I was writing this document I found poential problems in Python3. So 
here is a TODO list (things to be checked):

FIXME: PyUnicode_DecodeFSDefaultAndSize(): errors="replace"!
FIXME: import.c uses ASCII if default file system is unknown, whereas other
       functions uses UTF-8
FIXME: Write a function in Python3 to convert a bytes filename to a nice
       string
FIXME: When bytearray is accepted or not?
FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get
       bytes or unicode?
FIXME: Use a shorter title for this PEP :-)

Can anyone write a section about bytes encoding in Unicode using escape 
sequence?

What is the best tool to work on a PEP? I hate email threads, and I would 
prefer SVN / Mercurial / anything else.
---

Title: Python3 and UnicodeDecodeError for the command line, 
       environment variables and filenames

Introduction
============

Python3 does its best to give you texts encoded as a valid unicode characters
strings. When it hits an invalid bytes sequence (according to the used
charset), it has two choices: drops the value or raises an UnicodeDecodeError.
This document present the behaviour of Python3 for the command line,
environment variables and filenames.

Example of an invalid bytes sequence: ::

    >>> str(b'\xff', 'utf8')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)

whereas the same byte sequence is valid in another charset like ISO-8859-1: ::

    >>> str(b'\xff', 'iso-8859-1')
    '?'


Default encoding
================

Python uses "UTF-8" as the default Unicode encoding. You can read the default
charset using sys.getdefaultencoding(). The "default encoding" is used by
PyUnicode_FromStringAndSize().

A function sys.setdefaultencoding() exists, but it raises a ValueError for
charset different than UTF-8 since the charset is hardcoded in
PyUnicode_FromStringAndSize().


Command line
============

Python creates a nice unicode table for sys.argv using mbstowcs(): ::

    $ ./python -c 'import sys; print(sys.argv)' 'Ho h? !'
    ['-c', 'Ho h? !']

On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the
encoding. On an invalid bytes sequence, Python quits directly with an exit
code 1. Example with UTF-8 locale: ::

 $ python3.0 $(echo -e 'invalid:\xff')
 Could not convert argument 1 to string


Environment variables
=====================

Python uses "_wenviron" on Windows which are contains unicode (UTF-16-LE)
strings.  On other OS, it uses "environ" variable and the UTF-8 charset. It
drops a variable if its key or value is not convertible to unicode.
Example: ::

    env -i HOME=/home/my PATH=$(echo -e "\xff") python
    >>> import os; list(os.environ.items())
    [('HOME', '/home/my')]

Both key and values are unicode strings. Empty key and/or value are allowed.


Filenames
=========

Introduction
------------

Python2 uses byte filenames everywhere, but it was also possible to use
unicode filenames. Examples:
 - os.getcwd() gives bytes whereas os.getcwdu() always returns unicode
 - os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes
   on UnicodeDecodeError), os.readlink() has the same behaviour
 - glob.glob() converts the unicode pattern to bytes, and so create bytes
   filenames
 - open() supports bytes and unicode

Since listdir() mix bytes and unicode, you are not able to manipulate easily
filenames: ::

    >>> path=u'.'
    >>> for name in os.listdir(path):
    ...  print repr(name)
    ...  print repr(os.path.join(path, name))
    ...
    u'valid'
    u'./valid'
    'invalid\xff'
    Traceback (most recent call last):
      ...
      File "/usr/lib/python2.5/posixpath.py", line 65, in join
        path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)

Python3 supports both types, bytes and unicode, but disallow mixing them. If
you ask for unicode, you will always get unicode or an exception is raised.

You should only use unicode filenames, except if you are writing a program
fixing file system encoding, a backup tool or you users are unable to fix
their broken system.

Windows
-------

Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames.
So you should only use unicode filenames.

Non Windows (POSIX)
-------------------

POSIX OS like Linux uses bytes for historical reasons. In the best case, all
filenames will be encoded as valid UTF-8 strings and Python creates valid
unicode strings. But since system calls uses bytes, the file system may
returns an invalid filename, or a program can creates a file with an invalid
filename.

An invalid filename is a string which can not be decoded to unicode using the
default file system encoding (which is UTF-8 most of the time).

A robust program have to use only the bytes type to make sure that it will be
able to open / copy / remove any file or directory.

Filename encoding
-----------------

Python use:
 * "mbcs" on Windows
 * or "utf-8" on Mac OS X
 * or nl_langinfo(CODESET) on OS supporting this function
 * or UTF-8 by default

"mbcs" is not a valid charset name, it's an internal charset saying that
Python will use the function MultiByteToWideChar() to decode bytes to unicode.
This function uses the current codepage to decode bytes string.

You can read the charset using sys.getfilesystemencoding(). The function may
returns None if Python is unable to determine the default encoding.

PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or
UTF-8 if it is not set.

On UNIX (and other operating systems), it's possible to mount different file
systems using different charsets. sys.getdefaultencoding() will be the same
for the different file systems since this encoding is only used between Python
and the Linux kernel, not between the kernel and the file system which may
uses a different charset.

Display a filename
------------------

Example of a function formatting a filename to display it to human eyes: ::

    from sys import getfilesystemencoding
    def format_filename(filename):
        return str(filename, getfilesystemencoding(), 'replace')

Example: format_filename('r\xffport.doc') gives 'r?port.doc' with the UTF-8
encoding.

Functions producing filenames
-----------------------------

Policy: for unicode arguments: drop invalid bytes filenames;
for bytes arguments: return bytes
 - os.listdir()
 - glob.glob()

Policy: for an unicode argument: raise an UnicodeDecodeError on invalid
filename; for an bytes argument: return bytes
 - os.readlink()

Policy: create unicode directory or raise an UnicodeDecodeError
 - os.getcwd()

Policy: always returns bytes
 - os.getcwdb()

Functions for filename manipulation
-----------------------------------

Policy: raise TypeError on bytes/str mix
 - os.path.*(), eg. os.path.join()
 - fnmatch.*()

Functions accessing files
-------------------------

Policy: accept both bytes and str
 - io.open()
 - os.open()
 - os.chdir()
 - os.stat(), os.lstat()
 - os.rename()
 - os.unlink()
 - shutil.*()

os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment,
and unicode for the other argument

bytearray
---------

In most cases, bytearray() can be used as bytes for a filename.


Unicode normalisation
=====================

Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD.
Python does never normalize strings (nor filenames). No operating system does
normalize filenames. So the users using different norms will be unable to
retrieve their file. Don't panic! All users use the same norm.

Use unicodedata.normalize() to normalize an unicode string.


From mal at egenix.com  Thu Oct  2 14:07:50 2008
From: mal at egenix.com (M.-A. Lemburg)
Date: Thu, 02 Oct 2008 14:07:50 +0200
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
In-Reply-To: <200810021350.49292.victor.stinner@haypocalc.com>
References: <200810021350.49292.victor.stinner@haypocalc.com>
Message-ID: <48E4B996.9030101@egenix.com>

On 2008-10-02 13:50, Victor Stinner wrote:
> This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. 

The PEP doesn't appear to address any potential changes. Wouldn't
it be better to add such information to the Python3 documentation
itself ?!

> It's 
> a *draft*, don't hesitate to comment it. This document suppose that my patch 
> to allow bytes filenames is accept which is not the case today.
> 
> While I was writing this document I found poential problems in Python3. So 
> here is a TODO list (things to be checked):
> 
> FIXME: PyUnicode_DecodeFSDefaultAndSize(): errors="replace"!
> FIXME: import.c uses ASCII if default file system is unknown, whereas other
>        functions uses UTF-8
> FIXME: Write a function in Python3 to convert a bytes filename to a nice
>        string
> FIXME: When bytearray is accepted or not?
> FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get
>        bytes or unicode?
> FIXME: Use a shorter title for this PEP :-)
> 
> Can anyone write a section about bytes encoding in Unicode using escape 
> sequence?
> 
> What is the best tool to work on a PEP? I hate email threads, and I would 
> prefer SVN / Mercurial / anything else.
> ---
> 
> Title: Python3 and UnicodeDecodeError for the command line, 
>        environment variables and filenames
> 
> Introduction
> ============
> 
> Python3 does its best to give you texts encoded as a valid unicode characters
> strings. When it hits an invalid bytes sequence (according to the used
> charset), it has two choices: drops the value or raises an UnicodeDecodeError.
> This document present the behaviour of Python3 for the command line,
> environment variables and filenames.
> 
> Example of an invalid bytes sequence: ::
> 
>     >>> str(b'\xff', 'utf8')
>     UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)
> 
> whereas the same byte sequence is valid in another charset like ISO-8859-1: ::
> 
>     >>> str(b'\xff', 'iso-8859-1')
>     '?'

You have left out all the options you have by using a different
error handling mechanism (using a third parameter to str()), e.g.
'replace', 'ignore', etc.

> Default encoding
> ================
> 
> Python uses "UTF-8" as the default Unicode encoding. You can read the default
> charset using sys.getdefaultencoding(). The "default encoding" is used by
> PyUnicode_FromStringAndSize().
> 
> A function sys.setdefaultencoding() exists, but it raises a ValueError for
> charset different than UTF-8 since the charset is hardcoded in
> PyUnicode_FromStringAndSize().

Not only there: the C API makes various assumptions on the default
encoding as well. We should probably drop the term "default encoding"
altogether and replace it with "utf-8".

sys.setdefaultencoding() should probably be dropped altogether from
Python3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 02 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From ncoghlan at gmail.com  Thu Oct  2 14:31:06 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 02 Oct 2008 22:31:06 +1000
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
In-Reply-To: <48E4B996.9030101@egenix.com>
References: <200810021350.49292.victor.stinner@haypocalc.com>
	<48E4B996.9030101@egenix.com>
Message-ID: <48E4BF0A.9040604@gmail.com>

M.-A. Lemburg wrote:
> On 2008-10-02 13:50, Victor Stinner wrote:
>> This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. 
> 
> The PEP doesn't appear to address any potential changes. Wouldn't
> it be better to add such information to the Python3 documentation
> itself ?!

True, a simple wiki page would probably be adequate - once we agree on
the details, it can be added to the main Python 3 docs.

Victor - the Python wiki is also one of the easiest places to work on
early PEP drafts. See
http://wiki.python.org/moin/PythonEnhancementProposals.

> Not only there: the C API makes various assumptions on the default
> encoding as well. We should probably drop the term "default encoding"
> altogether and replace it with "utf-8".
> 
> sys.setdefaultencoding() should probably be dropped altogether from
> Python3.

Isn't that method still there to allow other implementations to be more
permissive about allowing the default encoding to be changed?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From victor.stinner at haypocalc.com  Thu Oct  2 14:35:48 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 2 Oct 2008 14:35:48 +0200
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
In-Reply-To: <48E4B996.9030101@egenix.com>
References: <200810021350.49292.victor.stinner@haypocalc.com>
	<48E4B996.9030101@egenix.com>
Message-ID: <200810021435.48955.victor.stinner@haypocalc.com>

Le Thursday 02 October 2008 14:07:50 M.-A. Lemburg, vous avez ?crit?:
> On 2008-10-02 13:50, Victor Stinner wrote:
> > This is a PEP (...)
>
> The PEP doesn't appear to address any potential changes. Wouldn't
> it be better to add such information to the Python3 documentation
> itself ?!

I don't know the right name of this document. Yeah, it may move to Doc/ in 
Python3 source code.

> > Example of an invalid bytes sequence: ::
> >     >>> str(b'\xff', 'utf8')
> >     UnicodeDecodeError
> >
> >     >>> str(b'\xff', 'iso-8859-1')
> >     '?'
>
> You have left out all the options you have by using a different
> error handling mechanism (using a third parameter to str()), e.g.
> 'replace', 'ignore', etc.

Yes, I can explain why replace and ignore can *not* be use in this case. If 
you use ignore or replace, filenames will be valid unicode strings, but you 
will be unable to open / copy / remove you file.

> > Default encoding
> > ================
> >
> > Python uses "UTF-8" as the default Unicode encoding. You can read the
> > default charset using sys.getdefaultencoding(). The "default encoding" is
> > used by PyUnicode_FromStringAndSize().
>
> Not only there: the C API makes various assumptions on the default
> encoding as well. We should probably drop the term "default encoding"
> altogether and replace it with "utf-8".

The concept of "default encoding" is unclear in Python. Yes, we might remove 
sys.getdefaultencoding() and write that PyUnicode_FromStringAndSize() uses 
the UTF-8 charset.

> sys.setdefaultencoding() should probably be dropped altogether from
> Python3.

Yes.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From victor.stinner at haypocalc.com  Thu Oct  2 18:46:13 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 2 Oct 2008 18:46:13 +0200
Subject: [Python-3000] Issues about Python script encoding
Message-ID: <200810021846.13939.victor.stinner@haypocalc.com>

Python3 traceback have bugs making debugging harder:

[Py3k] line number is wrong after encoding declaration
   http://bugs.python.org/issue2384

PyTraceBack_Print() doesn't respect # coding: xxx header
   http://bugs.python.org/issue3975

Both issues has patch + testcase.

--

About the coding header, IDLE doesn't read #coding: header. Here is a fix (use 
tokenize.detect_encoding):
http://bugs.python.org/issue4008

And finally, two more patches for the encoding detecting in:
http://bugs.python.org/issue4016
 -> use tokenize.detect_encoding() in linecache (instead of a duplicate
    incomplete (eg. no UTF-8 BOM support) code to detect the encoding)
 -> reuse codecs.BOM_UTF8 in tokenize

That's all for today :)

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From victor.stinner at haypocalc.com  Thu Oct  2 19:25:27 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 2 Oct 2008 19:25:27 +0200
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
In-Reply-To: <48E4BF0A.9040604@gmail.com>
References: <200810021350.49292.victor.stinner@haypocalc.com>
	<48E4B996.9030101@egenix.com> <48E4BF0A.9040604@gmail.com>
Message-ID: <200810021925.27369.victor.stinner@haypocalc.com>

Le Thursday 02 October 2008 14:31:06, vous avez ?crit?:
> Victor - the Python wiki is also one of the easiest places to work on
> early PEP drafts. See
> http://wiki.python.org/moin/PythonEnhancementProposals.

Ok, I converted the document to the wiki syntax:
http://wiki.python.org/moin/Python3UnicodeDecodeError

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From martin at v.loewis.de  Thu Oct  2 22:32:43 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 02 Oct 2008 22:32:43 +0200
Subject: [Python-3000] Issues about Python script encoding
In-Reply-To: <200810021846.13939.victor.stinner@haypocalc.com>
References: <200810021846.13939.victor.stinner@haypocalc.com>
Message-ID: <48E52FEB.5020307@v.loewis.de>

> About the coding header, IDLE doesn't read #coding: header. Here is a fix (use 
> tokenize.detect_encoding):
> http://bugs.python.org/issue4008

Are you really sure about that? It did in the past.

Regards,
Martin

From martin at v.loewis.de  Thu Oct  2 22:34:55 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 02 Oct 2008 22:34:55 +0200
Subject: [Python-3000] PEP: Python3 and UnicodeDecodeError
In-Reply-To: <48E4BF0A.9040604@gmail.com>
References: <200810021350.49292.victor.stinner@haypocalc.com>	<48E4B996.9030101@egenix.com>
	<48E4BF0A.9040604@gmail.com>
Message-ID: <48E5306F.2070903@v.loewis.de>

>> sys.setdefaultencoding() should probably be dropped altogether from
>> Python3.
> 
> Isn't that method still there to allow other implementations to be more
> permissive about allowing the default encoding to be changed?

That never was my understanding - although it's an interesting thought.

Is that opportunity actually used? I.e. is there a Python implementation
that does work correctly in the presence of setdefaultencoding? I find
that hard to believe.

Regards,
Martin

From victor.stinner at haypocalc.com  Thu Oct  2 23:54:06 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 2 Oct 2008 23:54:06 +0200
Subject: [Python-3000] Issues about Python script encoding
In-Reply-To: <48E52FEB.5020307@v.loewis.de>
References: <200810021846.13939.victor.stinner@haypocalc.com>
	<48E52FEB.5020307@v.loewis.de>
Message-ID: <200810022354.06928.victor.stinner@haypocalc.com>

Le Thursday 02 October 2008 22:32:43 Martin v. L?wis, vous avez ?crit?:
> > About the coding header, IDLE doesn't read #coding: header. Here is a fix
> > (use tokenize.detect_encoding):
> > http://bugs.python.org/issue4008
>
> Are you really sure about that? It did in the past.

Try IDLE in an ASCII terminal:
   python Tools/scripts/idle idle-3.0rc1-quits-when-run.py

(the .py file is attached to the issue).

IDLE use open(filename, 'r') without setting the encoding. io module is not 
aware of the #coding: header.

The issue is maybe related to the terminal locale since IDLE uses a "locale 
encoding" (import IOBinding; IOBinding.encoding) which is marked 
as "deprecated" in IDLE source code.

(We should use the bug tracker to discuss this issue)

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From jcollins37 at carolina.rr.com  Fri Oct  3 00:50:14 2008
From: jcollins37 at carolina.rr.com (James E. Collins III)
Date: Thu, 2 Oct 2008 18:50:14 -0400 (Eastern Daylight Time)
Subject: [Python-3000] Automatic Reply: Sound (Python-3000 Digest, Vol 32,
	Issue 9)
Message-ID: <48E5500B.000001.05980@JCOLLINS37-PCA>

Silence is one of hardest arguments to refute 
 
Have a great day!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081002/0e0058f9/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/unknown
Size: 46 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081002/0e0058f9/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/unknown
Size: 82 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081002/0e0058f9/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/unknown
Size: 4551 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081002/0e0058f9/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/unknown
Size: 1235 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081002/0e0058f9/attachment-0003.bin>

From jimjjewett at gmail.com  Fri Oct  3 19:35:31 2008
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 3 Oct 2008 13:35:31 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <loom.20081001T142457-236@post.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org> <871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org> <87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
Message-ID: <fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>

On Wed, Oct 1, 2008 at 10:36 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:

> The average user does not even /know/ what a charset is.

Because for the average user, there is no need.

Part of the HTML5 standard is how to guess at charsets, and when to
automatically use a superset instead of the declared encoding.  For
most of the US and Europe, the guesses are good enough.

For the languages and countries where multiple charsets are in common
use, and the guesses are often wrong, browser vendors say that the
change charset commands are well-known and frequently used.

> If a filename can't be exactly
> represented with a valid Unicode sequence, all
> applications wanting to access
> that file are impacted in the same way,

Not really.

Some utilities never really need to display the filename; they just
need to be able to manage the file.

Many applications need to display a file chooser, but may never need
to actually open problematic files, and may not need an accurate or
complete representation.  (Consider "Progra~1" on windows.)


> This sounds very much like a
> Python-level (or at least stdlib-level) problem to me.

The stdlib should provide a way of dealing with raw bytes.  Beyond
that, the needs get too specialized.  (And that way of dealing with
raw bytes *might* just be documenting the Latin-1 hack.)

> Are you suggesting that the solution to the filename
> problem is to prompt the
> user and ask them for a different encoding?

For some applications, yes.

-jJ

From foom at fuhm.net  Fri Oct  3 21:53:27 2008
From: foom at fuhm.net (James Y Knight)
Date: Fri, 3 Oct 2008 15:53:27 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E67175.1030103@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	<87od26e3an.fsf@xemacs.org>	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>	<48E2CCEC.9030709@canterbury.ac.nz>	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
Message-ID: <66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>

On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote:
> In order to work, the actual name must be preserved, or if  
> translated, must be a reversible, 1-to-1 translation.  A lot of  
> discussion here has talked about reversible translations, but  
> haven't noted the requirement that it be 1-to-1... and if the  
> translation produces something that looks like it could be a file  
> name, then the reverse translation is unlikely to be 1-to-1!   
> Somewhere, you need to add a flag that indicates whether or not a  
> reverse translation needs to be done, independently of the content  
> of the translated name.

That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1  
transforms.

James

From qrczak at knm.org.pl  Fri Oct  3 23:23:48 2008
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 3 Oct 2008 23:23:48 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E68911.6090403@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<loom.20081001T101918-594@post.gmane.org> <871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org> <87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
Message-ID: <3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>

2008/10/3 Glenn Linderman <v+python at g.nevcal.com>:

> My understanding of the Posix file names is that any byte values are valid
> except "/" and null.  Is this a correct understanding?

Yes (well, names "." and ".." are reserved, and there might be length
restrictions).

> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a
> Unicode character stream.  Call the original byte stream FOO.  The
> transformation then produces FOOTR, a set of Unicode code points.  Now FOOTR
> has a representation in UTF-8, which is a byte stream, call that byte stream
> FOOTRUTF8.  How, by looking at FOOTR, do you know whether it represents the
> file name FOO or FOOTRUTF8 ?

In the unpaired surrogate scheme: there is no FOOTRUTF8 because UTF-8
can encode only Unicode scalar values (which exclude surrogates).
Python strings can contain surrogates (in 4-byte builds) or unpaired
surrogates which are malformed UTF-16 (in 2-byte builds) ? in the
filename context they can't be represented in UTF-8 so they must mean
escaped bytes.

In the U+0000 scheme: FOOTRUTF8 contains a 0 byte, so the filename
must mean FOO.

> but if it
> introduces null characters into the translated "file name", then there is
> file name parsing software that it will be incompatible with, which may be
> as problematic as not translating the file names in the first place...

What do you mean by "not translating"? If a piece of software
validates filenames while they are represented by Unicode strings,
then they must have been somehow translated from byte strings (on
POSIX) or UTF-16-assumed-but-not-guaranteed strings (on Windows).

-- 
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/

From rhamph at gmail.com  Fri Oct  3 23:36:25 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Fri, 3 Oct 2008 15:36:25 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E68911.6090403@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<loom.20081001T101918-594@post.gmane.org> <871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org> <87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
Message-ID: <aac2c7cb0810031436y47505899neef9f8717c13e104@mail.gmail.com>

On Fri, Oct 3, 2008 at 3:05 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 12:53 PM, came the following characters from the
> keyboard of James Y Knight:
>>
>> On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote:
>>>
>>> In order to work, the actual name must be preserved, or if translated,
>>> must be a reversible, 1-to-1 translation.  A lot of discussion here has
>>> talked about reversible translations, but haven't noted the requirement that
>>> it be 1-to-1... and if the translation produces something that looks like it
>>> could be a file name, then the reverse translation is unlikely to be 1-to-1!
>>>  Somewhere, you need to add a flag that indicates whether or not a reverse
>>> translation needs to be done, independently of the content of the translated
>>> name.
>>
>> That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1
>> transforms.
>>
>> James
>
> My understanding of the Posix file names is that any byte values are valid
> except "/" and null.  Is this a correct understanding?
>
> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a
> Unicode character stream.  Call the original byte stream FOO.  The
> transformation then produces FOOTR, a set of Unicode code points.  Now FOOTR
> has a representation in UTF-8, which is a byte stream, call that byte stream
> FOOTRUTF8.  How, by looking at FOOTR, do you know whether it represents the
> file name FOO or FOOTRUTF8 ?  And remember that the user might provide a
> Unicode character stream identical to FOOTR: should it be translated to FOO
> or FOOTRUTF8 when creating a new file according to the user-supplied name?

UTF-8b produces an *invalid* unicode sequence, via lone scalars.  Any
attempt to encode or decode using a validating UTF-8 (or
UTF-16/UTF-32) codec would reject them, which is why they can
unambiguously be used.

In other words, it's not unicode (despite a resemblence), so it's easy
to be 1-to-1.


> So the U+0000 transform may be 1-to-1 since it introduces null characters
> into the translated "file name", which are effectively producing names that
> are invalid according to the Posix file name standard ... but if it
> introduces null characters into the translated "file name", then there is
> file name parsing software that it will be incompatible with, which may be
> as problematic as not translating the file names in the first place... deep
> analysis would have to be used to determine which problem is larger, or more
> significant.  I've certainly been "guilty" of writing software that assumes
> that there are no null characters in a file name.  I've even been "guilty"
> of writing software that assumes there are no space characters in a file
> name, although I've tried to break that habit in recent years...

Yup, U+0000 is unicode, but still can't be used with many external
APIs, as it's a transformation of the real file name.  The only real
advantage is you can store it in certain external formats, but
wouldn't you know it, XML isn't one of them[1].  Can you think of any
common formats where it would work?


[1] http://www.w3.org/International/questions/qa-controls

-- 
Adam Olsen, aka Rhamphoryncus

From rhamph at gmail.com  Sat Oct  4 01:54:06 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Fri, 3 Oct 2008 17:54:06 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E6A492.4090604@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<loom.20081001T111216-867@post.gmane.org> <87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
	<3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>
	<48E6A492.4090604@g.nevcal.com>
Message-ID: <aac2c7cb0810031654x3e3c51aeh2e2c742b27597727@mail.gmail.com>

On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 2:36 PM, came the following characters from the
> keyboard of Adam Olsen:
>>
>> UTF-8b produces an *invalid* unicode sequence, via lone scalars.  Any
>> attempt to encode or decode using a validating UTF-8 (or
>> UTF-16/UTF-32) codec would reject them, which is why they can
>> unambiguously be used.
>>
>> In other words, it's not unicode (despite a resemblence), so it's easy
>> to be 1-to-1.
>
> Sort of.  There is no numerical reason they cannot be represented in a
> UTF-8-like numeric encoding scheme.  It is only rules and regulations that
> prevent it.  So FOOTRUTF8 can exist, just not legally.  If the expectation
> is that an illegal UTF-16 code can be used, to permit the UTF-8b translation
> scheme to work at all, then it seems reasonable to expect than an illegal
> translation of it to UTF-8 might happen also, which means that the
> transformation isn't 1-to-1!

No, UTF-8b can't be translated to UTF-8.  It's illegal.


> I think someone demonstrated the use of unpaired surrogates in the Windows
> filename context the other day.  Whether that is a bug or not, it is the
> current state of affairs, someone might read a name from Windows and want to
> create it on Posix... what happens?  If we implement UTF-8b, I know what
> would happen.  But what would happen if we don't, today, on a Posix Python
> 3?  Would it use FOOTRUTF8 or would it generate an error?  I don't suppose
> it matters a lot, it is stupidity to use such names whether or not the
> prevention of it is enforced.

If python worked properly?  The illegal unicode object would get an
encoding error when you tried to translate to UTF-8 to send it over to
the Posix box.  You'd have alter all the software that touches it to
use your looks-like-but-isn't-quite-unicode, rather than using the
real unicode.

That's why I favour validating the windows API too, and making the raw
API be the raw UTF-16 (rather than letting it get encoded into a
single-byte encoding).  The rawness is what bytes need, not ASCII
similarity.


> But if someone on Posix is creating non-Python software that uses illegal
> lone surrogates, illegally UTF-8 coding them to create the file, and then
> giving them to a Python program to manipulate the content, things could get
> confused, if UTF-8b translations happen under the Python covers... the
> Python program would attempt to open a different file than the non-Python
> software created.

No, they can't illegal use UTF-8.  It's not UTF-8, period.  It's just garbage.


> Seems like attempts to manipulate and transform names are doomed to failure;
> the approach of having a bytes level interface seems to be the correct one,
> glad that seems to be the approach that Victor is implementing and Guido is
> favoring, although it is a pity that it can't be fully encapsulated into an
> object in time for 3.0, leaving us with multiple APIs for file access, and a
> potential future translation to an encapsulated object approach.

the bytes object covers 90% of the raw usage.  The other 10% is a
lossy encoding to unicode.  I much prefer that to be explicit, so an
attribute may do.. say b.decode('UTF-8', 'replace')?  Or do we need a
subtype of bytes, just to reduce that to 5-8 characters?


-- 
Adam Olsen, aka Rhamphoryncus

From rhamph at gmail.com  Sat Oct  4 08:57:36 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Sat, 4 Oct 2008 00:57:36 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E6ED99.2050406@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
	<3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>
	<48E6A492.4090604@g.nevcal.com>
	<aac2c7cb0810031654x3e3c51aeh2e2c742b27597727@mail.gmail.com>
	<48E6ED99.2050406@g.nevcal.com>
Message-ID: <aac2c7cb0810032357n66d452by2391be079179a48d@mail.gmail.com>

On Fri, Oct 3, 2008 at 10:14 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 4:54 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman <v+python at g.nevcal.com>
>> wrote:
>
> OK, so UTF-8b is not Unicode, either.  It's just garbage.  You can't have it
> both ways.

I've always said UTF-8b wasn't valid.


>>> Seems like attempts to manipulate and transform names are doomed to
>>> failure;
>>> the approach of having a bytes level interface seems to be the correct
>>> one,
>>> glad that seems to be the approach that Victor is implementing and Guido
>>> is
>>> favoring, although it is a pity that it can't be fully encapsulated into
>>> an
>>> object in time for 3.0, leaving us with multiple APIs for file access,
>>> and a
>>> potential future translation to an encapsulated object approach.
>>>
>>
>> the bytes object covers 90% of the raw usage.  The other 10% is a
>> lossy encoding to unicode.  I much prefer that to be explicit, so an
>> attribute may do.. say b.decode('UTF-8', 'replace')?  Or do we need a
>> subtype of bytes, just to reduce that to 5-8 characters?
>>
>
> I don't understand what you mean here... Victor/Guido's plan results in:
>
> Alternative 1:  Windows only programs can use the Python Unicode file
> interfaces, Posix programs can take a chance, and also use them (one stab at
> semi-portability, if people don't need access to weirdly named files).

Windows programs using non-validating unicode APIs will be exposed to
random exceptions when they use a validating unicode API.  Better to
validate everything early, where you can expect the failures.

Posix programs SHOULD take a chance.  It's much easier to deal with
pure unicode, and some things can only be done that way (such as
getting file names from the user through a GUI).


> Alternative 2: Posix only programs can use the Python bytes file interfaces
> and get all the files, but can't necessarily display them, except in lossy
> Unicode or hex, or by pretending they are Latin-1, or whatever they want to
> do, but they can't assume UTF-8, unless it happens to work.  Windows
> programs can use the bytes interface (another stab at semi-portability), if
> people don't need access to files named using Unicode characters not in the
> program's current code page.

Can't display them, can't export them.  'tis fun!


> Alternative 3: Portable programs use the Unicode file interfaces on Windows,
> and the bytes file interfaces on Posix, and deal with the differences, as
> described for Windows only in alternative 1 and Posix only in alternative 2.
>
> Alternative 4: Someone implements an object that does alternative 3 under
> the covers, and every one will wish Alternative 1 & 2 didn't even exist.
>  The only reasons not to do this seem to be (a) Python 2.6 is already
> released and doesn't have it, (b) Python 3.0 would slip its schedule even
> more, (c) it's a significant chunk of code to implement and get right in a
> hurry.

Nope, not possible.  The closest we can do is "bytes with implicit
conversion to unicode", but (a) implicit conversion is much less
maintainable (zen, etc), (b) it STILL doesn't work.  You still can't
round-trip a bad file name through a unicode API.

You have the file system and the user/libraries, and never the twain shall meet.


-- 
Adam Olsen, aka Rhamphoryncus

From brett at python.org  Sat Oct  4 20:03:54 2008
From: brett at python.org (Brett Cannon)
Date: Sat, 4 Oct 2008 11:03:54 -0700
Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__
	import planning)
In-Reply-To: <gc76uv$rr9$1@ger.gmane.org>
References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com>
	<3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org>
	<gc76uv$rr9$1@ger.gmane.org>
Message-ID: <bbaeab100810041103j7502018fmdcd2b575f81371d3@mail.gmail.com>

On Sat, Oct 4, 2008 at 12:45 AM, Georg Brandl <g.brandl at gmx.net> wrote:
> Barry Warsaw schrieb:
>> On Oct 3, 2008, at 5:26 PM, Benjamin Peterson wrote:
>>
>>> So now that we've released 2.6 and are working hard on shepherding 3.0
>>> out the door, it's time to worry about the next set of releases. :)
>>
>>> I propose that we dramatically shorten our release cycle for 2.7/3.1
>>> to roughly a year and put a strong focus stabilizing all the new
>>> goodies we included in the last release(s). In the 3.x branch, we
>>> should continue to solidify the new code and features that were
>>> introduced. One 2.7's main objectives should be binding 3.x and 2.x
>>> ever closer.
>>
>> There are several things that I would like to see us concentrate on
>> after the 3.0 release.  I agree that 3.1 should be primarily a
>> stabilizing release.  I suspect that we will find a lot of things that
>> need tweaking only after 3.0 final has been out there for a while.
>>
>> I think 2.7 should continue along the path of convergence toward 3.x.
>> The vision some of us talked about at Pycon was that at some point
>> down the line, maybe there's no difference between "python2.9 -3" and
>> "python3.3 -2".
>
> Especially 3.1 should also be a release where we focus as much on the
> community as on the code. There are many people out there for whom
> Python 3, as an incompatible language, is not an easy step to make,
> especially those with huge 2.x codebases on their hands. They have
> two problems: The libraries they depend on aren't ported, and the
> KLOC of code they care about are hard and tedious work to port, not
> to mention that it typically isn't viewed as productive work by those
> who pay them.
>
> We need to make 2to3 and related tools reliable and do more showcases
> of porting, like Martin did with Django, so that people have real-world
> examples at their disposal, by which they can estimate their own
> porting needs. (Waiting for the extended community to deliver such
> examples may be a mistake.)
>
> We also need to commit to help people with porting. I propose a new
> mailing list (e.g. python3-porting), parallel to python-list,
> specifically for people going that way. I think it will help to
> focus the community effort of getting Python 3 off the ground.
>

This is a good idea; python-help for porting.

> Last not least, there should be a *central* location on python.org where
> specifically all resources on 2->3 transition are collected. Talks,
> documents, links, and some crucial information many people seem to miss,
> such as how long the 2.x series will at least be maintained. They depend
> on this.

That seems reasonable if someone gets around to doing it. =)

-Brett

From martin at v.loewis.de  Sat Oct  4 21:17:21 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 04 Oct 2008 21:17:21 +0200
Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__
	import planning)
In-Reply-To: <gc8bk8$uqb$1@ger.gmane.org>
References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com>	<3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org>	<gc76uv$rr9$1@ger.gmane.org>	<bbaeab100810041103j7502018fmdcd2b575f81371d3@mail.gmail.com>
	<gc8bk8$uqb$1@ger.gmane.org>
Message-ID: <48E7C141.8010903@v.loewis.de>

> Well, since for >95% of the (potential) Py3k users it is more important than
> e.g. the import rewrite in Python (no stab at you intended, Brett), it is
> something someone will have to get around to doing.
> 
> I'm not excusing myself; in fact, I'd be happy to work on this, but overall
> the team "Python 3 advocacy and support" should consist of more than one
> person.

I think this has time. I'm (now) confident that people will port to
Python 3 sooner rather than later, just because it's there. In fact,
we have to be careful not to talk too many people into porting, since
there will be some glitches which need to be resolved, and may not get
resolved before 3.2 or so. So people with a natural wariness are advised
to trust this wariness, or else all their concerns become
self-fulfilling prophecies.

Regards,
Martin

From brett at python.org  Sat Oct  4 21:36:17 2008
From: brett at python.org (Brett Cannon)
Date: Sat, 4 Oct 2008 12:36:17 -0700
Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__
	import planning)
In-Reply-To: <48E7C141.8010903@v.loewis.de>
References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com>
	<3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org>
	<gc76uv$rr9$1@ger.gmane.org>
	<bbaeab100810041103j7502018fmdcd2b575f81371d3@mail.gmail.com>
	<gc8bk8$uqb$1@ger.gmane.org> <48E7C141.8010903@v.loewis.de>
Message-ID: <bbaeab100810041236m63ddb1cbr2f053a238c52049d@mail.gmail.com>

[replying to both Georg and Martin]

On Sat, Oct 4, 2008 at 12:17 PM, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>> Well, since for >95% of the (potential) Py3k users it is more important than
>> e.g. the import rewrite in Python (no stab at you intended, Brett), it is
>> something someone will have to get around to doing.
>>

Don't worry, I realize my import work is approaching vaporware status
at this rate (still plugging away at it, though).

But you are right: helping people port to 3 will be the most important
thing we can help people with.

>> I'm not excusing myself; in fact, I'd be happy to work on this, but overall
>> the team "Python 3 advocacy and support" should consist of more than one
>> person.
>

I would definitely be willing to help.

So the mailing list is a good idea. Perhaps it should just be
python-porting so that it can also be used for people who have
problems with minor releases?

We could then have a /porting/ section to the site where we can
actually document after each release how to port to the newest
version.

And as for 2 -> 3 stuff, should probably provide the expected steps to
port, tips for pure Python code (and how to write 2.6/3.0 compatible
code), extension modules, and make it clear what our overall plan is
(e.g. 3.2 probably being the truly stable release semantically).

> I think this has time. I'm (now) confident that people will port to
> Python 3 sooner rather than later, just because it's there. In fact,
> we have to be careful not to talk too many people into porting, since
> there will be some glitches which need to be resolved, and may not get
> resolved before 3.2 or so. So people with a natural wariness are advised
> to trust this wariness, or else all their concerns become
> self-fulfilling prophecies.

Yes, people should be warned that if they are not ready to make
changes after each Python release that are probably more than they are
used to between minor releases, they might to hold off for 3.1 or 3.2.
But I don't want to be too discouraging as that might stifle any
forward momentum we might have and potentially leave 3 flat before it
even gets going.

-Brett

From facundobatista at gmail.com  Sun Oct  5 01:19:31 2008
From: facundobatista at gmail.com (Facundo Batista)
Date: Sat, 4 Oct 2008 20:19:31 -0300
Subject: [Python-3000] [Python-Dev] 3.1 focus (was Re: for __future__
	import planning)
In-Reply-To: <bbaeab100810041236m63ddb1cbr2f053a238c52049d@mail.gmail.com>
References: <1afaf6160810031426n21514e81ma213b084aff20648@mail.gmail.com>
	<3DDCFDD1-52DB-487D-AEB4-758CF868945D@python.org>
	<gc76uv$rr9$1@ger.gmane.org>
	<bbaeab100810041103j7502018fmdcd2b575f81371d3@mail.gmail.com>
	<gc8bk8$uqb$1@ger.gmane.org> <48E7C141.8010903@v.loewis.de>
	<bbaeab100810041236m63ddb1cbr2f053a238c52049d@mail.gmail.com>
Message-ID: <e04bdf310810041619i7d57f3ct646ea1e48e5c716b@mail.gmail.com>

2008/10/4 Brett Cannon <brett at python.org>:

> So the mailing list is a good idea. Perhaps it should just be
> python-porting so that it can also be used for people who have
> problems with minor releases?

+1. I'd try to help on that list, also.

-- 
.    Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/

From tjreedy at udel.edu  Mon Oct  6 01:11:10 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Sun, 05 Oct 2008 19:11:10 -0400
Subject: [Python-3000] A plus for naked unbound methods
Message-ID: <gcbhid$trn$1@ger.gmane.org>

I have seen a couple of objections to leaving unbound methods naked (as 
functions) when retrieved in 3.0.  Here is a plus.

A c.l.p poster reported that 2.6 broke his code because the addition of 
default rich comparisons to object turned tests like hassattr(ob, 
'__lt__') from False to True.  The obvious fix ob.__lt__ == 
object.__lt__ does not work because wrapping makes it always False, even 
when conceptually true.  In 3.0, that equality test works.  (I pointed 
him to 'object' in repr(ob.__lt__) as a workaround.  Others posted others.)

tjr


From wescpy at gmail.com  Mon Oct  6 04:14:07 2008
From: wescpy at gmail.com (wesley chun)
Date: Sun, 5 Oct 2008 19:14:07 -0700
Subject: [Python-3000] Problem with grammar for 'except'?
In-Reply-To: <ca471dc20809041236l3955a60blcf8046a38adfd928@mail.gmail.com>
References: <bbaeab100809032110i58bdbcefpa66d5536ef02c7dc@mail.gmail.com>
	<A38FA5A4111844B8B3CF639A04795311@RaymondLaptop1>
	<ca471dc20809041236l3955a60blcf8046a38adfd928@mail.gmail.com>
Message-ID: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com>

On Thu, Sep 4, 2008 at 12:36 PM, Guido van Rossum <guido at python.org> wrote:
> On Wed, Sep 3, 2008 at 9:25 PM, Raymond Hettinger <python at rcn.com> wrote:
>> [Brett]
>>> I gave a talk last night at the Vancouver Python users group on
>>> 2.6/3.0, and I tried the following code and it failed during a live demo:
>>>
>>>  >>> try: pass
>>>  ... except Exception, Exception: pass
>>>   File "<stdin>", line 2
>>>     except Exception, Exception: pass
>>>                                ^
>>>  SyntaxError: invalid syntax
>>>
>>> Now from what I can tell from PEP 3110, that should be legal in 3.0.
>>> Am I reading the PEP correctly?
>>
>> Don't think so.
>> The parens are necessary for a tuple of exceptions
>> lest it be confused with the old "except E, v" syntax
>> which meant "except E as e".
>>
>> Maybe in 3.1, the paren requirement can be dropped.
>
> I would wait longer -- until well after the 2.x line is dead and
> buried. It will take some time for every Python user to train their
> Python fingers not to type "except E, v:" and we don't want people who
> are late in migrating inserting bugs like this in their first 3.x program.


it's probably a good idea to leave the paren requirement in there, but
i just reread the PEP myself, and it appears as though no parens is
actually supported, specifically: "except AttributeError, os.error:"
here:

http://www.python.org/dev/peps/pep-3110/#grammar-changes

also, and granted this is older info, Guido's 2006 talks seem to hint
this as well:

- change except clause syntax to except E1, E2, E3 as err:
    - this avoids the bug in except E1, E2: # meant except (E1, E2)

from both of these:

ACCU - Apr 2006 (slide 11)
http://www.python.org/doc/essays/ppt/accu2006/Py3kACCU.ppt

Vancouver Python Workshop - Aug 2006 (slide 13)
http://www.vanpyz.org/conference/2006/proceedings/MarygX/Py3KVanPyz.ppt

while we can't change the past, we can/should at least update the PEP
as well as the current 2.6 and 3.0 docs to specifically state that the
parens are required (for now) *and* give an example usage.

cheers,
-- wesley

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
"Python Web Development with Django", Addison Wesley, (c) 2008
http://withdjango.com

wesley.j.chun :: wescpy-at-gmail.com
python training and technical consulting
cyberweb.consulting : silicon valley, ca
http://cyberwebconsulting.com

From guido at python.org  Mon Oct  6 04:45:14 2008
From: guido at python.org (Guido van Rossum)
Date: Sun, 5 Oct 2008 19:45:14 -0700
Subject: [Python-3000] Problem with grammar for 'except'?
In-Reply-To: <78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com>
References: <bbaeab100809032110i58bdbcefpa66d5536ef02c7dc@mail.gmail.com>
	<A38FA5A4111844B8B3CF639A04795311@RaymondLaptop1>
	<ca471dc20809041236l3955a60blcf8046a38adfd928@mail.gmail.com>
	<78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com>
Message-ID: <ca471dc20810051945x4ca037adt477e71eceb0e34fa@mail.gmail.com>

Someone please fix the PEP. There are very good reasons for *not*
allowing "except X, Y:" to have a meaning -- if 2.x code somehow
accidentally ended up in the 3.0 world without having been run through
2to3, it would silently perturb the meaning in the most confusing way.
That's why the implementation got it right.

--Guido

On Sun, Oct 5, 2008 at 7:14 PM, wesley chun <wescpy at gmail.com> wrote:
> On Thu, Sep 4, 2008 at 12:36 PM, Guido van Rossum <guido at python.org> wrote:
>> On Wed, Sep 3, 2008 at 9:25 PM, Raymond Hettinger <python at rcn.com> wrote:
>>> [Brett]
>>>> I gave a talk last night at the Vancouver Python users group on
>>>> 2.6/3.0, and I tried the following code and it failed during a live demo:
>>>>
>>>>  >>> try: pass
>>>>  ... except Exception, Exception: pass
>>>>   File "<stdin>", line 2
>>>>     except Exception, Exception: pass
>>>>                                ^
>>>>  SyntaxError: invalid syntax
>>>>
>>>> Now from what I can tell from PEP 3110, that should be legal in 3.0.
>>>> Am I reading the PEP correctly?
>>>
>>> Don't think so.
>>> The parens are necessary for a tuple of exceptions
>>> lest it be confused with the old "except E, v" syntax
>>> which meant "except E as e".
>>>
>>> Maybe in 3.1, the paren requirement can be dropped.
>>
>> I would wait longer -- until well after the 2.x line is dead and
>> buried. It will take some time for every Python user to train their
>> Python fingers not to type "except E, v:" and we don't want people who
>> are late in migrating inserting bugs like this in their first 3.x program.
>
>
> it's probably a good idea to leave the paren requirement in there, but
> i just reread the PEP myself, and it appears as though no parens is
> actually supported, specifically: "except AttributeError, os.error:"
> here:
>
> http://www.python.org/dev/peps/pep-3110/#grammar-changes
>
> also, and granted this is older info, Guido's 2006 talks seem to hint
> this as well:
>
> - change except clause syntax to except E1, E2, E3 as err:
>    - this avoids the bug in except E1, E2: # meant except (E1, E2)
>
> from both of these:
>
> ACCU - Apr 2006 (slide 11)
> http://www.python.org/doc/essays/ppt/accu2006/Py3kACCU.ppt
>
> Vancouver Python Workshop - Aug 2006 (slide 13)
> http://www.vanpyz.org/conference/2006/proceedings/MarygX/Py3KVanPyz.ppt
>
> while we can't change the past, we can/should at least update the PEP
> as well as the current 2.6 and 3.0 docs to specifically state that the
> parens are required (for now) *and* give an example usage.
>
> cheers,
> -- wesley
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> "Python Web Development with Django", Addison Wesley, (c) 2008
> http://withdjango.com
>
> wesley.j.chun :: wescpy-at-gmail.com
> python training and technical consulting
> cyberweb.consulting : silicon valley, ca
> http://cyberwebconsulting.com
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From mrs at mythic-beasts.com  Mon Oct  6 18:50:34 2008
From: mrs at mythic-beasts.com (Mark Seaborn)
Date: Mon, 06 Oct 2008 17:50:34 +0100 (BST)
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <gcbhid$trn$1@ger.gmane.org>
References: <gcbhid$trn$1@ger.gmane.org>
Message-ID: <20081006.175034.343188282.mrs@localhost.localdomain>

Terry Reedy <tjreedy at udel.edu> wrote:

> I have seen a couple of objections to leaving unbound methods naked (as 
> functions) when retrieved in 3.0.  Here is a plus.
>
> A c.l.p poster reported that 2.6 broke his code because the addition of 
> default rich comparisons to object turned tests like hassattr(ob, 
> '__lt__') from False to True.

For the record, the post is:
http://mail.python.org/pipermail/python-list/2008-October/510540.html

> The obvious fix ob.__lt__ == object.__lt__ does not work because
> wrapping makes it always False, even when conceptually true.  In
> 3.0, that equality test works.  (I pointed him to 'object' in
> repr(ob.__lt__) as a workaround.  Others posted others.)

Assuming ob is an instance object, ob.__lt__ will give you a bound
method (taking 1 argument) which you would never expect to compare as
equal to object.__lt__ (taking 2 arguments).  So the presence or
absence of unbound methods makes no difference here.

Mark

From tjreedy at udel.edu  Mon Oct  6 20:19:56 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Mon, 06 Oct 2008 14:19:56 -0400
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <20081006.175034.343188282.mrs@localhost.localdomain>
References: <gcbhid$trn$1@ger.gmane.org>
	<20081006.175034.343188282.mrs@localhost.localdomain>
Message-ID: <gcdksc$uo8$1@ger.gmane.org>

Mark Seaborn wrote:
> Terry Reedy <tjreedy at udel.edu> wrote:
> 
>> I have seen a couple of objections to leaving unbound methods naked (as 
>> functions) when retrieved in 3.0.  Here is a plus.
>>
>> A c.l.p poster reported that 2.6 broke his code because the addition of 
>> default rich comparisons to object turned tests like hassattr(ob, 
>> '__lt__') from False to True.
> 
> For the record, the post is:
> http://mail.python.org/pipermail/python-list/2008-October/510540.html
> 
>> The obvious fix ob.__lt__ == object.__lt__ does not work because
>> wrapping makes it always False, even when conceptually true.  In
>> 3.0, that equality test works.  (I pointed him to 'object' in
>> repr(ob.__lt__) as a workaround.  Others posted others.)
> 
> Assuming ob is an instance object,

It was a class derived from object.  I should have made that clearer.


From mrs at mythic-beasts.com  Mon Oct  6 22:20:59 2008
From: mrs at mythic-beasts.com (Mark Seaborn)
Date: Mon, 06 Oct 2008 21:20:59 +0100 (BST)
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <gcdksc$uo8$1@ger.gmane.org>
References: <gcbhid$trn$1@ger.gmane.org>
	<20081006.175034.343188282.mrs@localhost.localdomain>
	<gcdksc$uo8$1@ger.gmane.org>
Message-ID: <20081006.212059.465784769.mrs@localhost.localdomain>

Terry Reedy <tjreedy at udel.edu> wrote:

> Mark Seaborn wrote:
> > Terry Reedy <tjreedy at udel.edu> wrote:
> > 
> >> I have seen a couple of objections to leaving unbound methods naked (as 
> >> functions) when retrieved in 3.0.  Here is a plus.
> >>
> >> A c.l.p poster reported that 2.6 broke his code because the addition of 
> >> default rich comparisons to object turned tests like hassattr(ob, 
> >> '__lt__') from False to True.
> > 
> > For the record, the post is:
> > http://mail.python.org/pipermail/python-list/2008-October/510540.html
> > 
> >> The obvious fix ob.__lt__ == object.__lt__ does not work because
> >> wrapping makes it always False, even when conceptually true.  In
> >> 3.0, that equality test works.  (I pointed him to 'object' in
> >> repr(ob.__lt__) as a workaround.  Others posted others.)
> > 
> > Assuming ob is an instance object,
> 
> It was a class derived from object.  I should have made that clearer.

It appears that unbound methods do what you want in the general case
in Python 2.5 and 2.6.  It's just that __lt__ behaves unlike normal
unbound methods.  So this isn't an argument against unbound methods,
it's an argument for __lt__ not to be a special case.

>>> class C(object):
...     def f(self): pass
...     def g(self): pass
... 
>>> class D(C):
...     def g(self): pass
... 
>>> C.f == D.f
True
>>> C.g == D.g
False
>>> C.__str__ == D.__str__
True
>>> C.__str__ == object.__str__
True

It is slightly odd that C.f and D.f compare as equal when they are not
equivalent.  It is not inconsistent with other cases where == returns
True on non-equivalent objects (such as dicts with equal content but
different identities), but it is odd for this to happen on a callable.

Mark

From barry at python.org  Tue Oct  7 02:47:57 2008
From: barry at python.org (Barry Warsaw)
Date: Mon, 6 Oct 2008 20:47:57 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule
Message-ID: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So, we need to come up with a new release schedule for Python 3.0.  My  
suggestion:

15-Oct-2008 3.0 beta 4
05-Nov-2008 3.0 rc 2
19-Nov-2008 3.0 rc 3
03-Dec-2008 3.0 final

Given what still needs to be done, is this a reasonable schedule?  Do  
we need two more betas?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOqxvnEjvBPtnXfVAQIR5QP/coSi2ltsZSpE2dyUg7Y35QcSk/+4ZbGK
zF0AgLaOkGs+DFnxRH9vy9kN3JaEkp1MhEpDjkomE7kNpnJB7bWotTrHI67HD9ma
ZDqqmaCc02IeUtLm7HuELvofjCgh+gryKWvRc71ErRHmn/YxMGr1OcEirPpx4nZ9
DeDV0OeUtTE=
=RchU
-----END PGP SIGNATURE-----

From musiccomposition at gmail.com  Tue Oct  7 02:52:54 2008
From: musiccomposition at gmail.com (Benjamin Peterson)
Date: Mon, 6 Oct 2008 19:52:54 -0500
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
Message-ID: <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>

On Mon, Oct 6, 2008 at 7:47 PM, Barry Warsaw <barry at python.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So, we need to come up with a new release schedule for Python 3.0.  My
> suggestion:
>
> 15-Oct-2008 3.0 beta 4
> 05-Nov-2008 3.0 rc 2
> 19-Nov-2008 3.0 rc 3
> 03-Dec-2008 3.0 final
>
> Given what still needs to be done, is this a reasonable schedule?  Do we
> need two more betas?

I'm not sure we do. Correct me if I'm wrong, but the "big ticket",
issue bytes/unicode filepaths, has been resolved. And looking at the
tracker, I only see 18 release blockers.



-- 
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."

From tjreedy at udel.edu  Tue Oct  7 03:08:29 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Mon, 06 Oct 2008 21:08:29 -0400
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <20081006.212059.465784769.mrs@localhost.localdomain>
References: <gcbhid$trn$1@ger.gmane.org>	<20081006.175034.343188282.mrs@localhost.localdomain>	<gcdksc$uo8$1@ger.gmane.org>
	<20081006.212059.465784769.mrs@localhost.localdomain>
Message-ID: <gcecqd$cr9$1@ger.gmane.org>

Mark Seaborn wrote:
> Terry Reedy <tjreedy at udel.edu> wrote:
> 
>> Mark Seaborn wrote:
>>> Terry Reedy <tjreedy at udel.edu> wrote:
>>>
>>>> I have seen a couple of objections to leaving unbound methods naked (as 
>>>> functions) when retrieved in 3.0.  Here is a plus.
>>>>
>>>> A c.l.p poster reported that 2.6 broke his code because the addition of 
>>>> default rich comparisons to object turned tests like hassattr(ob, 
>>>> '__lt__') from False to True.
>>> For the record, the post is:
>>> http://mail.python.org/pipermail/python-list/2008-October/510540.html
>>>
>>>> The obvious fix ob.__lt__ == object.__lt__ does not work because
>>>> wrapping makes it always False, even when conceptually true.  In
>>>> 3.0, that equality test works.  (I pointed him to 'object' in
>>>> repr(ob.__lt__) as a workaround.  Others posted others.)
>>> Assuming ob is an instance object,
>> It was a class derived from object.  I should have made that clearer.
> 
> It appears that unbound methods do what you want in the general case
> in Python 2.5 and 2.6.  It's just that __lt__ behaves unlike normal
> unbound methods.  So this isn't an argument against unbound methods,
> it's an argument for __lt__ not to be a special case.

It is not a special case.

 >>> def C(object): pass
...

 >>> C.__hash__ == object.__hash__
False

 >>> C.__str__ == object.__str__
False

I strongly suspect that the same is true of every method that a user 
class inherits from a builtin class.  Still, the clp OP is specifically 
interested in object as the base of his inheritance networks.

>>>> class C(object):
> ...     def f(self): pass
> ...     def g(self): pass
> ... 
>>>> class D(C):
> ...     def g(self): pass
> ... 
>>>> C.f == D.f
> True
>>>> C.g == D.g
> False

> It is slightly odd that C.f and D.f compare as equal when they are not
> equivalent.  It is not inconsistent with other cases where == returns
> True on non-equivalent objects (such as dicts with equal content but
> different identities), but it is odd for this to happen on a callable.

Interesting.  MethodWrapper must have an over-riding equality method 
that compare im.func attributes for the specific case of comparing 
MethodWrappers. But not relevant to the specific need;-).

So my point remains: leaving unbound methods unwrapped makes Python3 
work better for at least one real use case.

Terry Jan Reedy


From python at rcn.com  Tue Oct  7 03:48:18 2008
From: python at rcn.com (Raymond Hettinger)
Date: Mon, 6 Oct 2008 18:48:18 -0700
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
Message-ID: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>

[Barry Warsaw]
> So, we need to come up with a new release schedule for Python 3.0.  My  
> suggestion:
> 
> 15-Oct-2008 3.0 beta 4
> 05-Nov-2008 3.0 rc 2
> 19-Nov-2008 3.0 rc 3
> 03-Dec-2008 3.0 final
> 
> Given what still needs to be done, is this a reasonable schedule?  Do  
> we need two more betas?

Yes to both questions.

I'm seeing that people are just starting to download and play with 3.0.
I expect that we'll start getting more feedback on conversion issues,
the C API, screwy interactions with operating systems, bytes/text issues,
unanticipated interactions with other tools, etc.  Each user will stress
it in new ways and perhaps reveal a bunch of little integration issues
and documentation issues.  Those little fixups way go a long way toward
establishing a good first impression and reputation for 3.0 from the outset.


Raymond



From barry at python.org  Tue Oct  7 04:13:06 2008
From: barry at python.org (Barry Warsaw)
Date: Mon, 6 Oct 2008 22:13:06 -0400
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
Message-ID: <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote:

> [Barry Warsaw]
>> So, we need to come up with a new release schedule for Python 3.0.   
>> My  suggestion:
>> 15-Oct-2008 3.0 beta 4
>> 05-Nov-2008 3.0 rc 2
>> 19-Nov-2008 3.0 rc 3
>> 03-Dec-2008 3.0 final
>> Given what still needs to be done, is this a reasonable schedule?   
>> Do  we need two more betas?
>
> Yes to both questions.

I think that's contradictory :).  If we need two betas, then 05-Nov  
becomes beta 5, 19-Nov is rc 2.  If we don't need another rc then we  
can still do a final release on 03-Dec, otherwise we probably go 2  
weeks later.  I don't want to go much later than that though because  
then we get into the holiday season.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOrFs3EjvBPtnXfVAQJceQP/QJN7oLM4nG+iXmgdb0NmKzOzaE3J89sQ
UWZnc/hp618QNH4JWC8v2bYApFu+iVg3pcv1Lnmhuql6mOuDhSuKKJVA5jTdR7U2
2enhAEY2DXtmav/29nn2Fy6PYcWJy9pE2xBsbBW8qXc6tYww0iEBsz9SU68jPzPk
x5LFC5NqmXo=
=Kyr4
-----END PGP SIGNATURE-----

From foom at fuhm.net  Tue Oct  7 05:22:09 2008
From: foom at fuhm.net (James Y Knight)
Date: Mon, 6 Oct 2008 23:22:09 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
Message-ID: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>

On Oct 6, 2008, at 8:52 PM, Benjamin Peterson wrote:
> I'm not sure we do. Correct me if I'm wrong, but the "big ticket",
> issue bytes/unicode filepaths, has been resolved. And looking at the
> tracker, I only see 18 release blockers.


Well, if you mean that the resolution decided upon is to "simply"  
allow access to all system APIs using either byte or unicode strings,  
then it seems to me that there's a rather large amount of work left to  
do...

Here's some I found from a few minutes of futzing around with r66821  
of py3k on Linux.

  - Having os.getcwdb isn't much use when you can't even run python in  
the first place when the current directory has "bad" bytes in it.

Currently Python outputs:
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Aborted

  - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'"  
ought to work (with files with "bad" bytes being returned by find),  
which means that Python shouldn't blow up and refuse to start when  
there's a non-properly-encoding argv ("Could not convert argument 1 to  
string" and exiting isn't appropriate behavior).

  - Of course, just being able to start the interpreter isn't quite  
enough: you'll want to be able to access that argument list too,  
somehow (add sys.argvb?).

  - And then, getopt and optparse modules should work on bytestring  
vectors, so that you can use sys.argvb without writing your own  
argument parser. They don't currently.

  - There's no os.environb for bytewise access to the environment.  
Seems important.

  - Isn't it a potential security issue that " 'WHATEVER' in  
os.environ" can return False if WHATEVER had some "bad" bytes in it,  
but spawning a subprocess actually will include WHATEVER in the  
subprocess's environment? Actually, even better: the behavior depends  
on whether you use subprocess.call('foo') or subprocess.call('foo',  
os.environ). The first passes through the "bad" environment variables,  
while the second does not. A bit surprising, perhaps.

  - Shouldn't this work?
   subprocess.call(b'/bin/echo')
Currently raises an exception:
AttributeError: 'int' object has no attribute 'rfind'

  - I suppose sys.path should handle bytestrings on the path, and  
should be populated using the bytes-version of os.environ so that  
PYTHONPATH gets read in properly. Which of course implies that all the  
importers need to handle byte filenames.

  - zipfile.ZipFile(b'whatever.zip') doesn't work.

  - zipfile decodes/encodes the filenames inside the zip file to  
unicode, so thus can only handle correctly encoded filenames.

I'm sure there's even more APIs dealing with pathnames, command line  
arguments, or environment variables that ought to be able to handle  
both bytes and strings, that currently don't.

James

From rhamph at gmail.com  Tue Oct  7 07:18:48 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Mon, 6 Oct 2008 23:18:48 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48EA9B71.3060109@nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
	<3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>
	<48E6A492.4090604@g.nevcal.com>
	<aac2c7cb0810031654x3e3c51aeh2e2c742b27597727@mail.gmail.com>
	<48E6ED99.2050406@g.nevcal.com>
	<aac2c7cb0810032357n66d452by2391be079179a48d@mail.gmail.com>
	<48EA9B71.3060109@nevcal.com>
Message-ID: <aac2c7cb0810062218t5e661ddbwd02c8d973e8dbe57@mail.gmail.com>

On Mon, Oct 6, 2008 at 5:12 PM, Glenn Linderman <glenn at nevcal.com> wrote:
> On approximately 10/3/2008 11:57 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Fri, Oct 3, 2008 at 10:14 PM, Glenn Linderman <v+python at g.nevcal.com>
>> wrote:
>>> Alternative 3: Portable programs use the Unicode file interfaces on
>>> Windows,
>>> and the bytes file interfaces on Posix, and deal with the differences, as
>>> described for Windows only in alternative 1 and Posix only in alternative
>>> 2.
>>>
>>> Alternative 4: Someone implements an object that does alternative 3 under
>>> the covers, and every one will wish Alternative 1 & 2 didn't even exist.
>>>  The only reasons not to do this seem to be (a) Python 2.6 is already
>>> released and doesn't have it, (b) Python 3.0 would slip its schedule even
>>> more, (c) it's a significant chunk of code to implement and get right in
>>> a
>>> hurry.
>>>
>>
>> Nope, not possible.  The closest we can do is "bytes with implicit
>> conversion to unicode", but (a) implicit conversion is much less
>> maintainable (zen, etc), (b) it STILL doesn't work.  You still can't
>> round-trip a bad file name through a unicode API.
>>
>
> Not clear if you meant Alternative 3, 4 or both were not possible.
>
> The object would provide methods for manipulating the path names,
> particularly the ability to extract a path from one object and a file from
> another and combine them, somehow.  So programs wouldn't have to perform
> these sorts of manipulations themselves, so they wouldn't care if they are
> done on Posix and bytes and on Windows as Unicode.

But "Unicode" on windows is invalid.  It shares all the same problems
UTF-8b does, but worse as a correct UTF-16 codec would forbid
exporting it.  We'd need to invent a UTF-16b to save it, or simulate
one manually.

If the binary APIs on windows emitted raw UTF-16 bytes then we merely
need to add a os.sepb equal to os.sep.encode('UTF-16') and you've got
your portable low-level API.  You don't need a path object.


-- 
Adam Olsen, aka Rhamphoryncus

From rhamph at gmail.com  Tue Oct  7 08:22:59 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Tue, 7 Oct 2008 00:22:59 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48EAF263.5080006@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<48E68911.6090403@g.nevcal.com>
	<3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>
	<48E6A492.4090604@g.nevcal.com>
	<aac2c7cb0810031654x3e3c51aeh2e2c742b27597727@mail.gmail.com>
	<48E6ED99.2050406@g.nevcal.com>
	<aac2c7cb0810032357n66d452by2391be079179a48d@mail.gmail.com>
	<48EA9B71.3060109@nevcal.com>
	<aac2c7cb0810062218t5e661ddbwd02c8d973e8dbe57@mail.gmail.com>
	<48EAF263.5080006@g.nevcal.com>
Message-ID: <aac2c7cb0810062322v5af3c361o5a53af81824f799a@mail.gmail.com>

On Mon, Oct 6, 2008 at 11:23 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/6/2008 10:18 PM, came the following characters from the
> keyboard of Adam Olsen:
>> But "Unicode" on windows is invalid.  It shares all the same problems
>> UTF-8b does, but worse as a correct UTF-16 codec would forbid
>> exporting it.  We'd need to invent a UTF-16b to save it, or simulate
>> one manually.
>>
>> If the binary APIs on windows emitted raw UTF-16 bytes
>>
>> They do, for some definition of UTF-16, yes.
>>
>> then we merely
>> need to add a os.sepb equal to os.sep.encode('UTF-16') and you've got
>> your portable low-level API.  You don't need a path object.
>
> Except it isn't portable, because you can't do that on Posix.

The posix version should hardcode it as b'/'; I only meant windows to
use UTF-16.  You could perhaps use sys.getfilesystemencoding(), but
I'm unsure what it does if the encoding isn't an ascii superset (or
even if that can actually happen.)


-- 
Adam Olsen, aka Rhamphoryncus

From martin at v.loewis.de  Tue Oct  7 09:47:20 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 07 Oct 2008 09:47:20 +0200
Subject: [Python-3000] [Python-Dev]  Proposed Python 3.0 schedule
In-Reply-To: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
Message-ID: <48EB1408.1030007@v.loewis.de>

> Here's some I found from a few minutes of futzing around with r66821 of
> py3k on Linux.
> 
>  - Having os.getcwdb isn't much use when you can't even run python in
> the first place when the current directory has "bad" bytes in it.

That's not true: it *is* of much use. Python will live in /usr/bin,
which has a nicely-decodable path.

> Currently Python outputs:
> Could not find platform independent libraries <prefix>
> Could not find platform dependent libraries <exec_prefix>
> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
> Fatal Python error: Py_Initialize: can't initialize sys standard streams
> ImportError: No module named encodings.utf_8
> Aborted

I can't reproduce that. This happens (for me) when Python lives in
a directory that has an undecodable path - not when the current
directory is undecodable.

>  - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'" ought
> to work (with files with "bad" bytes being returned by find), which
> means that Python shouldn't blow up and refuse to start when there's a
> non-properly-encoding argv ("Could not convert argument 1 to string" and
> exiting isn't appropriate behavior).

Contributions are welcome. *Of course* can you access these files with
POSIX API. However, Python's path handling can't.

See above why I don't consider this as a serious bug, on Unix.

>  - Of course, just being able to start the interpreter isn't quite
> enough: you'll want to be able to access that argument list too, somehow
> (add sys.argvb?).

Perhaps. However, I don't see the need to be able to do so in Python
3.0.

>  - And then, getopt and optparse modules should work on bytestring
> vectors, so that you can use sys.argvb without writing your own argument
> parser. They don't currently.

And I hope they never will. Using bytes to represent this stuff will
just bring back the 2.x status, so some other solution must be found -
for 3.1 (or 3.2).

>  - There's no os.environb for bytewise access to the environment. Seems
> important.

Not to me. I don't have environment variables with non-ASCII characters
in them, and I think few other people do.

> I'm sure there's even more APIs dealing with pathnames, command line
> arguments, or environment variables that ought to be able to handle both
> bytes and strings, that currently don't.

Please, no.

Regards,
Martin

From victor.stinner at haypocalc.com  Tue Oct  7 11:30:35 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Tue, 7 Oct 2008 11:30:35 +0200
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
Message-ID: <200810071130.35729.victor.stinner@haypocalc.com>

Hi,

First of all, please read my document:
http://wiki.python.org/moin/Python3UnicodeDecodeError

I moved the document to a public wiki to allow anyone to edit it!

Le Tuesday 07 October 2008 05:22:09 James Y Knight, vous avez ?crit?:
> On Oct 6, 2008, at 8:52 PM, Benjamin Peterson wrote:
> > I'm not sure we do. Correct me if I'm wrong, but the "big ticket",
> > issue bytes/unicode filepaths, has been resolved.

Python3 now accepts bytes for os.listdir(), open() (io.open()), os.unlink(), 
os.path.*(), etc. But it's not enough to say that Python3 can use bytes 
everywhere. It would take months or *years* to fix all issues related to 
bytes and unicode. Remember, this task started in 2000 with Python *2.0* 
(creation of the unicode type).

> Well, if you mean that the resolution decided upon is to "simply"
> allow access to all system APIs using either byte or unicode strings,
> then it seems to me that there's a rather large amount of work left to
> do...

If you know a problem, open a ticket and propose a solution. It's not possible 
to list all new problems since we don't know them yet :-)

>   - Having os.getcwdb isn't much use when you can't even run python in
> the first place when the current directory has "bad" bytes in it.

My python3.0 works correctly in a directory with an invalid name. What is your 
OS / locale / Python version? Please create a ticket if needed.

>   - I'd think "find . -type f -print0 | xargs -0 python -c 'pass'"
> ought to work (with files with "bad" bytes being returned by find),

First, fix your home directory :-) There are good tools (convmv?) to fix 
invalid filenames.

> which means that Python shouldn't blow up and refuse to start when
> there's a non-properly-encoding argv ("Could not convert argument 1 to
> string" and exiting isn't appropriate behavior)

Why not? It's a good idea to break compatibility to refuse invalid bytes 
sequences. You can still uses the command line, an input file or a GUI to 
read raw bytes sequences.

>   - Of course, just being able to start the interpreter isn't quite
> enough: you'll want to be able to access that argument list too,
> somehow (add sys.argvb?).

If we create sys.argvb, what shoul be done if sys.argv creation failed? 
sys.argv would be empty or unset? Or some values would be removed (and so 
argv[2] is argv[1])? I think that many (a lot of) programs suppose that 
sys.argv exists and "is valid". If you introduce a special case (sometimes, 
sys.argv doesn't exist or is truncated !?), it will introduce new issues.

>   - There's no os.environb for bytewise access to the environment.
> Seems important.

It would be strange if you can put a variable in bytes to os.environb whereas 
os.environ would not get the key. I know two major usages of the environment:
 (1) read a variable in Python
 (2) put a variable for a child process 

(1) can be done with os.getenv() and returns None if the variable (key or 
value) is an invalid bytes sequence.

(2) can be done with subprocess.Popen(). subprocess doesn't support bytes yet 
but I wrote patches: #4035 and #4036.

>   - Isn't it a potential security issue that " 'WHATEVER' in
> os.environ" can return False if WHATEVER had some "bad" bytes in it,
> but spawning a subprocess actually will include WHATEVER in the
> subprocess's environment?

Yes. Python should remove the key while creating os.environ.

> - Shouldn't this work? subprocess.call(b'/bin/echo')

Yes. Most programs (at least on Linux and Mac) supports bytes and so you 
should be able use bytes arguments in their command lines, see issues #4035 
and #4036.

>   - I suppose sys.path should handle bytestrings on the path, and
> should be populated using the bytes-version of os.environ so that
> PYTHONPATH gets read in properly. Which of course implies that all the
> importers need to handle byte filenames.

If your file system is broken, rename your directory but don't introduce a 
special case for sys.path. 

>   - zipfile.ZipFile(b'whatever.zip') doesn't work.

Since zipfile uses bytes in its file structure, zipfile should accept bytes. 
But the right question is: should this issue block Python3 or can we wait for 
Python 3.1 (maybe 3.0.1)?

--

People wants to try the new Python version! Python3 introduces new amazing 
features like "keyword only arguments". The bytes/unicode problem is old and 
only affects broken systems

Windows (90% of the computers in the world?) only uses characters for the 
filenames, environment and command line. Mac and Linux use UTF-8 most of the 
time, and slowly everything speaks UTF-8! Python3 should not be delayed 
because of this problem.

About the initial barry's question: why Python3 is delayed until december? 
There are too much open issues?

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From ncoghlan at gmail.com  Tue Oct  7 12:10:19 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 07 Oct 2008 20:10:19 +1000
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <gcecqd$cr9$1@ger.gmane.org>
References: <gcbhid$trn$1@ger.gmane.org>	<20081006.175034.343188282.mrs@localhost.localdomain>	<gcdksc$uo8$1@ger.gmane.org>	<20081006.212059.465784769.mrs@localhost.localdomain>
	<gcecqd$cr9$1@ger.gmane.org>
Message-ID: <48EB358B.8020101@gmail.com>

(added Michael to the CC list)

It isn't object that has grown an __lt__ method, but type. The extra
check Michael actually wants is a way to make sure that the method isn't
coming from the object's metaclass, and the only reliable way to do that
is the way collections.Hashable does it when looking for __hash__:
iterate through the MRO looking for that method name in the class
dictionaries.

E.g.

def defines_method(obj, method_name):
  try:
    mro = obj.__mro__
  except AttributeError:
    return False # Not a type
  for cls in mro:
    if cls is object and not obj is object:
      break # Methods inherited from object don't count
    if method_name in cls.__dict__:
      return True
  return False # Didn't find it
>>> class X(object):
...   def __repr__(self): print "My Repr"
...
>>> class Y(X):
...   def __str__(self): print "My Str"
...
>>> defines_method(object, "__repr__")
True
>>> defines_method(object, "__str__")
True
>>> defines_method(object, "__cmp__")
False
>>> defines_method(X, "__repr__")
True
>>> defines_method(X, "__str__")
False
>>> defines_method(X, "__cmp__")
False
>>> defines_method(Y, "__repr__")
True
>>> defines_method(Y, "__str__")
True
>>> defines_method(Y, "__cmp__")
False

Terry Reedy wrote:
> I strongly suspect that the same is true of every method that a user
> class inherits from a builtin class.  Still, the clp OP is specifically
> interested in object as the base of his inheritance networks.

Your suspicion would be incorrect. What is actually happening is that
the behaviour of the returned method varies depending on whether or not
the object returned comes from the class itself (which will compare
equal with itself even when retrieved from a subclass), or a bound
method from the metaclass (which will not compare equal when retrieved
from a subclass, since it is bound to a different instance of the
metaclass).

In the case of the comparison methods, they're being retrieved from type
rather than object. This difference is made clear when you attempt to
invoke the retrieved method:

>>> object.__cmp__(1, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected 1 arguments, got 2
>>> object.__cmp__(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int'
>>> object.__cmp__(object)
0
>>> object.__hash__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor '__hash__' of 'object' object needs an argument
>>> object.__hash__(object)
135575008

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From solipsis at pitrou.net  Tue Oct  7 13:45:30 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 7 Oct 2008 11:45:30 +0000 (UTC)
Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again)
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
Message-ID: <loom.20081007T113750-12@post.gmane.org>


Hi,

James Y Knight <foom <at> fuhm.net> writes:
> 
>   - Having os.getcwdb isn't much use when you can't even run python in  
> the first place when the current directory has "bad" bytes in it.

I don't agree it's a similar problem. Python should be installed in a well-known
place with a sensible path. Of course, bonus points if Python can be launched
from anywhere, but I don't think it's a severe problem. In other words, I'd flag
this as "low priority".

If you want a more important issue, there's the issue of importing modules with
an unicode (non-ascii) path. Amaury has worked on this in the tracker.

> Currently Python outputs:
> Could not find platform independent libraries <prefix>
> Could not find platform dependent libraries <exec_prefix>
> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
> Fatal Python error: Py_Initialize: can't initialize sys standard streams
> ImportError: No module named encodings.utf_8

Ok, so the error message is quite cryptic and would perhaps deserve improving.
Still, "low priority" IMHO.

>   - And then, getopt and optparse modules should work on bytestring  
> vectors, so that you can use sys.argvb without writing your own  
> argument parser. They don't currently.

Then we will gradually start moving all modules even remotely related with IO
and filesystem stuff to a dual bytes/unicode API? That's precisely the kind of
confusion we want to end with Py3k (the confusion between bytes and unicode as
similar data types which could be used almost interchangeably without giving any
consideration to semantics).

>   - Isn't it a potential security issue that " 'WHATEVER' in  
> os.environ" can return False if WHATEVER had some "bad" bytes in it,  
> but spawning a subprocess actually will include WHATEVER in the  
> subprocess's environment?

I do agree with that. Errors should certainly not pass silently, especially when
they can have strong security implications.

>   - I suppose sys.path should handle bytestrings on the path, and  
> should be populated using the bytes-version of os.environ so that  
> PYTHONPATH gets read in properly.

Well, except on Windows where unicode paths are the Right Thing to do. But then
we have a glaring incompatibility between major platforms.

Regards

Antoine.



From facundobatista at gmail.com  Tue Oct  7 14:20:23 2008
From: facundobatista at gmail.com (Facundo Batista)
Date: Tue, 7 Oct 2008 09:20:23 -0300
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
Message-ID: <e04bdf310810070520m7d3d3c49v6aae5ee81d61a59a@mail.gmail.com>

2008/10/6 Raymond Hettinger <python at rcn.com>:

>> 15-Oct-2008 3.0 beta 4
>> 05-Nov-2008 3.0 rc 2
>> 19-Nov-2008 3.0 rc 3
>> 03-Dec-2008 3.0 final
>>
>> Given what still needs to be done, is this a reasonable schedule?  Do  we
>> need two more betas?
>
> Yes to both questions.

I agree with you here.


> I'm seeing that people are just starting to download and play with 3.0.
> I expect that we'll start getting more feedback on conversion issues,
> the C API, screwy interactions with operating systems, bytes/text issues,
> unanticipated interactions with other tools, etc.  Each user will stress
> it in new ways and perhaps reveal a bunch of little integration issues
> and documentation issues.  Those little fixups way go a long way toward
> establishing a good first impression and reputation for 3.0 from the outset.

And maybe also here, but bounded.

I don't want to keep deferring 3.0 months and months, I prefer to have
a redesigned schedule now, and stick to it as much as possible, even
if the 3.0 version is not as robust as we would want.

Regards,

-- 
.    Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/

From eric at trueblade.com  Tue Oct  7 14:50:53 2008
From: eric at trueblade.com (Eric Smith)
Date: Tue, 07 Oct 2008 08:50:53 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again)
In-Reply-To: <loom.20081007T113750-12@post.gmane.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<loom.20081007T113750-12@post.gmane.org>
Message-ID: <48EB5B2D.10200@trueblade.com>

Antoine Pitrou wrote:
> Hi,
> 
> James Y Knight <foom <at> fuhm.net> writes:
>>   - Having os.getcwdb isn't much use when you can't even run python in  
>> the first place when the current directory has "bad" bytes in it.
> 
> I don't agree it's a similar problem. Python should be installed in a well-known
> place with a sensible path. Of course, bonus points if Python can be launched
> from anywhere, but I don't think it's a severe problem. In other words, I'd flag
> this as "low priority".

What about the case when using something like py2exe to create a 
distributable executable? I haven't been following this conversation 
closely, so maybe this issue never applies to Windows. But I can see a 
py2exe executable not having a sensible path, and there might be similar 
issues on other platforms.

Eric.


From amauryfa at gmail.com  Tue Oct  7 15:51:07 2008
From: amauryfa at gmail.com (Amaury Forgeot d'Arc)
Date: Tue, 7 Oct 2008 15:51:07 +0200
Subject: [Python-3000] Accessing module state from extension types
Message-ID: <e27efe130810070651w39312ae8lb6261f8622b07f07@mail.gmail.com>

Hello,

Extension modules have a new "md_state" member, I understand that it
is designed to hold the "static" state of the module.
IIUC, for example in _cpickle.c, the "PyObject *dispatch_table"
variable is a good candidate for such module state.
This would allow to play more nicely with multiple startups/shutdowns,
reloading of the module, or with different sub-interpreters.

This state is accessible through the PyModule_GetState() function.
This is fine for module functions (the module object is passed as the
first argument, even if we always name it "self"), but how does it
work with classes or class methods?
Classes do not contain a reference to their modules, they only have
access to the __name__, which is not the same thing at all, specially
in this case.

This is unfortunate for extension modules which try to be
object-oriented, and have very few functions (the _pickle module does
not have any BTW)
How is this supposed to work?

-- 
Amaury Forgeot d'Arc

From janssen at parc.com  Tue Oct  7 17:24:08 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 7 Oct 2008 08:24:08 PDT
Subject: [Python-3000] Proposed Python 3.0 schedule (bytes/unicde again)
In-Reply-To: <loom.20081007T113750-12@post.gmane.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<loom.20081007T113750-12@post.gmane.org>
Message-ID: <40733.1223393048@parc.com>

Antoine Pitrou <solipsis at pitrou.net> wrote:

> >   - And then, getopt and optparse modules should work on bytestring  
> > vectors, so that you can use sys.argvb without writing your own  
> > argument parser. They don't currently.
> 
> Then we will gradually start moving all modules even remotely related with IO
> and filesystem stuff to a dual bytes/unicode API? That's precisely the kind of
> confusion we want to end with Py3k (the confusion between bytes and unicode as
> similar data types which could be used almost interchangeably without giving any
> consideration to semantics).

I wouldn't mix "IO" and "filesystem" that way.  "IO" is complicated.

The problem is, as we've lately discovered, that things which "look
toward" the machine and the OS, like file system APIs or os.getcwd() or
os.environ, are really dealing in bit sequences of various kinds, not
strings, though the designers of these low-level artifacts have made
some effort to disguise that.  Things which "look toward" the user, on
the other hand, are really dealing in strings, not bytes.  There's a
conversion step in there, if you are trying to write a program to print
to stdout (that is, the user) all the files in a directory (the OS).
Now, we can provide a automatic converter which will work in lots of
cases, but we can't affort to just deny the cases in which it doesn't
work.  We need bytes APIs to the OS and underlying machine and
networking and probably other things; we need string APIs to communicate
with the user.

Bill

From tjreedy at udel.edu  Tue Oct  7 17:44:24 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Tue, 07 Oct 2008 11:44:24 -0400
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <48EB358B.8020101@gmail.com>
References: <gcbhid$trn$1@ger.gmane.org>	<20081006.175034.343188282.mrs@localhost.localdomain>	<gcdksc$uo8$1@ger.gmane.org>	<20081006.212059.465784769.mrs@localhost.localdomain>	<gcecqd$cr9$1@ger.gmane.org>
	<48EB358B.8020101@gmail.com>
Message-ID: <gcg04o$5bh$1@ger.gmane.org>

Nick Coghlan wrote:
> (added Michael to the CC list)
> 
> It isn't object that has grown an __lt__ method, but type. The extra
> check Michael actually wants is a way to make sure that the method isn't
> coming from the object's metaclass, and the only reliable way to do that
> is the way collections.Hashable does it when looking for __hash__:
> iterate through the MRO looking for that method name in the class
> dictionaries

Thank you for the explanation.  I was aware that MRO traversal should be 
the 'officially correct' procedure for the original, but did not 
understand why (for 2.x, at least).

> In the case of the comparison methods, they're being retrieved from type
> rather than object. This difference is made clear when you attempt to
> invoke the retrieved method:
> 
>>>> object.__cmp__(1, 2)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: expected 1 arguments, got 2
>>>> object.__cmp__(2)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int'
>>>> object.__cmp__(object)
> 0

This surprises me, partly because the situation seems to be different in 
3.0.  Using __le__ in place of the non-existent __cmp__,

 >>> ole = object.__le__
 >>> ole(1,2)
NotImplemented
 >>> ole(1)
Traceback (most recent call last):
   File "<pyshell#21>", line 1, in <module>
     ole(1)
TypeError: expected 1 arguments, got 0
 >>> ole(object)
Traceback (most recent call last):
   File "<pyshell#22>", line 1, in <module>
     ole(object)
TypeError: expected 1 arguments, got 0
 >>> ole
<slot wrapper '__le__' of 'object' objects>
 >>> dir(ole)
['__call__', '__class__', '__delattr__', '__doc__', '__eq__', 
'__format__', '__ge__', '__get__', '__getattribute__', '__gt__', 
'__hash__', '__init__', '__le__', '__lt__', '__name__', '__ne__', 
'__new__', '__objclass__', '__reduce__', '__reduce_ex__', '__repr__', 
'__setattr__', '__sizeof__', '__str__', '__subclasshook__']
# no __self__ attribute

 >>> class C(object): pass

 >>> C.__le__
<slot wrapper '__le__' of 'object' objects>
# same as for hash in 2.5

I interpret all this to mean that in 3.0, rich comparison *are* defined 
on and being retrieved from object.  Correct?

I presume the change is because in 3.0, everything is an instance of 
object, so all classes can inherit the common methods from object, 
whereas that was *not* true in 2.x.  I very much like the cleaner design.

Terry Jan Reedy


From foom at fuhm.net  Tue Oct  7 17:51:19 2008
From: foom at fuhm.net (James Y Knight)
Date: Tue, 7 Oct 2008 11:51:19 -0400
Subject: [Python-3000] [Python-Dev]  Proposed Python 3.0 schedule
In-Reply-To: <48EB1408.1030007@v.loewis.de>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<48EB1408.1030007@v.loewis.de>
Message-ID: <BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>

On Oct 7, 2008, at 3:47 AM, Martin v. L?wis wrote:
>> - Having os.getcwdb isn't much use when you can't even run python in
>> the first place when the current directory has "bad" bytes in it.
>
> That's not true: it *is* of much use. Python will live in /usr/bin,
> which has a nicely-decodable path.
>
>> Currently Python outputs:
>> Could not find platform independent libraries <prefix>
>> Could not find platform dependent libraries <exec_prefix>
>> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
>> Fatal Python error: Py_Initialize: can't initialize sys standard  
>> streams
>> ImportError: No module named encodings.utf_8
>> Aborted
>
> I can't reproduce that. This happens (for me) when Python lives in
> a directory that has an undecodable path - not when the current
> directory is undecodable.

Sorry about that: this test was indeed in error: I ran "../python"  
from an undecodeable current directory, rather than "/full/path/to/ 
python", or putting python on the PATH and running it as "python". The  
first does not work, but the other more common ways to start it do.

>>
>> I'm sure there's even more APIs dealing with pathnames, command line
>> arguments, or environment variables that ought to be able to handle  
>> both
>> bytes and strings, that currently don't.
>
> Please, no.

I completely and totally agree with your distate, it's rather gross to  
allow bytes-or-str for every API that touches anything like filenames/ 
argv/environ. That's why I was pushing for the reversible conversion  
to str...But if bytes-or-str is the solution that's been chosen for  
this issue, it ought to either be fully committed to and implemented,  
or at least fully recognized and documented as a half-baked solution.

Of course, if an reversible encoding into string solution is used  
instead, none of these things would need special treatment: they would  
all work already.

FWIW: Qt works fine with undecodeable filenames, and it too uses  
unicode strings everywhere in its API. I looked into what it does, and  
found that it uses your (Martin)'s original idea for solving this: it  
stores undecodeable bytes as characters from 0x10fe00 to 0x10feff  
(which is valid private-use codespace).  While that might not be  
ideally correct, since you lose those 256 PUA characters, even that is  
IMO better than pushing out bytes to every API, or worse, giving up  
and just having python unable to access files, as it is now.

See lines 3074: QString::toUtf8() and 3408: QString::fromUtf8()) of

http://www.google.com/codesearch?q=+show:o7fNK6SzOYs:NO-Bv-AR2rI:toIOngLf1V8&cs_p=http://ie.archive.ubuntu.com/trolltech/pub/qt/snapshots/qt-x11-opensource-src-4.4.0-snapshot-20070402.tar.bz2&cs_f=qt-x11-opensource-src-4.4.0-snapshot-20070402/src/corelib/tools/qstring.cpp

James

From g.brandl at gmx.net  Tue Oct  7 18:04:37 2008
From: g.brandl at gmx.net (Georg Brandl)
Date: Tue, 07 Oct 2008 18:04:37 +0200
Subject: [Python-3000] Problem with grammar for 'except'?
In-Reply-To: <ca471dc20810051945x4ca037adt477e71eceb0e34fa@mail.gmail.com>
References: <bbaeab100809032110i58bdbcefpa66d5536ef02c7dc@mail.gmail.com>	<A38FA5A4111844B8B3CF639A04795311@RaymondLaptop1>	<ca471dc20809041236l3955a60blcf8046a38adfd928@mail.gmail.com>	<78b3a9580810051914v7a8995bax5f0f12d2a7934ad0@mail.gmail.com>
	<ca471dc20810051945x4ca037adt477e71eceb0e34fa@mail.gmail.com>
Message-ID: <gcg1b3$aaj$1@ger.gmane.org>

Guido van Rossum schrieb:
> Someone please fix the PEP. There are very good reasons for *not*
> allowing "except X, Y:" to have a meaning -- if 2.x code somehow
> accidentally ended up in the 3.0 world without having been run through
> 2to3, it would silently perturb the meaning in the most confusing way.
> That's why the implementation got it right.

Done.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.


From tjreedy at udel.edu  Tue Oct  7 20:07:36 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Tue, 07 Oct 2008 14:07:36 -0400
Subject: [Python-3000] [Python-Dev]  Proposed Python 3.0 schedule
In-Reply-To: <BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>	<48EB1408.1030007@v.loewis.de>
	<BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
Message-ID: <gcg8h8$drd$1@ger.gmane.org>

James Y Knight wrote:

> FWIW: Qt works fine with undecodeable filenames, and it too uses unicode 
> strings everywhere in its API. I looked into what it does, and found 
> that it uses your (Martin)'s original idea for solving this: it stores 
> undecodeable bytes as characters from 0x10fe00 to 0x10feff (which is 
> valid private-use codespace).  While that might not be ideally correct, 
> since you lose those 256 PUA characters, even that is IMO better than 
> pushing out bytes to every API, or worse, giving up and just having 
> python unable to access files, as it is now.

If Python uses a bit of the PUA (but only for filenames), which I think 
it should be free to do, then the manual should document that fact and 
when and why.  Then any Python app that needs to use the full PUA could 
do so as long as it either avoids mixing filenames with its strings or 
avoids working with invalid filenames.

The referenced QT file is licenced GPL2.


From martin at v.loewis.de  Tue Oct  7 21:40:21 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 07 Oct 2008 21:40:21 +0200
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
	3.0 schedule
In-Reply-To: <00e001c92881$68ba93c0$3a2fbb40$@com.au>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<00e001c92881$68ba93c0$3a2fbb40$@com.au>
Message-ID: <48EBBB25.70609@v.loewis.de>

> More specifically, I think 2to3 is shaping up well.  pywin32 is taking the
> approach of "port where possible, but keep in py2x syntax and convert at
> 'setup.py' time" and this is working out fairly well

I can't say how glad I am that you say that. It supports lib2to3 being a
proper library, despite the problems that this may cause in itself.

> * Better support for 2to3 in distutils (specifically, the support in
> build_py is stale, plus 'build_scripts' and 'install_data' should convert
> .py files to py3k syntax.)

Please do create a bug report for that. It sounds like it's easy to fix.

> An 'example' project that uses py2k syntax and
> "just works" on py3k using this strategy might be useful here.

Perhaps pywin32 :-?

I don't think a demo project would do much good, as it doesn't exercise
all the issues that may occur.

> * A standard 'helper script' that allows people to use py3k to execute a
> py2x syntax script by auto-converting the code.  I've a 10ish-line script
> that uses lib2to3 plus exec() to achieve that result, but a helper in 2to3
> for this would be nice.  For a concrete use-case, we want to keep our
> distutils script in py2x syntax, but execute it via py3k.  Its very possible
> this already exists and I've just missed it...

For the case of setup.py, I was hoping that it could be written in
compatible syntax even without needing conversion. That worked fine for
my Django port. Is that not the case for pywin32?

This specific issue might be out of scope for 3.x, IMO.

> Either way, I'm fairly confident a pywin32 build for py3k will be available
> in the next month or 2 (but as a result, I'm not really in a position to
> help with the above for that period...)

But please do file bug reports, preferably along with any patches to
distutils that you already have.

Regards,
Martin

From martin at v.loewis.de  Tue Oct  7 22:06:52 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 07 Oct 2008 22:06:52 +0200
Subject: [Python-3000] Python3UnicodeDecodeError (Was: Proposed Python 3.0
	schedule)
In-Reply-To: <200810071130.35729.victor.stinner@haypocalc.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<200810071130.35729.victor.stinner@haypocalc.com>
Message-ID: <48EBC15C.1050305@v.loewis.de>

> First of all, please read my document:
> http://wiki.python.org/moin/Python3UnicodeDecodeError

I have problems understanding that document. Is it supposed to
be a PEP (i.e. a proposal to enhance Python), or is it a description
of the status quo?

If it is a PEP, it should clearly separate status quo, specification,
and rationale (in any order that you find reasonable). It should also
have an "open issues" section, explicitly listing the questions that
haven't been resolved, and it should record objections to the proposal.

I think I would object to the specification (perhaps to the degree
of proposing a counter-PEP), but to do so, I first need a specification
to object to.

In terms of time-line, I think any such PEP is *clearly* out of scope
for Python 3.0. All the remaining issues should deferred to 3.1.

That the approach "we can use bytes in the file system API" was so
rushed into the code base is already unfortunate, but I can understand
the motivation - people want to write backup programs in Python.

If I take the text as if it was a specification, here are some of my
objections:

- Default encoding:
  a) seems irrelevant for the PEP. The default encoding doesn't nearly
     have the role anymore that it had in 2.x, and shouldn't have any
     effect on how file names are treated.
  b) I would propose that the notion of a default encoding is entirely
     eliminated from Python, along with sys.(get|set)defaultencoding
- argv and environ: are you suggesting that the behavior described
  in the PEP is desirable? I don't think it is (but I don't think it
  should change for 3.0, either, only for 3.1)

Regards,
Martin

From martin at v.loewis.de  Tue Oct  7 22:09:31 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 07 Oct 2008 22:09:31 +0200
Subject: [Python-3000] [Python-Dev]  Proposed Python 3.0 schedule
In-Reply-To: <BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>	<48EB1408.1030007@v.loewis.de>
	<BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
Message-ID: <48EBC1FB.5090209@v.loewis.de>

James Y Knight wrote:
> or at least fully recognized and documented as a half-baked
> solution.

I would prefer that, leaving a full resolution to 3.1 (or perhaps 3.2).
If we wait long enough, the issue will disappear (a strategy that Sun
is apparently taking for Java :-)

Regards,
Martin

From fdrake at acm.org  Tue Oct  7 22:18:09 2008
From: fdrake at acm.org (Fred Drake)
Date: Tue, 07 Oct 2008 16:18:09 -0400
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
In-Reply-To: <48EBC15C.1050305@v.loewis.de>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<200810071130.35729.victor.stinner@haypocalc.com>
	<48EBC15C.1050305@v.loewis.de>
Message-ID: <FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>

On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote:
>  b) I would propose that the notion of a default encoding is entirely
>     eliminated from Python, along with sys.(get|set)defaultencoding

+1


   -Fred

-- 
Fred Drake   <fdrake at acm.org>


From guido at python.org  Tue Oct  7 22:28:30 2008
From: guido at python.org (Guido van Rossum)
Date: Tue, 7 Oct 2008 13:28:30 -0700
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
Message-ID: <ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>

On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw <barry at python.org> wrote:
> So, we need to come up with a new release schedule for Python 3.0.  My
> suggestion:
>
> 15-Oct-2008 3.0 beta 4
> 05-Nov-2008 3.0 rc 2
> 19-Nov-2008 3.0 rc 3
> 03-Dec-2008 3.0 final
>
> Given what still needs to be done, is this a reasonable schedule?  Do we
> need two more betas?

I know I'm contradicting what I said earlier, but perhaps we should
just forget going back to beta and stick to ever-more-perfect release
candidates? In other worlds release candidates often contain tons of
imperfections (I believe I've seen this both for Java and Windows) and
the label "release candidate" more clearly encourages people to
download and play with it, which is what we need at this point! Then
the schedule would be something like

  15-Oct-2008 3.0 rc 2
  05-Nov-2008 3.0 rc 3
  19-Nov-2008 3.0 rc 4
  03-Dec-2008 3.0 final

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Tue Oct  7 22:29:43 2008
From: guido at python.org (Guido van Rossum)
Date: Tue, 7 Oct 2008 13:29:43 -0700
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
In-Reply-To: <FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<200810071130.35729.victor.stinner@haypocalc.com>
	<48EBC15C.1050305@v.loewis.de>
	<FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
Message-ID: <ca471dc20810071329i7ccd5651hcff7ebff344e1f30@mail.gmail.com>

On Tue, Oct 7, 2008 at 1:18 PM, Fred Drake <fdrake at acm.org> wrote:
> On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote:
>>
>>  b) I would propose that the notion of a default encoding is entirely
>>    eliminated from Python, along with sys.(get|set)defaultencoding
>
> +1

I expect that the only effect of this change would be that the
filesystem encoding would become the de-facto default encoding for
other contexts as well.

Not that that is necessarily a bad thing...

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From tjreedy at udel.edu  Tue Oct  7 22:44:16 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Tue, 07 Oct 2008 16:44:16 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
Message-ID: <gcghn0$g7v$1@ger.gmane.org>

Guido van Rossum wrote:
> On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw <barry at python.org> wrote:
>> So, we need to come up with a new release schedule for Python 3.0.  My
>> suggestion:
>>
>> 15-Oct-2008 3.0 beta 4
>> 05-Nov-2008 3.0 rc 2
>> 19-Nov-2008 3.0 rc 3
>> 03-Dec-2008 3.0 final
>>
>> Given what still needs to be done, is this a reasonable schedule?  Do we
>> need two more betas?
> 
> I know I'm contradicting what I said earlier, but perhaps we should
> just forget going back to beta and stick to ever-more-perfect release
> candidates? In other worlds release candidates often contain tons of
> imperfections (I believe I've seen this both for Java and Windows) and
> the label "release candidate" more clearly encourages people to
> download and play with it, which is what we need at this point! Then
> the schedule would be something like
> 
>   15-Oct-2008 3.0 rc 2
>   05-Nov-2008 3.0 rc 3
>   19-Nov-2008 3.0 rc 4
>   03-Dec-2008 3.0 final

As a user, I agree, even if it does stretch the usual notion of rc. 
Having a beta follow and be better than a gamma (rc) would be confusing.
Also, it was the rc designation that encouraged more people to download 
and play with rc1.  I think there has definitely been more attention on 
3.0 on c.l.p lately.


From rhamph at gmail.com  Tue Oct  7 22:45:11 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Tue, 7 Oct 2008 14:45:11 -0600
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<48EB1408.1030007@v.loewis.de>
	<BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
Message-ID: <aac2c7cb0810071345o2c20d3d1v42b45e79f3833c17@mail.gmail.com>

On Tue, Oct 7, 2008 at 9:51 AM, James Y Knight <foom at fuhm.net> wrote:
> On Oct 7, 2008, at 3:47 AM, Martin v. L?wis wrote:
>>>
>>> - Having os.getcwdb isn't much use when you can't even run python in
>>> the first place when the current directory has "bad" bytes in it.
>>
>> That's not true: it *is* of much use. Python will live in /usr/bin,
>> which has a nicely-decodable path.
>>
>>> Currently Python outputs:
>>> Could not find platform independent libraries <prefix>
>>> Could not find platform dependent libraries <exec_prefix>
>>> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
>>> Fatal Python error: Py_Initialize: can't initialize sys standard streams
>>> ImportError: No module named encodings.utf_8
>>> Aborted
>>
>> I can't reproduce that. This happens (for me) when Python lives in
>> a directory that has an undecodable path - not when the current
>> directory is undecodable.
>
> Sorry about that: this test was indeed in error: I ran "../python" from an
> undecodeable current directory, rather than "/full/path/to/python", or
> putting python on the PATH and running it as "python". The first does not
> work, but the other more common ways to start it do.
>
>>>
>>> I'm sure there's even more APIs dealing with pathnames, command line
>>> arguments, or environment variables that ought to be able to handle both
>>> bytes and strings, that currently don't.
>>
>> Please, no.
>
> I completely and totally agree with your distate, it's rather gross to allow
> bytes-or-str for every API that touches anything like
> filenames/argv/environ. That's why I was pushing for the reversible
> conversion to str...But if bytes-or-str is the solution that's been chosen
> for this issue, it ought to either be fully committed to and implemented, or
> at least fully recognized and documented as a half-baked solution.
>
> Of course, if an reversible encoding into string solution is used instead,
> none of these things would need special treatment: they would all work
> already.
>
> FWIW: Qt works fine with undecodeable filenames, and it too uses unicode
> strings everywhere in its API. I looked into what it does, and found that it
> uses your (Martin)'s original idea for solving this: it stores undecodeable
> bytes as characters from 0x10fe00 to 0x10feff (which is valid private-use
> codespace).  While that might not be ideally correct, since you lose those
> 256 PUA characters, even that is IMO better than pushing out bytes to every
> API, or worse, giving up and just having python unable to access files, as
> it is now.
>
> See lines 3074: QString::toUtf8() and 3408: QString::fromUtf8()) of
>
> http://www.google.com/codesearch?q=+show:o7fNK6SzOYs:NO-Bv-AR2rI:toIOngLf1V8&cs_p=http://ie.archive.ubuntu.com/trolltech/pub/qt/snapshots/qt-x11-opensource-src-4.4.0-snapshot-20070402.tar.bz2&cs_f=qt-x11-opensource-src-4.4.0-snapshot-20070402/src/corelib/tools/qstring.cpp

So what does Qt do when given a file name already using those PUA?
Looks like they get passed through untouched when decoded, but will
get translated into invalid names upon encoding.  So you still have
file names you can't open, and you're incompatible with what other
libraries do.

The only thing going for Qt is that they seem specifically interested
in latin-1, rather than arbitrary bad names.  The latin-1 strings that
would correspond to the UTF-8 PUA used would include at least one
control character, as well as other unusual bits, so it's pretty
unlikely to encounter a real latin-1 file name like that.


-- 
Adam Olsen, aka Rhamphoryncus

From mal at egenix.com  Tue Oct  7 22:52:04 2008
From: mal at egenix.com (M.-A. Lemburg)
Date: Tue, 07 Oct 2008 22:52:04 +0200
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
In-Reply-To: <FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>	<200810071130.35729.victor.stinner@haypocalc.com>	<48EBC15C.1050305@v.loewis.de>
	<FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
Message-ID: <48EBCBF4.7080200@egenix.com>

On 2008-10-07 22:18, Fred Drake wrote:
> On Oct 7, 2008, at 4:06 PM, Martin v. L?wis wrote:
>>  b) I would propose that the notion of a default encoding is entirely
>>     eliminated from Python, along with sys.(get|set)defaultencoding
> 
> +1

As already mentioned in my reply to Viktor: +1. It's not adjustable
anymore, so we might as well get rid off the sys module APIs.

The term "default encoding" itself still has some value in that it
is associated with the C API char* encoding used for PyUnicode
objects in Python 3.0.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 07 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From solipsis at pitrou.net  Tue Oct  7 23:31:42 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 7 Oct 2008 21:31:42 +0000 (UTC)
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<200810071130.35729.victor.stinner@haypocalc.com>
	<48EBC15C.1050305@v.loewis.de>
	<FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
	<ca471dc20810071329i7ccd5651hcff7ebff344e1f30@mail.gmail.com>
Message-ID: <loom.20081007T213020-298@post.gmane.org>

Guido van Rossum <guido <at> python.org> writes:
> 
> I expect that the only effect of this change would be that the
> filesystem encoding would become the de-facto default encoding for
> other contexts as well.

But there is no such thing as "the" filesystem encoding (except in Python's
simplified heuristics). There is one distinct encoding for each mounted 
filesystem.

Regards

Antoine.





From mrs at mythic-beasts.com  Tue Oct  7 23:28:16 2008
From: mrs at mythic-beasts.com (Mark Seaborn)
Date: Tue, 07 Oct 2008 22:28:16 +0100 (BST)
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <gcecqd$cr9$1@ger.gmane.org>
References: <gcdksc$uo8$1@ger.gmane.org>
	<20081006.212059.465784769.mrs@localhost.localdomain>
	<gcecqd$cr9$1@ger.gmane.org>
Message-ID: <20081007.222816.343187053.mrs@localhost.localdomain>

Terry Reedy <tjreedy at udel.edu> wrote:

> Mark Seaborn wrote:
> > It appears that unbound methods do what you want in the general case
> > in Python 2.5 and 2.6.  It's just that __lt__ behaves unlike normal
> > unbound methods.  So this isn't an argument against unbound methods,
> > it's an argument for __lt__ not to be a special case.
> 
> It is not a special case.
> 
>  >>> def C(object): pass
> ...
> 
>  >>> C.__hash__ == object.__hash__
> False
> 
>  >>> C.__str__ == object.__str__
> False

I assume you meant to use "class" instead of "def", in which case most
of the attributes do compare the way you want:

>>> class C(object): pass
... 
>>> C.__hash__ == object.__hash__
True
>>> C.__str__ == object.__str__
True

# But in Python 2.6:
>>> C.__lt__ == object.__lt__
False

Mark

From ncoghlan at gmail.com  Tue Oct  7 23:47:30 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 08 Oct 2008 07:47:30 +1000
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>
Message-ID: <48EBD8F2.4090802@gmail.com>

Barry Warsaw wrote:
> On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote:
> 
>> [Barry Warsaw]
>>> So, we need to come up with a new release schedule for Python 3.0. 
>>> My  suggestion:
>>> 15-Oct-2008 3.0 beta 4
>>> 05-Nov-2008 3.0 rc 2
>>> 19-Nov-2008 3.0 rc 3
>>> 03-Dec-2008 3.0 final
>>> Given what still needs to be done, is this a reasonable schedule? 
>>> Do  we need two more betas?
> 
>> Yes to both questions.
> 
> I think that's contradictory :).  If we need two betas, then 05-Nov
> becomes beta 5, 19-Nov is rc 2.  If we don't need another rc then we can
> still do a final release on 03-Dec, otherwise we probably go 2 weeks
> later.  I don't want to go much later than that though because then we
> get into the holiday season.

Do we need the full two weeks between rc's? Or is it too much of a pain
to cut releases 3 weeks in a row?

E.g. something like:

15-Oct-2008 3.0 beta 4
05-Nov-2008 3.0 beta 5
19-Nov-2008 3.0 rc 2
26-Nov-2008 3.0 rc 3 (if needed)
03-Dec-2008 3.0 final

Cheers,
Nick.

_______________________________________________
Python-3000 mailing list
Python-3000 at python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/ncoghlan%40gmail.com

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From martin at v.loewis.de  Tue Oct  7 23:50:48 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 07 Oct 2008 23:50:48 +0200
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
 3.0 schedule
In-Reply-To: <48EBD8F2.4090802@gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>	<67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>
	<48EBD8F2.4090802@gmail.com>
Message-ID: <48EBD9B8.4040102@v.loewis.de>

> Do we need the full two weeks between rc's?

If they are just other names for betas, yes. If they are true
release candidates (in the sense of "we really want to release this
as-is unless somebody tells us why this is a really bad idea"), then
no.

> Or is it too much of a pain
> to cut releases 3 weeks in a row?

It's a lot of effort, yes. Also for users, who will have barely
installed one release candidate when the next one comes out.

Regards,
Martin

From barry at python.org  Wed Oct  8 00:00:23 2008
From: barry at python.org (Barry Warsaw)
Date: Tue, 7 Oct 2008 18:00:23 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
Message-ID: <E16A1C28-F7C9-4CC1-836C-E254AFF5AE7E@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote:

> On Mon, Oct 6, 2008 at 5:47 PM, Barry Warsaw <barry at python.org> wrote:
>> So, we need to come up with a new release schedule for Python 3.0.   
>> My
>> suggestion:
>>
>> 15-Oct-2008 3.0 beta 4
>> 05-Nov-2008 3.0 rc 2
>> 19-Nov-2008 3.0 rc 3
>> 03-Dec-2008 3.0 final
>>
>> Given what still needs to be done, is this a reasonable schedule?   
>> Do we
>> need two more betas?
>
> I know I'm contradicting what I said earlier, but perhaps we should
> just forget going back to beta and stick to ever-more-perfect release
> candidates? In other worlds release candidates often contain tons of
> imperfections (I believe I've seen this both for Java and Windows) and
> the label "release candidate" more clearly encourages people to
> download and play with it, which is what we need at this point! Then
> the schedule would be something like
>
>  15-Oct-2008 3.0 rc 2
>  05-Nov-2008 3.0 rc 3
>  19-Nov-2008 3.0 rc 4
>  03-Dec-2008 3.0 final

I'm okay with that too.  It does seem odd to go back to beta then  
release another rc.  What's in a name, anyway? <wink>.  And it is good  
that more people are downloading it now that it's rc.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOvb93EjvBPtnXfVAQJTQAP/cmNdzd/SRymxXvW85EnW2NTHUkh1Auw9
bGlbSC0BF2p9ArgbDLPh/X4uatB3UaqoNeq5LTWHL2f9iCnsI7lFMPuexGr+3t4l
Xmld8qN77j4GpU6bXL8uUo3/vlhU4MiG5ETl0kMH30f47srOAAGEGZAqW9jAM92I
YSkQPSgBdYo=
=+s9t
-----END PGP SIGNATURE-----

From martin at v.loewis.de  Wed Oct  8 00:00:49 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 08 Oct 2008 00:00:49 +0200
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
In-Reply-To: <loom.20081007T213020-298@post.gmane.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>	<200810071130.35729.victor.stinner@haypocalc.com>	<48EBC15C.1050305@v.loewis.de>	<FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>	<ca471dc20810071329i7ccd5651hcff7ebff344e1f30@mail.gmail.com>
	<loom.20081007T213020-298@post.gmane.org>
Message-ID: <48EBDC11.2040806@v.loewis.de>

Antoine Pitrou wrote:
> Guido van Rossum <guido <at> python.org> writes:
>> I expect that the only effect of this change would be that the
>> filesystem encoding would become the de-facto default encoding for
>> other contexts as well.
> 
> But there is no such thing as "the" filesystem encoding (except in Python's
> simplified heuristics). There is one distinct encoding for each mounted 
> filesystem.

At best - for mounted joliet/vfat/ntfs partitions. For ext3/ufs/jfs
slices, every directory might use its own encoding, different files
in a single directory might use different encodings, and even a single
file name might switch encodings within itself.

However, this is completely unrelated to the issue at hand: remove
the "default encoding". Guido was suggesting that then merely the
"file system encoding" takes its place. These are both Python-only
concepts (in fact, Mark Hammond originally called the latter one
"file system default encoding"). I think the notion of "default
encoding" is flawed (for what it was used), and so it should be
removed. You seem to think that the notion of "file system encoding"
is also flawed - but do you infer from that that it also should be
removed?

Regards,
Martin

From barry at python.org  Wed Oct  8 00:01:39 2008
From: barry at python.org (Barry Warsaw)
Date: Tue, 7 Oct 2008 18:01:39 -0400
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <48EBD8F2.4090802@gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>
	<48EBD8F2.4090802@gmail.com>
Message-ID: <3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 7, 2008, at 5:47 PM, Nick Coghlan wrote:

> Barry Warsaw wrote:
>> On Oct 6, 2008, at 9:48 PM, Raymond Hettinger wrote:
>>
>>> [Barry Warsaw]
>>>> So, we need to come up with a new release schedule for Python 3.0.
>>>> My  suggestion:
>>>> 15-Oct-2008 3.0 beta 4
>>>> 05-Nov-2008 3.0 rc 2
>>>> 19-Nov-2008 3.0 rc 3
>>>> 03-Dec-2008 3.0 final
>>>> Given what still needs to be done, is this a reasonable schedule?
>>>> Do  we need two more betas?
>>
>>> Yes to both questions.
>>
>> I think that's contradictory :).  If we need two betas, then 05-Nov
>> becomes beta 5, 19-Nov is rc 2.  If we don't need another rc then  
>> we can
>> still do a final release on 03-Dec, otherwise we probably go 2 weeks
>> later.  I don't want to go much later than that though because then  
>> we
>> get into the holiday season.
>
> Do we need the full two weeks between rc's? Or is it too much of a  
> pain
> to cut releases 3 weeks in a row?
>
> E.g. something like:
>
> 15-Oct-2008 3.0 beta 4
> 05-Nov-2008 3.0 beta 5
> 19-Nov-2008 3.0 rc 2
> 26-Nov-2008 3.0 rc 3 (if needed)
> 03-Dec-2008 3.0 final

I won't be able to cut another release between the 15th and 5th, so at  
least that one should be 2 weeks.  If we don't need the additional rc,  
then we can release early, which would put us just before the US  
Thanksgiving holiday.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOvcQ3EjvBPtnXfVAQK5mwP9GQfw3zNvGhJWiSkZ2gQ1LNr0rnmfVmpF
WcDePkz3e5nsOjtkwiN0rlYHIQE9ySPfvtqqrInBW8y97y79mTjiM4S32XHLyAsd
WEWRb0ClcLuZs+JveAb8KF5pO0RlDgX9Dd6puuPr8kGa5aN/rosfsnXra1GrYpj3
JQghQ89JNkE=
=+Ymq
-----END PGP SIGNATURE-----

From martin at v.loewis.de  Wed Oct  8 00:05:45 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 08 Oct 2008 00:05:45 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
 2.6 or 3.0?
In-Reply-To: <aac2c7cb0810062322v5af3c361o5a53af81824f799a@mail.gmail.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<48E68911.6090403@g.nevcal.com>	<3f4107910810031423w75f6dee6sf2f08add6922e5fa@mail.gmail.com>	<48E6A492.4090604@g.nevcal.com>	<aac2c7cb0810031654x3e3c51aeh2e2c742b27597727@mail.gmail.com>	<48E6ED99.2050406@g.nevcal.com>	<aac2c7cb0810032357n66d452by2391be079179a48d@mail.gmail.com>	<48EA9B71.3060109@nevcal.com>	<aac2c7cb0810062218t5e661ddbwd02c8d973e8dbe57@mail.gmail.com>	<48EAF263.5080006@g.nevcal.com>
	<aac2c7cb0810062322v5af3c361o5a53af81824f799a@mail.gmail.com>
Message-ID: <48EBDD39.1030902@v.loewis.de>

> The posix version should hardcode it as b'/'; I only meant windows to
> use UTF-16.  You could perhaps use sys.getfilesystemencoding(), but
> I'm unsure what it does if the encoding isn't an ascii superset (or
> even if that can actually happen.)

POSIX has the notion of a "portable character set", which includes
the ASCII letters, digits, forward slash, and a few others. It requires
this set to be supported on any POSIX implementation.

So the file system encoding should always be an ASCII superset, in
the repertoire superset sense. I don't think POSIX assigns specific
code points, so it doesn't have to be a superset in the coded character
set sense.

I'm sure those VMS users will tell us some day.

Regards,
Martin

From solipsis at pitrou.net  Wed Oct  8 00:07:24 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 08 Oct 2008 00:07:24 +0200
Subject: [Python-3000] [Python-Dev] Python3UnicodeDecodeError
In-Reply-To: <48EBDC11.2040806@v.loewis.de>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<200810071130.35729.victor.stinner@haypocalc.com>
	<48EBC15C.1050305@v.loewis.de>
	<FCBFAB07-525D-4061-86C1-AE09BC426D11@acm.org>
	<ca471dc20810071329i7ccd5651hcff7ebff344e1f30@mail.gmail.com>
	<loom.20081007T213020-298@post.gmane.org>
	<48EBDC11.2040806@v.loewis.de>
Message-ID: <1223417244.14619.2.camel@fsol>

Le mercredi 08 octobre 2008 ? 00:00 +0200, "Martin v. L?wis" a ?crit :
> You seem to think that the notion of "file system encoding"
> is also flawed - but do you infer from that that it also should be
> removed?

Under the condition we find something better, yes.
Otherwise, let's keep the heuristic.



From martin at v.loewis.de  Wed Oct  8 00:10:47 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 08 Oct 2008 00:10:47 +0200
Subject: [Python-3000] Accessing module state from extension types
In-Reply-To: <e27efe130810070651w39312ae8lb6261f8622b07f07@mail.gmail.com>
References: <e27efe130810070651w39312ae8lb6261f8622b07f07@mail.gmail.com>
Message-ID: <48EBDE67.2090102@v.loewis.de>

> How is this supposed to work?

The design was that you use PyState_FindModule, as an efficient way for
getting a module object if you have the module def. The implementation
fills an index into the module def (which will stay constant across
interpreters), this this should give you your module object anywhere,
in constant time.

If you have specific proposals on how to make this more convenient to
use, please go ahead. (also, if you think that this somehow flawed:
this would be the time to mention it)

Regards,
Martin

From solipsis at pitrou.net  Wed Oct  8 00:12:24 2008
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 08 Oct 2008 00:12:24 +0200
Subject: [Python-3000] [python-committers] Proposed Python 3.0 schedule
In-Reply-To: <E16A1C28-F7C9-4CC1-836C-E254AFF5AE7E@python.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
	<E16A1C28-F7C9-4CC1-836C-E254AFF5AE7E@python.org>
Message-ID: <1223417544.14619.4.camel@fsol>

Le mardi 07 octobre 2008 ? 18:00 -0400, Barry Warsaw a ?crit :
> On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote:
> >  15-Oct-2008 3.0 rc 2
> >  05-Nov-2008 3.0 rc 3
> >  19-Nov-2008 3.0 rc 4
> >  03-Dec-2008 3.0 final
> 
> I'm okay with that too.  It does seem odd to go back to beta then  
> release another rc.  What's in a name, anyway? <wink>.  And it is good  
> that more people are downloading it now that it's rc.

I also think it's better to call them rcs and encourage people to play
with them.



From barry at python.org  Wed Oct  8 00:15:31 2008
From: barry at python.org (Barry Warsaw)
Date: Tue, 7 Oct 2008 18:15:31 -0400
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<ca471dc20810071328t65ee07faw5e0672be8bdc3fea@mail.gmail.com>
Message-ID: <E5E34408-545F-4D34-9663-15F9B65698F9@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 7, 2008, at 4:28 PM, Guido van Rossum wrote:

>  15-Oct-2008 3.0 rc 2
>  05-Nov-2008 3.0 rc 3
>  19-Nov-2008 3.0 rc 4
>  03-Dec-2008 3.0 final

I've updated PEP 361 and the Google calendar with this schedule,  
except that the PEP says that rc3 and rc4 are planned "if needed".

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOvfg3EjvBPtnXfVAQKDfwP/Sz9Ioe1tIrKtvD7JPG2cg2F+wfDJrc+9
vqfh6/eMWiUIOeSKJu6+gye7oXRcHwQXAPivNza3993HesOu0TjudnwXfkAlfsdE
m09Rh70AXQQiY7JX46etugRC4BwkuNeBo253cvmfo6hPK0ZhOHZSy3H1LkhvvLA6
Cq56CVqDUgs=
=i/Km
-----END PGP SIGNATURE-----

From barry at python.org  Wed Oct  8 00:16:56 2008
From: barry at python.org (Barry Warsaw)
Date: Tue, 7 Oct 2008 18:16:56 -0400
Subject: [Python-3000] [Python-Dev]   Proposed Python 3.0 schedule
In-Reply-To: <3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<67693931-29C5-4FD7-8E83-D97F892F3BE3@python.org>
	<48EBD8F2.4090802@gmail.com>
	<3F6B0210-EE87-4CE4-B487-DF4AAF733637@python.org>
Message-ID: <AE0C94C8-10F3-4000-A36A-815C5A1CCB40@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 7, 2008, at 6:01 PM, Barry Warsaw wrote:

> I won't be able to cut another release between the 15th and 5th, so  
> at least that one should be 2 weeks.  If we don't need the  
> additional rc, then we can release early, which would put us just  
> before the US Thanksgiving holiday.

Er, /3/ weeks between rc2 and rc3.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSOvf2HEjvBPtnXfVAQJDsQP8DRL2gQDMf1eEvgmmijPtVdbfAypZ1XMY
huNzPu91v6dpvrogIP5MJbmJnSnka5yk78JIlkbTU4ZHS0ADsQX+IApU5y/SlO9Y
FDtIqb+NFoVRFj5xQaN/EEqO8kNpq3WPmaEQJ4HHeDUIzcrbsPxfCm+vbePgnGzI
AwhQqCzmX1I=
=aQnH
-----END PGP SIGNATURE-----

From foom at fuhm.net  Wed Oct  8 00:22:13 2008
From: foom at fuhm.net (James Y Knight)
Date: Tue, 7 Oct 2008 18:22:13 -0400
Subject: [Python-3000] [Python-Dev] Proposed Python 3.0 schedule
In-Reply-To: <aac2c7cb0810071345o2c20d3d1v42b45e79f3833c17@mail.gmail.com>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<9212E57A-2292-43DC-9307-F05C0DD91CDA@fuhm.net>
	<48EB1408.1030007@v.loewis.de>
	<BDDCC54F-78D9-40DA-8F63-04DE4DCA0CEE@fuhm.net>
	<aac2c7cb0810071345o2c20d3d1v42b45e79f3833c17@mail.gmail.com>
Message-ID: <AA10B68C-B2D7-47B9-B38B-A8679A14CE8D@fuhm.net>

On Oct 7, 2008, at 4:45 PM, Adam Olsen wrote:
> So what does Qt do when given a file name already using those PUA?
> Looks like they get passed through untouched when decoded, but will
> get translated into invalid names upon encoding.

Well, I'd say that looks like a bug. It should probably decode those  
PUA characters as if they were undecodeable sequences so that they too  
roundtrip properly.

> So you still have
> file names you can't open

In practical terms, I suspect nobody has ever run into a file which  
has this problem. You certainly can't say that is the case for  
Python-3's current behavior; my suspicion is that anyone who uses any  
non-ascii filenames at all will run into issues with Python3's  
behavior at least once.

> , and you're incompatible with what other
> libraries do.

I'm sure there's a situation where that matters, but, at least I can  
run kpdf /any/arbitrary/file.pdf and have it work. And use the KDE  
file chooser, and have it able to browse my files, and choose any  
file, no matter what random characters it has in it. If there is an  
issue with interfacing to another library, the string can be converted  
to whatever the other library expects at the interface point...

People keep claiming that odd filenames are only going to be an issue  
for "backup tools", but I don't think that's true. I think it'll be an  
issue for most any program that reads user-specified files. Whether it  
be by running Python in an ASCII (e.g. "C") locale when there are  
files created with UTF-8 names, or by having copied/downloaded a file  
with an incorrectly encoded name, it's going to come up, and be an  
irritant when it does.

That Qt felt the need to make this change rather strengthens that  
point IMO...

> The only thing going for Qt is that they seem specifically interested
> in latin-1, rather than arbitrary bad names.  The latin-1 strings that
> would correspond to the UTF-8 PUA used would include at least one
> control character, as well as other unusual bits, so it's pretty
> unlikely to encounter a real latin-1 file name like that.


I'd say they're most concerned about files that their users are likely  
to run into, yes.

James

From amauryfa at gmail.com  Wed Oct  8 00:46:12 2008
From: amauryfa at gmail.com (Amaury Forgeot d'Arc)
Date: Wed, 8 Oct 2008 00:46:12 +0200
Subject: [Python-3000] Accessing module state from extension types
In-Reply-To: <48EBDE67.2090102@v.loewis.de>
References: <e27efe130810070651w39312ae8lb6261f8622b07f07@mail.gmail.com>
	<48EBDE67.2090102@v.loewis.de>
Message-ID: <e27efe130810071546k423184fflab5b0caa11f20f0e@mail.gmail.com>

2008/10/8 "Martin v. L?wis" <martin at v.loewis.de>:
>> How is this supposed to work?
>
> The design was that you use PyState_FindModule, as an efficient way for
> getting a module object if you have the module def. The implementation
> fills an index into the module def (which will stay constant across
> interpreters), this this should give you your module object anywhere,
> in constant time.

This is exactly what I was looking for. Thanks!

> If you have specific proposals on how to make this more convenient to
> use, please go ahead. (also, if you think that this somehow flawed:
> this would be the time to mention it)

I suppose that common usage will do things like this:
    ((MyModuleState
*)PyModule_GetState(PyState_FindModule(&myModuleDef)))->globalValue
If you want to check for errors, it becomes tedious and some kind of
macro could be useful.
But this can be added later.

Why is the function caller PyState_FindModule? It's the only one with
this prefix (with _PyState_AddModule); other functions in the same
module are called PyInterpreterState_*. I suggest to rename it now;
otherwise there may be confusion between "module state" and
"interpreter state"; see the example above:
PyModule_GetState(PyState_FindModule(x)) seems to be a round-trip (or
a no-op) to the casual reader.

And unless you already planned to do so, I think I will start to
document the module API.

-- 
Amaury Forgeot d'Arc

From ncoghlan at gmail.com  Wed Oct  8 11:44:50 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 08 Oct 2008 19:44:50 +1000
Subject: [Python-3000] A plus for naked unbound methods
In-Reply-To: <gcg04o$5bh$1@ger.gmane.org>
References: <gcbhid$trn$1@ger.gmane.org>	<20081006.175034.343188282.mrs@localhost.localdomain>	<gcdksc$uo8$1@ger.gmane.org>	<20081006.212059.465784769.mrs@localhost.localdomain>	<gcecqd$cr9$1@ger.gmane.org>	<48EB358B.8020101@gmail.com>
	<gcg04o$5bh$1@ger.gmane.org>
Message-ID: <48EC8112.9020404@gmail.com>

Terry Reedy wrote:
>> In the case of the comparison methods, they're being retrieved from type
>> rather than object. This difference is made clear when you attempt to
>> invoke the retrieved method:
>>
>>>>> object.__cmp__(1, 2)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> TypeError: expected 1 arguments, got 2
>>>>> object.__cmp__(2)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> TypeError: type.__cmp__(x,y) requires y to be a 'type', not a 'int'
>>>>> object.__cmp__(object)
>> 0
> 
> This surprises me, partly because the situation seems to be different in
> 3.0.

That's because the default comparison of object() instances also changes
in Py3k: equality and inequality checks will succeed (using identity
based comparison), but ordering checks will fail with a TypeError.

The rich comparisons on type() in 2.6 are actually there in order to
issue a Py3k warning when -3 is defined and an ordering comparison is
invoked on a type, but it appears no such warning is currently present
for default object comparison.

That lack of Py3k warnings is arguably a bug in 2.6, but we would want
to think carefully about the backwards compatibility implications of
defining rich comparisons on object before adding such warnings. As
we've seen, even adding rich comparisons to type was enough to break
some user code (admittedly it was code that made some unwarranted
assumptions and hence was already potentially broken in the face of
metaclasses other than type, but the change did in fact break that code
for cases where it used to work).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From mhammond at skippinet.com.au  Tue Oct  7 15:34:15 2008
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 8 Oct 2008 00:34:15 +1100
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
	3.0 schedule
In-Reply-To: <040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
Message-ID: <00e001c92881$68ba93c0$3a2fbb40$@com.au>

[when 2 mailing lists are not enough... :-]

> I'm seeing that people are just starting to download and play with 3.0.
> I expect that we'll start getting more feedback on conversion issues

+1 from this direction too.  pywin32 has recently started looking seriously
at py3k, and while things are in fairly good shape for us who are already
"on the bandwagon", cleaning up a few rough edges would help people's first
impressions - and as they say, you only get one chance at a good first
impression...

More specifically, I think 2to3 is shaping up well.  pywin32 is taking the
approach of "port where possible, but keep in py2x syntax and convert at
'setup.py' time" and this is working out fairly well (in fact, with just a
couple of helpers in pywintypes, I think we can support python 2.3 upwards).
I believe that many projects may well take a similar approach as it allows
them to defer a full commitment to py3k, so doing all we can to support this
might help with that first impression.  My experience is that this could
best be achieved by addressing the following issues before release:
 
* Almost all open 2to3 issues that aren't truly edge cases should be
resolved - if 2to3 doesn't work for people, they may be forced to (even
temporarily) "fork" their project, which will cause concern.  I'll note that
good recent progress is being made here, but its still worth mentioning...

* Better support for 2to3 in distutils (specifically, the support in
build_py is stale, plus 'build_scripts' and 'install_data' should convert
.py files to py3k syntax.)  An 'example' project that uses py2k syntax and
"just works" on py3k using this strategy might be useful here.

* A standard 'helper script' that allows people to use py3k to execute a
py2x syntax script by auto-converting the code.  I've a 10ish-line script
that uses lib2to3 plus exec() to achieve that result, but a helper in 2to3
for this would be nice.  For a concrete use-case, we want to keep our
distutils script in py2x syntax, but execute it via py3k.  Its very possible
this already exists and I've just missed it...

Either way, I'm fairly confident a pywin32 build for py3k will be available
in the next month or 2 (but as a result, I'm not really in a position to
help with the above for that period...)

Hopefully-helpfully,

Mark



From mhammond at skippinet.com.au  Wed Oct  8 03:04:36 2008
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 8 Oct 2008 12:04:36 +1100
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
	3.0 schedule
In-Reply-To: <48EBBB25.70609@v.loewis.de>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<00e001c92881$68ba93c0$3a2fbb40$@com.au>
	<48EBBB25.70609@v.loewis.de>
Message-ID: <014001c928e1$dc13af40$943b0dc0$@com.au>

> > * Better support for 2to3 in distutils (specifically, the support in
> > build_py is stale, plus 'build_scripts' and 'install_data' should
> > convert
> > .py files to py3k syntax.)
> 
> Please do create a bug report for that. It sounds like it's easy to
> fix.

Yeah, build_py is fairly easy to fix, but I also needed to extend the
support to build_scripts and install_data.  In addition, some already
reported bugs in 2to3 mean that some files fail to convert, and this breaks
the entire process - so as a result I ended up duplicating lib2to3's
'refactor_items()' but with exceptions being logged and ingored rather than
aborting the process.  Oh - and I deleted the .bak files (a copy of the
sources are converted, not the sources themselves)

Please see bugs 4072 and 4073  - but as mentioned below, the lack of a test
case means I didn't supply a tested patch.

> > An 'example' project that uses py2k syntax and
> > "just works" on py3k using this strategy might be useful here.
> 
> Perhaps pywin32 :-?
> 
> I don't think a demo project would do much good, as it doesn't exercise
> all the issues that may occur.

My idea was that the demo project would simply demonstrate the 2to3 concepts
that such a project could use.  pywin32 isn't a good example as it has a
very non-trivial setup.py and a large set of C extensions (the demo I had in
mind could avoid C extensions completely - C developers will already assume
#ifdef will be their friend, but .py code is the unknown...)

It would basically be a 'distutils demo', could have a single .py module and
a single .py script.  setup.py would support both 2.x and 3.x and would
demonstrate how the source is converted to py3k syntax before it is
installed into the py3k distribution.

It would also provide a useful test case - eg, for the distutils bug above,
I'm not sure how I can (a) demonstrate it is currently broken and (b)
demonstrate a patch corrects the problem.

> > * A standard 'helper script' that allows people to use py3k to
> > execute a py2x syntax script by auto-converting the code.  I've 
> > a 10ish-line script that uses lib2to3 plus exec() to achieve that 
> > result, but a helper in 2to3
> > for this would be nice.  For a concrete use-case, we want to keep our
> > distutils script in py2x syntax, but execute it via py3k.  Its very
> > possible this already exists and I've just missed it...
> 
> For the case of setup.py, I was hoping that it could be written in
> compatible syntax even without needing conversion. That worked fine for
> my Django port. Is that not the case for pywin32?

setup.py catches and examines some exceptions.  Consider the more general
case though - pywin32 has a number of tests all of which will also be
maintained in py2x syntax.  It is extremely convenient to be able to
execute:

% py3k run2.py my_test.py etc

And have 'my_test.py' (which is 2.x syntax) be executed directly by py3k
without doing a full 'setup.py install' or manually invoking 2to3 via a temp
file, etc.  As mentioned, 'run2.py' is quite short and just uses
lib2to3+exec, but I'm not sure everyone will work out how to roll their
own...

Specifically, I believe that a script with similar capabilities could be
installed with py3k in the "scripts" directory and it advertised as a
reasonable way to directly execute your *scripts* which, although py3x
compatible, are being maintained in py2x syntax.  Below is my quick attempt
at such a script, which I promptly stopped looking at as soon as it worked
(ie, I'm not sure if all those options are needed, etc), but it does let me
execute my tests using py3k directly from the source tree.
 
Cheers,

Mark

---
# This is a Python 3.x script to execute a python 2.x script by 2to3'ing it.
import sys
from lib2to3.refactor import RefactoringTool, get_fixers_from_package

fixers = get_fixers_from_package('lib2to3.fixes')
options = dict(doctests_only=False, fix=[], list_fixes=[], 
               print_function=False, verbose=False,
               write=True)
r = RefactoringTool(fixers, options)
script = sys.argv[1]
data = open(script).read()
print("Converting...")
got = r.refactor_string(data, script)
print("Executing...")
# nuke ourselves from argv
del sys.argv[1]
exec(str(got))
---


From mhammond at skippinet.com.au  Wed Oct  8 03:26:22 2008
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 8 Oct 2008 12:26:22 +1100
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
	3.0 schedule
In-Reply-To: <014001c928e1$dc13af40$943b0dc0$@com.au>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>	<00e001c92881$68ba93c0$3a2fbb40$@com.au>	<48EBBB25.70609@v.loewis.de>
	<014001c928e1$dc13af40$943b0dc0$@com.au>
Message-ID: <014201c928e4$e726bd20$b5743760$@com.au>

> at such a script, which I promptly stopped looking at as soon as it
> worked

Which is quite obvious really given that:

> # nuke ourselves from argv
> del sys.argv[1]

is removing the wrong value!

Mark


From musiccomposition at gmail.com  Wed Oct  8 20:59:38 2008
From: musiccomposition at gmail.com (Benjamin Peterson)
Date: Wed, 8 Oct 2008 12:59:38 -0600
Subject: [Python-3000] [python-committers] [Python-Dev] Proposed Python
	3.0 schedule
In-Reply-To: <014001c928e1$dc13af40$943b0dc0$@com.au>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<040FDB9B68C549AE848AC35C0231DD70@RaymondLaptop1>
	<00e001c92881$68ba93c0$3a2fbb40$@com.au> <48EBBB25.70609@v.loewis.de>
	<014001c928e1$dc13af40$943b0dc0$@com.au>
Message-ID: <1afaf6160810081159o18e64e68te95ab94f1198472c@mail.gmail.com>

On 10/7/08, Mark Hammond <mhammond at skippinet.com.au> wrote:
> # This is a Python 3.x script to execute a python 2.x script by 2to3'ing it.
> import sys
> from lib2to3.refactor import RefactoringTool, get_fixers_from_package
>
> fixers = get_fixers_from_package('lib2to3.fixes')
> options = dict(doctests_only=False, fix=[], list_fixes=[],
>                print_function=False, verbose=False,
>                write=True)

Note that only the print_function option is used.

> r = RefactoringTool(fixers, options)
> script = sys.argv[1]
> data = open(script).read()
> print("Converting...")
> got = r.refactor_string(data, script)
> print("Executing...")
> # nuke ourselves from argv
> del sys.argv[1]
> exec(str(got))
> ---
>
> _______________________________________________
> python-committers mailing list
> python-committers at python.org
> http://mail.python.org/mailman/listinfo/python-committers
>


-- 
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."

From musiccomposition at gmail.com  Wed Oct  8 22:43:22 2008
From: musiccomposition at gmail.com (Benjamin Peterson)
Date: Wed, 8 Oct 2008 15:43:22 -0500
Subject: [Python-3000] Proposed Python 3.0 schedule
In-Reply-To: <48EC4A57.8030608@hlabs.spb.ru>
References: <2A0F6C99-481A-473B-B1A5-7B9FB5A47D6F@python.org>
	<1afaf6160810061752w390ba174m66baf6646175105d@mail.gmail.com>
	<48EC4A57.8030608@hlabs.spb.ru>
Message-ID: <1afaf6160810081343h62bf5ab3pad21d32e68a48313@mail.gmail.com>

On Wed, Oct 8, 2008 at 12:51 AM, Dmitry Vasiliev <dima at hlabs.spb.ru> wrote:
>
> BTW, I think the following issues should be also marked as release blockers:

Agreed and done.

>
> - http://bugs.python.org/issue3714 (nntplib module broken by str to
> unicode conversion)
> - http://bugs.python.org/issue3725 (telnetlib module broken by str to
> unicode conversion)
> - http://bugs.python.org/issue3727 (poplib module broken by str to
> unicode conversion)
>
> --
> Dmitry Vasiliev <dima at hlabs.spb.ru>
> http://hlabs.spb.ru
>



-- 
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."

From tjreedy at udel.edu  Fri Oct 10 04:20:02 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 09 Oct 2008 22:20:02 -0400
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48E68911.6090403@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	<87od26e3an.fsf@xemacs.org>	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>	<48E2CCEC.9030709@canterbury.ac.nz>	<loom.20081001T101918-594@post.gmane.org>	<871vz0pnuw.fsf@xemacs.org>	<loom.20081001T111216-867@post.gmane.org>	<87wsgso178.fsf@xemacs.org>	<loom.20081001T142457-236@post.gmane.org>	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>	<48E67175.1030103@g.nevcal.com>	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com>
Message-ID: <gcme4f$215$1@ger.gmane.org>

Glenn Linderman wrote:

> My understanding of the Posix file names is that any byte values are 
> valid except "/" and null.  Is this a correct understanding?
> 
> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a 
> Unicode character stream.  Call the original byte stream FOO.  The 
> transformation then produces FOOTR, a set of Unicode code points.  Now 
> FOOTR has a representation in UTF-8, which is a byte stream, call that 
> byte stream FOOTRUTF8.  How, by looking at FOOTR, do you know whether it 
> represents the file name FOO or FOOTRUTF8 ?  And remember that the user 
> might provide a Unicode character stream identical to FOOTR: should it 
> be translated to FOO or FOOTRUTF8 when creating a new file according to 
> the user-supplied name?

If FOOTR is using PUA chars, then I believe that users should not be 
providing such a stream as it would have no defined meaning coming from 
them.


From stephen at xemacs.org  Fri Oct 10 06:38:25 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 10 Oct 2008 13:38:25 +0900
Subject: [Python-3000] [Python-Dev] Filename as byte string in
	python	2.6 or 3.0?
In-Reply-To: <gcme4f$215$1@ger.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
Message-ID: <87k5chxdxa.fsf@xemacs.org>

Terry Reedy writes:

 > If FOOTR is using PUA chars, then I believe that users should not
 > be providing such a stream as it would have no defined meaning
 > coming from them.

But that's precisely what "private use" means: the users provide their
own definitions!  The Unicode standard provides that if a process
doesn't know what those characters mean, it *must* pass them through
*unchanged*, on the assumption that they will eventually reach a user
who knows what they mean.

So this means that (to conform to Unicode) every Python program must
take responsibility for ensuring that it tracks every filename to be
sure that no internal-use PUA characters make it to the "outside
world" where they will be propagated indefinitely by conforming
processes.  This is a substantial burden.

This is precisely the advantage of UTF-8b: the first conforming
process that catches any escapees will scream bloody murder and turn
them over to the Spanish Inquisition, who will torture them on the
rack until they confess that Python did it.<wink>


From stephen at xemacs.org  Fri Oct 10 08:55:56 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 10 Oct 2008 15:55:56 +0900
Subject: [Python-3000] [Python-Dev] Filename as byte string in	python
 2.6 or 3.0?
In-Reply-To: <48EEDECA.8050107@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
	<87k5chxdxa.fsf@xemacs.org> <48EEDECA.8050107@g.nevcal.com>
Message-ID: <87iqs1x7k3.fsf@xemacs.org>

Glenn Linderman writes:

 > Define a conforming process.

For present purposes, one that promises not to emit invalid Unicode
strings as Unicode.

 > If it is one that handles Unicode with full validation, all is 
 > wonderful, except on platforms that permit non-validated Unicode names 
 > or non-Unicode names.  And these are precisely the platforms for which 
 > these various translation schemes have been proposed.

Those aren't the proposals I've been reading about.  True, people have
suggested limiting the translation schemes with various coverage for
different platforms.  But AFAIK, all platforms supported by Python
allow NFS mounts, not to mention FAT filesystems on removable devices,
so in practice all may encounter arbitrary filenames in arbitrary
encodings.  Nor is it trivial for Python to figure out what
filesystems, let alone encodings, are being used.  So Python has to
support whatever is decided, period, perhaps with more or less complex
heuristics to tune treatment to platforms.

 > And so they will not enforce full validation on file names, even if they 
 > handle full validation on other strings.

Well, in practice that means conforming processes *will* validate at
least some file names, since I don't know of any systems that really
treat file names as anything but strings.

 > And Python will not always be the culprit.

But if the defaults get screwed up here, it will remain one of the
"usual suspects" for a long time to come.  It would be nice to provide
a foundation for doing better than that, but nothing proposed so far
does.  That's not surprising, because they're designed to preserve,
rather than handle, apparently invalid data, in hopes that somebody
else will clean up the mess.

The problem that all the proposals face is that they assume that we
know where the cleaning up will be done, and that we're in control of
the code that will have to do it.

From ncoghlan at gmail.com  Fri Oct 10 10:25:26 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 10 Oct 2008 18:25:26 +1000
Subject: [Python-3000] [Python-Dev] Filename as byte string in	python
 2.6 or 3.0?
In-Reply-To: <48EF04CC.5080503@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>	<87od26e3an.fsf@xemacs.org>	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>	<48E2CCEC.9030709@canterbury.ac.nz>	<loom.20081001T101918-594@post.gmane.org>	<871vz0pnuw.fsf@xemacs.org>	<loom.20081001T111216-867@post.gmane.org>	<87wsgso178.fsf@xemacs.org>	<loom.20081001T142457-236@post.gmane.org>	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>	<48E67175.1030103@g.nevcal.com>	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>	<48E68911.6090403@g.nevcal.com>	<gcme4f$215$1@ger.gmane.org>	<87k5chxdxa.fsf@xemacs.org>	<48EEDECA.8050107@g.nevcal.com>	<87iqs1x7k3.fsf@xemacs.org>
	<48EF04CC.5080503@g.nevcal.com>
Message-ID: <48EF1176.70101@gmail.com>

Glenn Linderman wrote:
> BDFL has chosen scheme
> 2, it seems, unless he changes his mind.  It has the advantages that few
> or no code changes are necessary to handle files that have Unicode
> names, and applications that want to handle files with non-Unicode names
> can, but have to work harder.

More accurately, I would say that Guido has chosen scheme 2
(predominantly Unicode APIs, with partial binary APIs) *for now*, with a
view to reviewing the situation for Python 3.1 or 3.2. By that time some
consensus will hopefully have emerged on how best to deal with invalid
binary filenames while interacting with Unicode-only APIs.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org

From stephen at xemacs.org  Fri Oct 10 10:39:57 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 10 Oct 2008 17:39:57 +0900
Subject: [Python-3000] [Python-Dev] Filename as byte string in	python
 2.6 or 3.0?
In-Reply-To: <48EF04CC.5080503@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<2040788B-98C7-4AA2-94AD-2E85E7DF07E8@fuhm.net>
	<87od26e3an.fsf@xemacs.org>
	<6C26CFCA-21E0-4F6B-A314-57358EC08D55@fuhm.net>
	<48E2CCEC.9030709@canterbury.ac.nz>
	<loom.20081001T101918-594@post.gmane.org>
	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
	<87k5chxdxa.fsf@xemacs.org> <48EEDECA.8050107@g.nevcal.com>
	<87iqs1x7k3.fsf@xemacs.org> <48EF04CC.5080503@g.nevcal.com>
Message-ID: <87ej2oyhb6.fsf@xemacs.org>

Glenn Linderman writes:

 > OK, but file names are not (always) Unicode strings.  So it is possible 
 > to have a conforming process that still manipulates non-Unicode 
 > filenames, as long as it doesn't emit them in places where Unicode 
 > strings are required.

Sure.  My point is that "emission elimination" is very hard to arrange
under any of the not-quite Unicode schemes that have been discussed,
and almost surely the effort to achieve even "emission reduction" will
almost never occur if we default to imperfect but pretty good
sanitization.

It will encourage the same kind of coding that we're familiar with
already: things almost always work for most programmers if they assume
that it's OK to ignore the problem.

 > So when a foreign file system is mounted, the driver for that file
 > system gets "first crack" and defining legal names and how to
 > handle files that don't have legal names.  For example, on Windows,
 > I wouldn't be surprised to see NFS drivers that suppress
 > non-Unicode names at the 16-bit API level.  At that point, Python
 > would not be responsible for

... much of anything.  Nice work if we can get it; I won't be holding
my breath.  For starters, USB memory sticks are still all FAT format
by default.

 > There are lots of kinds of strings.

Not in Python, though.  There are Unicode strings and there are bytes
objects.  If the latter were acceptable for this application, we'd
have no problem.


From rhamph at gmail.com  Fri Oct 10 14:16:21 2008
From: rhamph at gmail.com (Adam Olsen)
Date: Fri, 10 Oct 2008 06:16:21 -0600
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48EF04CC.5080503@g.nevcal.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
	<87k5chxdxa.fsf@xemacs.org> <48EEDECA.8050107@g.nevcal.com>
	<87iqs1x7k3.fsf@xemacs.org> <48EF04CC.5080503@g.nevcal.com>
Message-ID: <aac2c7cb0810100516n24323eedg7c3f1a2f97869ce4@mail.gmail.com>

On Fri, Oct 10, 2008 at 1:31 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/9/2008 11:55 PM, came the following characters from the
> keyboard of Stephen J. Turnbull:
>> The problem that all the proposals face is that they assume that we
>> know where the cleaning up will be done, and that we're in control of
>> the code that will have to do it.
>
>
> I think this is your expression of "Applications that do XXX may neeed
> modification to handle all files" :)
>
> The object wrapper gives us the right control, but likely forces more
> changes to applications than the other schemes.  BDFL has chosen scheme 2,
> it seems, unless he changes his mind.  It has the advantages that few or no
> code changes are necessary to handle files that have Unicode names, and
> applications that want to handle files with non-Unicode names can, but have
> to work harder.  If Python had come with a file path manipulation object
> from the beginning, (3) might be a better scheme, but, as much as I like and
> wish for scheme (3), scheme (2) has a better migration story, and scheme (1)
> basically only solves some of the problems some of the times, and can cause
> other problems due to data puns (although the chances of doing so are
> somewhat low, and approach zero in my environment, and likely in many
> environments... but then in my environment, and likely in many environments,
> they also don't actually solve any problems either, so I'd be just as well
> off without it).

There's a spectrum of choices, depending on how soon you want the API to fail:
* bytes/unicode distinct APIs.  unicode never fails, but does skip.
* bytes/unicode automatic.  return bytes for invalid names; fails when
concatenated to unicode strings
* invalid unicode.  Works internally, but fails when exposed to external APIs
* FilePath object.  I can't see a difference from invalid unicode?
* transformed unicode.  Works internally, can be round-tripped through
external APIs, but fails if those external APIs touch the filesystem.
Also breaks valid file names.

Since none of the options eliminate failure (and none can, short of
universally redefining UTF-8 or making the filesystem validate the
encoding), we instead pick the lesser evil.  Although the first option
does skip file names, it turns out to be the least surprising and
least magical.  Indeed, it's the only option that never fails while
listing directory contents!


-- 
Adam Olsen, aka Rhamphoryncus

From qrczak at knm.org.pl  Fri Oct 10 15:31:22 2008
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 10 Oct 2008 15:31:22 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <gcme4f$215$1@ger.gmane.org>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<871vz0pnuw.fsf@xemacs.org> <loom.20081001T111216-867@post.gmane.org>
	<87wsgso178.fsf@xemacs.org> <loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
Message-ID: <3f4107910810100631o3b006f5dne5d43d9dd6b3fa8@mail.gmail.com>

2008/10/10 Terry Reedy <tjreedy at udel.edu>:

> If FOOTR is using PUA chars, then I believe that users should not be
> providing such a stream as it would have no defined meaning coming from
> them.

PUA already has a UTF-8 representation, so this is the worst choice
among UTF-8b and U+0000 which do preserve the encoding of all existing
UTF-8 filenames. I am using bits of PUA (although not for filenames
for now) and I would be annoyed if Python mangled them.

-- 
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/

From martin at v.loewis.de  Sun Oct 12 21:13:56 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 12 Oct 2008 21:13:56 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
 2.6 or 3.0?
In-Reply-To: <3f4107910810100631o3b006f5dne5d43d9dd6b3fa8@mail.gmail.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<871vz0pnuw.fsf@xemacs.org>
	<loom.20081001T111216-867@post.gmane.org>	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>	<48E67175.1030103@g.nevcal.com>	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>	<48E68911.6090403@g.nevcal.com>
	<gcme4f$215$1@ger.gmane.org>
	<3f4107910810100631o3b006f5dne5d43d9dd6b3fa8@mail.gmail.com>
Message-ID: <48F24C74.80407@v.loewis.de>

> I am using bits of PUA

Which bits specifically?

Regards,
Martin

From qrczak at knm.org.pl  Sun Oct 12 21:25:08 2008
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sun, 12 Oct 2008 21:25:08 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
	2.6 or 3.0?
In-Reply-To: <48F24C74.80407@v.loewis.de>
References: <200809271404.25654.victor.stinner@haypocalc.com>
	<87wsgso178.fsf@xemacs.org> <loom.20081001T142457-236@post.gmane.org>
	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>
	<48E67175.1030103@g.nevcal.com>
	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>
	<48E68911.6090403@g.nevcal.com> <gcme4f$215$1@ger.gmane.org>
	<3f4107910810100631o3b006f5dne5d43d9dd6b3fa8@mail.gmail.com>
	<48F24C74.80407@v.loewis.de>
Message-ID: <3f4107910810121225n499365d4k49feae757c7949c5@mail.gmail.com>

2008/10/12 "Martin v. L?wis" <martin at v.loewis.de>:

>> I am using bits of PUA
>
> Which bits specifically?

Well, not bits in the technical sense, just a small range U+E650...U+E677.

-- 
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/

From martin at v.loewis.de  Sun Oct 12 21:49:28 2008
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sun, 12 Oct 2008 21:49:28 +0200
Subject: [Python-3000] [Python-Dev] Filename as byte string in python
 2.6 or 3.0?
In-Reply-To: <3f4107910810121225n499365d4k49feae757c7949c5@mail.gmail.com>
References: <200809271404.25654.victor.stinner@haypocalc.com>	<87wsgso178.fsf@xemacs.org>
	<loom.20081001T142457-236@post.gmane.org>	<fb6fbf560810031035j35c18695m9bb9a487571157c9@mail.gmail.com>	<48E67175.1030103@g.nevcal.com>	<66746C74-D922-4501-8CEF-FDC6D2BBBB87@fuhm.net>	<48E68911.6090403@g.nevcal.com>
	<gcme4f$215$1@ger.gmane.org>	<3f4107910810100631o3b006f5dne5d43d9dd6b3fa8@mail.gmail.com>	<48F24C74.80407@v.loewis.de>
	<3f4107910810121225n499365d4k49feae757c7949c5@mail.gmail.com>
Message-ID: <48F254C8.80205@v.loewis.de>

>>> I am using bits of PUA
>> Which bits specifically?
> 
> Well, not bits in the technical sense, just a small range U+E650...U+E677.

:-) That's exactly what I wanted to know. If Python ever uses PUA
characters, there shouldn't be any collisions with that range.

Regards,
Martin


From skip.montanaro at gmail.com  Thu Oct 16 18:01:08 2008
From: skip.montanaro at gmail.com (Skip Montanaro)
Date: Thu, 16 Oct 2008 11:01:08 -0500
Subject: [Python-3000] Backporting multiprocessing?
Message-ID: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>

I'd like to try backporting the multiprocessing module to Python 2.4.  My first
problem appears to be the reliance on a complete(?) rewrite of the buffer stuff.

Any clues about transforming this code would be much appreciated.

(Note: I'm backporting because the Python 2.6 version appears to be much more
robust than the 0.52 third-party release.)

Thanks,

Skip

From jnoller at gmail.com  Thu Oct 16 18:34:28 2008
From: jnoller at gmail.com (Jesse Noller)
Date: Thu, 16 Oct 2008 12:34:28 -0400
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
Message-ID: <4222a8490810160934g53f4372aw582f864bb4db1230@mail.gmail.com>

Hi Skip,

I had been approached to do the exact same thing, are you trying to
back port the trunk version (2.6) or py3000?

On Thu, Oct 16, 2008 at 12:01 PM, Skip Montanaro
<skip.montanaro at gmail.com> wrote:
> I'd like to try backporting the multiprocessing module to Python 2.4.  My first
> problem appears to be the reliance on a complete(?) rewrite of the buffer stuff.
>
> Any clues about transforming this code would be much appreciated.
>
> (Note: I'm backporting because the Python 2.6 version appears to be much more
> robust than the 0.52 third-party release.)
>
> Thanks,
>
> Skip
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/jnoller%40gmail.com
>

From skip.montanaro at gmail.com  Thu Oct 16 18:36:54 2008
From: skip.montanaro at gmail.com (Skip Montanaro)
Date: Thu, 16 Oct 2008 11:36:54 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <4222a8490810160934g53f4372aw582f864bb4db1230@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<4222a8490810160934g53f4372aw582f864bb4db1230@mail.gmail.com>
Message-ID: <60bb7ceb0810160936s63c7bc08p1411fd83f7550439@mail.gmail.com>

>  I had been approached to do the exact same thing, are you trying to
>  back port the trunk version (2.6) or py3000?

I'm trying to backport from 2.6.  It appears that the buffer stuff is
completely
new though (backported from Python 3.0).

S

From jnoller at gmail.com  Thu Oct 16 18:37:26 2008
From: jnoller at gmail.com (Jesse Noller)
Date: Thu, 16 Oct 2008 12:37:26 -0400
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <4222a8490810160934g53f4372aw582f864bb4db1230@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<4222a8490810160934g53f4372aw582f864bb4db1230@mail.gmail.com>
Message-ID: <4222a8490810160937r35bf0b28m585f608930156553@mail.gmail.com>

Also note, for python 2.4/2.5 you are going to *need* the patch to bug
http://bugs.python.org/issue874900

On Thu, Oct 16, 2008 at 12:34 PM, Jesse Noller <jnoller at gmail.com> wrote:
> Hi Skip,
>
> I had been approached to do the exact same thing, are you trying to
> back port the trunk version (2.6) or py3000?
>
> On Thu, Oct 16, 2008 at 12:01 PM, Skip Montanaro
> <skip.montanaro at gmail.com> wrote:
>> I'd like to try backporting the multiprocessing module to Python 2.4.  My first
>> problem appears to be the reliance on a complete(?) rewrite of the buffer stuff.
>>
>> Any clues about transforming this code would be much appreciated.
>>
>> (Note: I'm backporting because the Python 2.6 version appears to be much more
>> robust than the 0.52 third-party release.)
>>
>> Thanks,
>>
>> Skip
>> _______________________________________________
>> Python-3000 mailing list
>> Python-3000 at python.org
>> http://mail.python.org/mailman/listinfo/python-3000
>> Unsubscribe: http://mail.python.org/mailman/options/python-3000/jnoller%40gmail.com
>>
>

From lists at cheimes.de  Thu Oct 16 21:28:31 2008
From: lists at cheimes.de (Christian Heimes)
Date: Thu, 16 Oct 2008 21:28:31 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
Message-ID: <gd84l0$v7$1@ger.gmane.org>

Skip Montanaro wrote:
> I'd like to try backporting the multiprocessing module to Python 2.4.  My first
> problem appears to be the reliance on a complete(?) rewrite of the buffer stuff.
> 
> Any clues about transforming this code would be much appreciated.
> 
> (Note: I'm backporting because the Python 2.6 version appears to be much more
> robust than the 0.52 third-party release.)

Good timing, Skip! I was planing to do a backport to 2.5, too. I've some 
experience with both the old and the new buffer protocol. I might be of 
some assistance to you.

I like to make as much code of the trunk version compatible with 2.5 and 
2.4 as possible. Let's see how far we can get with a bunch of macros and 
#ifdefs.

Christian


From jnoller at gmail.com  Thu Oct 16 21:30:56 2008
From: jnoller at gmail.com (Jesse Noller)
Date: Thu, 16 Oct 2008 15:30:56 -0400
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <gd84l0$v7$1@ger.gmane.org>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
Message-ID: <4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>

Do we want to start a google code project for this given all three of
us are interested in this? :)

On Thu, Oct 16, 2008 at 3:28 PM, Christian Heimes <lists at cheimes.de> wrote:
> Skip Montanaro wrote:
>>
>> I'd like to try backporting the multiprocessing module to Python 2.4.  My
>> first
>> problem appears to be the reliance on a complete(?) rewrite of the buffer
>> stuff.
>>
>> Any clues about transforming this code would be much appreciated.
>>
>> (Note: I'm backporting because the Python 2.6 version appears to be much
>> more
>> robust than the 0.52 third-party release.)
>
> Good timing, Skip! I was planing to do a backport to 2.5, too. I've some
> experience with both the old and the new buffer protocol. I might be of some
> assistance to you.
>
> I like to make as much code of the trunk version compatible with 2.5 and 2.4
> as possible. Let's see how far we can get with a bunch of macros and
> #ifdefs.
>
> Christian
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/jnoller%40gmail.com
>

From lists at cheimes.de  Thu Oct 16 21:38:27 2008
From: lists at cheimes.de (Christian Heimes)
Date: Thu, 16 Oct 2008 21:38:27 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>	
	<gd84l0$v7$1@ger.gmane.org>
	<4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
Message-ID: <48F79833.9000301@cheimes.de>

Jesse Noller wrote:
> Do we want to start a google code project for this given all three of
> us are interested in this? :)

Do we need (yet) another Google code project? Isn't svn.python.org 
sufficient for our needs? I'm -0 on a Google code project but I'll give 
you my gmail account if you insist on one.

Christian

From jnoller at gmail.com  Thu Oct 16 22:03:14 2008
From: jnoller at gmail.com (Jesse Noller)
Date: Thu, 16 Oct 2008 16:03:14 -0400
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F79833.9000301@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
	<4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
	<48F79833.9000301@cheimes.de>
Message-ID: <4222a8490810161303t2a5edd14i934e0c7bc7f3f39e@mail.gmail.com>

On Thu, Oct 16, 2008 at 3:38 PM, Christian Heimes <lists at cheimes.de> wrote:
> Jesse Noller wrote:
>>
>> Do we want to start a google code project for this given all three of
>> us are interested in this? :)
>
> Do we need (yet) another Google code project? Isn't svn.python.org
> sufficient for our needs? I'm -0 on a Google code project but I'll give you
> my gmail account if you insist on one.
>
> Christian
>

I've not used svn.python.org for personal side/projects - also,
ideally the back port would be stand-alone and package-index
installable

From ncoghlan at gmail.com  Fri Oct 17 00:06:30 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 17 Oct 2008 08:06:30 +1000
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
Message-ID: <48F7BAE6.4070108@gmail.com>

Skip Montanaro wrote:
> (Note: I'm backporting because the Python 2.6 version appears to be much more
> robust than the 0.52 third-party release.)

As Jesse points out, some of that robustness comes from long-standing
bugs in the core getting fixed as a result of the addition of the
multiprocessing unit tests to the standard library test suite.

Not trying to discourage the project, just pointing out that it may not
be as effective as hoped without patching the older versions of the
interpreter.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------

From lists at cheimes.de  Fri Oct 17 00:21:39 2008
From: lists at cheimes.de (Christian Heimes)
Date: Fri, 17 Oct 2008 00:21:39 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F7BAE6.4070108@gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com>
Message-ID: <48F7BE73.3010303@cheimes.de>

Nick Coghlan wrote:
> As Jesse points out, some of that robustness comes from long-standing
> bugs in the core getting fixed as a result of the addition of the
> multiprocessing unit tests to the standard library test suite.
> 
> Not trying to discourage the project, just pointing out that it may not
> be as effective as hoped without patching the older versions of the
> interpreter.

Oh h...
Are you able to recall a list of the most important bug fixes? Maybe we 
can get the bug fixes into 2.5.3 before it's too late.

Christian

From barry at python.org  Fri Oct 17 04:06:31 2008
From: barry at python.org (Barry Warsaw)
Date: Thu, 16 Oct 2008 22:06:31 -0400
Subject: [Python-3000] No rc2 tonight
Message-ID: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I was supposed to release 3.0rc2 last night, but events caught up with  
me.  In going through the release blockers tonight, I do not think we  
are ready to release.  Here are the issues that need addressing:

Showstoppers:

3775 Update RELNOTES file
   - Don't worry about this one
3626 python3.0 interpreter on Cygwin ignores all arguments
   - This one appears to have an approved patch, but I do not have  
Cygwin to
     verify.  I happy if someone who does can verify and then Amaury  
should be
     free to apply the patch.
3723 Py_NewInterpreter does not work
   - This one seems serious and in need of attention.  Can someone  
please
     take a look at this issue?
3799 Byte/string inconsistencies between different dbm modules
   - This one also seems serious, and Guido bumped this to a release  
blocker so
     that it would be looked at before rc2.  We'll, here we are at rc2!
1210 imaplib does not run under Python 3
3727 poplib module broken by str to unicode conversion
   - These both have patches that need review
3574 compile() cannot decode Latin-1 source encodings
   - Brett is approved to land this one

Deferred

I deferred these but I would really like to get them fixed before rc2.

3664 Pickler.dump from a badly initialized Pickler segfaults
   - This one needs a proper patch with a test
3714 nntplib module broken by str to unicode conversion
   - This issue seems pretty far from resolution

If these issues can be resolved or deferred, I will try again to make  
a release tomorrow (Friday) night.  Otherwise, rc2 may have to wait  
until after November 1st.

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSPfzJ3EjvBPtnXfVAQLEsQQAhALzGwpK/Eu5BmnasibGbsIzdYW7CSJQ
1uvrYbdGCY1nR4pl5WoB+xt6mqqVgUDZjEQmY2TatGKmWk8B7T/2UjZtmmpnNFom
9EFYffP5pm55wW4bzerGsfJJo1Xfsb2Q9pYcYj99TozCiE62bJkL7CTrmheutqft
7MMlTXRJJLE=
=cvx7
-----END PGP SIGNATURE-----

From alexandre at peadrop.com  Fri Oct 17 07:14:05 2008
From: alexandre at peadrop.com (Alexandre Vassalotti)
Date: Fri, 17 Oct 2008 01:14:05 -0400
Subject: [Python-3000] No rc2 tonight
In-Reply-To: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>
References: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>
Message-ID: <acd65fa20810162214p506d1704y7136758a23950d14@mail.gmail.com>

On Thu, Oct 16, 2008 at 10:06 PM, Barry Warsaw <barry at python.org> wrote:
> I deferred these but I would really like to get them fixed before rc2.
>
> 3664 Pickler.dump from a badly initialized Pickler segfaults
>  - This one needs a proper patch with a test

I posted the patch for that one. Please review.

Thank you,
-- Alexandre

From victor.stinner at haypocalc.com  Fri Oct 17 10:36:38 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Fri, 17 Oct 2008 10:36:38 +0200
Subject: [Python-3000] No rc2 tonight
In-Reply-To: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>
References: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>
Message-ID: <200810171036.38341.victor.stinner@haypocalc.com>

> 1210 imaplib does not run under Python 3
> 3727 poplib module broken by str to unicode conversion
>    - These both have patches that need review
> 3714 nntplib module broken by str to unicode conversion
>    - This issue seems pretty far from resolution

I worked on these modules. First I tried to use unicode everywhere but then I 
realized that each email can use a different encoding. Using a fixed charset 
is meanless, that's why I wrote new patches (for poplib and imaplib) to 
return emails (and other status messages) as bytes strings.

Since nntplib also transport emails, I think that my current patch 
(nntplib_unicode.patch) is invalid and I should write another one using 
bytes. If I don't have time to fix it quickly, please leave 3714 at 
state "deferred blocker".

Barry: you closed the issue #4125 but the specified revision number is the 
commit fixing issue #3988. runtests.sh have to use the -bb flag!

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From ncoghlan at gmail.com  Fri Oct 17 11:06:13 2008
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 17 Oct 2008 19:06:13 +1000
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F7BE73.3010303@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
Message-ID: <48F85585.8090201@gmail.com>

Christian Heimes wrote:
> Nick Coghlan wrote:
>> As Jesse points out, some of that robustness comes from long-standing
>> bugs in the core getting fixed as a result of the addition of the
>> multiprocessing unit tests to the standard library test suite.
>>
>> Not trying to discourage the project, just pointing out that it may not
>> be as effective as hoped without patching the older versions of the
>> interpreter.
> 
> Oh h...
> Are you able to recall a list of the most important bug fixes? Maybe we
> can get the bug fixes into 2.5.3 before it's too late.

The one Jesse linked in his python-dev post was the one that blocked it
the longest:
http://bugs.python.org/issue874900

However, if I'm reading the discussion in the tracker correctly, the fix
was applied to all 3 branches (2.5, trunk, 3k). So it is only people
using versions <= 2.5.2 that will suffer that particular problem.

I think there were a couple of others as well, but it would take a trawl
through the py3k mailing list archives to figure out what they were (I'm
pretty sure Jesse posted a list of the issues that needed to be fixed to
get the multiprocessing unit tests passing reliably).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------

From barry at python.org  Fri Oct 17 16:00:51 2008
From: barry at python.org (Barry Warsaw)
Date: Fri, 17 Oct 2008 10:00:51 -0400
Subject: [Python-3000] No rc2 tonight
In-Reply-To: <200810171036.38341.victor.stinner@haypocalc.com>
References: <0EACC1A0-EA85-4EC3-BC80-4BA6CDFD3556@python.org>
	<200810171036.38341.victor.stinner@haypocalc.com>
Message-ID: <0984C612-FB9D-4D07-8B83-635504D6BFBE@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 17, 2008, at 4:36 AM, Victor Stinner wrote:

>> 1210 imaplib does not run under Python 3
>> 3727 poplib module broken by str to unicode conversion
>>   - These both have patches that need review
>> 3714 nntplib module broken by str to unicode conversion
>>   - This issue seems pretty far from resolution
>
> I worked on these modules. First I tried to use unicode everywhere  
> but then I
> realized that each email can use a different encoding. Using a fixed  
> charset
> is meanless, that's why I wrote new patches (for poplib and imaplib)  
> to
> return emails (and other status messages) as bytes strings.
>
> Since nntplib also transport emails, I think that my current patch
> (nntplib_unicode.patch) is invalid and I should write another one  
> using
> bytes. If I don't have time to fix it quickly, please leave 3714 at
> state "deferred blocker".

Ok.

> Barry: you closed the issue #4125 but the specified revision number  
> is the
> commit fixing issue #3988. runtests.sh have to use the -bb flag!

Yeah, finger fart there.  It's now committed.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSPialHEjvBPtnXfVAQLw9wP/S/8chGDAO5cDdiM6b+3GtV9Yd01DqXmJ
XGVDlN1QwUXnI2SYDzYJ3JIO0ptWiQENga0FbT9fFrHhOMmFM4ZqaiGtvx4r97wO
nwev22gGUzvAK/gsYkg8+gOsvN/q6uGMuvlgbNM/qQBL52kiGlmvpfYPnCcJ9YIY
0R9LEzrpT8E=
=5yvS
-----END PGP SIGNATURE-----

From skip at pobox.com  Fri Oct 17 16:44:27 2008
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 17 Oct 2008 09:44:27 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <gd84l0$v7$1@ger.gmane.org>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
Message-ID: <18680.42187.731857.59208@montanaro-dyndns-org.local>


    Christian> I like to make as much code of the trunk version compatible
    Christian> with 2.5 and 2.4 as possible. Let's see how far we can get
    Christian> with a bunch of macros and #ifdefs.

I'll follow your lead. ;-)

Skip

From skip at pobox.com  Fri Oct 17 16:45:37 2008
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 17 Oct 2008 09:45:37 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
	<4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
Message-ID: <18680.42257.84013.537080@montanaro-dyndns-org.local>


    Jesse> Do we want to start a google code project for this given all
    Jesse> three of us are interested in this? :)

Maybe the svn repo could grow a backports sibling of sandbox.

Skip

From skip at pobox.com  Fri Oct 17 16:47:10 2008
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 17 Oct 2008 09:47:10 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <4222a8490810161303t2a5edd14i934e0c7bc7f3f39e@mail.gmail.com>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
	<4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
	<48F79833.9000301@cheimes.de>
	<4222a8490810161303t2a5edd14i934e0c7bc7f3f39e@mail.gmail.com>
Message-ID: <18680.42350.962999.456461@montanaro-dyndns-org.local>


    Jesse> I've not used svn.python.org for personal side/projects - also,
    Jesse> ideally the back port would be stand-alone and package-index
    Jesse> installable

I wouldn't call this really a personal/side project.  OTOH, firing up a
Google Code project means you can admit project developers without giving
them the keys to the kingdom so-to-speak.

Skip

From skip at pobox.com  Fri Oct 17 16:50:34 2008
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 17 Oct 2008 09:50:34 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F7BE73.3010303@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
Message-ID: <18680.42554.791704.321274@montanaro-dyndns-org.local>


    Christian> Oh h...  Are you able to recall a list of the most important
    Christian> bug fixes? Maybe we can get the bug fixes into 2.5.3 before
    Christian> it's too late.

Maybe doing the modest amount of translation required of the 2.6 unit tests
so they run under 0.52 would help.  See what fails and then see what fixes
correspond to fixing those failing tests.

Skip


From jnoller at gmail.com  Fri Oct 17 16:55:22 2008
From: jnoller at gmail.com (Jesse Noller)
Date: Fri, 17 Oct 2008 10:55:22 -0400
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18680.42350.962999.456461@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<gd84l0$v7$1@ger.gmane.org>
	<4222a8490810161230xca234c6y15c8e0733fb62b2e@mail.gmail.com>
	<48F79833.9000301@cheimes.de>
	<4222a8490810161303t2a5edd14i934e0c7bc7f3f39e@mail.gmail.com>
	<18680.42350.962999.456461@montanaro-dyndns-org.local>
Message-ID: <4222a8490810170755v15e18b4er8bfecefcc55975e1@mail.gmail.com>

On Fri, Oct 17, 2008 at 10:47 AM,  <skip at pobox.com> wrote:
>
>    Jesse> I've not used svn.python.org for personal side/projects - also,
>    Jesse> ideally the back port would be stand-alone and package-index
>    Jesse> installable
>
> I wouldn't call this really a personal/side project.  OTOH, firing up a
> Google Code project means you can admit project developers without giving
> them the keys to the kingdom so-to-speak.
>
> Skip
>

Fair enough :)

I fired up http://code.google.com/p/python-multiprocessing/ last
night, and added you and Christian - anyone else wanting in on this
can ping me.

Skip - I know you had some work you already had on the bench for
pulling out-repackaging the MP stuff, do you want to commit that and
then we can work from there?

We should also probably take this off the dev list

-jesse

From lists at cheimes.de  Fri Oct 17 20:56:58 2008
From: lists at cheimes.de (Christian Heimes)
Date: Fri, 17 Oct 2008 20:56:58 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18680.42554.791704.321274@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
	<18680.42554.791704.321274@montanaro-dyndns-org.local>
Message-ID: <48F8DFFA.9090707@cheimes.de>

skip at pobox.com wrote:
>     Christian> Oh h...  Are you able to recall a list of the most important
>     Christian> bug fixes? Maybe we can get the bug fixes into 2.5.3 before
>     Christian> it's too late.
> 
> Maybe doing the modest amount of translation required of the 2.6 unit tests
> so they run under 0.52 would help.  See what fails and then see what fixes
> correspond to fixing those failing tests.

Sounds like a good plan. Let's get started! Are you going to commit your 
work to the Google Code repository anytime soon?

Christian


From skip at pobox.com  Sun Oct 19 02:39:35 2008
From: skip at pobox.com (skip at pobox.com)
Date: Sat, 18 Oct 2008 19:39:35 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F8DFFA.9090707@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
	<18680.42554.791704.321274@montanaro-dyndns-org.local>
	<48F8DFFA.9090707@cheimes.de>
Message-ID: <18682.33223.833779.771437@montanaro-dyndns-org.local>


    >> Maybe doing the modest amount of translation required of the 2.6 unit
    >> tests so they run under 0.52 would help.  See what fails and then see
    >> what fixes correspond to fixing those failing tests.

    Christian> Sounds like a good plan. Let's get started! Are you going to
    Christian> commit your work to the Google Code repository anytime soon?

Folks,

My apologies.  I have been essentially off-net for the past couple of days.
Reason one: we are in the midst of moving.  Reason two: our first grandchild
(Carmine Michael Montanaro) was born early Friday morning.  (yay!)  Between
visiting Carmine and moving/packing I haven't really been close to a
computer since Thursday mid-afternoon.  (I'm writing this reply off-net at
the moment.  Who knows when I'll get back within range of a wireless
signal.)

I will try to get close enough to the net for a small amount of time Sunday
and upload what I have to Google Code.  It ain't much, so if you're
impatient, you can pretty much replicate what I did:

    find . -name '*processing*' | egrep -v framework\|build\|PC | xargs tar --create --verbose --file=$HOME/tmp/multiprocessing.tar --exclude=.svn --exclude='*.pyc'

Skip


From skip at pobox.com  Mon Oct 20 18:01:30 2008
From: skip at pobox.com (skip at pobox.com)
Date: Mon, 20 Oct 2008 11:01:30 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48F8DFFA.9090707@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
	<18680.42554.791704.321274@montanaro-dyndns-org.local>
	<48F8DFFA.9090707@cheimes.de>
Message-ID: <18684.43866.322328.195968@montanaro-dyndns-org.local>


    >> Maybe doing the modest amount of translation required of the 2.6 unit
    >> tests so they run under 0.52 would help.  See what fails and then see
    >> what fixes correspond to fixing those failing tests.

    Christian> Sounds like a good plan. Let's get started! Are you going to
    Christian> commit your work to the Google Code repository anytime soon?

I checked in the contents of my multiprocessing.tar file and opened issues
#1 and #2.

Skip


From lists at cheimes.de  Wed Oct 22 15:02:45 2008
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 22 Oct 2008 15:02:45 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18684.43866.322328.195968@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
	<18680.42554.791704.321274@montanaro-dyndns-org.local>
	<48F8DFFA.9090707@cheimes.de>
	<18684.43866.322328.195968@montanaro-dyndns-org.local>
Message-ID: <48FF2475.8090006@cheimes.de>

skip at pobox.com wrote:
> I checked in the contents of my multiprocessing.tar file and opened issues
> #1 and #2.

I added a setup.py, disabled recv_bytes_into for now and fixed lots of
naming issues. The multiprocessing code is using the new names of the
threading module (current_thread, is_alive etc.) but Python 2.5 just
have the old names (currentThread, isAlive).

$ python2.5 setup.py build_ext -i
$ PYTHONPATH=Lib python2.5 Lib/test/test_multiprocessing.py

======================================================================
ERROR: test_connection (__main__.WithProcessesTestConnection)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "Lib/test/test_multiprocessing.py", line 1220, in test_connection
     self.assertEqual(conn.recv_bytes_into(buffer),
AttributeError: '_multiprocessing.Connection' object has no attribute
'recv_bytes_into'

----------------------------------------------------------------------
Ran 123 tests in 12.309s

FAILED (errors=1)

:)

Christian


From jnoller at gmail.com  Wed Oct 22 15:05:16 2008
From: jnoller at gmail.com (jnoller at gmail.com)
Date: Wed, 22 Oct 2008 06:05:16 -0700
Subject: [Python-3000] Backporting multiprocessing?
Message-ID: <00151757357c13daba0459d7331b@google.com>

Maybe we should backport those handy pep8 threading names ... ... Ok maybe  
not.

On Oct 22, 2008 9:02am, Christian Heimes <lists at cheimes.de> wrote:
> skip at pobox.com wrote:
>
>
> I checked in the contents of my multiprocessing.tar file and opened issues
>
> #1 and #2.
>
>
>
>
> I added a setup.py, disabled recv_bytes_into for now and fixed lots of
>
> naming issues. The multiprocessing code is using the new names of the
>
> threading module (current_thread, is_alive etc.) but Python 2.5 just
>
> have the old names (currentThread, isAlive).
>
>
>
> $ python2.5 setup.py build_ext -i
>
> $ PYTHONPATH=Lib python2.5 Lib/test/test_multiprocessing.py
>
>
>
> ======================================================================
>
> ERROR: test_connection (__main__.WithProcessesTestConnection)
>
> ----------------------------------------------------------------------
>
> Traceback (most recent call last):
>
> File "Lib/test/test_multiprocessing.py", line 1220, in test_connection
>
> self.assertEqual(conn.recv_bytes_into(buffer),
>
> AttributeError: '_multiprocessing.Connection' object has no attribute
>
> 'recv_bytes_into'
>
>
>
> ----------------------------------------------------------------------
>
> Ran 123 tests in 12.309s
>
>
>
> FAILED (errors=1)
>
>
>
> :)
>
>
>
> Christian
>
>
>
> _______________________________________________
>
> Python-3000 mailing list
>
> Python-3000 at python.org
>
> http://mail.python.org/mailman/listinfo/python-3000
>
> Unsubscribe:  
http://mail.python.org/mailman/options/python-3000/jnoller%40gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-3000/attachments/20081022/67e55fe4/attachment.htm>

From lists at cheimes.de  Wed Oct 22 16:12:22 2008
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 22 Oct 2008 16:12:22 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <48FF2475.8090006@cheimes.de>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>	<48F7BAE6.4070108@gmail.com>
	<48F7BE73.3010303@cheimes.de>	<18680.42554.791704.321274@montanaro-dyndns-org.local>	<48F8DFFA.9090707@cheimes.de>	<18684.43866.322328.195968@montanaro-dyndns-org.local>
	<48FF2475.8090006@cheimes.de>
Message-ID: <gdncc5$hmt$1@ger.gmane.org>

Update:

I just implemented the recv_bytes_into function with the old buffer 
protocol. All tests are passing on my Linux box (Ubuntu 8.04 with gcc 
4.2, AMD64 processor).

svn check it out https://python-multiprocessing.googlecode.com/svn/trunk

Christian


From skip at pobox.com  Wed Oct 22 16:52:27 2008
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 22 Oct 2008 09:52:27 -0500
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <gdncc5$hmt$1@ger.gmane.org>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>
	<48F7BAE6.4070108@gmail.com> <48F7BE73.3010303@cheimes.de>
	<18680.42554.791704.321274@montanaro-dyndns-org.local>
	<48F8DFFA.9090707@cheimes.de>
	<18684.43866.322328.195968@montanaro-dyndns-org.local>
	<48FF2475.8090006@cheimes.de> <gdncc5$hmt$1@ger.gmane.org>
Message-ID: <18687.15915.98565.882018@montanaro-dyndns-org.local>


    Christian> I just implemented the recv_bytes_into function with the old
    Christian> buffer protocol. All tests are passing on my Linux box
    Christian> (Ubuntu 8.04 with gcc 4.2, AMD64 processor).

Using Python v < 2.6?  So I don't need to horse around making
test_multiprocessing.py API compatible with processing 0.52?

Skip

From lists at cheimes.de  Wed Oct 22 17:07:02 2008
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 22 Oct 2008 17:07:02 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18687.15915.98565.882018@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>	<48F7BAE6.4070108@gmail.com>
	<48F7BE73.3010303@cheimes.de>	<18680.42554.791704.321274@montanaro-dyndns-org.local>	<48F8DFFA.9090707@cheimes.de>	<18684.43866.322328.195968@montanaro-dyndns-org.local>	<48FF2475.8090006@cheimes.de>
	<gdncc5$hmt$1@ger.gmane.org>
	<18687.15915.98565.882018@montanaro-dyndns-org.local>
Message-ID: <48FF4196.2020204@cheimes.de>

skip at pobox.com wrote:
> Using Python v < 2.6?  So I don't need to horse around making
> test_multiprocessing.py API compatible with processing 0.52?

With Python 2.5.2 and 2.6.0 all tests are passing with any error. With 
Python 2.4.5 seven tests are failing because 2.4 doesn't support mmap 
with a negative file number.

File ".../python-multiprocessing/Lib/multiprocessing/heap.py", line 56, 
in __init__
     self.buffer = mmap.mmap(-1, size)
EnvironmentError: [Errno 9] Bad file descriptor

Christian

From lists at cheimes.de  Wed Oct 22 20:33:00 2008
From: lists at cheimes.de (Christian Heimes)
Date: Wed, 22 Oct 2008 20:33:00 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18687.15915.98565.882018@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>	<48F7BAE6.4070108@gmail.com>
	<48F7BE73.3010303@cheimes.de>	<18680.42554.791704.321274@montanaro-dyndns-org.local>	<48F8DFFA.9090707@cheimes.de>	<18684.43866.322328.195968@montanaro-dyndns-org.local>	<48FF2475.8090006@cheimes.de>
	<gdncc5$hmt$1@ger.gmane.org>
	<18687.15915.98565.882018@montanaro-dyndns-org.local>
Message-ID: <48FF71DC.8010904@cheimes.de>

skip at pobox.com wrote:
> Using Python v < 2.6?  So I don't need to horse around making
> test_multiprocessing.py API compatible with processing 0.52?

I've backported the Python 2.5 svn version of mmap to 2.4 and added it 
as multiprocessing._mmap25. The port is just a proof of concept and most 
like contains issues with ssize_t -> long transitions. But it's working.

With the latest svn checkout all tests are passing for 2.4.5, 2.5.2 and 
2.6.0 on my 64bit Ubuntu box. Somebody needs to test it on Windows, 
32bit Linux and BSD.

Christian

From lists at cheimes.de  Thu Oct 23 02:49:48 2008
From: lists at cheimes.de (Christian Heimes)
Date: Thu, 23 Oct 2008 02:49:48 +0200
Subject: [Python-3000] Backporting multiprocessing?
In-Reply-To: <18687.15915.98565.882018@montanaro-dyndns-org.local>
References: <60bb7ceb0810160901n367ce5f6r11f384e4661a56dc@mail.gmail.com>	<48F7BAE6.4070108@gmail.com>
	<48F7BE73.3010303@cheimes.de>	<18680.42554.791704.321274@montanaro-dyndns-org.local>	<48F8DFFA.9090707@cheimes.de>	<18684.43866.322328.195968@montanaro-dyndns-org.local>	<48FF2475.8090006@cheimes.de>
	<gdncc5$hmt$1@ger.gmane.org>
	<18687.15915.98565.882018@montanaro-dyndns-org.local>
Message-ID: <48FFCA2C.3090001@cheimes.de>

The latest svn version is now working with Python 2.4.4, Python 2.5.2 
and Python 2.6.0 on Linux (Ubuntu AMD64, Debian i386) and Windows XP. On 
Windows the multiprocessing module requires ctypes and pywin32 under 
Python 2.4.4.

Some of the examples aren't working correctly under 2.4 and 2.5. Jesse 
is looking into it.

Christian

From victor.stinner at haypocalc.com  Tue Oct 28 16:12:54 2008
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Tue, 28 Oct 2008 16:12:54 +0100
Subject: [Python-3000] email libraries: use byte or unicode strings?
Message-ID: <200810281612.54570.victor.stinner@haypocalc.com>

Hi,

I worked on poplib, imaplib and nntplib to fix them in Python3. First I tried 
to use unicode everywhere because I love unicode and I don't want to care 
about the charset. So I used a default charset (ISO-8859-1), but it doesn't 
work because each email can use a different charset. The charset is written 
in the email header but I don't want to hack the libraries to parse the 
headers: poplib should only support the POP3 protocol, email parsing is 
complex and should be done by another module (later, after fetching the 
email).

Current status: poplib, imaplib and nntplib are broken

--

I wrote patches for poplib and imaplib to use only byte strings. 
I "backported" poplib tests from python trunk and I used different POP3 and 
IMAP servers to test the libraries.

Can anyone review my patches? Issues #1210 and #3727.

--

I don't know the NNTP protocol and so I'm unable to test it. But nntplib 
should also use byte strings only.

Note: imaplib and nntplib have no test :-(

--

What about smtplib or smtpd?

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/

From janssen at parc.com  Tue Oct 28 18:19:41 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 28 Oct 2008 10:19:41 PDT
Subject: [Python-3000] email libraries: use byte or unicode strings?
In-Reply-To: <200810281612.54570.victor.stinner@haypocalc.com>
References: <200810281612.54570.victor.stinner@haypocalc.com>
Message-ID: <98133.1225214381@parc.com>

Victor Stinner <victor.stinner at haypocalc.com> wrote:

> Note: imaplib and nntplib have no test :-(

I'm concerned about the lack of test suites for these modules.  They
basically go untested unless someone sets up a server, and then runs a
series of unscripted ad-hoc tests against that server.  Is there either
a way we could set up IMAP and NNTP servers on python.org to test
against, or perhaps find some effectively permanent services to test
againt?

Bill

From barry at python.org  Wed Oct 29 10:12:59 2008
From: barry at python.org (Barry Warsaw)
Date: Wed, 29 Oct 2008 09:12:59 +0000
Subject: [Python-3000] email libraries: use byte or unicode strings?
In-Reply-To: <200810281612.54570.victor.stinner@haypocalc.com>
References: <200810281612.54570.victor.stinner@haypocalc.com>
Message-ID: <20081029091259.7153ec82@resist.wooz.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Oct 28, 2008, at 04:12 PM, Victor Stinner wrote:

>What about smtplib or smtpd?

Yes, they should use bytes, as should the email package.  The latter doesn't
though, and it needs a lot of work (we tried and failed at pycon).

- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkkIKRwACgkQ2YZpQepbvXF6AQCfeHXzthMc3+PNAGpwY8r0QNs2
UvIAoJxnVX55X+4RuI+KdgL9uF8N6k0q
=ASUk
-----END PGP SIGNATURE-----

From tjreedy at udel.edu  Wed Oct 29 18:55:07 2008
From: tjreedy at udel.edu (Terry Reedy)
Date: Wed, 29 Oct 2008 13:55:07 -0400
Subject: [Python-3000] email libraries: use byte or unicode strings?
In-Reply-To: <98133.1225214381@parc.com>
References: <200810281612.54570.victor.stinner@haypocalc.com>
	<98133.1225214381@parc.com>
Message-ID: <gea81p$gd5$1@ger.gmane.org>

Bill Janssen wrote:
> Victor Stinner <victor.stinner at haypocalc.com> wrote:
> 
>> Note: imaplib and nntplib have no test :-(

The examples in the manual should serve as a basis for a doctest.

> I'm concerned about the lack of test suites for these modules.  They
> basically go untested unless someone sets up a server, and then runs a
> series of unscripted ad-hoc tests against that server.  Is there either
> a way we could set up IMAP and NNTP servers on python.org to test
> against, or perhaps find some effectively permanent services to test
> againt?

news.gmane.org requires no logon and has been up pretty reliably for 
years.  I presume it uses standard nntp server software.  It certainly 
works fine with both OutlookExpress and Thunderbird. 
gmane.comp.python.devel (or python-3000.devel) would be an approriate 
newsgroup to test retrieval.  gmane.test is obviously used by numerous 
people for testing posting and replies.

However...
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
win32
Type "help", "copyright", "credits" or "license" for more information.
 >>> import nntplib as N
 >>> s=N.NNTP('news.gmane.org')

3.0rc1 fail here with string vs bytes message

 >>> resp, count, first, last, name = s.group('gmane.comp.python.devel')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Program Files\Python25\lib\nntplib.py", line 346, in group
     resp = self.shortcmd('GROUP ' + name)
   File "C:\Program Files\Python25\lib\nntplib.py", line 260, in shortcmd
     return self.getresp()
   File "C:\Program Files\Python25\lib\nntplib.py", line 215, in getresp
     resp = self.getline()
   File "C:\Program Files\Python25\lib\nntplib.py", line 207, in getline
     if not line: raise EOFError
EOFError
 >>> s.group('gmane.comp.python.general')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Program Files\Python25\lib\nntplib.py", line 346, in group
     resp = self.shortcmd('GROUP ' + name)
   File "C:\Program Files\Python25\lib\nntplib.py", line 259, in shortcmd
     self.putcmd(line)
   File "C:\Program Files\Python25\lib\nntplib.py", line 199, in putcmd
     self.putline(line)
   File "C:\Program Files\Python25\lib\nntplib.py", line 194, in putline
     self.sock.sendall(line)
   File "<string>", line 1, in sendall
socket.error: (10053, 'Software caused connection abort')

 >>> s.getwelcome()
'200 news.gmane.org InterNetNews NNRP server INN 2.4.1 ready (posting ok).'

 >>> s.set_debuglevel(2)
 >>> s.newgroups('081029','000000')
*cmd* 'NEWGROUPS 081029 000000'
*put* 'NEWGROUPS 081029 000000\r\n'
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Program Files\Python25\lib\nntplib.py", line 275, in newgroups
     return self.longcmd('NEWGROUPS ' + date + ' ' + time, file)
   File "C:\Program Files\Python25\lib\nntplib.py", line 264, in longcmd
     self.putcmd(line)
   File "C:\Program Files\Python25\lib\nntplib.py", line 199, in putcmd
     self.putline(line)
   File "C:\Program Files\Python25\lib\nntplib.py", line 194, in putline
     self.sock.sendall(line)
   File "<string>", line 1, in sendall
socket.error: (10053, 'Software caused connection abort')

I get same with 'nntp.aioe.org', so on my WinXP machine, 2.5 nntplib 
seems not to be working either.

Terry


From andrewm at object-craft.com.au  Thu Oct 30 23:17:25 2008
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Oct 2008 09:17:25 +1100
Subject: [Python-3000] email libraries: use byte or unicode strings?
In-Reply-To: <20081029091259.7153ec82@resist.wooz.org> 
References: <200810281612.54570.victor.stinner@haypocalc.com>
	<20081029091259.7153ec82@resist.wooz.org>
Message-ID: <20081030221726.0A0636007DF@longblack.object-craft.com.au>

>>What about smtplib or smtpd?
>
>Yes, they should use bytes, 

I agree. imaplib, poplib and smtplib are wire protocols, and should be
8-bit clean (SMTP in particular). The APIs are little more than the wire
protocol, so I think it's apprioriate they present bytes to their users
also (and there is nothing in their respective RFC's about encoding).

>as should the email package.  

That's a tricker case, but I think it should use bytes internally. One of
the early goals of email was that be able to cope with malformed MIME -
this includes incorrectly encoded messages. So I think it must keep a
bytes representation internally.

However - charset encoding is part of the MIME spec, so users have a
reasonable expectation that the mime lib will present them with unicode.
So the API needs to be unicode.

>The latter doesn't though, and it needs a lot of work (we tried and failed
>at pycon).

Yes, it's hard. I think we're going to have to break the API.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From stephen at xemacs.org  Fri Oct 31 07:23:14 2008
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 31 Oct 2008 15:23:14 +0900
Subject: [Python-3000] email libraries: use byte or unicode strings?
In-Reply-To: <20081030221726.0A0636007DF@longblack.object-craft.com.au>
References: <200810281612.54570.victor.stinner@haypocalc.com>
	<20081029091259.7153ec82@resist.wooz.org>
	<20081030221726.0A0636007DF@longblack.object-craft.com.au>
Message-ID: <87tzatuvu5.fsf@uwakimon.sk.tsukuba.ac.jp>

Andrew McNamara writes:

 > However - charset encoding is part of the MIME spec, so users have a
 > reasonable expectation that the mime lib will present them with unicode.
 > So the API needs to be unicode.

It needs to /include/ unicode functionality.  However, this might very
well be lazy (a function which automatically resends a message may not
need to decode the MIME parts, for example).

So I think there should be three layers: one corresponding more or less
to raw SMTP---all bytes; one which handles mail as text---all unicode;
and one which handles the transitions---which needs phasers set to
"kill" any data in incorrect format.

I also suggest that these three levels of functionality are
intertwingled enough (at the RFC level) that it does not make sense to
separate them into more than one module.


From lists at cheimes.de  Fri Oct 31 22:50:39 2008
From: lists at cheimes.de (Christian Heimes)
Date: Fri, 31 Oct 2008 22:50:39 +0100
Subject: [Python-3000] close() on open(fd, closefd=False)
Message-ID: <gefujf$vi1$1@ger.gmane.org>

Amaury has found an issue with open and closefd, 
http://bugs.python.org/issue4233

The additional warnings aren't critical. But in retrospection I think 
that I made a small error during the design of the closefd feature.
With a file descriptor number as first argument and closefd set to 
false, the file descriptor isn't closed when the file object is 
deallocated. It's also impossible to close the fd with close(). Right 
now close() doesn't do anything and you can still write or read after 
close(). This behavior is surprising to the user. I like to change 
close() to set the internal fd attribute to -1 (meaning close) but keep 
the fd open.

Maybe the warning could be dropped all along, too.

Christian


From musiccomposition at gmail.com  Fri Oct 31 23:02:47 2008
From: musiccomposition at gmail.com (Benjamin Peterson)
Date: Fri, 31 Oct 2008 17:02:47 -0500
Subject: [Python-3000] close() on open(fd, closefd=False)
In-Reply-To: <gefujf$vi1$1@ger.gmane.org>
References: <gefujf$vi1$1@ger.gmane.org>
Message-ID: <1afaf6160810311502y54888a55kc642d4772885ab5a@mail.gmail.com>

On Fri, Oct 31, 2008 at 4:50 PM, Christian Heimes <lists at cheimes.de> wrote:
> Amaury has found an issue with open and closefd,
> http://bugs.python.org/issue4233
>
> The additional warnings aren't critical. But in retrospection I think that I
> made a small error during the design of the closefd feature.
> With a file descriptor number as first argument and closefd set to false,
> the file descriptor isn't closed when the file object is deallocated. It's
> also impossible to close the fd with close(). Right now close() doesn't do
> anything and you can still write or read after close(). This behavior is
> surprising to the user. I like to change close() to set the internal fd
> attribute to -1 (meaning close) but keep the fd open.

Isn't it too late to make semantic changes like this? Not only is 3.0
in rc phase, but we've release 2.6's io backport with this
(mis)feature.

>
> Maybe the warning could be dropped all along, too.

That may be the best course of action.



-- 
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."