From hi at  Fri Aug  1 02:59:41 2014
From: hi at (Shiz)
Date: Fri, 1 Aug 2014 02:59:41 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
Message-ID: <>

Hi folks,

I?m working on porting CPython to the Android platform, and while making decent progress, I?m currently stuck at a higher-level issue than adding #ifdefs for __ANDROID__ to C extension modules.

The idea is, not only CPython extension modules have some assumptions that don?t seem to fit Android?s mold, some default Python-written modules do as well. However, whereas CPython extensions can trivially check if we?re building for Android by checking the __ANDROID__ compiler macro, Python modules can do no such check, and are left wondering how to figure out if the platform they are currently running on is an Android one. To my knowledge there is no reliable way to detect if one is using Android as a vehicle for their journey using any other way.

Now, the main question is: what would be the best way to ?expose? the indication that Android is being ran on to Python-living modules? My own thought was to add sys.getlinuxuserland(), or platform.linux_userland(), in similar vein to sys.getwindowsversion() and platform.linux_distribution(), which could return information about the userland of running CPython instance, instead of knowing merely the kernel and the distribution.

This way, code could trivially check if it ran on the GNU(+associates) userland, or under a BSD-ish userland, or Android? and adjust its behaviour accordingly.

I would be delighted to hear comments on this proposal, or better yet, alternative solutions. :)

Kind regards,

P.S.: I am well aware that Android might as well never be officially supported in CPython. In that case, consider this a thought experiment of how it /would/ be handled. :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <>

From v+python at  Fri Aug  1 03:54:53 2014
From: v+python at (Glenn Linderman)
Date: Thu, 31 Jul 2014 18:54:53 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

On 7/31/2014 5:59 PM, Shiz wrote:
> Hi folks,
> I?m working on porting CPython to the Android platform, and while making decent progress, I?m currently stuck at a higher-level issue than adding #ifdefs for __ANDROID__ to C extension modules.
> The idea is, not only CPython extension modules have some assumptions that don?t seem to fit Android?s mold, some default Python-written modules do as well. However, whereas CPython extensions can trivially check if we?re building for Android by checking the __ANDROID__ compiler macro, Python modules can do no such check, and are left wondering how to figure out if the platform they are currently running on is an Android one. To my knowledge there is no reliable way to detect if one is using Android as a vehicle for their journey using any other way.
> Now, the main question is: what would be the best way to ?expose? the indication that Android is being ran on to Python-living modules? My own thought was to add sys.getlinuxuserland(), or platform.linux_userland(), in similar vein to sys.getwindowsversion() and platform.linux_distribution(), which could return information about the userland of running CPython instance, instead of knowing merely the kernel and the distribution.

I've no idea what you mean by "userland" in your suggestions above or 
below, but doesn't the Android environment qualify as a 
(multi-versioned) platform independently of its host OS? Seems I've read 
about an Android reimplementation for Windows, for example. As long as 
all the services expected by Android are faithfully produced, the host 
OS may be irrelevant to an Android application... in which case, I would 
think/propose/suggest the platform name should change from win32 or 
linux to Android (and the Android version be reflected in version parts).

> This way, code could trivially check if it ran on the GNU(+associates) userland, or under a BSD-ish userland, or Android? and adjust its behaviour accordingly.
> I would be delighted to hear comments on this proposal, or better yet, alternative solutions. :)
> Kind regards,
> Shiz
> P.S.: I am well aware that Android might as well never be officially supported in CPython. In that case, consider this a thought experiment of how it /would/ be handled. :)

Is your P.S. suggestive that you would not be willing to support your 
port for use by others?  Of course, until it is somewhat complete, it is 
hard to know how complete and compatible it can be.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From p.f.moore at  Fri Aug  1 08:46:21 2014
From: p.f.moore at (Paul Moore)
Date: Fri, 1 Aug 2014 07:46:21 +0100
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

On 1 August 2014 02:54, Glenn Linderman <v+python at> wrote:
> I've no idea what you mean by "userland" in your suggestions above or below,
> but doesn't the Android environment qualify as a (multi-versioned) platform
> independently of its host OS? Seems I've read about an Android
> reimplementation for Windows, for example. As long as all the services
> expected by Android are faithfully produced, the host OS may be irrelevant
> to an Android application... in which case, I would think/propose/suggest
> the platform name should change from win32 or linux to Android (and the
> Android version be reflected in version parts).

Alternatively, if having sys.platform be "linux" makes portability
easier because code that does a platform check generally gets the
right answer if Android reports as "linux", then why not make
sys.linux_distribution report "android"?

To put it briefly, either android is the platform, or android is a
specific distribution of the linux platform.


From hi at  Fri Aug  1 14:23:17 2014
From: hi at (Shiz)
Date: Fri, 1 Aug 2014 14:23:17 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

On 01 Aug 2014, at 03:54, Glenn Linderman <v+python at> wrote:
> I've no idea what you mean by "userland" in your suggestions above or below, but doesn't the Android environment qualify as a (multi-versioned) platform independently of its host OS? Seems I've read about an Android reimplementation for Windows, for example. As long as all the services expected by Android are faithfully produced, the host OS may be irrelevant to an Android application... in which case, I would think/propose/suggest the platform name should change from win32 or linux to Android (and the Android version be reflected in version parts).

That might be a way to look at it. So far I assumed that the Android environment would be largely Linux-based, since the Android NDK (Native Development Kit, the SDK used for creating C/C++-level applications) is used for my patch which gives a GNU-ish toolchain with a Linux/Unixy environment. I know an implementation exists that claims to run Android on top of an NT kernel, but I honestly have little idea of how it works. Given how a fair amount of things ?already work? with the platform set to linux, I?m not sure if changing sys.platform would be a good idea? but that?s from my NDK perspective.

> Is your P.S. suggestive that you would not be willing to support your port for use by others?  Of course, until it is somewhat complete, it is hard to know how complete and compatible it can be.

Oh, no, nothing like that. It?s just that I?m not sure, as goes for anything, that it would be accepted into mainline CPython. Better safe than sorry in that aspect: maybe the maintainers don?t want to support Android in the first place. :)

Kind regards,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <>

From mark at  Fri Aug  1 14:32:48 2014
From: mark at (Shiz)
Date: Fri, 1 Aug 2014 14:32:48 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

> On 1 August 2014 02:54, Glenn Linderman <v+python at> wrote:
> Alternatively, if having sys.platform be "linux" makes portability
> easier because code that does a platform check generally gets the
> right answer if Android reports as "linux", then why not make
> sys.linux_distribution report "android"?
> To put it briefly, either android is the platform, or android is a
> specific distribution of the linux platform.
> Paul

That might maybe work better. I was assuming a userland perspective because I?ve been honestly mostly wrestling with Bionic, Android?s libc,
but putting that into perspective to consider Android as a whole (after all, the SDK and NDK are what make Android for a lot of developers)
might be a valid other approach as well.

Kinds regards,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <>

From status at  Fri Aug  1 18:08:08 2014
From: status at (Python tracker)
Date: Fri,  1 Aug 2014 18:08:08 +0200 (CEST)
Subject: [Python-Dev] Summary of Python tracker Issues
Message-ID: <>

ACTIVITY SUMMARY (2014-07-25 - 2014-08-01)
Python tracker at

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4592 ( +1)
  closed 29297 (+49)
  total  33889 (+50)

Open issues with patches: 2163 

Issues opened (34)

#11271: doesn't batch fun  reopened by pitrou

#22063: asyncio: sock_xxx() methods of event loops should check ath so  reopened by haypo

#22069: TextIOWrapper(newline="\n", line_buffering=True) mistakenly tr  opened by akira

#22070: Use the _functools module to speed up functools.total_ordering  opened by ncoghlan

#22071: Remove long-time deprecated attributes from smtpd  opened by zvyn

#22077: Improve the error message for various sequences  opened by Claudiu.Popa

#22079: Ensure in PyType_Ready() that base class of static type is sta  opened by serhiy.storchaka

#22080: Add windows_helper module helper  opened by Claudiu.Popa

#22083: Refactor PyShell's breakpoint related methods  opened by sahutd

#22086: Tab indent no longer works in interpreter  opened by Azendale

#22087: _UnixDefaultEventLoopPolicy should either create a new loop or  opened by dan.oreilly

#22088: base64 module still ignores non-alphabet characters  opened by Julian

#22090: Decimal and float formatting treat '%' differently for infinit  opened by mark.dickinson

#22091: __debug__ in compile(optimize=1)  opened by arigo

#22092: Executing some tests inside Lib/unittest/test individually thr  opened by vajrasky

#22093: Compiling python on OS X gives warning about compact unwind  opened by vajrasky

#22094: oss_audio_device.write(data) produces short writes  opened by akira

#22095: Use of set_tunnel with default port results in incorrect post  opened by demian.brecht

#22097: Linked list API for ordereddict  opened by pitrou

#22098: Behavior of Structure inconsistent with BigEndianStructure whe  opened by Florian.Dold

#22100: Use $HOSTPYTHON when determining candidate interpreter for $PY  opened by shiz

#22102: Zipfile generates Zipfile error in zip with 0 total number of  opened by Guillaume.Carre

#22103: bdist_wininst does not run install script  opened by mb_

#22104: test_asyncio unstable in refleak mode  opened by pitrou

#22105: Hang during File "Save As"  opened by Joe

#22107: tempfile module misinterprets access denied error on Windows  opened by rupole

#22110: enable extra compilation warnings  opened by neologix

#22112: '_UnixSelectorEventLoop' object has no attribute 'create_task'  opened by pydanny

#22113: memoryview and struct.pack_into  opened by stangelandcl

#22114: You cannot call communicate() safely after receiving an except  opened by amrith

#22115: Add new methods to trace Tkinter variables  opened by serhiy.storchaka

#22116: Weak reference support for C function objects  opened by pitrou

#22117: Rewrite pytime.h to work on nanoseconds  opened by haypo

#22118: urljoin fails with messy relative URLs  opened by Mike.Lissner

Most recent 15 issues with no replies (15)

#22116: Weak reference support for C function objects

#22115: Add new methods to trace Tkinter variables

#22107: tempfile module misinterprets access denied error on Windows

#22105: Hang during File "Save As"

#22103: bdist_wininst does not run install script

#22102: Zipfile generates Zipfile error in zip with 0 total number of

#22098: Behavior of Structure inconsistent with BigEndianStructure whe

#22095: Use of set_tunnel with default port results in incorrect post

#22092: Executing some tests inside Lib/unittest/test individually thr

#22088: base64 module still ignores non-alphabet characters

#22086: Tab indent no longer works in interpreter

#22083: Refactor PyShell's breakpoint related methods

#22080: Add windows_helper module helper

#22077: Improve the error message for various sequences

#22071: Remove long-time deprecated attributes from smtpd

Most recent 15 issues waiting for review (15)

#22117: Rewrite pytime.h to work on nanoseconds

#22115: Add new methods to trace Tkinter variables

#22110: enable extra compilation warnings

#22104: test_asyncio unstable in refleak mode

#22100: Use $HOSTPYTHON when determining candidate interpreter for $PY

#22097: Linked list API for ordereddict

#22095: Use of set_tunnel with default port results in incorrect post

#22092: Executing some tests inside Lib/unittest/test individually thr

#22087: _UnixDefaultEventLoopPolicy should either create a new loop or

#22083: Refactor PyShell's breakpoint related methods

#22080: Add windows_helper module helper

#22077: Improve the error message for various sequences

#22071: Remove long-time deprecated attributes from smtpd

#22068: tkinter: avoid reference loops with Variables and Fonts

#22065: Update turtledemo menu creation

Top 10 most discussed issues (10)

#21308: PEP 466: backport ssl changes  13 msgs

#22097: Linked list API for ordereddict  13 msgs

#22114: You cannot call communicate() safely after receiving an except   9 msgs

#9529: Make re match object iterable   8 msgs

#15986: memoryview: expose 'buf' attribute   8 msgs

#20170: Derby #1: Convert 137 sites to Argument Clinic in Modules/posi   8 msgs

#21933: Allow the user to change font sizes with the text pane of turt   8 msgs

#22087: _UnixDefaultEventLoopPolicy should either create a new loop or   8 msgs

#17620: Python interactive console doesn't use sys.stdin for input   7 msgs

#18174: Make regrtest with --huntrleaks check for fd leaks   7 msgs

Issues closed (49)

#11969: Can't launch multiproccessing.Process on methods  closed by pitrou

#11990: redirected output - stdout writes newline as \n in windows  closed by haypo

#15152: test_subprocess failures on awfully slow builtbots  closed by neologix

#15398: intermittence on UnicodeFileTests.test_rename at test_pep277 o  closed by ned.deily

#16005: smtplib.SMTP().sendmail() and rset()  closed by r.david.murray

#16383: Python 3.3 Permission Error with User Library on Windows  closed by zach.ware

#17172: Add turtledemo to IDLE menu  closed by terry.reedy

#17371: Mismatch between Python 3.3 build environment and distutils co  closed by loewis

#17634: Win32: shutil.copy leaks file handles to child processes  closed by haypo

#18395: Make _Py_char2wchar() and _Py_wchar2char() public  closed by haypo

#19612: test_subprocess: sporadic failure of test_communicate_epipe()  closed by haypo

#19875: test_getsockaddrarg occasional failure  closed by neologix

#19923: OSError: [Errno 512] Unknown error 512 in test_multiprocessing  closed by neologix

#20093: Wrong OSError message from os.rename() when dst is a non-empty  closed by doko

#20466: Example in Doc/extending/embedding.rst fails to compile cleanl  closed by zach.ware

#21580: PhotoImage(data=...) apparently has to be UTF-8 or Base-64 enc  closed by serhiy.storchaka

#21591: "exec(a, b, c)" not the same as "exec a in b, c" in nested fun  closed by djc

#21704: _multiprocessing module builds incorrectly when POSIX semaphor  closed by Arfrever

#21867: Turtle returns TypeError when undobuffer is set to 0 (aka no u  closed by berker.peksag

#21958: Allow python 2.7 to compile with Visual Studio 2013  closed by zach.ware

#21990: saxutils defines an inner class where a normal one would do  closed by rhettinger

#22003: BytesIO copy-on-write  closed by pitrou

#22018: signal.set_wakeup_fd() should accept sockets on Windows  closed by haypo

#22023: PyUnicode_FromFormat is broken on python 2  closed by haypo

#22033: Subclass friendly reprs  closed by serhiy.storchaka

#22041: http POST request with python 3.3 through web proxy  closed by ned.deily

#22044: Premature Py_DECREF while generating a TypeError in call_tzinf  closed by rhettinger

#22054: Add os.get_blocking() and os.set_blocking() functions  closed by haypo

#22058: datetime.datetime() should accept a as init para  closed by rhettinger

#22066: subprocess.communicate() does not receive full output from the  closed by ezio.melotti

#22072: Fix typos in SSL's documentation  closed by python-dev

#22073: Reference links in PEP466 are broken  closed by ned.deily

#22074: Lib/test/ fails with NameError  closed by pitrou

#22075: Lambda, Enumerate and List comprehensions crash  closed by ned.deily

#22076: csv module bad grammar in exception message  closed by berker.peksag

#22078: io.BufferedReader hides ResourceWarnings when garbage collecte  closed by serhiy.storchaka

#22081: Backport repr(socket.socket) from Python 3.5 to Python 2.7  closed by haypo

#22082: Clear interned strings listed in slotdefs  closed by loewis

#22084: Mutating while iterating  closed by ncoghlan

#22085: Drop support of Tk 8.3  closed by serhiy.storchaka

#22089: collections.MutableSet does not provide update method  closed by rhettinger

#22096: Argument Clinic: add ability to specify an existing impl funct  closed by zach.ware

#22099: Two "Save As" Windows  closed by ned.deily

#22101: doesn't provide copy() method  closed by rhettinger

#22106: Python 2 docs 'control flow/pass' section contains bad example  closed by rhettinger

#22108: python c api wchar_t*/char* passing contradiction  closed by loewis

#22109: Python failing in markupsafe module when running ansible  closed by r.david.murray

#22111: Improve imaplib testsuite.  closed by pitrou

#1508864: threading.Timer/timeouts break on change of win32 local time  closed by haypo

From cf.natali at  Fri Aug  1 19:49:52 2014
From: cf.natali at (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=)
Date: Fri, 1 Aug 2014 18:49:52 +0100
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

2014-08-01 13:23 GMT+01:00 Shiz <hi at>:
>> Is your P.S. suggestive that you would not be willing to support your port for use by others?  Of course, until it is somewhat complete, it is hard to know how complete and compatible it can be.
> Oh, no, nothing like that. It's just that I'm not sure, as goes for anything, that it would be accepted into mainline CPython. Better safe than sorry in that aspect: maybe the maintainers don't want to support Android in the first place. :)

Well, Android is so popular that supporting it would definitely be interesting.
There are a couple questions however (I'm not familiar at all with
Android, I don't have a smartphone ;-):
- Do you have an idea of the amount of work/patch size required? Do
you have an example of a patch (even if it's a work-in-progess)?
- Is there really a common Android platform? I've heard a lot about
fragmentation, so would we have to support several Android flavours
(like #ifdef __ANDROID_VENDOR_A__, #elif defined

From hi at  Fri Aug  1 20:09:30 2014
From: hi at (Shiz)
Date: Fri, 01 Aug 2014 20:09:30 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

Hash: SHA512

Charles-Fran?ois Natali wrote:
> Well, Android is so popular that supporting it would definitely be
> interesting. There are a couple questions however (I'm not familiar
> at all with Android, I don't have a smartphone ;-): - Do you have an
> idea of the amount of work/patch size required? Do you have an
> example of a patch (even if it's a work-in-progess)? - Is there
> really a common Android platform? I've heard a lot about 
> fragmentation, so would we have to support several Android flavours 
> (like #ifdef __ANDROID_VENDOR_A__, #elif defined 

Absolutely! I maintain a public repository of patches against CPython
v3.3.3 at [1].

They are divided into three large patches: one fixes some issues I
encountered with CPython's build system for cross-compilation, one fixes
Android/Bionic's numerous locale issues (locale.h/langinfo.h support in
Android is basically a set of stub functions that return NULL), and the
last one is a set of 'misc' fixes for things that affect Android, mainly
smaller things like missing fields in struct passwd and the like.

With those patches, CPython 3.3.3 will cross-compile to and run on at
least my own Android device, a Moto G running Android 4.4.2. What's left
to fail is fix the numerous regression test failures and their causes. I
documented some of my findings at [2]. :)

As far as Android fragmentation goes, to my knowledge that mainly refers
to fragmentation at two levels: the Android versions numerous devices
run tends to differ greatly, and the screen sizes, resolutions and
aspect ratios vary greatly. Obviously the latter is a problem beyond the
scope of CPython, but the former could lead to some issues.

Luckily however, the NDK[3], the SDK of choice to use for C/C++-level
applications, is fairly unified and expected to be used for pretty much
all Android devices. Essentially there should be only the NDK for
CPython to target, with a variety of NDK versions to support depending
on which versions of Android CPython chooses to support. So far I've
been only testing against NDK r9c, so I'm honestly not all that familiar
with the changes different NDK versions bring, but from what I heard up
until NDK r10 were mostly toolchain updates and header additions for new
Android versions.

I'd dare say that the vast, vast majority of Android devices out there
are running on the same base, namely AOSP[4] with numerous vendor
fixes/drivers/additions, and that custom Android distributions would try
not to break NDK compatibility.

Kind regards,

Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From agriff at  Fri Aug  1 23:48:37 2014
From: agriff at (Andrea Griffini)
Date: Fri, 1 Aug 2014 23:48:37 +0200
Subject: [Python-Dev] sum(...) limitation
Message-ID: <>

help(sum) tells clearly that it should be used to sum numbers and not
strings, and with strings actually fails.

However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.

Is this to be considered a bug?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Fri Aug  1 23:51:54 2014
From: guido at (Guido van Rossum)
Date: Fri, 1 Aug 2014 14:51:54 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
Message-ID: <>

No. We just can't put all possible use cases in the docstring. :-)

On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:

> help(sum) tells clearly that it should be used to sum numbers and not
> strings, and with strings actually fails.
> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
> Is this to be considered a bug?
> Andrea
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From 4kir4.1i at  Sat Aug  2 03:53:45 2014
From: 4kir4.1i at (Akira Li)
Date: Sat, 02 Aug 2014 05:53:45 +0400
Subject: [Python-Dev] Exposing the Android platform existence to Python
References: <>
Message-ID: <>

Shiz <hi at> writes:

> Hi folks,
> I?m working on porting CPython to the Android platform, and while
> making decent progress, I?m currently stuck at a higher-level issue
> than adding #ifdefs for __ANDROID__ to C extension modules.
> The idea is, not only CPython extension modules have some assumptions
> that don?t seem to fit Android?s mold, some default Python-written
> modules do as well. However, whereas CPython extensions can trivially
> check if we?re building for Android by checking the __ANDROID__
> compiler macro, Python modules can do no such check, and are left
> wondering how to figure out if the platform they are currently running
> on is an Android one. To my knowledge there is no reliable way to
> detect if one is using Android as a vehicle for their journey using
> any other way.
> Now, the main question is: what would be the best way to ?expose? the
> indication that Android is being ran on to Python-living modules? My
> own thought was to add sys.getlinuxuserland(), or
> platform.linux_userland(), in similar vein to sys.getwindowsversion()
> and platform.linux_distribution(), which could return information
> about the userland of running CPython instance, instead of knowing
> merely the kernel and the distribution.
> This way, code could trivially check if it ran on the GNU(+associates)
> userland, or under a BSD-ish userland, or Android? and adjust its
> behaviour accordingly.
> I would be delighted to hear comments on this proposal, or better yet,
> alternative solutions. :)
> Kind regards,
> Shiz
> P.S.: I am well aware that Android might as well never be officially
> supported in CPython. In that case, consider this a thought experiment
> of how it /would/ be handled. :)

Python uses, sys.platform, and various functions from `platform`
module to provide version info:

- coarse: is 'posix', 'nt', 'ce', 'java' [1]. It is defined by
          availability of some builtin modules ('posix', 'nt' in
          particular) at import time.

- finer: sys.platform may start with freebsd, linux, win, cygwin, darwin
         (`uname -s`). It is defined at python build time.

- detailed: `platform` module. It provides as much info as possible
            e.g., platform.uname(), platform.platform().
            It may use runtime commands to get it.

If Android is posixy enough (would `posix` module work on Android?)
then could be left 'posix'.

You could set sys.platform to 'android' (like sys.platform may be
'cygwin' on Windows) if Android is not like *any other* Linux
distribution (from the point of view of writing a working Python code on
it) i.e., if Android is further from other Linux distribution than
freebsd, linux, darwin from each other then it might deserve
sys.platform slot.

If sys.platform is left 'linux' (like sys.platform is 'darwin' on iOS)
then platform module could be used to detect Android e.g.,
platform.linux_distribution() though (it might be removed in Python 3.6)
it is unpredictable [2] unless you fix it on your python distribution,
e.g., here's an output on my machine:

  >>> import platform
  >>> platform.linux_distribution()
  ('Ubuntu', '14.04', 'trusty')

For example:

  is_android = (platform.linux_distribution()[0] == 'Android')

You could also define platform.android_version() that can provide Android
specific version details as much as you need:

  is_android = bool(platform.android_version().release)

You could provide an alias android_ver (like existing java_ver, libc_ver,
mac_ver, win32_ver).

See also, "When to use, sys.platform, or platform.system?" [3]

Unrelated, TIL [4]:

  Android is a Linux distribution according to the Linux Foundation


btw, does it help adding os.get_shell_executable() [5] function, to
avoid hacking subprocess module, so that os.confstr('CS_PATH') or
os.defpath on Android could be defined to include /system/bin instead?



From steve at  Sat Aug  2 05:06:34 2014
From: steve at (Steven D'Aprano)
Date: Sat, 2 Aug 2014 13:06:34 +1000
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <20140802030634.GH4525@ando>

On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote:

> Python uses, sys.platform, and various functions from `platform`
> module to provide version info:
> If Android is posixy enough (would `posix` module work on Android?)
> then could be left 'posix'.

Does anyone know what kivy does when running under Android?


From guido at  Sat Aug  2 05:34:32 2014
From: guido at (Guido van Rossum)
Date: Fri, 1 Aug 2014 20:34:32 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <20140802030634.GH4525@ando>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Or SL4A? (

On Fri, Aug 1, 2014 at 8:06 PM, Steven D'Aprano <steve at> wrote:

> On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote:
> > Python uses, sys.platform, and various functions from `platform`
> > module to provide version info:
> [...]
> > If Android is posixy enough (would `posix` module work on Android?)
> > then could be left 'posix'.
> Does anyone know what kivy does when running under Android?
> --
> Steven
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From cyberdupo56 at  Sat Aug  2 07:57:38 2014
From: cyberdupo56 at (Allen Li)
Date: Fri, 1 Aug 2014 22:57:38 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
Message-ID: <20140802055738.GA6053@gensokyo>

On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
> No. We just can't put all possible use cases in the docstring. :-)
> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:
>     help(sum) tells clearly that it should be used to sum numbers and not
>     strings, and with strings actually fails.
>     However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
>     Is this to be considered a bug?

Can you explain the rationale behind this design decision?  It seems
terribly inconsistent.  Why are only strings explicitly restricted from
being sum()ed?  sum() should either ban everything except numbers or
accept everything that implements addition (duck typing).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <>

From tjreedy at  Sat Aug  2 08:35:32 2014
From: tjreedy at (Terry Reedy)
Date: Sat, 02 Aug 2014 02:35:32 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802055738.GA6053@gensokyo>
References: <>
Message-ID: <lri0rs$em0$>

On 8/2/2014 1:57 AM, Allen Li wrote:
> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
>> No. We just can't put all possible use cases in the docstring. :-)
>> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:
>>      help(sum) tells clearly that it should be used to sum numbers and not
>>      strings, and with strings actually fails.
>>      However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
>>      Is this to be considered a bug?
> Can you explain the rationale behind this design decision?  It seems
> terribly inconsistent.  Why are only strings explicitly restricted from
> being sum()ed?  sum() should either ban everything except numbers or
> accept everything that implements addition (duck typing).

O(n**2) behavior, ''.join(strings) alternative.

Terry Jan Reedy

From phil at  Sat Aug  2 09:53:35 2014
From: phil at (Phil Thompson)
Date: Sat, 02 Aug 2014 08:53:35 +0100
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On 02/08/2014 4:34 am, Guido van Rossum wrote:
> Or SL4A? (
> On Fri, Aug 1, 2014 at 8:06 PM, Steven D'Aprano <steve at> 
> wrote:
>> On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote:
>> > Python uses, sys.platform, and various functions from `platform`
>> > module to provide version info:
>> [...]
>> > If Android is posixy enough (would `posix` module work on Android?)
>> > then could be left 'posix'.
>> Does anyone know what kivy does when running under Android?

I don't think either do anything.

As the OP said, porting Python to Android is mainly about dealing with a 
C stdlib that is limited in places. Therefore there might be the odd 
missing function or attribute in the Python stdlib - just the same as 
can happen with other platforms.

To me the issue is whether, for a particular value of sys.platform, the 
programmer can expect a particular Python stdlib API. If so then Android 
needs a different value for sys.platform.

On the other hand if the programmer should not expect to make such an 
assumption, and should instead allow for the absence of certain 
functions (but which ones?), then the existing value of 'linux' should 
be fine.

Another option I don't think I've seen suggested, given the recommended 
way of testing for Linux is to use sys.platform.startswith('linux'), is 
to use a value of 'linux-android'.


From steve at  Sat Aug  2 09:39:12 2014
From: steve at (Steven D'Aprano)
Date: Sat, 2 Aug 2014 17:39:12 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802055738.GA6053@gensokyo>
References: <>
Message-ID: <20140802073912.GI4525@ando>

On Fri, Aug 01, 2014 at 10:57:38PM -0700, Allen Li wrote:
> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
> > No. We just can't put all possible use cases in the docstring. :-)
> > 
> > 
> > On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:
> > 
> >     help(sum) tells clearly that it should be used to sum numbers and not
> >     strings, and with strings actually fails.
> > 
> >     However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
> > 
> >     Is this to be considered a bug?
> Can you explain the rationale behind this design decision?  It seems
> terribly inconsistent.  Why are only strings explicitly restricted from
> being sum()ed?  sum() should either ban everything except numbers or
> accept everything that implements addition (duck typing).

Repeated list and str concatenation both have quadratic O(N**2) 
performance, but people frequently build up strings with + and rarely do 
the same for lists. String concatenation with + is an attractive 
nuisance for many people, including some who actually know better but 
nevertheless do it. Also, for reasons I don't understand, many people 
dislike or cannot remember to use ''.join.

Whatever the reason, repeated string concatenation is common whereas 
repeated list concatenation is much, much rarer (and repeated tuple 
concatenation even rarer), so sum(strings) is likely to be a land mine 
buried in your code while sum(lists) is not. Hence the decision that 
beginners in particular need to be protected from the mistake of using 
sum(strings) but bothering to check for sum(lists) is a waste of time.

Personally, I wish that sum would raise a warning rather than an 

As for prohibiting anything except numbers with sum(), that in my 
opinion would be a bad idea. sum(vectors), sum(numeric_arrays), 
sum(angles) etc. should all be allowed. The general sum() built-in 
should accept any type that allows + (unless explicitly black-listed), 
while specialist numeric-only sums could go into modules (like 


From jtaylor.debian at  Sat Aug  2 12:11:54 2014
From: jtaylor.debian at (Julian Taylor)
Date: Sat, 02 Aug 2014 12:11:54 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <lri0rs$em0$>
References: <>
 <20140802055738.GA6053@gensokyo> <lri0rs$em0$>
Message-ID: <>

On 02.08.2014 08:35, Terry Reedy wrote:
> On 8/2/2014 1:57 AM, Allen Li wrote:
>> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
>>> No. We just can't put all possible use cases in the docstring. :-)
>>> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:
>>>      help(sum) tells clearly that it should be used to sum numbers
>>> and not
>>>      strings, and with strings actually fails.
>>>      However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
>>>      Is this to be considered a bug?
>> Can you explain the rationale behind this design decision?  It seems
>> terribly inconsistent.  Why are only strings explicitly restricted from
>> being sum()ed?  sum() should either ban everything except numbers or
>> accept everything that implements addition (duck typing).
> O(n**2) behavior, ''.join(strings) alternative.

hm could this be a pure python case that would profit from temporary
elision [0]?

lists could declare the tp_can_elide slot and call list.extend on the
temporary during its tp_add slot instead of creating a new temporary.
extend/realloc can avoid the copy if there is free memory available
after the block.


From stefan_ml at  Sat Aug  2 12:56:53 2014
From: stefan_ml at (Stefan Behnel)
Date: Sat, 02 Aug 2014 12:56:53 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <lri0rs$em0$>
Message-ID: <lrig5n$ivq$>

Julian Taylor schrieb am 02.08.2014 um 12:11:
> On 02.08.2014 08:35, Terry Reedy wrote:
>> On 8/2/2014 1:57 AM, Allen Li wrote:
>>> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
>>>> No. We just can't put all possible use cases in the docstring. :-)
>>>> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff at> wrote:
>>>>      help(sum) tells clearly that it should be used to sum numbers
>>>> and not
>>>>      strings, and with strings actually fails.
>>>>      However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
>>>>      Is this to be considered a bug?
>>> Can you explain the rationale behind this design decision?  It seems
>>> terribly inconsistent.  Why are only strings explicitly restricted from
>>> being sum()ed?  sum() should either ban everything except numbers or
>>> accept everything that implements addition (duck typing).
>> O(n**2) behavior, ''.join(strings) alternative.
> lists could declare the tp_can_elide slot and call list.extend on the
> temporary during its tp_add slot instead of creating a new temporary.
> extend/realloc can avoid the copy if there is free memory available
> after the block.

Yes, i.e. only sometimes. Better not rely on it in your code.


From alexander.belopolsky at  Sat Aug  2 16:52:07 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sat, 2 Aug 2014 10:52:07 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802073912.GI4525@ando>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve at> wrote:

> String concatenation with + is an attractive
> nuisance for many people, including some who actually know better but
> nevertheless do it. Also, for reasons I don't understand, many people
> dislike or cannot remember to use ''.join.

Since sum() already treats strings as a special case, why can't it simply
call (an equivalent of) ''.join itself instead of telling the user to do
it?  It does not matter why "many people dislike or cannot remember to use
''.join" - if this is a fact - it should be considered by language
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From stefan_ml at  Sat Aug  2 17:06:10 2014
From: stefan_ml at (Stefan Behnel)
Date: Sat, 02 Aug 2014 17:06:10 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <lriup3$ase$>

Alexander Belopolsky schrieb am 02.08.2014 um 16:52:
> On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote:
>> String concatenation with + is an attractive
>> nuisance for many people, including some who actually know better but
>> nevertheless do it. Also, for reasons I don't understand, many people
>> dislike or cannot remember to use ''.join.
> Since sum() already treats strings as a special case, why can't it simply
> call (an equivalent of) ''.join itself instead of telling the user to do
> it?  It does not matter why "many people dislike or cannot remember to use
> ''.join" - if this is a fact - it should be considered by language
> implementors.

I don't think sum(strings) is beautiful enough to merit special cased
support. Special cased rejection sounds like a much better way to ask
people "think again - what's a sum of strings anyway?".


From steve at  Sat Aug  2 17:27:56 2014
From: steve at (Steven D'Aprano)
Date: Sun, 3 Aug 2014 01:27:56 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <20140802152756.GJ4525@ando>

On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote:
> On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve at> wrote:
> > String concatenation with + is an attractive
> > nuisance for many people, including some who actually know better but
> > nevertheless do it. Also, for reasons I don't understand, many people
> > dislike or cannot remember to use ''.join.
> >
> Since sum() already treats strings as a special case, why can't it simply
> call (an equivalent of) ''.join itself instead of telling the user to do
> it?  It does not matter why "many people dislike or cannot remember to use
> ''.join" - if this is a fact - it should be considered by language
> implementors.

It could, of course, but there is virtue in keeping sum simple, 
rather than special-casing who knows how many different types. If sum() 
tries to handle strings, should it do the same for lists? bytearrays? 
array.array? tuple? Where do we stop?

Ultimately it comes down to personal taste. Some people are going to 
wish sum() tried harder to do the clever thing with more types, some 
people are going to wish it was simpler and didn't try to be clever at 

Another argument against excessive cleverness is that it ties sum() to 
one particular idiom or implementation. Today, the idiomatic and 
efficient way to concatenate a lot of strings is with ''.join, but 
tomorrow there might be a new str.concat() method. Who knows? sum() 
shouldn't have to care about these details, since they are secondary to 
sum()'s purpose, which is to add numbers. Anything else is a 
bonus (or perhaps a nuisance).

So, I would argue that when faced with something that is not a number, 
there are two reasonable approaches for sum() to take:

- refuse to handle the type at all; or
- fall back on simple-minded repeated addition.

By the way, I think this whole argument would have been easily 
side-stepped if + was only used for addition, and & used for 
concatenation. Then there would be no question about what sum() should 
do for lists and tuples and strings: raise TypeError.


From hi at  Sat Aug  2 14:00:04 2014
From: hi at (Shiz)
Date: Sat, 02 Aug 2014 14:00:04 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
Message-ID: <>

Hash: SHA512

Akira Li wrote:
> Python uses, sys.platform, and various functions from
> `platform` module to provide version info:
> - coarse: is 'posix', 'nt', 'ce', 'java' [1]. It is defined
> by availability of some builtin modules ('posix', 'nt' in particular)
> at import time.
> - finer: sys.platform may start with freebsd, linux, win, cygwin,
> darwin (`uname -s`). It is defined at python build time.
> - detailed: `platform` module. It provides as much info as possible 
> e.g., platform.uname(), platform.platform(). It may use runtime
> commands to get it.
> If Android is posixy enough (would `posix` module work on Android?) 
> then could be left 'posix'.
> You could set sys.platform to 'android' (like sys.platform may be 
> 'cygwin' on Windows) if Android is not like *any other* Linux 
> distribution (from the point of view of writing a working Python code
> on it) i.e., if Android is further from other Linux distribution
> than freebsd, linux, darwin from each other then it might deserve 
> sys.platform slot.
> If sys.platform is left 'linux' (like sys.platform is 'darwin' on
> iOS) then platform module could be used to detect Android e.g., 
> platform.linux_distribution() though (it might be removed in Python
> 3.6) it is unpredictable [2] unless you fix it on your python
> distribution, e.g., here's an output on my machine:
>>>> import platform platform.linux_distribution()
> ('Ubuntu', '14.04', 'trusty')
> For example:
> is_android = (platform.linux_distribution()[0] == 'Android')
> You could also define platform.android_version() that can provide
> Android specific version details as much as you need:
> is_android = bool(platform.android_version().release)
> You could provide an alias android_ver (like existing java_ver,
> libc_ver, mac_ver, win32_ver).
> See also, "When to use, sys.platform, or platform.system?"
> [3]
> Unrelated, TIL [4]:
> Android is a Linux distribution according to the Linux Foundation
> [1] [2]
> [3] 
> btw, does it help adding os.get_shell_executable() [5] function, to 
> avoid hacking subprocess module, so that os.confstr('CS_PATH') or 
> os.defpath on Android could be defined to include /system/bin
> instead?
> [5]

Thanks for the detailed information!

I would consider Android at least POSIX-y enough for to be
considered 'posix'. It doesn't implement a few POSIX-mandated things
like POSIX semaphores, but aside from that I would largely consider it
'compatible enough'.

I guess what is left is deciding whether to add a platform slot for
Android, or to stuff the detection in platform.linux_distribution(). I
feel like it would be a bit hacky for standard modules to rely on a
platform.linux_distribution() return value though, it seems mostly
useful for display purposes.

Phil Thompson's idea of setting sys.platform to 'linux-android' also
occurred to me. Under the premise that we can get users to use
sys.platform.startswith('linux'), this seems like the best solution in
my eyes: it both allows for existing code to continue the assumption
that they are running on a Linux platform, which I believe to be correct
in a lot of places, and Python modules to use a solid value to check if
they need to behave differently when running on Android.

On a sidenote, Kivy and SL4A/Py4A do not address this, no. From what
I've seen from their patches they are mostly there to get Python
compiling and running in the first place, not necessarily about fixing
every compatibility issue. :)

As for the os.get_shell_executable(), that seems like a good solution
for the issue that occurs in the subprocess module indeed. I'd
personally prefer it to manual checking within the module.

Kind regards,
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From python at  Sat Aug  2 17:50:32 2014
From: python at (MRAB)
Date: Sat, 02 Aug 2014 16:50:32 +0100
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802152756.GJ4525@ando>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 2014-08-02 16:27, Steven D'Aprano wrote:
> On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote:
>> On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve at> wrote:
>> > String concatenation with + is an attractive
>> > nuisance for many people, including some who actually know better but
>> > nevertheless do it. Also, for reasons I don't understand, many people
>> > dislike or cannot remember to use ''.join.
>> >
>> Since sum() already treats strings as a special case, why can't it simply
>> call (an equivalent of) ''.join itself instead of telling the user to do
>> it?  It does not matter why "many people dislike or cannot remember to use
>> ''.join" - if this is a fact - it should be considered by language
>> implementors.
> It could, of course, but there is virtue in keeping sum simple,
> rather than special-casing who knows how many different types. If sum()
> tries to handle strings, should it do the same for lists? bytearrays?
> array.array? tuple? Where do we stop?
We could leave any special-casing to the classes themselves:

def sum(iterable, start=0):
     sum_func = getattr(type(start), '__sum__')

     if sum_func is None:
         result = start

         for item in iterable:
             result = result + item
         result = sum_func(start, iterable)

     return result

> Ultimately it comes down to personal taste. Some people are going to
> wish sum() tried harder to do the clever thing with more types, some
> people are going to wish it was simpler and didn't try to be clever at
> all.
> Another argument against excessive cleverness is that it ties sum() to
> one particular idiom or implementation. Today, the idiomatic and
> efficient way to concatenate a lot of strings is with ''.join, but
> tomorrow there might be a new str.concat() method. Who knows? sum()
> shouldn't have to care about these details, since they are secondary to
> sum()'s purpose, which is to add numbers. Anything else is a
> bonus (or perhaps a nuisance).
> So, I would argue that when faced with something that is not a number,
> there are two reasonable approaches for sum() to take:
> - refuse to handle the type at all; or
> - fall back on simple-minded repeated addition.
> By the way, I think this whole argument would have been easily
> side-stepped if + was only used for addition, and & used for
> concatenation. Then there would be no question about what sum() should
> do for lists and tuples and strings: raise TypeError.

From alexander.belopolsky at  Sat Aug  2 20:15:34 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sat, 2 Aug 2014 14:15:34 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <lriup3$ase$>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Sat, Aug 2, 2014 at 11:06 AM, Stefan Behnel <stefan_ml at> wrote:

> I don't think sum(strings) is beautiful enough

sum(strings) is more beautiful than ''.join(strings) in my view, but
unfortunately it does not work even for lists because the initial value
defaults to 0.

sum(strings, '') and ''.join(strings) are equally ugly and non-obvious
because they require an empty string.  Empty containers are an advanced
concept and it is unfortunate that a simple job of concatenating a list of
(non-empty!) strings exposes the user to it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Sat Aug  2 20:36:29 2014
From: guido at (Guido van Rossum)
Date: Sat, 2 Aug 2014 11:36:29 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson <phil at>

> To me the issue is whether, for a particular value of sys.platform, the
> programmer can expect a particular Python stdlib API. If so then Android
> needs a different value for sys.platform.

sys.platform is for a broad indication of the OS kernel. It can be used to
distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since Android
is Linux it should have the same sys.platform as other Linux systems
('linux2'). If you want to know whether a specific syscall is there, check
for the presence of the method in the os module.

The platform module is suitable for additional vendor-specific info about
the platform, and I'd hope that there's something there that indicates
Android. Again, what values does the platform module return on SL4A or
Kivy, which have already ported Python to Android? In particular, I'd
expect platform.linux_distribution() to return a clue that it's Android.
There should also be clues in /etc/lsb-release (assuming Android supports
it :-).

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From hi at  Sat Aug  2 21:14:30 2014
From: hi at (Shiz)
Date: Sat, 02 Aug 2014 21:14:30 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Hash: SHA512

Guido van Rossum wrote:
> sys.platform is for a broad indication of the OS kernel. It can be
> used to distinguish Windows, Mac and Linux (and BSD, Solaris etc.).
> Since Android is Linux it should have the same sys.platform as other
> Linux systems ('linux2'). If you want to know whether a specific
> syscall is there, check for the presence of the method in the os
> module.
> The platform module is suitable for additional vendor-specific info 
> about the platform, and I'd hope that there's something there that 
> indicates Android. Again, what values does the platform module return
> on SL4A or Kivy, which have already ported Python to Android? In 
> particular, I'd expect platform.linux_distribution() to return a
> clue that it's Android. There should also be clues in
> /etc/lsb-release (assuming Android supports it :-).
> -- --Guido van Rossum ( <>)

To the best of my knowledge, Kivy and Py4A/SL4A don't modify that code
at all, so it just returns 'linux2'. In addition, they don't modify either, so platform.linux_distribution() returns empty values.

My patchset[1] currently contains patches that both set sys.platform to
'linux-android' and modifies platform.linux_distribution() to parse and
return a proper value for Android systems:

>>> import sys, platform sys.platform
>>> platform.linux_distribution()
('Android', '4.4.2', 'Blur_Version.174.44.9.falcon_umts.EURetail.en.EU')

The sys.platform thing was mainly done out of curiosity on its
possibility after Phil bringing it up. My main issue with leaving
Android detection to checking platform.linux_distribution() is that it
feels like a bit of a wonky thing for core Python modules to rely on to
change behaviour where needed on Android (as well as introducing a
dependency cycle between subprocess and platform right now).

I'd also like to note that I wouldn't agree with following too many of
Kivy/Py4A/SL4A's design decisions on this, as they seem mostly absent.
- From what I've read, their patches mostly seem geared towards getting
Python to run on Android, not necessarily integrating it well or fixing
all inconsistencies. This also leads to things like subprocess.Popen()
indeed breaking with shell=True[2].

Kind regards,

Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From dw+python-dev at  Sat Aug  2 22:35:13 2014
From: dw+python-dev at (David Wilson)
Date: Sat, 2 Aug 2014 20:35:13 +0000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802073912.GI4525@ando>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <20140802203513.GA10447@k2>

On Sat, Aug 02, 2014 at 05:39:12PM +1000, Steven D'Aprano wrote:

> Repeated list and str concatenation both have quadratic O(N**2)
> performance, but people frequently build up strings with + and rarely
> do the same for lists. String concatenation with + is an attractive
> nuisance for many people, including some who actually know better but
> nevertheless do it. Also, for reasons I don't understand, many people
> dislike or cannot remember to use ''.join.

join() isn't preferable in cases where it damages readability while
simultaneously providing zero or negative performance benefit, such as
when concatenating a few short strings, e.g. while adding a prefix to a

Although it's true that join() is automatically the safer option, and
especially when dealing with user supplied data, the net harm caused by
teaching rote and ceremony seems far less desirable compared to fixing a
trivial slowdown in a script, if that slowdown ever became apparent.

Another (twisted) interpretation is that since the quadratic behaviour
is a CPython implementation detail, and there are alternatives where
__add__ is constant time, encouraging users to code against
implementation details becomes undesirable. In our twisty world, __add__
becomes *preferable* since the resulting programs more closely resemble

    $ cat
    a = 'this '
    b = 'is a string'
    c = 'as we can tell'

    def x():
        return a + b + c

    def y():
        return ''.join([a, b, c])

    $ python -m timeit -s 'import t' 't.x()'
    1000000 loops, best of 3: 0.477 usec per loop

    $ python -m timeit -s 'import t' 't.y()'
    1000000 loops, best of 3: 0.695 usec per loop


From phil at  Sat Aug  2 22:38:37 2014
From: phil at (Phil Thompson)
Date: Sat, 02 Aug 2014 21:38:37 +0100
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On 02/08/2014 7:36 pm, Guido van Rossum wrote:
> On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson 
> <phil at>
> wrote:
>> To me the issue is whether, for a particular value of sys.platform, 
>> the
>> programmer can expect a particular Python stdlib API. If so then 
>> Android
>> needs a different value for sys.platform.
> sys.platform is for a broad indication of the OS kernel. It can be used 
> to
> distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since 
> Android
> is Linux it should have the same sys.platform as other Linux systems
> ('linux2'). If you want to know whether a specific syscall is there, 
> check
> for the presence of the method in the os module.

It's not just the os module - other modules contain code that would be 
affected, but there are plenty of other parts of the Python stdlib that 
aren't implemented on every platform. Using the approach you prefer then 
all that's needed is to update the documentation to say that certain 
things are not implemented on Android.


From guido at  Sat Aug  2 22:40:38 2014
From: guido at (Guido van Rossum)
Date: Sat, 2 Aug 2014 13:40:38 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>


On Saturday, August 2, 2014, Phil Thompson <phil at>

> On 02/08/2014 7:36 pm, Guido van Rossum wrote:
>> On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson <
>> phil at>
>> wrote:
>>  To me the issue is whether, for a particular value of sys.platform, the
>>> programmer can expect a particular Python stdlib API. If so then Android
>>> needs a different value for sys.platform.
>> sys.platform is for a broad indication of the OS kernel. It can be used to
>> distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since Android
>> is Linux it should have the same sys.platform as other Linux systems
>> ('linux2'). If you want to know whether a specific syscall is there, check
>> for the presence of the method in the os module.
> It's not just the os module - other modules contain code that would be
> affected, but there are plenty of other parts of the Python stdlib that
> aren't implemented on every platform. Using the approach you prefer then
> all that's needed is to update the documentation to say that certain things
> are not implemented on Android.
> Phil

--Guido van Rossum (on iPad)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Sat Aug  2 22:35:01 2014
From: guido at (Guido van Rossum)
Date: Sat, 2 Aug 2014 13:35:01 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On Sat, Aug 2, 2014 at 12:14 PM, Shiz <hi at> wrote:

> Guido van Rossum wrote:
> > sys.platform is for a broad indication of the OS kernel. It can be
> > used to distinguish Windows, Mac and Linux (and BSD, Solaris etc.).
> > Since Android is Linux it should have the same sys.platform as other
> > Linux systems ('linux2'). If you want to know whether a specific
> > syscall is there, check for the presence of the method in the os
> > module.
> >
> > The platform module is suitable for additional vendor-specific info
> > about the platform, and I'd hope that there's something there that
> > indicates Android. Again, what values does the platform module return
> > on SL4A or Kivy, which have already ported Python to Android? In
> > particular, I'd expect platform.linux_distribution() to return a
> > clue that it's Android. There should also be clues in
> > /etc/lsb-release (assuming Android supports it :-).
> >
> > -- --Guido van Rossum ( <>)
> To the best of my knowledge, Kivy and Py4A/SL4A don't modify that code
> at all, so it just returns 'linux2'. In addition, they don't modify
> either, so platform.linux_distribution() returns empty values.

OK, so personally I'd leave sys.platform but improve on

> My patchset[1] currently contains patches that both set sys.platform to
> 'linux-android' and modifies platform.linux_distribution() to parse and
> return a proper value for Android systems:
> >>> import sys, platform sys.platform
> 'linux-android'
> >>> platform.linux_distribution()
> ('Android', '4.4.2', 'Blur_Version.174.44.9.falcon_umts.EURetail.en.EU')
> The sys.platform thing was mainly done out of curiosity on its
> possibility after Phil bringing it up.

Can you give a few examples of where you'd need to differentiate Android
from other Linux platforms in otherwise portable code, and where testing
for the presence or absence of the specific function that you'd like to
call isn't possible? I know I pretty much never test for the difference
between OSX and other UNIX variants (including Linux) -- the only platform
distinction that regularly comes up in my own code is Windows vs. the rest.
And even there, often the right thing to test for is something more
specific like os.sep.

> My main issue with leaving
> Android detection to checking platform.linux_distribution() is that it
> feels like a bit of a wonky thing for core Python modules to rely on to
> change behaviour where needed on Android (as well as introducing a
> dependency cycle between subprocess and platform right now).

What's the specific change in stdlib behavior that you're proposing for

> I'd also like to note that I wouldn't agree with following too many of
> Kivy/Py4A/SL4A's design decisions on this, as they seem mostly absent.
> - From what I've read, their patches mostly seem geared towards getting
> Python to run on Android, not necessarily integrating it well or fixing
> all inconsistencies. This also leads to things like subprocess.Popen()
> indeed breaking with shell=True[2].

I'm all for fixing subprocess.Popen(), though I'm not sure what the best
way is to determine this particular choice (why is it in the first place
that /bin/sh doesn't work?). However, since it's a stdlib module you could
easily rely on a private API to detect Android, so this doesn't really
force the sys.platform issue. (Or you could propose a fix that will work
for Kivi and SL4A as well, e.g. checking for some system file that is
documented as unique to Android.)

> Kind regards,
> Shiz
> [1]:
> [2]:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From dw+python-dev at  Sun Aug  3 00:13:39 2014
From: dw+python-dev at (David Wilson)
Date: Sat, 2 Aug 2014 22:13:39 +0000
Subject: [Python-Dev] [Python-checkins] cpython: Issue #22003: When
 initialized from a bytes object, io.BytesIO() now
In-Reply-To: <>
References: <>
Message-ID: <20140802221339.GA12662@k2>

Thanks for spotting,

There is a new patch in to fix the


From hi at  Sun Aug  3 00:49:00 2014
From: hi at (Shiz)
Date: Sun, 03 Aug 2014 00:49:00 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Hash: SHA512

Guido van Rossum wrote:
> Can you give a few examples of where you'd need to differentiate
> Android from other Linux platforms in otherwise portable code, and
> where testing for the presence or absence of the specific function
> that you'd like to call isn't possible? I know I pretty much never
> test for the difference between OSX and other UNIX variants
> (including Linux) -- the only platform distinction that regularly
> comes up in my own code is Windows vs. the rest. And even there,
> often the right thing to test for is something more specific like
> os.sep.

> What's the specific change in stdlib behavior that you're proposing
> for Android?

The most obvious change would be to subprocess.Popen(). The reason a
generic approach there won't work is also the reason I expect more
changes might be needed: the Android file system doesn't abide by any
POSIX file system standards. Its shell isn't located at /bin/sh, but at
/system/bin/sh. The only directories it provides that are POSIX-standard
are /dev and /etc, to my knowledge. You could check to see if
/system/bin/sh exists and use that first, but that would break the
preferred shell on POSIX systems that happen to have /system for some
reason or another. In short: the preferred shell on POSIX systems is
/bin/sh, but on Android it's /system/bin/sh. Simple existence checking
might break the preferred shell on either. For more specific stdlib
examples I'd have to check the test suite again.

I can see the point of a sys.platform change not necessarily being
needed, but it would nice for user code too to have a sort-of trivial
way to figure out if it's running on Android. While core CPython might
in general care far less, for user applications it's a bigger deal since
they have to draw GUIs and use system services in a way that *is*
usually very different on Android. Again, platform.linux_distribution()
seems more for display purposes than for applications to check their
core logic against.
In addition, apparently platform.linux_distribution() is getting
deprecated in 3.5 and removed in 3.6[1].

I agree that above issue should in fact be solved by the earlier-linked
to os.get_preferred_shell() approach, however.

> However, since it's a stdlib module you could easily rely on a
> private API to detect Android, so this doesn't really force the
> sys.platform issue. (Or you could propose a fix that will work for
> Kivi and SL4A as well, e.g. checking for some system file that is
> documented as unique to Android.)

After checking most of the entire Android file system, I'm not sure if
such a file exists. Sure, a lot of the Android file system hierarchy
isn't really used anywhere else, but I'm not sure a check to see if e.g.
/system exists is really enough to conclude Python is running on Android
on its own. The thing that gets closest (which is the thing my patch checks for) is several Android-specific environment
variables being defined (ANDROID_ROOT, ANDROID_DATA,
ANDROID_PROPERTY_WORKSPACE...). Wouldn't it be better to put this in the
standard Python library and expose it somehow, though? It *is* fragile
code, it seems better if applications could 'just rely' on Python to
figure it out, since it's not a trivial check.

Kind regards,

Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From greg.ewing at  Sun Aug  3 02:27:40 2014
From: greg.ewing at (Greg Ewing)
Date: Sun, 03 Aug 2014 12:27:40 +1200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Shiz wrote:
> I'm not sure a check to see if e.g.
> /system exists is really enough to conclude Python is running on Android
> on its own.

Since MacOSX has /System and typically a case-insensitive
file system, it certainly wouldn't. :-)


From guido at  Sun Aug  3 06:41:42 2014
From: guido at (Guido van Rossum)
Date: Sat, 2 Aug 2014 21:41:42 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Well, it really does look like checking for the presence of those ANDROID_*
environment variables it the best way to recognize the Android platform.
Anyone can do that without waiting for a ruling on whether Android is Linux
or not (which would be necessary because the docs for sys.platform are
quite clear about its value on Linux systems). Googling terms like "is
Android Linux" suggests that there is considerable controversy about the
issue, so I suggest you don't wait. :-)

On Sat, Aug 2, 2014 at 3:49 PM, Shiz <hi at> wrote:

> Hash: SHA512
> Guido van Rossum wrote:
> > Can you give a few examples of where you'd need to differentiate
> > Android from other Linux platforms in otherwise portable code, and
> > where testing for the presence or absence of the specific function
> > that you'd like to call isn't possible? I know I pretty much never
> > test for the difference between OSX and other UNIX variants
> > (including Linux) -- the only platform distinction that regularly
> > comes up in my own code is Windows vs. the rest. And even there,
> > often the right thing to test for is something more specific like
> > os.sep.
> > What's the specific change in stdlib behavior that you're proposing
> > for Android?
> The most obvious change would be to subprocess.Popen(). The reason a
> generic approach there won't work is also the reason I expect more
> changes might be needed: the Android file system doesn't abide by any
> POSIX file system standards. Its shell isn't located at /bin/sh, but at
> /system/bin/sh. The only directories it provides that are POSIX-standard
> are /dev and /etc, to my knowledge. You could check to see if
> /system/bin/sh exists and use that first, but that would break the
> preferred shell on POSIX systems that happen to have /system for some
> reason or another. In short: the preferred shell on POSIX systems is
> /bin/sh, but on Android it's /system/bin/sh. Simple existence checking
> might break the preferred shell on either. For more specific stdlib
> examples I'd have to check the test suite again.
> I can see the point of a sys.platform change not necessarily being
> needed, but it would nice for user code too to have a sort-of trivial
> way to figure out if it's running on Android. While core CPython might
> in general care far less, for user applications it's a bigger deal since
> they have to draw GUIs and use system services in a way that *is*
> usually very different on Android. Again, platform.linux_distribution()
> seems more for display purposes than for applications to check their
> core logic against.
> In addition, apparently platform.linux_distribution() is getting
> deprecated in 3.5 and removed in 3.6[1].
> I agree that above issue should in fact be solved by the earlier-linked
> to os.get_preferred_shell() approach, however.
> > However, since it's a stdlib module you could easily rely on a
> > private API to detect Android, so this doesn't really force the
> > sys.platform issue. (Or you could propose a fix that will work for
> > Kivi and SL4A as well, e.g. checking for some system file that is
> > documented as unique to Android.)
> After checking most of the entire Android file system, I'm not sure if
> such a file exists. Sure, a lot of the Android file system hierarchy
> isn't really used anywhere else, but I'm not sure a check to see if e.g.
> /system exists is really enough to conclude Python is running on Android
> on its own. The thing that gets closest (which is the thing my
> patch checks for) is several Android-specific environment
> variables being defined (ANDROID_ROOT, ANDROID_DATA,
> ANDROID_PROPERTY_WORKSPACE...). Wouldn't it be better to put this in the
> standard Python library and expose it somehow, though? It *is* fragile
> code, it seems better if applications could 'just rely' on Python to
> figure it out, since it's not a trivial check.
> Kind regards,
> Shiz
> [1]:
> Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
> Comment: Using GnuPG with Mozilla -
> yCexTCEwu1rApjGYWSUw92Ihr9LnWn4aL7tEBqGXHN5pDctw0/FlGH9d0WhpMz/b
> DN0w5ukqx2YyY1EDK7hp1//6eU+tXTGQu890CWgboj5OQF8LXFyN6ReG0ynAKFC7
> gSyYGunqCIInRdnz9IRXWgQ91F/d1D3hZq9ZNffZzacA+PIA1rPdgziUuLdThl14
> P2/o98DzLRa3iTrTeW+x8f7nfbfNFmO8BLJsrce0o50BlD75YsUKVeTlwjU9IuIC
> gbw5Cxo8cfBN9Eg7iLkMgxkwiEVspuLVcVmoNVL4zsuavj41jlmyZFmPvRMO7OK+
> NQMq5vGPub7q4lBtlk7a8gFqDJQad7fcEgsCFTIb0nvckkEi1EeLC9kyzmVEqi3C
> ngiXGVfjM0qpwLKvY+pr5adsoeJSK3dVzIfEXptsvHvOhav6oxG9nCdbe3uW2ROT
> hM444FSqngUabceRe395TXu2XhXcpDNcl8Ye1ADfMZdiWFYRp8/xtNVKoWZ7Ge6D
> Gcx3/QiUtXP7jvykE9GI7QGB6JKCFuBY/RloDS7miteCutl7k0GLcp3+tRmtoypi
> jL3lcCtUSNOMEX4Y5CqfhMcjEVccWvy98oM4Tz7qMdYv5OwASNDAzjRFh3SbRXI+
> WRVqBf5aF13hy37RbkgoweXh1qn2vBO9sUUTJFp5ymlz8WisQFr+KRnt5bcjCKAe
> ycVThHQaLE/j1JOSgOmbD0Xi4hcvfFvlaNEmXTL1TiWRDC0crhM9fqObHHhWlFHv
> +b6AO39vVSfz1nTxTIByr6Z3GHlTFaU6iUx9oixHModEg2ej9iXb1Hq8atMHv/Z1
> thP/sZ7mRRBhakQPoL9i8+5+AIEiFnw5GnW7w74N/cRalF5SB2RpzDAudv2UHMWQ
> jPpVrDbDv9BAUeZKF/hl1xCpbI3xR1zhpLP6d7kH7p9fDAcS07W2hYIkX1LCyTvx
> xn0XHQKEejaAZG1HwYE/0aP1Z39SJhODZx1rFjWtgE3q1akO9hfadpRiRVhozsUT
> r/cXoJN3sakPbctN7B4wMXtSTrVrwqdfPCuua6mG15uTGVbkPFze/vj4yc0b+sql
> GnrxeiWmJXE/DkpyTbEXUPyCm95ggm+TUfUJ/yb/GhdL1yU9xCjVcxuFmAo5s0WH
> k4tra8/vU21V8OzxPmK0eGH9Sl4fUg7JsmAC/Igez+utO7lJLXwfPnUSz+Ls30ao
> Xd28IYMsoQ1LCltmfN/fDl3uWJi2e/kZM9v/KTkj9AncvUsDLIOV80AP+remM9E=
> =Z0j+

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From hi at  Sun Aug  3 07:18:01 2014
From: hi at (Shiz)
Date: Sun, 03 Aug 2014 07:18:01 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Hash: SHA512

Guido van Rossum wrote:
> Well, it really does look like checking for the presence of those 
> ANDROID_* environment variables it the best way to recognize the
> Android platform. Anyone can do that without waiting for a ruling on
> whether Android is Linux or not (which would be necessary because the
> docs for sys.platform are quite clear about its value on Linux
> systems). Googling terms like "is Android Linux" suggests that there
> is considerable controversy about the issue, so I suggest you don't
> wait. :-)

Right, which brings us back to the original point I was trying to make:
any chance we could move logic like that into a sys.getandroidversion()
or platform.android_version() so user code (and standard library code
alike) doesn't have to perform those relatively nasty checks themselves?
It seems like a fair thing to do if CPython would support Android as an
official target.

Kind regards,
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From 4kir4.1i at  Sun Aug  3 12:45:30 2014
From: 4kir4.1i at (Akira Li)
Date: Sun, 03 Aug 2014 14:45:30 +0400
Subject: [Python-Dev] Exposing the Android platform existence to Python
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Shiz <hi at> writes:

> The most obvious change would be to subprocess.Popen(). The reason a
> generic approach there won't work is also the reason I expect more
> changes might be needed: the Android file system doesn't abide by any
> POSIX file system standards. Its shell isn't located at /bin/sh, but at
> /system/bin/sh. The only directories it provides that are POSIX-standard
> are /dev and /etc, to my knowledge. You could check to see if
> /system/bin/sh exists and use that first, but that would break the
> preferred shell on POSIX systems that happen to have /system for some
> reason or another. In short: the preferred shell on POSIX systems is
> /bin/sh, but on Android it's /system/bin/sh. Simple existence checking
> might break the preferred shell on either. For more specific stdlib
> examples I'd have to check the test suite again.

FYI, /bin/sh is not POSIX, see


From 4kir4.1i at  Sun Aug  3 13:31:06 2014
From: 4kir4.1i at (Akira Li)
Date: Sun, 03 Aug 2014 15:31:06 +0400
Subject: [Python-Dev] Exposing the Android platform existence to Python
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Guido van Rossum <guido at> writes:

> Well, it really does look like checking for the presence of those ANDROID_*
> environment variables it the best way to recognize the Android platform.
> Anyone can do that without waiting for a ruling on whether Android is Linux
> or not (which would be necessary because the docs for sys.platform are
> quite clear about its value on Linux systems). Googling terms like "is
> Android Linux" suggests that there is considerable controversy about the
> issue, so I suggest you don't wait. :-)

I don't see sysconfig mentioned in the discussion (maybe for a
reason). It might provide build-time information e.g.,

  built_for_android = 'android' in sysconfig.get_config_var('MULTIARCH')

assuming the complete value is something like 'arm-linux-android'.  It
says that the python binary is built for android (the current platform
may or may not be Android).


From guido at  Sun Aug  3 17:58:11 2014
From: guido at (Guido van Rossum)
Date: Sun, 3 Aug 2014 08:58:11 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

But *are* we going to support Android officially? What's the point? Do you
have a plan for getting Python apps to first-class status in the App Store
(um, Google Play)?

Regardless, I recommend that you add a new method to the platform module
(careful people can test for the presence of the new method before calling
it) and leave poor sys.platform alone.

On Sat, Aug 2, 2014 at 10:18 PM, Shiz <hi at> wrote:

> Hash: SHA512
> Guido van Rossum wrote:
> > Well, it really does look like checking for the presence of those
> > ANDROID_* environment variables it the best way to recognize the
> > Android platform. Anyone can do that without waiting for a ruling on
> > whether Android is Linux or not (which would be necessary because the
> > docs for sys.platform are quite clear about its value on Linux
> > systems). Googling terms like "is Android Linux" suggests that there
> > is considerable controversy about the issue, so I suggest you don't
> > wait. :-)
> Right, which brings us back to the original point I was trying to make:
> any chance we could move logic like that into a sys.getandroidversion()
> or platform.android_version() so user code (and standard library code
> alike) doesn't have to perform those relatively nasty checks themselves?
> It seems like a fair thing to do if CPython would support Android as an
> official target.
> Kind regards,
> Shiz
> Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
> Comment: Using GnuPG with Mozilla -
> pm58gBqVYvd1y/uIiLpQgpGb1dPrNziV1IYOBJaDcU1i/03JlgGdr3HOq29KvHdQ
> xgaQQbsyl63Tzhs4oA2iow7eoRO5rkZ338hxpWrUQqRek73AYXJt2r5w9dRklUh/
> Z1R+80otVRAj69uJub8yAys08QqljKG80cnfQwUcFJVDWZRmr/z/WRGoC7QkRYVK
> EfIa7EVlm/3mArmueF6vxgF5qHevXIHvVSf18JJ918gxldKLJ4ht1v8L/4h4QBrC
> zfNqWyg8lXh6evMMH4lM755rycCTrtyzkoxmocLkUsEHrB65eOWWSBYdQgRMpuOH
> SZs+9K+P1jPwsJlcHl8j4sXoG6NtL6BBim70nlEnvdWQ6qHMivBNcyA1gEwI7Upn
> hG4t7AM4c3fdbkOg4V1F7EVrS9QqIxxWFIMAfYUGstZnfbBUDDGKIkE68ZbT+scq
> RTLbh78WsVA/YB/NLnxKvCTCuJb2uwg7R/VC1bMlsTUqTSfmckHl/XSRrgk+ggve
> A45sOKyoWzpfZEaAL9/e2TsPul5bRatVFX2JqEuzO42OTNZRr7GRxvRgF4tmnmG2
> baSfrEhm3rcIFxT2IqLy+28g7ffGKcbbq7oo7LPvrh+zIupamygCnvMs6aSPE3zi
> Vi31EiFrZ8pn3YF+yfO7D9hjtqE41IIc86dKPUyKYfG+wO1oPXNwzBEZfoRSoJaY
> 9EKd1fqOm9iYHHzr+mkEko/bl+SxNFHHJ/y/uEU6ZIhBjbylDJ9AKCAm5q9gotuT
> 5i3PuyOOrTuYO0ei0su5Ya9UO5vD3+gUNKTHe9IdUL/e+5qYt5tjwtfPC9UTldSy
> xLv8Ca0uC7mOHLPi8ASghoO2tbjy69TNYmzljqIGUufBOKshFnNWA7DDmQdYrdTN
> t+EXsUAUmqm1RT29Zhrt1LCsoByyXh5jBapyIleU8TTrmotpX3dlI7rooZSegUiy
> 8lD05oIjX+JRbfXXsNg384e6Stc6UktrhIK00w3ILVP9IqnqAO+dao/uE+5lLvxU
> BcL9/PjmTY+1U8ZJCb9uZXNG8jWP2lsQEKaSFURkoUjTzfRpAoa6tVpCZOOvqZC2
> F52ZSwmUBtP7vydRJ7BZjOeRxDzMD8qd0ED3fciDRbnVdXHIG+8MFL5MY1CDm9i7
> r7bngcsqSUURq/Zj4BYnM8lOX1PXC9+U4gVNEkiwf+9CjfeIyMd4QpuMyXPxeiUa
> QDU8MX5VdA1oBvJ2nbXV8QwriIfODbyhD/00QhLHw5ifKjxB8ZZdF4jNT+Ay9jnR
> nEWuIpat3ch2Sg/ECtBvcA8hHYE9TfFZGdrdZVvib7fHsS+AUFXuhjAnkEyOVB4=
> =m+JD

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From hi at  Sun Aug  3 18:00:28 2014
From: hi at (Shiz)
Date: Sun, 03 Aug 2014 18:00:28 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
 <> <>
Message-ID: <>

Hash: SHA512

Akira Li wrote:
> FYI, /bin/sh is not POSIX, see 

Ah right, my apologies. Android doesn't seem to have getconf(1) either,
but sh /is/ on $PATH. Anyway, even if it weren't, os.defpath could be
tweaked on Android.

> I don't see sysconfig mentioned in the discussion (maybe for a 
> reason). It might provide build-time information e.g.,
> built_for_android = 'android' in 
> sysconfig.get_config_var('MULTIARCH')
> assuming the complete value is something like 'arm-linux-android'.
> It says that the python binary is built for android (the current
> platform may or may not be Android).

MULTIARCH is empty in my sysconfig ( You
could possibly match HOST_GNU_TYPE against 'androideabi', even though it
still seems a bit fragile. Please ignore MACHDEP/PLATDIR, those are set
as a result of me fiddling with sys.platform.

Kind regards,
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From hi at  Sun Aug  3 18:04:50 2014
From: hi at (Shiz)
Date: Sun, 03 Aug 2014 18:04:50 +0200
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

Hash: SHA512

Guido van Rossum wrote:
> But *are* we going to support Android officially? What's the point?
> Do you have a plan for getting Python apps to first-class status in
> the App Store (um, Google Play)?
> Regardless, I recommend that you add a new method to the platform
> module (careful people can test for the presence of the new method
> before calling it) and leave poor sys.platform alone.

Well, that is the idea, at least empowering people to write proper
Android apps in Python. The first step of that would be making CPython
run on Android, the second step would be adding libraries that allow
Python users to interface with the Android API.

As I said, even if the CPython maintainers are not willing to support
Android in the end, I'd at least like my patchset to be done according
to CPython development guidelines/principles as close as possible.

Adding android_version() to the platform module it is, then.
hasattr(platform, 'android_version') is probably an easy enough check
for Python users.

Kind regards,
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Mozilla -


From phil at  Sun Aug  3 19:16:53 2014
From: phil at (Phil Thompson)
Date: Sun, 03 Aug 2014 18:16:53 +0100
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On 03/08/2014 4:58 pm, Guido van Rossum wrote:
> But *are* we going to support Android officially? What's the point? Do 
> you
> have a plan for getting Python apps to first-class status in the App 
> Store
> (um, Google Play)?

I do...


From guido at  Sun Aug  3 20:17:03 2014
From: guido at (Guido van Rossum)
Date: Sun, 3 Aug 2014 11:17:03 -0700
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On Sun, Aug 3, 2014 at 10:16 AM, Phil Thompson <phil at>

> On 03/08/2014 4:58 pm, Guido van Rossum wrote:
>> But *are* we going to support Android officially? What's the point? Do you
>> have a plan for getting Python apps to first-class status in the App Store
>> (um, Google Play)?
> I do...
> Phil

Oooh, that's pretty cool!

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Mon Aug  4 02:01:14 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 4 Aug 2014 10:01:14 +1000
Subject: [Python-Dev] Exposing the Android platform existence to Python
In-Reply-To: <>
References: <>
 <> <20140802030634.GH4525@ando>
Message-ID: <>

On 4 Aug 2014 03:18, "Phil Thompson" <phil at> wrote:
> On 03/08/2014 4:58 pm, Guido van Rossum wrote:
>> But *are* we going to support Android officially? What's the point? Do
>> have a plan for getting Python apps to first-class status in the App
>> (um, Google Play)?
> I do...


I've only been skimming this thread, but +1 for Android mostly reading as
Linux, but with an extra method in the platform module that gives more

For those interested in mobile app development, Russell Keith-Magee also
announced the release of "toga" [1] here at PyCon AU. That's a Python
specific GUI library that maps directly to native widgets (rather than
using theming as Kivy does). I mention it as one of the things Russell is
specifically looking for is more participation from folks that know the
Android side of things :)



> Phil
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From larry at  Mon Aug  4 09:12:47 2014
From: larry at (Larry Hastings)
Date: Mon, 04 Aug 2014 17:12:47 +1000
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
Message-ID: <>

Argument Clinic "converters" specify how to convert an individual 
argument to the function you're defining.  Although a converter could 
theoretically represent any sort of conversion, most of the time they 
directly represent types like "int" or "double" or "str".

Because there's such variety in argument parsing, the converters are 
customizable with parameters.  Many of these are common enough that 
Argument Clinic suggests some standard names.  Examples: "zeroes=True" 
for strings and buffers means "permit internal \0 characters", and 
"bitwise=True" for unsigned integers means "copy the bits over, even if 
there's overflow/underflow, and even if the original is negative".

A third example is "nullable=True", which means "also accept None for 
this parameter".  This was originally intended for use with strings 
(compare the "s" and "z" format units for PyArg_ParseTuple), however it 
looks like we'll have a use for "nullable ints" in the ongoing Argument 
Clinic conversion work.

Several people have said they found the name "nullable" surprising, 
suggesting I use another name like "allow_none" or "noneable".  I, in 
turn, find their surprise surprising; "nullable" is a term long 
associated with exactly this concept.  It's used in C# and SQL, and the 
term even has its own Wikipedia page:

Most amusingly, Vala *used* to have an annotation called "(allow-none)", 
but they've broken it out into two annotations, "(nullable)" and 

Before you say "the term 'nullable' will confuse end users", let me 
remind you: this is not user-facing.  This is a parameter for an 
Argument Clinic converter, and will only ever be seen by CPython core 
developers.  A group which I hope is not so easily confused.

It's my contention that "nullable" is the correct name.  But I've been 
asked to bring up the topic for discussion, to see if a consensus forms 
around this or around some other name.

Let the bike-shedding begin,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From me+python at  Mon Aug  4 09:35:39 2014
From: me+python at (Stephen Hansen)
Date: Mon, 4 Aug 2014 00:35:39 -0700
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On Mon, Aug 4, 2014 at 12:12 AM, Larry Hastings <larry at> wrote:

> Several people have said they found the name "nullable" surprising,
> suggesting I use another name like "allow_none" or "noneable".  I, in turn,
> find their surprise surprising; "nullable" is a term long associated with
> exactly this concept.  It's used in C# and SQL, and the term even has its
> own Wikipedia page:

The thing is, "null" in these languages are not the same thing. If you look
to the various database wrappers there's a lot of controversy about just
how to map the SQL NULL to Python: simply mapping it to Python's None
becomes strange because the semantics of a SQL NULL or NULL pointer and
Python None don't exactly match. Not all that long ago someone was making
an argument on this list to add a SQLNULL type object to better map SQL
NULL semantics (regards to sorting, as I recall -- but its been awhile)

Python has None. Its definition and understanding in a Python context is
clear. Why introduce some other concept? In Python its very common you pass
None instead of an other argument.

> Before you say "the term 'nullable' will confuse end users", let me remind
> you: this is not user-facing.  This is a parameter for an Argument Clinic
> converter, and will only ever be seen by CPython core developers.  A group
> which I hope is not so easily confused

Yet, my lurking observation of argument clinic is it is all about clearly
defining the C-side of how things are done in Python API's. It may not
confuse 'end users', but it may confuse possible contributors, and simply
add a lack of clarity to the situation.

Passing None in place of another argument is a very Pythonic thing to do;
why confuse that by using other words which imply other semantics? None is
a Python thing with clear semantics in Python; allow_none quite accurately
describes the Pythonic thing described here, while 'nullable' expects for
domain knowledge beyond Python and makes assumptions of semantics.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From v+python at  Mon Aug  4 09:46:25 2014
From: v+python at (Glenn Linderman)
Date: Mon, 04 Aug 2014 00:46:25 -0700
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On 8/4/2014 12:35 AM, Stephen Hansen wrote:
> On Mon, Aug 4, 2014 at 12:12 AM, Larry Hastings <larry at 
> <mailto:larry at>> wrote:
>     Several people have said they found the name "nullable"
>     surprising, suggesting I use another name like "allow_none" or
>     "noneable".  I, in turn, find their surprise surprising;
>     "nullable" is a term long associated with exactly this concept. 
>     It's used in C# and SQL, and the term even has its own Wikipedia page:
> The thing is, "null" in these languages are not the same thing. If you 
> look to the various database wrappers there's a lot of controversy 
> about just how to map the SQL NULL to Python: simply mapping it to 
> Python's None becomes strange because the semantics of a SQL NULL or 
> NULL pointer and Python None don't exactly match. Not all that long 
> ago someone was making an argument on this list to add a SQLNULL type 
> object to better map SQL NULL semantics (regards to sorting, as I 
> recall -- but its been awhile)
> Python has None. Its definition and understanding in a Python context 
> is clear. Why introduce some other concept? In Python its very common 
> you pass None instead of an other argument.
>     Before you say "the term 'nullable' will confuse end users", let
>     me remind you: this is not user-facing.  This is a parameter for
>     an Argument Clinic converter, and will only ever be seen by
>     CPython core developers.  A group which I hope is not so easily
>     confused
> Yet, my lurking observation of argument clinic is it is all about 
> clearly defining the C-side of how things are done in Python API's. It 
> may not confuse 'end users', but it may confuse possible contributors, 
> and simply add a lack of clarity to the situation.
> Passing None in place of another argument is a very Pythonic thing to 
> do; why confuse that by using other words which imply other semantics? 
> None is a Python thing with clear semantics in Python; allow_none 
> quite accurately describes the Pythonic thing described here, while 
> 'nullable' expects for domain knowledge beyond Python and makes 
> assumptions of semantics.
> /re-lurk
> --S

Thanks, Stephen.  +1 to all you wrote.

There remains, of course, one potential justification for using 
"nullable", that you didn't make 100% clear. Because "argument clinic is 
it is all about clearly defining the C-side of how things are done in 
Python API's." and that is that C uses NULL (but it is only a 
convention, not a language feature) for missing reference parameters on 
occasion. But I think it is much more clear that if C NULL gets mapped 
to Python None, and we are talking about Python parameters, then a 
NULLable C parameter should map to an "allow_none" Python parameter.

The concepts of C NULL, C# NULL, SQL NULL, and Python None are all 
slightly different, even the brilliant people on python-dev could better 
spend their energies on new features and bug fixes rather than being 
slowed by the need to remember yet another unclear and inconsistent 
terminology issue, of which there are already too many.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From phd at  Mon Aug  4 09:39:36 2014
From: phd at (Oleg Broytman)
Date: Mon, 4 Aug 2014 09:39:36 +0200
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>


On Mon, Aug 04, 2014 at 05:12:47PM +1000, Larry Hastings <larry at> wrote:
> "nullable=True", which means "also accept None
> for this parameter".  This was originally intended for use with
> strings (compare the "s" and "z" format units for PyArg_ParseTuple),
> however it looks like we'll have a use for "nullable ints" in the
> ongoing Argument Clinic conversion work.
> Several people have said they found the name "nullable" surprising,
> suggesting I use another name like "allow_none" or "noneable".  I,
> in turn, find their surprise surprising; "nullable" is a term long
> associated with exactly this concept.  It's used in C# and SQL, and
> the term even has its own Wikipedia page:

   In my very humble opinion, "nullable" is ok, but "allow_none" is

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From ncoghlan at  Mon Aug  4 14:22:17 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 4 Aug 2014 22:22:17 +1000
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <> <>
Message-ID: <>

On 4 Aug 2014 18:16, "Oleg Broytman" <phd at> wrote:
> Hi!
> On Mon, Aug 04, 2014 at 05:12:47PM +1000, Larry Hastings <
larry at> wrote:
> > "nullable=True", which means "also accept None
> > for this parameter".  This was originally intended for use with
> > strings (compare the "s" and "z" format units for PyArg_ParseTuple),
> > however it looks like we'll have a use for "nullable ints" in the
> > ongoing Argument Clinic conversion work.
> >
> > Several people have said they found the name "nullable" surprising,
> > suggesting I use another name like "allow_none" or "noneable".  I,
> > in turn, find their surprise surprising; "nullable" is a term long
> > associated with exactly this concept.  It's used in C# and SQL, and
> > the term even has its own Wikipedia page:
> >
> >
>    In my very humble opinion, "nullable" is ok, but "allow_none" is
> better.

Yup, this is where I stand as well. The main concern I have with nullable
is that we *are* writing C code when dealing with Argument Clinic, and
"nullable" may make me think of a C NULL rather than Python's None.


> Oleg.
> --
>      Oleg Broytman              phd at
>            Programmers don't die, they just GOSUB without RETURN.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Mon Aug  4 15:06:36 2014
From: antoine at (Antoine Pitrou)
Date: Mon, 04 Aug 2014 09:06:36 -0400
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <lro0gs$akm$>

Le 04/08/2014 03:35, Stephen Hansen a ?crit :
>     Before you say "the term 'nullable' will confuse end users", let me
>     remind you: this is not user-facing.  This is a parameter for an
>     Argument Clinic converter, and will only ever be seen by CPython
>     core developers.  A group which I hope is not so easily confused
> Yet, my lurking observation of argument clinic is it is all about
> clearly defining the C-side of how things are done in Python API's. It
> may not confuse 'end users', but it may confuse possible contributors,
> and simply add a lack of clarity to the situation.

That's a rather good point, and I agree with Stephen here. Even core 
contributors can deserve clarity and the occasional non-confusing 
notation :-)



From njs at  Mon Aug  4 12:19:38 2014
From: njs at (Nathaniel Smith)
Date: Mon, 4 Aug 2014 11:19:38 +0100
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

I admit I spent the first half of the email scratching my head and trying
to figure out what NULL had to do with argument clinic specs. (Maybe it
would mean that if the argument is "not given" in some appropriate way then
we set the corresponding C variable to NULL?) Finding out you were talking
about None came as a surprising twist.

On 4 Aug 2014 08:13, "Larry Hastings" <larry at> wrote:

> Argument Clinic "converters" specify how to convert an individual argument
> to the function you're defining.  Although a converter could theoretically
> represent any sort of conversion, most of the time they directly represent
> types like "int" or "double" or "str".
> Because there's such variety in argument parsing, the converters are
> customizable with parameters.  Many of these are common enough that
> Argument Clinic suggests some standard names.  Examples: "zeroes=True" for
> strings and buffers means "permit internal \0 characters", and
> "bitwise=True" for unsigned integers means "copy the bits over, even if
> there's overflow/underflow, and even if the original is negative".
> A third example is "nullable=True", which means "also accept None for this
> parameter".  This was originally intended for use with strings (compare the
> "s" and "z" format units for PyArg_ParseTuple), however it looks like we'll
> have a use for "nullable ints" in the ongoing Argument Clinic conversion
> work.
> Several people have said they found the name "nullable" surprising,
> suggesting I use another name like "allow_none" or "noneable".  I, in turn,
> find their surprise surprising; "nullable" is a term long associated with
> exactly this concept.  It's used in C# and SQL, and the term even has its
> own Wikipedia page:
> Most amusingly, Vala *used* to have an annotation called "(allow-none)",
> but they've broken it out into two annotations, "(nullable)" and
> "(optional)".
> Before you say "the term 'nullable' will confuse end users", let me remind
> you: this is not user-facing.  This is a parameter for an Argument Clinic
> converter, and will only ever be seen by CPython core developers.  A group
> which I hope is not so easily confused.
> It's my contention that "nullable" is the correct name.  But I've been
> asked to bring up the topic for discussion, to see if a consensus forms
> around this or around some other name.
> Let the bike-shedding begin,
> */arry*
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Mon Aug  4 18:25:12 2014
From: chris.barker at (Chris Barker)
Date: Mon, 4 Aug 2014 09:25:12 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140802203513.GA10447@k2>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Sat, Aug 2, 2014 at 1:35 PM, David Wilson <dw+python-dev at> wrote:

> > Repeated list and str concatenation both have quadratic O(N**2)
> > performance, but people frequently build up strings with +

> join() isn't preferable in cases where it damages readability while
> simultaneously providing zero or negative performance benefit, such as
> when concatenating a few short strings, e.g. while adding a prefix to a
> filename.

Good point -- I was trying to make the point about .join() vs + for strings
in an intro python class last year, and made the mistake of having the
students test the performance.

You need to concatenate a LOT of strings to see any difference at all --  I
know that O() of algorithms is unavoidable, but between efficient python
optimizations and a an apparently good memory allocator, it's really a
practical non-issue.

> Although it's true that join() is automatically the safer option, and
> especially when dealing with user supplied data, the net harm caused by
> teaching rote and ceremony seems far less desirable compared to fixing a
> trivial slowdown in a script, if that slowdown ever became apparent.

and it rarely would.

Blocking sum( some_strings) because it _might_ have poor performance seems
awfully pedantic.

As a long-time numpy user, I think sum(a_long_list_of_numbers) has
pathetically bad performance, but I wouldn't block it!



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From larry at  Mon Aug  4 18:56:36 2014
From: larry at (Larry Hastings)
Date: Tue, 05 Aug 2014 02:56:36 +1000
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/04/2014 05:46 PM, Glenn Linderman wrote:
> There remains, of course, one potential justification for using 
> "nullable", that you didn't make 100% clear. Because "argument clinic 
> is it is all about clearly defining the C-side of how things are done 
> in Python API's." and that is that C uses NULL (but it is only a 
> convention, not a language feature) for missing reference parameters 
> on occasion. But I think it is much more clear that if C NULL gets 
> mapped to Python None, and we are talking about Python parameters, 
> then a NULLable C parameter should map to an "allow_none" Python 
> parameter.

Argument Clinic defines *both* sides of how things are done in builtins, 
both C and Python.  So it's a bit messier than that. Currently the 
"nullable" flag is only applicable to certain converters which output 
pointer types in C, so if it gets a None for that argument it does 
provide a NULL as the C equivalent.  But in the "nullable int" patch 
obviously I can't do that.  Instead you get a structure containing 
either an int or a flag specifying "you got a None", currently named 
"is_null".  So I don't think your proposed additional justification helps.

Of course, in my opinion I don't need this additional justification.  
Python's "None" is its null object.  And we already have the concept of 
"nullable types" in computer science, for exactly, *exactly!*, this 
concept.  As the Zen says, "special cases aren't special enough to break 
the rules".  Just because Python is silly enough to name its null object 
"None" doesn't mean we have to warp all our other names around it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Mon Aug  4 18:57:03 2014
From: ethan at (Ethan Furman)
Date: Mon, 04 Aug 2014 09:57:03 -0700
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/04/2014 12:12 AM, Larry Hastings wrote:
> It's my contention that "nullable" is the correct name.  But I've been asked to bring up the topic for discussion, to
> see if a consensus forms around this or around some other name.
> Let the bike-shedding begin,

I think the original name is okay, but 'allow_none' is definitely clearer.


From alexander.belopolsky at  Mon Aug  4 19:36:39 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Mon, 4 Aug 2014 13:36:39 -0400
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On Mon, Aug 4, 2014 at 12:57 PM, Ethan Furman <ethan at> wrote:

> 'allow_none' is definitely clearer.

I disagree. Unlike "nullable", "allow_none" does not tell me what happens
on the C side when I pass in None.  If the receiving type is PyObject*,
either NULL or Py_None is a valid choice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Mon Aug  4 19:53:19 2014
From: antoine at (Antoine Pitrou)
Date: Mon, 04 Aug 2014 13:53:19 -0400
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>	<>
Message-ID: <lrohaf$9l6$>

Le 04/08/2014 13:36, Alexander Belopolsky a ?crit :
> On Mon, Aug 4, 2014 at 12:57 PM, Ethan Furman <ethan at
> <mailto:ethan at>> wrote:
>     'allow_none' is definitely clearer.
> I disagree. Unlike "nullable", "allow_none" does not tell me what
> happens on the C side when I pass in None.  If the receiving type is
> PyObject*, either NULL or Py_None is a valid choice.

But here the receiving type can be an int.



From alexander.belopolsky at  Mon Aug  4 20:04:05 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Mon, 4 Aug 2014 14:04:05 -0400
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <lrohaf$9l6$>
References: <> <>
Message-ID: <>

On Mon, Aug 4, 2014 at 1:53 PM, Antoine Pitrou <antoine at> wrote:

> I disagree. Unlike "nullable", "allow_none" does not tell me what
>> happens on the C side when I pass in None.  If the receiving type is
>> PyObject*, either NULL or Py_None is a valid choice.
> But here the receiving type can be an int.

We cannot "allow None" when the receiving type is C int.  In this case, we
need a way to implement "nullable int" type in C.  We can use int * or a
pair of int and _Bool or anything else.  Whatever the implementation, the
concept that is implemented is "nullable int."  The advantage of using the
term "nullable" is that it is language and implementation neutral.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From steve at  Mon Aug  4 20:10:18 2014
From: steve at (Steven D'Aprano)
Date: Tue, 5 Aug 2014 04:10:18 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <20140804181013.GO4525@ando>

On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:

> Good point -- I was trying to make the point about .join() vs + for strings
> in an intro python class last year, and made the mistake of having the
> students test the performance.
> You need to concatenate a LOT of strings to see any difference at all --  I
> know that O() of algorithms is unavoidable, but between efficient python
> optimizations and a an apparently good memory allocator, it's really a
> practical non-issue.

If only that were the case, but it isn't. Here's a cautionary tale for 
how using string concatenation can blow up in your face:

Chris Withers asks for help debugging HTTP slowness:

and publishes some times:

(notice that Python was SIX HUNDRED times slower than wget or IE)

and Simon Cross identified the problem:

leading Guido to describe the offending code as an embarrassment.

It shouldn't be hard to demonstrate the difference between repeated 
string concatenation and join, all you need do is defeat sum()'s 
prohibition against strings. Run this bit of code, and you'll see a 
significant difference in performance, even with CPython's optimized 

# --- cut ---
class Faker:
    def __add__(self, other):
            return other

x = Faker()
strings = list("Hello World!")
assert ''.join(strings) == sum(strings, x)

from timeit import Timer
setup = "from __main__ import x, strings"
t1 = Timer("''.join(strings)", setup)
t2 = Timer("sum(strings, x)", setup)

print (min(t1.repeat()))
print (min(t2.repeat()))
# --- cut ---

On my computer, using Python 2.7, I find the version using sum is nearly 
4.5 times slower, and with 3.3 about 4.2 times slower. That's with a 
mere twelve substrings, hardly "a lot". I tried running it on IronPython 
with a slightly larger list of substrings, but I got sick of waiting for 
it to finish.

If you want to argue that microbenchmarks aren't important, well, I 
might agree with you in general, but in the specific case of string 
concatenation there's that pesky factor of 600 slowdown in real world 
code to argue with.

> Blocking sum( some_strings) because it _might_ have poor performance seems
> awfully pedantic.

The rationale for explicitly prohibiting strings while merely implicitly 
discouraging other non-numeric types is that beginners, who are least 
likely to understand why their code occasionally and unpredictably 
becomes catastrophically slow, are far more likely to sum strings than 
sum tuples or lists.

(I don't entirely agree with this rationale, I'd prefer a warning rather 
than an exception.)


From larry at  Mon Aug  4 20:18:44 2014
From: larry at (Larry Hastings)
Date: Tue, 05 Aug 2014 04:18:44 +1000
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <lrohaf$9l6$>
References: <>	<>
Message-ID: <>

On 08/05/2014 03:53 AM, Antoine Pitrou wrote:
> Le 04/08/2014 13:36, Alexander Belopolsky a ?crit :
>> If the receiving type is PyObject*, either NULL or Py_None is a valid 
>> choice.
> But here the receiving type can be an int.

Just to be precise: in the case where the receiving type *would* have 
been an int, and "nullable=True", the receiving type is actually a 
structure containing an int and a "you got a None" flag. I can't stick a 
magic value in the int and say "that represents you getting a None" 
because any integer value may be valid.

Also, I'm pretty sure there are places in builtin argument parsing that 
accept either NULL or Py_None, and I *think* maybe in one or two of them 
they actually mean different things.  What fun!

For small values of "fun",

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Mon Aug  4 20:37:54 2014
From: antoine at (Antoine Pitrou)
Date: Mon, 04 Aug 2014 14:37:54 -0400
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>	<>
 <lrohaf$9l6$> <>
Message-ID: <lroju2$cnf$>

Le 04/08/2014 14:18, Larry Hastings a ?crit :
> On 08/05/2014 03:53 AM, Antoine Pitrou wrote:
>> Le 04/08/2014 13:36, Alexander Belopolsky a ?crit :
>>> If the receiving type is PyObject*, either NULL or Py_None is a valid
>>> choice.
>> But here the receiving type can be an int.
> Just to be precise: in the case where the receiving type *would* have
> been an int, and "nullable=True", the receiving type is actually a
> structure containing an int and a "you got a None" flag. I can't stick a
> magic value in the int and say "that represents you getting a None"
> because any integer value may be valid.
> Also, I'm pretty sure there are places in builtin argument parsing that
> accept either NULL or Py_None, and I *think* maybe in one or two of them
> they actually mean different things.  What fun!
> For small values of "fun",

Is -909 too large a value to be fun?



From stefan_ml at  Mon Aug  4 21:14:49 2014
From: stefan_ml at (Stefan Behnel)
Date: Mon, 04 Aug 2014 21:14:49 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140804181013.GO4525@ando>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <lrom3b$9s4$>

Steven D'Aprano schrieb am 04.08.2014 um 20:10:
> On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
>> Good point -- I was trying to make the point about .join() vs + for strings
>> in an intro python class last year, and made the mistake of having the
>> students test the performance.
>> You need to concatenate a LOT of strings to see any difference at all --  I
>> know that O() of algorithms is unavoidable, but between efficient python
>> optimizations and a an apparently good memory allocator, it's really a
>> practical non-issue.
> If only that were the case, but it isn't. Here's a cautionary tale for 
> how using string concatenation can blow up in your face:
> Chris Withers asks for help debugging HTTP slowness:
> and publishes some times:
> (notice that Python was SIX HUNDRED times slower than wget or IE)
> and Simon Cross identified the problem:
> leading Guido to describe the offending code as an embarrassment.

Thanks for digging up that story.

>> Blocking sum( some_strings) because it _might_ have poor performance seems
>> awfully pedantic.
> The rationale for explicitly prohibiting strings while merely implicitly 
> discouraging other non-numeric types is that beginners, who are least 
> likely to understand why their code occasionally and unpredictably 
> becomes catastrophically slow, are far more likely to sum strings than 
> sum tuples or lists.

Well, the obvious difference between strings and lists (not tuples) is that
strings are immutable, so it would seem more obvious at first sight to
concatenate strings than to do the same thing with lists, which can easily
be extended (they are clearly designed for that). This rational may not
apply as much to beginners as to more experienced programmers, but it
should still explain why this is so often discussed in the context of
string concatenation and pretty much never for lists.

As for tuples, their most common use case is to represent a fixed length
sequence of semantically different values. That renders their concatenation
a sufficiently uncommon use case to make no-one ask loudly for "large
scale" sum(tuples) support.

Basically, extending lists is an obvious thing, but getting multiple
strings joined without using "+"-concatenating them isn't.


From jimjjewett at  Mon Aug  4 22:22:27 2014
From: jimjjewett at (Jim J. Jewett)
Date: Mon, 04 Aug 2014 13:22:27 -0700 (PDT)
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
Message-ID: <>

Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in ) wrote:

> Andrea Griffini <agriff at> wrote:

>>    However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
> hm could this be a pure python case that would profit from temporary
> elision [ ]?

> lists could declare the tp_can_elide slot and call list.extend on the
> temporary during its tp_add slot instead of creating a new temporary.
> extend/realloc can avoid the copy if there is free memory available
> after the block.

Yes, with all the same problems.

When dealing with a complex object, how can you be sure that __add__
won't need access to the original values during the entire computation?
It works with matrix addition, but not with matric multiplication.
Depending on the details of the implementation, it could even fail for
a sort of sliding-neighbor addition similar to the original justification.

Of course, then those tricky implementations should not define an
_eliding_add_, but maybe the builtin objects still should?  After all,
a plain old list is OK to re-use.  Unless the first evaluation to create
it ends up evaluating an item that has side effects...

In the end, it looks like a lot of machinery (and extra checks that may
slow down the normal small-object case) for something that won't be used
all that often.

Though it is really tempting to consider a compilation mode that assumes
objects and builtins will be "normal", and lets you replace the entire
above expression with compile-time [1, 2, 3, 4, 5, 6].  Would writing
objects to that stricter standard and encouraging its use (and maybe
offering a few AST transforms to auto-generate the out-parameters?) work
as well for those who do need the speed?



If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ

From taleinat at  Tue Aug  5 12:08:05 2014
From: taleinat at (Tal Einat)
Date: Tue, 5 Aug 2014 13:08:05 +0300
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

On Mon, Aug 4, 2014 at 10:12 AM, Larry Hastings <larry at> wrote:
> It's my contention that "nullable" is the correct name.  But I've been asked
> to bring up the topic for discussion, to see if a consensus forms around
> this or around some other name.
> Let the bike-shedding begin,
> /arry

+1 for some form of "allow None" rather than "nullable".

- Tal Einat

From martin at  Tue Aug  5 17:13:12 2014
From: martin at (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 05 Aug 2014 17:13:12 +0200
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <>
Message-ID: <>

Am 04.08.14 09:12, schrieb Larry Hastings:
> It's my contention that "nullable" is the correct name.  But I've been
> asked to bring up the topic for discussion, to see if a consensus forms
> around this or around some other name.

I have personally no problems with calling a type "nullable" even in
Python, and, as a type *adjective* this seems to be the right choice
(i.e. I wouldn't say "noneable int" or "allow_none int"; the former is
no established or intuitive term, the latter is not an adjective).

As a type *flag*, flexibility in naming is greater. zeroes=True formally
creates a subtype (of string), and it doesn't hurt that it is not an
adjective. "allow_zeroes" might be more descriptive. bitwise=True
doesn't really create a subtype of int. For the feature in question,
I find both "allow_none" and "nullable" acceptable; "noneable" is not.


From ischwabacher at  Thu Aug  7 00:36:37 2014
From: ischwabacher at (Isaac Schwabacher)
Date: Wed, 06 Aug 2014 17:36:37 -0500
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
Message-ID: <>

pathlib.Path currently strips trailing slashes from pathnames, but this behavior contradicts POSIX (, which specifies that the resolution of the pathname of a symbolic link to a directory in the context of a function that operates on symbolic links shall depend on whether the pathname has a trailing slash:

> 4.12 Pathname Resolution
> ========================
> [...]
> A pathname that contains at least one non- <slash> character and that ends with one or more trailing <slash> characters shall not be resolved successfully unless the last pathname component before the trailing <slash> characters names an existing directory or a directory entry that is to be created for a directory immediately after the pathname is resolved. Interfaces using pathname resolution may specify additional constraints[1] when a pathname that does not name an existing directory contains at least one non- <slash> character and contains one or more trailing <slash> characters.
> If a symbolic link is encountered during pathname resolution, the behavior shall depend on whether the pathname component is at the end of the pathname and on the function being performed. If all of the following are true, then pathname resolution is complete:
> 1. This is the last pathname component of the pathname.
> 2. The pathname has no trailing <slash>.
> 3. The function is required to act on the symbolic link itself, or certain arguments direct that the function act on the symbolic link itself.
> In all other cases, the system shall prefix the remaining pathname, if any, with the contents of the symbolic link. [...]

The following sentence appeared in an earlier version of POSIX ( but has since been removed:

> A pathname that contains at least one non-slash character and that ends with one or more trailing slashes shall be resolved as if a single dot character ( '.' ) were appended to the pathname.

Is this important enough to preserve trailing slashes?

- Isaac Schwabacher

From antoine at  Thu Aug  7 02:11:36 2014
From: antoine at (Antoine Pitrou)
Date: Wed, 06 Aug 2014 20:11:36 -0400
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
Message-ID: <lrug7p$32h$>

Le 06/08/2014 18:36, Isaac Schwabacher a ?crit :
>> If a symbolic link is encountered during pathname resolution, the
>> behavior shall depend on whether the pathname component is at the
>> end of the pathname and on the function being performed. If all of
>> the following are true, then pathname resolution is complete:
>> 1. This is the last pathname component of the pathname. 2. The
>> pathname has no trailing <slash>. 3. The function is required to
>> act on the symbolic link itself, or certain arguments direct that
>> the function act on the symbolic link itself.
>> In all other cases, the system shall prefix the remaining pathname,
>> if any, with the contents of the symbolic link. [...]

So the only case where this would make a difference is when calling a 
"function acting on the symbolic link itself" (such as lstat() or 
unlink()) on a path with a trailing slash:

 >>> os.lstat('foo')
os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, 
st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370025, 
st_mtime=1407370025, st_ctime=1407370025)
 >>> os.lstat('foo/')
os.stat_result(st_mode=17407, st_ino=917505, st_dev=2050, st_nlink=7, 
st_uid=0, st_gid=0, st_size=4096, st_atime=1407367916, 
st_mtime=1407369857, st_ctime=1407369857)

 >>> pathlib.Path('foo').lstat()
os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, 
st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370037, 
st_mtime=1407370025, st_ctime=1407370025)
 >>> pathlib.Path('foo/').lstat()
os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, 
st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370037, 
st_mtime=1407370025, st_ctime=1407370025)

But you can also call resolve() explicitly if you want to act on the 
link target rather than the link itself:

 >>> pathlib.Path('foo/').resolve().lstat()
os.stat_result(st_mode=17407, st_ino=917505, st_dev=2050, st_nlink=7, 
st_uid=0, st_gid=0, st_size=4096, st_atime=1407367916, 
st_mtime=1407369857, st_ctime=1407369857)

Am I overlooking other cases?



From alexander.belopolsky at  Thu Aug  7 02:50:14 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Wed, 6 Aug 2014 20:50:14 -0400
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <lrug7p$32h$>
References: <>
Message-ID: <>

On Wed, Aug 6, 2014 at 8:11 PM, Antoine Pitrou <antoine at> wrote:

> Am I overlooking other cases?

There are many interfaces where trailing slash is significant.  For
example, rsync uses trailing slash on the target directory to avoid
creating an additional directory level at the destination.  Loosing it when
passing path strings through pathlib.Path() may be a source of bugs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Thu Aug  7 03:55:14 2014
From: antoine at (Antoine Pitrou)
Date: Wed, 06 Aug 2014 21:55:14 -0400
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
 <> <lrug7p$32h$>
Message-ID: <lruma1$k2$>

Le 06/08/2014 20:50, Alexander Belopolsky a ?crit :
> On Wed, Aug 6, 2014 at 8:11 PM, Antoine Pitrou <antoine at
> <mailto:antoine at>> wrote:
>     Am I overlooking other cases?
> There are many interfaces where trailing slash is significant.  For
> example, rsync uses trailing slash on the target directory to avoid
> creating an additional directory level at the destination.  Loosing it
> when passing path strings through pathlib.Path() may be a source of bugs.

pathlib is generally concerned with filesystem operations written in 
Python, not arbitrary third-party tools. Also it is probably easy to 
append the trailing slash in your command-line invocation, if so desired.



From ben+python at  Thu Aug  7 04:12:30 2014
From: ben+python at (Ben Finney)
Date: Thu, 07 Aug 2014 12:12:30 +1000
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
References: <>
Message-ID: <>

Antoine Pitrou <antoine at> writes:

> Le 06/08/2014 20:50, Alexander Belopolsky a ?crit :
> > There are many interfaces where trailing slash is significant. [?]
> > Loosing it when passing path strings through pathlib.Path() may be a
> > source of bugs.
> pathlib is generally concerned with filesystem operations written in
> Python, not arbitrary third-party tools.

The operating system shell is more than an ?arbitrary third-party tool?,
though; it preserves paths, and handles invoking commands.

You seem to be saying that ?pathlib? is not intended to be helpful for
constructing a shell command. Will its documentation warn that is so?

> Also it is probably easy to append the trailing slash in your
> command-line invocation, if so desired.

The trouble is that one can desire it, and construct a path knowing that
the presence or absence of a trailing slash has semantic significance;
and then have it unaccountably altered by the pathlib.Path code. This is
worse than preserving the semantic value.

 \       ?But Marge, what if we chose the wrong religion? Each week we |
  `\          just make God madder and madder.? ?Homer, _The Simpsons_ |
_o__)                                                                  |
Ben Finney

From antoine at  Thu Aug  7 04:30:52 2014
From: antoine at (Antoine Pitrou)
Date: Wed, 06 Aug 2014 22:30:52 -0400
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
 <> <lrug7p$32h$>
 <lruma1$k2$> <>
Message-ID: <lruocs$svc$>

Le 06/08/2014 22:12, Ben Finney a ?crit :
> You seem to be saying that ?pathlib? is not intended to be helpful for
> constructing a shell command.

pathlib lets you do operations on paths. It also gives you a string 
representation of the path that's expected to designate that path when 
talking to operating system APIs. It doesn't give you the possibility to 
store other semantic variations ("whether a new directory level must be 
created"); that's up to you to add those.

(similarly, it doesn't have separate classes to represent "a file", "a 
directory", "a non-existing file", etc.)



From bcannon at  Thu Aug  7 16:04:04 2014
From: bcannon at (Brett Cannon)
Date: Thu, 07 Aug 2014 14:04:04 +0000
Subject: [Python-Dev] [Python-checkins] Daily reference leaks
	(09f56fdcacf1): sum=21004
References: <>
Message-ID: <>

test_codecs is not happy. Looking at the subject lines of commit emails
from the past day I don't see any obvious cause.

On Thu Aug 07 2014 at 4:35:05 AM <solipsis at> wrote:

> results for 09f56fdcacf1 on branch "default"
> --------------------------------------------
> test_codecs leaked [5825, 5825, 5825] references, sum=17475
> test_codecs leaked [1172, 1174, 1174] memory blocks, sum=3520
> test_collections leaked [0, 2, 0] references, sum=2
> test_functools leaked [0, 0, 3] memory blocks, sum=3
> test_site leaked [0, 2, 0] references, sum=2
> test_site leaked [0, 2, 0] memory blocks, sum=2
> Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R',
> '3:3:/home/antoine/cpython/refleaks/reflogdA4OO6', '-x']
> _______________________________________________
> Python-checkins mailing list
> Python-checkins at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Thu Aug  7 17:05:46 2014
From: guido at (Guido van Rossum)
Date: Thu, 7 Aug 2014 08:05:46 -0700
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <lruocs$svc$>
References: <>
 <lruma1$k2$> <>
Message-ID: <>

Hm. I personally consider a trailing slash significant. It feels
semantically different (and in some cases it is) so I don't think it should
be normalized. The behavior of os.path.split() here feels right.

On Wed, Aug 6, 2014 at 7:30 PM, Antoine Pitrou <antoine at> wrote:

> Le 06/08/2014 22:12, Ben Finney a ?crit :
>  You seem to be saying that ?pathlib? is not intended to be helpful for
>> constructing a shell command.
> pathlib lets you do operations on paths. It also gives you a string
> representation of the path that's expected to designate that path when
> talking to operating system APIs. It doesn't give you the possibility to
> store other semantic variations ("whether a new directory level must be
> created"); that's up to you to add those.
> (similarly, it doesn't have separate classes to represent "a file", "a
> directory", "a non-existing file", etc.)
> Regards
> Antoine.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From zachary.ware+pydev at  Thu Aug  7 19:16:03 2014
From: zachary.ware+pydev at (Zachary Ware)
Date: Thu, 7 Aug 2014 12:16:03 -0500
Subject: [Python-Dev] [Python-checkins] Daily reference leaks
	(09f56fdcacf1): sum=21004
In-Reply-To: <>
References: <>
Message-ID: <>

On Thu, Aug 7, 2014 at 9:04 AM, Brett Cannon <bcannon at> wrote:
> test_codecs is not happy. Looking at the subject lines of commit emails from
> the past day I don't see any obvious cause.

Looks like this was caused by the change I made to regrtest in [1] to
fix refleak testing in test_asyncio [2].  I'm looking into it, but
haven't found any kind of reason for it yet.



From zachary.ware+pydev at  Thu Aug  7 22:51:24 2014
From: zachary.ware+pydev at (Zachary Ware)
Date: Thu, 7 Aug 2014 15:51:24 -0500
Subject: [Python-Dev] [Python-checkins] Daily reference leaks
	(09f56fdcacf1): sum=21004
In-Reply-To: <>
References: <>
Message-ID: <>

On Thu, Aug 7, 2014 at 12:16 PM, Zachary Ware
<zachary.ware+pydev at> wrote:
> On Thu, Aug 7, 2014 at 9:04 AM, Brett Cannon <bcannon at> wrote:
>> test_codecs is not happy. Looking at the subject lines of commit emails from
>> the past day I don't see any obvious cause.
> Looks like this was caused by the change I made to regrtest in [1] to
> fix refleak testing in test_asyncio [2].  I'm looking into it, but
> haven't found any kind of reason for it yet.

I've created to keep track of this
and report my findings thus far.


From chris.barker at  Fri Aug  8 00:06:18 2014
From: chris.barker at (Chris Barker)
Date: Thu, 7 Aug 2014 15:06:18 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140804181013.GO4525@ando>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Mon, Aug 4, 2014 at 11:10 AM, Steven D'Aprano <steve at>

> On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
> > Good point -- I was trying to make the point about .join() vs + for
> strings
> > in an intro python class last year, and made the mistake of having the
> > students test the performance.
> >
> > You need to concatenate a LOT of strings to see any difference at all

> If only that were the case, but it isn't. Here's a cautionary tale for
> how using string concatenation can blow up in your face:
> Chris Withers asks for help debugging HTTP slowness:

Thanks for that -- interesting story. note that that was not suing sum() in
that case though, which is really the issue at hand.

It shouldn't be hard to demonstrate the difference between repeated
> string concatenation and join, all you need do is defeat sum()'s
> prohibition against strings. Run this bit of code, and you'll see a
> significant difference in performance, even with CPython's optimized
> concatenation:

well, that does look compelling, but what it shows is that
sum(a_list_of_strings) is slow compared to ''.join(a_list_of_stings). That
doesn't surprise me a bit -- this is really similar to why:


is going to be a lot faster than:


and why I'll tell everyone that is working with lots of numbers to use
numpy. ndarray.sum know what data type it's deaing with,a nd can do the
loop in C. similarly with ''.join() (though not as optimized.

But I'm not sure we're seeing the big O difference here at all -- but
rather the extra calls though each element in the list's __add__ method.

In the case where you already HAVE a big list of strings, then yes, ''.join
is the clear winner.

But I think the case we're often talking about, and I've tested with
students, is when you are building up a long string on the fly out of
little strings. In that case, you need to profile the full "append to list,
then call join()", not just the join() call:

# continued adding of strings ( O(n^2)? )
In [6]: def add_strings(l):
   ...:     s = ''
   ...:     for i in l:
   ...:         s+=i
   ...:     return s

Using append and then join ( O(n)? )
In [14]: def join_strings(list_of_strings):
   ....:     l = []
   ....:     for i in list_of_strings:
   ....:         l.append(i)
   ....:     return ''.join(l)

In [23]: timeit add_strings(strings)
1000000 loops, best of 3: 831 ns per loop

In [24]: timeit join_strings(strings)
100000 loops, best of 3: 1.87 ?s per loop

## hmm -- concatenating is faster for a small list of tiny strings....

In [31]: strings = list('Hello World')* 1000

strings *= 1000
In [26]: timeit add_strings(strings)
1000 loops, best of 3: 932 ?s per loop

In [27]: timeit join_strings(strings)
1000 loops, best of 3: 967 ?s per loop

## now about the same.

In [31]: strings = list('Hello World')* 10000

In [29]: timeit add_strings(strings)
100 loops, best of 3: 9.44 ms per loop

In [30]: timeit join_strings(strings)
100 loops, best of 3: 10.1 ms per loop

still about he same?

In [31]: strings = list('Hello World')* 1000000

In [32]: timeit add_strings(strings)
1 loops, best of 3: 1.27 s per loop

In [33]: timeit join_strings(strings)
1 loops, best of 3: 1.05 s per loop

there we go -- slight advantage to joining.....

So this is why we've said that the common wisdom about string concatenating
isn't really a practical issue.

But if you already have the strings all in a list, then yes, join() is a
major win over sum()

In fact, I tried the above with sum() -- and it was really, really slow. So
slow I didn't have the patience to wait for it.

Here is a smaller example:

In [22]: strings = list('Hello World')* 10000

In [23]: timeit add_strings(strings)
100 loops, best of 3: 9.61 ms per loop

In [24]: timeit sum( strings, Faker() )
1 loops, best of 3: 246 ms per loop

So why is sum() so darn slow with strings compared to a simple loop with +=

(and if I try it with a list 10 times as long it takes "forever")

Perhaps the http issue cited was before some nifty optimizations in current



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Fri Aug  8 01:01:49 2014
From: ethan at (Ethan Furman)
Date: Thu, 07 Aug 2014 16:01:49 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 08/07/2014 03:06 PM, Chris Barker wrote:

[snip timings, etc.]

I don't remember where, but I believe that cPython has an optimization built in for repeated string concatenation, which 
is probably why you aren't seeing big differences between the + and the sum().

A little testing shows how to defeat that optimization:

   blah = ''
   for string in ['booyah'] * 100000:
       blah = string + blah

Note the reversed order of the addition.

--> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1)
[0.021117210388183594, 0.013692855834960938, 0.00768280029296875]

--> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1)
[15.301048994064331, 15.343288898468018, 15.268463850021362]


From ethan at  Fri Aug  8 01:05:50 2014
From: ethan at (Ethan Furman)
Date: Thu, 07 Aug 2014 16:05:50 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 08/07/2014 04:01 PM, Ethan Furman wrote:
> On 08/07/2014 03:06 PM, Chris Barker wrote:
>  the + and the sum().

Yeah, that 'sum' should be 'join'  :/


From ethan at  Fri Aug  8 01:08:14 2014
From: ethan at (Ethan Furman)
Date: Thu, 07 Aug 2014 16:08:14 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 08/07/2014 04:01 PM, Ethan Furman wrote:
> On 08/07/2014 03:06 PM, Chris Barker wrote:
> --> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1)
> [0.021117210388183594, 0.013692855834960938, 0.00768280029296875]
> --> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1)
> [15.301048994064331, 15.343288898468018, 15.268463850021362]

Oh, and the join() timings:

--> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah = ''").repeat(3, 1)
[0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188]

So, + is three orders of magnitude slower than join.


From larry at  Fri Aug  8 06:41:13 2014
From: larry at (Larry Hastings)
Date: Thu, 07 Aug 2014 21:41:13 -0700
Subject: [Python-Dev] Surely "nullable" is a reasonable name?
In-Reply-To: <>
References: <> <>
Message-ID: <>

On 08/05/2014 08:13 AM, "Martin v. L?wis" wrote:
> For the feature in question,
> I find both "allow_none" and "nullable" acceptable; "noneable" is not.

Well!  It's rare that the core dev community is so consistent in its 
opinion.  I still think "nullable" is totally appropriate, but I'll 
change it to "allow_none".

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From p.f.moore at  Fri Aug  8 14:27:28 2014
From: p.f.moore at (Paul Moore)
Date: Fri, 8 Aug 2014 13:27:28 +0100
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <lruma1$k2$>
References: <>
Message-ID: <>

On 7 August 2014 02:55, Antoine Pitrou <antoine at> wrote:
> pathlib is generally concerned with filesystem operations written in Python,
> not arbitrary third-party tools. Also it is probably easy to append the
> trailing slash in your command-line invocation, if so desired.

I had a use case where I wanted to allow a config file to contain
"path: foo" to create a file called foo, and "path: foo/" to create a
directory. It was a shortcut for specifying an explicit "directory:
true" parameter as well.

The fact that pathlib stripped the slash made coding this mildly
tricky (especially as I wanted to cater for Windows users writing
"foo\\"...) It's not a showstopper, but I agree that semantically,
being able to distinguish whether an input had a trailing slash is
sometimes useful.


From alexander.belopolsky at  Fri Aug  8 15:39:43 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Fri, 8 Aug 2014 09:39:43 -0400
Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
Message-ID: <>

On Fri, Aug 8, 2014 at 8:27 AM, Paul Moore <p.f.moore at> wrote:

> I had a use case where I wanted to allow a config file to contain
> "path: foo" to create a file called foo, and "path: foo/" to create a
> directory. It was a shortcut for specifying an explicit "directory:
> true" parameter as well.

Here is my use case: I have a database application that can save a table in
a variety of formats based on the supplied file name.  For example,
save('t.csv', t) saves in CSV text format while save('t', t)  saves in the
default binary format.  In addition, it supports "splayed" format where a
table is saved in multiple files across a directory - one file per column.
 The native database save function chooses this format when file name ends
with a slash: save('t/', t).   I would like to make the save() function in
Python that works like this, but takes pathlib.Path instances instead of
str, but in the current version, I cannot supply 't/' as a Path instance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From status at  Fri Aug  8 18:08:08 2014
From: status at (Python tracker)
Date: Fri,  8 Aug 2014 18:08:08 +0200 (CEST)
Subject: [Python-Dev] Summary of Python tracker Issues
Message-ID: <>

ACTIVITY SUMMARY (2014-08-01 - 2014-08-08)
Python tracker at

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4602 (+10)
  closed 29340 (+43)
  total  33942 (+53)

Open issues with patches: 2177 

Issues opened (39)

#21039: pathlib strips trailing slash  reopened by pitrou

#21591: "exec(a, b, c)" not the same as "exec a in b, c" in nested fun  reopened by Arfrever

#22121: IDLE should start with HOME as the initial working directory  opened by mark

#22123: Provide a direct function for types.SimpleNamespace()  opened by mark

#22125: Cure signedness warnings introduced by #22003  opened by dw

#22126: mc68881 fpcr inline asm breaks clang -flto build  opened by ivank

#22128: patch: steer people away from  opened by Frank.van.Dijk

#22131: uuid.bytes optimization  opened by kevinlondon

#22133: IDLE: Set correct WM_CLASS on X11  opened by sahutd

#22135: allow to break into pdb with Ctrl-C for all the commands that  opened by xdegaye

#22137: Test imaplib API on all methods specified in RFC 3501  opened by zvyn

#22138: patch.object doesn't restore function defaults  opened by chepner

#22139: python windows 2.7.8 64-bit wrong binary version  opened by Andreas.Richter

#22140: "python-config --includes" returns a wrong path (double prefix  opened by Michael.Dussere

#22141: rlcompleter.Completer matches too much  opened by donlorenzo

#22143: rlcompleter.Completer has duplicate matches  opened by donlorenzo

#22144: ellipsis needs better display in lexer documentation  opened by Fran??ois-Ren??.Rideau

#22145: <> in parser spec but not lexer spec  opened by Fran??ois-Ren??.Rideau

#22147: PosixPath() constructor should not accept strings with embedde  opened by ischwabacher

#22148: frozen.c should #include <importlib.h> instead of "importlib.h  opened by jbeck

#22149: the frame of a suspended generator should not have a local tra  opened by xdegaye

#22150: deprecated-removed directive is broken in Sphinx 1.2.2  opened by berker.peksag

#22153: There is no standard TestCase.runTest implementation  opened by vadmium

#22154: context manager support  opened by Ralph.Broenink

#22155: Out of date code example for tkinter's createfilehandler  opened by vadmium

#22156: Fix compiler warnings  opened by haypo

#22157: FAIL: test_with_pip (test.test_venv.EnsurePipTest)  opened by snehal

#22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy  opened by zvyn

#22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec  opened by zvyn

#22160: Windows installers need to be updated following OpenSSL securi  opened by alex

#22161: Remove unsupported code from ctypes  opened by serhiy.storchaka

#22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul  opened by edulix

#22164: cell object cleared too early?  opened by pitrou

#22165: Empty response from http.server when directory listing contain  opened by jleedev

#22166: test_codecs "leaking" references  opened by zach.ware

#22167: iglob() has misleading documentation (does indeed store names  opened by roysmith

#22168: Turtle Graphics RawTurtle problem  opened by Kent.D..Lee

#22171: stack smash when using ctypes/libffi to access union  opened by wes.kerfoot

#22173: Update lib2to3.tests and test_lib2to3 to use test discovery  opened by zach.ware

Most recent 15 issues with no replies (15)

#22173: Update lib2to3.tests and test_lib2to3 to use test discovery

#22171: stack smash when using ctypes/libffi to access union

#22166: test_codecs "leaking" references

#22164: cell object cleared too early?

#22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul

#22161: Remove unsupported code from ctypes

#22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec

#22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy

#22155: Out of date code example for tkinter's createfilehandler

#22153: There is no standard TestCase.runTest implementation

#22149: the frame of a suspended generator should not have a local tra

#22143: rlcompleter.Completer has duplicate matches

#22140: "python-config --includes" returns a wrong path (double prefix

#22135: allow to break into pdb with Ctrl-C for all the commands that

#22115: Add new methods to trace Tkinter variables

Most recent 15 issues waiting for review (15)

#22173: Update lib2to3.tests and test_lib2to3 to use test discovery

#22165: Empty response from http.server when directory listing contain

#22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul

#22161: Remove unsupported code from ctypes

#22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec

#22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy

#22156: Fix compiler warnings

#22150: deprecated-removed directive is broken in Sphinx 1.2.2

#22149: the frame of a suspended generator should not have a local tra

#22148: frozen.c should #include <importlib.h> instead of "importlib.h

#22143: rlcompleter.Completer has duplicate matches

#22141: rlcompleter.Completer matches too much

#22138: patch.object doesn't restore function defaults

#22137: Test imaplib API on all methods specified in RFC 3501

#22133: IDLE: Set correct WM_CLASS on X11

Top 10 most discussed issues (10)

#19838: test.test_pathlib.PosixPathTest.test_touch_common fails on Fre  23 msgs

#21448: Email Parser use 100% CPU  14 msgs

#22123: Provide a direct function for types.SimpleNamespace()  11 msgs

#21965: Add support for Memory BIO to _ssl  10 msgs

#14910: argparse: disable abbreviation   9 msgs

#21308: PEP 466: backport ssl changes   9 msgs

#22046: should mention that it might throw NotImplement   9 msgs

#21091: EmailMessage.is_attachment should be a method   8 msgs

#22118: urljoin fails with messy relative URLs   8 msgs

#22160: Windows installers need to be updated following OpenSSL securi   8 msgs

Issues closed (43)

#5411: Add xz support to shutil  closed by serhiy.storchaka

#11763: assertEqual memory issues with large text inputs  closed by ezio.melotti

#13540: Document the Action API in argparse  closed by jason.coombs

#15114: Deprecate strict mode of HTMLParser  closed by ezio.melotti

#15826: Increased test coverage of  closed by ezio.melotti

#15974: Optional compact and colored output for regrest  closed by pitrou

#17665: convert test_wsgiref to idiomatic unittest code  closed by ezio.melotti

#18034: Last two entries in the programming FAQ are out of date (impor  closed by ezio.melotti

#18142: Tests fail on Mageia Linux Cauldron x86-64 with some configure  closed by ned.deily

#18588: timeit examples should be consistent  closed by ezio.melotti

#19055: Regular expressions: * does not match as many repetitions as p  closed by ezio.melotti

#20056: Got deprecation warning when running on Windows  closed by serhiy.storchaka

#20170: Derby #1: Convert 137 sites to Argument Clinic in Modules/posi  closed by larry

#20402: List comprehensions should be noted in for loop documentation  closed by rhettinger

#20977: pyflakes: undefined "ctype" in 2 except blocks in the email mo  closed by ezio.melotti

#21047: html.parser.HTMLParser: convert_charrefs should become True by  closed by berker.peksag

#21539: pathlib's Path.mkdir() should allow for "mkdir -p" functionali  closed by barry

#21972: Bugs in the lexer and parser documentation  closed by loewis

#21975: Using pickled/unpickled sqlite3.Row results in segfault rather  closed by serhiy.storchaka

#22077: Improve the error message for various sequences  closed by terry.reedy

#22092: Executing some tests inside Lib/unittest/test individually thr  closed by ezio.melotti

#22097: Linked list API for ordereddict  closed by rhettinger

#22104: test_asyncio unstable in refleak mode  closed by python-dev

#22105: Idle: Hang during File "Save As"  closed by terry.reedy

#22110: enable extra compilation warnings  closed by neologix

#22114: You cannot call communicate() safely after receiving an except  closed by amrith

#22116: Weak reference support for C function objects  closed by pitrou

#22119: Some input chars (i.e. '++') break re.match  closed by ezio.melotti

#22120: Return converter code generated by Argument Clinic has a warni  closed by larry

#22122: turtle module examples should all begin "from turtle import *"  closed by mark

#22124: Rotating items of list to left  closed by zach.ware

#22127: performance regression in socket getsockaddrarg()  closed by loewis

#22129: Please add an equivalent to QString::simplified() to Python st  closed by serhiy.storchaka

#22130: Logging fileConfig behavior does not match documentation  closed by python-dev

#22132: Cannot copy the same directory structure to the same destinati  closed by eric.araujo

#22134: string formatting float rounding errors  closed by ned.deily

#22136: Fix _tkinter compiler warnings on MSVC  closed by python-dev

#22142: PEP 465 operators not described in lexical_analysis  closed by python-dev

#22146: Error message for __build_class__ contains typo  closed by python-dev

#22162: Activating a venv - Dash doesn't understand source  closed by vinay.sajip

#22169: sys.tracebacklimit = 0 does not work as documented in 3.x  closed by ned.deily

#22170: Typo in iterator doc  closed by ezio.melotti

#22172: Local files shadow system modules, even from system modules  closed by ncoghlan

From chris.barker at  Fri Aug  8 17:23:51 2014
From: chris.barker at (Chris Barker)
Date: Fri, 8 Aug 2014 08:23:51 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Thu, Aug 7, 2014 at 4:01 PM, Ethan Furman <ethan at> wrote:

> I don't remember where, but I believe that cPython has an optimization
> built in for repeated string concatenation, which is probably why you
> aren't seeing big differences between the + and the sum().

Indeed -- clearly so.

A little testing shows how to defeat that optimization:

  blah = ''
>   for string in ['booyah'] * 100000:
>       blah = string + blah
> Note the reversed order of the addition.

thanks -- cool trick.

Oh, and the join() timings:
> --> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah =
> ''").repeat(3, 1)
> [0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188]
> So, + is three orders of magnitude slower than join.

only one if if you use the optimized form of + and not even that if you
need to build up the list first, which is the common use-case.

So my final question is this:

repeated string concatenation is not the "recommended" way to do this --
but nevertheless, cPython has an optimization that makes it fast and
efficient, to the point that there is no practical performance reason to
prefer appending to a list and calling join()) afterward.

So why not apply a similar optimization to sum() for strings?



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Fri Aug  8 20:09:45 2014
From: ethan at (Ethan Furman)
Date: Fri, 08 Aug 2014 11:09:45 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 08/08/2014 08:23 AM, Chris Barker wrote:
> So my final question is this:
> repeated string concatenation is not the "recommended" way to do this -- but nevertheless, cPython has an optimization
> that makes it fast and efficient, to the point that there is no practical performance reason to prefer appending to a
> list and calling join()) afterward.
> So why not apply a similar optimization to sum() for strings?

That I cannot answer -- I find the current situation with sum highly irritating.


From raymond.hettinger at  Sat Aug  9 02:34:34 2014
From: raymond.hettinger at (Raymond Hettinger)
Date: Fri, 8 Aug 2014 17:34:34 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Aug 8, 2014, at 11:09 AM, Ethan Furman <ethan at> wrote:

>> So why not apply a similar optimization to sum() for strings?
> That I cannot answer -- I find the current situation with sum highly irritating.

It is only irritating if you are misusing sum().

The str.__add__ optimization was put in because
it was common for people to accidentally incur
the performance penalty.

With sum(), we don't seem to have that problem
(I don't see people using it to add lists except
just to show that could be done).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Sat Aug  9 02:56:24 2014
From: ethan at (Ethan Furman)
Date: Fri, 08 Aug 2014 17:56:24 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 08/08/2014 05:34 PM, Raymond Hettinger wrote:
> On Aug 8, 2014, at 11:09 AM, Ethan Furman <ethan at <mailto:ethan at>> wrote:
>>> So why not apply a similar optimization to sum() for strings?
>> That I cannot answer -- I find the current situation with sum highly irritating.
> It is only irritating if you are misusing sum().

Actually, I have an advanced degree in irritability -- perhaps you've noticed in the past?

I don't use sum at all, or at least very rarely, and it still irritates me.  It feels like I'm being told I'm too dumb 
to figure out when I can safely use sum and when I can't.


From alexander.belopolsky at  Sat Aug  9 04:20:37 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Fri, 8 Aug 2014 22:20:37 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan at> wrote:

> I don't use sum at all, or at least very rarely, and it still irritates me.

You are not alone.  When I see sum([a, b, c]), I think it is a + b + c, but
in Python it is 0 + a + b + c.  If we had a "join" operator for strings
that is different form + - then sure, I would not try to use sum to join
strings, but we don't.  I have always thought that sum(x) is just a
shorthand for reduce(operator.add, x), but again it is not so in Python.
 While "sum should only be used for numbers,"  it turns out it is not a
good choice for floats - use math.fsum.  While "strings are blocked because
sum is slow," numpy arrays with millions of elements are not.  And try to
explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.
 Why have builtin sum at all if its use comes with so many caveats?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From steve at  Sat Aug  9 07:08:45 2014
From: steve at (Steven D'Aprano)
Date: Sat, 9 Aug 2014 15:08:45 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <20140802203513.GA10447@k2>
Message-ID: <20140809050845.GZ4525@ando>

On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
> On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan at> wrote:
> > I don't use sum at all, or at least very rarely, and it still irritates me.
> You are not alone.  When I see sum([a, b, c]), I think it is a + b + c, but
> in Python it is 0 + a + b + c.  If we had a "join" operator for strings
> that is different form + - then sure, I would not try to use sum to join
> strings, but we don't.

I've long believed that + is the wrong operator for concatenating 
strings, and that & makes a much better operator. We wouldn't be having 
these interminable arguments about using sum() to concatenate strings 
(and lists, and tuples) if the & operator was used for concatenation and 
+ was only used for numeric addition.

> I have always thought that sum(x) is just a
> shorthand for reduce(operator.add, x), but again it is not so in Python.

The signature of reduce is:

    reduce(function, sequence[, initial]) -> value

so sum() is (at least conceptually) a shorthand for reduce:

def sum(values, initial=0):
    return reduce(operator.add, values, initial)

but that's an implementation detail, not a language promise, and sum() 
is free to differ from that simple version. Indeed, even the public 
interface is different, since sum() prohibits using a string as the 
initial value and only promises to work with numbers. The fact that it 
happens to work with lists and tuples is somewhat of an accident of 

> While "sum should only be used for numbers,"  it turns out it is not a
> good choice for floats - use math.fsum.

Correct. And if you (generic you, not you personally) do not understand 
why simple-minded addition of floats is troublesome, then you're going 
to have a world of trouble. Anyone who is disturbed by the question of 
"should I use sum or math.fsum?" probably shouldn't be writing serious 
floating point code at all. Floating point computations are hard, and 
there is simply no escaping this fact.

> While "strings are blocked because
> sum is slow," numpy arrays with millions of elements are not.

That's not a good example. Strings are potentially O(N**2), which means 
not just "slow" but *agonisingly* slow, as in taking a week -- no 
exaggeration -- to concat a million strings. If it takes a nanosecond to 
concat two strings, then 1e6**2 such concatenations could take over 
eleven days. Slowness of such magnitude might as well be "the process 
has locked up".

In comparison, summing a numpy array with a million entries is not 
really slow in that sense. The time taken is proportional to the number 
of entries, and differs from summing a list only by a constant factor.

Besides, in the case of strings it is quite simple to decide "is the 
initial value a string?", whereas with lists or numpy arrays it's quite 
hard to decide "is the list or array so huge that the user will consider 
this too slow?". What counts as "too slow" depends on the machine it is 
running on, what other processes are running, and the user's mood, and 
leads to the silly result that summing an array of N items succeeds but 
N+1 items doesn't. So in the case of strings, it is easy to make a
blanket prohibition, but in the case of lists or arrays, there is no 
reasonable place to draw the line.

> And try to
> explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.

I think that's because sum() has to box up each and every element in the 
array into an object, which is wasteful, while abs() can delegate to a 
specialist array.__abs__ method. Although that's not something beginners 
should be expected to understand, no serious Python programmer should be 
confused by this. As a programmer, we should expect to have some 
understanding of our tools, how they work, their limitations, and when 
to use a different tool. That's why numpy has its own version of sum 
which is designed to work specifically on numpy arrays. Use a specialist 
tool for a specialist job:

py> with Stopwatch():
...     sum(carray)  # carray is a numpy array of 75000000 floats.
time taken: 52.659770 seconds
py> with Stopwatch():
...     numpy.sum(carray)
time taken: 0.161263 seconds

>  Why have builtin sum at all if its use comes with so many caveats?

Because sum() is a perfectly reasonable general purpose tool for adding 
up small amounts of numbers where high floating point precision is not 
required. It has been included as a built-in because Python comes with 
"batteries included", and a basic function for adding up a few numbers 
is an obvious, simple battery. But serious programmers should be 
comfortable with the idea that you use the right tool for the right job.

If you visit a hardware store, you will find that even something as 
simple as the hammer exists in many specialist varieties. There are tack 
hammers, claw hammers, framing hammers, lump hammers, rubber and wooden 
mallets, "brass" non-sparking hammers, carpet hammers, brick hammers, 
ball-peen and cross-peen hammers, and even more specialist versions like 
geologist's hammers. Bashing an object with something hard is remarkably 
complicated, and there are literally dozens of types and sizes of "the 
hammer".  Why should it be a surprise that there are a handful of 
different ways to sum items?


From greg.ewing at  Sat Aug  9 07:36:11 2014
From: greg.ewing at (Greg Ewing)
Date: Sat, 09 Aug 2014 17:36:11 +1200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140809050845.GZ4525@ando>
References: <20140802203513.GA10447@k2>
Message-ID: <>

Steven D'Aprano wrote:
> I've long believed that + is the wrong operator for concatenating 
> strings, and that & makes a much better operator.

Do you have a reason for preferring '&' in particular, or
do you just want something different from '+'?

Personally I can't see why "bitwise and" on strings should
be a better metaphor for concatenation that "addition". :-)


From antoine at  Sat Aug  9 07:39:16 2014
From: antoine at (Antoine Pitrou)
Date: Sat, 09 Aug 2014 01:39:16 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140809050845.GZ4525@ando>
References: <20140802203513.GA10447@k2>
Message-ID: <ls4c68$pd8$>

Le 09/08/2014 01:08, Steven D'Aprano a ?crit :
> On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
>> On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan at> wrote:
>>> I don't use sum at all, or at least very rarely, and it still irritates me.
>> You are not alone.  When I see sum([a, b, c]), I think it is a + b + c, but
>> in Python it is 0 + a + b + c.  If we had a "join" operator for strings
>> that is different form + - then sure, I would not try to use sum to join
>> strings, but we don't.
> I've long believed that + is the wrong operator for concatenating
> strings, and that & makes a much better operator. We wouldn't be having
> these interminable arguments about using sum() to concatenate strings
> (and lists, and tuples) if the & operator was used for concatenation and
> + was only used for numeric addition.

Come on. These arguments are interminable because many people (including 
you) love feeding interminable arguments. No need to blame Python for that.

And for that matter, this interminable discussion should probably have 
taken place on python-ideas or even python-list.



From stephen at  Sat Aug  9 09:08:41 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 09 Aug 2014 16:08:41 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

Alexander Belopolsky writes:

 > Why have builtin sum at all if its use comes with so many caveats?

Because we already have it.  If the caveats had been known when it was
introduced, maybe it wouldn't have been.  The question is whether you
can convince python-dev that it's worth changing the definition of
sum().  IMO that's going to be very hard to do.  All the suggestions
I've seen so far are (IMHO, YMMV) just as ugly as the present

From p.f.moore at  Sat Aug  9 10:36:31 2014
From: p.f.moore at (Paul Moore)
Date: Sat, 9 Aug 2014 09:36:31 +0100
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140809050845.GZ4525@ando>
References: <20140802203513.GA10447@k2>
Message-ID: <>

On 9 August 2014 06:08, Steven D'Aprano <steve at> wrote:
> py> with Stopwatch():
> ...     sum(carray)  # carray is a numpy array of 75000000 floats.
> ...
> 112500000.0
> time taken: 52.659770 seconds
> py> with Stopwatch():
> ...     numpy.sum(carray)
> ...
> 112500000.0
> time taken: 0.161263 seconds
>>  Why have builtin sum at all if its use comes with so many caveats?
> Because sum() is a perfectly reasonable general purpose tool for adding
> up small amounts of numbers where high floating point precision is not
> required. It has been included as a built-in because Python comes with
> "batteries included", and a basic function for adding up a few numbers
> is an obvious, simple battery. But serious programmers should be
> comfortable with the idea that you use the right tool for the right job.

Changing the subject a little, but the Stopwatch function you used up
there is "an obvious, simple battery" for timing a chunk of code at
the interactive prompt. I'm amazed there's nothing like it in the
timeit module...


From benhoyt at  Sat Aug  9 18:43:01 2014
From: benhoyt at (Ben Hoyt)
Date: Sat, 9 Aug 2014 12:43:01 -0400
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
Message-ID: <>

Just thought I'd share some of my excitement about how fast the all-C
version [1] of os.scandir() is turning out to be.

Below are the results of my scandir / walk benchmark run with three
different versions. I'm using an SSD, which seems to make it
especially faster than listdir / walk. Note that benchmark results can
vary a lot, depending on operating system, file system, hard drive
type, and the OS's caching state.

Anyway, os.walk() can be FIFTY times as fast using os.scandir().

# Old ctypes implementation of scandir in
C:\work\scandir>\work\python\cpython\python -r
Using slower ctypes version of scandir
os.walk took 1.144s, scandir.walk took 0.060s -- 19.2x as fast

# Existing "half C" implementation of scandir in _scandir.c:
C:\work\scandir>\Python34-x86\python.exe -r
Using fast C version of scandir
os.walk took 1.160s, scandir.walk took 0.042s -- 27.6x as fast

# New "all C" os.scandir implementation in posixmodule.c:
C:\work\scandir>\work\python\cpython\python -r
Using Python 3.5's builtin os.scandir()
os.walk took 1.141s, scandir.walk took 0.022s -- 53.0x as fast

[1] Work in progress implementation as part of Python 3.5's
posixmodule.c available here:


From alexander.belopolsky at  Sat Aug  9 20:02:58 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sat, 9 Aug 2014 14:02:58 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull <stephen at>

> All the suggestions
> I've seen so far are (IMHO, YMMV) just as ugly as the present
> situation.

What is ugly about allowing strings?  CPython certainly has a way to to
make sum(x, '') at least as efficient as y='';for in in x; y+= x is now.
 What is ugly about making sum([a, b, ..]) be equivalent to a + b + .. so
that non-empty lists of arbitrary types can be "summed"?  What is ugly
about harmonizing sum(x) and reduce(operator.add, x) behaviors?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From alexander.belopolsky at  Sat Aug  9 20:04:00 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sat, 9 Aug 2014 14:04:00 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Sat, Aug 9, 2014 at 2:02 PM, Alexander Belopolsky <
alexander.belopolsky at> wrote:

> y='';for in in x; y+= x

Should have been

for i in x; y += i
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From gokoproject at  Sat Aug  9 20:44:10 2014
From: gokoproject at (John Yeuk Hon Wong)
Date: Sat, 09 Aug 2014 14:44:10 -0400
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
Message-ID: <>


Referring to my discussion on [1] and then on #python this afternoon.

A little background would help people to understand where this was 
coming from.

1. I write Python 2 code and have done zero Python-3 specific code.
2. I have always been using class Foo(object) so I do not know the new 
style is no longer required in Python 3. I feel "stupid" and "wrong" by 
thinking (object) is still a convention in Python 3.
3. Many Python 2 tutorials do not use object as the base class whether 
for historical reason, or lack of information/education, and can cause 
confusing to newcomers searching for answers when they consult the 
official documentation.

While Python 3 code no longer requires object be the base class for the 
new-style class definition, I believe (object) is still required if one 
has to write a 2-3 compatible code. But this was not explained or warned 
anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please 
correct me)

I propose the followings:

* It is desirable to state boldly to users that (object) is no longer 
needed in Python-3 **only** code and warn users to revert to (object) 
style if the code needs to be 2 and 3 compatible.

* In addition, Python 2 doc [2] should be fixed by introducing the 
new-style classes. This problem was noted a long long time ago according 
to [4].

* I would like to see warnings from suggested action item 1 on [2] and 
[3], for python 2 and 3 documentations.

Possible objections(s):

* We are pushing toward Python 3, some years later we don't need to 
maintain both Python 2 and 3 code. And many people, especially the 
newcomers will probably have no need to maintain Python 2 and 3 
compatible codes.

My answer to that is we need to be careful with marketing. First, it is 
a little embarrassing to assume and to find out the assumption is not 
entirely accurate. Secondly, Python 2 will not go away any time soon and 
most tutorials available on the Internet today are still written for 
Python 2. Furthermore, this CAN be a "gotcha" for new developers knowing 
only Python 3 writing Python 2 & 3 compatible code.

* Books can do a better job

I haven't actually reviewed/read any Python 3 books knowing most of my 
code should work without bothering Python 3-2 incompatibility yet.
So I don't have an accurate answer, but a very very quick glance over a 
popular Python 3 book (I am not sure if naming it out is ethical or not 
so I am going to grey it out here) the book just writes class Foo: and 
doesn't note the different between 2 and 3 with classes. It is not wrong 
since the book is about programming in Python 3, NOT writing 2 and 3, 
but this is where the communication breaks. Docs and books don't give 
all the answers needed.

P.S. Sorry if I should've have asked on #python-dev first or made a 
ticket but I've decided to send to mailing list before making a bug ticket.
First time!


Yeuk Hon





From alexander.belopolsky at  Sat Aug  9 21:20:42 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sat, 9 Aug 2014 15:20:42 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <20140809050845.GZ4525@ando>
References: <20140802203513.GA10447@k2>
Message-ID: <>

On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano <steve at> wrote:

> We wouldn't be having
> these interminable arguments about using sum() to concatenate strings
> (and lists, and tuples) if the & operator was used for concatenation and
> + was only used for numeric addition.

But we would probably have a similar discussion about all(). :-)

Use of + is consistent with the use of * for repetition.  What would you
use use for repetition if you use & instead?

Compare, for example

s + ' ' * (n - len(s))


s & ' ' * (n - len(s))

Which one is clearer?

It is sum() that need to be fixed, not +.  Not having sum([a, b])
equivalent to a + b for any a, b pair is hard to justify.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From tjreedy at  Sat Aug  9 22:46:56 2014
From: tjreedy at (Terry Reedy)
Date: Sat, 09 Aug 2014 16:46:56 -0400
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
In-Reply-To: <>
References: <>
Message-ID: <ls61ch$2b5$>

On 8/9/2014 2:44 PM, John Yeuk Hon Wong wrote:
> Hi.
> Referring to my discussion on [1] and then on #python this afternoon.
> A little background would help people to understand where this was
> coming from.
> 1. I write Python 2 code and have done zero Python-3 specific code.
> 2. I have always been using class Foo(object) so I do not know the new
> style is no longer required in Python 3. I feel "stupid" and "wrong" by
> thinking (object) is still a convention in Python 3.

If someone else tried to make you feel that way, they are Code of 
Conduct violators who should be ignored. If you are beating yourself on 
the head, stop.

> 3. Many Python 2 tutorials do not use object as the base class whether
> for historical reason, or lack of information/education,

Probably both. Either way, the result is a disservice to readers.

> and can cause confusing to newcomers searching for answers
 > when they consult the official documentation.

I and some other people STRONGLY recommend that newcomers start with 
Python 3 and Python 3 docs and completely ignore Python 2 unless they 

> While Python 3 code no longer requires object be the base class for the
> new-style class definition, I believe (object) is still required if one
> has to write a 2-3 compatible code. But this was not explained or warned
> anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please
> correct me)
> I propose the followings:
> * It is desirable to state boldly to users that (object) is no longer
> needed in Python-3 **only** code and warn users to revert to (object)
> style if the code needs to be 2 and 3 compatible.

I think 'boldly' and 'warn' are a bit overstated.

> * In addition, Python 2 doc [2] should be fixed by introducing the
> new-style classes.

Definitely. The 2.x tutorial start with class x: and continues that way 
half way through the chapter.  I think it should start with class 
x(object): and at the end of the first half, briefly mention that class 
x in 2.x gets something slightly different that beginners can mostly 
ignore, while class x: in 3.x == class x(object): and that the latter 
works the same for both.

The 3.x tutorial, in the same place could *briefly* mention that class 
x: == class x(object): and the the latter is usually only used in code 
that also runs on 2.x or has been converted without removing the extra 
code.  The 3.x tutorial should *not* mention old style classes.

 > This problem was noted a long long time ago according to [4].

The opening statement "Unfortunately, new-style classes have not yet 
been integrated into Python's standard documention." is perhaps a decade 
out of date.  That page should not have been included in the new site 
design without being modified.

> [1]:
> [2]:
> [3]:
> [4]:

Terry Jan Reedy

From jeanpierreda at  Sat Aug  9 23:07:58 2014
From: jeanpierreda at (Devin Jeanpierre)
Date: Sat, 9 Aug 2014 14:07:58 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <20140802203513.GA10447@k2>
Message-ID: <>

On Sat, Aug 9, 2014 at 12:20 PM, Alexander Belopolsky
<alexander.belopolsky at> wrote:
> On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano <steve at> wrote:
>> We wouldn't be having
>> these interminable arguments about using sum() to concatenate strings
>> (and lists, and tuples) if the & operator was used for concatenation and
>> + was only used for numeric addition.
> But we would probably have a similar discussion about all(). :-)
> Use of + is consistent with the use of * for repetition.  What would you use
> use for repetition if you use & instead?

If the only goal is to not be tempted to use sum() for string
concatenation, how about using *? This is more consistent with
mathematics terminology, where a * b is not necessarily the same as b
* a (unlike +, which is commutative). As an example, consider matrix
multiplication. Then, to answer your question, repetition would have
been s ** n.

(In fact, this is the notation for concatenation and repetition used
in formal language theory.)

(If we really super wanted to add this to Python, obviously we'd use
the @ and @@ operators. But it's a bit late for that.)

-- Devin

From steve at  Sun Aug 10 02:44:52 2014
From: steve at (Steven D'Aprano)
Date: Sun, 10 Aug 2014 10:44:52 +1000
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
	explained in python 2 and 3 doc
In-Reply-To: <>
References: <>
Message-ID: <20140810004452.GB4525@ando>

On Sat, Aug 09, 2014 at 02:44:10PM -0400, John Yeuk Hon Wong wrote:
> Hi.
> Referring to my discussion on [1] and then on #python this afternoon.
> A little background would help people to understand where this was 
> coming from.
> 1. I write Python 2 code and have done zero Python-3 specific code.
> 2. I have always been using class Foo(object) so I do not know the new 
> style is no longer required in Python 3. I feel "stupid" and "wrong" by 
> thinking (object) is still a convention in Python 3.

But object is still a convention in Python 3.

It is certainly required when writing code that will behave the same in 
version 2 and 3, and it's optional in 3-only code, but certainly not 
frowned upon or discouraged. There's nothing wrong with explicitly 
inheriting from object in Python 3, and with the Zen of Python "Explicit 
is better than implicit" I would argue that *leaving it out* should be 
very slightly discouraged.

class Spam:  # okay, but a bit lazy
class Spam(object):  # better

Perhaps PEP 8 should make a recommendation, but if so, I think it should 
be a very weak one. In Python 3, it really doesn't matter which you 
write. My own personal practice is to explicitly inherit from object 
when the class is "important" or more than half a dozen lines, and leave 
it out if the class is a stub or tiny.

> 3. Many Python 2 tutorials do not use object as the base class whether 
> for historical reason, or lack of information/education, and can cause 
> confusing to newcomers searching for answers when they consult the 
> official documentation.

We can't do anything about third party tutorials :-(

> While Python 3 code no longer requires object be the base class for the 
> new-style class definition, I believe (object) is still required if one 
> has to write a 2-3 compatible code. But this was not explained or warned 
> anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please 
> correct me)

It's not *always* required, only if you use features which require 
new-style classes, e.g. super, or properties.

> I propose the followings:
> * It is desirable to state boldly to users that (object) is no longer 
> needed in Python-3 **only** code 

I'm against that. Stating this boldly will be understood by some readers 
that object should not be used, and I'm strongly against that. I believe 
explicitly inheriting from object should be mildly preferred, not 
strongly discouraged.

> and warn users to revert to (object) 
> style if the code needs to be 2 and 3 compatible.

I don't think that should be necesary, but have no objections to it 
being mentioned. I think it should be obvious: if you need new-style 
behaviour in Python 2, then obviously you have to inherit from object 
otherwise you have a classic class. That requirement doesn't go away 
just because your code will sometimes run under Python 3.

Looking at your comment here:

> [1]:

there is a reply from zeckalpha, who says:

   "Actually, leaving out `object` is the preferred convention for 
    Python 3, as they are semantically equivalent."

How does (s)he justify this claim?

   "Explicit is better than implicit."

which is not logical. If you leave out `object`, that's implicit, not 


From rosuav at  Sun Aug 10 03:01:17 2014
From: rosuav at (Chris Angelico)
Date: Sun, 10 Aug 2014 11:01:17 +1000
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
In-Reply-To: <20140810004452.GB4525@ando>
References: <>
Message-ID: <>

On Sun, Aug 10, 2014 at 10:44 AM, Steven D'Aprano <steve at> wrote:
> Looking at your comment here:
>> [1]:
> there is a reply from zeckalpha, who says:
>    "Actually, leaving out `object` is the preferred convention for
>     Python 3, as they are semantically equivalent."
> How does (s)he justify this claim?
>    "Explicit is better than implicit."
> which is not logical. If you leave out `object`, that's implicit, not
> explicit.

The justification is illogical. However, I personally believe
boilerplate should be omitted where possible; that's why we have a
whole lot of things that "just work". Why does Python not have
explicit boolification for if/while checks? REXX does (if you try to
use anything else, you get a run-time error "Logical value not 0 or
1"), and that's more explicit - Python could require you to write "if
bool(x)" for the case where you actually want the truthiness magic, to
distinguish from "if x is not None" etc. But that's unnecessary
boilerplate. Python could have required explicit nonlocal declarations
for all names used in closures, but that's unhelpful too. Python
strives to eliminate that kind of thing.

So, my view would be: Py3-only tutorials can and probably should omit
it, for the same reason that we don't advise piles of __future__
directives. You can always add stuff later for coping with Py2+Py3
execution; chances are any non-trivial code will have much bigger
issues than accidentally making an old-style class.


From antoine at  Sun Aug 10 05:20:27 2014
From: antoine at (Antoine Pitrou)
Date: Sat, 09 Aug 2014 23:20:27 -0400
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <ls6odr$rq8$>

Le 09/08/2014 12:43, Ben Hoyt a ?crit :
> Just thought I'd share some of my excitement about how fast the all-C
> version [1] of os.scandir() is turning out to be.
> Below are the results of my scandir / walk benchmark run with three
> different versions. I'm using an SSD, which seems to make it
> especially faster than listdir / walk. Note that benchmark results can
> vary a lot, depending on operating system, file system, hard drive
> type, and the OS's caching state.
> Anyway, os.walk() can be FIFTY times as fast using os.scandir().

Very nice results, thank you :-)



From ncoghlan at  Sun Aug 10 05:57:36 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 10 Aug 2014 13:57:36 +1000
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <ls6odr$rq8$>
References: <>
Message-ID: <>

On 10 August 2014 13:20, Antoine Pitrou <antoine at> wrote:
> Le 09/08/2014 12:43, Ben Hoyt a ?crit :
>> Just thought I'd share some of my excitement about how fast the all-C
>> version [1] of os.scandir() is turning out to be.
>> Below are the results of my scandir / walk benchmark run with three
>> different versions. I'm using an SSD, which seems to make it
>> especially faster than listdir / walk. Note that benchmark results can
>> vary a lot, depending on operating system, file system, hard drive
>> type, and the OS's caching state.
>> Anyway, os.walk() can be FIFTY times as fast using os.scandir().
> Very nice results, thank you :-)


This may actually motivate me to start working on a redesign of
walkdir at some point, with scandir and DirEntry objects as the basis.
My original approach was just too slow to be useful in practice (at
least when working with trees on the scale of a full Fedora or RHEL
build hosted on an NFS share).


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From robertc at  Sun Aug 10 07:40:47 2014
From: robertc at (Robert Collins)
Date: Sun, 10 Aug 2014 17:40:47 +1200
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

A small tip from my bzr days - cd into the directory before scanning
it - especially if you'll end up statting more than a fraction of the
files, or are recursing - otherwise the VFS does a traversal for each
path you directly stat / recurse into. This can become a dominating
factor in some workloads (I shaved several hundred milliseconds off of
bzr stat on kernel trees doing this).


On 10 August 2014 15:57, Nick Coghlan <ncoghlan at> wrote:
> On 10 August 2014 13:20, Antoine Pitrou <antoine at> wrote:
>> Le 09/08/2014 12:43, Ben Hoyt a ?crit :
>>> Just thought I'd share some of my excitement about how fast the all-C
>>> version [1] of os.scandir() is turning out to be.
>>> Below are the results of my scandir / walk benchmark run with three
>>> different versions. I'm using an SSD, which seems to make it
>>> especially faster than listdir / walk. Note that benchmark results can
>>> vary a lot, depending on operating system, file system, hard drive
>>> type, and the OS's caching state.
>>> Anyway, os.walk() can be FIFTY times as fast using os.scandir().
>> Very nice results, thank you :-)
> Indeed!
> This may actually motivate me to start working on a redesign of
> walkdir at some point, with scandir and DirEntry objects as the basis.
> My original approach was just too slow to be useful in practice (at
> least when working with trees on the scale of a full Fedora or RHEL
> build hosted on an NFS share).
> Cheers,
> Nick.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

Robert Collins <rbtcollins at>
Distinguished Technologist
HP Converged Cloud

From larry at  Sun Aug 10 08:11:41 2014
From: larry at (Larry Hastings)
Date: Sat, 09 Aug 2014 23:11:41 -0700
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/09/2014 10:40 PM, Robert Collins wrote:
> A small tip from my bzr days - cd into the directory before scanning it

I doubt that's permissible for a library function like os.scandir().

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From stephen at  Sun Aug 10 10:24:32 2014
From: stephen at (Stephen J. Turnbull)
Date: Sun, 10 Aug 2014 17:24:32 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

Alexander Belopolsky writes:
 > On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull <stephen at>
 > wrote:
 > > All the suggestions
 > > I've seen so far are (IMHO, YMMV) just as ugly as the present
 > > situation.
 > >
 > What is ugly about allowing strings?  CPython certainly has a way to to
 > make sum(x, '')

sum(it, '') itself is ugly.  As I say, YMMV, but in general last I
heard arguments that are usually constants drawn from a small set of
constants are considered un-Pythonic; a separate function to express
that case is preferred.  I like the separate function style.

And that's the current situation, except that in the case of strings
it turns out to be useful to allow for "sums" that have "glue" at the
joints, so it's spelled as a string method rather than a builtin: eg,
", ".join(paramlist).

Actually ... if I were a fan of the "".join() idiom, I'd seriously
propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
(with the same efficient semantics).

From stephen at  Sun Aug 10 11:13:51 2014
From: stephen at (Stephen J. Turnbull)
Date: Sun, 10 Aug 2014 18:13:51 +0900
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
In-Reply-To: <>
References: <> <20140810004452.GB4525@ando>
Message-ID: <>

Chris Angelico writes:

 > The justification is illogical. However, I personally believe
 > boilerplate should be omitted where possible;

But it mostly can't be omitted.  I wrote 22 classes (all trivial)
yesterday for a Python 3 program.  Not one derived directly from
object.  That's a bit unusual, but in the three longish scripts I have
to hand, not one had more than 30% "new" classes derived from object.

As a matter of personal style, I don't use optional positional
arguments (with a few "traditional" exceptions); if I omit one most of
the time, when I need it I use a keyword.  That's not an argument,
it's just an observation that's consistent with support for using
an explicit parent class of object "most of the time".

 > that's why we have a whole lot of things that "just work". Why does
 > Python not have explicit boolification for if/while checks?

Because it does have explicit boolification (signaled by the control
structure syntax itself).  No?  I don't think this is less explicit
than REXX, because it doesn't happen elsewhere (10 + False == 10 --
not True, and even bool(10) + False != True).

 > So, my view would be: Py3-only tutorials can and probably should omit
 > it,

But this doesn't make things simpler.  It means that there are two
syntaxes to define some classes, and you want to make one of them
TOOWTDI for classes derived directly from object, and the other
TOOWTDI for non-trivial subclasses.  I'll grant that in some sense
it's no more complex, either, of course.

Note that taken to extremes, your argument could be construed as "we
should define defaults for all arguments and omit them where possible".

Of course for typing in quick programs, and for trivial classes,
omitting the derivation from object is a useful convenience.  But I
don't think it's something that should be encouraged in tutorials.


From arigo at  Sun Aug 10 12:28:25 2014
From: arigo at (Armin Rigo)
Date: Sun, 10 Aug 2014 12:28:25 +0200
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

Hi Larry,

On 10 August 2014 08:11, Larry Hastings <larry at> wrote:
>> A small tip from my bzr days - cd into the directory before scanning it
> I doubt that's permissible for a library function like os.scandir().

Indeed, chdir() is notably not compatible with multithreading.  There
would be a non-portable but clean way to do that: the functions
openat() and fstatat().  They only exist on relatively modern Linuxes,

A bient?t,


From rdmurray at  Sun Aug 10 15:55:40 2014
From: rdmurray at (R. David Murray)
Date: Sun, 10 Aug 2014 09:55:40 -0400
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, 10 Aug 2014 13:57:36 +1000, Nick Coghlan <ncoghlan at> wrote:
> On 10 August 2014 13:20, Antoine Pitrou <antoine at> wrote:
> > Le 09/08/2014 12:43, Ben Hoyt a ??crit :
> >
> >> Just thought I'd share some of my excitement about how fast the all-C
> >> version [1] of os.scandir() is turning out to be.
> >>
> >> Below are the results of my scandir / walk benchmark run with three
> >> different versions. I'm using an SSD, which seems to make it
> >> especially faster than listdir / walk. Note that benchmark results can
> >> vary a lot, depending on operating system, file system, hard drive
> >> type, and the OS's caching state.
> >>
> >> Anyway, os.walk() can be FIFTY times as fast using os.scandir().
> >
> >
> > Very nice results, thank you :-)
> Indeed!
> This may actually motivate me to start working on a redesign of
> walkdir at some point, with scandir and DirEntry objects as the basis.
> My original approach was just too slow to be useful in practice (at
> least when working with trees on the scale of a full Fedora or RHEL
> build hosted on an NFS share).

There is another potentially good place in the stdlib to apply scandir:
iglob.  See issue 22167.


From barry at  Sun Aug 10 16:39:10 2014
From: barry at (Barry Warsaw)
Date: Sun, 10 Aug 2014 10:39:10 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <20140810103910.2c8b9079@anarchist.localdomain>

On Aug 10, 2014, at 05:24 PM, Stephen J. Turnbull wrote:

>Actually ... if I were a fan of the "".join() idiom, I'd seriously
>propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
>deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
>(with the same efficient semantics).

Ever since ''.join was added, there has been vague talk about adding a join()
built-in.  If the semantics and argument syntax can be worked out, I'd still
be in favor of that.  Probably deserves a PEP and a millithread community
bikeshed paintdown.


From alexander.belopolsky at  Sun Aug 10 17:51:51 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Sun, 10 Aug 2014 11:51:51 -0400
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
In-Reply-To: <20140810004452.GB4525@ando>
References: <>
Message-ID: <>

On Sat, Aug 9, 2014 at 8:44 PM, Steven D'Aprano <steve at> wrote:

> It is certainly required when writing code that will behave the same in
> version 2 and 3

This is not true.  An alternative is to put

__metaclass__ = type

at the top of your module to make all classes in your module new-style in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From barry at  Sun Aug 10 18:26:39 2014
From: barry at (Barry Warsaw)
Date: Sun, 10 Aug 2014 12:26:39 -0400
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
 explained in python 2 and 3 doc
In-Reply-To: <>
References: <> <20140810004452.GB4525@ando>
Message-ID: <20140810122639.364756bf@anarchist.localdomain>

On Aug 10, 2014, at 11:51 AM, Alexander Belopolsky wrote:

>This is not true.  An alternative is to put
>__metaclass__ = type
>at the top of your module to make all classes in your module new-style in

I like this much better, and it's what I do in my own bilingual code.  It
makes it much easier to remove the unnecessary cruft when you drop the Python
2 support.


From steve at  Sun Aug 10 19:21:46 2014
From: steve at (Steven D'Aprano)
Date: Mon, 11 Aug 2014 03:21:46 +1000
Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly
	explained in python 2 and 3 doc
In-Reply-To: <>
References: <> <20140810004452.GB4525@ando>
Message-ID: <20140810172146.GE4525@ando>

On Sun, Aug 10, 2014 at 11:51:51AM -0400, Alexander Belopolsky wrote:
> On Sat, Aug 9, 2014 at 8:44 PM, Steven D'Aprano <steve at> wrote:
> > It is certainly required when writing code that will behave the same in
> > version 2 and 3
> >
> This is not true.  An alternative is to put
> __metaclass__ = type
> at the top of your module to make all classes in your module new-style in
> python2.

So it is. I forgot about that, thank you for the correction.


From v+python at  Sun Aug 10 22:12:26 2014
From: v+python at (Glenn Linderman)
Date: Sun, 10 Aug 2014 13:12:26 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
> Actually ... if I were a fan of the "".join() idiom, I'd seriously
> propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
> deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
> (with the same efficient semantics).
Actually, there is no need to wait for 0.sum() to propose "".sum... but 
it is only a spelling change, so no real benefit.

Thinking about this more, maybe it should be a class function, so that 
it wouldn't require an instance:

str.sum( iterable_containing_strings )

[ or  str.join( iterable_containing_strings ) ]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From rdmurray at  Sun Aug 10 22:27:25 2014
From: rdmurray at (R. David Murray)
Date: Sun, 10 Aug 2014 16:27:25 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
Message-ID: <>

On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman <v+python at> wrote:
> On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
> > Actually ... if I were a fan of the "".join() idiom, I'd seriously
> > propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
> > deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
> > (with the same efficient semantics).
> Actually, there is no need to wait for 0.sum() to propose "".sum... but 
> it is only a spelling change, so no real benefit.
> Thinking about this more, maybe it should be a class function, so that 
> it wouldn't require an instance:
> str.sum( iterable_containing_strings )
> [ or  str.join( iterable_containing_strings ) ]

That's how it used to be spelled in python2.


From rdmurray at  Sun Aug 10 22:29:38 2014
From: rdmurray at (R. David Murray)
Date: Sun, 10 Aug 2014 16:29:38 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
Message-ID: <>

On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman <v+python at> wrote:
> On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
> > Actually ... if I were a fan of the "".join() idiom, I'd seriously
> > propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
> > deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
> > (with the same efficient semantics).
> Actually, there is no need to wait for 0.sum() to propose "".sum... but 
> it is only a spelling change, so no real benefit.
> Thinking about this more, maybe it should be a class function, so that 
> it wouldn't require an instance:
> str.sum( iterable_containing_strings )
> [ or  str.join( iterable_containing_strings ) ]

Sorry, I mean 'string.join' is how it used to be spelled.  Making it a
class method is indeed slightly different.


From stephen at  Mon Aug 11 01:57:36 2014
From: stephen at (Stephen J. Turnbull)
Date: Mon, 11 Aug 2014 08:57:36 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

Glenn Linderman writes:

 > On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
 > > Actually ... if I were a fan of the "".join() idiom, I'd seriously
 > > propose 0.sum(numeric_iterable) as the RightThang{tm].  Then we could
 > > deprecate "".join(string_iterable) in favor of "".sum(string_iterable)
 > > (with the same efficient semantics).

 > Actually, there is no need to wait for 0.sum() to propose "".sum... but 
 > it is only a spelling change, so no real benefit.

IMO it's worse than merely a spelling change, because (1) "join" is a
more evocative term for concatenating strings than "sum" and (2) I
don't know of any other sums that allow "glue".

I'm overall -1 on trying to change the current situation (except for
adding a join() builtin or str.join class method).  We could probably
fix everything in a static-typed language (because that would allow
picking an initial object of the appropriate type), but without that
we need to pick a default of some particular type, and 0 makes the
most sense.

I can understand the desire of people who want to use the same syntax
for summing an iterable of numbers and for concatenating an iterable
of strings, but to me they're really not even formally the same in
practical use.  I'm very sympathetic to Steven's explanation that "we
wouldn't be having this discussion if we used a different operator for
string concatenation".  Although that's not the whole story: in
practice even numerical sums get split into multiple functions because
floating point addition isn't associative, and so needs careful
treatment to preserve accuracy.  At that point I'm strongly +1 on
abandoning attempts to "rationalize" summation.

I'm not sure how I'd feel about raising an exception if you try to sum
any iterable containing misbehaved types like float.  But not only
would that be a Python 4 effort due to backward incompatibility, but
it sorta contradicts the main argument of proponents ("any type
implementing __add__ should be sum()-able").

From uwe.schmitt at  Mon Aug 11 11:10:53 2014
From: uwe.schmitt at (Schmitt  Uwe (ID SIS))
Date: Mon, 11 Aug 2014 09:10:53 +0000
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
Message-ID: <>

Dear all,

I discovered a problem using cPickle.loads from CPython 2.7.6.

The last line in the following code raises an infinite recursion

    class T(object):

        def __init__(self):
            self.item = list()

        def __getattr__(self, name):
            return getattr(self.item, name)

    import cPickle

    t = T()

    l = cPickle.dumps(t)

loads triggers T.__getattr__ using "getattr(inst, "__setstate__", None)" for looking up a "__setstate__" method,
which is not implemented for T. As the item attribute is missing at this time, the ininfite recursion starts.

The infinite recursion disappears if I attach a default implementation for __setstate__ to T:

    def __setstate__(self, dd):
        self.__dict__ = dd

This could be fixed by using ?hasattr? in pickle before trying to call ?getattr?.

Is this a bug or did I miss something ?

Kind Regards,

From tjreedy at  Mon Aug 11 13:28:44 2014
From: tjreedy at (Terry Reedy)
Date: Mon, 11 Aug 2014 07:28:44 -0400
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
In-Reply-To: <>
References: <>
Message-ID: <lsa9e0$tr1$>

On 8/11/2014 5:10 AM, Schmitt Uwe (ID SIS) wrote:

Python usage questions should be directed to python-list, for instance.

> I discovered a problem using cPickle.loads from CPython 2.7.6.

The problem is your code having infinite recursion. You only discovered 
it with pickle.

> The last line in the following code raises an infinite recursion
>      class T(object):
>          def __init__(self):
>              self.item = list()
>          def __getattr__(self, name):
>              return getattr(self.item, name)

This is a (common) bug in your program.  __getattr__ should call 
self.__dict__(name) to avoid the recursion.

Terry Jan Reedy

From __peter__ at  Mon Aug 11 13:40:13 2014
From: __peter__ at (Peter Otten)
Date: Mon, 11 Aug 2014 13:40:13 +0200
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
References: <> <lsa9e0$tr1$>
Message-ID: <lsaa2u$2t7$>

Terry Reedy wrote:

> On 8/11/2014 5:10 AM, Schmitt Uwe (ID SIS) wrote:
> Python usage questions should be directed to python-list, for instance.
>> I discovered a problem using cPickle.loads from CPython 2.7.6.
> The problem is your code having infinite recursion. You only discovered
> it with pickle.
>> The last line in the following code raises an infinite recursion
>>      class T(object):
>>          def __init__(self):
>>              self.item = list()
>>          def __getattr__(self, name):
>>              return getattr(self.item, name)
> This is a (common) bug in your program.  __getattr__ should call
> self.__dict__(name) to avoid the recursion.

Read again. The OP tries to delegate attribute lookup to an (existing) 

IMO the root cause of the problem is that pickle looks up __dunder__ methods 
in the instance rather than the class.

From rosuav at  Mon Aug 11 13:43:00 2014
From: rosuav at (Chris Angelico)
Date: Mon, 11 Aug 2014 21:43:00 +1000
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
In-Reply-To: <lsaa2u$2t7$>
References: <> <lsa9e0$tr1$>
Message-ID: <>

On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at> wrote:
> Read again. The OP tries to delegate attribute lookup to an (existing)
> attribute.
> IMO the root cause of the problem is that pickle looks up __dunder__ methods
> in the instance rather than the class.

The recursion comes from the attempted lookup of self.item, when
__init__ hasn't been called.


From rdmurray at  Mon Aug 11 14:10:30 2014
From: rdmurray at (R. David Murray)
Date: Mon, 11 Aug 2014 08:10:30 -0400
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
In-Reply-To: <>
References: <>
 <lsa9e0$tr1$> <lsaa2u$2t7$>
Message-ID: <>

On Mon, 11 Aug 2014 21:43:00 +1000, Chris Angelico <rosuav at> wrote:
> On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at> wrote:
> > Read again. The OP tries to delegate attribute lookup to an (existing)
> > attribute.
> >
> > IMO the root cause of the problem is that pickle looks up __dunder__ methods
> > in the instance rather than the class.
> The recursion comes from the attempted lookup of self.item, when
> __init__ hasn't been called.

Indeed, and this is what the OP missed.  With a class like this, it is
necessary to *make* it pickleable, since the pickle protocol doesn't
call __init__.


From __peter__ at  Mon Aug 11 14:25:01 2014
From: __peter__ at (Peter Otten)
Date: Mon, 11 Aug 2014 14:25:01 +0200
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
References: <>
 <lsa9e0$tr1$> <lsaa2u$2t7$>
Message-ID: <lsacmu$6n5$>

Chris Angelico wrote:

> On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at> wrote:
>> Read again. The OP tries to delegate attribute lookup to an (existing)
>> attribute.
>> IMO the root cause of the problem is that pickle looks up __dunder__
>> methods in the instance rather than the class.
> The recursion comes from the attempted lookup of self.item, when
> __init__ hasn't been called.

You are right. Sorry for the confusion.

From benhoyt at  Mon Aug 11 14:26:47 2014
From: benhoyt at (Ben Hoyt)
Date: Mon, 11 Aug 2014 08:26:47 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

It seems to me this is something of a pointless discussion -- I highly
doubt the current situation is going to change, and it works very well.
Even if not perfect, sum() is for numbers, sep.join() for strings. However,
I will add one comment:

I'm overall -1 on trying to change the current situation (except for
> adding a join() builtin or str.join class method).

Did you know there actually is a str.join "class method"? I've never
actually seen it used this way, but for people who just can't stand
sep.join(seq), you can always call str.join(sep, seq) -- works in Python 2
and 3:

>>> str.join('.', ['abc', 'def', 'ghi'])

This works as a side effect of the fact that you can call methods as
cls.method(instance, args).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From 4kir4.1i at  Mon Aug 11 15:01:31 2014
From: 4kir4.1i at (Akira Li)
Date: Mon, 11 Aug 2014 17:01:31 +0400
Subject: [Python-Dev] python2.7 infinite recursion when loading pickled
References: <>
Message-ID: <>

"Schmitt  Uwe (ID SIS)" <uwe.schmitt at> writes:

> I discovered a problem using cPickle.loads from CPython 2.7.6.
> The last line in the following code raises an infinite recursion
>     class T(object):
>         def __init__(self):
>             self.item = list()
>         def __getattr__(self, name):
>             return getattr(self.item, name)
>     import cPickle
>     t = T()
>     l = cPickle.dumps(t)
>     cPickle.loads(l)
> Is this a bug or did I miss something ?

The issue is that your __getattr__ raises RuntimeError (due to infinite
recursion) for non-existing attributes instead of AttributeError. To fix
it, you could use object.__getattribute__:

  class C:
    def __init__(self):
        self.item = []
    def __getattr__(self, name):
        return getattr(object.__getattribute__(self, 'item'), name)

There were issues in the past due to {get,has}attr silencing
non-AttributeError exceptions; therefore it is good that pickle breaks
when it gets RuntimeError instead of AttributeError.


From 4kir4.1i at  Mon Aug 11 17:26:29 2014
From: 4kir4.1i at (Akira Li)
Date: Mon, 11 Aug 2014 19:26:29 +0400
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
References: <>
Message-ID: <>

Armin Rigo <arigo at> writes:

> On 10 August 2014 08:11, Larry Hastings <larry at> wrote:
>>> A small tip from my bzr days - cd into the directory before scanning it
>> I doubt that's permissible for a library function like os.scandir().
> Indeed, chdir() is notably not compatible with multithreading.  There
> would be a non-portable but clean way to do that: the functions
> openat() and fstatat().  They only exist on relatively modern Linuxes,
> though.

There is os.fwalk() that could be both safer and faster than
os.walk(). It yields rootdir fd that can be used by functions that
support dir_fd parameter, see os.supports_dir_fd set. They use *at()
functions under the hood.

os.fwalk() could be implemented in terms of os.scandir() if the latter
would support fd parameter like os.listdir() does (be in os.supports_fd
set (note: it is different from os.supports_dir_fd)).

Victor Stinner suggested [1] to allow scandir(fd) but I don't see it
being mentioned in the pep 471 [2]: it neither supports nor rejects the



From benhoyt at  Mon Aug 11 17:51:26 2014
From: benhoyt at (Ben Hoyt)
Date: Mon, 11 Aug 2014 11:51:26 -0400
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

> Victor Stinner suggested [1] to allow scandir(fd) but I don't see it
> being mentioned in the pep 471 [2]: it neither supports nor rejects the
> idea.
> [1]
> [2]

Yes, listdir() supports fd, and I think scandir() probably will too to
parallel that, if not for v1.0 then soon after. Victor and I want to
focus on getting the PEP 471 (string path only) version working first.


From chris.barker at  Mon Aug 11 17:07:39 2014
From: chris.barker at (Chris Barker - NOAA Federal)
Date: Mon, 11 Aug 2014 08:07:39 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
Message-ID: <-2448384566377912251@unknownmsgid>

> I'm very sympathetic to Steven's explanation that "we
> wouldn't be having this discussion if we used a different operator for
> string concatenation".

Sure -- but just imagine the conversations we could be having instead
: what does bit wise and of a string mean? A bytes object? I cod see
it as a character-wise and, for instance  ;-)

My confusion is still this:

Repeated summation of strings has been optimized in cpython even
though it's not the recommended way to solve that problem.

So why not special case optimize sum() for strings? We are already
special-case strings to raise an exception.

It seems pretty pedantic to say: we cod make this work well, but we'd
rather chide you for not knowing the "proper" way to do it.

Practicality beats purity?


> Although that's not the whole story: in
> practice even numerical sums get split into multiple functions because
> floating point addition isn't associative, and so needs careful
> treatment to preserve accuracy.  At that point I'm strongly +1 on
> abandoning attempts to "rationalize" summation.
> I'm not sure how I'd feel about raising an exception if you try to sum
> any iterable containing misbehaved types like float.  But not only
> would that be a Python 4 effort due to backward incompatibility, but
> it sorta contradicts the main argument of proponents ("any type
> implementing __add__ should be sum()-able").
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

From jtaylor.debian at  Mon Aug 11 19:20:42 2014
From: jtaylor.debian at (Julian Taylor)
Date: Mon, 11 Aug 2014 19:20:42 +0200
Subject: [Python-Dev] sum(...) limitation - temporary elision take 2
In-Reply-To: <>
References: <>
Message-ID: <>

On 04.08.2014 22:22, Jim J. Jewett wrote:
> Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in
> ) wrote:
>> Andrea Griffini <agriff at> wrote:
>>>    However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
>> hm could this be a pure python case that would profit from temporary
>> elision [ ]?
>> lists could declare the tp_can_elide slot and call list.extend on the
>> temporary during its tp_add slot instead of creating a new temporary.
>> extend/realloc can avoid the copy if there is free memory available
>> after the block.
> Yes, with all the same problems.
> When dealing with a complex object, how can you be sure that __add__
> won't need access to the original values during the entire computation?
> It works with matrix addition, but not with matric multiplication.
> Depending on the details of the implementation, it could even fail for
> a sort of sliding-neighbor addition similar to the original justification.

The c-extension object knows what its add slot does. An object that
cannot elide would simply always return 0 indicating to python to not
call the inplace variant.
E.g. the numpy __matmul__ operator would never tell python that it can
work inplace, but __add__ would (if the arguments allow it).

Though we may have found a way to do it without the direct help of
Python, but it involves reading and storing the current instruction of
the frame object to figure out if it is called directly from the
unfinished patch to numpy, see the can_elide_temp function:
Probably not the best way as this is hardly intended Python C-API but
assuming there is no overlooked issue with this approach it could be a
good workaround for known good Python versions.

From matsjoyce at  Mon Aug 11 19:42:19 2014
From: matsjoyce at (matsjoyce)
Date: Mon, 11 Aug 2014 17:42:19 +0000 (UTC)
Subject: [Python-Dev] Reviving restricted mode?
References: <>
Message-ID: <>

Yup, I read that post. However, those specific issues do not exist in my 
module, as there is a module whitelist, and a method whitelist. Builtins are 
now proxied, and all types going in to functions are checked for 
modification. There maybe some holes in my approach, but I can't find them.

From breamoreboy at  Mon Aug 11 19:55:07 2014
From: breamoreboy at (Mark Lawrence)
Date: Mon, 11 Aug 2014 18:55:07 +0100
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <lsb01r$vq3$>

On 11/08/2014 18:42, matsjoyce wrote:
> Yup, I read that post. However, those specific issues do not exist in my
> module, as there is a module whitelist, and a method whitelist. Builtins are
> now proxied, and all types going in to functions are checked for
> modification. There maybe some holes in my approach, but I can't find them.

Any chance of giving us some context, or do I have to retrieve my 
crystal ball from the menders?

My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

From skip at  Mon Aug 11 21:00:32 2014
From: skip at (Skip Montanaro)
Date: Mon, 11 Aug 2014 14:00:32 -0500
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce <matsjoyce at> wrote:
> There maybe some holes in my approach, but I can't find them.

There's the rub. Given time, I suspect someone will discover a hole or two.


From tjreedy at  Mon Aug 11 22:29:03 2014
From: tjreedy at (Terry Reedy)
Date: Mon, 11 Aug 2014 16:29:03 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
Message-ID: <lsb935$hkg$>

On 8/11/2014 8:26 AM, Ben Hoyt wrote:
> It seems to me this is something of a pointless discussion -- I highly
> doubt the current situation is going to change, and it works very well.
> Even if not perfect, sum() is for numbers, sep.join() for strings.
> However, I will add one comment:
>     I'm overall -1 on trying to change the current situation (except for
>     adding a join() builtin or str.join class method).
> Did you know there actually is a str.join "class method"?

A 'method' is a function accessed as an attribute of a class.
An 'instance method' is a method whose first parameter is an instance of 
the class. str.join is an instance method.  A 'class method', wrapped as 
such with classmether(), usually by decorating it with @classmethod, 
would take the class as a parameter.

> I've never
> actually seen it used this way, but for people who just can't stand
> sep.join(seq), you can always call str.join(sep, seq) -- works in Python
> 2 and 3:
>  >>> str.join('.', ['abc', 'def', 'ghi'])
> 'abc.def.ghi'

One could even put 'join = str.join' at the top of a file.

All this is true of *every* instance method.  For instance
>>> int.__add__(1, 2) == 1 .__add__(2) == 1 + 2

However, your point that people who cannot stand the abbreviation 
*could* use the full form that is being abbreviated.

In ancient Python, when strings did not have methods, the current string 
methods were functions in the string module. The functions were removed 
in 3.0.  Their continued use in 2.x code is bad for 3.x compatibility, 
so I would not encourage it.

 >>> help(string.join)  # 2.7.8
Help on function join in module string:

join(words, sep=' ')
     join(list [,sep]) -> string

     Return a string composed of the words in list, with
     intervening occurrences of sep.  The default separator is a
     single space.

'List' is obsolete.  Since sometime before 2.7, 'words' meant an 
iterable of strings.

 >>> def digits():
	for i in range(10):
		yield str(i)
 >>> string.join(digits(), '')

Of of the string functions, I believe the conversion of join (and its 
synonum 'joinfields') to a method has been the most contentious.

Terry Jan Reedy

From ischwabacher at  Mon Aug 11 20:36:48 2014
From: ischwabacher at (Isaac Schwabacher)
Date: Mon, 11 Aug 2014 13:36:48 -0500
Subject: [Python-Dev]  pathlib handling of trailing slash (Issue #21039)
In-Reply-To: <>
References: <>
Message-ID: <>

I see this as a parallel to the question of `pathlib.PurePath.resolve()`, about which `pathlib` is (rightly!) very opinionated. Just as `foo/../bar` shouldn't resolve to `bar`, `foo/` shouldn't be truncated to `foo`. And if `PurePath` doesn't do this, `Path` shouldn't either, because the difference between a `Path` and a `PurePath` is the availability of filesystem operations, not the identities of the objects involved.

On another level, I think that this is a simple decision: `PosixPath` claims right there in the name to implement POSIX behavior, and POSIX specifies that `foo` and `foo/` refer (in some cases) to different directory entries. Therefore, `foo` and `foo/` can't be the same path. Moreover, `PosixPath` implements several methods that have the same name as syscalls that POSIX specifies to depend on whether their path arguments end in trailing slashes. (Even `stat` [], which explicitly follows symbolic links regardless of the presence of a trailing slash, fails with ENOTDIR if given "path/to/existing/file/".) It feels pathological for `pathlib.PosixPath` to be so almost-compliant.


From victor.stinner at  Mon Aug 11 23:42:41 2014
From: victor.stinner at (Victor Stinner)
Date: Mon, 11 Aug 2014 23:42:41 +0200
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

2014-08-11 19:42 GMT+02:00 matsjoyce <matsjoyce at>:
> Yup, I read that post. However, those specific issues do not exist in my
> module, as there is a module whitelist, and a method whitelist. Builtins are
> now proxied, and all types going in to functions are checked for
> modification. There maybe some holes in my approach, but I can't find them.

I take a look at your code and it looks like almost everything is blocked.

Right now, I'm not sure that your sandbox is useful. For example, for
a simple IRC bot, it would help to have access to some modules like
math, time or random. The problem is to provide a way to allow these
modules and ensure that the policy doesn't introduce a new hole.
Allowing more functions increase the risk of new holes.

Even if your sandbox is strong, CPython contains a lot of code written
in C (50% of CPython is written in C), and the C code usually takes
shortcuts which ignore your sandbox. CPython source code is huge
(+210k of C lines just for the core). Bugs are common, your sandbox is
vulnerable to all these bugs. See for example the Lib/test/crashers/
directory of CPython.

For my pysandbox project, I wrote some proxies and many
vulnerabilities were found in these proxies. They can be explained by
the nature of Python, you can introspect everything, modify
everything, etc. It's very hard to design such proxy in Python.
Implementing such proxy in C helps a little bit.

The rule is always the same: your sandbox is as strong as its weakest
function. A very minor bug is enough to break the whole sandbox. See
the history of pysandbox for examples of such bugs (called
"vulnerabilities" in the case of a sandbox).


From cyberdupo56 at  Tue Aug 12 01:08:00 2014
From: cyberdupo56 at (Allen Li)
Date: Mon, 11 Aug 2014 16:08:00 -0700
Subject: [Python-Dev] Multiline with statement line continuation
Message-ID: <20140811230800.GA12210@gensokyo>

This is a problem I sometimes run into when working with a lot of files
simultaneously, where I need three or more `with` statements:

    with open('foo') as foo:
        with open('bar') as bar:
            with open('baz') as baz:

Thankfully, support for multiple items was added in 3.1:

    with open('foo') as foo, open('bar') as bar, open('baz') as baz:

However, this begs the need for a multiline form, especially when
working with three or more items:

    with open('foo') as foo, \
         open('bar') as bar, \
         open('baz') as baz, \
         open('spam') as spam \
         open('eggs') as eggs:

Currently, this works with explicit line continuation, but as all style
guides favor implicit line continuation over explicit, it would be nice
if you could do the following:

    with (open('foo') as foo,
          open('bar') as bar,
          open('baz') as baz,
          open('spam') as spam,
          open('eggs') as eggs):

Currently, this is a syntax error, since the language specification for
`with` is

    with_stmt ::=  "with" with_item ("," with_item)* ":" suite
    with_item ::=  expression ["as" target]

as opposed to something like

    with_stmt ::=  "with" with_expr ":" suite
    with_expr ::=  with_item ("," with_item)*
              |    '(' with_item ("," with_item)* ')'

This is really just a style issue, furthermore a style issue that
requires a change to the languagee grammar (probably, someone who knows
for sure please confirm), so at first I thought it wasn't worth
mentioning, but I'd like to hear what everyone else thinks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <>

From ben+python at  Tue Aug 12 01:27:57 2014
From: ben+python at (Ben Finney)
Date: Tue, 12 Aug 2014 09:27:57 +1000
Subject: [Python-Dev]
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

Allen Li <cyberdupo56 at> writes:

> Currently, this works with explicit line continuation, but as all
> style guides favor implicit line continuation over explicit, it would
> be nice if you could do the following:
>     with (open('foo') as foo,
>           open('bar') as bar,
>           open('baz') as baz,
>           open('spam') as spam,
>           open('eggs') as eggs):
>         pass
> Currently, this is a syntax error

Even if it weren't a syntax error, the syntax would be ambiguous. How
will you discern the meaning of::

    with (

Is that three separate context managers? Or is it one tuple with three

I am definitely sympathetic to the desire for a good solution to
multi-line ?with? statements, but I also don't want to see a special
case to make it even more difficult to understand when a tuple literal
is being specified in code. I admit I don't have a good answer to
satisfy both those simultaneously.

 \           ?We have met the enemy and he is us.? ?Walt Kelly, _Pogo_ |
  `\                                                        1971-04-22 |
_o__)                                                                  |
Ben Finney

From ncoghlan at  Tue Aug 12 02:19:06 2014
From: ncoghlan at (Nick Coghlan)
Date: Tue, 12 Aug 2014 10:19:06 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <-2448384566377912251@unknownmsgid>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 12 Aug 2014 03:03, "Chris Barker - NOAA Federal" <chris.barker at>
> My confusion is still this:
> Repeated summation of strings has been optimized in cpython even
> though it's not the recommended way to solve that problem.

The quadratic behaviour of repeated str summation is a subtle, silent
error. It *is* controversial that CPython silently optimises some cases of
it away, since it can cause problems when porting affected code to other
interpreters that don't use refcounting and thus have a harder time
implementing such a trick.

It's considered worth the cost, since it dramatically improves the
performance of common naive code in a way that doesn't alter the semantics.

> So why not special case optimize sum() for strings? We are already
> special-case strings to raise an exception.
> It seems pretty pedantic to say: we cod make this work well, but we'd
> rather chide you for not knowing the "proper" way to do it.

Yes, that's exactly what this is - a nudge towards the right way to
concatenate strings without incurring quadratic behaviour. We *want* people
to learn that distinction, not sweep it under the rug. That's the other
reason the implicit optimisation is controversial - it hides an important
difference in algorithmic complexity from users.

> Practicality beats purity?

Teaching users the difference between linear time operations and quadratic
ones isn't about purity, it's about passing along a fundamental principle
of algorithm scalability.

We do it specifically for strings because they *do* have an optimised
algorithm available that we can point users towards, and concatenating
multiple strings is common.

Other containers don't tend to be concatenated like that in the first
place, so there's no such check pushing other iterables towards


> -Chris
> > Although that's not the whole story: in
> > practice even numerical sums get split into multiple functions because
> > floating point addition isn't associative, and so needs careful
> > treatment to preserve accuracy.  At that point I'm strongly +1 on
> > abandoning attempts to "rationalize" summation.
> >
> > I'm not sure how I'd feel about raising an exception if you try to sum
> > any iterable containing misbehaved types like float.  But not only
> > would that be a Python 4 effort due to backward incompatibility, but
> > it sorta contradicts the main argument of proponents ("any type
> > implementing __add__ should be sum()-able").
> >
> > _______________________________________________
> > Python-Dev mailing list
> > Python-Dev at
> >
> > Unsubscribe:
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Tue Aug 12 02:28:14 2014
From: ncoghlan at (Nick Coghlan)
Date: Tue, 12 Aug 2014 10:28:14 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140811230800.GA12210@gensokyo>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On 12 Aug 2014 09:09, "Allen Li" <cyberdupo56 at> wrote:
> This is a problem I sometimes run into when working with a lot of files
> simultaneously, where I need three or more `with` statements:
>     with open('foo') as foo:
>         with open('bar') as bar:
>             with open('baz') as baz:
>                 pass
> Thankfully, support for multiple items was added in 3.1:
>     with open('foo') as foo, open('bar') as bar, open('baz') as baz:
>         pass
> However, this begs the need for a multiline form, especially when
> working with three or more items:
>     with open('foo') as foo, \
>          open('bar') as bar, \
>          open('baz') as baz, \
>          open('spam') as spam \
>          open('eggs') as eggs:
>         pass

I generally see this kind of construct as a sign that refactoring is
needed. For example, contextlib.ExitStack offers a number of ways to manage
multiple context managers dynamically rather than statically.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From benhoyt at  Tue Aug 12 02:29:51 2014
From: benhoyt at (Ben Hoyt)
Date: Mon, 11 Aug 2014 20:29:51 -0400
Subject: [Python-Dev] Multiline 'with' statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo> <>
Message-ID: <>

> Even if it weren't a syntax error, the syntax would be ambiguous. How
> will you discern the meaning of::
>     with (
>             foo,
>             bar,
>             baz):
>         pass
> Is that three separate context managers? Or is it one tuple with three
> items?

Is it meaningful to use "with" with a tuple, though? Because a tuple
isn't a context manager with __enter__ and __exit__ methods. For

>>> with (1,2,3): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: __exit__

So -- although I'm not arguing for it here -- you'd be turning an code
(a runtime AttributeError) into valid syntax.


From alexander.belopolsky at  Tue Aug 12 02:50:28 2014
From: alexander.belopolsky at (Alexander Belopolsky)
Date: Mon, 11 Aug 2014 20:50:28 -0400
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On Mon, Aug 11, 2014 at 8:19 PM, Nick Coghlan <ncoghlan at> wrote:

> Teaching users the difference between linear time operations and quadratic
> ones isn't about purity, it's about passing along a fundamental principle
> of algorithm scalability.

I would understand if this was done in reduce(operator.add, ..) which
indeed spells out the choice of an algorithm, but why sum() should be O(N)
for numbers and O(N**2) for containers?  Would a python implementation
that, for example, optimizes away 0's in sum(list_of_numbers) be
non-compliant with some fundamental principle?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Tue Aug 12 03:21:15 2014
From: chris.barker at (Chris Barker - NOAA Federal)
Date: Mon, 11 Aug 2014 18:21:15 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
 <> <-2448384566377912251@unknownmsgid>
Message-ID: <2076096455819154683@unknownmsgid>

Sorry for the bike shedding here, but:

The quadratic behaviour of repeated str summation is a subtle, silent error.

OK, fair enough. I suppose it would be hard and ugly to catch those
instances and raise an exception pointing users to "".join.

*is* controversial that CPython silently optimises some cases of it away,
since it can cause problems when porting affected code to other
interpreters that don't use refcounting and thus have a harder time
implementing such a trick.

Is there anything in the language spec that says string concatenation is
O(n^2)? Or for that matter any of the performs characteristics of build in
types? Those striker as implementation details that SHOULD be particular to
the implementation.

Should we cripple the performance of some operation in Cpython so that it
won't work better that Jython? That seems an odd choice. Then how dare PyPy
make scalar computation faster? People might switch to cPython and not know
they should have been using numpy all along...

It's considered worth the cost, since it dramatically improves the
performance of common naive code in a way that doesn't alter the semantics.

Seems the same argument could be made for sum(list_of_strings).

 > It seems pretty pedantic to say: we could make this work well, but we'd
> rather chide you for not knowing the "proper" way to do it.

Yes, that's exactly what this is - a nudge towards the right way to
concatenate strings without incurring quadratic behaviour.

But if it were optimized, it wouldn't incur quadratic behavior.

We *want* people to learn that distinction, not sweep it under the rug.

But sum() is not inherently quadratic -- that's a limitation of the
implementation. I agree that disallowing it is a good idea given that
behavior, but if it were optimized, there would be no reason to steer
people away.

"".join _could_ be naively written with the same poor performance -- why
should users need to understand why one was optimized and one was not?

That's the other reason the implicit optimisation is controversial - it
hides an important difference in algorithmic complexity from users.

It doesn't hide it -- it eliminates it. I suppose it's good for folks to
understand the implications of string immutability for when they write
their own algorithms, but this wouldn't be considered a good argument for a
poorly performing sort() for instance.

> Practicality beats purity?

Teaching users the difference between linear time operations and quadratic
ones isn't about purity, it's about passing along a fundamental principle
of algorithm scalability.

That is a very import a lesson to learn, sure, but python is not only a
teaching language. People will need to learn those lessons at some point,
this one feature makes little difference.

We do it specifically for strings because they *do* have an optimised
algorithm available that we can point users towards, and concatenating
multiple strings is common.

Sure, but I think all that does is teach people about a cpython specific
implementation -- and I doubt naive users get any closer to understanding
algorithmic complexity -- all they learn is you should use string.join().

Oh well, not really that big a deal.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ben+python at  Tue Aug 12 05:41:20 2014
From: ben+python at (Ben Finney)
Date: Tue, 12 Aug 2014 13:41:20 +1000
Subject: [Python-Dev] Multiline 'with' statement line continuation
References: <20140811230800.GA12210@gensokyo> <>
Message-ID: <>

Ben Hoyt <benhoyt at> writes:

> So -- although I'm not arguing for it here -- you'd be turning an code
> (a runtime AttributeError) into valid syntax.

Exactly what I'd want to avoid, especially because it *looks* like a
tuple. There are IMO too many pieces of code that look confusingly
similar to tuples but actually mean something else.

 \     ?I have an answering machine in my car. It says, ?I'm home now. |
  `\  But leave a message and I'll call when I'm out.?? ?Steven Wright |
_o__)                                                                  |
Ben Finney

From stephen at  Tue Aug 12 05:50:21 2014
From: stephen at (Stephen J. Turnbull)
Date: Tue, 12 Aug 2014 12:50:21 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <2076096455819154683@unknownmsgid>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

Chris Barker - NOAA Federal writes:

 > Is there anything in the language spec that says string concatenation is
 > O(n^2)? Or for that matter any of the performs characteristics of build in
 > types? Those striker as implementation details that SHOULD be particular to
 > the implementation.

Container concatenation isn't quadratic in Python at all.  The naive
implementation of sum() as a loop repeatedly calling __add__ is
quadratic for them.  Strings (and immutable containers in general) are
particularly horrible, as they don't have __iadd__.

You could argue that sum() being a function of an iterable isn't just
a calling convention for a loop encapsulated in a function, but rather
a completely different kind of function that doesn't imply anything
about the implementation, and therefore that it should dispatch on
type(it).  But explicitly dispatching on type(x) is yucky (what if
somebody wants to sum a different type not currently recognized by the
sum() builtin?) so, obviously, we should define a standard __sum__
dunder!  IMO we'd also want a homogeneous_iterable ABC, and a concrete
homogeneous_iterable_of_TYPE for each sum()-able TYPE to help users
catch bugs injecting the wrong type into an iterable_of_TYPE.

But this still sucks.  Why?  Because obviously we'd want the
attractive nuisance of "if you have __add__, there's a default
definition of __sum__" (AIUI, this is what bothers Alexander most
about the current situation, at least of the things he's mentioned, I
can really sympathize with his dislike).  And new Pythonistas and lazy
programmers who only intend to use sum() on "small enough" iterables
will use the default, and their programs will appear to hang on
somewhat larger iterable, or a realtime requirement will go
unsatisfied when least expected, or ....  If we *don't* have that
property for sum(), ugh!  Yuck!  Same old same old!  (IMHO, YMMV of

It's possible that Python could provide some kind of feature that
would allow an optimized sum function for every type that has __add__,
but I think this will take a lot of thinking.  *Somebody* will do it
(I don't think anybody is +1 on restricting sum() to a subset of types
with __add__).  I just think we should wait until that somebody appears.

 > Should we cripple the performance of some operation in Cpython so that it
 > won't work better that Jython?

Nobody is crippling operations.  We're prohibiting use of a *name* for
an operation that is associated (strongly so, in my mind) with an
inefficient algorithm in favor of the *same operation* by a different
name (which has no existing implementation, and therefore Python
implementers are responsible for implementing it efficiently).  Note:
the "inefficient" algorithm isn't inefficient for integers, and it
isn't inefficient for numbers in general (although it's inaccurate for
some classes of numbers).

 > Seems the same argument [that Python language doesn't prohibit
 > optimizations in particular implementations just because they
 > aren't made in others] could be made for sum(list_of_strings).

It could.  But then we have to consider special-casing every builtin
type that provides __add__, and we impose an unobvious burden on user
types that provide __add__.

 > > It seems pretty pedantic to say: we could make this work well,
 > > but we'd rather chide you for not knowing the "proper" way to do
 > > it.

Nobody disagrees.  But backward compatibility gets in the way.

 > But sum() is not inherently quadratic -- that's a limitation of the
 > implementation.

But the faulty implementation is the canonical implementation, the
only one that can be defined directly in terms of __add__, and it is
efficient for non-container types.[1]

 > "".join _could_ be naively written with the same poor performance
 > -- why should users need to understand why one was optimized and
 > one was not?

Good question.  They shouldn't -- thus the prohibition on sum()ing

 > That is a very import a lesson to learn, sure, but python is not
 > only a teaching language. People will need to learn those lessons
 > at some point, this one feature makes little difference.

No, it makes a big difference.  If you can do something, then it's OK
to do it, is something Python tries to implement.  If sum() works for
everything with an __add__, given current Python language features
some people are going to end up with very inefficient code and it will
bite some of them (and not necessarily the authors!) at some time.

If it doesn't work for every type with __add__, why not?  You'll end
up playing whack-a-mole with type prohibitions.  Ugh.

 > Sure, but I think all that does is teach people about a cpython specific
 > implementation -- and I doubt naive users get any closer to understanding
 > algorithmic complexity -- all they learn is you should use string.join().
 > Oh well, not really that big a deal.

Not to Python.  Maybe not to you.  But I've learned a lot about
Pythonic ways of doing things trying to channel the folks who
implemented this restriction.  (I don't claim to have gotten it right!
Just that it's been fun and educational. :-)


[1]  This isn't quite true.  One can imagine a "token" or "symbol"
type that is str without __len__, but does have __add__.  But that
seems silly enough to not be a problem in practice.

From ethan at  Tue Aug 12 06:02:17 2014
From: ethan at (Ethan Furman)
Date: Mon, 11 Aug 2014 21:02:17 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <> <>
 <2076096455819154683@unknownmsgid> <>
Message-ID: <>

On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote:
> Chris Barker - NOAA Federal writes:
>> It seems pretty pedantic to say: we could make this work well,
>> but we'd rather chide you for not knowing the "proper" way to do
>> it.
> Nobody disagrees.  But backward compatibility gets in the way.

Something that currently doesn't work, starts to.  How is that a backward compatibility problem?


From Nikolaus at  Tue Aug 12 06:39:11 2014
From: Nikolaus at (Nikolaus Rath)
Date: Mon, 11 Aug 2014 21:39:11 -0700
Subject: [Python-Dev] Commit-ready patches in need of review
Message-ID: <>


The following commit-ready patches have been waiting for review since
May and earlier.It'd be great if someone could find the time to take a
look. I'll be happy to incorporate feedback as necessary:

* (filecmp.dircmp does exact match

* (gzip, bz2, lzma: add option to
  limit output size)

* (Derby #8: Convert 28 sites to
  Argument Clinic across 2 files)

  I only wrote the patch for one file because I'd like to have feedback
  before tackling the second. However, the patches are independent so
  unless there are other problems this is ready for commit.


GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             ?Time flies like an arrow, fruit flies like a Banana.?

From stephen at  Tue Aug 12 08:07:29 2014
From: stephen at (Stephen J. Turnbull)
Date: Tue, 12 Aug 2014 15:07:29 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
Message-ID: <>

Ethan Furman writes:
 > On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote:
 > > Chris Barker - NOAA Federal writes:
 > >
 > >> It seems pretty pedantic to say: we could make this work well,
 > >> but we'd rather chide you for not knowing the "proper" way to do
 > >> it.
 > >
 > > Nobody disagrees.  But backward compatibility gets in the way.
 > Something that currently doesn't work, starts to.  How is that a
 > backward compatibility problem?

I'm referring to removing the unnecessary information that there's a
better way to do it, and simply raising an error (as in Python 3.2,
say) which is all a RealProgrammer[tm] should ever need!

That would be a regression and backward incompatible.

From ncoghlan at  Tue Aug 12 09:30:22 2014
From: ncoghlan at (Nick Coghlan)
Date: Tue, 12 Aug 2014 17:30:22 +1000
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <2076096455819154683@unknownmsgid>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
Message-ID: <>

On 12 Aug 2014 11:21, "Chris Barker - NOAA Federal" <chris.barker at>
> Sorry for the bike shedding here, but:
>> The quadratic behaviour of repeated str summation is a subtle, silent
> OK, fair enough. I suppose it would be hard and ugly to catch those
instances and raise an exception pointing users to "".join.
>> *is* controversial that CPython silently optimises some cases of it
away, since it can cause problems when porting affected code to other
interpreters that don't use refcounting and thus have a harder time
implementing such a trick.
> Is there anything in the language spec that says string concatenation is
O(n^2)? Or for that matter any of the performs characteristics of build in
types? Those striker as implementation details that SHOULD be particular to
the implementation.

If you implement strings so they have multiple data segments internally (as
is the case for StringIO these days), yes, you can avoid quadratic time
concatenation behaviour. Doing so makes it harder to meet other complexity
expectations (like O(1) access to arbitrary code points), and isn't going
to happen in CPython regardless due to C API backwards compatibility

For the explicit loop with repeated concatenation, we can't say "this is
slow, don't do it". People do it anyway, so we've opted for the "fine, make
it as fast as we can" option as being preferable to an obscure and
relatively hard to debug performance problem.

For sum(), we have the option of being more direct and just telling people
Python's answer to the string concatenation problem (i.e. str.join). That
is decidedly *not* the series of operations described in sum's
documentation as "Sums start and the items of an iterable from left to
right and returns the total."

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From arigo at  Tue Aug 12 10:02:00 2014
From: arigo at (Armin Rigo)
Date: Tue, 12 Aug 2014 10:02:00 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
 <> <-2448384566377912251@unknownmsgid>
Message-ID: <>

Hi all,

The core of the matter is that if we repeatedly __add__ strings from a
long list, we get O(n**2) behavior.  For one point of view, the
reason is that the additions proceed in left-to-right order.  Indeed,
sum() could proceed in a more balanced tree-like order: from [x0, x1,
x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat
until there is only one item in the final list.  This order ensures
that sum(list_of_strings) is at worst O(n log n).  It might be in
practice close enough from linear to not matter.  It also improves a
lot the precision of sum(list_of_floats) (though not reaching the same
precision levels of math.fsum()).

Just a thought,


From jeanpierreda at  Tue Aug 12 12:43:07 2014
From: jeanpierreda at (Devin Jeanpierre)
Date: Tue, 12 Aug 2014 03:43:07 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140811230800.GA12210@gensokyo>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

I think this thread is probably Python-Ideas territory...

On Mon, Aug 11, 2014 at 4:08 PM, Allen Li <cyberdupo56 at> wrote:
> Currently, this works with explicit line continuation, but as all style
> guides favor implicit line continuation over explicit, it would be nice
> if you could do the following:
>     with (open('foo') as foo,
>           open('bar') as bar,
>           open('baz') as baz,
>           open('spam') as spam,
>           open('eggs') as eggs):
>         pass

The parentheses seem unnecessary/redundant/weird. Why not allow
newlines in-between "with" and the terminating ":"?

with open('foo') as foo,
       open('bar') as bar,
       open('baz') as baz:

-- Devin

From steve at  Tue Aug 12 14:15:41 2014
From: steve at (Steven D'Aprano)
Date: Tue, 12 Aug 2014 22:15:41 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <20140812121541.GG4525@ando>

On Tue, Aug 12, 2014 at 10:28:14AM +1000, Nick Coghlan wrote:
> On 12 Aug 2014 09:09, "Allen Li" <cyberdupo56 at> wrote:
> >
> > This is a problem I sometimes run into when working with a lot of files
> > simultaneously, where I need three or more `with` statements:
> >
> >     with open('foo') as foo:
> >         with open('bar') as bar:
> >             with open('baz') as baz:
> >                 pass
> >
> > Thankfully, support for multiple items was added in 3.1:
> >
> >     with open('foo') as foo, open('bar') as bar, open('baz') as baz:
> >         pass
> >
> > However, this begs the need for a multiline form, especially when
> > working with three or more items:
> >
> >     with open('foo') as foo, \
> >          open('bar') as bar, \
> >          open('baz') as baz, \
> >          open('spam') as spam \
> >          open('eggs') as eggs:
> >         pass
> I generally see this kind of construct as a sign that refactoring is
> needed. For example, contextlib.ExitStack offers a number of ways to manage
> multiple context managers dynamically rather than statically.

I don't think that ExitStack is the right solution for when you have a 
small number of context managers known at edit-time. The extra effort of 
writing your code, and reading it, in a dynamic manner is not justified. 
Compare the natural way of writing this:

with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese:
    # do stuff with spam, eggs, cheese

versus the dynamic way:

with ExitStack() as stack:
    spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in 
                  zip(("spam", "eggs"), ("r", "w")]
    cheese = stack.enter_context(frobulate("cheese"))
    # do stuff with spam, eggs, cheese

I prefer the first, even with the long line.


From graffatcolmingov at  Tue Aug 12 15:04:35 2014
From: graffatcolmingov at (Ian Cordasco)
Date: Tue, 12 Aug 2014 08:04:35 -0500
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140812121541.GG4525@ando>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On Tue, Aug 12, 2014 at 7:15 AM, Steven D'Aprano <steve at> wrote:
> On Tue, Aug 12, 2014 at 10:28:14AM +1000, Nick Coghlan wrote:
>> On 12 Aug 2014 09:09, "Allen Li" <cyberdupo56 at> wrote:
>> >
>> > This is a problem I sometimes run into when working with a lot of files
>> > simultaneously, where I need three or more `with` statements:
>> >
>> >     with open('foo') as foo:
>> >         with open('bar') as bar:
>> >             with open('baz') as baz:
>> >                 pass
>> >
>> > Thankfully, support for multiple items was added in 3.1:
>> >
>> >     with open('foo') as foo, open('bar') as bar, open('baz') as baz:
>> >         pass
>> >
>> > However, this begs the need for a multiline form, especially when
>> > working with three or more items:
>> >
>> >     with open('foo') as foo, \
>> >          open('bar') as bar, \
>> >          open('baz') as baz, \
>> >          open('spam') as spam \
>> >          open('eggs') as eggs:
>> >         pass
>> I generally see this kind of construct as a sign that refactoring is
>> needed. For example, contextlib.ExitStack offers a number of ways to manage
>> multiple context managers dynamically rather than statically.
> I don't think that ExitStack is the right solution for when you have a
> small number of context managers known at edit-time. The extra effort of
> writing your code, and reading it, in a dynamic manner is not justified.
> Compare the natural way of writing this:
> with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese:
>     # do stuff with spam, eggs, cheese
> versus the dynamic way:
> with ExitStack() as stack:
>     spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in
>                   zip(("spam", "eggs"), ("r", "w")]
>     cheese = stack.enter_context(frobulate("cheese"))
>     # do stuff with spam, eggs, cheese
> I prefer the first, even with the long line.

I agree with Steven for *small* numbers of context managers. Once they
become too long though, either refactoring is severely needed or the
user should ExitStack.

To quote Ben Hoyt:

> Is it meaningful to use "with" with a tuple, though? Because a tuple
> isn't a context manager with __enter__ and __exit__ methods. For
> example:
> >>> with (1,2,3): pass
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: __exit__
> So -- although I'm not arguing for it here -- you'd be turning an code
> (a runtime AttributeError) into valid syntax.

I think by introducing parentheses we are going to risk seriously
confusing users who may then try to write an assignment like

a = (open('spam') as spam, open('eggs') as eggs)

Because it looks like a tuple but isn't and I think the extra
complexity this would add to the language would not be worth the
benefit. If we simply look at Ruby for what happens when you have an
overloaded syntax that means two different things, you can see why I'm
against modifying this syntax. In Ruby, parentheses for method calls
are optional and curly braces (i.e, {}) are used for blocks and hash
literals. With a method on class that takes a parameter and a block,
you get some confusing errors, take for example:

class Spam
  def eggs(ham)
    puts ham
    yield if block_present?

s =
s.eggs {monty: 'python'}
SyntaxError: ...


s.eggs({monty: 'python'})

Will print out the hash. The interpreter isn't intelligent enough to
know if you're attempting to pass a hash as a parameter or a block to
be executed. This may seem like a stretch to apply to Python, but the
concept of muddling the meaning of something already very well defined
seems like a bad idea.

From guido at  Tue Aug 12 17:12:45 2014
From: guido at (Guido van Rossum)
Date: Tue, 12 Aug 2014 08:12:45 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On Tue, Aug 12, 2014 at 3:43 AM, Devin Jeanpierre <jeanpierreda at>

> I think this thread is probably Python-Ideas territory...
> On Mon, Aug 11, 2014 at 4:08 PM, Allen Li <cyberdupo56 at> wrote:
> > Currently, this works with explicit line continuation, but as all style
> > guides favor implicit line continuation over explicit, it would be nice
> > if you could do the following:
> >
> >     with (open('foo') as foo,
> >           open('bar') as bar,
> >           open('baz') as baz,
> >           open('spam') as spam,
> >           open('eggs') as eggs):
> >         pass
> The parentheses seem unnecessary/redundant/weird. Why not allow
> newlines in-between "with" and the terminating ":"?
> with open('foo') as foo,
>        open('bar') as bar,
>        open('baz') as baz:
>     pass

That way lies Coffeescript. Too much guessing.

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From arigo at  Tue Aug 12 18:57:39 2014
From: arigo at (Armin Rigo)
Date: Tue, 12 Aug 2014 18:57:39 +0200
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140811230800.GA12210@gensokyo>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>


On 12 August 2014 01:08, Allen Li <cyberdupo56 at> wrote:
>     with (open('foo') as foo,
>           open('bar') as bar,
>           open('baz') as baz,
>           open('spam') as spam,
>           open('eggs') as eggs):
>         pass

+1.  It's exactly the same grammar extension as for "from import"
statements, for the same reason.


From g.brandl at  Tue Aug 12 20:52:44 2014
From: g.brandl at (Georg Brandl)
Date: Tue, 12 Aug 2014 20:52:44 +0200
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <lsdnps$fvd$>

On 08/12/2014 06:57 PM, Armin Rigo wrote:
> Hi,
> On 12 August 2014 01:08, Allen Li <cyberdupo56 at> wrote:
>>     with (open('foo') as foo,
>>           open('bar') as bar,
>>           open('baz') as baz,
>>           open('spam') as spam,
>>           open('eggs') as eggs):
>>         pass
> +1.  It's exactly the same grammar extension as for "from import"
> statements, for the same reason.

Not the same: in import statements it unambiguously replaces a list
of (optionally as-renamed) identifiers.  Here, it would replace an
arbitrary expression, which I think would mean that we couldn't
differentiate between e.g.

   with (expr).meth():        # a line break in "expr"
                              # would make the parens useful


   with (expr1, expr2):


From chris.barker at  Tue Aug 12 21:11:35 2014
From: chris.barker at (Chris Barker)
Date: Tue, 12 Aug 2014 12:11:35 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <> <>
 <> <-2448384566377912251@unknownmsgid>
 <2076096455819154683@unknownmsgid> <>
 <> <>
Message-ID: <>

On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <stephen at>

> I'm referring to removing the unnecessary information that there's a
>  better way to do it, and simply raising an error (as in Python 3.2,
> say) which is all a RealProgrammer[tm] should ever need!

I can't imagine anyone is suggesting that -- disallow it, but don't tell
anyone why?

The only thing that is remotely on the table here is:

1) remove the special case for strings -- buyer beware -- but consistent
and less "ugly"

2) add a special case for strings that is fast and efficient -- may be as
simple as calling "".join() under the hood --no more code than the
exception check.

And I doubt anyone really is pushing for anything but (2)

Steven Turnbull wrote:

>   IMO we'd also want a homogeneous_iterable ABC

Actually, I've thought for years that that would open the door to a lot of
optimizations -- but that's a much broader question that sum(). I even
brought it up probably over ten years ago -- but no one was the least bit
iinterested -- nor are they now -- I now this was a rhetorical suggestion
to make the point about what not to do....

  Because obviously we'd want the
> attractive nuisance of "if you have __add__, there's a default
> definition of __sum__"

now I'm confused -- isn't that exactly what we have now?

It's possible that Python could provide some kind of feature that
> would allow an optimized sum function for every type that has __add__,
> but I think this will take a lot of thinking.

does it need to be every type? As it is the common ones work fine already
except for strings -- so if we add an optimized string sum() then we're

 *Somebody* will do it
> (I don't think anybody is +1 on restricting sum() to a subset of types
> with __add__).

uhm, that's exactly what we have now -- you can use sum() with anything
that has an __add__, except strings. Ns by that logic, if we thought there
were other inefficient use cases, we'd restrict those too.

But users can always define their own classes that have a __sum__ and are
really inefficient -- so unless sum() becomes just for a certain subset of
built-in types -- does anyone want that? Then we are back to the current

sum() can be used for any type that has an __add__ defined.

But naive users are likely to try it with strings, and that's bad, so we
want to prevent that, and have a special case check for strings.

What I fail to see is why it's better to raise an exception and point users
to a better way, than to simply provide an optimization so that it's a mute

The only justification offered here is that will teach people that summing
strings (and some other objects?) is order(N^2) and a bad idea. But:

a) Python's primary purpose is practical, not pedagogical (not that it
isn't great for that)

b) I doubt any naive users learn anything other than "I can't use sum() for
strings, I should use "".join()". Will they make the leap to "I shouldn't
use string concatenation in a loop, either"? Oh, wait, you can use string
concatenation in a loop -- that's been optimized. So will they learn: "some
types of object shave poor performance with repeated concatenation and
shouldn't be used with sum(). So If I write such a class, and want to sum
them up, I'll need to write an optimized version of that code"?

I submit that no naive user is going to get any closer to a proper
understanding of algorithmic Order behavior from this small hint. Which
leaves no reason to prefer an Exception to an optimization.

One other point: perhaps this will lead a naive user into thinking --
"sum() raises an exception if I try to use it inefficiently, so it must be
OK to use for anything that doesn't raise an exception" -- that would be a
bad lesson to mis-learn....


Armin Rigo wrote:

> It also improves a
> lot the precision of sum(list_of_floats) (though not reaching the same
> precision levels of math.fsum()).

while we are at it, having the default sum() for floats be fsum() would be
nice -- I'd rather the default was better accuracy loser performance. Folks
that really care about performance could call math.fastsum(), or really,
use numpy...

This does turn sum() into a function that does type-based dispatch, but
isn't python full of those already? do something special for the types you
know about, call the generic dunder method for the rest.


Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From Stefan.Richthofer at  Tue Aug 12 21:48:01 2014
From: Stefan.Richthofer at (Stefan Richthofer)
Date: Tue, 12 Aug 2014 21:48:01 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <> <>
 <> <>
Message-ID: <trinity-2ac7d211-1b97-448d-8ce4-0f88dc136ccc-1407872881388@3capp-gmx-bs03>

An HTML attachment was scrubbed...
URL: <>

From jeanpierreda at  Wed Aug 13 02:41:32 2014
From: jeanpierreda at (Devin Jeanpierre)
Date: Tue, 12 Aug 2014 17:41:32 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On Tue, Aug 12, 2014 at 8:12 AM, Guido van Rossum <guido at> wrote:
> On Tue, Aug 12, 2014 at 3:43 AM, Devin Jeanpierre <jeanpierreda at>
> wrote:
>> The parentheses seem unnecessary/redundant/weird. Why not allow
>> newlines in-between "with" and the terminating ":"?
>> with open('foo') as foo,
>>        open('bar') as bar,
>>        open('baz') as baz:
>>     pass
> That way lies Coffeescript. Too much guessing.

There's no syntactic ambiguity, so what guessing are you talking about?

What *really* requires guessing, is figuring out where in Python's
syntax parentheses are allowed vs not allowed ;). For example, "from
foo import (bar, baz)" is legal, but "import (bar, baz)" is not.
Sometimes it feels like Python is slowly and organically evolving into
a parenthesis-delimited language.

-- Devin

From Nikolaus at  Wed Aug 13 04:48:34 2014
From: Nikolaus at (Nikolaus Rath)
Date: Tue, 12 Aug 2014 19:48:34 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
 (Chris Barker's message of "Tue, 12 Aug 2014 12:11:35 -0700")
References: <>
Message-ID: <>

Chris Barker <chris.barker at> writes:
> What I fail to see is why it's better to raise an exception and point users
> to a better way, than to simply provide an optimization so that it's a mute
> issue.
> The only justification offered here is that will teach people that summing
> strings (and some other objects?) is order(N^2) and a bad idea. But:
> a) Python's primary purpose is practical, not pedagogical (not that it
> isn't great for that)
> b) I doubt any naive users learn anything other than "I can't use sum() for
> strings, I should use "".join()". Will they make the leap to "I shouldn't
> use string concatenation in a loop, either"? Oh, wait, you can use string
> concatenation in a loop -- that's been optimized. So will they learn: "some
> types of object shave poor performance with repeated concatenation and
> shouldn't be used with sum(). So If I write such a class, and want to sum
> them up, I'll need to write an optimized version of that code"?
> I submit that no naive user is going to get any closer to a proper
> understanding of algorithmic Order behavior from this small hint. Which
> leaves no reason to prefer an Exception to an optimization.
> One other point: perhaps this will lead a naive user into thinking --
> "sum() raises an exception if I try to use it inefficiently, so it must be
> OK to use for anything that doesn't raise an exception" -- that would be a
> bad lesson to mis-learn....

AOL to that.


GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             ?Time flies like an arrow, fruit flies like a Banana.?

From steve at  Wed Aug 13 05:38:55 2014
From: steve at (Steven D'Aprano)
Date: Wed, 13 Aug 2014 13:38:55 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <20140813033855.GH4525@ando>

On Tue, Aug 12, 2014 at 08:04:35AM -0500, Ian Cordasco wrote:

> I think by introducing parentheses we are going to risk seriously
> confusing users who may then try to write an assignment like
> a = (open('spam') as spam, open('eggs') as eggs)


If they try it, they will get a syntax error. Now, admittedly Python's 
syntax error messages tend to be terse and cryptic, but it's still 
enough to show that you can't do that.

py> a = (open('spam') as spam, open('eggs') as eggs)
  File "<stdin>", line 1
    a = (open('spam') as spam, open('eggs') as eggs)
SyntaxError: invalid syntax

I don't see this as a problem. There's no limit to the things that 
people *might* do if they don't understand Python semantics:

for module in sys, math, os, 
    import module

(and yes, I once tried this as a beginner) but they try it once, realise 
it doesn't work, and never do it again.

> Because it looks like a tuple but isn't and I think the extra
> complexity this would add to the language would not be worth the
> benefit. 

Do we have a problem with people thinking that, since tuples are 
normally interchangable with lists, they can write this?

from module import [fe, fi, fo, fum,
                    spam, eggs, cheese]

and then being "seriously confused" by the syntax error they receive? Or 
writing this?

from (module import fe, fi, fo, fum,
                    spam, eggs, cheese)

It's not sufficient that people might try it, see it fails, and move on. 
Your claim is that it will cause serious confusion. I just don't see 
that happening.

> If we simply look at Ruby for what happens when you have an
> overloaded syntax that means two different things, you can see why I'm
> against modifying this syntax. 

That ship has sailed in Python, oh, 20+ years ago. Parens are used for 
grouping, for tuples[1], for function calls, for parameter lists, class 
base-classes, generator expressions and line continuations. I cannot 
think of any examples where these multiple uses for parens has cause 
meaningful confusion, and I don't think this one will either.

[1] Technically not, since it's the comma, not the ( ), which makes a 
tuple, but a lot of people don't know that and treat it as if it the 
parens were compulsary.


From stephen at  Wed Aug 13 08:21:42 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 13 Aug 2014 15:21:42 +0900
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
Message-ID: <>

Redirecting to python-ideas, so trimming less than I might.

Chris Barker writes:
 > On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <stephen at>
 > wrote:
 > > I'm referring to removing the unnecessary information that there's a
 > >  better way to do it, and simply raising an error (as in Python 3.2,
 > > say) which is all a RealProgrammer[tm] should ever need!
 > >
 > I can't imagine anyone is suggesting that -- disallow it, but don't tell
 > anyone why?

As I said, it's a regression.  That's exactly the behavior in Python 3.2.

 > The only thing that is remotely on the table here is:
 > 1) remove the special case for strings -- buyer beware -- but consistent
 > and less "ugly"

It's only consistent if you believe that Python has strict rules for
use of various operators.  It doesn't, except as far as they are
constrained by precedence.  For example, I have an application where I
add bytestrings bytewise modulo N <= 256, and concatenate them.  In
fact I use function call syntax, but the obvious operator syntax is
'+' for the bytewise addition, and '*' for the concatenation.

It's not in the Zen, but I believe in the maxim "If it's worth doing,
it's worth doing well."  So for me, 1) is out anyway.

 > 2) add a special case for strings that is fast and efficient -- may be as
 > simple as calling "".join() under the hood --no more code than the
 > exception check.

Sure, but what about all the other immutable containers with __add__
methods?  What about mappings with key-wise __add__ methods whose
values might be immutable but have __add__ methods?  Where do you stop
with the special-casing?  I consider this far more complex and ugly
than the simple "sum() is for numbers" rule (and even that is way too
complex considering accuracy of summing floats).

 > And I doubt anyone really is pushing for anything but (2)

I know that, but I think it's the wrong solution to the problem (which
is genuine IMO).  The right solution is something generic, possibly a
__sum__ method.  The question is whether that leads to too much work
to be worth it (eg, "homogeneous_iterable").

 > > Because obviously we'd want the attractive nuisance of "if you
 > > have __add__, there's a default definition of __sum__"
 > now I'm confused -- isn't that exactly what we have now?

Yes and my feeling (backed up by arguments that I admit may persuade
nobody but myself) is that what we have now kinda sucks[tm].  It
seemed like a good idea when I first saw it, but then, my apps don't
scale to where the pain starts in my own usage.

 > > It's possible that Python could provide some kind of feature that
 > > would allow an optimized sum function for every type that has
 > > __add__, but I think this will take a lot of thinking.
 > does it need to be every type? As it is the common ones work fine already
 > except for strings -- so if we add an optimized string sum() then we're
 > done.

I didn't say provide an optimized sum(), I said provide a feature
enabling people who want to optimize sum() to do so.  So yes, it needs
to be every type (the optional __sum__ method is a proof of concept,
modulo it actually being implementable ;-).

 > > *Somebody* will do it (I don't think anybody is +1 on restricting
 > > sum() to a subset of types with __add__).
 > uhm, that's exactly what we have now

Exactly.  Who's arguing that the sum() we have now is a ticket to
Paradise?  I'm just saying that there's probably somebody out there
negative enough on the current situation to come up with an answer
that I think is general enough (and I suspect that python-dev
consensus is that demanding, too).

 > sum() can be used for any type that has an __add__ defined.

I'd like to see that be mutable types with __iadd__.

 > What I fail to see is why it's better to raise an exception and
 > point users to a better way, than to simply provide an optimization
 > so that it's a mute issue.

Because inefficient sum() is an attractive nuisance, easy to overlook,
and likely to bite users other than the author.

 > The only justification offered here is that will teach people that summing
 > strings (and some other objects?)

Summing tuples works (with appropriate start=tuple()).  Haven't
benchmarked, but I bet that's O(N^2).

 > is order(N^2) and a bad idea. But:
 > a) Python's primary purpose is practical, not pedagogical (not that it
 > isn't great for that)

My argument is that in practical use sum() is a bad idea, period,
until you book up on the types and applications where it *does* work.
N.B. It doesn't even work properly for numbers (inaccurate for floats).

 > b) I doubt any naive users learn anything other than "I can't use sum() for
 > strings, I should use "".join()".

For people who think that special-casing strings is a good idea, I
think this is about as much benefit as you can expect.  Why go
farther?<0.5 wink/>

 > I submit that no naive user is going to get any closer to a proper
 > understanding of algorithmic Order behavior from this small hint. Which
 > leaves no reason to prefer an Exception to an optimization.

TOOWTDI.  str.join is in pretty much every code base by now, and
tutorials and FAQs recommending its user and severely deprecating sum
for strings are legion.

 > One other point: perhaps this will lead a naive user into thinking --
 > "sum() raises an exception if I try to use it inefficiently, so it must be
 > OK to use for anything that doesn't raise an exception" -- that would be a
 > bad lesson to mis-learn....

That assumes they know about the start argument.  I think most naive
users will just try to sum a bunch of tuples, and get the "can't add
0, tuple" Exception and write a loop.  I suspect that many of the
users who get the "use str.join" warning along with the Exception are
unaware of the start argument, too.  They expect sum(iter_of_str) to
magically add the strings.  Ie, when in 3.2 they got the
uninformative "can't add 0, str" message, they did not immediately go
"d'oh" and insert ", start=''" in the call to sum, they wrote a loop.

 > while we are at it, having the default sum() for floats be fsum()
 > would be nice

How do you propose to implement that, given math.fsum is perfectly
happy to sum integers?  You can't just check one or a few leading
elements for floatiness.  I think you have to dispatch on type(start),
but then sum(iter_of_floats) DTWT.  So I would suggest changing the
signature to sum(it, start=0.0).  This would probably be acceptable to
most users with iterables of ints, but does imply some performance hit.

 > This does turn sum() into a function that does type-based dispatch,
 > but isn't python full of those already? do something special for
 > the types you know about, call the generic dunder method for the
 > rest.

AFAIK Python is moving in the opposite direction: if there's a common
need for dispatching to type-specific implementations of a method,
define a standard (not "generic") dunder for the purpose, and have the
builtin (or operator, or whatever) look up (not "call") the
appropriate instance in the usual way, then call it.  If there's a
useful generic implementation, define an ABC to inherit from that
provides that generic implementation.

From ncoghlan at  Wed Aug 13 10:34:58 2014
From: ncoghlan at (Nick Coghlan)
Date: Wed, 13 Aug 2014 18:34:58 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140812121541.GG4525@ando>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On 12 August 2014 22:15, Steven D'Aprano <steve at> wrote:
> Compare the natural way of writing this:
> with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese:
>     # do stuff with spam, eggs, cheese
> versus the dynamic way:
> with ExitStack() as stack:
>     spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in
>                   zip(("spam", "eggs"), ("r", "w")]
>     cheese = stack.enter_context(frobulate("cheese"))
>     # do stuff with spam, eggs, cheese

You wouldn't necessarily switch at three. At only three, you have lots
of options, including multiple nested with statements:

    with open("spam") as spam:
        with open("eggs", "w") as eggs:
            with frobulate("cheese") as cheese:
                # do stuff with spam, eggs, cheese

The "multiple context managers in one with statement" form is there
*solely* to save indentation levels, and overuse can often be a sign
that you may have a custom context manager trying to get out:

    def dish(spam_file, egg_file, topping):
        with open(spam_file), open(egg_file, 'w'), frobulate(topping):

    with dish("spam", "eggs", "cheese") as spam, eggs, cheese:
        # do stuff with spam, eggs & cheese

ExitStack is mostly useful as a tool for writing flexible custom
context managers, and for dealing with context managers in cases where
lexical scoping doesn't necessarily work, rather than being something
you'd regularly use for inline code.

"Why do I have so many contexts open at once in this function?" is a
question developers should ask themselves in the same way its worth
asking "why do I have so many local variables in this function?"


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ijmorlan at  Wed Aug 13 15:11:15 2014
From: ijmorlan at (Isaac Morland)
Date: Wed, 13 Aug 2014 09:11:15 -0400 (EDT)
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

On Mon, 11 Aug 2014, Skip Montanaro wrote:

> On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce <matsjoyce at> wrote:
>> There maybe some holes in my approach, but I can't find them.
> There's the rub. Given time, I suspect someone will discover a hole or two.

Schneier's Law:

 	Any person can invent a security system so clever that she or he can't
 	think of how to break it.

While I would not claim a Python sandbox is utterly impossible, I'm 
suspicious that the whole "consenting adults" approach in Python is 
incompatible with a sandbox.  The whole idea of a sandbox is to absolutely 
prevent people from doing things even if they really want to and know what 
they are doing.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From 4kir4.1i at  Wed Aug 13 17:47:18 2014
From: 4kir4.1i at (Akira Li)
Date: Wed, 13 Aug 2014 19:47:18 +0400
Subject: [Python-Dev] Multiline with statement line continuation
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

Nick Coghlan <ncoghlan at> writes:

> On 12 August 2014 22:15, Steven D'Aprano <steve at> wrote:
>> Compare the natural way of writing this:
>> with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese:
>>     # do stuff with spam, eggs, cheese
>> versus the dynamic way:
>> with ExitStack() as stack:
>>     spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in
>>                   zip(("spam", "eggs"), ("r", "w")]
>>     cheese = stack.enter_context(frobulate("cheese"))
>>     # do stuff with spam, eggs, cheese
> You wouldn't necessarily switch at three. At only three, you have lots
> of options, including multiple nested with statements:
>     with open("spam") as spam:
>         with open("eggs", "w") as eggs:
>             with frobulate("cheese") as cheese:
>                 # do stuff with spam, eggs, cheese
> The "multiple context managers in one with statement" form is there
> *solely* to save indentation levels, and overuse can often be a sign
> that you may have a custom context manager trying to get out:
>     @contextlib.contextmanager
>     def dish(spam_file, egg_file, topping):
>         with open(spam_file), open(egg_file, 'w'), frobulate(topping):
>             yield
>     with dish("spam", "eggs", "cheese") as spam, eggs, cheese:
>         # do stuff with spam, eggs & cheese
> ExitStack is mostly useful as a tool for writing flexible custom
> context managers, and for dealing with context managers in cases where
> lexical scoping doesn't necessarily work, rather than being something
> you'd regularly use for inline code.
> "Why do I have so many contexts open at once in this function?" is a
> question developers should ask themselves in the same way its worth
> asking "why do I have so many local variables in this function?"

Multiline with-statement can be useful even with *two* context
managers. Two is not many.

Saving indentations levels along is a worthy goal. It can affect
readability and the perceived complexity of the code.

Here's how I'd like the code to look like:

  with (open('input filename') as input_file,
        open('output filename', 'w') as output_file):
      # code with list comprehensions to transform input file into output file

Even one additional unnecessary indentation level may force to split
list comprehensions into several lines (less readable) and/or use
shorter names (less readable). Or it may force to move the inline code
into a separate named function prematurely, solely to preserve the
indentation level (also may be less readable) i.e.,

  with ... as input_file:
      with ... as output_file:
          ... #XXX indentation level is lost for no reason

  with ... as infile, ... as outfile: #XXX shorter names

  with ... as input_file:
      with ... as output_file:
          transform(input_file, output_file) #XXX unnecessary function

And (nested() can be implemented using ExitStack):

  with nested(open(..),
              open(..)) as (input_file, output_file):
      ... #XXX less readable

Here's an example where nested() won't help:

  def get_integers(filename):
      with (open(filename, 'rb', 0) as file,
            mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file):
          for match in re.finditer(br'\d+', mmapped_file):
              yield int(

Here's another:

  with (open('log'+'some expression that generates filename', 'a') as logfile,


From matsjoyce at  Wed Aug 13 18:19:14 2014
From: matsjoyce at (matsjoyce)
Date: Wed, 13 Aug 2014 16:19:14 +0000 (UTC)
Subject: [Python-Dev] Reviving restricted mode?
References: <>
Message-ID: <>

Unless you remove all the things labelled "keep away from children". I wrote 
this sandbox to allow python to be used as a "mods"/"add-ons" language for a 
game I'm writing, hence the perhaps too strict nature.

About the crashers: as this is for games, its "fine" for the game to crash, 
as long as the sandbox is not broken while crashing.

time and math can probably be allowed, but random imports a lot of 
undesirable modules.

My sandbox doesn't use proxies, due to the introspection and complexity that 
it involves. Instead it completely isolates the sandboxed globals, and checks 
all arguments and globals for irregularities before passing control to non-
sandboxed functions.

From rosuav at  Wed Aug 13 18:26:29 2014
From: rosuav at (Chris Angelico)
Date: Thu, 14 Aug 2014 02:26:29 +1000
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland <ijmorlan at> wrote:
> While I would not claim a Python sandbox is utterly impossible, I'm
> suspicious that the whole "consenting adults" approach in Python is
> incompatible with a sandbox.  The whole idea of a sandbox is to absolutely
> prevent people from doing things even if they really want to and know what
> they are doing.

It's certainly not *fundamentally* impossible to sandbox Python.
However, the question becomes one of how much effort you're going to
go to and how much you're going to restrict the code. I think I
remember reading about something that's like ast.literal_eval, but
allows name references; with that, plus some tiny features of
assignment, you could make a fairly straight-forward evaluator that
lets you work comfortably with numbers, strings, lists, dicts, etc.
That could be pretty useful - but it wouldn't so much be "Python in a
sandbox" as "an expression evaluator that uses a severely restricted
set of Python syntax".

If you start with all of Python and then start cutting out the
dangerous bits, you're doomed to miss something, and your sandbox is
broken. If you start with nothing and then start adding functionality,
you're looking at a gigantic job before it becomes anything that you
could call an applications language. So while it's theoretically
possible (I think - certainly I can't say for sure that it's
impossible), it's fairly impractical. I've had my own try at it, and
failed quite badly (fortunately noisily and at a sufficiently early
stage of development to shift).


From matsjoyce at  Wed Aug 13 18:17:13 2014
From: matsjoyce at (matsjoyce)
Date: Wed, 13 Aug 2014 17:17:13 +0100
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

Unless you remove all the things labelled "keep away from children". I
wrote this sandbox to allow python to be used as a "mods"/"add-ons"
language for a game I'm writing, hence the perhaps too strict nature.

About the crashers: as this is for games, its "fine" for the game to crash,
as long as the sandbox is not broken while crashing.

time and math can probably be allowed, but random imports a lot of
undesirable modules.

My sandbox doesn't use proxies, due to the introspection and complexity
that it involves. Instead it completely isolates the sandboxed globals, and
checks all arguments and globals for irregularities before passing control
to non-sandboxed functions.

On 13 August 2014 14:11, Isaac Morland <ijmorlan at> wrote:

> On Mon, 11 Aug 2014, Skip Montanaro wrote:
>  On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce <matsjoyce at> wrote:
>>> There maybe some holes in my approach, but I can't find them.
>> There's the rub. Given time, I suspect someone will discover a hole or
>> two.
> Schneier's Law:
>         Any person can invent a security system so clever that she or he
> can't
>         think of how to break it.
> While I would not claim a Python sandbox is utterly impossible, I'm
> suspicious that the whole "consenting adults" approach in Python is
> incompatible with a sandbox.  The whole idea of a sandbox is to absolutely
> prevent people from doing things even if they really want to and know what
> they are doing.
> Isaac Morland                   CSCF Web Guru
> DC 2554C, x36650                WWW Software Specialist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ronaldoussoren at  Wed Aug 13 16:32:13 2014
From: ronaldoussoren at (Ronald Oussoren)
Date: Wed, 13 Aug 2014 16:32:13 +0200
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando>
 <> <>
 <> <-2448384566377912251@unknownmsgid>
Message-ID: <>

On 12 Aug 2014, at 10:02, Armin Rigo <arigo at> wrote:

> Hi all,
> The core of the matter is that if we repeatedly __add__ strings from a
> long list, we get O(n**2) behavior.  For one point of view, the
> reason is that the additions proceed in left-to-right order.  Indeed,
> sum() could proceed in a more balanced tree-like order: from [x0, x1,
> x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat
> until there is only one item in the final list.  This order ensures
> that sum(list_of_strings) is at worst O(n log n).  It might be in
> practice close enough from linear to not matter.  It also improves a
> lot the precision of sum(list_of_floats) (though not reaching the same
> precision levels of math.fsum()).

I wonder why nobody has mentioned previous year?s discussion of the same issue yet:

Maybe someone can write a PEP about this that can be pointed when the question is discussed again next summer ;-)


From steve at  Wed Aug 13 18:58:39 2014
From: steve at (Steven D'Aprano)
Date: Thu, 14 Aug 2014 02:58:39 +1000
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <20140813165839.GJ4525@ando>

On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote:
> On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland <ijmorlan at> wrote:
> > While I would not claim a Python sandbox is utterly impossible, I'm
> > suspicious that the whole "consenting adults" approach in Python is
> > incompatible with a sandbox.  The whole idea of a sandbox is to absolutely
> > prevent people from doing things even if they really want to and know what
> > they are doing.

The point of a sandbox is that I, the consenting adult writing the 
application in the first place, may want to allow *untrusted others* to 
call Python code without giving them control of the entire application. 
The consenting adults rule applies to me, the application writer, not 
them, the end-users, even if they happen to be writing Python code. If 
they want unrestricted access to the Python interpreter, they can run 
their code on their own machine, not mine.

> It's certainly not *fundamentally* impossible to sandbox Python.
> However, the question becomes one of how much effort you're going to
> go to and how much you're going to restrict the code.

I believe that PyPy has an effective sandbox, but to what degree of 
effectiveness I don't know.

I've had rogue Javascript crash my browser or make my entire computer 
effectively unusable often enough that I am skeptical about claims that 
Javascript in the browser is effectively sandboxed, so I'm doubly 
cautious about Python.


From rosuav at  Wed Aug 13 19:06:01 2014
From: rosuav at (Chris Angelico)
Date: Thu, 14 Aug 2014 03:06:01 +1000
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <20140813165839.GJ4525@ando>
References: <>
Message-ID: <>

On Thu, Aug 14, 2014 at 2:58 AM, Steven D'Aprano <steve at> wrote:
>> It's certainly not *fundamentally* impossible to sandbox Python.
>> However, the question becomes one of how much effort you're going to
>> go to and how much you're going to restrict the code.
> I believe that PyPy has an effective sandbox, but to what degree of
> effectiveness I don't know.

A potential attacker can have arbitrary code run in the subprocess,
but cannot actually do any input/output not controlled by the outer
process. Additional barriers are put to limit the amount of RAM and
CPU time used.

Note that this is very different from sandboxing at the Python
language level, i.e. placing restrictions on what kind of Python code
the attacker is allowed to run (why? read about pysandbox).

That's quite useful, but isn't the same thing as a Python-in-Python
sandbox (or even what I was doing, Python-in-C++).


From yoavglazner at  Wed Aug 13 19:08:51 2014
From: yoavglazner at (yoav glazner)
Date: Wed, 13 Aug 2014 20:08:51 +0300
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On Aug 13, 2014 7:04 PM, "Akira Li" <4kir4.1i at> wrote:
> Nick Coghlan <ncoghlan at> writes:
> > On 12 August 2014 22:15, Steven D'Aprano <steve at> wrote:
> >> Compare the natural way of writing this:
> >>
> >> with open("spam") as spam, open("eggs", "w") as eggs,
frobulate("cheese") as cheese:
> >>     # do stuff with spam, eggs, cheese
> >>
> >> versus the dynamic way:
> >>
> >> with ExitStack() as stack:
> >>     spam, eggs = [stack.enter_context(open(fname), mode) for fname,
mode in
> >>                   zip(("spam", "eggs"), ("r", "w")]
> >>     cheese = stack.enter_context(frobulate("cheese"))
> >>     # do stuff with spam, eggs, cheese
> >
> > You wouldn't necessarily switch at three. At only three, you have lots
> > of options, including multiple nested with statements:
> >
> >     with open("spam") as spam:
> >         with open("eggs", "w") as eggs:
> >             with frobulate("cheese") as cheese:
> >                 # do stuff with spam, eggs, cheese
> >
> > The "multiple context managers in one with statement" form is there
> > *solely* to save indentation levels, and overuse can often be a sign
> > that you may have a custom context manager trying to get out:
> >
> >     @contextlib.contextmanager
> >     def dish(spam_file, egg_file, topping):
> >         with open(spam_file), open(egg_file, 'w'), frobulate(topping):
> >             yield
> >
> >     with dish("spam", "eggs", "cheese") as spam, eggs, cheese:
> >         # do stuff with spam, eggs & cheese
> >
> > ExitStack is mostly useful as a tool for writing flexible custom
> > context managers, and for dealing with context managers in cases where
> > lexical scoping doesn't necessarily work, rather than being something
> > you'd regularly use for inline code.
> >
> > "Why do I have so many contexts open at once in this function?" is a
> > question developers should ask themselves in the same way its worth
> > asking "why do I have so many local variables in this function?"
> Multiline with-statement can be useful even with *two* context
> managers. Two is not many.
> Saving indentations levels along is a worthy goal. It can affect
> readability and the perceived complexity of the code.
> Here's how I'd like the code to look like:
>   with (open('input filename') as input_file,
>         open('output filename', 'w') as output_file):
>       # code with list comprehensions to transform input file into output
> Even one additional unnecessary indentation level may force to split
> list comprehensions into several lines (less readable) and/or use
> shorter names (less readable). Or it may force to move the inline code
> into a separate named function prematurely, solely to preserve the
> indentation level (also may be less readable) i.e.,
>   with ... as input_file:
>       with ... as output_file:
>           ... #XXX indentation level is lost for no reason
>   with ... as infile, ... as outfile: #XXX shorter names
>       ...
>   with ... as input_file:
>       with ... as output_file:
>           transform(input_file, output_file) #XXX unnecessary function
> And (nested() can be implemented using ExitStack):
>   with nested(open(..),
>               open(..)) as (input_file, output_file):
>       ... #XXX less readable
> Here's an example where nested() won't help:
>   def get_integers(filename):
>       with (open(filename, 'rb', 0) as file,
>             mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as
>           for match in re.finditer(br'\d+', mmapped_file):
>               yield int(
> Here's another:
>   with (open('log'+'some expression that generates filename', 'a') as
>         redirect_stdout(logfile)):
>       ...
Just a thought, would it bit wierd that:
with (a as b, c as d): "works"
with (a, c): "boom"
with(a as b, c): ?

> --
> Akira
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ijmorlan at  Wed Aug 13 19:11:23 2014
From: ijmorlan at (Isaac Morland)
Date: Wed, 13 Aug 2014 13:11:23 -0400 (EDT)
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <20140813165839.GJ4525@ando>
References: <>
Message-ID: <>

On Thu, 14 Aug 2014, Steven D'Aprano wrote:

> On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote:
>> On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland <ijmorlan at> wrote:
>>> While I would not claim a Python sandbox is utterly impossible, I'm
>>> suspicious that the whole "consenting adults" approach in Python is
>>> incompatible with a sandbox.  The whole idea of a sandbox is to absolutely
>>> prevent people from doing things even if they really want to and know what
>>> they are doing.
> The point of a sandbox is that I, the consenting adult writing the
> application in the first place, may want to allow *untrusted others* to
> call Python code without giving them control of the entire application.
> The consenting adults rule applies to me, the application writer, not
> them, the end-users, even if they happen to be writing Python code. If
> they want unrestricted access to the Python interpreter, they can run
> their code on their own machine, not mine.

Yes, absolutely, and I didn't mean to contradict what you are saying. 
What I am suggesting is that the basic design of Python isn't a good 
starting point for imposing mandatory restrictions on what code can do. 
By contrast, take something like Safe Haskell.  I'm not absolutely certain 
that it really is safe as promised, but it's starting from a very 
different language in which the compiler performs extremely sophisticated 
type checking and simply won't compile programs that don't work within the 
type system.

This isn't a knock on Python (which I love using, by the way), just being 
realistic about what the existing language is likely to be able to 
support.  Having said that, I'll be very interested if somebody does come 
up with a restricted mode Python that is widely accepted as being secure - 
that would be a real achievement.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From steve at  Wed Aug 13 19:32:26 2014
From: steve at (Steven D'Aprano)
Date: Thu, 14 Aug 2014 03:32:26 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <20140813173225.GL4525@ando>

On Wed, Aug 13, 2014 at 08:08:51PM +0300, yoav glazner wrote:
> Just a thought, would it bit wierd that:
> with (a as b, c as d): "works"
> with (a, c): "boom"
> with(a as b, c): ?

If this proposal is accepted, there is no need for the "boom". The 
syntax should allow:

# Without parens, limited to a single line.
with a [as name], b [as name], c [as name], ...:

# With parens, not limited to a single line.
with (a [as name],
      b [as name],
      c [as name],

where the "as name" part is always optional. In both these cases, 
whether there are parens or not, it will be interpreted as a series of 
context managers and never as a single tuple.

Note two things:

(1) this means that even in the unlikely event that tuples become 
context managers in the future, you won't be able to use a tuple 

    with (1, 2, 3):  # won't work as expected

    t = (1, 2, 3)
    with t:  # will work as expected

But I cannot imagine any circumstances where tuples will become context 

(2) Also note that *this is already the case*, since tuples are made by 
the commas, not the parentheses. E.g. this succeeds:

# Not a tuple, actually two context managers.
with open("/tmp/foo"), open("/tmp/bar", "w"):


From tjreedy at  Wed Aug 13 20:11:07 2014
From: tjreedy at (Terry Reedy)
Date: Wed, 13 Aug 2014 14:11:07 -0400
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <lsg9oj$4u0$>

On 8/13/2014 12:19 PM, matsjoyce wrote:
> Unless you remove all the things labelled "keep away from children". I wrote
> this sandbox to allow python to be used as a "mods"/"add-ons" language for a
> game I'm writing, hence the perhaps too strict nature.
> About the crashers: as this is for games, its "fine" for the game to crash,
> as long as the sandbox is not broken while crashing.
> time and math can probably be allowed, but random imports a lot of
> undesirable modules.
> My sandbox doesn't use proxies, due to the introspection and complexity that
> it involves. Instead it completely isolates the sandboxed globals, and checks
> all arguments and globals for irregularities before passing control to non-
> sandboxed functions.

pydev is for mainly for discussion of maintaining current versions and 
development of the next, and for discussion of PEPs which might apply to 
the one after next.

This discussion should be on python-list or perhaps python-ideas if 
there is a semi-concrete proposal for a future python.

Terry Jan Reedy

From victor.stinner at  Wed Aug 13 23:25:43 2014
From: victor.stinner at (Victor Stinner)
Date: Wed, 13 Aug 2014 23:25:43 +0200
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <20140813165839.GJ4525@ando>
References: <>
Message-ID: <>


I heard that PyPy sandbox cannot be used out of the box. You have to write
a policy to allow syscalls. The complexity is moved to this policy which is
very hard to write, especially if you only use whitelists.

Correct me if I'm wrong. To be honest, I never take a look at this sandbox.


Le mercredi 13 ao?t 2014, Steven D'Aprano <steve at> a ?crit :

> On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote:
> > On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland <ijmorlan at
> <javascript:;>> wrote:
> > > While I would not claim a Python sandbox is utterly impossible, I'm
> > > suspicious that the whole "consenting adults" approach in Python is
> > > incompatible with a sandbox.  The whole idea of a sandbox is to
> absolutely
> > > prevent people from doing things even if they really want to and know
> what
> > > they are doing.
> The point of a sandbox is that I, the consenting adult writing the
> application in the first place, may want to allow *untrusted others* to
> call Python code without giving them control of the entire application.
> The consenting adults rule applies to me, the application writer, not
> them, the end-users, even if they happen to be writing Python code. If
> they want unrestricted access to the Python interpreter, they can run
> their code on their own machine, not mine.
> > It's certainly not *fundamentally* impossible to sandbox Python.
> > However, the question becomes one of how much effort you're going to
> > go to and how much you're going to restrict the code.
> I believe that PyPy has an effective sandbox, but to what degree of
> effectiveness I don't know.
> I've had rogue Javascript crash my browser or make my entire computer
> effectively unusable often enough that I am skeptical about claims that
> Javascript in the browser is effectively sandboxed, so I'm doubly
> cautious about Python.
> --
> Steven
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at <javascript:;>
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Thu Aug 14 02:10:34 2014
From: chris.barker at (Chris Barker)
Date: Wed, 13 Aug 2014 17:10:34 -0700
Subject: [Python-Dev] sum(...) limitation
In-Reply-To: <>
References: <>
 <> <>
 <> <-2448384566377912251@unknownmsgid>
 <2076096455819154683@unknownmsgid> <>
 <> <>
Message-ID: <>

On Tue, Aug 12, 2014 at 11:21 PM, Stephen J. Turnbull <stephen at>

> Redirecting to python-ideas, so trimming less than I might.

reasonable enough -- you are introducing some more significant ideas for

I've said all I have to say about this -- I don't seem to see anything
encouraging form core devs, so I guess that's it.

Thanks for the fun bike-shedding...


> Chris Barker writes:
>  > On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <
> stephen at>
>  > wrote:
>  >
>  > > I'm referring to removing the unnecessary information that there's a
>  > >  better way to do it, and simply raising an error (as in Python 3.2,
>  > > say) which is all a RealProgrammer[tm] should ever need!
>  > >
>  >
>  > I can't imagine anyone is suggesting that -- disallow it, but don't tell
>  > anyone why?
> As I said, it's a regression.  That's exactly the behavior in Python 3.2.
>  > The only thing that is remotely on the table here is:
>  >
>  > 1) remove the special case for strings -- buyer beware -- but consistent
>  > and less "ugly"
> It's only consistent if you believe that Python has strict rules for
> use of various operators.  It doesn't, except as far as they are
> constrained by precedence.  For example, I have an application where I
> add bytestrings bytewise modulo N <= 256, and concatenate them.  In
> fact I use function call syntax, but the obvious operator syntax is
> '+' for the bytewise addition, and '*' for the concatenation.
> It's not in the Zen, but I believe in the maxim "If it's worth doing,
> it's worth doing well."  So for me, 1) is out anyway.
>  > 2) add a special case for strings that is fast and efficient -- may be
> as
>  > simple as calling "".join() under the hood --no more code than the
>  > exception check.
> Sure, but what about all the other immutable containers with __add__
> methods?  What about mappings with key-wise __add__ methods whose
> values might be immutable but have __add__ methods?  Where do you stop
> with the special-casing?  I consider this far more complex and ugly
> than the simple "sum() is for numbers" rule (and even that is way too
> complex considering accuracy of summing floats).
>  > And I doubt anyone really is pushing for anything but (2)
> I know that, but I think it's the wrong solution to the problem (which
> is genuine IMO).  The right solution is something generic, possibly a
> __sum__ method.  The question is whether that leads to too much work
> to be worth it (eg, "homogeneous_iterable").
>  > > Because obviously we'd want the attractive nuisance of "if you
>  > > have __add__, there's a default definition of __sum__"
>  >
>  > now I'm confused -- isn't that exactly what we have now?
> Yes and my feeling (backed up by arguments that I admit may persuade
> nobody but myself) is that what we have now kinda sucks[tm].  It
> seemed like a good idea when I first saw it, but then, my apps don't
> scale to where the pain starts in my own usage.
>  > > It's possible that Python could provide some kind of feature that
>  > > would allow an optimized sum function for every type that has
>  > > __add__, but I think this will take a lot of thinking.
>  >
>  > does it need to be every type? As it is the common ones work fine
> already
>  > except for strings -- so if we add an optimized string sum() then we're
>  > done.
> I didn't say provide an optimized sum(), I said provide a feature
> enabling people who want to optimize sum() to do so.  So yes, it needs
> to be every type (the optional __sum__ method is a proof of concept,
> modulo it actually being implementable ;-).
>  > > *Somebody* will do it (I don't think anybody is +1 on restricting
>  > > sum() to a subset of types with __add__).
>  >
>  > uhm, that's exactly what we have now
> Exactly.  Who's arguing that the sum() we have now is a ticket to
> Paradise?  I'm just saying that there's probably somebody out there
> negative enough on the current situation to come up with an answer
> that I think is general enough (and I suspect that python-dev
> consensus is that demanding, too).
>  > sum() can be used for any type that has an __add__ defined.
> I'd like to see that be mutable types with __iadd__.
>  > What I fail to see is why it's better to raise an exception and
>  > point users to a better way, than to simply provide an optimization
>  > so that it's a mute issue.
> Because inefficient sum() is an attractive nuisance, easy to overlook,
> and likely to bite users other than the author.
>  > The only justification offered here is that will teach people that
> summing
>  > strings (and some other objects?)
> Summing tuples works (with appropriate start=tuple()).  Haven't
> benchmarked, but I bet that's O(N^2).
>  > is order(N^2) and a bad idea. But:
>  >
>  > a) Python's primary purpose is practical, not pedagogical (not that it
>  > isn't great for that)
> My argument is that in practical use sum() is a bad idea, period,
> until you book up on the types and applications where it *does* work.
> N.B. It doesn't even work properly for numbers (inaccurate for floats).
>  > b) I doubt any naive users learn anything other than "I can't use sum()
> for
>  > strings, I should use "".join()".
> For people who think that special-casing strings is a good idea, I
> think this is about as much benefit as you can expect.  Why go
> farther?<0.5 wink/>
>  > I submit that no naive user is going to get any closer to a proper
>  > understanding of algorithmic Order behavior from this small hint. Which
>  > leaves no reason to prefer an Exception to an optimization.
> TOOWTDI.  str.join is in pretty much every code base by now, and
> tutorials and FAQs recommending its user and severely deprecating sum
> for strings are legion.
>  > One other point: perhaps this will lead a naive user into thinking --
>  > "sum() raises an exception if I try to use it inefficiently, so it must
> be
>  > OK to use for anything that doesn't raise an exception" -- that would
> be a
>  > bad lesson to mis-learn....
> That assumes they know about the start argument.  I think most naive
> users will just try to sum a bunch of tuples, and get the "can't add
> 0, tuple" Exception and write a loop.  I suspect that many of the
> users who get the "use str.join" warning along with the Exception are
> unaware of the start argument, too.  They expect sum(iter_of_str) to
> magically add the strings.  Ie, when in 3.2 they got the
> uninformative "can't add 0, str" message, they did not immediately go
> "d'oh" and insert ", start=''" in the call to sum, they wrote a loop.
>  > while we are at it, having the default sum() for floats be fsum()
>  > would be nice
> How do you propose to implement that, given math.fsum is perfectly
> happy to sum integers?  You can't just check one or a few leading
> elements for floatiness.  I think you have to dispatch on type(start),
> but then sum(iter_of_floats) DTWT.  So I would suggest changing the
> signature to sum(it, start=0.0).  This would probably be acceptable to
> most users with iterables of ints, but does imply some performance hit.
>  > This does turn sum() into a function that does type-based dispatch,
>  > but isn't python full of those already? do something special for
>  > the types you know about, call the generic dunder method for the
>  > rest.
> AFAIK Python is moving in the opposite direction: if there's a common
> need for dispatching to type-specific implementations of a method,
> define a standard (not "generic") dunder for the purpose, and have the
> builtin (or operator, or whatever) look up (not "call") the
> appropriate instance in the usual way, then call it.  If there's a
> useful generic implementation, define an ABC to inherit from that
> provides that generic implementation.


Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From storchaka at  Thu Aug 14 07:46:50 2014
From: storchaka at (Serhiy Storchaka)
Date: Thu, 14 Aug 2014 08:46:50 +0300
Subject: [Python-Dev] Documenting enum types
Message-ID: <lshig1$k8v$>

Should new enum types added recently to collect module constants be 
documented at all? For example AddressFamily is absent in socket.__all__ 


From ncoghlan at  Thu Aug 14 09:48:58 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 14 Aug 2014 17:48:58 +1000
Subject: [Python-Dev] Reviving restricted mode?
In-Reply-To: <>
References: <>
Message-ID: <>

On 14 August 2014 07:25, Victor Stinner <victor.stinner at> wrote:
> Hi,
> I heard that PyPy sandbox cannot be used out of the box. You have to write a
> policy to allow syscalls. The complexity is moved to this policy which is
> very hard to write, especially if you only use whitelists.
> Correct me if I'm wrong. To be honest, I never take a look at this sandbox.

By default, the PyPy sandbox requires all system access to be proxied
through the host application (which is running in a separate process).
Similarly, using "sandbox" on Fedora (et al) will get you a default
deny OS level sandbox, where you have to provide selective access to
things outside the box.

The effective decision taken when rexec and Bastion were removed from
the standard library was "sandboxing is hard enough for operating
systems to get right, we're not going to try to tackle the even harder
problem of an in-process sandbox".

"Deny all" sandboxes are relatively easy, but also relatively useless.
It's "allow these activities, but no others" that's difficult, since
any kind of access can often be leveraged into greater access than was


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From victor.stinner at  Thu Aug 14 11:25:06 2014
From: victor.stinner at (Victor Stinner)
Date: Thu, 14 Aug 2014 11:25:06 +0200
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <lshig1$k8v$>
References: <lshig1$k8v$>
Message-ID: <>


IMO we should not document enum types because Python implementations other
than CPython may want to implement them differently (ex: not all Python
implementations have an enum module currently). By experience, exposing too
many things in the public API becomes a problem later when you want to
modify the code.

Le 14 ao?t 2014 07:47, "Serhiy Storchaka" <storchaka at> a ?crit :

> Should new enum types added recently to collect module constants be
> documented at all? For example AddressFamily is absent in socket.__all__
> [1].
> [1]
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Thu Aug 14 13:52:57 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 14 Aug 2014 21:52:57 +1000
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <>
References: <lshig1$k8v$>
Message-ID: <>

On 14 August 2014 19:25, Victor Stinner <victor.stinner at> wrote:
> Hi,
> IMO we should not document enum types because Python implementations other
> than CPython may want to implement them differently (ex: not all Python
> implementations have an enum module currently). By experience, exposing too
> many things in the public API becomes a problem later when you want to
> modify the code.

Implementations claiming conformance with Python 3.4 will have to have
an enum module - there just aren't any of those other than CPython at
this point (I expect PyPy3 will catch up before too long, since the
changes between 3.2 and 3.4 shouldn't be too dramatic from an
implementation perspective).

In this particular case, though, I think the relevant question is "Why
are they enums?" and the answer is "for the better representations".
I'm not clear on the use case for exposing and documenting the enum
types themselves (although I don't have any real objection either).


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From guido at  Thu Aug 14 17:42:00 2014
From: guido at (Guido van Rossum)
Date: Thu, 14 Aug 2014 08:42:00 -0700
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <>
References: <lshig1$k8v$>
Message-ID: <>

The enemy must be documented and exported, since users will encounter them.
On Aug 14, 2014 4:54 AM, "Nick Coghlan" <ncoghlan at> wrote:

> On 14 August 2014 19:25, Victor Stinner <victor.stinner at> wrote:
> > Hi,
> >
> > IMO we should not document enum types because Python implementations
> other
> > than CPython may want to implement them differently (ex: not all Python
> > implementations have an enum module currently). By experience, exposing
> too
> > many things in the public API becomes a problem later when you want to
> > modify the code.
> Implementations claiming conformance with Python 3.4 will have to have
> an enum module - there just aren't any of those other than CPython at
> this point (I expect PyPy3 will catch up before too long, since the
> changes between 3.2 and 3.4 shouldn't be too dramatic from an
> implementation perspective).
> In this particular case, though, I think the relevant question is "Why
> are they enums?" and the answer is "for the better representations".
> I'm not clear on the use case for exposing and documenting the enum
> types themselves (although I don't have any real objection either).
> Regards,
> Nick.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From benhoyt at  Thu Aug 14 17:51:59 2014
From: benhoyt at (Ben Hoyt)
Date: Thu, 14 Aug 2014 11:51:59 -0400
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <>
References: <lshig1$k8v$>
Message-ID: <>

> The enemy must be documented and exported, since users will encounter them.

enum == enemy? Is that you, Raymond? ;-)


From ethan at  Thu Aug 14 18:14:38 2014
From: ethan at (Ethan Furman)
Date: Thu, 14 Aug 2014 09:14:38 -0700
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <>
References: <lshig1$k8v$>
Message-ID: <>

On 08/14/2014 08:51 AM, Ben Hoyt wrote:
>> The enemy must be documented and exported, since users will encounter them.
> enum == enemy? Is that you, Raymond? ;-)

ROFL!  Thanks, I needed that!



From breamoreboy at  Thu Aug 14 19:24:45 2014
From: breamoreboy at (Mark Lawrence)
Date: Thu, 14 Aug 2014 18:24:45 +0100
Subject: [Python-Dev] Documenting enum types
In-Reply-To: <>
References: <lshig1$k8v$>
Message-ID: <lsircv$bq2$>

On 14/08/2014 17:14, Ethan Furman wrote:
> On 08/14/2014 08:51 AM, Ben Hoyt wrote:

The BDFL actually wrote:-

>>> The enemy must be documented and exported, since users will encounter
>>> them.


>> enum == enemy? Is that you, Raymond? ;-)
> ROFL!  Thanks, I needed that!
> :D
> --
> ~Ethan~

I'll be seeing the PSF in court, on the grounds that I've just bust a 
gut laughing :)

My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

From ncoghlan at  Fri Aug 15 07:50:25 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 15 Aug 2014 15:50:25 +1000
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray
Message-ID: <>

I just posted an updated version of PEP 467 after recently finishing
the updates to the Python 3.4+ binary sequence docs to decouple them
from the str docs.

Key points in the proposal:

* deprecate passing integers to bytes() and bytearray()
* add bytes.zeros() and bytearray.zeros() as a replacement
* add bytes.byte() and bytearray.byte() as counterparts to ord() for binary data
* add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()

As far as I am aware, that last item poses the only open question,
with the alternative being to add an "iterbytes" builtin with a
definition along the lines of the following:

    def iterbytes(data):
            getiter = type(data).__iterbytes__
        except AttributeError:
            iter = map(bytes.byte, data)
            iter = getiter(data)
        return iter



Full PEP text:
PEP: 467
Title: Minor API improvements for bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan at>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-03-30
Python-Version: 3.5
Post-History: 2014-03-30 2014-08-15


During the initial development of the Python 3 language specification, the
core ``bytes`` type for arbitrary binary data started as the mutable type
that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.

This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make it easier to operate entirely in the binary


To simplify the task of writing the Python 3 documentation, the ``bytes``
and ``bytearray`` types were documented primarily in terms of the way they
differed from the Unicode based Python 3 ``str`` type. Even when I
`heavily revised the sequence documentation
<>`__ in 2012, I retained that
simplifying shortcut.

However, it turns out that this approach to the documentation of these types
had a problem: it doesn't adequately introduce users to their hybrid nature,
where they can be manipulated *either* as a "sequence of integers" type,
*or* as ``str``-like types that assume ASCII compatible data.

That oversight has now been corrected, with the binary sequence types now
being documented entirely independently of the ``str`` documentation in
`Python 3.4+ <>`__

The confusion isn't just a documentation issue, however, as there are also
some lingering design quirks from an earlier pre-release design where there
was *no* separate ``bytearray`` type, and instead the core ``bytes`` type
was mutable (with no immutable counterpart).

Finally, additional experience with using the existing Python 3 binary
sequence types in real world applications has suggested it would be
beneficial to make it easier to convert integers to length 1 bytes objects.


As a "consistency improvement" proposal, this PEP is actually about a few
smaller micro-proposals, each aimed at improving the usability of the binary
data model in Python 3. Proposals are motivated by one of two main factors:

* removing remnants of the original design of ``bytes`` as a mutable type
* allowing users to easily convert integer values to a length 1 ``bytes``

Alternate Constructors

The ``bytes`` and ``bytearray`` constructors currently accept an integer
argument, but interpret it to mean a zero-filled object of the given length.
This is a legacy of the original design of ``bytes`` as a mutable type,
rather than a particularly intuitive behaviour for users. It has become
especially confusing now that some other ``bytes`` interfaces treat integers
and the corresponding length 1 bytes instances as equivalent input.

    >>> b"\x03" in bytes([1, 2, 3])
    >>> 3 in bytes([1, 2, 3])

    >>> bytes(b"\x03")
    >>> bytes(3)

This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and targeted for
removal in Python 3.7, being replaced by two more explicit alternate
constructors provided as class methods. The initial python-ideas thread
[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
this constructor behaviour.

Firstly, a ``byte`` constructor is proposed that converts integers
in the range 0 to 255 (inclusive) to a ``bytes`` object::

    >>> bytes.byte(3)
    >>> bytearray.byte(3)
    >>> bytes.byte(512)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: bytes must be in range(0, 256)

One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.
The ``ord()`` documentation will also be updated to note that while
``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
``bytearray.byte`` are the counterparts for binary input.

Secondly, a ``zeros`` constructor is proposed that serves as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::

    >>> bytes.zeros(3)
    >>> bytearray.zeros(3)

The chosen name here is taken from the corresponding initialisation function
in NumPy (although, as these are sequence types rather than N-dimensional
matrices, the constructors take a length as input rather than a shape tuple)

While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
useful duo amongst the new constructors, ``bytes.zeros`` and
`bytearray.byte`` are provided in order to maintain API consistency between
the two types.


While iteration over ``bytes`` objects and other binary sequences produces
integers, it is sometimes desirable to iterate over length 1 bytes objects

To handle this situation more obviously (and more efficiently) than would be
the case with the ``map(bytes.byte, data)`` construct enabled by the above
constructor changes, this PEP proposes the addition of a new ``iterbytes``
method to ``bytes``, ``bytearray`` and ``memoryview``::

    for x in data.iterbytes():
        # x is a length 1 ``bytes`` object, rather than an integer

Third party types and arbitrary containers of integers that lack the new
method can still be handled by combining ``map`` with the new
``bytes.byte()`` alternate constructor proposed above::

    for x in map(bytes.byte, data):
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive

Open questions

* The fallback case above suggests that this could perhaps be better handled
  as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
  if defined, but otherwise fell back to ``map(bytes.byte, data)``::

    for x in iterbytes(data):
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive


.. [ideas-thread1]
.. [empty-buffer-issue]
.. [GvR-initial-feedback]


This document has been placed in the public domain.

Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From status at  Fri Aug 15 18:07:43 2014
From: status at (Python tracker)
Date: Fri, 15 Aug 2014 18:07:43 +0200 (CEST)
Subject: [Python-Dev] Summary of Python tracker Issues
Message-ID: <>

ACTIVITY SUMMARY (2014-08-08 - 2014-08-15)
Python tracker at

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4602 ( +0)
  closed 29371 (+31)
  total  33973 (+31)

Open issues with patches: 2175 

Issues opened (23)

#21166: Bus error in pybuilddir.txt 'python -m sysconfigure --generate  reopened by ned.deily

#22176: update internal libffi copy to 3.1, introducing AArch64 and PO  opened by doko

#22177: Incorrect version reported after downgrade  opened by jpe5605

#22179: Focus stays on Search Dialog when text found in editor  opened by BreamoreBoy

#22181: os.urandom() should use Linux 3.17 getrandom() syscall  opened by haypo

#22182: distutils.file_util.move_file unpacks wrongly an exception  opened by Claudiu.Popa

#22185: Occasional RuntimeError from Condition.notify  opened by dougz

#22186: Typos in .py files  opened by iwontbecreative

#22187: commands.mkarg() buggy in East Asian locales  opened by jwilk

#22188: test_gdb fails on invalid gdbinit  opened by lekensteyn

#22189: collections.UserString missing some str methods  opened by ncoghlan

#22191: warnings.__all__ incomplete  opened by pitrou

#22192: dict_values objects are hashable  opened by roippi

#22193: Add _PySys_GetSizeOf()  opened by serhiy.storchaka

#22194: access to cdecimal / libmpdec API  opened by pitrou

#22195: Make it easy to replace print() calls with logging calls  opened by pitrou

#22196: namedtuple documentation could/should mention the new Enum typ  opened by lelit

#22197: Allow better verbosity / output control in test cases  opened by pitrou

#22198: Odd floor-division corner case  opened by mark.dickinson

#22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m  opened by jamercee

#22200: Remove distutils checks for Python version  opened by takluyver

#22201: python -mzipfile fails to unzip files with folders created by  opened by Antony.Lee

#22203: inspect.getargspec() returns wrong spec for builtins  opened by suor

Most recent 15 issues with no replies (15)

#22201: python -mzipfile fails to unzip files with folders created by

#22200: Remove distutils checks for Python version

#22197: Allow better verbosity / output control in test cases

#22196: namedtuple documentation could/should mention the new Enum typ

#22194: access to cdecimal / libmpdec API

#22189: collections.UserString missing some str methods

#22188: test_gdb fails on invalid gdbinit

#22181: os.urandom() should use Linux 3.17 getrandom() syscall

#22179: Focus stays on Search Dialog when text found in editor

#22173: Update lib2to3.tests and test_lib2to3 to use test discovery

#22164: cell object cleared too early?

#22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul

#22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec

#22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy

#22153: There is no standard TestCase.runTest implementation

Most recent 15 issues waiting for review (15)

#22200: Remove distutils checks for Python version

#22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m

#22193: Add _PySys_GetSizeOf()

#22186: Typos in .py files

#22185: Occasional RuntimeError from Condition.notify

#22182: distutils.file_util.move_file unpacks wrongly an exception

#22173: Update lib2to3.tests and test_lib2to3 to use test discovery

#22166: test_codecs "leaking" references

#22165: Empty response from http.server when directory listing contain

#22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul

#22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec

#22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy

#22156: Fix compiler warnings

#22150: deprecated-removed directive is broken in Sphinx 1.2.2

#22149: the frame of a suspended generator should not have a local tra

Top 10 most discussed issues (10)

#19494: urllib2.HTTPBasicAuthHandler (or urllib.request.HTTPBasicAuthH  15 msgs

#15381: Optimize BytesIO to do  less reallocations when written, simil  10 msgs

#22193: Add _PySys_GetSizeOf()   7 msgs

#22118: urljoin fails with messy relative URLs   6 msgs

#12954: Multiprocessing logging under Windows   5 msgs

#18844: allow weights in random.choice   5 msgs

#21448: Email Parser use 100% CPU   5 msgs

#22177: Incorrect version reported after downgrade   5 msgs

#22191: warnings.__all__ incomplete   5 msgs

#22198: Odd floor-division corner case   5 msgs

Issues closed (28)

#14105: Breakpoints in debug lost if line is inserted; IDLE  closed by terry.reedy

#16773: int() half-accepts UserString  closed by serhiy.storchaka

#17923: test glob with trailing slash fail on AIX 6.1  closed by serhiy.storchaka

#18004: test_list.test_overflow crashes Win64  closed by serhiy.storchaka

#19743: test_gdb failures  closed by pitrou

#20101: Determine correct behavior for time functions on Windows  closed by haypo

#20729: mailbox.Mailbox does odd hasattr() check  closed by serhiy.storchaka

#20746: test_pdb fails in refleak mode  closed by pitrou

#21121: -Werror=declaration-after-statement is added even for extensio  closed by python-dev

#21412: Solaris/Oracle Studio: Fatal Python error: PyThreadState_Get w  closed by ned.deily

#21445: Some asserts in test_filecmp have the wrong messages  closed by berker.peksag

#21725: RFC 6531 (SMTPUTF8) support in smtpd  closed by r.david.murray

#21777: Separate out documentation of binary sequence methods  closed by ncoghlan

#22060: Clean up ctypes.test, use unittest test discovery  closed by python-dev

#22065: Update turtledemo menu creation  closed by terry.reedy

#22112: '_UnixSelectorEventLoop' object has no attribute 'create_task'  closed by haypo

#22139: python windows 2.7.8 64-bit did not install  closed by loewis

#22145: <> in parser spec but not lexer spec  closed by rhettinger

#22161: Remove unsupported code from ctypes  closed by serhiy.storchaka

#22174: property doc fixes  closed by rhettinger

#22175: improve test_faulthandler readability with dedent  closed by python-dev

#22178: _winreg.QueryInfoKey Last Modified Time Value Incorrect or Exp  closed by python-dev

#22180: operator.setitem example no longer works in Python 3 due to la  closed by rhettinger

#22183: datetime.timezone methods require datetime object  closed by belopolsky

#22184: lrucache should reject maxsize as a function  closed by rhettinger

#22190: Integrate tracemalloc into regrtest refleak hunting  closed by ncoghlan

#22202: Function Bug?  closed by steven.daprano

#22204: spam  closed by ezio.melotti

From guido at  Fri Aug 15 19:48:58 2014
From: guido at (Guido van Rossum)
Date: Fri, 15 Aug 2014 10:48:58 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

This feels chatty. I'd like the PEP to call out the specific proposals and
put the more verbose motivation later. It took me a long time to realize
that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3). Also
your mention of bytes.byte() as the counterpart to ord() confused me -- I
think it's more similar to chr(). I don't like iterbytes as a builtin,
let's keep it as a method on affected types.

On Thu, Aug 14, 2014 at 10:50 PM, Nick Coghlan <ncoghlan at> wrote:

> I just posted an updated version of PEP 467 after recently finishing
> the updates to the Python 3.4+ binary sequence docs to decouple them
> from the str docs.
> Key points in the proposal:
> * deprecate passing integers to bytes() and bytearray()
> * add bytes.zeros() and bytearray.zeros() as a replacement
> * add bytes.byte() and bytearray.byte() as counterparts to ord() for
> binary data
> * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()
> As far as I am aware, that last item poses the only open question,
> with the alternative being to add an "iterbytes" builtin with a
> definition along the lines of the following:
>     def iterbytes(data):
>         try:
>             getiter = type(data).__iterbytes__
>         except AttributeError:
>             iter = map(bytes.byte, data)
>         else:
>             iter = getiter(data)
>         return iter
> Regards,
> Nick.
> Full PEP text:
> =============================
> PEP: 467
> Title: Minor API improvements for bytes and bytearray
> Version: $Revision$
> Last-Modified: $Date$
> Author: Nick Coghlan <ncoghlan at>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 2014-03-30
> Python-Version: 3.5
> Post-History: 2014-03-30 2014-08-15
> Abstract
> ========
> During the initial development of the Python 3 language specification, the
> core ``bytes`` type for arbitrary binary data started as the mutable type
> that is now referred to as ``bytearray``. Other aspects of operating in
> the binary domain in Python have also evolved over the course of the Python
> 3 series.
> This PEP proposes a number of small adjustments to the APIs of the
> ``bytes``
> and ``bytearray`` types to make it easier to operate entirely in the binary
> domain.
> Background
> ==========
> To simplify the task of writing the Python 3 documentation, the ``bytes``
> and ``bytearray`` types were documented primarily in terms of the way they
> differed from the Unicode based Python 3 ``str`` type. Even when I
> `heavily revised the sequence documentation
> <>`__ in 2012, I retained
> that
> simplifying shortcut.
> However, it turns out that this approach to the documentation of these
> types
> had a problem: it doesn't adequately introduce users to their hybrid
> nature,
> where they can be manipulated *either* as a "sequence of integers" type,
> *or* as ``str``-like types that assume ASCII compatible data.
> That oversight has now been corrected, with the binary sequence types now
> being documented entirely independently of the ``str`` documentation in
> `Python 3.4+ <
> >`__
> The confusion isn't just a documentation issue, however, as there are also
> some lingering design quirks from an earlier pre-release design where there
> was *no* separate ``bytearray`` type, and instead the core ``bytes`` type
> was mutable (with no immutable counterpart).
> Finally, additional experience with using the existing Python 3 binary
> sequence types in real world applications has suggested it would be
> beneficial to make it easier to convert integers to length 1 bytes objects.
> Proposals
> =========
> As a "consistency improvement" proposal, this PEP is actually about a few
> smaller micro-proposals, each aimed at improving the usability of the
> binary
> data model in Python 3. Proposals are motivated by one of two main factors:
> * removing remnants of the original design of ``bytes`` as a mutable type
> * allowing users to easily convert integer values to a length 1 ``bytes``
>   object
> Alternate Constructors
> ----------------------
> The ``bytes`` and ``bytearray`` constructors currently accept an integer
> argument, but interpret it to mean a zero-filled object of the given
> length.
> This is a legacy of the original design of ``bytes`` as a mutable type,
> rather than a particularly intuitive behaviour for users. It has become
> especially confusing now that some other ``bytes`` interfaces treat
> integers
> and the corresponding length 1 bytes instances as equivalent input.
> Compare::
>     >>> b"\x03" in bytes([1, 2, 3])
>     True
>     >>> 3 in bytes([1, 2, 3])
>     True
>     >>> bytes(b"\x03")
>     b'\x03'
>     >>> bytes(3)
>     b'\x00\x00\x00'
> This PEP proposes that the current handling of integers in the bytes and
> bytearray constructors by deprecated in Python 3.5 and targeted for
> removal in Python 3.7, being replaced by two more explicit alternate
> constructors provided as class methods. The initial python-ideas thread
> [ideas-thread1]_ that spawned this PEP was specifically aimed at
> deprecating
> this constructor behaviour.
> Firstly, a ``byte`` constructor is proposed that converts integers
> in the range 0 to 255 (inclusive) to a ``bytes`` object::
>     >>> bytes.byte(3)
>     b'\x03'
>     >>> bytearray.byte(3)
>     bytearray(b'\x03')
>     >>> bytes.byte(512)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     ValueError: bytes must be in range(0, 256)
> One specific use case for this alternate constructor is to easily convert
> the result of indexing operations on ``bytes`` and other binary sequences
> from an integer to a ``bytes`` object. The documentation for this API
> should note that its counterpart for the reverse conversion is ``ord()``.
> The ``ord()`` documentation will also be updated to note that while
> ``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
> ``bytearray.byte`` are the counterparts for binary input.
> Secondly, a ``zeros`` constructor is proposed that serves as a direct
> replacement for the current constructor behaviour, rather than having to
> use
> sequence repetition to achieve the same effect in a less intuitive way::
>     >>> bytes.zeros(3)
>     b'\x00\x00\x00'
>     >>> bytearray.zeros(3)
>     bytearray(b'\x00\x00\x00')
> The chosen name here is taken from the corresponding initialisation
> function
> in NumPy (although, as these are sequence types rather than N-dimensional
> matrices, the constructors take a length as input rather than a shape
> tuple)
> While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
> useful duo amongst the new constructors, ``bytes.zeros`` and
> `bytearray.byte`` are provided in order to maintain API consistency between
> the two types.
> Iteration
> ---------
> While iteration over ``bytes`` objects and other binary sequences produces
> integers, it is sometimes desirable to iterate over length 1 bytes objects
> instead.
> To handle this situation more obviously (and more efficiently) than would
> be
> the case with the ``map(bytes.byte, data)`` construct enabled by the above
> constructor changes, this PEP proposes the addition of a new ``iterbytes``
> method to ``bytes``, ``bytearray`` and ``memoryview``::
>     for x in data.iterbytes():
>         # x is a length 1 ``bytes`` object, rather than an integer
> Third party types and arbitrary containers of integers that lack the new
> method can still be handled by combining ``map`` with the new
> ``bytes.byte()`` alternate constructor proposed above::
>     for x in map(bytes.byte, data):
>         # x is a length 1 ``bytes`` object, rather than an integer
>         # This works with *any* container of integers in the range
>         # 0 to 255 inclusive
> Open questions
> ^^^^^^^^^^^^^^
> * The fallback case above suggests that this could perhaps be better
> handled
>   as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
>   if defined, but otherwise fell back to ``map(bytes.byte, data)``::
>     for x in iterbytes(data):
>         # x is a length 1 ``bytes`` object, rather than an integer
>         # This works with *any* container of integers in the range
>         # 0 to 255 inclusive
> References
> ==========
> .. [ideas-thread1]
> .. [empty-buffer-issue]
> .. [GvR-initial-feedback]
> Copyright
> =========
> This document has been placed in the public domain.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From storchaka at  Fri Aug 15 21:54:22 2014
From: storchaka at (Serhiy Storchaka)
Date: Fri, 15 Aug 2014 22:54:22 +0300
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lsloh8$eac$>

15.08.14 08:50, Nick Coghlan ???????(??):
> * add bytes.zeros() and bytearray.zeros() as a replacement

b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need 
to learn new method. And it works right now.

> * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()

What are use cases for this? I suppose that main use case may be writing 
the code compatible with 2.7 and 3.x. But in this case you need a 
wrapper (because these types in 2.7 have no the iterbytes() method). And 
how larger would be an advantage of this method over the 
``map(bytes.byte, data)``?

From victor.stinner at  Fri Aug 15 21:59:40 2014
From: victor.stinner at (Victor Stinner)
Date: Fri, 15 Aug 2014 21:59:40 +0200
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <lsloh8$eac$>
References: <>
Message-ID: <>

2014-08-15 21:54 GMT+02:00 Serhiy Storchaka <storchaka at>:
> 15.08.14 08:50, Nick Coghlan ???????(??):
>> * add bytes.zeros() and bytearray.zeros() as a replacement
> b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need to
> learn new method. And it works right now.

FYI there is a pending patch for bytearray(int) to use calloc()
instead of malloc(). It's faster for buffer for n larger than 1 MB:

I'm not sure that the optimization is really useful.


From victor.stinner at  Fri Aug 15 21:55:46 2014
From: victor.stinner at (Victor Stinner)
Date: Fri, 15 Aug 2014 21:55:46 +0200
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

2014-08-15 7:50 GMT+02:00 Nick Coghlan <ncoghlan at>:
> As far as I am aware, that last item poses the only open question,
> with the alternative being to add an "iterbytes" builtin (...)

Do you have examples of use cases for a builtin function? I only found
5 usages of bytes((byte,)) constructor in the standard library:

$ grep -E 'bytes\(\([^)]+, *\)\)' $(find -name "*.py")
./Lib/            c = bytes((c,))
./Lib/        c = bytes((c,))
./Lib/        b32tab = [bytes((i,)) for i in _b32alphabet]
./Lib/        _a85chars = [bytes((i,)) for i in range(33, 118)]
./Lib/        _b85chars = [bytes((i,)) for i in _b85alphabet]

bytes.iterbytes() can be used in 4 cases on 5. Adding a new builtin
for a single line in the whole standard library doesn't look right.


From ethan at  Fri Aug 15 23:03:40 2014
From: ethan at (Ethan Furman)
Date: Fri, 15 Aug 2014 14:03:40 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140813033855.GH4525@ando>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On 08/12/2014 08:38 PM, Steven D'Aprano wrote:
> [1] Technically not, since it's the comma, not the ( ), which makes a
> tuple, but a lot of people don't know that and treat it as if it the
> parens were compulsary.

It might as well be, because if there can be a non-tuple way to interpret the comma that way takes precedence, and then 
the parens /are/ required to disambiguate and get the tuple you wanted.


From ethan at  Fri Aug 15 23:08:42 2014
From: ethan at (Ethan Furman)
Date: Fri, 15 Aug 2014 14:08:42 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140813173225.GL4525@ando>
References: <20140811230800.GA12210@gensokyo>
Message-ID: <>

On 08/13/2014 10:32 AM, Steven D'Aprano wrote:
> (2) Also note that *this is already the case*, since tuples are made by
> the commas, not the parentheses. E.g. this succeeds:
> # Not a tuple, actually two context managers.
> with open("/tmp/foo"), open("/tmp/bar", "w"):
>     pass

Thanks for proving my point!  A comma, and yet we did *not* get a tuple from it.


From g.brandl at  Fri Aug 15 23:34:32 2014
From: g.brandl at (Georg Brandl)
Date: Fri, 15 Aug 2014 23:34:32 +0200
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
Message-ID: <lslud8$kc1$>

On 08/15/2014 11:08 PM, Ethan Furman wrote:
> On 08/13/2014 10:32 AM, Steven D'Aprano wrote:
>> (2) Also note that *this is already the case*, since tuples are made by
>> the commas, not the parentheses. E.g. this succeeds:
>> # Not a tuple, actually two context managers.
>> with open("/tmp/foo"), open("/tmp/bar", "w"):
>>     pass
> Thanks for proving my point!  A comma, and yet we did *not* get a tuple from it.

Clearly the rule is that the comma makes the tuple, except when it doesn't :)


From steve at  Sat Aug 16 05:08:48 2014
From: steve at (Steven D'Aprano)
Date: Sat, 16 Aug 2014 13:08:48 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
Message-ID: <20140816030847.GD4525@ando>

On Fri, Aug 15, 2014 at 02:08:42PM -0700, Ethan Furman wrote:
> On 08/13/2014 10:32 AM, Steven D'Aprano wrote:
> >
> >(2) Also note that *this is already the case*, since tuples are made by
> >the commas, not the parentheses. E.g. this succeeds:
> >
> ># Not a tuple, actually two context managers.
> >with open("/tmp/foo"), open("/tmp/bar", "w"):
> >    pass
> Thanks for proving my point!  A comma, and yet we did *not* get a tuple 
> from it.

Um, sorry, I don't quite get you. Are you agreeing or disagreeing with 
me? I spent half of yesterday reading the static typing thread over on 
Python-ideas and it's possible my brain has melted down *wink* but I'm 
confused by your response.

Normally when people say "Thanks for proving my point", the implication 
is that the person being thanked (in this case me) has inadvertently 
undercut their own argument. I don't think I have. I'm suggesting that 
the argument *against* the proposal:

    "Multi-line with statements should not be allowed, because:

    with (spam,

    is syntactically a tuple"

is a poor argument (that is, I'm disagreeing with it), since *single* 
line parens-free with statements are already syntactically a tuple:

    with spam, eggs, cheese:  # Commas make a tuple, not parens.

I think the OP's suggestion is a sound one, and while Nick's point that 
bulky with-statements *may* be a sign that some re-factoring is needed, 
there are many things that are a sign that re-factoring is needed and 
I don't think this particular one warrents rejecting what is otherwise 
an obvious and clear way of using multiple context managers.


From ethan at  Sat Aug 16 05:29:09 2014
From: ethan at (Ethan Furman)
Date: Fri, 15 Aug 2014 20:29:09 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140816030847.GD4525@ando>
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
Message-ID: <>

On 08/15/2014 08:08 PM, Steven D'Aprano wrote:
> On Fri, Aug 15, 2014 at 02:08:42PM -0700, Ethan Furman wrote:
>> On 08/13/2014 10:32 AM, Steven D'Aprano wrote:
>>> (2) Also note that *this is already the case*, since tuples are made by
>>> the commas, not the parentheses. E.g. this succeeds:
>>> # Not a tuple, actually two context managers.
>>> with open("/tmp/foo"), open("/tmp/bar", "w"):
>>>     pass
>> Thanks for proving my point!  A comma, and yet we did *not* get a tuple
>> from it.
> Um, sorry, I don't quite get you. Are you agreeing or disagreeing with
> me? I spent half of yesterday reading the static typing thread over on
> Python-ideas and it's possible my brain has melted down *wink* but I'm
> confused by your response.

My point is that commas don't always make a tuple, and your example above is a case in point:  we have a comma 
separating two context managers, but we do not have a tuple, and your comment even says so.

> is a poor argument (that is, I'm disagreeing with it), since *single*
> line parens-free with statements are already syntactically a tuple:
>      with spam, eggs, cheese:  # Commas make a tuple, not parens.

This point I do not understand -- commas /can/ create a tuple, but don't /necessarily/ create a tuple.  So, 
semantically: no tuple.  Syntactically: I don't think there's a tuple there this way either.  I suppose one of us should 
look it up in the lexar.  ;)


From ncoghlan at  Sat Aug 16 07:17:35 2014
From: ncoghlan at (Nick Coghlan)
Date: Sat, 16 Aug 2014 15:17:35 +1000
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 16 August 2014 03:48, Guido van Rossum <guido at> wrote:
> This feels chatty. I'd like the PEP to call out the specific proposals and
> put the more verbose motivation later.

I realised that some of that history was actually completely
irrelevant now, so I culled a fair bit of it entirely.

> It took me a long time to realize
> that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3).

I've split out the four subproposals into their own sections, so
hopefully this is clearer now.

> Also
> your mention of bytes.byte() as the counterpart to ord() confused me -- I
> think it's more similar to chr().

This was just a case of me using the wrong word - I meant "inverse"
rather than "counterpart".

> I don't like iterbytes as a builtin, let's
> keep it as a method on affected types.

Done. I also added an explanation of the benefits it offers over the
more generic "map(bytes.byte, data)", as well as more precise
semantics for how it will work with memoryview objects.

New draft is live at, as well
as being included inline below.



PEP: 467
Title: Minor API improvements for bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan at>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-03-30
Python-Version: 3.5
Post-History: 2014-03-30 2014-08-15 2014-08-16


During the initial development of the Python 3 language specification, the
core ``bytes`` type for arbitrary binary data started as the mutable type
that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.

This PEP proposes four small adjustments to the APIs of the ``bytes``,
``bytearray`` and ``memoryview`` types to make it easier to operate entirely
in the binary domain:

* Deprecate passing single integer values to ``bytes`` and ``bytearray``
* Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors
* Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors
* Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and
  ``memoryview.iterbytes`` alternative iterators


Deprecation of current "zero-initialised sequence" behaviour

Currently, the ``bytes`` and ``bytearray`` constructors accept an integer
argument and interpret it as meaning to create a zero-initialised sequence
of the given size::

    >>> bytes(3)
    >>> bytearray(3)

This PEP proposes to deprecate that behaviour in Python 3.5, and remove it
entirely in Python 3.6.

No other changes are proposed to the existing constructors.

Addition of explicit "zero-initialised sequence" constructors

To replace the deprecated behaviour, this PEP proposes the addition of an
explicit ``zeros`` alternative constructor as a class method on both
``bytes`` and ``bytearray``::

    >>> bytes.zeros(3)
    >>> bytearray.zeros(3)

It will behave just as the current constructors behave when passed a single

The specific choice of ``zeros`` as the alternative constructor name is taken
from the corresponding initialisation function in NumPy (although, as these
are 1-dimensional sequence types rather than N-dimensional matrices, the
constructors take a length as input rather than a shape tuple)

Addition of explicit "single byte" constructors

As binary counterparts to the text ``chr`` function, this PEP proposes the
addition of an explicit ``byte`` alternative constructor as a class method
on both ``bytes`` and ``bytearray``::

    >>> bytes.byte(3)
    >>> bytearray.byte(3)

These methods will only accept integers in the range 0 to 255 (inclusive)::

    >>> bytes.byte(512)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: bytes must be in range(0, 256)

    >>> bytes.byte(1.0)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'float' object cannot be interpreted as an integer

The documentation of the ``ord`` builtin will be updated to explicitly note
that ``bytes.byte`` is the inverse operation for binary data, while ``chr``
is the inverse operation for text data.

Behaviourally, ``bytes.byte(x)`` will be equivalent to the current
``bytes([x])`` (and similarly for ``bytearray``). The new spelling is
expected to be easier to discover and easier to read (especially when used
in conjunction with indexing operations on binary sequence types).

As a separate method, the new spelling will also work better with higher
order functions like ``map``.

Addition of optimised iterator methods that produce ``bytes`` objects

This PEP proposes that ``bytes``, ``bytearray`` and ``memoryview`` gain an
optimised ``iterbytes`` method that produces length 1 ``bytes`` objects
rather than integers::

    for x in data.iterbytes():
        # x is a length 1 ``bytes`` object, rather than an integer

The method can be used with arbitrary buffer exporting objects by wrapping
them in a ``memoryview`` instance first::

    for x in memoryview(data).iterbytes():
        # x is a length 1 ``bytes`` object, rather than an integer

For ``memoryview``, the semantics of ``iterbytes()`` are defined such that::

    memview.tobytes() == b''.join(memview.iterbytes())

This allows the raw bytes of the memory view to be iterated over without
needing to make a copy, regardless of the defined shape and format.

The main advantage this method offers over the ``map(bytes.byte, data)``
approach is that it is guaranteed *not* to fail midstream with a
``ValueError`` or ``TypeError``. By contrast, when using the ``map`` based
approach, the type and value of the individual items in the iterable are
only checked as they are retrieved and passed through the ``bytes.byte``

Design discussion

Why not rely on sequence repetition to create zero-initialised sequences?

Zero-initialised sequences can be created via sequence repetition::

    >>> b'\x00' * 3
    >>> bytearray(b'\x00') * 3

However, this was also the case when the ``bytearray`` type was originally
designed, and the decision was made to add explicit support for it in the
type constructor. The immutable ``bytes`` type then inherited that feature
when it was introduced in PEP 3137.

This PEP isn't revisiting that original design decision, just changing the
spelling as users sometimes find the current behaviour of the binary sequence
constructors surprising. In particular, there's a reasonable case to be made
that ``bytes(x)`` (where ``x`` is an integer) should behave like the
``bytes.byte(x)`` proposal in this PEP. Providing both behaviours as separate
class methods avoids that ambiguity.


.. [1] Initial March 2014 discussion thread on python-ideas
.. [2] Guido's initial feedback in that thread
.. [3] Issue proposing moving zero-initialised sequences to a dedicated API
.. [4] Issue proposing to use calloc() for zero-initialised binary sequences
.. [5] August 2014 discussion thread on python-dev

Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From steve at  Sat Aug 16 07:41:47 2014
From: steve at (Steven D'Aprano)
Date: Sat, 16 Aug 2014 15:41:47 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
Message-ID: <20140816054147.GG4525@ando>

On Fri, Aug 15, 2014 at 08:29:09PM -0700, Ethan Furman wrote:
> On 08/15/2014 08:08 PM, Steven D'Aprano wrote:

> >is a poor argument (that is, I'm disagreeing with it), since *single*
> >line parens-free with statements are already syntactically a tuple:
> >
> >     with spam, eggs, cheese:  # Commas make a tuple, not parens.
> This point I do not understand -- commas /can/ create a tuple, but don't 
> /necessarily/ create a tuple.  So, semantically: no tuple.

Right! I think we are in agreement. It's not that with statements 
actually generate a tuple, but that they *look* like they include a 
tuple. That's what I meant by "syntactically a tuple", sorry if that was 
confusing. I didn't mean to suggest that Python necessarily builds a 
tuple of context managers.

If people were going to be prone to mistake

    with (a, b, c): ...

as including a tuple, they would have already mistaken:

    with a, b, c: ...

the same way. But they haven't.


From ben+python at  Sat Aug 16 09:25:33 2014
From: ben+python at (Ben Finney)
Date: Sat, 16 Aug 2014 17:25:33 +1000
Subject: [Python-Dev] Multiline with statement line continuation
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
Message-ID: <>

Steven D'Aprano <steve at> writes:

> If people were going to be prone to mistake
>     with (a, b, c): ...
> as including a tuple

? because the parens are a strong signal ?this is an expression to be
evaluated, resulting in a single value to use in the statement?.

> they would have already mistaken:
>     with a, b, c: ...
> the same way. But they haven't.

Right. The presence or absence of parens make a big semantic difference.

 \      ?The process by which banks create money is so simple that the |
  `\     mind is repelled.? ?John Kenneth Galbraith, _Money: Whence It |
_o__)                                       Came, Where It Went_, 1975 |
Ben Finney

From jeanpierreda at  Sat Aug 16 10:04:13 2014
From: jeanpierreda at (Devin Jeanpierre)
Date: Sat, 16 Aug 2014 01:04:13 -0700
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140811230800.GA12210@gensokyo>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
 <20140816054147.GG4525@ando> <>
Message-ID: <>

On Sat, Aug 16, 2014 at 12:25 AM, Ben Finney <ben+python at> wrote:
> Steven D'Aprano <steve at> writes:
>> If people were going to be prone to mistake
>>     with (a, b, c): ...
>> as including a tuple
> ? because the parens are a strong signal ?this is an expression to be
> evaluated, resulting in a single value to use in the statement?.
>> they would have already mistaken:
>>     with a, b, c: ...
>> the same way. But they haven't.
> Right. The presence or absence of parens make a big semantic difference.

At least historically so, since "except a, b:" and "except (a, b):"
used to be different things (only the latter constructs a tuple in
2.x). OTOH, consider "from .. import (..., ..., ...)".

Pretty sure at this point parens can be used for non-expressions quite
reasonably -- although I'd still prefer just allowing newlines without
requiring extra syntax.

-- Devin

From senthil at  Sat Aug 16 12:40:02 2014
From: senthil at (Senthil Kumaran)
Date: Sat, 16 Aug 2014 16:10:02 +0530
Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix Issue #8797:
 Raise HTTPError on failed Basic Authentication immediately.
In-Reply-To: <>
References: <>
Message-ID: <>

I added some extra coverage for basic auth in the tests and I notice that
in buildbots, some of them are throwing "error: [Errno 32] Broken pipe"

I am looking into this and will fix this.


On Sat, Aug 16, 2014 at 2:19 PM, senthil.kumaran <python-checkins at
> wrote:

> changeset:   92111:e0510a3bdf8f
> branch:      2.7
> parent:      92097:6d41f139709b
> user:        Senthil Kumaran <senthil at>
> date:        Sat Aug 16 14:16:14 2014 +0530
> summary:
>   Fix Issue #8797: Raise HTTPError on failed Basic Authentication
> immediately. Initial patch by Sam Bull.
> files:
>   Lib/test/ |  86 ++++++++++++++++++-
>   Lib/                    |  19 +---
>   Misc/NEWS                         |   3 +
>   3 files changed, 90 insertions(+), 18 deletions(-)
> diff --git a/Lib/test/
> b/Lib/test/
> --- a/Lib/test/
> +++ b/Lib/test/
> @@ -1,6 +1,8 @@
> +import base64
>  import urlparse
>  import urllib2
>  import BaseHTTPServer
> +import SimpleHTTPServer
>  import unittest
>  import hashlib
> @@ -66,6 +68,48 @@
>  # Authentication infrastructure
> +
> +class BasicAuthHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
> +    """Handler for performing Basic Authentication."""
> +    # Server side values
> +    USER = "testUser"
> +    PASSWD = "testPass"
> +    REALM = "Test"
> +    USER_PASSWD = "%s:%s" % (USER, PASSWD)
> +    ENCODED_AUTH = base64.b64encode(USER_PASSWD)
> +
> +    def __init__(self, *args, **kwargs):
> +        SimpleHTTPServer.SimpleHTTPRequestHandler.__init__(self, *args,
> +                                                           **kwargs)
> +
> +    def log_message(self, format, *args):
> +        # Supress the HTTP Console log output
> +        pass
> +
> +    def do_HEAD(self):
> +        self.send_response(200)
> +        self.send_header("Content-type", "text/html")
> +        self.end_headers()
> +
> +    def do_AUTHHEAD(self):
> +        self.send_response(401)
> +        self.send_header("WWW-Authenticate", "Basic realm=\"%s\"" %
> self.REALM)
> +        self.send_header("Content-type", "text/html")
> +        self.end_headers()
> +
> +    def do_GET(self):
> +        if self.headers.getheader("Authorization") == None:
> +            self.do_AUTHHEAD()
> +            self.wfile.write("No Auth Header Received")
> +        elif self.headers.getheader(
> +                "Authorization") == "Basic " + self.ENCODED_AUTH:
> +            SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
> +        else:
> +            self.do_AUTHHEAD()
> +            self.wfile.write(self.headers.getheader("Authorization"))
> +            self.wfile.write("Not Authenticated")
> +
> +
>  class DigestAuthHandler:
>      """Handler for performing digest authentication."""
> @@ -228,6 +272,45 @@
>          test_support.threading_cleanup(*self._threads)
> +class BasicAuthTests(BaseTestCase):
> +    USER = "testUser"
> +    PASSWD = "testPass"
> +    INCORRECT_PASSWD = "Incorrect"
> +    REALM = "Test"
> +
> +    def setUp(self):
> +        super(BasicAuthTests, self).setUp()
> +        # With Basic Authentication
> +        def http_server_with_basic_auth_handler(*args, **kwargs):
> +            return BasicAuthHandler(*args, **kwargs)
> +        self.server =
> LoopbackHttpServerThread(http_server_with_basic_auth_handler)
> +        self.server_url = '' % self.server.port
> +        self.server.start()
> +        self.server.ready.wait()
> +
> +    def tearDown(self):
> +        self.server.stop()
> +        super(BasicAuthTests, self).tearDown()
> +
> +    def test_basic_auth_success(self):
> +        ah = urllib2.HTTPBasicAuthHandler()
> +        ah.add_password(self.REALM, self.server_url, self.USER,
> self.PASSWD)
> +        urllib2.install_opener(urllib2.build_opener(ah))
> +        try:
> +            self.assertTrue(urllib2.urlopen(self.server_url))
> +        except urllib2.HTTPError:
> +  "Basic Auth Failed for url: %s" % self.server_url)
> +        except Exception as e:
> +            raise e
> +
> +    def test_basic_auth_httperror(self):
> +        ah = urllib2.HTTPBasicAuthHandler()
> +        ah.add_password(self.REALM, self.server_url, self.USER,
> +                        self.INCORRECT_PASSWD)
> +        urllib2.install_opener(urllib2.build_opener(ah))
> +        self.assertRaises(urllib2.HTTPError, urllib2.urlopen,
> self.server_url)
> +
> +
>  class ProxyAuthTests(BaseTestCase):
>      URL = "http://localhost"
> @@ -240,6 +323,7 @@
>          self.digest_auth_handler = DigestAuthHandler()
>          self.digest_auth_handler.set_users({self.USER: self.PASSWD})
>          self.digest_auth_handler.set_realm(self.REALM)
> +        # With Digest Authentication
>          def create_fake_proxy_handler(*args, **kwargs):
>              return FakeProxyHandler(self.digest_auth_handler, *args,
> **kwargs)
> @@ -544,7 +628,7 @@
>      # the next line.
>      #test_support.requires("network")
> -    test_support.run_unittest(ProxyAuthTests, TestUrlopen)
> +    test_support.run_unittest(BasicAuthTests, ProxyAuthTests, TestUrlopen)
>  if __name__ == "__main__":
>      test_main()
> diff --git a/Lib/ b/Lib/
> --- a/Lib/
> +++ b/Lib/
> @@ -843,10 +843,7 @@
>              password_mgr = HTTPPasswordMgr()
>          self.passwd = password_mgr
>          self.add_password = self.passwd.add_password
> -        self.retried = 0
> -    def reset_retry_count(self):
> -        self.retried = 0
>      def http_error_auth_reqed(self, authreq, host, req, headers):
>          # host may be an authority (without userinfo) or a URL with an
> @@ -854,13 +851,6 @@
>          # XXX could be multiple headers
>          authreq = headers.get(authreq, None)
> -        if self.retried > 5:
> -            # retry sending the username:password 5 times before failing.
> -            raise HTTPError(req.get_full_url(), 401, "basic auth failed",
> -                            headers, None)
> -        else:
> -            self.retried += 1
> -
>          if authreq:
>              mo =
>              if mo:
> @@ -869,17 +859,14 @@
>                      warnings.warn("Basic Auth Realm was unquoted",
>                                    UserWarning, 2)
>                  if scheme.lower() == 'basic':
> -                    response = self.retry_http_basic_auth(host, req,
> realm)
> -                    if response and response.code != 401:
> -                        self.retried = 0
> -                    return response
> +                    return self.retry_http_basic_auth(host, req, realm)
>      def retry_http_basic_auth(self, host, req, realm):
>          user, pw = self.passwd.find_user_password(realm, host)
>          if pw is not None:
>              raw = "%s:%s" % (user, pw)
>              auth = 'Basic %s' % base64.b64encode(raw).strip()
> -            if req.headers.get(self.auth_header, None) == auth:
> +            if req.get_header(self.auth_header, None) == auth:
>                  return None
>              req.add_unredirected_header(self.auth_header, auth)
>              return, timeout=req.timeout)
> @@ -895,7 +882,6 @@
>          url = req.get_full_url()
>          response = self.http_error_auth_reqed('www-authenticate',
>                                                url, req, headers)
> -        self.reset_retry_count()
>          return response
> @@ -911,7 +897,6 @@
>          authority = req.get_host()
>          response = self.http_error_auth_reqed('proxy-authenticate',
>                                            authority, req, headers)
> -        self.reset_retry_count()
>          return response
> diff --git a/Misc/NEWS b/Misc/NEWS
> --- a/Misc/NEWS
> +++ b/Misc/NEWS
> @@ -19,6 +19,9 @@
>  Library
>  -------
> +- Issue #8797: Raise HTTPError on failed Basic Authentication immediately.
> +  Initial patch by Sam Bull.
> +
>  - Issue #21448: Changed FeedParser feed() to avoid O(N**2) behavior when
>    parsing long line.  Original patch by Raymond Hettinger.
> --
> Repository URL:
> _______________________________________________
> Python-checkins mailing list
> Python-checkins at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From steve at  Sat Aug 16 13:16:52 2014
From: steve at (Steven D'Aprano)
Date: Sat, 16 Aug 2014 21:16:52 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140812121541.GG4525@ando>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
 <20140816054147.GG4525@ando> <>
Message-ID: <20140816111652.GI4525@ando>

On Sat, Aug 16, 2014 at 05:25:33PM +1000, Ben Finney wrote:
> > they would have already mistaken:
> >
> >     with a, b, c: ...
> >
> > the same way. But they haven't.
> Right. The presence or absence of parens make a big semantic difference.

from silly.mistakes.programmers.make import (
     hands, up, anyone, who, thinks, this, is_, a, tuple)

def function(how, about, this, one): ...

But quite frankly, even if there is some person somewhere who gets 
confused and tries to write:

context_managers = (open("a"), open("b", "w"), open("c", "w"))
with context_managers as things:
    text = things[0].read()

I simply don't care. They will try it, discover that tuples are not 
context managers, fix their code, and move on. (I've made sillier 
mistakes, and became a better programmer from it.)

We cannot paralyse ourselves out of fear that somebody, somewhere, will 
make a silly mistake. You can try that "with tuple" code right now, and 
you will get nice runtime exception. I admit that the error message is 
not the most descriptive I've ever seen, but I've seen worse, and any 
half-decent programmer can do what they do for any other unexpected 
exception: read the Fine Manual, or ask for help, or otherwise debug the 
problem. Why should this specific exception be treated as so harmful 
that we have to forgo a useful piece of functionality to avoid it?

Some designs are bug-magnets, like the infamous "except A,B" syntax, 
which fails silently, doing the wrong thing. Unless someone has a 
convincing rationale for how and why this multi-line with will likewise 
be a bug-magnet, I don't think that some vague similarity between it and 
tuples is justification for rejecting the proposal.


From marko at  Sat Aug 16 14:47:06 2014
From: marko at (Marko Rauhamaa)
Date: Sat, 16 Aug 2014 15:47:06 +0300
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <20140816111652.GI4525@ando> (Steven D'Aprano's message of "Sat, 
 16 Aug 2014 21:16:52 +1000")
References: <20140812121541.GG4525@ando>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
 <20140816054147.GG4525@ando> <>
Message-ID: <>

Steven D'Aprano <steve at>:

> I simply don't care. They will try it, discover that tuples are not 
> context managers, fix their code, and move on.

*Could* tuples (and lists and sequences) be context managers?

*Should* tuples (and lists and sequences) be context managers?

> I don't think that some vague similarity between it and tuples is
> justification for rejecting the proposal.

You might be able to have it bothways. You could have:

   with (open(name) for name in os.listdir("config")) as files:


From rosuav at  Sat Aug 16 23:42:25 2014
From: rosuav at (Chris Angelico)
Date: Sun, 17 Aug 2014 07:42:25 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140812121541.GG4525@ando>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
 <20140816054147.GG4525@ando> <>
 <20140816111652.GI4525@ando> <>
Message-ID: <>

On Sat, Aug 16, 2014 at 10:47 PM, Marko Rauhamaa <marko at> wrote:
> You might be able to have it bothways. You could have:
>    with (open(name) for name in os.listdir("config")) as files:

But that's not a tuple, it's a generator. Should generators be context
managers? Is anyone seriously suggesting this? I don't think so. Is
this solutions looking for problems?


From ncoghlan at  Sun Aug 17 03:10:00 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 11:10:00 +1000
Subject: [Python-Dev] Multiline with statement line continuation
In-Reply-To: <>
References: <20140812121541.GG4525@ando>
 <20140813173225.GL4525@ando> <>
 <20140816030847.GD4525@ando> <>
 <20140816054147.GG4525@ando> <>
 <20140816111652.GI4525@ando> <>
Message-ID: <>

On 17 August 2014 07:42, Chris Angelico <rosuav at> wrote:
> On Sat, Aug 16, 2014 at 10:47 PM, Marko Rauhamaa <marko at> wrote:
>> You might be able to have it bothways. You could have:
>>    with (open(name) for name in os.listdir("config")) as files:
> But that's not a tuple, it's a generator. Should generators be context
> managers? Is anyone seriously suggesting this? I don't think so. Is
> this solutions looking for problems?

Yes. We have a whole programming language to play with, when "X is
hard to read" becomes a problem, it may be time to reach for a better
tool. If the context manager line is getting unwieldy, it's often a
sign it's time to factor it out to a dedicated helper, or break it up
into multiple with statements :)


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ncoghlan at  Sun Aug 17 03:28:48 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 11:28:48 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
Message-ID: <>

I've seen a few people on python-ideas express the assumption that
there will be another Py3k style compatibility break for Python 4.0.

I've also had people express the concern that "you broke compatibility
in a major way once, how do we know you won't do it again?".

Both of those contrast strongly with Guido's stated position that he
never wants to go through a transition like the 2->3 one again.

Barry wrote PEP 404 to make it completely explicit that python-dev had
no plans to create a Python 2.8 release. Would it be worth writing a
similarly explicit "not an option" PEP explaining that the regular
deprecation and removal process (roughly documented in PEP 387) is the
*only* deprecation and removal process? It could also point to the
fact that we now have PEP 411 (provisional APIs) to help reduce our
chances of being locked indefinitely into design decisions we aren't
happy with.

If folks (most signficantly, Guido) are amenable to the idea, it
shouldn't take long to put such a PEP together, and I think it could
help reduce some of the confusions around the expectations for Python
4.0 and the evolution of 3.x in general.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From steve at  Sun Aug 17 04:39:02 2014
From: steve at (Steven D'Aprano)
Date: Sun, 17 Aug 2014 12:39:02 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
	Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <20140817023902.GM4525@ando>

On Sun, Aug 17, 2014 at 11:28:48AM +1000, Nick Coghlan wrote:
> I've seen a few people on python-ideas express the assumption that
> there will be another Py3k style compatibility break for Python 4.0.

I used to refer to Python 4000 as the hypothetical compatibility break 
version. Now I refer to Python 5000.

> I've also had people express the concern that "you broke compatibility
> in a major way once, how do we know you won't do it again?".

Even languages with ISO standards behind them and release schedules 
measured in decades make backward-incompatible changes. For example, I 
see that Fortran 95 (despite being classified as a minor revision) 
deleted at least six language features. To expect Python to never break 
compatibility again is asking too much.

But I think it is fair to promise that Python won't make *so 
many* backwards incompatible changes all at once again, and has no 
concrete plans to make backwards incompatible changes to syntax in the 
foreseeable future. (That is, not before Python 5000 :-)

> If folks (most signficantly, Guido) are amenable to the idea, it
> shouldn't take long to put such a PEP together, and I think it could
> help reduce some of the confusions around the expectations for Python
> 4.0 and the evolution of 3.x in general.

I think it's a good idea, so long as there's no implied or explicit 
promise that Python language is now set in stone never to change.


From guido at  Sun Aug 17 04:43:39 2014
From: guido at (Guido van Rossum)
Date: Sat, 16 Aug 2014 19:43:39 -0700
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan <ncoghlan at> wrote:

> I've seen a few people on python-ideas express the assumption that
> there will be another Py3k style compatibility break for Python 4.0.

There used to be only joking references to 4.0 or py4k -- how things have
changed! I've seen nothing that a gentle correction on the list couldn't
fix though.

> I've also had people express the concern that "you broke compatibility
> in a major way once, how do we know you won't do it again?".

Well, they won't, really. You can't predict the future. But really, that's
a pretty poor way to say "please don't do it again."

I'm not sure why, but I hate when someone starts a suggestion or a question
with "why doesn't Python ..." and I have to fight the urge to reply in a
flippant way without answering the real question. (And just now I did it

I suppose this phrasing may actually be meant as a form of politeness, but
to me it often sounds passive-aggressive, pretend-polite. (Could it be a
matter of cultural difference? The internet is full of broken English, my
own often included.)

> Both of those contrast strongly with Guido's stated position that he
> never wants to go through a transition like the 2->3 one again.

Right. What's more, when I say that, I don't mean that you should wait
until I retire -- I think it's genuinely a bad idea.

I also don't expect that it'll be necessary -- in fact, I am counting on
tools (e.g. static analysis!) to improve to the point where there won't be
a reason for such a transition.

(Don't understand this to mean that we should never deprecate things.
Deprecations will happen, they are necessary for the evolution of any
programming language. But they won't ever hurt in the way that Python 3

> Barry wrote PEP 404 to make it completely explicit that python-dev had
> no plans to create a Python 2.8 release. Would it be worth writing a
> similarly explicit "not an option" PEP explaining that the regular
> deprecation and removal process (roughly documented in PEP 387) is the
> *only* deprecation and removal process? It could also point to the
> fact that we now have PEP 411 (provisional APIs) to help reduce our
> chances of being locked indefinitely into design decisions we aren't
> happy with.
> If folks (most significantly, Guido) are amenable to the idea, it
> shouldn't take long to put such a PEP together, and I think it could
> help reduce some of the confusions around the expectations for Python
> 4.0 and the evolution of 3.x in general.

But what should it say? It's easy to say there won't be a 2.8 because we
already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there
won't be a 4.0? Never? Why not? Who is to say that at some point some folks
won't be going off on their own to design a whole new language and name it
Python 4, following Larry Wall's Perl 6 example?

I think it makes sense to occasionally remind the more eager contributors
that we want the future to come gently (that's not to say in our sleep :-).
But I'm not sure a PEP is the best form for such a reminder. Even the Pope
has a Twitter account. :-)

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From lukasz at  Sun Aug 17 04:46:45 2014
From: lukasz at (=?utf-8?Q?=C5=81ukasz_Langa?=)
Date: Sat, 16 Aug 2014 19:46:45 -0700
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
	Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 16, 2014, at 6:28 PM, Nick Coghlan <ncoghlan at> wrote:

> I've seen a few people on python-ideas express the assumption that
> there will be another Py3k style compatibility break for Python 4.0.

Whenever I mention Python 4 or PEP 4000, it?s always a joke. However, saying upfront that we will never break compatibility is a bold statement. Technically even introducing new syntax breaks compatibility. Not to mention fixing long-lasting bugs. So you?d need to split hairs just defining what we mean by a ?major compatibility break?.

Worse, if we ever did a change that we feel is within the bounds of the contract, you?d have someone pointing at that PEP saying that they feel we broke the contract. Splitting hairs again.

PEP 404 was necessary for some people/organizations to move on. I fail to see how PEP 4000 (or rather PEP 4004? ;-)) would be useful in that context.

Best regards,
?ukasz Langa

Twitter: @llanga
IRC: ambv on #python-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From lukasz at  Sun Aug 17 04:49:18 2014
From: lukasz at (=?utf-8?Q?=C5=81ukasz_Langa?=)
Date: Sat, 16 Aug 2014 19:49:18 -0700
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
	Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 16, 2014, at 7:43 PM, Guido van Rossum <guido at> wrote:

> But can we really say there won't be a 4.0? Never? Why not? Who is to say that at some point some folks won't be going off on their own to design a whole new language and name it Python 4, following Larry Wall's Perl 6 example?

If they ever do, please make them not follow the Perl 6 example!

Best regards,
?ukasz Langa

Twitter: @llanga
IRC: ambv on #python-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 17 05:48:41 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 13:48:41 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On 17 August 2014 12:43, Guido van Rossum <guido at> wrote:
> On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan <ncoghlan at> wrote:
>> I've also had people express the concern that "you broke compatibility
>> in a major way once, how do we know you won't do it again?".
> Well, they won't, really. You can't predict the future. But really, that's a
> pretty poor way to say "please don't do it again."
> I'm not sure why, but I hate when someone starts a suggestion or a question
> with "why doesn't Python ..." and I have to fight the urge to reply in a
> flippant way without answering the real question. (And just now I did it
> again.)
> I suppose this phrasing may actually be meant as a form of politeness, but
> to me it often sounds passive-aggressive, pretend-polite. (Could it be a
> matter of cultural difference? The internet is full of broken English, my
> own often included.)

I don't mind it if the typical answers are accepted as valid:

*  "because it has these downsides, and those are considered to
outweigh the benefits"
*  "because it's difficult, and it never bothered anyone enough for
them to put in the work to do something about it"

Those aren't always obvious, especially to folks that don't have a lot
of experience with long lived software projects (I had only just
started high school when Python was first released!), so I don't mind
explaining them when I have time.

>> Both of those contrast strongly with Guido's stated position that he
>> never wants to go through a transition like the 2->3 one again.
> Right. What's more, when I say that, I don't mean that you should wait until
> I retire -- I think it's genuinely a bad idea.

Absolutely agreed - I think the Unicode change was worthwhile (even
with the impact proving to be higher than expected), but there isn't
any such fundamental change to the data model lurking for Python 3.

> I also don't expect that it'll be necessary -- in fact, I am counting on
> tools (e.g. static analysis!) to improve to the point where there won't be a
> reason for such a transition.

The fact that things like Hylang and MacroPy can already run on the
CPython VM also shows that other features (like import hooks and the
AST compiler) have evolved to the point where the Python data model
and runtime semantics can be more effectively decoupled from syntactic

> (Don't understand this to mean that we should never deprecate things.
> Deprecations will happen, they are necessary for the evolution of any
> programming language. But they won't ever hurt in the way that Python 3
> hurt.)

Right. I think Python 2 has been stable for so long that I sometimes
wonder if folks forget (or never knew?) we used to deprecate things
within the Python 2 series as well, such that code that ran on Python
2.x wasn't necessarily guaranteed to run on Python 2.(x+2). "Never
deprecate anything" is a recipe for unbounded growth in complexity.

Benjamin has made a decent start on documenting that normal
deprecation process in PEP 387, so I'd also suggest refining that a
bit and getting it to "Accepted" as part of any explicit "Python 4.x
won't be as disruptive as 3.x" clarification.

>> no plans to create a Python 2.8 release. Would it be worth writing a
>> similarly explicit "not an option" PEP explaining that the regular
>> deprecation and removal process (roughly documented in PEP 387) is the
>> *only* deprecation and removal process? It could also point to the
>> fact that we now have PEP 411 (provisional APIs) to help reduce our
>> chances of being locked indefinitely into design decisions we aren't
>> happy with.
>> If folks (most significantly, Guido) are amenable to the idea, it
>> shouldn't take long to put such a PEP together, and I think it could
>> help reduce some of the confusions around the expectations for Python
>> 4.0 and the evolution of 3.x in general.
> But what should it say?

The specific things I was thinking we could point out were:

- PEP 387, documenting the normal deprecation process that existed
even in Python 2
- highlighting the increased preference for "documented deprecation
only" in cases where maintaining something isn't actively causing
problems, there are just better alternatives now available
- PEP 411, the (still relatively new) provisional API concept
- PEP 405, adding pyvenv as a standard part of Python
- PEP 453, better integrating PyPI into the recommended way of working
with the language

Those all help change the way the language evolves, as they reduce the
pressure to rush things into the standard library before they're
ready, while at the same time giving us a chance to publish "not quite
ready to be locked down" features for very broad feedback.

I'd also point out that the "variable encodings" to "Unicode"
transition for text handling is an industry wide issue, one which even
operating systems are still struggling with in some cases. POSIX-only
software that only needs to run on modern platforms can assume UTF-8,
while modern Windows and Java only software can largely assume
UTF-16-LE, but anyone trying to integrate with both is going to have a
far more interesting time of things (as we've discovered the hard
way). That transition is the core thing that sometimes makes migrating
from Python 2 to Python 3 non-trivial - even the changes to dict are
relatively simple to address by comparison.

> It's easy to say there won't be a 2.8 because we
> already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there
> won't be a 4.0? Never? Why not?

I'm assuming there *will* be a 4.0 - I'd just like to see it be "the
release after Python 3.9", rather than being spectacularly different
from the preceding 3.x releases. That's similar to the way that the
Linux kernel shifted to the 3.x series not because of any particular
milestone, but just due to the sheer weight of accumulated changes
relative to the early 2.x releases.

> Who is to say that at some point some folks
> won't be going off on their own to design a whole new language and name it
> Python 4, following Larry Wall's Perl 6 example?

Based on the examples of both Python 3 and Perl 6, I'd personally
strongly advocate for such a project to be a new language with a
different name, even if it was created and maintained by python-dev :)

> I think it makes sense to occasionally remind the more eager contributors
> that we want the future to come gently (that's not to say in our sleep :-).
> But I'm not sure a PEP is the best form for such a reminder. Even the Pope
> has a Twitter account. :-)

Yeah, I'm not sure a PEP is the right way either. However, it seemed
to get the point across for both PEP 404 ("no Python 2.8") and PEP 394
("POSIX platforms: don't make /usr/bin/python refer to Python 3, you
break things when you do that"), so I figured I'd at least raise the
suggestion on this topic as well.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From guido at  Sun Aug 17 07:08:37 2014
From: guido at (Guido van Rossum)
Date: Sat, 16 Aug 2014 22:08:37 -0700
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

I think this would be a great topic for a blog post. Once you've written it
I can even bless it by Tweeting about it. :-)

PS. Why isn't PEP 387 accepted yet?

On Sat, Aug 16, 2014 at 8:48 PM, Nick Coghlan <ncoghlan at> wrote:

> On 17 August 2014 12:43, Guido van Rossum <guido at> wrote:
> > On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan <ncoghlan at>
> wrote:
> >> I've also had people express the concern that "you broke compatibility
> >> in a major way once, how do we know you won't do it again?".
> >
> >
> > Well, they won't, really. You can't predict the future. But really,
> that's a
> > pretty poor way to say "please don't do it again."
> >
> > I'm not sure why, but I hate when someone starts a suggestion or a
> question
> > with "why doesn't Python ..." and I have to fight the urge to reply in a
> > flippant way without answering the real question. (And just now I did it
> > again.)
> >
> > I suppose this phrasing may actually be meant as a form of politeness,
> but
> > to me it often sounds passive-aggressive, pretend-polite. (Could it be a
> > matter of cultural difference? The internet is full of broken English, my
> > own often included.)
> I don't mind it if the typical answers are accepted as valid:
> *  "because it has these downsides, and those are considered to
> outweigh the benefits"
> *  "because it's difficult, and it never bothered anyone enough for
> them to put in the work to do something about it"
> Those aren't always obvious, especially to folks that don't have a lot
> of experience with long lived software projects (I had only just
> started high school when Python was first released!), so I don't mind
> explaining them when I have time.
> >> Both of those contrast strongly with Guido's stated position that he
> >> never wants to go through a transition like the 2->3 one again.
> >
> > Right. What's more, when I say that, I don't mean that you should wait
> until
> > I retire -- I think it's genuinely a bad idea.
> Absolutely agreed - I think the Unicode change was worthwhile (even
> with the impact proving to be higher than expected), but there isn't
> any such fundamental change to the data model lurking for Python 3.
> > I also don't expect that it'll be necessary -- in fact, I am counting on
> > tools (e.g. static analysis!) to improve to the point where there won't
> be a
> > reason for such a transition.
> The fact that things like Hylang and MacroPy can already run on the
> CPython VM also shows that other features (like import hooks and the
> AST compiler) have evolved to the point where the Python data model
> and runtime semantics can be more effectively decoupled from syntactic
> details.
> > (Don't understand this to mean that we should never deprecate things.
> > Deprecations will happen, they are necessary for the evolution of any
> > programming language. But they won't ever hurt in the way that Python 3
> > hurt.)
> Right. I think Python 2 has been stable for so long that I sometimes
> wonder if folks forget (or never knew?) we used to deprecate things
> within the Python 2 series as well, such that code that ran on Python
> 2.x wasn't necessarily guaranteed to run on Python 2.(x+2). "Never
> deprecate anything" is a recipe for unbounded growth in complexity.
> Benjamin has made a decent start on documenting that normal
> deprecation process in PEP 387, so I'd also suggest refining that a
> bit and getting it to "Accepted" as part of any explicit "Python 4.x
> won't be as disruptive as 3.x" clarification.
> >> no plans to create a Python 2.8 release. Would it be worth writing a
> >> similarly explicit "not an option" PEP explaining that the regular
> >> deprecation and removal process (roughly documented in PEP 387) is the
> >> *only* deprecation and removal process? It could also point to the
> >> fact that we now have PEP 411 (provisional APIs) to help reduce our
> >> chances of being locked indefinitely into design decisions we aren't
> >> happy with.
> >>
> >> If folks (most significantly, Guido) are amenable to the idea, it
> >>
> >> shouldn't take long to put such a PEP together, and I think it could
> >> help reduce some of the confusions around the expectations for Python
> >> 4.0 and the evolution of 3.x in general.
> >
> > But what should it say?
> The specific things I was thinking we could point out were:
> - PEP 387, documenting the normal deprecation process that existed
> even in Python 2
> - highlighting the increased preference for "documented deprecation
> only" in cases where maintaining something isn't actively causing
> problems, there are just better alternatives now available
> - PEP 411, the (still relatively new) provisional API concept
> - PEP 405, adding pyvenv as a standard part of Python
> - PEP 453, better integrating PyPI into the recommended way of working
> with the language
> Those all help change the way the language evolves, as they reduce the
> pressure to rush things into the standard library before they're
> ready, while at the same time giving us a chance to publish "not quite
> ready to be locked down" features for very broad feedback.
> I'd also point out that the "variable encodings" to "Unicode"
> transition for text handling is an industry wide issue, one which even
> operating systems are still struggling with in some cases. POSIX-only
> software that only needs to run on modern platforms can assume UTF-8,
> while modern Windows and Java only software can largely assume
> UTF-16-LE, but anyone trying to integrate with both is going to have a
> far more interesting time of things (as we've discovered the hard
> way). That transition is the core thing that sometimes makes migrating
> from Python 2 to Python 3 non-trivial - even the changes to dict are
> relatively simple to address by comparison.
> > It's easy to say there won't be a 2.8 because we
> > already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there
> > won't be a 4.0? Never? Why not?
> I'm assuming there *will* be a 4.0 - I'd just like to see it be "the
> release after Python 3.9", rather than being spectacularly different
> from the preceding 3.x releases. That's similar to the way that the
> Linux kernel shifted to the 3.x series not because of any particular
> milestone, but just due to the sheer weight of accumulated changes
> relative to the early 2.x releases.
> > Who is to say that at some point some folks
> > won't be going off on their own to design a whole new language and name
> it
> > Python 4, following Larry Wall's Perl 6 example?
> Based on the examples of both Python 3 and Perl 6, I'd personally
> strongly advocate for such a project to be a new language with a
> different name, even if it was created and maintained by python-dev :)
> > I think it makes sense to occasionally remind the more eager contributors
> > that we want the future to come gently (that's not to say in our sleep
> :-).
> > But I'm not sure a PEP is the best form for such a reminder. Even the
> Pope
> > has a Twitter account. :-)
> Yeah, I'm not sure a PEP is the right way either. However, it seemed
> to get the point across for both PEP 404 ("no Python 2.8") and PEP 394
> ("POSIX platforms: don't make /usr/bin/python refer to Python 3, you
> break things when you do that"), so I figured I'd at least raise the
> suggestion on this topic as well.
> Cheers,
> Nick.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 17 07:34:16 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 15:34:16 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On 17 August 2014 15:08, Guido van Rossum <guido at> wrote:
> I think this would be a great topic for a blog post. Once you've written it
> I can even bless it by Tweeting about it. :-)

Sounds like a plan - I'll try to put together something coherent this week :)

> PS. Why isn't PEP 387 accepted yet?

Not sure - it mostly looks correct to me. I suspect it just fell off
the radar since it's a "describe what we're already doing anyway" kind
of document.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From raymond.hettinger at  Sun Aug 17 10:13:39 2014
From: raymond.hettinger at (Raymond Hettinger)
Date: Sun, 17 Aug 2014 01:13:39 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 14, 2014, at 10:50 PM, Nick Coghlan <ncoghlan at> wrote:

> Key points in the proposal:
> * deprecate passing integers to bytes() and bytearray()

I'm opposed to removing this part of the API.  It has proven useful
and the alternative isn't very nice.   Declaring the size of fixed length
arrays is not a new concept and is widely adopted in other languages.
One principal use case for the bytearray is creating and manipulating
binary data.  Initializing to zero is common operation and should remain
part of the core API (consider why we now have list.copy() even though
copying with a slice remains possible and efficient).

I and my clients have taken advantage of this feature and it reads nicely.
The proposed deprecation would break our code and not actually make
anything better.

Another thought is that the core devs should be very reluctant to deprecate
anything we don't have to while the 2 to 3 transition is still in progress.   
Every new deprecation of APIs that existed in Python 2.7 just adds another
obstacle to converting code.  Individually, the differences are trivial.  
Collectively, they present a good reason to never migrate code to Python 3.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 17 10:28:17 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 18:28:17 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On 17 August 2014 15:34, Nick Coghlan <ncoghlan at> wrote:
> On 17 August 2014 15:08, Guido van Rossum <guido at> wrote:
>> I think this would be a great topic for a blog post. Once you've written it
>> I can even bless it by Tweeting about it. :-)
> Sounds like a plan - I'll try to put together something coherent this week :)

OK, make that "this afternoon": :)


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ncoghlan at  Sun Aug 17 10:41:05 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 17 Aug 2014 18:41:05 +1000
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 17 August 2014 18:13, Raymond Hettinger <raymond.hettinger at> wrote:
> On Aug 14, 2014, at 10:50 PM, Nick Coghlan <ncoghlan at> wrote:
> Key points in the proposal:
> * deprecate passing integers to bytes() and bytearray()
> I'm opposed to removing this part of the API.  It has proven useful
> and the alternative isn't very nice.   Declaring the size of fixed length
> arrays is not a new concept and is widely adopted in other languages.
> One principal use case for the bytearray is creating and manipulating
> binary data.  Initializing to zero is common operation and should remain
> part of the core API (consider why we now have list.copy() even though
> copying with a slice remains possible and efficient).

That's why the PEP proposes adding a "zeros" method, based on the name
of the corresponding NumPy construct.

The status quo has some very ugly failure modes when an integer is
passed unexpectedly, and tries to create a large buffer, rather than
throwing a type error.

> I and my clients have taken advantage of this feature and it reads nicely.

If I see "bytearray(10)" there is nothing there that suggests "this
creates an array of length 10 and initialises it to zero" to me. I'd
be more inclined to guess it would be equivalent to "bytearray([10])".

"bytearray.zeros(10)", on the other hand, is relatively clear,
independently of user expectations.

> The proposed deprecation would break our code and not actually make
> anything better.
> Another thought is that the core devs should be very reluctant to deprecate
> anything we don't have to while the 2 to 3 transition is still in progress.
> Every new deprecation of APIs that existed in Python 2.7 just adds another
> obstacle to converting code.  Individually, the differences are trivial.
> Collectively, they present a good reason to never migrate code to Python 3.

This is actually one of the inconsistencies between the Python 2 and 3
binary APIs:

Python 2.7.5 (default, Jun 25 2014, 10:19:55)
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes(10)
>>> bytearray(10)

Users wanting well-behaved binary sequences in Python 2.7 would be
well advised to use the "future" module to get a full backport of the
actual Python 3 bytes type, rather than the approximation that is the
8-bit str in Python 2. And once they do that, they'll be able to track
the evolution of the Python 3 binary sequence behaviour without any
further trouble.

That said, I don't really mind how long the deprecation cycle is. I'd
be fine with fully supporting both in 3.5 (2015), deprecating the main
constructor in favour of the explicit zeros() method in 3.6 (2017) and
dropping the legacy behaviour in 3.7 (2018)


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From senthil at  Sun Aug 17 11:37:44 2014
From: senthil at (Senthil Kumaran)
Date: Sun, 17 Aug 2014 15:07:44 +0530
Subject: [Python-Dev] [Python-checkins] cpython (merge 3.4 -> default):
 Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales.
In-Reply-To: <>
References: <>
Message-ID: <>

This change is okay and not harmful. But I think, It might still not fix
the encoding issue that we encountered on Mac.

[localhost cpython]$ hg log -l 1
changeset:   92128:7cdc941d5180
tag:         tip
parent:      92126:3153a400b739
parent:      92127:a894b629bbea
user:        Serhiy Storchaka <storchaka at>
date:        Sun Aug 17 12:21:06 2014 +0300
Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales.

[localhost cpython]$ ./python.exe -m test.regrtest test_httpservers
[1/1] test_httpservers
test test_httpservers failed -- Traceback (most recent call last):
  File "/Users/skumaran/python/cpython/Lib/test/", line
283, in test_undecodable_filename
    .encode(enc, 'surrogateescape'), body)
AssertionError: b'href="%40test_5809_tmp%ED%B3%A7w%ED%B3%B0.txt"' not found
in b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "">\n<html>\n<head>\n<meta
http-equiv="Content-Type" content="text/html;
charset=utf-8">\n<title>Directory listing for
tmpj54lc8m1/</title>\n</head>\n<body>\n<h1>Directory listing for

1 test failed:

The underlying problem seems to be difference in which os.listdir() which
uses C-API  and os.fsdecode represent the decoded chars. Ref:

On Sun, Aug 17, 2014 at 2:52 PM, serhiy.storchaka <
python-checkins at> wrote:

> changeset:   92128:7cdc941d5180
> parent:      92126:3153a400b739
> parent:      92127:a894b629bbea
> user:        Serhiy Storchaka <storchaka at>
> date:        Sun Aug 17 12:21:06 2014 +0300
> summary:
>   Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales.
> files:
>   Lib/test/ |  5 +++--
>   1 files changed, 3 insertions(+), 2 deletions(-)
> diff --git a/Lib/test/ b/Lib/test/
> --- a/Lib/test/
> +++ b/Lib/test/
> @@ -272,6 +272,7 @@
>      @unittest.skipUnless(support.TESTFN_UNDECODABLE,
>                           'need support.TESTFN_UNDECODABLE')
>      def test_undecodable_filename(self):
> +        enc = sys.getfilesystemencoding()
>          filename = os.fsdecode(support.TESTFN_UNDECODABLE) + '.txt'
>          with open(os.path.join(self.tempdir, filename), 'wb') as f:
>              f.write(support.TESTFN_UNDECODABLE)
> @@ -279,9 +280,9 @@
>          body = self.check_status_and_reason(response, 200)
>          quotedname = urllib.parse.quote(filename, errors='surrogatepass')
>          self.assertIn(('href="%s"' % quotedname)
> -                      .encode('utf-8', 'surrogateescape'), body)
> +                      .encode(enc, 'surrogateescape'), body)
>          self.assertIn(('>%s<' % html.escape(filename))
> -                      .encode('utf-8', 'surrogateescape'), body)
> +                      .encode(enc, 'surrogateescape'), body)
>          response = self.request(self.tempdir_name + '/' + quotedname)
>          self.check_status_and_reason(response, 200,
>                                       data=support.TESTFN_UNDECODABLE)
> --
> Repository URL:
> _______________________________________________
> Python-checkins mailing list
> Python-checkins at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From francismb at  Sun Aug 17 11:50:36 2014
From: francismb at (francis)
Date: Sun, 17 Aug 2014 11:50:36 +0200
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/17/2014 03:28 AM, Nick Coghlan wrote:
> I've seen a few people on python-ideas express the assumption that
> there will be another Py3k style compatibility break for Python 4.0.
> I've also had people express the concern that "you broke compatibility
> in a major way once, how do we know you won't do it again?".

Why not just allow those changes that can be automatically changed by
a tool/script applied on the code (a la go, 2to3, 3.Ato3.B, ...)?

From barry at  Sun Aug 17 15:29:19 2014
From: barry at (Barry Warsaw)
Date: Sun, 17 Aug 2014 09:29:19 -0400
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote:

>(Don't understand this to mean that we should never deprecate things.
>Deprecations will happen, they are necessary for the evolution of any
>programming language. But they won't ever hurt in the way that Python 3

It would be useful to explore what causes the most pain in the 2->3
transition?  IMHO, it's not the deprecations or changes such as print ->
print().  It's the bytes/str split - a fundamental change to core and common
data types.  The question then is whether you foresee any similar looming
pervasive change? [*]


[*] I was going to add a joke about mandatory static type checking, but
sometimes jokes are blown up into apocalyptic prophesy around here. ;)

From storchaka at  Sun Aug 17 16:47:57 2014
From: storchaka at (Serhiy Storchaka)
Date: Sun, 17 Aug 2014 17:47:57 +0300
Subject: [Python-Dev] "embedded NUL character" exceptions
Message-ID: <lsqg2d$82k$>

Currently most functions which accepts string argument which then passed 
to C function as NUL-terminated string, reject strings with embedded NUL 
character and raise TypeError. ValueError looks more appropriate here, 
because argument type is correct (str), only its value is wrong. But 
this is backward incompatible change.

I think that we should get rid of this legacy inconsistency sooner or 
later. Why not fix it right now? I have opened an issue on the tracker 
[1], but this issue requires more broad discussion.


From guido at  Sun Aug 17 17:13:52 2014
From: guido at (Guido van Rossum)
Date: Sun, 17 Aug 2014 08:13:52 -0700
Subject: [Python-Dev] "embedded NUL character" exceptions
In-Reply-To: <lsqg2d$82k$>
References: <lsqg2d$82k$>
Message-ID: <>

Sounds good to me.

On Sun, Aug 17, 2014 at 7:47 AM, Serhiy Storchaka <storchaka at>

> Currently most functions which accepts string argument which then passed
> to C function as NUL-terminated string, reject strings with embedded NUL
> character and raise TypeError. ValueError looks more appropriate here,
> because argument type is correct (str), only its value is wrong. But this
> is backward incompatible change.
> I think that we should get rid of this legacy inconsistency sooner or
> later. Why not fix it right now? I have opened an issue on the tracker [1],
> but this issue requires more broad discussion.
> [1]
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From raymond.hettinger at  Sun Aug 17 19:07:09 2014
From: raymond.hettinger at (Raymond Hettinger)
Date: Sun, 17 Aug 2014 10:07:09 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan at> wrote:

> If I see "bytearray(10)" there is nothing there that suggests "this
> creates an array of length 10 and initialises it to zero" to me. I'd
> be more inclined to guess it would be equivalent to "bytearray([10])".
> "bytearray.zeros(10)", on the other hand, is relatively clear,
> independently of user expectations.

Zeros would have been great but that should have been done originally.
The time to get API design right is at inception.
Now, you're just breaking code and invalidating any published examples.

>> Another thought is that the core devs should be very reluctant to deprecate
>> anything we don't have to while the 2 to 3 transition is still in progress.
>> Every new deprecation of APIs that existed in Python 2.7 just adds another
>> obstacle to converting code.  Individually, the differences are trivial.
>> Collectively, they present a good reason to never migrate code to Python 3.
> This is actually one of the inconsistencies between the Python 2 and 3
> binary APIs:

However, bytearray(n) is the same in both Python 2 and Python 3.
Changing it in Python 3 increases the gulf between the two.

The further we let Python 3 diverge from Python 2, the less likely that
people will convert their code and the harder you make it to write code
that runs under both.

FWIW, I've been teaching Python full time for three years.  I cover the
use of bytearray(n) in my classes and not a single person out of 3000+
engineers have had a problem with it.   I seriously question the PEP's
assertion that there is a real problem to be solved (i.e. that people
are baffled by bytearray(bufsiz)) and that the problem is sufficiently
painful to warrant the headaches that go along with API changes.

The other proposal to add bytearray.byte(3) should probably be named
bytearray.from_byte(3) for clarity.  That said, I question whether there is
actually a use case for this.   I have never seen seen code that has a
need to create a byte array of length one from a single integer.
For the most part, the API will be easiest to learn if it matches what
we do for lists and for array.array.

Sorry Nick, but I think you're making the API worse instead of better.
This API isn't perfect but it isn't flat-out broken either.   There is some
unfortunate asymmetry between bytes() and bytearray() in Python 2,
but that ship has sailed.  The current API for Python 3 is pretty good
(though there is still a tension between wanting to be like lists and like
strings both at the same time).


P.S.  The most important problem in the Python world now is getting
Python 2 users to adopt Python 3.  The core devs need to develop
a strong distaste for anything that makes that problem harder.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From donald at  Sun Aug 17 19:16:31 2014
From: donald at (Donald Stufft)
Date: Sun, 17 Aug 2014 13:16:31 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

> On Aug 17, 2014, at 1:07 PM, Raymond Hettinger <raymond.hettinger at> wrote:
> On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan at <mailto:ncoghlan at>> wrote:
>> If I see "bytearray(10)" there is nothing there that suggests "this
>> creates an array of length 10 and initialises it to zero" to me. I'd
>> be more inclined to guess it would be equivalent to "bytearray([10])".
>> "bytearray.zeros(10)", on the other hand, is relatively clear,
>> independently of user expectations.
> Zeros would have been great but that should have been done originally.
> The time to get API design right is at inception.
> Now, you're just breaking code and invalidating any published examples.
>>> Another thought is that the core devs should be very reluctant to deprecate
>>> anything we don't have to while the 2 to 3 transition is still in progress.
>>> Every new deprecation of APIs that existed in Python 2.7 just adds another
>>> obstacle to converting code.  Individually, the differences are trivial.
>>> Collectively, they present a good reason to never migrate code to Python 3.
>> This is actually one of the inconsistencies between the Python 2 and 3
>> binary APIs:
> However, bytearray(n) is the same in both Python 2 and Python 3.
> Changing it in Python 3 increases the gulf between the two.
> The further we let Python 3 diverge from Python 2, the less likely that
> people will convert their code and the harder you make it to write code
> that runs under both.
> FWIW, I've been teaching Python full time for three years.  I cover the
> use of bytearray(n) in my classes and not a single person out of 3000+
> engineers have had a problem with it.   I seriously question the PEP's
> assertion that there is a real problem to be solved (i.e. that people
> are baffled by bytearray(bufsiz)) and that the problem is sufficiently
> painful to warrant the headaches that go along with API changes.
> The other proposal to add bytearray.byte(3) should probably be named
> bytearray.from_byte(3) for clarity.  That said, I question whether there is
> actually a use case for this.   I have never seen seen code that has a
> need to create a byte array of length one from a single integer.
> For the most part, the API will be easiest to learn if it matches what
> we do for lists and for array.array.
> Sorry Nick, but I think you're making the API worse instead of better.
> This API isn't perfect but it isn't flat-out broken either.   There is some
> unfortunate asymmetry between bytes() and bytearray() in Python 2,
> but that ship has sailed.  The current API for Python 3 is pretty good
> (though there is still a tension between wanting to be like lists and like
> strings both at the same time).
> Raymond
> P.S.  The most important problem in the Python world now is getting
> Python 2 users to adopt Python 3.  The core devs need to develop
> a strong distaste for anything that makes that problem harder.

For the record I?ve had all of the problems that Nick states and I?m
+1 on this change.

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Sun Aug 17 20:33:52 2014
From: ethan at (Ethan Furman)
Date: Sun, 17 Aug 2014 11:33:52 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/17/2014 10:16 AM, Donald Stufft wrote:
> For the record I?ve had all of the problems that Nick states and I?m
> +1 on this change.

I've had many of the problems Nick states and I'm also +1.


From graffatcolmingov at  Sun Aug 17 20:40:34 2014
From: graffatcolmingov at (Ian Cordasco)
Date: Sun, 17 Aug 2014 13:40:34 -0500
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014 12:17 PM, "Donald Stufft" <donald at> wrote:
>> On Aug 17, 2014, at 1:07 PM, Raymond Hettinger <raymond.hettinger at> wrote:
>> On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan at> wrote:
>>> If I see "bytearray(10)" there is nothing there that suggests "this
>>> creates an array of length 10 and initialises it to zero" to me. I'd
>>> be more inclined to guess it would be equivalent to "bytearray([10])".
>>> "bytearray.zeros(10)", on the other hand, is relatively clear,
>>> independently of user expectations.
>> Zeros would have been great but that should have been done originally.
>> The time to get API design right is at inception.
>> Now, you're just breaking code and invalidating any published examples.
>>>> Another thought is that the core devs should be very reluctant to deprecate
>>>> anything we don't have to while the 2 to 3 transition is still in progress.
>>>> Every new deprecation of APIs that existed in Python 2.7 just adds another
>>>> obstacle to converting code.  Individually, the differences are trivial.
>>>> Collectively, they present a good reason to never migrate code to Python 3.
>>> This is actually one of the inconsistencies between the Python 2 and 3
>>> binary APIs:
>> However, bytearray(n) is the same in both Python 2 and Python 3.
>> Changing it in Python 3 increases the gulf between the two.
>> The further we let Python 3 diverge from Python 2, the less likely that
>> people will convert their code and the harder you make it to write code
>> that runs under both.
>> FWIW, I've been teaching Python full time for three years.  I cover the
>> use of bytearray(n) in my classes and not a single person out of 3000+
>> engineers have had a problem with it.   I seriously question the PEP's
>> assertion that there is a real problem to be solved (i.e. that people
>> are baffled by bytearray(bufsiz)) and that the problem is sufficiently
>> painful to warrant the headaches that go along with API changes.
>> The other proposal to add bytearray.byte(3) should probably be named
>> bytearray.from_byte(3) for clarity.  That said, I question whether there is
>> actually a use case for this.   I have never seen seen code that has a
>> need to create a byte array of length one from a single integer.
>> For the most part, the API will be easiest to learn if it matches what
>> we do for lists and for array.array.
>> Sorry Nick, but I think you're making the API worse instead of better.
>> This API isn't perfect but it isn't flat-out broken either.   There is some
>> unfortunate asymmetry between bytes() and bytearray() in Python 2,
>> but that ship has sailed.  The current API for Python 3 is pretty good
>> (though there is still a tension between wanting to be like lists and like
>> strings both at the same time).
>> Raymond
>> P.S.  The most important problem in the Python world now is getting
>> Python 2 users to adopt Python 3.  The core devs need to develop
>> a strong distaste for anything that makes that problem harder.
> For the record I?ve had all of the problems that Nick states and I?m
> +1 on this change.

I've run into these problems as well, but I'm swayed by Raymond's
argument regarding bytearray's constructor. I wouldn't be adverse to
adding zeroes (for some parity between bytes and bytearray) to that
but I'm not sure deprecating te behaviour of bytearray's constructor
is necessary.

(Whilst on my phone I only replied to Donald, so I'm forwarding this
to the list.)

From raymond.hettinger at  Sun Aug 17 23:19:17 2014
From: raymond.hettinger at (Raymond Hettinger)
Date: Sun, 17 Aug 2014 14:19:17 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014, at 11:33 AM, Ethan Furman <ethan at> wrote:

> I've had many of the problems Nick states and I'm also +1.

There are two code snippets below which were taken from the standard library.
Are you saying that:
1) you don't understand the code (as the pep suggests)
2) you are willing to break that code and everything like it
3) and it would be more elegantly expressed as:  
        charmap = bytearray.zeros(256)
        mapping = bytearray.zeros(256)

At work, I have network engineers creating IPv4 headers and other structures
with bytearrays initialized to zeros.  Do you really want to break all their code?
No where else in Python do we create buffers that way.  Code like
"msg, who = s.recvfrom(256)" is the norm.

Also, it is unclear if you're saying that you have an actual use case for this
part of the proposal?

   ba = bytearray.byte(65)

And than the code would be better, clearer, and faster than the currently working form?

   ba = bytearray([65])

Does there really need to be a special case for constructing a single byte?
To me, that is akin to proposing "list.from_int(65)" as an important special
case to replace "[65]".

If you must muck with the ever changing bytes() API, then please 
leave the bytearray() API alone.  I think we should show some respect
for code that is currently working and is cleanly expressible in both
Python 2 and Python 3.  We aren't winning users with API churn.

FWIW, I guessing that the differing view points in the thread stem
mainly from the proponents experiences with bytes() rather than
from experience with bytearray() which doesn't seem to have any
usage problems in the wild.  I've never seen a developer say they
didn't understand what "buf = bytearray(1024)" means.   That is
not an actual problem that needs solving (or breaking).

What may be an actual problem is code like "char = bytes(1024)"
though I'm unclear what a user might have actually been trying
to do with code like that.


----------- excerpts from Lib/ ---------------

    charmap = bytearray(256)
    for op, av in charset:
	while True:
                if op is LITERAL:
                    charmap[fixup(av)] = 1
                elif op is RANGE:
                    for i in range(fixup(av[0]), fixup(av[1])+1):
                        charmap[i] = 1
                elif op is NEGATE:
                    out.append((op, av))
                    tail.append((op, av))


    charmap = bytes(charmap) # should be hashable                                                                                 
    comps = {}
    mapping = bytearray(256)
    block = 0
    data = bytearray()
    for i in range(0, 65536, 256):
        chunk = charmap[i: i + 256]
        if chunk in comps:
            mapping[i // 256] = comps[chunk]
            mapping[i // 256] = comps[chunk] = block
            block += 1
            data += chunk
    data = _mk_bitmap(data)
    data[0:0] = [block] + _bytes_to_codes(mapping)
    out.append((BIGCHARSET, data))
    out += tail
    return out                    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From barry at  Sun Aug 17 23:41:10 2014
From: barry at (Barry Warsaw)
Date: Sun, 17 Aug 2014 17:41:10 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

I think the biggest API "problem" is that default iteration returns integers
instead of bytes.  That's a real pain.

I'm not sure .iterbytes() is the best name for spelling iteration over bytes
instead of integers though.  Given that we can't change __iter__(), I
personally would perhaps prefer a simple .bytes property over which if you
iterated you would receive bytes, e.g.

>>> data = bytes([1, 2, 3])
>>> for i in data:
...  print(i)
>>> for b in data.bytes:
...   print(b)

There are no backward compatibility issues with this of course.

As for the single-int-ctor forms, they're inconvenient and arguably "wrong",
but I think we can live with it.  OTOH, I don't see any harm by adding the
.zeros() alternative constructor.  I'd probably want to spell the .byte()
alternative constructor .from_int() but I also don't think the status quo (or
.byte()) is that much of a usability problem.

The API churn problem comes about when you start wanting to deprecate the
single-int-ctor form.  *If* that part gets adopted, it should have a really
long deprecation cycle, IMO.


From donald at  Sun Aug 17 23:55:45 2014
From: donald at (Donald Stufft)
Date: Sun, 17 Aug 2014 17:55:45 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

> On Aug 17, 2014, at 5:19 PM, Raymond Hettinger <raymond.hettinger at> wrote:
> On Aug 17, 2014, at 11:33 AM, Ethan Furman <ethan at <mailto:ethan at>> wrote:
>> I've had many of the problems Nick states and I'm also +1.
> There are two code snippets below which were taken from the standard library.
> Are you saying that:
> 1) you don't understand the code (as the pep suggests)
> 2) you are willing to break that code and everything like it
> 3) and it would be more elegantly expressed as:  
>         charmap = bytearray.zeros(256)
>     and
>         mapping = bytearray.zeros(256)
> At work, I have network engineers creating IPv4 headers and other structures
> with bytearrays initialized to zeros.  Do you really want to break all their code?
> No where else in Python do we create buffers that way.  Code like
> "msg, who = s.recvfrom(256)" is the norm.
> Also, it is unclear if you're saying that you have an actual use case for this
> part of the proposal?
>    ba = bytearray.byte(65)
> And than the code would be better, clearer, and faster than the currently working form?
>    ba = bytearray([65])
> Does there really need to be a special case for constructing a single byte?
> To me, that is akin to proposing "list.from_int(65)" as an important special
> case to replace "[65]".
> If you must muck with the ever changing bytes() API, then please 
> leave the bytearray() API alone.  I think we should show some respect
> for code that is currently working and is cleanly expressible in both
> Python 2 and Python 3.  We aren't winning users with API churn.
> FWIW, I guessing that the differing view points in the thread stem
> mainly from the proponents experiences with bytes() rather than
> from experience with bytearray() which doesn't seem to have any
> usage problems in the wild.  I've never seen a developer say they
> didn't understand what "buf = bytearray(1024)" means.   That is
> not an actual problem that needs solving (or breaking).
> What may be an actual problem is code like "char = bytes(1024)"
> though I'm unclear what a user might have actually been trying
> to do with code like that.

I think this is probably correct. I generally don?t think that bytes(1024)
makes much sense at all, especially not as a default constructor. Most likely
it exists to be similar to bytearray().

I don't have a specific problem with bytearray(1024), though I do think it's
more elegantly and clearly described as bytearray.zeros(1024), but not by much.

I find bytes.byte()/bytearray to be needed as long as there isn't a simple way
to iterate over a bytes or bytearray in a way that yields bytes or bytearrays
instead of integers. To be honest I can't think of a time when I'd actually
*want* to iterate over a bytes/bytearray as integers. Although I realize there
is unlikely to be a reasonable method to change that now. If iterbytes is added
I'm not sure where i'd personally use either bytes.byte() or bytearray.byte().

In general though I think that overloading a single constructor method to do
something conceptually different based on the type of the parameter leads to
these kind of confusing scenarios and that having differently named constructors
for the different concepts is far clearer.

So given all that, I am:

* +10000 for some method of iterating over both types as bytes instead of
* +1 on adding .zeros to both types as an alternative and preferred method of
  creating a zero filled instance and deprecating the original method[1].
* -0 on adding .byte to both types as an alternative method of creating a
  single byte instance.
* -1 On changing the meaning of bytearray(1024).
* +/-0 on changing the meaning of bytes(1024), I think that bytes(1024) is
  likely to *not* be what someone wants and that what they really want is
  bytes([N]). I also think that the number one reason for someone to be doing
  bytes(N) is because they were attempting to iterate over a bytes or bytearray
  object and they got an integer. I also think that it's bad that this changes
  from 2.x to 3.x and I wish it hadn't. However I can't decide if it's worth
  reverting this at this time or not.

[1] By deprecating I mean, raise a deprecation warning, or something but my
    thoughts on actually removing the other methods are listed explicitly.

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From markus at  Sun Aug 17 23:55:47 2014
From: markus at (Markus Unterwaditzer)
Date: Sun, 17 Aug 2014 23:55:47 +0200
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <20140817215547.GA9919@chromebot.unti>

On Sun, Aug 17, 2014 at 05:41:10PM -0400, Barry Warsaw wrote:
> I think the biggest API "problem" is that default iteration returns integers
> instead of bytes.  That's a real pain.

I agree, this behavior required some helper functions while porting Werkzeug to
Python 3 AFAIK.

> I'm not sure .iterbytes() is the best name for spelling iteration over bytes
> instead of integers though.  Given that we can't change __iter__(), I
> personally would perhaps prefer a simple .bytes property over which if you
> iterated you would receive bytes, e.g.

I'd rather be for a .bytes() method, to match the .values(), and .keys()
methods on dictionaries.

-- Markus

From antoine at  Mon Aug 18 00:33:01 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 17 Aug 2014 18:33:01 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lsraiu$hld$>

Le 17/08/2014 13:07, Raymond Hettinger a ?crit :
> FWIW, I've been teaching Python full time for three years.  I cover the
> use of bytearray(n) in my classes and not a single person out of 3000+
> engineers have had a problem with it.

This is less about bytearray() than bytes(), IMO. bytearray() is 
sufficiently specialized that only experienced people will encounter it.

And while preallocating a bytearray of a certain size makes sense, it's 
completely pointless for a bytes object.



From ncoghlan at  Mon Aug 18 00:48:08 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 08:48:08 +1000
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <20140817215547.GA9919@chromebot.unti>
References: <>
Message-ID: <>

On 18 Aug 2014 08:04, "Markus Unterwaditzer" <markus at>
> On Sun, Aug 17, 2014 at 05:41:10PM -0400, Barry Warsaw wrote:
> > I think the biggest API "problem" is that default iteration returns
> > instead of bytes.  That's a real pain.
> I agree, this behavior required some helper functions while porting
Werkzeug to
> Python 3 AFAIK.
> >
> > I'm not sure .iterbytes() is the best name for spelling iteration over
> > instead of integers though.  Given that we can't change __iter__(), I
> > personally would perhaps prefer a simple .bytes property over which if
> > iterated you would receive bytes, e.g.
> I'd rather be for a .bytes() method, to match the .values(), and .keys()
> methods on dictionaries.

Calling it bytes is too confusing:

    for x in bytes(data):

    for x in bytes(data).bytes()

When referring to bytes, which bytes do you mean, the builtin or the method?

iterbytes() isn't especially attractive as a method name, but it's far more
explicit about its purpose.


> -- Markus
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From barry at  Mon Aug 18 00:52:36 2014
From: barry at (Barry Warsaw)
Date: Sun, 17 Aug 2014 18:52:36 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 18, 2014, at 08:48 AM, Nick Coghlan wrote:

>Calling it bytes is too confusing:
>    for x in bytes(data):
>       ...
>    for x in bytes(data).bytes()
>When referring to bytes, which bytes do you mean, the builtin or the method?
>iterbytes() isn't especially attractive as a method name, but it's far more
>explicit about its purpose.

I don't know.  How often do you really instantiate the bytes object there in
the for loop?


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <>

From ncoghlan at  Mon Aug 18 01:08:09 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 09:08:09 +1000
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 Aug 2014 03:07, "Raymond Hettinger" <raymond.hettinger at>
> On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan at> wrote:
>> If I see "bytearray(10)" there is nothing there that suggests "this
>> creates an array of length 10 and initialises it to zero" to me. I'd
>> be more inclined to guess it would be equivalent to "bytearray([10])".
>> "bytearray.zeros(10)", on the other hand, is relatively clear,
>> independently of user expectations.
> Zeros would have been great but that should have been done originally.
> The time to get API design right is at inception.
> Now, you're just breaking code and invalidating any published examples.

I'm fine with postponing the deprecation elements indefinitely (or just
deprecating bytes(int) and leaving bytearray(int) alone).

>>> Another thought is that the core devs should be very reluctant to
>>> anything we don't have to while the 2 to 3 transition is still in
>>> Every new deprecation of APIs that existed in Python 2.7 just adds
>>> obstacle to converting code.  Individually, the differences are trivial.
>>> Collectively, they present a good reason to never migrate code to
Python 3.
>> This is actually one of the inconsistencies between the Python 2 and 3
>> binary APIs:
> However, bytearray(n) is the same in both Python 2 and Python 3.
> Changing it in Python 3 increases the gulf between the two.
> The further we let Python 3 diverge from Python 2, the less likely that
> people will convert their code and the harder you make it to write code
> that runs under both.
> FWIW, I've been teaching Python full time for three years.  I cover the
> use of bytearray(n) in my classes and not a single person out of 3000+
> engineers have had a problem with it.   I seriously question the PEP's
> assertion that there is a real problem to be solved (i.e. that people
> are baffled by bytearray(bufsiz)) and that the problem is sufficiently
> painful to warrant the headaches that go along with API changes.

Yes, I'd expect engineers and networking folks to be fine with it. It isn't
how this mode of the constructor *works* that worries me, it's how it
*fails* (i.e. silently producing unexpected data rather than a type error).

Purely deprecating the bytes case and leaving bytearray alone would likely
address my concerns.

> The other proposal to add bytearray.byte(3) should probably be named
> bytearray.from_byte(3) for clarity.  That said, I question whether there
> actually a use case for this.   I have never seen seen code that has a
> need to create a byte array of length one from a single integer.
> For the most part, the API will be easiest to learn if it matches what
> we do for lists and for array.array.

This part of the proposal came from a few things:

* many of the bytes and bytearray methods only accept bytes-like objects,
but iteration and indexing produce integers
* to mitigate the impact of the above, some (but not all) bytes and
bytearray methods now accept integers in addition to bytes-like objects
* ord() in Python 3 is only documented as accepting length 1 strings, but
also accepts length 1 bytes-like objects

Adding bytes.byte() makes it practical to document the binary half of ord's
behaviour, and eliminates any temptation to expand the "also accepts
integers" behaviour out to more types.

bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2
had both chr() and unichr().

I don't recall ever needing chr() in a real program either, but I still
consider it an important part of clearly articulating the data model.

> Sorry Nick, but I think you're making the API worse instead of better.
> This API isn't perfect but it isn't flat-out broken either.   There is
> unfortunate asymmetry between bytes() and bytearray() in Python 2,
> but that ship has sailed.  The current API for Python 3 is pretty good
> (though there is still a tension between wanting to be like lists and like
> strings both at the same time).

Yes. It didn't help that the docs previously expected readers to infer the
behaviour of the binary sequence methods from the string documentation -
while the new docs could still use some refinement, I've at least addressed
that part of the problem.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Mon Aug 18 01:12:39 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 09:12:39 +1000
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 Aug 2014 08:55, "Barry Warsaw" <barry at> wrote:
> On Aug 18, 2014, at 08:48 AM, Nick Coghlan wrote:
> >Calling it bytes is too confusing:
> >
> >    for x in bytes(data):
> >       ...
> >
> >    for x in bytes(data).bytes()
> >
> >When referring to bytes, which bytes do you mean, the builtin or the
> >
> >iterbytes() isn't especially attractive as a method name, but it's far
> >explicit about its purpose.
> I don't know.  How often do you really instantiate the bytes object there
> the for loop?

I'm talking more generally - do you *really* want to be explaining that
"bytes" behaves like a tuple of integers, while "bytes.bytes" behaves like
a tuple of bytes?

Namespaces are great and all, but using the same name for two different
concepts is still inherently confusing.


> -Barry
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Mon Aug 18 01:23:00 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 17 Aug 2014 19:23:00 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lsrdgl$g11$>

Le 16/08/2014 01:17, Nick Coghlan a ?crit :
> * Deprecate passing single integer values to ``bytes`` and ``bytearray``

I'm neutral. Ideally we wouldn't have done that mistake at the beginning.

> * Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors
> * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors
> * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and
>    ``memoryview.iterbytes`` alternative iterators

+0.5. "iterbytes" isn't really great as a name.



From raymond.hettinger at  Mon Aug 18 01:41:38 2014
From: raymond.hettinger at (Raymond Hettinger)
Date: Sun, 17 Aug 2014 16:41:38 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014, at 4:08 PM, Nick Coghlan <ncoghlan at> wrote:

> Purely deprecating the bytes case and leaving bytearray alone would likely address my concerns.

That is good progress.  Thanks :-)

Would a warning for the bytes case suffice, do you need an actual deprecation?

> bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2 had both chr() and unichr().
> I don't recall ever needing chr() in a real program either, but I still consider it an important part of clearly articulating the data model.

"I don't recall having ever needed this"  greatly weakens the premise that this is needed :-)

The APIs have been around since 2.6 and AFAICT there have been zero demonstrated
need for a special case for a single byte.  We already have a perfectly good spelling:
   NUL = bytes([0])

The Zen tells us we really don't need a second way to do it (actually a third since you
can also write b'\x00') and it suggests that this special case isn't special enough.

I encourage restraint against adding an unneeded class method that has no parallel
elsewhere.  Right now, the learning curve is mitigated because bytes is very str-like
and because bytearray is list-like (i.e. the method names have been used elsewhere
and likely already learned before encountering bytes() or bytearray()).  Putting in new,
rarely used funky method adds to the learning burden.

If you do press forward with adding it (and I don't see why), then as an alternate 
constructor, the name should be from_int() or some such to avoid ambiguity
and to make clear that it is a class method.

> iterbytes() isn't especially attractive as a method name, but it's far more
> explicit about its purpose.

I concur.  In this case, explicitness matters.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Mon Aug 18 01:51:40 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 09:51:40 +1000
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 Aug 2014 09:41, "Raymond Hettinger" <raymond.hettinger at>
> I encourage restraint against adding an unneeded class method that has no
> elsewhere.  Right now, the learning curve is mitigated because bytes is
very str-like
> and because bytearray is list-like (i.e. the method names have been used
> and likely already learned before encountering bytes() or bytearray()).
 Putting in new,
> rarely used funky method adds to the learning burden.
> If you do press forward with adding it (and I don't see why), then as an
> constructor, the name should be from_int() or some such to avoid ambiguity
> and to make clear that it is a class method.

If I remember the sequence of events correctly, I thought of
map(bytes.byte, data) first, and then Guido suggested a dedicated
iterbytes() method later.

The step I hadn't taken (until now) was realising that the new
memoryview(data).iterbytes() capability actually combines with the existing
(bytes([b]) for b in data) to make the original bytes.byte idea unnecessary.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From barry at  Mon Aug 18 01:55:02 2014
From: barry at (Barry Warsaw)
Date: Sun, 17 Aug 2014 19:55:02 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote:

>I'm talking more generally - do you *really* want to be explaining that
>"bytes" behaves like a tuple of integers, while "bytes.bytes" behaves like
>a tuple of bytes?

I would explain it differently though, using concrete examples.

    data = bytes(...)
    for i in data: # iterate over data as integers
    for i in data.bytes: # iterate over data as bytes

But whatever.  I just wish there was something better than iterbytes.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <>

From ncoghlan at  Mon Aug 18 02:08:24 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 10:08:24 +1000
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 Aug 2014 09:57, "Barry Warsaw" <barry at> wrote:
> On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote:
> >I'm talking more generally - do you *really* want to be explaining that
> >"bytes" behaves like a tuple of integers, while "bytes.bytes" behaves
> >a tuple of bytes?
> I would explain it differently though, using concrete examples.
>     data = bytes(...)
>     for i in data: # iterate over data as integers
>     for i in data.bytes: # iterate over data as bytes
> But whatever.  I just wish there was something better than iterbytes.

There's actually another aspect to your idea, independent of the naming:
exposing a view rather than just an iterator. I'm going to have to look at
the implications for memoryview, but it may be a good way to go (and would
align with the iterator -> view changes in dict).


> -Barry
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From barry at  Mon Aug 18 02:22:07 2014
From: barry at (Barry Warsaw)
Date: Sun, 17 Aug 2014 20:22:07 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote:

>There's actually another aspect to your idea, independent of the naming:
>exposing a view rather than just an iterator. I'm going to have to look at
>the implications for memoryview, but it may be a good way to go (and would
>align with the iterator -> view changes in dict).

Yep!  Maybe that will inspire a better spelling. :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <>

From guido at  Mon Aug 18 02:45:32 2014
From: guido at (Guido van Rossum)
Date: Sun, 17 Aug 2014 17:45:32 -0700
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw <barry at> wrote:

> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote:
> >There's actually another aspect to your idea, independent of the naming:
> >exposing a view rather than just an iterator. I'm going to have to look at
> >the implications for memoryview, but it may be a good way to go (and would
> >align with the iterator -> view changes in dict).
> Yep!  Maybe that will inspire a better spelling. :)

+1. It's just as much about b[i] as it is about "for c in b", so a view
sounds right. (The view would have to be mutable for bytearrays and for
writable memoryviews.)

On the rest, it's sounding more and more as if we will just need to live
with both bytes(1000) and bytearray(1000). A warning sounds worse than a
deprecation to me.

bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes and
bytearray pretty highly.

I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a size
one list is (or at least feels) more expensive to allocate than a size one
bytes object. So, okay.

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Mon Aug 18 03:02:18 2014
From: guido at (Guido van Rossum)
Date: Sun, 17 Aug 2014 18:02:18 -0700
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw <barry at> wrote:

> On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote:
> >(Don't understand this to mean that we should never deprecate things.
> >Deprecations will happen, they are necessary for the evolution of any
> >programming language. But they won't ever hurt in the way that Python 3
> >hurt.)
> It would be useful to explore what causes the most pain in the 2->3
> transition?  IMHO, it's not the deprecations or changes such as print ->
> print().  It's the bytes/str split - a fundamental change to core and
> common
> data types.  The question then is whether you foresee any similar looming
> pervasive change? [*]

I'm unsure about what's the single biggest pain moving to Python 3. In the
past I would have said that it's for sure the bytes/str split (which both
the biggest pain and the biggest payoff).

But if I look carefully into the soul of teams that are still on 2.7 (I
know a few... :-), I think the real reason is that Python 3 changes so many
different things, you have to actually understand your code to port it
(unlike with minor version transitions, where the changes usually spike in
one specific area, and you can leave the rest to normal attrition and
periodic maintenance).

> [*] I was going to add a joke about mandatory static type checking, but
> sometimes jokes are blown up into apocalyptic prophesy around here. ;)

Heh. :-)

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From donald at  Mon Aug 18 03:14:46 2014
From: donald at (Donald Stufft)
Date: Sun, 17 Aug 2014 21:14:46 -0400
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014, at 09:02 PM, Guido van Rossum wrote:
> On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw <barry at> wrote:
>> On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote:
>(Don't understand this to mean that we should never deprecate things.
>Deprecations will happen, they are necessary for the evolution of any
>programming language. But they won't ever hurt in the way that Python 3
>> It would be useful to explore what causes the most pain in the 2->3
transition?? IMHO, it's not the deprecations or changes such as print ->
print().? It's the bytes/str split - a fundamental change to core and
data types.? The question then is whether you foresee any similar
pervasive change? [*]
> I'm unsure about what's the single biggest pain moving to Python 3. In the past I would have said that it's for sure the bytes/str split (which both the biggest pain and the biggest payoff).
> But if I look carefully into the soul of teams that are still on 2.7 (I know a few... :-), I think the real reason is that Python 3 changes so many different things, you have to actually understand your code to port it (unlike with minor version transitions, where the changes usually spike in one specific area, and you can leave the rest to normal attrition and periodic maintenance).

In my experience bytes/str is the single biggest change that causes the
most problems. Most of the other changes can be mechanically transformed
and/or papered over using helpers like six. The bytes/str change is the
main one that requires understanding code and where it requires a
serious untangling of things in code bases where str/bytes are freely
used intechangingbly. Often times this requires making a decision about
what *should* be bytes or str as well which requires having some deep
knowledge about the APIs in question too.

From antoine at  Mon Aug 18 03:39:31 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 17 Aug 2014 21:39:31 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lsrlgk$102$>

Le 17/08/2014 19:41, Raymond Hettinger a ?crit :
> The APIs have been around since 2.6 and AFAICT there have been zero
> demonstrated
> need for a special case for a single byte.  We already have a perfectly
> good spelling:
>     NUL = bytes([0])

That is actually a very cumbersome spelling. Why should I first create a 
one-element list in order to create a one-byte bytes object?

> The Zen tells us we really don't need a second way to do it (actually a
> third since you
> can also write b'\x00') and it suggests that this special case isn't
> special enough.

b'\x00' is obviously the right way to do it in this case, but we're 
concerned about the non-constant case.

The reason to instantiate bytes from non-constant integer comes from the 
unfortunate indexing and iteration behaviour of bytes objects.



From ethan at  Mon Aug 18 03:44:23 2014
From: ethan at (Ethan Furman)
Date: Sun, 17 Aug 2014 18:44:23 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/17/2014 02:19 PM, Raymond Hettinger wrote:
> On Aug 17, 2014, at 11:33 AM, Ethan Furman wrote:
>> I've had many of the problems Nick states and I'm also +1.
> There are two code snippets below which were taken from the standard library.


My issues are with 'bytes', not 'bytearray'.  'bytearray(10)' actually makes sense.  I certainly have no problem with 
bytearray and bytes not being exactly the same.

My primary issues with bytes is not being able to do b'abc'[2] == b'c', and with not being able to do x = b'abc'[2]; y = 
bytes(x); assert y == b'c'.

And because of the backwards compatibility issues I would deprecate, because we have a new 'better' way, but not remove, 
the current functionality.

I pretty much agree exactly with what Donald Stufft said about it.


From antoine at  Mon Aug 18 03:40:50 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 17 Aug 2014 21:40:50 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lsrlj2$102$>

Le 17/08/2014 20:08, Nick Coghlan a ?crit :
> On 18 Aug 2014 09:57, "Barry Warsaw" <barry at
> <mailto:barry at>> wrote:
>  >
>  > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote:
>  >
>  > >I'm talking more generally - do you *really* want to be explaining that
>  > >"bytes" behaves like a tuple of integers, while "bytes.bytes"
> behaves like
>  > >a tuple of bytes?
>  >
>  > I would explain it differently though, using concrete examples.
>  >
>  >     data = bytes(...)
>  >     for i in data: # iterate over data as integers
>  >     for i in data.bytes: # iterate over data as bytes
>  >
>  > But whatever.  I just wish there was something better than iterbytes.
> There's actually another aspect to your idea, independent of the naming:
> exposing a view rather than just an iterator.

So that view would actually be the bytes object done right? Funny :-)
Will it have lazy slicing?



From donald at  Mon Aug 18 03:48:21 2014
From: donald at (Donald Stufft)
Date: Sun, 17 Aug 2014 21:48:21 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <lsrlj2$102$>
References: <>
Message-ID: <>

from __future__ import bytesdoneright? :D

  Donald Stufft
  donald at

On Sun, Aug 17, 2014, at 09:40 PM, Antoine Pitrou wrote:
> Le 17/08/2014 20:08, Nick Coghlan a ?crit :
> >
> > On 18 Aug 2014 09:57, "Barry Warsaw" <barry at
> > <mailto:barry at>> wrote:
> >  >
> >  > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote:
> >  >
> >  > >I'm talking more generally - do you *really* want to be explaining that
> >  > >"bytes" behaves like a tuple of integers, while "bytes.bytes"
> > behaves like
> >  > >a tuple of bytes?
> >  >
> >  > I would explain it differently though, using concrete examples.
> >  >
> >  >     data = bytes(...)
> >  >     for i in data: # iterate over data as integers
> >  >     for i in data.bytes: # iterate over data as bytes
> >  >
> >  > But whatever.  I just wish there was something better than iterbytes.
> >
> > There's actually another aspect to your idea, independent of the naming:
> > exposing a view rather than just an iterator.
> So that view would actually be the bytes object done right? Funny :-)
> Will it have lazy slicing?
> Regards
> Antoine.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

From ethan at  Mon Aug 18 03:52:10 2014
From: ethan at (Ethan Furman)
Date: Sun, 17 Aug 2014 18:52:10 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/17/2014 04:08 PM, Nick Coghlan wrote:
> I'm fine with postponing the deprecation elements indefinitely (or just deprecating bytes(int) and leaving
> bytearray(int) alone).

+1 on both pieces.


From graffatcolmingov at  Mon Aug 18 04:02:52 2014
From: graffatcolmingov at (Ian Cordasco)
Date: Sun, 17 Aug 2014 21:02:52 -0500
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014 at 8:52 PM, Ethan Furman <ethan at> wrote:
> On 08/17/2014 04:08 PM, Nick Coghlan wrote:
>> I'm fine with postponing the deprecation elements indefinitely (or just
>> deprecating bytes(int) and leaving
>> bytearray(int) alone).
> +1 on both pieces.

Perhaps postpone the deprecation to Python 4000 ;)

From alex.gaynor at  Mon Aug 18 04:14:01 2014
From: alex.gaynor at (Alex Gaynor)
Date: Mon, 18 Aug 2014 02:14:01 +0000 (UTC)
Subject: [Python-Dev]
References: <>
Message-ID: <>

Donald Stufft <donald <at>> writes:

> For the record I?ve had all of the problems that Nick states and I?m
> +1 on this change.
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

I've hit basically every problem everyone here has stated, and in no uncertain
terms am I completely opposed to deprecating anything. The Python 2 to 3
migration is already hard enough, and already proceeding far too slowly for
many of our tastes. Making that migration even more complex would drive me to
the point of giving up.


From chrism at  Mon Aug 18 04:51:26 2014
From: chrism at (Chris McDonough)
Date: Sun, 17 Aug 2014 22:51:26 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <lsrlj2$102$>
References: <>
Message-ID: <>

On 08/17/2014 09:40 PM, Antoine Pitrou wrote:
> Le 17/08/2014 20:08, Nick Coghlan a ?crit :
>> On 18 Aug 2014 09:57, "Barry Warsaw" <barry at
>> <mailto:barry at>> wrote:
>>  >
>>  > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote:
>>  >
>>  > >I'm talking more generally - do you *really* want to be explaining
>> that
>>  > >"bytes" behaves like a tuple of integers, while "bytes.bytes"
>> behaves like
>>  > >a tuple of bytes?
>>  >
>>  > I would explain it differently though, using concrete examples.
>>  >
>>  >     data = bytes(...)
>>  >     for i in data: # iterate over data as integers
>>  >     for i in data.bytes: # iterate over data as bytes
>>  >
>>  > But whatever.  I just wish there was something better than iterbytes.
>> There's actually another aspect to your idea, independent of the naming:
>> exposing a view rather than just an iterator.
> So that view would actually be the bytes object done right? Funny :-)
> Will it have lazy slicing?

bytes.sorry()? ;-)

- C

From jeanpierreda at  Mon Aug 18 05:50:40 2014
From: jeanpierreda at (Devin Jeanpierre)
Date: Sun, 17 Aug 2014 20:50:40 -0700
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014 at 7:14 PM, Alex Gaynor <alex.gaynor at> wrote:
> I've hit basically every problem everyone here has stated, and in no uncertain
> terms am I completely opposed to deprecating anything. The Python 2 to 3
> migration is already hard enough, and already proceeding far too slowly for
> many of our tastes. Making that migration even more complex would drive me to
> the point of giving up.

Could you elaborate what problems you are thinking this will cause for you?

It seems to me that avoiding a bug-prone API is not particularly
complex, and moving it back to its 2.x semantics or making it not work
entirely, rather than making it work differently, would make porting
applications easier. If, during porting to 3.x, you find a deprecation
warning for bytes(n), then rather than being annoying code churny
extra changes, this is actually a bug that's been identified. So it's
helpful even during the deprecation period.

-- Devin

From ncoghlan at  Mon Aug 18 08:45:27 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 18 Aug 2014 16:45:27 +1000
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 August 2014 11:14, Donald Stufft <donald at> wrote:
> On Sun, Aug 17, 2014, at 09:02 PM, Guido van Rossum wrote:
>> I'm unsure about what's the single biggest pain moving to Python 3. In the past I would have said that it's for sure the bytes/str split (which both the biggest pain and the biggest payoff).
>> But if I look carefully into the soul of teams that are still on 2.7 (I know a few... :-), I think the real reason is that Python 3 changes so many different things, you have to actually understand your code to port it (unlike with minor version transitions, where the changes usually spike in one specific area, and you can leave the rest to normal attrition and periodic maintenance).
> In my experience bytes/str is the single biggest change that causes the
> most problems. Most of the other changes can be mechanically transformed
> and/or papered over using helpers like six. The bytes/str change is the
> main one that requires understanding code and where it requires a
> serious untangling of things in code bases where str/bytes are freely
> used intechangingbly. Often times this requires making a decision about
> what *should* be bytes or str as well which requires having some deep
> knowledge about the APIs in question too.

It's certainly the one that has caused the most churn in CPython and
the standard library - the ripples still haven't entirely settled on
that front :)

I think Guido's right that there's also a "death of a thousand cuts"
aspect for large existing code bases, though, especially those that
are lacking comprehensive test suites. By definition, existing large
Python 2 applications are OK with the restrictions imposed by Python
2, and we're deliberately not forcing the issue by halting Python 2
maintenance. That's where Steve Dower's idea of being able to
progressively declare a code base "Python 3 compatible" on a file by
file basis and have some means of programmatically enforcing that is
interesting - it opens the door to "opportunistic and incremental"
porting, where modules are progressively updated to run on both, until
an application reaches a point where it can switch to Python 3 and
leave Python 2 behind.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From barry at  Mon Aug 18 15:50:00 2014
From: barry at (Barry Warsaw)
Date: Mon, 18 Aug 2014 09:50:00 -0400
Subject: [Python-Dev] PEP 467: Minor API improvements for bytes &
In-Reply-To: <lsrlgk$102$>
References: <>
Message-ID: <>

On Aug 17, 2014, at 09:39 PM, Antoine Pitrou wrote:

>> need for a special case for a single byte.  We already have a perfectly
>> good spelling:
>>     NUL = bytes([0])
>That is actually a very cumbersome spelling. Why should I first create a
>one-element list in order to create a one-byte bytes object?

I feel the same way every time I have to write `set(['foo'])`.


From barry at  Mon Aug 18 16:12:59 2014
From: barry at (Barry Warsaw)
Date: Mon, 18 Aug 2014 10:12:59 -0400
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014, at 06:02 PM, Guido van Rossum wrote:

>I'm unsure about what's the single biggest pain moving to Python 3. In the
>past I would have said that it's for sure the bytes/str split (which both
>the biggest pain and the biggest payoff).
>But if I look carefully into the soul of teams that are still on 2.7 (I
>know a few... :-), I think the real reason is that Python 3 changes so many
>different things, you have to actually understand your code to port it
>(unlike with minor version transitions, where the changes usually spike in
>one specific area, and you can leave the rest to normal attrition and
>periodic maintenance).

The latter is a good point, and sometimes it's a huge challenge to understand
the code being ported.  A good test suite (and dare I say, doctests :) help a
lot with this.

I've ported a ton of stuff, and failed at a few.  I think all the little
changes are mostly tractable, and we've assembled a pretty good stack of
documents to help[*].

Sometimes a seemingly easy and mechanical port will produce odd failures,
where more domain expertise needs to be brought to bear to get just the right
bilingual invocation.  But if the underlying code does not itself have a clear
bytes/str distinction, then you're doomed.  One of my failures was a Python
binding for a large C++ project that deeply conflated data and text.  Another
was a pure Python library that essentially did the same.  In both cases, I
ended up in a situation where some core types could be neither str nor bytes
without some part of the test suite failing miserably.  Those are the types of
projects that largely get left unported since it's much harder to justify the
costs vs. benefits.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <>

From barry at  Mon Aug 18 16:17:23 2014
From: barry at (Barry Warsaw)
Date: Mon, 18 Aug 2014 10:17:23 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Aug 17, 2014, at 09:48 PM, Donald Stufft wrote:

>from __future__ import bytesdoneright? :D

Synonymous to:

bytes = bytesdoneright

or maybe

from bytesdoneright import bytes



From andreas.r.maier at  Mon Aug 18 13:34:32 2014
From: andreas.r.maier at (Andreas Maier)
Date: Mon, 18 Aug 2014 13:34:32 +0200
Subject: [Python-Dev] Review needed for patch for issue #12067
Message-ID: <>

a patch for issue #12067 (targeting Py 3.5) is available since a few 
weeks and is ready for review. From my perspective, it is ready for commit.

Could the community please review the patch?


From dickinsm at  Mon Aug 18 19:22:26 2014
From: dickinsm at (Mark Dickinson)
Date: Mon, 18 Aug 2014 18:22:26 +0100
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <20140817023902.GM4525@ando>
References: <>
Message-ID: <>

[Moderately off-topic]

On Sun, Aug 17, 2014 at 3:39 AM, Steven D'Aprano <steve at>

> I used to refer to Python 4000 as the hypothetical compatibility break
> version. Now I refer to Python 5000.

I personally think it should be Python 5000000, or Py5M.  When we come to
create the mercurial branch, that should of course, following tradition, be
called p5ym.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Mon Aug 18 19:49:06 2014
From: antoine at (Antoine Pitrou)
Date: Mon, 18 Aug 2014 13:49:06 -0400
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <lsteai$4kl$>

Le 18/08/2014 13:22, Mark Dickinson a ?crit :
> [Moderately off-topic]
> On Sun, Aug 17, 2014 at 3:39 AM, Steven D'Aprano <steve at
> <mailto:steve at>> wrote:
>     I used to refer to Python 4000 as the hypothetical compatibility break
>     version. Now I refer to Python 5000.
> I personally think it should be Python 5000000, or Py5M.  When we come
> to create the mercurial branch, that should of course, following
> tradition, be called p5ym.

I would suggest "NaV", for "not-a-version". It would compare greater 
than all other version numbers (in the spirit of Numpy's "not-a-time", 
slightly tweaked).



From chris.barker at  Mon Aug 18 18:04:06 2014
From: chris.barker at (Chris Barker)
Date: Mon, 18 Aug 2014 09:04:06 -0700
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Sun, Aug 17, 2014 at 2:41 PM, Barry Warsaw <barry at> wrote:

> I think the biggest API "problem" is that default iteration returns
> integers
> instead of bytes.  That's a real pain.

what is really needed for this NOT to be a pain is a byte scalar.

numpy has a scalar type for every type it supports -- this is a GOOD THING

In [53]: a = np.array((3,4,5), dtype=np.uint8)

In [54]: a
Out[54]: array([3, 4, 5], dtype=uint8)

In [55]: a[1]
Out[55]: 4

In [56]: type(a[1])
Out[56]: numpy.uint8

In [57]: a[1].shape
Out[57]: ()

The lack of a  character type is a major source of "type errors" in python
(the whole list of strings vs a single string problem -- both return a
sequence when you index into them or iterate over them)

Anyway, the character ship has long since sailed, but maybe a byte scalar
would be a good idea?

And FWIW, I think the proposal does make for a better, cleaner API.

Whether that's worth the deprecation is not clear to me, though as someone
whose been on the verge of making the leap to 3.* for ages, this isn't
going to make any difference.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From tjreedy at  Mon Aug 18 22:06:06 2014
From: tjreedy at (Terry Reedy)
Date: Mon, 18 Aug 2014 16:06:06 -0400
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <lstmce$scm$>

On 8/18/2014 12:04 PM, Chris Barker wrote:
> On Sun, Aug 17, 2014 at 2:41 PM, Barry Warsaw <barry at
> <mailto:barry at>> wrote:
>     I think the biggest API "problem" is that default iteration returns
>     integers
>     instead of bytes.  That's a real pain.
> what is really needed for this NOT to be a pain is a byte scalar.

The byte scalar is an int in range(256). Bytes is an array of such.

> numpy has a scalar type for every type it supports -- this is a GOOD
> THING (tm):
> In [53]: a = np.array((3,4,5), dtype=np.uint8)
> In [54]: a
> Out[54]: array([3, 4, 5], dtype=uint8)
> In [55]: a[1]
> Out[55]: 4
> In [56]: type(a[1])
> Out[56]: numpy.uint8
> In [57]: a[1].shape
> Out[57]: ()
> The lack of a  character type is a major source of "type errors" in
> python (the whole list of strings vs a single string problem -- both
> return a sequence when you index into them or iterate over them)

This is exactly what iterbytes would do  -- yields bytes of size 1.

> Anyway, the character ship has long since sailed, but maybe a byte
> scalar would be a good idea?
> And FWIW, I think the proposal does make for a better, cleaner API.

Terry Jan Reedy

From tjreedy at  Mon Aug 18 22:12:22 2014
From: tjreedy at (Terry Reedy)
Date: Mon, 18 Aug 2014 16:12:22 -0400
Subject: [Python-Dev] -- Untrusted Connection (Firefox)
Message-ID: <lstmo7$1sh$>

Firefox does not want to connect to Plain works fine. Has the certificate expired?

Terry Jan Reedy

From phd at  Mon Aug 18 22:19:48 2014
From: phd at (Oleg Broytman)
Date: Mon, 18 Aug 2014 22:19:48 +0200
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <lstmo7$1sh$>
References: <lstmo7$1sh$>
Message-ID: <>

On Mon, Aug 18, 2014 at 04:12:22PM -0400, Terry Reedy <tjreedy at> wrote:
> Firefox does not want to connect to

   Works for me (FF 31).

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From benjamin at  Mon Aug 18 22:22:01 2014
From: benjamin at (Benjamin Peterson)
Date: Mon, 18 Aug 2014 13:22:01 -0700
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <lstmo7$1sh$>
References: <lstmo7$1sh$>
Message-ID: <>

It uses a CACert certificate, which your system probably doesn't trust.

On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote:
> Firefox does not want to connect to Plain 
> works fine. Has the certificate expired?
> -- 
> Terry Jan Reedy
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

From graffatcolmingov at  Mon Aug 18 22:26:48 2014
From: graffatcolmingov at (Ian Cordasco)
Date: Mon, 18 Aug 2014 15:26:48 -0500
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

On Mon, Aug 18, 2014 at 3:22 PM, Benjamin Peterson <benjamin at> wrote:
> It uses a CACert certificate, which your system probably doesn't trust.
> On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote:
>> Firefox does not want to connect to Plain
>> works fine. Has the certificate expired?
>> --
>> Terry Jan Reedy
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at
>> Unsubscribe:

Benjamin that looks accurate. I see the same thing as Terry (on
Firefox 31) and the reason is: uses an invalid security certificate. The certificate
is not trusted because no issuer chain was provided. (Error code:

From phd at  Mon Aug 18 22:30:43 2014
From: phd at (Oleg Broytman)
Date: Mon, 18 Aug 2014 22:30:43 +0200
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

On Mon, Aug 18, 2014 at 03:26:48PM -0500, Ian Cordasco <graffatcolmingov at> wrote:
> On Mon, Aug 18, 2014 at 3:22 PM, Benjamin Peterson <benjamin at> wrote:
> > It uses a CACert certificate, which your system probably doesn't trust.
> >
> > On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote:
> >> Firefox does not want to connect to Plain
> >> works fine. Has the certificate expired?
> Benjamin that looks accurate. I see the same thing as Terry (on
> Firefox 31) and the reason is:
> uses an invalid security certificate. The certificate
> is not trusted because no issuer chain was provided. (Error code:
> sec_error_unknown_issuer)

   Aha, I see now -- the signing certificate is CAcert, which I've
installed manually.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From robertc at  Tue Aug 19 02:07:59 2014
From: robertc at (Robert Collins)
Date: Tue, 19 Aug 2014 12:07:59 +1200
Subject: [Python-Dev] os.walk() is going to be *fast* with scandir
In-Reply-To: <>
References: <>
Message-ID: <>

Indeed - my suggestion is applicable to people using the library

On 10 Aug 2014 18:21, "Larry Hastings" <larry at> wrote:

>  On 08/09/2014 10:40 PM, Robert Collins wrote:
> A small tip from my bzr days - cd into the directory before scanning it
> I doubt that's permissible for a library function like os.scandir().
> */arry*
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Mon Aug 18 22:37:32 2014
From: chris.barker at (Chris Barker)
Date: Mon, 18 Aug 2014 13:37:32 -0700
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <lstmce$scm$>
References: <>
Message-ID: <>

On Mon, Aug 18, 2014 at 1:06 PM, Terry Reedy <tjreedy at> wrote:

> The byte scalar is an int in range(256). Bytes is an array of such.

then why the complaint about iterating over bytes producing ints? Ye,s a
byte owuld be pretty much teh same as an int, but it would have
restrictions - useful ones.

 numpy has a scalar type for every type it supports -- this is a GOOD
>> THING (tm):
>> In [56]: type(a[1])
>> Out[56]: numpy.uint8
>> In [57]: a[1].shape
>> Out[57]: ()
>> The lack of a  character type is a major source of "type errors" in
>> python (the whole list of strings vs a single string problem -- both
>> return a sequence when you index into them or iterate over them)
> This is exactly what iterbytes would do  -- yields bytes of size 1.

as I understand it, it would yield a bytes object of length one -- that is
a sequence that _happens_ to only have one item in it -- not the same thing.

Note above. In numpy, when you index out of a 1-d array you get a scalar --
with shape == ()  -- not a 1-d array of length 1. And this is useful, as it
provide s clear termination point when you drill down through multiple

I often wish I could do that with nested lists with strings at the bottom.

[1,2,3] is a sequence of numbers

"this" is a sequence of characters -- oops, not it's not, it's a sequence
of sequences of sequences of ...

I think it would be cleaner if bytes was a sequence of a scalar byte object.

This is a bigger deal for numpy, what with its n-dimensional arrays and
many reducing operations, but the same principles apply.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From storchaka at  Tue Aug 19 10:37:00 2014
From: storchaka at (Serhiy Storchaka)
Date: Tue, 19 Aug 2014 11:37:00 +0300
Subject: [Python-Dev] Bytes path support
Message-ID: <lsv2ba$hj7$>

Builting open(), io classes, os and os.path functions and some other 
functions in the stdlib support bytes paths as well as str paths. But 
many functions doesn't. There are requests about adding this support 
([1], [2]) in some modules. It is easy (just call os.fsdecode() on 
argument) but I'm not sure it is worth to do. Pathlib doesn't support 
bytes path and it looks intentional. What is general policy about 
support of bytes path in the stdlib?


From ncoghlan at  Tue Aug 19 14:25:48 2014
From: ncoghlan at (Nick Coghlan)
Date: Tue, 19 Aug 2014 22:25:48 +1000
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On 18 August 2014 10:45, Guido van Rossum <guido at> wrote:
> On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw <barry at> wrote:
>> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote:
>> >There's actually another aspect to your idea, independent of the naming:
>> >exposing a view rather than just an iterator. I'm going to have to look
>> > at
>> >the implications for memoryview, but it may be a good way to go (and
>> > would
>> >align with the iterator -> view changes in dict).
>> Yep!  Maybe that will inspire a better spelling. :)
> +1. It's just as much about b[i] as it is about "for c in b", so a view
> sounds right. (The view would have to be mutable for bytearrays and for
> writable memoryviews.)
> On the rest, it's sounding more and more as if we will just need to live
> with both bytes(1000) and bytearray(1000). A warning sounds worse than a
> deprecation to me.

I'm fine with keeping bytearray(1000), since that works the same way
in both Python 2 & 3, and doesn't seem likely to be invoked

I'd still like to deprecate "bytes(1000)", since that does different
things in Python 2 & 3, while "b'\x00' * 1000" does the same thing in

$ python -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))'
$ python3 -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))'

Hitting the deprecation warning in single-source code would seem to be
a strong hint that you have a bug in one version or the other rather
than being intended behaviour.

> bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes and
> bytearray pretty highly.

With "bytearray(1000)" sticking around indefinitely, I'm less
concerned about adding a "zeros" constructor.

> I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a size
> one list is (or at least feels) more expensive to allocate than a size one
> bytes object. So, okay.

So, here's an interesting thing I hadn't previously registered: we
actually already have a fairly capable "bytesview" option, and have
done since Stefan implemented "memoryview.cast" in 3.3. The trick lies
in the 'c' format character for the struct module, which is parsed as
a length 1 bytes object rather than as an integer:

>>> data = bytearray(b"Hello world")
>>> bytesview = memoryview(data).cast('c')
>>> list(bytesview)
[b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd']
>>> b''.join(bytesview)
b'Hello world'
>>> bytesview[0:5] = memoryview(b"olleH").cast('c')
>>> list(bytesview)
[b'o', b'l', b'l', b'e', b'H', b' ', b'w', b'o', b'r', b'l', b'd']
>>> b''.join(bytesview)
b'olleH world'

For the read-only case, it covers everything (iteration, indexing,
slicing), for the writable view case, it doesn't cover changing the
shape of the target array, and it doesn't cover assigning arbitrary
buffer objects (you need to wrap them in a similar cast for memoryview
to allow the assignment).

It's hardly the most *intuitive* spelling though - I was one of the
reviewers for Stefan's memoryview rewrite back in 3.3, and I only made
the connection today when looking to see how a view object like the
one we were discussing elsewhere in the thread might be implemented as
a facade over arbitrary memory buffers, rather than being specific to
bytes and bytearray.

If we went down the "bytesview" path, then a single new facade would
cover not only the 3 builtins (bytes, bytearray, memoryview) but also
any *other* buffer exporting type. If we so chose (at some point in
the future, not as part of this PEP), such a type could allow
additional bytes operations (like "count", "startswith" or "index") to
be applied to arbitrary regions of memory without making a copy. We
can't add those other operations to memoryview, since they don't make
sense for an n-dimensional array.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From guido at  Tue Aug 19 18:46:24 2014
From: guido at (Guido van Rossum)
Date: Tue, 19 Aug 2014 09:46:24 -0700
Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes &
In-Reply-To: <>
References: <>
Message-ID: <>

On Tue, Aug 19, 2014 at 5:25 AM, Nick Coghlan <ncoghlan at> wrote:

> On 18 August 2014 10:45, Guido van Rossum <guido at> wrote:
> > On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw <barry at> wrote:
> >>
> >> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote:
> >>
> >> >There's actually another aspect to your idea, independent of the
> naming:
> >> >exposing a view rather than just an iterator. I'm going to have to look
> >> > at
> >> >the implications for memoryview, but it may be a good way to go (and
> >> > would
> >> >align with the iterator -> view changes in dict).
> >>
> >> Yep!  Maybe that will inspire a better spelling. :)
> >
> >
> > +1. It's just as much about b[i] as it is about "for c in b", so a view
> > sounds right. (The view would have to be mutable for bytearrays and for
> > writable memoryviews.)
> >
> > On the rest, it's sounding more and more as if we will just need to live
> > with both bytes(1000) and bytearray(1000). A warning sounds worse than a
> > deprecation to me.
> I'm fine with keeping bytearray(1000), since that works the same way
> in both Python 2 & 3, and doesn't seem likely to be invoked
> inadvertently.
> I'd still like to deprecate "bytes(1000)", since that does different
> things in Python 2 & 3, while "b'\x00' * 1000" does the same thing in
> both.

I think any argument based on what "bytes" does in Python 2 is pretty weak,
since Python 2's bytes is just an alias for str, so it has tons of behavior
that differ -- why single this out?

In Python 3, I really like bytes and bytearray to be as similar as
possible, and that includes the constructor.

> $ python -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))'
> '10'
> '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> $ python3 -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))'
> b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> Hitting the deprecation warning in single-source code would seem to be
> a strong hint that you have a bug in one version or the other rather
> than being intended behaviour.
> > bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes
> and
> > bytearray pretty highly.
> With "bytearray(1000)" sticking around indefinitely, I'm less
> concerned about adding a "zeros" constructor.

That's fine.

>  > I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a
> size
> > one list is (or at least feels) more expensive to allocate than a size
> one
> > bytes object. So, okay.
> So, here's an interesting thing I hadn't previously registered: we
> actually already have a fairly capable "bytesview" option, and have
> done since Stefan implemented "memoryview.cast" in 3.3. The trick lies
> in the 'c' format character for the struct module, which is parsed as
> a length 1 bytes object rather than as an integer:
> >>> data = bytearray(b"Hello world")
> >>> bytesview = memoryview(data).cast('c')
> >>> list(bytesview)
> [b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd']
> >>> b''.join(bytesview)
> b'Hello world'
> >>> bytesview[0:5] = memoryview(b"olleH").cast('c')
> >>> list(bytesview)
> [b'o', b'l', b'l', b'e', b'H', b' ', b'w', b'o', b'r', b'l', b'd']
> >>> b''.join(bytesview)
> b'olleH world'
> For the read-only case, it covers everything (iteration, indexing,
> slicing), for the writable view case, it doesn't cover changing the
> shape of the target array, and it doesn't cover assigning arbitrary
> buffer objects (you need to wrap them in a similar cast for memoryview
> to allow the assignment).
> It's hardly the most *intuitive* spelling though - I was one of the
> reviewers for Stefan's memoryview rewrite back in 3.3, and I only made
> the connection today when looking to see how a view object like the
> one we were discussing elsewhere in the thread might be implemented as
> a facade over arbitrary memory buffers, rather than being specific to
> bytes and bytearray.

Maybe the 'future' package can offer an iterbytes or bytesview implemented
this way?

> If we went down the "bytesview" path, then a single new facade would
> cover not only the 3 builtins (bytes, bytearray, memoryview) but also
> any *other* buffer exporting type. If we so chose (at some point in
> the future, not as part of this PEP), such a type could allow
> additional bytes operations (like "count", "startswith" or "index") to
> be applied to arbitrary regions of memory without making a copy.

Why call out "without making a copy" for operations that naturally don't
have to copy anything?

> We
> can't add those other operations to memoryview, since they don't make
> sense for an n-dimensional array.

I'm sorry for your efforts, but I'm getting more and more lukewarm about
the entire PEP.

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From guido at  Tue Aug 19 19:02:32 2014
From: guido at (Guido van Rossum)
Date: Tue, 19 Aug 2014 10:02:32 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lsv2ba$hj7$>
References: <lsv2ba$hj7$>
Message-ID: <>

The official policy is that we want them to go away, but reality so far has
not budged. We will continue to hold our breath though. :-)

On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka <storchaka at>

> Builting open(), io classes, os and os.path functions and some other
> functions in the stdlib support bytes paths as well as str paths. But many
> functions doesn't. There are requests about adding this support ([1], [2])
> in some modules. It is easy (just call os.fsdecode() on argument) but I'm
> not sure it is worth to do. Pathlib doesn't support bytes path and it looks
> intentional. What is general policy about support of bytes path in the
> stdlib?
> [1]
> [2]
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From benhoyt at  Tue Aug 19 19:31:54 2014
From: benhoyt at (Ben Hoyt)
Date: Tue, 19 Aug 2014 13:31:54 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

> The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)

Does that mean that new APIs should explicitly not support bytes? I'm
thinking of os.scandir() (PEP 471), which I'm implementing at the
moment. I was originally going to make it support bytes so it was
compatible with listdir, but maybe that's a bad idea. Bytes paths are
essentially broken on Windows.


> On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka <storchaka at> wrote:
>> Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib?
>> [1]
>> [2]

From storchaka at  Tue Aug 19 19:34:03 2014
From: storchaka at (Serhiy Storchaka)
Date: Tue, 19 Aug 2014 20:34:03 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <lt020p$qmk$>

19.08.14 20:02, Guido van Rossum ???????(??):
> The official policy is that we want them to go away, but reality so far
> has not budged. We will continue to hold our breath though. :-)

Does it mean that we should reject all propositions about adding bytes 
path support in existing functions (in particular issue19997 (imghdr) 
and issue20797 (zipfile))?

From benjamin at  Tue Aug 19 19:40:29 2014
From: benjamin at (Benjamin Peterson)
Date: Tue, 19 Aug 2014 10:40:29 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

On Tue, Aug 19, 2014, at 10:31, Ben Hoyt wrote:
> > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
> Does that mean that new APIs should explicitly not support bytes? I'm
> thinking of os.scandir() (PEP 471), which I'm implementing at the
> moment. I was originally going to make it support bytes so it was
> compatible with listdir, but maybe that's a bad idea. Bytes paths are
> essentially broken on Windows.

Bytes paths are "essential" on Unix, though, so I don't think we should
create new low-level APIs that don't support bytes.

From benhoyt at  Tue Aug 19 19:43:07 2014
From: benhoyt at (Ben Hoyt)
Date: Tue, 19 Aug 2014 13:43:07 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

>> > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
>> Does that mean that new APIs should explicitly not support bytes? I'm
>> thinking of os.scandir() (PEP 471), which I'm implementing at the
>> moment. I was originally going to make it support bytes so it was
>> compatible with listdir, but maybe that's a bad idea. Bytes paths are
>> essentially broken on Windows.
> Bytes paths are "essential" on Unix, though, so I don't think we should
> create new low-level APIs that don't support bytes.

Fair enough. I don't quite understand, though -- why is the "official
policy" to kill something that's "essential" on *nix?


From tseaver at  Tue Aug 19 19:56:16 2014
From: tseaver at (Tres Seaver)
Date: Tue, 19 Aug 2014 13:56:16 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <lt0341$7re$>

Hash: SHA1

On 08/19/2014 01:43 PM, Ben Hoyt wrote:
>>>> The official policy is that we want them [support for bytes
>>>> paths in stdlib functions] to go away, but reality so far has
>>>> not budged. We will continue to hold our breath though. :-)
>>> Does that mean that new APIs should explicitly not support bytes?
>>> I'm thinking of os.scandir() (PEP 471), which I'm implementing at
>>> the moment. I was originally going to make it support bytes so it
>>> was compatible with listdir, but maybe that's a bad idea. Bytes
>>> paths are essentially broken on Windows.
>> Bytes paths are "essential" on Unix, though, so I don't think we
>> should create new low-level APIs that don't support bytes.
> Fair enough. I don't quite understand, though -- why is the "official 
> policy" to kill something that's "essential" on *nix?

ISTM that the policy is based on a fantasy that "it looks like text to me
in my use cases, so therefore it must be text for everyone."

- -- 
Tres Seaver          +1 540-429-0999          tseaver at
Palladion Software   "Excellence by Design"
Version: GnuPG v1.4.11 (GNU/Linux)


From benjamin at  Tue Aug 19 20:00:35 2014
From: benjamin at (Benjamin Peterson)
Date: Tue, 19 Aug 2014 11:00:35 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

On Tue, Aug 19, 2014, at 10:43, Ben Hoyt wrote:
> >> > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
> >>
> >> Does that mean that new APIs should explicitly not support bytes? I'm
> >> thinking of os.scandir() (PEP 471), which I'm implementing at the
> >> moment. I was originally going to make it support bytes so it was
> >> compatible with listdir, but maybe that's a bad idea. Bytes paths are
> >> essentially broken on Windows.
> >
> > Bytes paths are "essential" on Unix, though, so I don't think we should
> > create new low-level APIs that don't support bytes.
> Fair enough. I don't quite understand, though -- why is the "official
> policy" to kill something that's "essential" on *nix?

Well, notice the official policy is desperately *wanting* them to go
away with the implication that we grudgingly bow to reality. :)

From antoine at  Tue Aug 19 20:06:29 2014
From: antoine at (Antoine Pitrou)
Date: Tue, 19 Aug 2014 14:06:29 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <lt03n5$cu2$>

Le 19/08/2014 13:43, Ben Hoyt a ?crit :
>>>> The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
>>> Does that mean that new APIs should explicitly not support bytes? I'm
>>> thinking of os.scandir() (PEP 471), which I'm implementing at the
>>> moment. I was originally going to make it support bytes so it was
>>> compatible with listdir, but maybe that's a bad idea. Bytes paths are
>>> essentially broken on Windows.
>> Bytes paths are "essential" on Unix, though, so I don't think we should
>> create new low-level APIs that don't support bytes.
> Fair enough. I don't quite understand, though -- why is the "official
> policy" to kill something that's "essential" on *nix?

PEP 383 should actually work on Unix quite well, AFAIR.



From marko at  Tue Aug 19 20:16:40 2014
From: marko at (Marko Rauhamaa)
Date: Tue, 19 Aug 2014 21:16:40 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lt0341$7re$> (Tres Seaver's message of "Tue, 19
 Aug 2014 13:56:16 -0400")
References: <lsv2ba$hj7$>
Message-ID: <>

Tres Seaver <tseaver at>:

> On 08/19/2014 01:43 PM, Ben Hoyt wrote:
>> Fair enough. I don't quite understand, though -- why is the "official
>> policy" to kill something that's "essential" on *nix?
> ISTM that the policy is based on a fantasy that "it looks like text to
> me in my use cases, so therefore it must be text for everyone."

What I like about Python is that it allows me to write native linux code
without having to make portability compromises that plague, say, Java. I
have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The
"textualization" of Python3 seems part of a conscious effort to make
Python more Java-esque.


From stephen at  Tue Aug 19 20:44:14 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 20 Aug 2014 03:44:14 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Ben Hoyt writes:

 > Fair enough. I don't quite understand, though -- why is the "official
 > policy" to kill something that's "essential" on *nix?

They're not essential on *nix.  Unix paths at the OS level are "just
bytes" (even on Mac, although the most common Mac filesystem does
enforce UTF-8 Unicode NFD).  This use case is now perfectly well
served by codecs.

However, there are a lot of applications that involve reading a file
name from a directory, and passing it verbatim to another OS
function.  This case can be handled now using the surrogateescape
error handler, but when these APIs were introduced we didn't even have
a reliable way to roundtrip filenames because a Unix filename doesn't
need to be a string of characters from *any* character set.

And there's the undeniable convenience of treating file names as
opaque objects in those applications.


From greg.ewing at  Wed Aug 20 00:01:11 2014
From: greg.ewing at (Greg Ewing)
Date: Wed, 20 Aug 2014 10:01:11 +1200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Ben Hoyt wrote:

> Does that mean that new APIs should explicitly not support bytes? 
 > ... Bytes paths are essentially broken on Windows.

But on Unix, paths are essentially bytes. What's the
official policy for dealing with that?


From greg.ewing at  Wed Aug 20 00:09:24 2014
From: greg.ewing at (Greg Ewing)
Date: Wed, 20 Aug 2014 10:09:24 +1200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Stephen J. Turnbull wrote:

> This case can be handled now using the surrogateescape
> error handler,

So maybe the way to make bytes paths go away is to always
use surrogateescape for paths on unix?


From guido at  Wed Aug 20 01:44:05 2014
From: guido at (Guido van Rossum)
Date: Tue, 19 Aug 2014 16:44:05 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <> <>
Message-ID: <>

I'm sorry my moment of levity was taken so seriously.

With my serious hat on, I would like to claim that *conceptually* filenames
are most definitely text. Due to various historical accidents the UNIX
system calls often encoded text as arguments, and we sometimes need to
control that encoding. Hence the occasional need for bytes arguments. But
most of the time you don't have to think about that, and forcing users to
worry about it is mostly as counter-productive as forcing to think about
the encoding of every text file.

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From stephen at  Wed Aug 20 07:01:10 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 20 Aug 2014 14:01:10 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Greg Ewing writes:
 > Stephen J. Turnbull wrote:
 > > This case can be handled now using the surrogateescape
 > > error handler,
 > So maybe the way to make bytes paths go away is to always
 > use surrogateescape for paths on unix?

Backward compatibility rules that out, I think.  I certainly would
recommend that for new code, but even for new code there are many
users who vehemently object to using Unicode as an intermediate
representation of things they think of as binary blobs.  Not worth the
hassle to even seriously propose removing those APIs IMO.

From guido at  Wed Aug 20 07:06:03 2014
From: guido at (Guido van Rossum)
Date: Tue, 19 Aug 2014 22:06:03 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

On Tuesday, August 19, 2014, Stephen J. Turnbull <stephen at> wrote:

> Greg Ewing writes:
>  > Stephen J. Turnbull wrote:
>  >
>  > > This case can be handled now using the surrogateescape
>  > > error handler,
>  >
>  > So maybe the way to make bytes paths go away is to always
>  > use surrogateescape for paths on unix?
> Backward compatibility rules that out, I think.  I certainly would
> recommend that for new code, but even for new code there are many
> users who vehemently object to using Unicode as an intermediate
> representation of things they think of as binary blobs.  Not worth the
> hassle to even seriously propose removing those APIs IMO.

But maybe we don't have to add new ones?


--Guido van Rossum (on iPad)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From marko at  Wed Aug 20 07:52:19 2014
From: marko at (Marko Rauhamaa)
Date: Wed, 20 Aug 2014 08:52:19 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
 (Guido van Rossum's message of "Tue, 19 Aug 2014 16:44:05 -0700")
References: <lsv2ba$hj7$>
Message-ID: <>

Guido van Rossum <guido at>:

> With my serious hat on, I would like to claim that *conceptually*
> filenames are most definitely text. Due to various historical
> accidents the UNIX system calls often encoded text as arguments, and
> we sometimes need to control that encoding.

Due to historical accidents, text (in the Python sense) is not a
first-class data type in Unix. Text, machine language, XML, Python etc
are interpretations of bytes. Bytes are the first-class data type
recognized by the kernel. That reality cannot be wished away.

> Hence the occasional need for bytes arguments. But most of the time
> you don't have to think about that, and forcing users to worry about
> it is mostly as counter-productive as forcing to think about the
> encoding of every text file.

The users of Python programs can often be given higher-level facades.
Unix programmers, though, shouldn't be shielded from bytes.


From stephen at  Wed Aug 20 08:38:01 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 20 Aug 2014 15:38:01 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Guido van Rossum writes:
 > On Tuesday, August 19, 2014, Stephen J. Turnbull <stephen at> wrote:
 > > Greg Ewing writes:

 > >  > So maybe the way to make bytes paths go away is to always
 > >  > use surrogateescape for paths on unix?
 > >
 > > Backward compatibility rules that out, I think.  I certainly would
 > > recommend that for new code, but even for new code there are many
 > > users who vehemently object to using Unicode as an intermediate
 > > representation of things they think of as binary blobs.  Not worth the
 > > hassle to even seriously propose removing those APIs IMO.
 > But maybe we don't have to add new ones?

IMO, we should avoid it.

There may be some use cases.  Sergiy mentions two bug reports. imghdr.what doesn't accept bytes paths zipfile.extractall should accept bytes path as parameter

I'm very unsympathetic to these.  In both cases the bytes are coming
from outside of module in question.  Why are they in bytes?  That
question should scare you, because from the point of view of end users
there are no good answers: they all mean that the end user is going to
end up with uninterpretable bytes in their directories, for the
convenience of the programmer.

In the case of issue20797, I'd be a *little* sympathetic if the RFE
were for the *members* argument.  zipfiles evidently have no way to
specify the encodings of the name(s) of their members (and the zipfile
module doesn't have APIs for it!), so the programmer is kind of stuck,
especially if the requirement is that the extraction require no user
intervention.  But again, this is rarely what the user wants.

I would be sympathetic to an internal, bytes-based, "kids these stunts
are performed by trained professionals do NOT try this at home" API,
with a sane user-oriented str-based API for ordinary use for this
module.  I suppose it might be useful for such a multi-type API to be
polymorphic, but it would have to be a "if there are bytes anywhere,
everything must be bytes and return values will be bytes" and
similarly for str kind of polymorphism.  No mixing bytes and strings,

From stephen at  Wed Aug 20 08:43:32 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 20 Aug 2014 15:43:32 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Marko Rauhamaa writes:

 > Unix programmers, though, shouldn't be shielded from bytes.

Nobody's trying to do that.  But Python users should be shielded from
Unix programmers.

From ben+python at  Wed Aug 20 08:53:26 2014
From: ben+python at (Ben Finney)
Date: Wed, 20 Aug 2014 16:53:26 +1000
Subject: [Python-Dev] Bytes path support
References: <lsv2ba$hj7$>
Message-ID: <>

"Stephen J. Turnbull" <stephen at> writes:

> Marko Rauhamaa writes:
>  > Unix programmers, though, shouldn't be shielded from bytes.
> Nobody's trying to do that.  But Python users should be shielded from
> Unix programmers.

+1 QotW

 \        ?Intellectual property is to the 21st century what the slave |
  `\                              trade was to the 16th.? ?David Mertz |
_o__)                                                                  |
Ben Finney

From p.f.moore at  Wed Aug 20 13:00:38 2014
From: p.f.moore at (Paul Moore)
Date: Wed, 20 Aug 2014 12:00:38 +0100
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

On 20 August 2014 07:53, Ben Finney <ben+python at> wrote:
> "Stephen J. Turnbull" <stephen at> writes:
>> Marko Rauhamaa writes:
>>  > Unix programmers, though, shouldn't be shielded from bytes.
>> Nobody's trying to do that.  But Python users should be shielded from
>> Unix programmers.
> +1 QotW

That quote is actually almost a "hidden extra Zen of Python" IMO :-)
Both parts of it.


From ncoghlan at  Wed Aug 20 13:08:16 2014
From: ncoghlan at (Nick Coghlan)
Date: Wed, 20 Aug 2014 21:08:16 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

On 20 Aug 2014 04:18, "Marko Rauhamaa" <marko at> wrote:
> Tres Seaver <tseaver at>:
> > On 08/19/2014 01:43 PM, Ben Hoyt wrote:
> >> Fair enough. I don't quite understand, though -- why is the "official
> >> policy" to kill something that's "essential" on *nix?
> >
> > ISTM that the policy is based on a fantasy that "it looks like text to
> > me in my use cases, so therefore it must be text for everyone."
> What I like about Python is that it allows me to write native linux code
> without having to make portability compromises that plague, say, Java. I
> have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The
> "textualization" of Python3 seems part of a conscious effort to make
> Python more Java-esque.

It's not just the JVM that says text and binary APIs should be separate -
it's every widely used operating system services layer except POSIX. The
POSIX way works well *if* everyone reliably encodes things as UTF-8 or
always uses encoding detection, but its failure mode is unfortunately
silent data corruption.

That said, there's a lot of Python software that is POSIX specific, where
bytes paths would be the least of the barriers to porting to Windows or
Jython. I'm personally +1 on consistently allowing binary paths in lower
level APIs, but disallowing them in higher level explicitly cross platform
abstractions like pathlib.


> Marko
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From antoine at  Wed Aug 20 15:01:40 2014
From: antoine at (Antoine Pitrou)
Date: Wed, 20 Aug 2014 09:01:40 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <lt267l$qtg$>

Le 20/08/2014 07:08, Nick Coghlan a ?crit :
> It's not just the JVM that says text and binary APIs should be separate
> - it's every widely used operating system services layer except POSIX.
> The POSIX way works well *if* everyone reliably encodes things as UTF-8
> or always uses encoding detection, but its failure mode is unfortunately
> silent data corruption.
> That said, there's a lot of Python software that is POSIX specific,
> where bytes paths would be the least of the barriers to porting to
> Windows or Jython. I'm personally +1 on consistently allowing binary
> paths in lower level APIs, but disallowing them in higher level
> explicitly cross platform abstractions like pathlib.

I fully agree with Nick's position here.

To elaborate specifically about pathlib, it doesn't handle bytes paths 
but allows you to generate them if desired:

Adding full bytes support to pathlib would have added a lot of 
complication and fragility in the implementation *and* in the API (is it 
allowed to combine str and bytes paths? should they have separate 
classes?), for arguably little benefit.

I think if you want low-level features (such as unconverted bytes paths 
under POSIX), it is reasonable to point you to low-level APIs.



From brett at  Wed Aug 20 16:04:20 2014
From: brett at (Brett Cannon)
Date: Wed, 20 Aug 2014 14:04:20 +0000
Subject: [Python-Dev] Bytes path support
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

On Wed Aug 20 2014 at 9:02:25 AM Antoine Pitrou <antoine at> wrote:

> Le 20/08/2014 07:08, Nick Coghlan a ?crit :
> >
> > It's not just the JVM that says text and binary APIs should be separate
> > - it's every widely used operating system services layer except POSIX.
> > The POSIX way works well *if* everyone reliably encodes things as UTF-8
> > or always uses encoding detection, but its failure mode is unfortunately
> > silent data corruption.
> >
> > That said, there's a lot of Python software that is POSIX specific,
> > where bytes paths would be the least of the barriers to porting to
> > Windows or Jython. I'm personally +1 on consistently allowing binary
> > paths in lower level APIs, but disallowing them in higher level
> > explicitly cross platform abstractions like pathlib.
> I fully agree with Nick's position here.
> To elaborate specifically about pathlib, it doesn't handle bytes paths
> but allows you to generate them if desired:
> Adding full bytes support to pathlib would have added a lot of
> complication and fragility in the implementation *and* in the API (is it
> allowed to combine str and bytes paths? should they have separate
> classes?), for arguably little benefit.
> I think if you want low-level features (such as unconverted bytes paths
> under POSIX), it is reasonable to point you to low-level APIs.

+1 from me as well. Allowing the low-level stuff work on bytes but keeping
high-level actually high-level keeps with our consenting adults policy as
well as making things possible, but not at the detriment of the common
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From tjreedy at  Wed Aug 20 20:41:26 2014
From: tjreedy at (Terry Reedy)
Date: Wed, 20 Aug 2014 14:41:26 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lt267l$qtg$>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <lt2q5o$bc9$>

On 8/20/2014 9:01 AM, Antoine Pitrou wrote:
> Le 20/08/2014 07:08, Nick Coghlan a ?crit :
>> It's not just the JVM that says text and binary APIs should be separate
>> - it's every widely used operating system services layer except POSIX.
>> The POSIX way works well *if* everyone reliably encodes things as UTF-8
>> or always uses encoding detection, but its failure mode is unfortunately
>> silent data corruption.
>> That said, there's a lot of Python software that is POSIX specific,
>> where bytes paths would be the least of the barriers to porting to
>> Windows or Jython. I'm personally +1 on consistently allowing binary
>> paths in lower level APIs, but disallowing them in higher level
>> explicitly cross platform abstractions like pathlib.
> I fully agree with Nick's position here.
> To elaborate specifically about pathlib, it doesn't handle bytes paths
> but allows you to generate them if desired:
> Adding full bytes support to pathlib would have added a lot of
> complication and fragility in the implementation *and* in the API (is it
> allowed to combine str and bytes paths? should they have separate
> classes?), for arguably little benefit.

I am glad you did not recreate the madness of pre 3.0 Python in that regard.

> I think if you want low-level features (such as unconverted bytes paths
> under POSIX), it is reasonable to point you to low-level APIs.

Do our docs somewhere explain the idea that files names are conceptually 
*names*, not arbitrary bytes; explain the concept of low-level versus 
high-level API' and point to the two types of APIs in Python?

Terry Jan Reedy

From greg.ewing at  Thu Aug 21 00:18:11 2014
From: greg.ewing at (Greg Ewing)
Date: Thu, 21 Aug 2014 10:18:11 +1200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lt267l$qtg$>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

Antoine Pitrou wrote:
> I think if you want low-level features (such as unconverted bytes paths 
> under POSIX), it is reasonable to point you to low-level APIs.

The problem with scandir() in particular is that there is
currently *no* low-level API exposed that gives the same

If scandir() is not to support bytes paths, I'd suggest
exposing the opendir() and readdir() system calls with
bytes path support.


From ncoghlan at  Thu Aug 21 00:31:52 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 21 Aug 2014 08:31:52 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
 <lt267l$qtg$> <>
Message-ID: <>

On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing at> wrote:
> Antoine Pitrou wrote:
>> I think if you want low-level features (such as unconverted bytes paths
under POSIX), it is reasonable to point you to low-level APIs.
> The problem with scandir() in particular is that there is
> currently *no* low-level API exposed that gives the same
> functionality.
> If scandir() is not to support bytes paths, I'd suggest
> exposing the opendir() and readdir() system calls with
> bytes path support.

scandir is low level (the entire os module is low level). In fact, aside
from pathlib, I'd consider pretty much every API we have that deals with
paths to be low level - that's a large part of the reason we needed pathlib!


> --
> Greg
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Thu Aug 21 01:04:34 2014
From: chris.barker at (Chris Barker)
Date: Wed, 20 Aug 2014 16:04:34 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

>  but disallowing them in higher level
>> > explicitly cross platform abstractions like pathlib.
I think the trick here is that posix-using folks claim that filenames are
just bytes, and indeed they can be passed around with a char*, so they seem
to be.

but you can't possible do anything other than pass them around if you
REALLY think they are just bytes.

So really, people treat them as
"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
maybe a couple others)-is-ascii-compatible"

If you assume that, then you could write a pathlib that would work. And in
practice, I expect a lot of designed only for posix code works that way.
But of course, this gets ugly if you go to a platform where filenames are
not "bytes-in-some-arbitrary-encoding-where-at-least
the-slash-character-(and maybe a couple others)-is-ascii-compatible", like

I'm not sure if it's worth having a pathlib, etc. that uses this assumption
-- but it could help us all write code that actually works with this screwy
lack of specification.

 Antoine Pitrou wrote:

> To elaborate specifically about pathlib, it doesn't handle bytes paths
> but allows you to generate them if desired:

but that uses

os.fsencode:  Encode filename to the filesystem encoding

As I understand it, the whole problem with some posix systems is that there
is NO filesystem encoding -- i.e. you can't know for sure what encoding a
filename is in. So you need to be able to pass the bytes through as they

(At least as I read Armin Ronacher's blog)



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Thu Aug 21 01:26:51 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 21 Aug 2014 09:26:51 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

On 21 Aug 2014 09:06, "Chris Barker" <chris.barker at> wrote:

> As I understand it, the whole problem with some posix systems is that
there is NO filesystem encoding -- i.e. you can't know for sure what
encoding a filename is in. So you need to be able to pass the bytes through
as they are.
> (At least as I read Armin Ronacher's blog)

Armin lets his astonishment at the idea we'd expect Linux vendors to fix
their broken OS get the better of him at times - he thinks the
responsibility lies entirely with us to work around its quirks and
limitations :)

The "surrogateescape" codec is our main answer to the unreliability of the
POSIX encoding model - fsdecode will squirrel away arbitrary bytes in the
private use area, and then fsencode will restore them again later. That
works for the simple round tripping case, but we currently lack good
default tools for "cleaning" strings that may contain surrogates (or even
scanning a string to see if surrogates are present).

One idea I had along those lines is a surrogatereplace error handler ( that emitted an ASCII question mark for
each smuggled byte, rather than propagating the encoding problem.


> -Chris
> --
> Christopher Barker, Ph.D.
> Oceanographer
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> Chris.Barker at
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ethan at  Thu Aug 21 01:33:27 2014
From: ethan at (Ethan Furman)
Date: Wed, 20 Aug 2014 16:33:27 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
 <lt267l$qtg$> <>
Message-ID: <>

On 08/20/2014 03:31 PM, Nick Coghlan wrote:
> On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing at <mailto:greg.ewing at>> wrote:
>> Antoine Pitrou wrote:
>>> I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
>> The problem with scandir() in particular is that there is
>> currently *no* low-level API exposed that gives the same
>> functionality.
>> If scandir() is not to support bytes paths, I'd suggest
>> exposing the opendir() and readdir() system calls with
>> bytes path support.
> scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every
> API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib!

If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should 
support bytes paths.

Is that what you meant to say?


From ncoghlan at  Thu Aug 21 02:15:15 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 21 Aug 2014 10:15:15 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
 <lt267l$qtg$> <>
Message-ID: <>

On 21 August 2014 09:33, Ethan Furman <ethan at> wrote:
> On 08/20/2014 03:31 PM, Nick Coghlan wrote:
>> On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing at
>> <mailto:greg.ewing at>> wrote:
>>> Antoine Pitrou wrote:
>>>> I think if you want low-level features (such as unconverted bytes paths
>>>> under POSIX), it is reasonable to point you to low-level APIs.
>>> The problem with scandir() in particular is that there is
>>> currently *no* low-level API exposed that gives the same
>>> functionality.
>>> If scandir() is not to support bytes paths, I'd suggest
>>> exposing the opendir() and readdir() system calls with
>>> bytes path support.
>> scandir is low level (the entire os module is low level). In fact, aside
>> from pathlib, I'd consider pretty much every
>> API we have that deals with paths to be low level - that's a large part of
>> the reason we needed pathlib!
> If scandir is low-level, and the low-level API's are the ones that should
> support bytes paths, then scandir should support bytes paths.
> Is that what you meant to say?

Yes. The discussions around PEP 471 *deferred* discussions of bytes
and file descriptor support to their own RFEs (not needing a PEP),
they didn't decide definitively not to support them. So Serhiy's
thread is entirely pertinent to that question.

Note that adding bytes support still *should not* hold up the initial
PEP 471 implementation - it should be done as a follow on RFE.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ethan at  Thu Aug 21 02:25:24 2014
From: ethan at (Ethan Furman)
Date: Wed, 20 Aug 2014 17:25:24 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
 <lt267l$qtg$> <>
Message-ID: <>

On 08/20/2014 05:15 PM, Nick Coghlan wrote:
> On 21 August 2014 09:33, Ethan Furman <ethan at> wrote:
>> On 08/20/2014 03:31 PM, Nick Coghlan wrote:
>>> scandir is low level (the entire os module is low level). In fact, aside
>>> from pathlib, I'd consider pretty much every
>>> API we have that deals with paths to be low level - that's a large part of
>>> the reason we needed pathlib!
>> If scandir is low-level, and the low-level API's are the ones that should
>> support bytes paths, then scandir should support bytes paths.
>> Is that what you meant to say?
> Yes. The discussions around PEP 471 *deferred* discussions of bytes
> and file descriptor support to their own RFEs (not needing a PEP),
> they didn't decide definitively not to support them. So Serhiy's
> thread is entirely pertinent to that question.

Thanks for clearing that up.  I hate feeling confused.  ;)

> Note that adding bytes support still *should not* hold up the initial
> PEP 471 implementation - it should be done as a follow on RFE.



From joseph.martinot-lagarde at  Thu Aug 21 02:27:10 2014
From: joseph.martinot-lagarde at (Joseph Martinot-Lagarde)
Date: Thu, 21 Aug 2014 02:27:10 +0200
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

Le 18/08/2014 03:02, Guido van Rossum a ?crit :
> On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw <barry at
> <mailto:barry at>> wrote:
>     On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote:
>      >(Don't understand this to mean that we should never deprecate things.
>      >Deprecations will happen, they are necessary for the evolution of any
>      >programming language. But they won't ever hurt in the way that
>     Python 3
>      >hurt.)
>     It would be useful to explore what causes the most pain in the 2->3
>     transition?  IMHO, it's not the deprecations or changes such as print ->
>     print().  It's the bytes/str split - a fundamental change to core
>     and common
>     data types.  The question then is whether you foresee any similar
>     looming
>     pervasive change? [*]
> I'm unsure about what's the single biggest pain moving to Python 3. In
> the past I would have said that it's for sure the bytes/str split (which
> both the biggest pain and the biggest payoff).

The pain was even bigger because in addition to the change in underlying 
types, the names of the types were not compatible between the python 
versions. I often try to write compatible code between python2 and 3, 
and I can't use "str" because it has not the same meaning in both 
versions, I can not use "unicode" because it disappeared in python3, and 
I can't use "byte" because it doesn't exist in python2. Add __str__ and 
__unicode__ to the mix and then you get the real pain.

Actually "str" is still usefull in the cases where a library is 
byte-only in python2 and unicode-only in python3 (hello, 


From benhoyt at  Thu Aug 21 03:22:12 2014
From: benhoyt at (Ben Hoyt)
Date: Wed, 20 Aug 2014 21:22:12 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
 <lt267l$qtg$> <>
Message-ID: <>

>> If scandir is low-level, and the low-level API's are the ones that should
>> support bytes paths, then scandir should support bytes paths.
>> Is that what you meant to say?
> Yes. The discussions around PEP 471 *deferred* discussions of bytes
> and file descriptor support to their own RFEs (not needing a PEP),
> they didn't decide definitively not to support them. So Serhiy's
> thread is entirely pertinent to that question.
> Note that adding bytes support still *should not* hold up the initial
> PEP 471 implementation - it should be done as a follow on RFE.

I agree with this (that scandir is low level and should support
bytes). As it happens, I'm implementing bytes support as well -- what
with the path_t support in posixmodule.c and the listdir
implementation to go on, it's not really any harder. So I think we'll
have it right off the bat.

BTW, the Windows implementation of PEP 471 is basically done, and the
POSIX implementation is written but not working yet. And then there's
tests and docs.


From stephen at  Thu Aug 21 04:16:27 2014
From: stephen at (Stephen J. Turnbull)
Date: Thu, 21 Aug 2014 11:16:27 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

Nick Coghlan writes:

 > One idea I had along those lines is a surrogatereplace error handler (
 > that emitted an ASCII question mark for
 > each smuggled byte, rather than propagating the encoding problem.

Please, don't.

"Smuggled bytes" are not independent events.  They tend to be
correlated *within* file names, and this handler would generate names
whose human semantics get lost (and there *are* human semantics,
otherwise the name would be str(some_counter)).  They tend to be
correlated across file names, and this handler will generate multiple
files with the same munged name (and again, the differentiating human
semantics get lost).

If you don't know the semantics of the intended file names, you can't
generate good replacement names.  This has to be an application-level
function, and often requires user intervention to get good names.

If you want to provide helper functions that applications can use to
clean names explicitly, that might be OK.

From cs at  Thu Aug 21 06:52:19 2014
From: cs at (Cameron Simpson)
Date: Thu, 21 Aug 2014 14:52:19 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>

On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at> wrote:
>>  but disallowing them in higher level
>>> > explicitly cross platform abstractions like pathlib.
>I think the trick here is that posix-using folks claim that filenames are
>just bytes, and indeed they can be passed around with a char*, so they seem
>to be.
>but you can't possible do anything other than pass them around if you
>REALLY think they are just bytes.
>So really, people treat them as
>"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
>maybe a couple others)-is-ascii-compatible"

As someone who fought long and hard in the surrogate-escape listdir() wars, and 
was won over once the scheme was thoroughly explained to me, I take issue with 
these assertions: they are bogus or misleading.

Firstly, POSIX filenames _are_ just byte strings. The only forbidden character 
is the NUL byte, which terminates a C string, and the only special character is 
the slash, which separates pathanme components.

Second, a bare low level program cannot do _much_ more than pass them around.  
It certainly can do things like compute their basename, or other path related 

The "bytes in some arbitrary encoding where at least the slash character (and
maybe a couple others) is ascii compatible" notion is completely bogus. There's 
only one special byte, the slash (code 47). There's no OS-level need that it or 
anything else be ASCII compatible. I think characterisations such as the one 
quoted are activately misleading.

The way you get UTF-8 (or some other encoding, fortunately getting less and 
less common) is by convention: you decide in your environment to work in some 
encoding (say utf-8) via the locale variables, and all your user-facing text 
gets used in UTF-8 encoding form when turned into bytes for the filename calls 
because your text<->bytes methods say to do so.

I think we'd all agree it is nice to have a system where filenames are all 
Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore 
the reality for such systems. I certainly think the Window-side Babel of code 
pages and multiple code systems is far far worse. (Disclaimer: not a Windows 
programmer, just based on hearing them complain.)

I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac 
OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the 
underlying filesystems reject invalid byte sequences).

> Antoine Pitrou wrote:
>> To elaborate specifically about pathlib, it doesn't handle bytes paths
>> but allows you to generate them if desired:
>but that uses
>os.fsencode:  Encode filename to the filesystem encoding
>As I understand it, the whole problem with some posix systems is that there
>is NO filesystem encoding -- i.e. you can't know for sure what encoding a
>filename is in. So you need to be able to pass the bytes through as they

Yes and no. I made that argument too.

There's no _external_ "filesystem encoding" in the sense of something recorded 
in the filesystem that anyone can inspect. But there is the expressed locale 
settings, available at runtime to any program that cares to pay attention. It 
is a workable situation.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly 
internally consistent. It just doesn't match what he wants. (Indeed, what I 
want, and I'm a long time UNIX fanboy.)

Cameron Simpson <cs at>

God is real, unless declared integer.   - Johan Montald, johan at

From tjreedy at  Thu Aug 21 09:32:15 2014
From: tjreedy at (Terry Reedy)
Date: Thu, 21 Aug 2014 03:32:15 -0400
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <lt47b1$a0j$>

On 8/20/2014 8:27 PM, Joseph Martinot-Lagarde wrote:

> The pain was even bigger because in addition to the change in underlying
> types, the names of the types were not compatible between the python
> versions. I often try to write compatible code between python2 and 3,
> and I can't use "str" because it has not the same meaning in both
> versions, I can not use "unicode" because it disappeared in python3,

And bridge library should have the equivalent of
if 'py3': unicode = str

> I can't use "byte" because it doesn't exist in python2.

2.7 (and 2.6?) already has
if 'py2': bytes = str
and I presume bridge libraries targeted before that was added include it 

Terry Jan Reedy

From phd at  Thu Aug 21 09:45:03 2014
From: phd at (Oleg Broytman)
Date: Thu, 21 Aug 2014 09:45:03 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>


On Thu, Aug 21, 2014 at 02:52:19PM +1000, Cameron Simpson <cs at> wrote:
> Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he
> wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)
> Cheers,
> Cameron Simpson <cs at>

   +1 from another Unix fanboy. Like an old wine, Unix becomes better
with years! ;-)

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From ncoghlan at  Thu Aug 21 14:26:56 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 21 Aug 2014 22:26:56 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <lt0341$7re$> <>
Message-ID: <>

On 21 August 2014 12:16, Stephen J. Turnbull <stephen at> wrote:
> Nick Coghlan writes:
>  > One idea I had along those lines is a surrogatereplace error handler (
>  > that emitted an ASCII question mark for
>  > each smuggled byte, rather than propagating the encoding problem.
> Please, don't.
> "Smuggled bytes" are not independent events.  They tend to be
> correlated *within* file names, and this handler would generate names
> whose human semantics get lost (and there *are* human semantics,
> otherwise the name would be str(some_counter)).  They tend to be
> correlated across file names, and this handler will generate multiple
> files with the same munged name (and again, the differentiating human
> semantics get lost).
> If you don't know the semantics of the intended file names, you can't
> generate good replacement names.  This has to be an application-level
> function, and often requires user intervention to get good names.
> If you want to provide helper functions that applications can use to
> clean names explicitly, that might be OK.

Yeah, I was thinking in the context of reproducing sys.stdout's
behaviour in Python 2, but that reproduces the bytes faithfully, so
'surrogateescape' is already offers exactly the behaviour we want
(sys.stdout will have surrogateescape enabled by default in 3.5).

I'll keep pondering the question of possible helper functions in the
"string" module.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From martin at  Thu Aug 21 14:40:48 2014
From: martin at (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 21 Aug 2014 14:40:48 +0200
Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a
 Py3k style compatibility break again?
In-Reply-To: <>
References: <>
Message-ID: <>

Am 18.08.14 08:45, schrieb Nick Coghlan:
> It's certainly the one that has caused the most churn in CPython and
> the standard library - the ripples still haven't entirely settled on
> that front :)

For people porting their libraries and applications, the challenge is
often even bigger: they need to learn a new programming concept. For
many developers, it is a novel idea that character strings are not
just bytes. A similar split is in the number types (integers vs.
floats), but most developers have learned the distinction when they
learned programming. That a text file is not a file that contains text
(but bytes interpreted as text) is surprising. In addition, you also
have to learn a lot of facts (what is the ASCII encoding, what is
the iso-8859-1 encoding, what is UTF-8 (and how does it differ from

When you have all that understood, you *then* run into the design
choices to be made for your software.

> I think Guido's right that there's also a "death of a thousand cuts"
> aspect for large existing code bases, though, especially those that
> are lacking comprehensive test suites.

I think the second big challenge is "my dependencies are not ported
to Python 3". There is little you can do about it, short of porting
the dependencies yourself (fortunately, Python and most of its libraries
are free software).


From martin at  Thu Aug 21 14:54:36 2014
From: martin at (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 21 Aug 2014 14:54:36 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
Message-ID: <>

Am 19.08.14 19:43, schrieb Ben Hoyt:
>>>> The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
>>> Does that mean that new APIs should explicitly not support bytes? I'm
>>> thinking of os.scandir() (PEP 471), which I'm implementing at the
>>> moment. I was originally going to make it support bytes so it was
>>> compatible with listdir, but maybe that's a bad idea. Bytes paths are
>>> essentially broken on Windows.
>> Bytes paths are "essential" on Unix, though, so I don't think we should
>> create new low-level APIs that don't support bytes.
> Fair enough. I don't quite understand, though -- why is the "official
> policy" to kill something that's "essential" on *nix?

I think the people defending the "Unix file names are just bytes" side
often miss an important detail: displaying file names to the user, and
allowing the user to enter file names.

A script that just needs to traverse a directory tree and look at files
by certain criteria can easily do so with not worrying about a text
interpretation of the file names.

When it comes to user interaction, it becomes apparent that, even on
Unix, file names are not just bytes. If you do "ls -l" in your shell,
the "system" (not just the kernel - but ultimately the terminal program,
which might be the console driver, or an X11 application) will interpret
the file name as having an encoding, and render them with a font.

So for Python, the question is: which of the use cases (processing
all files, vs. showing them to the user) should be better supported?
Python 3 took the latter as an answer, under the assumption that this
is the more common case.


From ncoghlan at  Thu Aug 21 14:55:33 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 21 Aug 2014 22:55:33 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>

On 21 August 2014 14:52, Cameron Simpson <cs at> wrote:
> Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he wants.
> (Indeed, what I want, and I'm a long time UNIX fanboy.)

The part that is broken is the idea that locale encodings are a viable
solution to conveying the appropriate encoding to use to talk to the
operating system. We've tried trusting them with Python 3, and they're
reliably wrong in certain situations. systemd is apparently better
than upstart at setting them correctly (e.g. for cron jobs), but even
it can't defend against an erroneous (or deliberate!) "LANG=C", or ssh
environment forwarding pushing a client's locale to the server. It's
worth looking through some of Armin Ronacher's complaints about Python
3 being broken on Linux, and seeing how many of them boil down to
"trusting the locale is wrong, Python 3 should just assume UTF-8 on
every POSIX system, the same way it does on Mac OS X". (I suspect
ShiftJIS, ISO-2022, et al users might object to that approach, but
it's at least a more viable choice now than it was back in 2008)

I still think we made the right call at least *trying* the idea of
trusting the locale encoding (since that's the officially supported
way of getting this information from the OS), and in many, many
situations it works fine. But I suspect we may eventually need to
resolve the technical issues currently preventing us from deciding to
ignore the environmental locale during interpreter startup and try
something different (such as always assuming UTF-8, or trying to force
C.UTF-8 if we detect the C locale, or looking for the systemd config
files and using those to set the OS encoding, rather than the
environmental locale).


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From antoine at  Thu Aug 21 15:20:27 2014
From: antoine at (Antoine Pitrou)
Date: Thu, 21 Aug 2014 09:20:27 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <lt4rmr$a00$>

Le 21/08/2014 00:52, Cameron Simpson a ?crit :
> The "bytes in some arbitrary encoding where at least the slash character
> (and
> maybe a couple others) is ascii compatible" notion is completely bogus.
> There's only one special byte, the slash (code 47). There's no OS-level
> need that it or anything else be ASCII compatible.

Of course there is. Try to split an UTF-16-encoded file path on the byte 
47 and you'll get a lot of garbage. So, yes, POSIX implicitly mandates 
an ASCII-compatible encoding for file paths.



From marko at  Thu Aug 21 15:58:03 2014
From: marko at (Marko Rauhamaa)
Date: Thu, 21 Aug 2014 16:58:03 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <> ("Martin v. =?utf-8?Q?L=C3=B6w?=
 =?utf-8?Q?is=22's?= message of "Thu, 21 Aug 2014 14:54:36 +0200")
References: <lsv2ba$hj7$>
Message-ID: <>

"Martin v. L?wis" <martin at>:

> I think the people defending the "Unix file names are just bytes" side
> often miss an important detail: displaying file names to the user, and
> allowing the user to enter file names.

The user interface is a real issue and needs to be addressed. It is
separate from the OS interface, though.

> A script that just needs to traverse a directory tree and look at
> files by certain criteria can easily do so with not worrying about a
> text interpretation of the file names.

A single system often has file names that have been encoded with
different schemes. Only today, I have had to deal with the JIS character
table (<URL:,MSDN.10%29.gif>) -- you
will notice that it doesn't have a backslash character. A coworker uses

I use UTF-8. UTF-8, of course, will refuse to deal with some byte

My point is that the poor programmer cannot ignore the possibility of
"funny" character sets. If Python tried to protect the programmer from
that possibility, the result might be even more intractable: how to act
on a file with an non-UTF-8 filename if you are unable to express it as
a text string?


From ncoghlan at  Thu Aug 21 16:12:50 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 22 Aug 2014 00:12:50 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <> <>
Message-ID: <>

On 21 August 2014 23:58, Marko Rauhamaa <marko at> wrote:
> My point is that the poor programmer cannot ignore the possibility of
> "funny" character sets. If Python tried to protect the programmer from
> that possibility, the result might be even more intractable: how to act
> on a file with an non-UTF-8 filename if you are unable to express it as
> a text string?

That's what the "surrogateescape" codec is for - we use it by default
on most OS interfaces, and it's implicit in the use of "os.fsencode"
and "os.fsdecode". Starting with Python 3, it's also enabled on
sys.stdout by default, so that "print(os.listdir(dirname))" will pass
the original raw bytes through to the terminal the same way Python 2

The docs could use additional details as to which interfaces do and
don't have surrogateescape enabled by default, but for the time being,
the description of the codec error handler just links out to the
original definition in PEP 383.

It may also be useful to have some tools for detecting and cleaning
strings containing surrogate escaped data, but there hasn't been a
concrete proposal along those lines as yet. Personally, I'm currently
waiting to see if the Fedora or OpenStack folks indicate a need for
such tools before proposing any additions.


> Marko
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ncoghlan at  Thu Aug 21 16:13:37 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 22 Aug 2014 00:13:37 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <> <>
Message-ID: <>

On 22 August 2014 00:12, Nick Coghlan <ncoghlan at> wrote:
> On 21 August 2014 23:58, Marko Rauhamaa <marko at> wrote:
>> My point is that the poor programmer cannot ignore the possibility of
>> "funny" character sets. If Python tried to protect the programmer from
>> that possibility, the result might be even more intractable: how to act
>> on a file with an non-UTF-8 filename if you are unable to express it as
>> a text string?
> That's what the "surrogateescape" codec is for

Oops, that should say "codec error handled" (I got it right later in the post).


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From arigo at  Thu Aug 21 16:41:23 2014
From: arigo at (Armin Rigo)
Date: Thu, 21 Aug 2014 16:41:23 +0200
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>


On 18 August 2014 22:30, Oleg Broytman <phd at> wrote:
>    Aha, I see now -- the signing certificate is CAcert, which I've
> installed manually.

I don't suppose anyone is particularly annoyed by this fact?  I know
for sure two classes of people that will never click "Ignore".  The
first one is people that, for lack of a less negative term, I'll call
"security freaks".  The second is "serious business people" to which
the shiny new look of appeals; they are likely to heed the
warning "Legitimate banks, stores, etc. will never ask you to do this"
and would regard an official hint to ignore it as highly

(The bug tracker of PyPy used to have the same problem.  We fixed the
situation recently, but previously, we used to argue that we didn't
have a lot of connections with either class of people...)

A bient?t,


From ncoghlan at  Thu Aug 21 17:44:37 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 22 Aug 2014 01:44:37 +1000
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

On 22 August 2014 00:41, Armin Rigo <arigo at> wrote:
> Hi,
> On 18 August 2014 22:30, Oleg Broytman <phd at> wrote:
>>    Aha, I see now -- the signing certificate is CAcert, which I've
>> installed manually.
> I don't suppose anyone is particularly annoyed by this fact?  I know
> for sure two classes of people that will never click "Ignore".  The
> first one is people that, for lack of a less negative term, I'll call
> "security freaks".  The second is "serious business people" to which
> the shiny new look of appeals; they are likely to heed the
> warning "Legitimate banks, stores, etc. will never ask you to do this"
> and would regard an official hint to ignore it as highly
> unprofessional.

I've now raised this issue with the infrastructure team. The current
hosting arrangements for were put in place when the
PSF didn't have any on-call system administrators of its own, but now
that we do, it may be time to migrate that service to a location where
we can switch to a more appropriate SSL certificate.

Anyone interested in following the discussion further may wish to join
infrastructure at


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From martin at  Thu Aug 21 18:29:55 2014
From: martin at (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 21 Aug 2014 18:29:55 +0200
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

Am 21.08.14 17:44, schrieb Nick Coghlan:
> I've now raised this issue with the infrastructure team. The current
> hosting arrangements for were put in place when the
> PSF didn't have any on-call system administrators of its own, but now
> that we do, it may be time to migrate that service to a location where
> we can switch to a more appropriate SSL certificate.

Just to relay Noah's response: it's actually not the hosting that
prevents installation of a proper certificate, it's the limitation
that the certificate we could deploy would include "" as
a server name, which is considered risky regardless of where the
service is hosted. There are solutions to that as well, of course.


From ryan at  Thu Aug 21 18:48:11 2014
From: ryan at (Ryan Hiebert)
Date: Thu, 21 Aug 2014 11:48:11 -0500
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

> On Aug 21, 2014, at 11:29 AM, Martin v. L?wis <martin at> wrote:
> Am 21.08.14 17:44, schrieb Nick Coghlan:
>> I've now raised this issue with the infrastructure team. The current
>> hosting arrangements for were put in place when the
>> PSF didn't have any on-call system administrators of its own, but now
>> that we do, it may be time to migrate that service to a location where
>> we can switch to a more appropriate SSL certificate.
> Just to relay Noah's response: it's actually not the hosting that
> prevents installation of a proper certificate, it's the limitation
> that the certificate we could deploy would include "" as
> a server name, which is considered risky regardless of where the
> service is hosted. There are solutions to that as well, of course.

That sounds like a limitation I?ve seen with StartSSL. Perhaps there?s a certificate authority that would be willing to sponsor a certificate for Python without this annoying limitation?

From stephen at  Thu Aug 21 19:27:21 2014
From: stephen at (Stephen J. Turnbull)
Date: Fri, 22 Aug 2014 02:27:21 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lsv2ba$hj7$>
 <> <>
Message-ID: <>

Marko Rauhamaa writes:

 > My point is that the poor programmer cannot ignore the possibility of
 > "funny" character sets.

*Poor* programmers do it all the time.  That's why Python codecs raise
when they encounter bytes they can't handle.

 > If Python tried to protect the programmer from that possibility,

I don't understand your point.  The existing interfaces aren't going
anywhere, and they're enough to do anything you need to do.  Although
there are a few radicals (like me in a past life :-) who might like to
see them go away in favor of opt-in to binary encoding via
surrogateescape error handling, nobody in their right mind supports

The question here is not about going backward, it's about whether to
add new bytes APIs, and which ones.

From benjamin at  Thu Aug 21 20:45:06 2014
From: benjamin at (Benjamin Peterson)
Date: Thu, 21 Aug 2014 11:45:06 -0700
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

On Thu, Aug 21, 2014, at 09:48, Ryan Hiebert wrote:
> > On Aug 21, 2014, at 11:29 AM, Martin v. L?wis <martin at> wrote:
> > 
> > Am 21.08.14 17:44, schrieb Nick Coghlan:
> >> I've now raised this issue with the infrastructure team. The current
> >> hosting arrangements for were put in place when the
> >> PSF didn't have any on-call system administrators of its own, but now
> >> that we do, it may be time to migrate that service to a location where
> >> we can switch to a more appropriate SSL certificate.
> > 
> > Just to relay Noah's response: it's actually not the hosting that
> > prevents installation of a proper certificate, it's the limitation
> > that the certificate we could deploy would include "" as
> > a server name, which is considered risky regardless of where the
> > service is hosted. There are solutions to that as well, of course.
> That sounds like a limitation I?ve seen with StartSSL. Perhaps there?s a
> certificate authority that would be willing to sponsor a certificate for
> Python without this annoying limitation?

Perhaps some board members could comment, but I hope the PSF could just
pay a few hundred a year for a proper certificate.

From tjreedy at  Thu Aug 21 21:59:20 2014
From: tjreedy at (Terry Reedy)
Date: Thu, 21 Aug 2014 15:59:20 -0400
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <lt5j3r$378$>

On 8/21/2014 10:41 AM, Armin Rigo wrote:
> Hi,
> On 18 August 2014 22:30, Oleg Broytman <phd at> wrote:
>>     Aha, I see now -- the signing certificate is CAcert, which I've
>> installed manually.
> I don't suppose anyone is particularly annoyed by this fact?

I noticed the issue, and started this thread, because someone posted an 
https::/ link. I ordinarily just go to 
and get the http connection.  I have https-anywhere installed, but it 
must notice the dodgy certificate and silently not switch. So I never 
knew before tht there was an https connection available, and never 
thought to try it.

Given that we are shipping both login credentials and files over the 
connection, making https routine, with a proper certificate, might be a 
good idea.

Terry Jan Reedy

From cs at  Fri Aug 22 00:27:21 2014
From: cs at (Cameron Simpson)
Date: Fri, 22 Aug 2014 08:27:21 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lt4rmr$a00$>
References: <lt4rmr$a00$>
Message-ID: <>

On 21Aug2014 09:20, Antoine Pitrou <antoine at> wrote:
>Le 21/08/2014 00:52, Cameron Simpson a ?crit :
>>The "bytes in some arbitrary encoding where at least the slash character
>>maybe a couple others) is ascii compatible" notion is completely bogus.
>>There's only one special byte, the slash (code 47). There's no OS-level
>>need that it or anything else be ASCII compatible.
>Of course there is. Try to split an UTF-16-encoded file path on the 
>byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly 
>mandates an ASCII-compatible encoding for file paths.

[Rolls eyes.] Looking at the UTF-16 encoding, it looks like it also embeds NUL 
bytes for various codes below 32768. How are they handled? As remarked, codes 0 
(NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings.

If you imagine you can embed bare UTF-16 freely even excluding code 47, I think 
one of us is missing something.

That's not "ASCII compatible". That's "not all byte codes can be freely used 
without thought", and any multibyte coding will have to consider such things 
when embedding itself in another coding scheme.

Cameron Simpson <cs at>

Microsoft:  Committed to putting the "backward" into "backward compatibility."

From chris.barker at  Fri Aug 22 00:30:20 2014
From: chris.barker at (Chris Barker)
Date: Thu, 21 Aug 2014 15:30:20 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>

On Wed, Aug 20, 2014 at 9:52 PM, Cameron Simpson <cs at> wrote:

> On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at>
> wrote:

>  So really, people treat them as
>> "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
>> maybe a couple others)-is-ascii-compatible"
> As someone who fought long and hard in the surrogate-escape listdir()
> wars, and was won over once the scheme was thoroughly explained to me, I
> take issue with these assertions: they are bogus or misleading.
> Firstly, POSIX filenames _are_ just byte strings. The only forbidden
> character is the NUL byte, which terminates a C string, and the only
> special character is the slash, which separates pathanme components.

so they are "just byte strings", oh, except that you can't have a  null,
and the "slash" had better be code 47 (and vice versa). How is that
different than "bytes-in-some-arbitrary-encoding-where-at-least

(sorry about the "maybe a couple others", I was too lazy to do my research
and be sure).

But my point is that python users want to be able to work with paths, and
paths on posix are not strictly strings with a clearly defined encoding,
but they are also not quite "just arbitrary bytes". So it would be nice if
we could have a pathlib that would work with these odd beasts. I've lost
track a bit as to whether the surrogate-escape solution allows this to all
work now. If it does, then great, sorry for the noise.

Second, a bare low level program cannot do _much_ more than pass them
> around.  It certainly can do things like compute their basename, or other
> path related operations.

only if you assume that pesky slash == 47 thing -- it's not much, but it's
not raw bytes either.

The "bytes in some arbitrary encoding where at least the slash character
> (and
> maybe a couple others) is ascii compatible" notion is completely bogus.
> There's only one special byte, the slash (code 47). There's no OS-level
> need that it or anything else be ASCII compatible. I think
> characterizations such as the one quoted are activately misleading.

code 47 == "slash" is ascii compatible -- where else did the 47 value come

> I think we'd all agree it is nice to have a system where filenames are all
> Unicode, but since POSIX/UNIX predates it by decades it is a bit late to
> ignore the reality for such systems.

well, the community could have gone to "if you want anything other than
ascii, make it utf-8 -- but always, we're all a bunch of independent

But none of this is relevant -- systems in the wild do what they do --
clearly we all want Python to work with them as best it can.

> There's no _external_ "filesystem encoding" in the sense of something
> recorded in the filesystem that anyone can inspect. But there is the
> expressed locale settings, available at runtime to any program that cares
> to pay attention. It is a workable situation.

I haven't run into it, but it seem the folks that have don't think relying
on the locale setting is the least bit workable. If it were, we woldn't be
havin this discussion -- use the locale setting to decide how to decode
filenames -- done.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he wants.
> (Indeed, what I want, and I'm a long time UNIX fanboy.)

bug or feature? you decide. Internal consistency is a good start, but it
punts the whole encoding issue to the client software, without giving it
the tools to do it right. I call that "really hard to work with" if not



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From p.f.moore at  Fri Aug 22 00:42:06 2014
From: p.f.moore at (Paul Moore)
Date: Thu, 21 Aug 2014 23:42:06 +0100
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
Message-ID: <>

On 21 August 2014 23:27, Cameron Simpson <cs at> wrote:
> That's not "ASCII compatible". That's "not all byte codes can be freely used
> without thought", and any multibyte coding will have to consider such things
> when embedding itself in another coding scheme.

I wonder how badly a Unix system would break if you specified UTF16 as
the system encoding...?

From antoine at  Fri Aug 22 00:54:47 2014
From: antoine at (Antoine Pitrou)
Date: Thu, 21 Aug 2014 18:54:47 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
Message-ID: <lt5tbn$108$>

Le 21/08/2014 18:27, Cameron Simpson a ?crit :
> As
> remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX
> filename bytes strings.

So you admit that POSIX mandates that file paths are expressed in an 
ASCII-compatible encoding after all? Good. I've nothing to add to your rant.


From ijmorlan at  Fri Aug 22 01:06:55 2014
From: ijmorlan at (Isaac Morland)
Date: Thu, 21 Aug 2014 19:06:55 -0400 (EDT)
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>

On Thu, 21 Aug 2014, Chris Barker wrote:

> so they are "just byte strings", oh, except that you can't have a ?null, and
> the "slash" had better be code 47 (and vice versa). How is that different
> than "bytes-in-some-arbitrary-encoding-where-at-least
> the-slash-character-is-ascii-compatible"?

Actually, slash doesn't need to be code 47.  But no matter what code 47 
means outside of the context of a filename, it is the path arc separator 
byte (not character).

In fact, this isn't even entirely academic.  On a Mac OS X machine, go 
into Finder and try to create a directory called ":".  You'll get an error 
saying 'The name ?:? can?t be used.'.  Now create a directory called "/". 
No problem, raising the question of what is going on at the filesystem 


$ ls -al
total 0
drwxr-xr-x   3 ijmorlan  staff   102 21 Aug 18:57 ./
drwxr-xr-x+ 80 ijmorlan  staff  2720 21 Aug 18:57 ../
drwxr-xr-x   2 ijmorlan  staff    68 21 Aug 18:57 :/

And of course in shell one would remove the directory with this:

rm -rf :


rm -rf /

So in effect the file system path arc encoding on Mac OS X is UTF-8 
*except* that : is outlawed and / is encoded as \x3A rather than the usual 
\x2F.  Of course, the path arc separator byte (not character) remains \x2F 
as always.

Just for fun, there are contexts in which one can give a full path at the 
GUI level, where : is used as the path separator.  This is for historical 
reasons and presumably is the reason for the above-noted behaviour.

I think the real tension here is between the POSIX level where filenames 
are byte strings (except for \x00, which is reserved for string 
termination) where \x2F has special interpretation, and absolutely every 
application ever written, in every language, which wants filenames to be 
character strings.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From ncoghlan at  Fri Aug 22 01:25:05 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 22 Aug 2014 09:25:05 +1000
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <>

On 22 Aug 2014 04:45, "Benjamin Peterson" <benjamin at> wrote:
> Perhaps some board members could comment, but I hope the PSF could just
> pay a few hundred a year for a proper certificate.

That's exactly what we're doing - MAL reminded me we reached the same
conclusion last time this came up, we'll just track it better this time to
make sure it doesn't slip through the cracks again.

(And yes, switching to forced HTTPS once this is addressed would also be a
good idea - we'll add it to the list)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Fri Aug 22 01:38:55 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 22 Aug 2014 09:38:55 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
Message-ID: <>

On 22 Aug 2014 09:24, "Isaac Morland" <ijmorlan at> wrote:
> I think the real tension here is between the POSIX level where filenames
are byte strings (except for \x00, which is reserved for string
termination) where \x2F has special interpretation, and absolutely every
application ever written, in every language, which wants filenames to be
character strings.

That's one of the best summaries of the situation I've ever seen :)

Most languages (including Python 2) throw up their hands and say this is
the developer's problem to deal with. Python 3 says it's *our* problem to
deal with on behalf of our developers. The "surrogateescape" error handler
allows recalcitrant bytes to be dealt with relatively gracefully in most
situations. We don't quite cover *everything* yet (hence the complaints
from some of the folks that are experts at dealing with Python 2 Unicode
handling on POSIX systems), but the remaining problems are a lot more
tractable than the "teach every native English speaker everywhere how to
handle Unicode properly" problem.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From v+python at  Fri Aug 22 02:00:02 2014
From: v+python at (Glenn Linderman)
Date: Thu, 21 Aug 2014 17:00:02 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
Message-ID: <>

On 8/21/2014 3:42 PM, Paul Moore wrote:
> I wonder how badly a Unix system would break if you specified UTF16 as
> the system encoding...?
> Paul

Does Unix even support UTF-16 as an encoding? I suppose, these days, it 
probably does, for reading contents of files created on Windows, etc. 
(Unicode was just gaining traction when I last used Unix in a 
significant manner; yes, my web host runs Linux, and I know enough to do 
what can be done there... but haven't experimented with encodings other 
than ASCII & UTF-8 on the web host, and don't intend to).

If it allows configuration of UTF-16 or UTF-32 as system encodings, I 
would consider that a bug, though, as too much of Unix predates Unicode, 
and would be likely to fail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From v+python at  Fri Aug 22 01:56:59 2014
From: v+python at (Glenn Linderman)
Date: Thu, 21 Aug 2014 16:56:59 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <lt5tbn$108$>
References: <lt4rmr$a00$>
 <> <lt5tbn$108$>
Message-ID: <>

On 8/21/2014 3:54 PM, Antoine Pitrou wrote:
> Le 21/08/2014 18:27, Cameron Simpson a ?crit :
>> As
>> remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX
>> filename bytes strings.
> So you admit that POSIX mandates that file paths are expressed in an 
> ASCII-compatible encoding after all? Good. I've nothing to add to your 
> rant.
> Antoine.

0 and 47 are certainly originally derived from ASCII.  However, there 
could be lots of encodings that are not ASCII compatible (but in 
practice, probably very few, since most encodings _are_ ASCII 
compatible) that could be fit those constraints.

So while as a technical matter, Cameron is correct that Unix only treats 
0 & 47 as special, and that is insufficient to declare that encodings 
must be ASCII compatible, as a practical matter, since most encodings 
are ASCII compatible anyway, it would be hard to find very many that 
could be used successfully with Unix file names that are not ASCII 
compatible, that could comply with the 0 & 47 requirements.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From phd at  Fri Aug 22 03:09:33 2014
From: phd at (Oleg Broytman)
Date: Fri, 22 Aug 2014 03:09:33 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
Message-ID: <>

On Thu, Aug 21, 2014 at 05:00:02PM -0700, Glenn Linderman <v+python at> wrote:
> On 8/21/2014 3:42 PM, Paul Moore wrote:
> >I wonder how badly a Unix system would break if you specified UTF16 as
> >the system encoding...?
> Does Unix even support UTF-16 as an encoding?

   As an encoding of file's content? Certainly yes. As a locale
encoding? Definitely no.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From chris.barker at  Fri Aug 22 02:30:14 2014
From: chris.barker at (Chris Barker - NOAA Federal)
Date: Thu, 21 Aug 2014 17:30:14 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
Message-ID: <5124983344373446869@unknownmsgid>

> Does Unix even support UTF-16 as an encoding? I suppose, these days, it probably does, for reading contents of files created on Windows, etc.

I don't think Unix supports any encodings at all for the _contents_ of
files -- that's up to applications. Of course the command line text
processing tools need to know -- I'm guessing those are never going to
work w/UTF-16!

"System encoding" is a nice idea, but pretty much worthless. Only
helpful for files created and processed on the same system -- not rare
for that not to be the case.

This brings up the other key problem. If file names are (almost)
arbitrary bytes, how do you write one to/read one from a text file
with a particular encoding? ( or for that matter display it on a

And people still want to say posix isn't broken in this regard?



From tjreedy at  Fri Aug 22 04:32:24 2014
From: tjreedy at (Terry Reedy)
Date: Thu, 21 Aug 2014 22:32:24 -0400
Subject: [Python-Dev] -- Untrusted Connection
In-Reply-To: <>
References: <lstmo7$1sh$>
Message-ID: <lt6a4s$f3q$>

On 8/21/2014 7:25 PM, Nick Coghlan wrote:
> On 22 Aug 2014 04:45, "Benjamin Peterson" <benjamin at
> <mailto:benjamin at>> wrote:
>  >
>  > Perhaps some board members could comment, but I hope the PSF could just
>  > pay a few hundred a year for a proper certificate.
> That's exactly what we're doing - MAL reminded me we reached the same
> conclusion last time this came up, we'll just track it better this time
> to make sure it doesn't slip through the cracks again.
> (And yes, switching to forced HTTPS once this is addressed would also be
> a good idea - we'll add it to the list)

I just switched from a 'low variety' short password of the sort almost 
crackable with brute force (today, though not several years ago) to a 
higher variety longer password. People with admin privileges on the 
tracker might be reminded to recheck.  What was adequate 10 years ago is 
not so now.

Terry Jan Reedy

From phd at  Fri Aug 22 04:42:29 2014
From: phd at (Oleg Broytman)
Date: Fri, 22 Aug 2014 04:42:29 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <5124983344373446869@unknownmsgid>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker at> wrote:
> This brings up the other key problem. If file names are (almost)
> arbitrary bytes, how do you write one to/read one from a text file
> with a particular encoding? ( or for that matter display it on a
> terminal)

   There is no such thing as an encoding of text files. So we just
write those bytes to the file or output them to the terminal. I often do
that. My filesystems are full of files with names and content in
at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
terminal with koi8 or utf-8 locale and fonts and some file always look
weird. But however weird they are it's possible to work with them.

   The bigger problem is line feeds. A filename with linefeeds can be
put to a text file, but cannot be read back. So one has to transform
such names. Usually s/\\/\\\\/g and s/\n/\\n/g is enough. (-:

> And people still want to say posix isn't broken in this regard?

   Not at all! And broken or not broken it's what I (for many different
reasons) prefer to use for my desktops, servers, notebooks, routers and
smartphones, so if Python would stand on my way I'd rather switch to a
different tools.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From stephen at  Fri Aug 22 07:11:08 2014
From: stephen at (Stephen J. Turnbull)
Date: Fri, 22 Aug 2014 14:11:08 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <5124983344373446869@unknownmsgid>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Chris Barker - NOAA Federal writes:

 > This brings up the other key problem. If file names are (almost)
 > arbitrary bytes, how do you write one to/read one from a text file
 > with a particular encoding? ( or for that matter display it on a
 > terminal)

"Very carefully."

But this is strictly from need.  *Nobody* (with the exception of the
crackers who like to name their programs things like "\u0007") *wants*
to do this.  Real people want to name their files in some human
language they understand, and spell it in the usual way, and encode
those characters as bytes in the usual way.

Decoding those characters in the usual way and getting nonsense is the
exceptional case, and it must be the application's or user's problem
to decide what to do.  They know where they got the file from and
usually have some idea of what its name should look like.  Python
doesn't, so Python cannot solve it for them.

For that reason, I believe that Python's "normal"/high-level approach
to file handling should treat file names as (human-oriented) text.  Of
course Python should be able to handle bytes straight from the disk,
but most programmers shouldn't have to.

 > And people still want to say posix isn't broken in this regard?

Deal with it, bro'.<wink/>

From marko at  Fri Aug 22 07:24:42 2014
From: marko at (Marko Rauhamaa)
Date: Fri, 22 Aug 2014 08:24:42 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
 (Nick Coghlan's message of "Fri, 22 Aug 2014 09:38:55 +1000")
References: <>
Message-ID: <>

Nick Coghlan <ncoghlan at>:

> Python 3 says it's *our* problem to deal with on behalf of our
> developers.


    Flik: I was just trying to help.

    Mr. Soil: Then help us; *don't* help us.


From steve at  Fri Aug 22 17:19:14 2014
From: steve at (Steven D'Aprano)
Date: Sat, 23 Aug 2014 01:19:14 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <20140822151911.GS25957@ando>

On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker at> wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
>    There is no such thing as an encoding of text files.

I don't understand this comment. It seems to me that *text* files have 
to have an encoding, otherwise you can't interpret the contents as text. 
Files, of course, only contain bytes, but to be treated as bytes you 
need some way of transforming byte N to char C (or multiple bytes to C), 
which is an encoding.

Perhaps you just mean that encodings are not recorded in the text file 

To answer Chris' question, you typically cannot include arbitrary 
bytes in text files, and displaying them to the user is likewise 
problematic. The usual solution is to support some form of 
escaping, like \t #x0A; or %0D, to give a few examples.


From martin at  Fri Aug 22 17:25:16 2014
From: martin at (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Fri, 22 Aug 2014 17:25:16 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <lt5tbn$108$>
Message-ID: <>

Am 22.08.14 01:56, schrieb Glenn Linderman:
> 0 and 47 are certainly originally derived from ASCII.  However, there
> could be lots of encodings that are not ASCII compatible (but in
> practice, probably very few, since most encodings _are_ ASCII
> compatible) that could be fit those constraints.
> So while as a technical matter, Cameron is correct that Unix only treats
> 0 & 47 as special, and that is insufficient to declare that encodings
> must be ASCII compatible, as a practical matter, since most encodings
> are ASCII compatible anyway, it would be hard to find very many that
> could be used successfully with Unix file names that are not ASCII
> compatible, that could comply with the 0 & 47 requirements.

More importantly, existing encodings that are distinctively *not*
ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47
(instead, it is at 91 at EBCDIC, 47 is the BEL control character).

There are boundary cases, of course. VISCII is "mostly ASCII
compatible", putting graphic characters into some of the control
characters, but using those that aren't used in ASCII, anyway.

And then there is the YUSCII family of encodings, which definitely
is not ASCII compatible, as it does not contain Latin characters,
but still puts the / into 47 (and also keeps the ASCII digits and
special characters in their positions). There is also SI 960, which
has the slash, the ASCII uppercase letters, digits and special
characters, but replaces the lower-case characters with Hebrew.

So yes, Unix doesn't mandate ASCII-compatible encodings; but it
still mandates ASCII-inspired encodings. I wonder how you would
run "gcc", though, on an SI 960 system; you'ld have to type


From phd at  Fri Aug 22 17:51:04 2014
From: phd at (Oleg Broytman)
Date: Fri, 22 Aug 2014 17:51:04 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <20140822151911.GS25957@ando>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
Message-ID: <>


On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano <steve at> wrote:
> On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker at> wrote:
> > > This brings up the other key problem. If file names are (almost)
> > > arbitrary bytes, how do you write one to/read one from a text file
> > > with a particular encoding? ( or for that matter display it on a
> > > terminal)
> > 
> >    There is no such thing as an encoding of text files.
> I don't understand this comment. It seems to me that *text* files have 
> to have an encoding, otherwise you can't interpret the contents as text. 

   What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
   Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

> Files, of course, only contain bytes, but to be treated as bytes you 
> need some way of transforming byte N to char C (or multiple bytes to C), 
> which is an encoding.

   But you don't need to treat the entire file in one encoding. Strange
characters are clearly visible so you can interpret them differently. I
am very much trained to distinguish koi8, cp1251 and utf-8 texts; I
cannot translate them mentally but I can recognize them.

> Perhaps you just mean that encodings are not recorded in the text file 
> itself?

   Yes, that too.

> To answer Chris' question, you typically cannot include arbitrary 
> bytes in text files, and displaying them to the user is likewise 
> problematic

   As a person who view utf-8 files in koi8 fonts (and vice versa) every
day I'd argue. (-:

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From status at  Fri Aug 22 18:08:12 2014
From: status at (Python tracker)
Date: Fri, 22 Aug 2014 18:08:12 +0200 (CEST)
Subject: [Python-Dev] Summary of Python tracker Issues
Message-ID: <>

ACTIVITY SUMMARY (2014-08-15 - 2014-08-22)
Python tracker at

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4621 (+19)
  closed 29399 (+28)
  total  34020 (+47)

Open issues with patches: 2179 

Issues opened (41)

#22207: Test for integer overflow on Py_ssize_t: explicitly cast to si  opened by haypo

#22208: tarfile can't add in memory files (reopened)  opened by markgrandi

#22209: Idle: add better access to extension information  opened by terry.reedy

#22210: pdb-run-restarting-a-pdb-session  opened by zhengxiexie

#22211: Remove VMS specific code in expat.h  & xmlrole.h  opened by John.Malmberg

#22212: fails if module fails to build.  opened by John.Malmberg

#22213: pyvenv style virtual environments unusable in an embedded syst  opened by grahamd

#22214: Tkinter: Don't stringify callbacks arguments  opened by serhiy.storchaka

#22215: "embedded NUL character" exceptions  opened by serhiy.storchaka

#22216: smtplip STARTTLS fails at second attampt due to unsufficiant q  opened by zvyn

#22217: Reprs for zipfile classes  opened by serhiy.storchaka

#22218: Fix more compiler warnings "comparison between signed and unsi  opened by haypo

#22219: python -mzipfile fails to add empty folders to created zip  opened by Antony.Lee

#22220: Ttk extensions test failure  opened by serhiy.storchaka

#22221: ast.literal_eval confused by coding declarations  opened by jorgenschaefer

#22222: dtoa.c: remove custom memory allocator  opened by haypo

#22223: argparse not including '--' arguments in previous optional REM  opened by Jurko.Gospodneti??

#22225: Add SQLite support to http.cookiejar  opened by demian.brecht

#22226: Refactor dict result handling in Tkinter  opened by serhiy.storchaka

#22227: Simplify tarfile iterator  opened by serhiy.storchaka

#22228: Adapt bash readline operate-and-get-next function  opened by lelit

#22229: wsgiref doesn't appear to ever set REMOTE_HOST in the environ  opened by alex

#22231: httplib: unicode url will cause an ascii codec error when comb  opened by Bob.Chen

#22232: str.splitlines splitting on none-\r\n characters  opened by scharron

#22233: http.client splits headers on none-\r\n characters  opened by scharron

#22234: urllib.parse.urlparse accepts any falsy value as an url  opened by Ztane

#22235: httplib: TypeError with file() object in  opened by erob

#22236: Do not use _default_root in Tkinter tests  opened by serhiy.storchaka

#22237: sorted() docs should state that the sort is stable  opened by Wilfred.Hughes

#22239: asyncio: nested event loop  opened by djarb

#22240: argparse support for "python -m module" in help  opened by tebeka

#22241: strftime/strptime round trip fails even for UTC datetime objec  opened by akira

#22242: Doc fix in the Import section in language reference.  opened by jon.poler

#22243: Documentation on try statement incorrectly implies target of e  opened by mwilliamson

#22244: load_verify_locations fails to handle unicode paths on Python  opened by alex

#22246: add strptime(s, '%s')  opened by akira

#22247: More incomplete module.__all__ lists  opened by vadmium

#22248: urllib.request.urlopen raises exception when 30X-redirect url  opened by tomasgroth

#22249: Possibly incorrect example is given for socket.getaddrinfo()  opened by Alexander.Patrakov

#22250: unittest lowercase methods  opened by simonzack

#22251: Various markup errors in documentation  opened by berker.peksag

Most recent 15 issues with no replies (15)

#22251: Various markup errors in documentation

#22250: unittest lowercase methods

#22249: Possibly incorrect example is given for socket.getaddrinfo()

#22246: add strptime(s, '%s')

#22244: load_verify_locations fails to handle unicode paths on Python

#22242: Doc fix in the Import section in language reference.

#22239: asyncio: nested event loop

#22234: urllib.parse.urlparse accepts any falsy value as an url

#22231: httplib: unicode url will cause an ascii codec error when comb

#22229: wsgiref doesn't appear to ever set REMOTE_HOST in the environ

#22227: Simplify tarfile iterator

#22225: Add SQLite support to http.cookiejar

#22216: smtplip STARTTLS fails at second attampt due to unsufficiant q

#22212: fails if module fails to build.

#22211: Remove VMS specific code in expat.h  & xmlrole.h

Most recent 15 issues waiting for review (15)

#22251: Various markup errors in documentation

#22246: add strptime(s, '%s')

#22242: Doc fix in the Import section in language reference.

#22240: argparse support for "python -m module" in help

#22236: Do not use _default_root in Tkinter tests

#22228: Adapt bash readline operate-and-get-next function

#22227: Simplify tarfile iterator

#22226: Refactor dict result handling in Tkinter

#22222: dtoa.c: remove custom memory allocator

#22219: python -mzipfile fails to add empty folders to created zip

#22218: Fix more compiler warnings "comparison between signed and unsi

#22217: Reprs for zipfile classes

#22216: smtplip STARTTLS fails at second attampt due to unsufficiant q

#22215: "embedded NUL character" exceptions

#22214: Tkinter: Don't stringify callbacks arguments

Top 10 most discussed issues (10)

#17535: IDLE: Add an option to show line numbers along the left side o   9 msgs

#22208: tarfile can't add in memory files (reopened)   8 msgs

#2527: Pass a namespace to timeit   6 msgs

#22195: Make it easy to replace print() calls with logging calls   6 msgs

#22241: strftime/strptime round trip fails even for UTC datetime objec   6 msgs

#22194: access to cdecimal / libmpdec API   5 msgs

#22198: Odd floor-division corner case   5 msgs

#22218: Fix more compiler warnings "comparison between signed and unsi   5 msgs

#20152: Derby #15: Convert 50 sites to Argument Clinic across 9 files   4 msgs

#20184: Derby #16: Convert 50 sites to Argument Clinic across 9 files   4 msgs

Issues closed (27)

#7283: test_site failure when .local/lib/pythonX.Y/site-packages hasn  closed by ned.deily

#15696: Correct __sizeof__ support for mmap  closed by serhiy.storchaka

#16599: unittest: Access test result from tearDown  closed by Claudiu.Popa

#19628: maxlevels -1 on compileall for unlimited recursion  closed by python-dev

#19714: Add tests for importlib.machinery.WindowsRegistryFinder  closed by brett.cannon

#19997: imghdr.what doesn't accept bytes paths  closed by serhiy.storchaka

#20797: zipfile.extractall should accept bytes path as parameter  closed by serhiy.storchaka

#21308: PEP 466: backport ssl changes  closed by python-dev

#21389: The repr of BoundMethod objects sometimes incorrectly identifi  closed by python-dev

#21549: Add the members parameter for TarFile.list()  closed by serhiy.storchaka

#22016: Add a new 'surrogatereplace' output only error handler  closed by ncoghlan

#22068: tkinter: avoid reference loops with Variables and Fonts  closed by serhiy.storchaka

#22118: urljoin fails with messy relative URLs  closed by pitrou

#22150: deprecated-removed directive is broken in Sphinx 1.2.2  closed by berker.peksag

#22156: Fix compiler warnings "comparison between signed and unsigned  closed by haypo

#22157: _ctypes on ppc64: libffi/src/powerpc/linux64.o: ABI version 1  closed by doko

#22165: Empty response from http.server when directory listing contain  closed by serhiy.storchaka

#22188: test_gdb fails on invalid gdbinit  closed by python-dev

#22191: warnings.__all__ incomplete  closed by brett.cannon

#22200: Remove distutils checks for Python version  closed by python-dev

#22201: python -mzipfile fails to unzip files with folders created by  closed by serhiy.storchaka

#22205: debugmallocstats test is cpython only  closed by python-dev

#22206: PyThread_create_key(): fix comparison between signed and unsig  closed by haypo

#22224: is prone to political blocking in Russia  closed by georg.brandl

#22230: 'python -mzipfile -c' does not zip empty directories  closed by serhiy.storchaka

#22238: fractions.gcd results in infinite loop when nan or inf given a  closed by mark.dickinson

#22245: test_urllib2_localnet prints out error messages  closed by orsenthil

From v+python at  Fri Aug 22 18:37:13 2014
From: v+python at (Glenn Linderman)
Date: Fri, 22 Aug 2014 09:37:13 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
Message-ID: <>

On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>     What encoding does have a text file (an HTML, to be precise) with
> text in utf-8, ads in cp1251 (ad blocks were included from different
> files) and comments in koi8-r?
>     Well, I must admit the HTML was rather an exception, but having a
> text file with some strange characters (binary strings, or paragraphs
> in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully 
delimited, and documented) sections of encoded text in different encodings.

If it is named .html and served by the server as UTF-8, then the server 
is misconfigured, or the file is incorrectly populated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From phd at  Fri Aug 22 18:52:22 2014
From: phd at (Oleg Broytman)
Date: Fri, 22 Aug 2014 18:52:22 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
Message-ID: <>

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at> wrote:
> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >    What encoding does have a text file (an HTML, to be precise) with
> >text in utf-8, ads in cp1251 (ad blocks were included from different
> >files) and comments in koi8-r?
> >    Well, I must admit the HTML was rather an exception, but having a
> >text file with some strange characters (binary strings, or paragraphs
> >in different encodings) is not that exceptional.
> That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.

   Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From v+python at  Fri Aug 22 19:09:21 2014
From: v+python at (Glenn Linderman)
Date: Fri, 22 Aug 2014 10:09:21 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
Message-ID: <>

On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at> wrote:
>> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>>>     What encoding does have a text file (an HTML, to be precise) with
>>> text in utf-8, ads in cp1251 (ad blocks were included from different
>>> files) and comments in koi8-r?
>>>     Well, I must admit the HTML was rather an exception, but having a
>>> text file with some strange characters (binary strings, or paragraphs
>>> in different encodings) is not that exceptional.
>> That's not a text file. That's a binary file containing (hopefully
>> delimited, and documented) sections of encoded text in different
>> encodings.
>     Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
> Oleg.

I was not declaring your file not to be a "text file" from any 
definition obtained from Python3 documentation, just from a common sense 
definition of "text file".

Looking at it from Python3, though, it is clear that when opening a file 
in "text" mode, an encoding may be specified or will be assumed.  That 
is one encoding, applying to the whole file, not 3 encodings, with 
declarations on when to switch between them. So I think, in general, 
Python3 assumes or defines a definition of text file that matches my 
"common sense" definition. Also, if it is an HTML file, I doubt the 
browser will use multiple different encodings when interpreting it, so 
it is not clear that the file is of practical use for its intended 
purpose if it contains text in multiple different encodings, but is 
served using only a single encoding, unless there is javascript or some 
programming in the browser that reencodes the data.

On the other hand, Python3 provides various facilities for working with 
such files.

The first I'll mention is the one that follows from my description of 
what your file really is: Python3 allows opening files in binary mode, 
and then decoding various sections of it using whatever encoding you 
like, using the bytes.decode() operation on various sections of the 
file. Determination of which sections are in which encodings is beyond 
the scope of this description of the technique, and is application 

The second is to specify an error handler, that, like you, is trained to 
recognize the other encodings and convert them appropriately. I'm not 
aware that such an error handler has been or could be written, myself 
not having your training.

The third is to specify the UTF-8 with the surrogate escape error 
handler. This allows non-UTF-8 codes to be loaded into memory. You, or 
algorithms as smart as you, could perhaps be developed to detect and 
manipulate the resulting "lone surrogate" codes in meaningful ways, or 
could simply allow them to ride along without interpretation, and be 
emitted as the original, into other files.

There may be other technique that I am not aware of.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From phd at  Fri Aug 22 20:50:05 2014
From: phd at (Oleg Broytman)
Date: Fri, 22 Aug 2014 20:50:05 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python at> wrote:
> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at> wrote:
> >>On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >>>    What encoding does have a text file (an HTML, to be precise) with
> >>>text in utf-8, ads in cp1251 (ad blocks were included from different
> >>>files) and comments in koi8-r?
> >>>    Well, I must admit the HTML was rather an exception, but having a
> >>>text file with some strange characters (binary strings, or paragraphs
> >>>in different encodings) is not that exceptional.
> >>That's not a text file. That's a binary file containing (hopefully
> >>delimited, and documented) sections of encoded text in different
> >>encodings.
> >    Allow me to disagree. For me, this is a text file which I can (and
> >do) view with a pager, edit with a text editor, list on a console,
> >search with grep and so on. If it is not a text file by strict Python3
> >standards then these standards are too strict for me. Either I find a
> >simple workaround in Python3 to work with such texts or find a different
> >tool. I cannot avoid such files because my reality is much more complex
> >than strict text/binary dichotomy in Python3.
> I was not declaring your file not to be a "text file" from any
> definition obtained from Python3 documentation, just from a common
> sense definition of "text file".

   And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.

> Looking at it from Python3, though, it is clear that when opening a
> file in "text" mode, an encoding may be specified or will be
> assumed.  That is one encoding, applying to the whole file, not 3
> encodings, with declarations on when to switch between them. So I
> think, in general, Python3 assumes or defines a definition of text
> file that matches my "common sense" definition.

   I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.

> On the other hand, Python3 provides various facilities for working
> with such files.
> The first I'll mention is the one that follows from my description
> of what your file really is: Python3 allows opening files in binary
> mode, and then decoding various sections of it using whatever
> encoding you like, using the bytes.decode() operation on various
> sections of the file. Determination of which sections are in which
> encodings is beyond the scope of this description of the technique,
> and is application dependent.

   This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
   But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.

   Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
   Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
   Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.

> The second is to specify an error handler, that, like you, is
> trained to recognize the other encodings and convert them
> appropriately. I'm not aware that such an error handler has been or
> could be written, myself not having your training.
> The third is to specify the UTF-8 with the surrogate escape error
> handler. This allows non-UTF-8 codes to be loaded into memory. You,
> or algorithms as smart as you, could perhaps be developed to detect
> and manipulate the resulting "lone surrogate" codes in meaningful
> ways, or could simply allow them to ride along without
> interpretation, and be emitted as the original, into other files.

   Yes, these are different workarounds.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From v+python at  Fri Aug 22 22:17:44 2014
From: v+python at (Glenn Linderman)
Date: Fri, 22 Aug 2014 13:17:44 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

On 8/22/2014 11:50 AM, Oleg Broytman wrote:
> On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python at> wrote:
>> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
>>> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at> wrote:
>>>> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>>>>>     What encoding does have a text file (an HTML, to be precise) with
>>>>> text in utf-8, ads in cp1251 (ad blocks were included from different
>>>>> files) and comments in koi8-r?
>>>>>     Well, I must admit the HTML was rather an exception, but having a
>>>>> text file with some strange characters (binary strings, or paragraphs
>>>>> in different encodings) is not that exceptional.
>>>> That's not a text file. That's a binary file containing (hopefully
>>>> delimited, and documented) sections of encoded text in different
>>>> encodings.
>>>     Allow me to disagree. For me, this is a text file which I can (and
>>> do) view with a pager, edit with a text editor, list on a console,
>>> search with grep and so on. If it is not a text file by strict Python3
>>> standards then these standards are too strict for me. Either I find a
>>> simple workaround in Python3 to work with such texts or find a different
>>> tool. I cannot avoid such files because my reality is much more complex
>>> than strict text/binary dichotomy in Python3.
>> I was not declaring your file not to be a "text file" from any
>> definition obtained from Python3 documentation, just from a common
>> sense definition of "text file".
>     And in my opinion those files are perfect text. The files consist of
> lines separated by EOL characters (not necessary EOL characters of my OS
> because it could be a text file produced in a different OS), lines
> consist of words and words of characters.

Until you know or can deduce the encoding of a file, it is binary. If it 
has multiple, different, embedded encodings of text, it is still binary. 
In my opinion. So these are just opinions, and naming conventions. If 
you call it text, you have a different definition of text file than I do.

>> Looking at it from Python3, though, it is clear that when opening a
>> file in "text" mode, an encoding may be specified or will be
>> assumed.  That is one encoding, applying to the whole file, not 3
>> encodings, with declarations on when to switch between them. So I
>> think, in general, Python3 assumes or defines a definition of text
>> file that matches my "common sense" definition.
>     I don't have problems with Python3 text. I have problems with Python3
> trying to get rid of byte strings and treating bytes as strict non-text.

Python3 is not trying to get rid of byte strings. But to some extent, it 
is wanting to treat bytes as non-text... bytes can be encoded text, but 
is not text until it is decoded. There is some processing that can be 
done on encoded text, but it has to be done differently (in many cases) 
than processing done on (non-encoded) text.

One difference is the interpretation of what character is what varies 
from encoding to encoding, so if the processing requires understanding 
the characters, then the character code must be known.

On the other hand, if it suffices to detect blocks of opaque text 
delimited by a known set of delimiters codes (EOL: CR, LF, combinations 
thereof) then that can be done relatively easily on binary, as long as 
the encoding doesn't have data puns where a multibyte encoded character 
might contain the code for the delimiter as one of the bytes of the code 
for the character.

>> On the other hand, Python3 provides various facilities for working
>> with such files.
>> The first I'll mention is the one that follows from my description
>> of what your file really is: Python3 allows opening files in binary
>> mode, and then decoding various sections of it using whatever
>> encoding you like, using the bytes.decode() operation on various
>> sections of the file. Determination of which sections are in which
>> encodings is beyond the scope of this description of the technique,
>> and is application dependent.
>     This is perhaps the most promising approach. If I can open a text
> file in binary mode, iterate it line by line, split every line of
> non-ascii bytes with .split() and process them that'd satisfy my needs.
>     But still there are dragons. If I read a filename from such file I
> read it as bytes, not str, so I can only use low-level APIs to
> manipulate with those filenames. Pity.

If the file names are in an unknown encoding, both in the directory and 
in the encoded text in the file listing, then unless you can deduce the 
encoding, you would be limited to doing manipulations with file APIs 
that support bytes, the low-level ones, yes.  If you can deduce the 
encoding, then you are freed from that limitation.

>     Let see a perfectly normal situation I am quite often in. A person
> sent me a directory full of MP3 files. The transport doesn't matter; it
> could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
> matters is that filenames and content are in alien encodings. Most often
> it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
> utf8. There is a playlist among the files -- a text file that lists MP3
> files, every file on a single line; usually with full paths
> ("C:\Audio\some.mp3").
>     Now I want to read filenames from the file and process the filenames
> (strip paths) and files (verify existing of files, or renumber the files
> or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
> also in cp1251 of utf-8 encoding]...whatever).

"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is 
utf-8, but it is not both. Maybe you meant "or" instead of "of".

>   I don't know the encoding
> of the playlist but I know it corresponds to the encoding of filenames
> so I can expect those files exist on my filesystem; they have strangely
> looking unreadable names but they exist.
>     Just a small example of why I do want to process filenames from a
> text file in an alien encoding. Without knowing the encoding in advance.

An interesting example, for sure. Life will be easier when everyone 
converts to Unicode and UTF-8.

>> The second is to specify an error handler, that, like you, is
>> trained to recognize the other encodings and convert them
>> appropriately. I'm not aware that such an error handler has been or
>> could be written, myself not having your training.
>> The third is to specify the UTF-8 with the surrogate escape error
>> handler. This allows non-UTF-8 codes to be loaded into memory. You,
>> or algorithms as smart as you, could perhaps be developed to detect
>> and manipulate the resulting "lone surrogate" codes in meaningful
>> ways, or could simply allow them to ride along without
>> interpretation, and be emitted as the original, into other files.
>     Yes, these are different workarounds.
> Oleg.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Fri Aug 22 20:51:20 2014
From: chris.barker at (Chris Barker)
Date: Fri, 22 Aug 2014 11:51:20 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman <v+python at>

>     What encoding does have a text file (an HTML, to be precise) with
> text in utf-8, ads in cp1251 (ad blocks were included from different
> files) and comments in koi8-r?
>    Well, I must admit the HTML was rather an exception, but having a
> text file with some strange characters (binary strings, or paragraphs
> in different encodings) is not that exceptional.
>  That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.
>     Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
> First -- we're getting OT here -- this thread was about file and path
names, not the contents of files. But I suppose I brought that in when I
talked about writing file names to files...

The first I'll mention is the one that follows from my description of what
> your file really is: Python3 allows opening files in binary mode, and then
> decoding various sections of it using whatever encoding you like, using the
> bytes.decode() operation on various sections of the file. Determination of
> which sections are in which encodings is beyond the scope of this
> description of the technique, and is application dependent.

right -- and you would have wanted to open such file in binary mode with
py2 as well, but in that case, you's have the contents in py2 string
object, which has a few more convenient ways to work with text (at least
ascii-compatible) than the py3 bytes object does.

The third is to specify the UTF-8 with the surrogate escape error handler.
> This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as
> smart as you, could perhaps be developed to detect and manipulate the
> resulting "lone surrogate" codes in meaningful ways, or could simply allow
> them to ride along without interpretation, and be emitted as the original,
> into other files.

Just so I'm clear here -- if you write that back out, encoded as utf-8 --
you'll get the exact same binary blob out as came in?

I wonder if this would make it hard to preserve byte boundaries, though.

By the way, IIUC correctly, you can also use the python latin-1 decoder --
anything latin-1 will come through correctly, anything not valid latin-1
will come in as garbage, but if you re-encode with latin-1 the original
bytes will be preserved. I think this will also preserve a 1:1 relationship
between character count and byte count, which could be handy.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From chris.barker at  Fri Aug 22 20:53:01 2014
From: chris.barker at (Chris Barker)
Date: Fri, 22 Aug 2014 11:53:01 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman <phd at> wrote:

> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <
> chris.barker at> wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
>    There is no such thing as an encoding of text files. So we just
> write those bytes to the file

So I write bytes that are encoded one way into a text file that's encoded
another way, and expect to be abel to read that later? you're kidding,
right? Only if that's  he only thing in the file -- usually not the case
with my text files.

or output them to the terminal. I often do
> that. My filesystems are full of files with names and content in
> at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
> terminal with koi8 or utf-8 locale and fonts and some file always look
> weird. But however weird they are it's possible to work with them.

Not for me (or many other users) -- terminals are sometimes set with
ascii-only encoding, so non-ascii barfs -- or you get some weird control
characters that mess up your terminal -- dumping arbitrary bytes to a
terminal does not always "just work".

> > And people still want to say posix isn't broken in this regard?
>    Not at all! And broken or not broken it's what I (for many different
> reasons) prefer to use for my desktops, servers, notebooks, routers and
> smartphones,

Sorry -- that's a Red Herring -- I agree, "broken" or "simple and
consistent" is irrelevant, we all want Python to work as well as it can on
such systems.

The point is that if you are reading a file name from the system, and then
passing it back to the system, then you can treat it as just bytes -- who
cares? And if you add the byte value of 47 thing, then you can even do
basic path manipulations. But once you want to do other things with your
file name, then you need to know the encoding. And it is very, very common
for users to need to do other things with filenames, and they almost always
want them as text that they can read and understand.

Python3 supports this case very well. But it does indeed make it hard to
work with filenames when you don't know the encoding they are in. And
apparently that's pretty common -- or common enough that it would be nice
for Python to support it well. This trick is how -- we'd like the "just
pass it around and do path manipulations" case to work with (almost)
arbitrary bytes, but everything else to work naturally with text (unicode

Which brings us to the "what APIs should accept bytes" question. I think
that's been pretty much answered: All the low-level ones, so that protocol
and library programmers can write code that works on systems with undefined
filename encodings.

But: casual users still need to do the normal things with file names and
paths, and ideally those should work the same way on all systems.

I think the way to do this is to abstract the path concept, like pathlib
does. Back in the day, paths were "just strings", and that worked OK with
py2 strings, because you could put arbitrary bytes in them. But the "py2
strings were perfect" folks seem to not acknowledge that while they are
nice for matching the posix filename model, they were a pain in the neck
when you needed to do somethign else like write them in to a JSON file or
something. From my personal experience, non-ascii filenames are much easier
to deal with if I use unicode for filenames everywhere (py2). Somehow, I
have yet to be bitten by mixed encoding in filenames.

So will using a surrogate-escape error handling with pathlib make all this
just work?



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From rosuav at  Fri Aug 22 23:04:20 2014
From: rosuav at (Chris Angelico)
Date: Sat, 23 Aug 2014 07:04:20 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python at> wrote:
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> utf-8, but it is not both. Maybe you meant "or" instead of "of".

I'd assume "or" meant there, rather than "of", it's a common typo.

Not sure why 1251, specifically, but it's not uncommon for boundary
code to attempt a decode that consists of something like "attempt
UTF-8 decode, and if that fails, attempt an eight-bit decode". For my
MUD clients, that's pretty much required; one of the servers I
frequent is completely bytes-oriented, so whatever encoding one client
uses will be dutifully echoed to every other client. There are some
that correctly use UTF-8, but others use whatever they feel like; and
since those naughty clients are mainly on Windows, I can reasonably
guess that they'll be using CP-1252. So that's what I do: UTF-8,
fall-back on 1252. (It's also possible some clients will be using
Latin-1, but 1252 is a superset of that.)

But it's important to note that this is a method of handling junk.
It's not a design intention; this is for a situation where I really
want to cope with any byte stream and attempt to display it as text.
And if I get something that's neither UTF-8 nor CP-1252, I will
display it wrongly, and there's nothing can be done about that.


From phd at  Sat Aug 23 00:09:31 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 00:09:31 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman <v+python at> wrote:
> >in cp1251 of utf-8 encoding
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or
> it is utf-8, but it is not both. Maybe you meant "or" instead of
> "of".

   But of course!

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From phd at  Sat Aug 23 00:21:18 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 00:21:18 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker <chris.barker at> wrote:
> Back in the day, paths were "just strings", and that worked OK with
> py2 strings, because you could put arbitrary bytes in them. But the "py2
> strings were perfect" folks seem to not acknowledge that while they are
> nice for matching the posix filename model, they were a pain in the neck
> when you needed to do somethign else like write them in to a JSON file or
> something.

   This is the core of the problem. Python2 favors Unix model but
Windows people pays the price. Python3 reverses that and I'm still
thinking if I want to pay the new price.

> So will using a surrogate-escape error handling with pathlib make all this
> just work?

   I'm involved in developing and maintaining a few big commercial
projects that will hardly be ported to Python3. So I'm stuck with
Python2 for many years and I haven't tried Python3. May be I should try
a small personal project, but certainly not this year. May be the next

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From phd at  Sat Aug 23 00:26:37 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 00:26:37 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico <rosuav at> wrote:
> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python at> wrote:
> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
> I'd assume "or" meant there, rather than "of", it's a common typo.
> Not sure why 1251, specifically

   This is the encoding of Russian Windows. Files and emails in Russia
are mostly in cp1251 encoding; something like 60-70%, I think. The
second popular encoding is cp866 (Russian DOS); it's used by Windows as
OEM encoding.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From rosuav at  Sat Aug 23 00:28:09 2014
From: rosuav at (Chris Angelico)
Date: Sat, 23 Aug 2014 08:28:09 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman <phd at> wrote:
> On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico <rosuav at> wrote:
>> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python at> wrote:
>> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
>> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
>> I'd assume "or" meant there, rather than "of", it's a common typo.
>> Not sure why 1251, specifically
>    This is the encoding of Russian Windows. Files and emails in Russia
> are mostly in cp1251 encoding; something like 60-70%, I think. The
> second popular encoding is cp866 (Russian DOS); it's used by Windows as
> OEM encoding.

Yeah, that makes sense. In any case, you pick one "most likely" 8-bit
encoding and go with it.


From rdmurray at  Sat Aug 23 04:20:55 2014
From: rdmurray at (R. David Murray)
Date: Fri, 22 Aug 2014 22:20:55 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman <phd at> wrote:
>    I'm involved in developing and maintaining a few big commercial
> projects that will hardly be ported to Python3. So I'm stuck with
> Python2 for many years and I haven't tried Python3. May be I should try
> a small personal project, but certainly not this year. May be the next
> one...

Yes, you should try it.  Really, it's not the monster you are
constructing in your mind.  The functions that read filenames and return
them as text use surrogate escape to preserve the bytes, and the
functions that accept filenames use surrogate escape to recover those
bytes before passing them back to the OS.  So posix binary filenames
just work, as long as the only thing you depend on is being able to
split and join them on the / character (and possibly the . character)
and otherwise treat the names as black boxes...which is exactly the same
situation you are in in python2.

If you need to read filenames out of a file, you'll need to specify the
surrogate escape error handler so that the bytes will be there to be
recovered when you pass them to the file system functions, but it will

Or, as discussed, you can treat them as binary and use the os level
functions that accept binary input (which are exactly the ones you are
used to using in python2).  This includes os.path.split and
os.path.join, which as noted are the only things you can depend on
working correctly when you don't know the encoding of the filenames.

So, the way to look at this is that python3 is no worse[1] than python2 for
handling posix binary filenames, and also provides additional features
if you *do* know the correct encoding of the filenames.


[1] modulo any remaining API bugs, which is exactly where this thread
started: trying to figure out which APIs need to be able to handle
binary paths and/or surrogate escaped paths so that posix filenames
consistently work as well in python3 as they did in python2).

From stephen at  Sat Aug 23 10:02:25 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 23 Aug 2014 17:02:25 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

Chris Barker writes:

 > > The third is to specify the UTF-8 with the surrogate escape error
 > > handler.  This allows non-UTF-8 codes to be loaded into
 > > memory.

Read as bytes and incrementally decode.  If you hit an Exception,
retry from that point.

 > Just so I'm clear here -- if you write that back out, encoded as
 > utf-8 -- you'll get the exact same binary blob out as came in?

If and only if there are no changes to the content.

 > I wonder if this would make it hard to preserve byte boundaries,
 > though.

I'm not sure what you mean by "byte boundaries".  If you mean
after concatenation of such objects, yes, the uninterpretable bytes
will be encoded in such a way as to be identifiable as lone bytes;
they won't be interpreted as Unicode characters.

 > By the way, IIUC correctly, you can also use the python latin-1
 > decoder -- anything latin-1 will come through correctly, anything
 > not valid latin-1 will come in as garbage, but if you re-encode
 > with latin-1 the original bytes will be preserved. I think this
 > will also preserve a 1:1 relationship between character count and
 > byte count, which could be handy.

Bad idea, especially for Oleg's use case -- you can't decode those by
codec without reencoding to bytes first.  No point in abandoning
codecs just because there isn't one designed for his use case exactly.
Just read as bytes and decode piecewise in one way or another.  For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points and a very few plausible coding systems,
which can be fairly well distinguished by the range of bytes used and
probably nearly perfectly with additional information from the
structure and distribution of apparently decoded characters.

From stephen at  Sat Aug 23 10:20:40 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 23 Aug 2014 17:20:40 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

Chris Angelico writes:

 > Not sure why 1251,

All of those codes have repertoires that are Cyrillic supersets,
presumably Russian-language content, based on Oleg's top domain.

 > But it's important to note that this is a method of handling junk.
 > It's not a design intention; this is for a situation where I really
 > want to cope with any byte stream and attempt to display it as text.
 > And if I get something that's neither UTF-8 nor CP-1252, I will
 > display it wrongly, and there's nothing can be done about that.

Of course there is.  It just gets more heuristic the more numerous the
potential encodings are.

From stephen at  Sat Aug 23 11:02:06 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 23 Aug 2014 18:02:06 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Chris Barker writes:

 > So I write bytes that are encoded one way into a text file that's encoded
 > another way, and expect to be abel to read that later?

No, not you.  Crap software does that.  Your MUD server.  Oleg's
favorite web pages with ads, or more likely the ad servers.

 > Not for me (or many other users) -- terminals are sometimes set
 > with ascii-only encoding,

So?  That means you can't handle text files in general, only those
restricted to ASCII.  That's a completely different issue.

 > Python3 supports this case very well. But it does indeed make it
 > hard to work with filenames when you don't know the encoding they
 > are in.

No, it doesn't.  Reasonably handling "text streams" in unknown,
possibly multiple, encodings is just hard.  Python 3 has nothing to do
with it, and Oleg should know that very well.

It's true that code written in Python 2 to handle these issues needs
to be ported to Python 3.  Things is, Oleg says "another tool" -- any
non-Python-2 tool will need porting of his code too.

 > And apparently that's pretty common -- or common enough that it
 > would be nice for Python to support it well. This trick is how --
 > we'd like the "just pass it around and do path manipulations" case
 > to work with (almost) arbitrary bytes,

It does.  That's what os.path is for.

 > but everything else to work naturally with text (unicode text).

No gloss, please.  It's text, period.  The internal Unicode encoding
is *not exposed*, with a few (important) exceptions such as Han

 > I think the way to do this is to abstract the path concept, like pathlib
 > does.

You forgot to append the word "well".<wink/>

 > From my personal experience, non-ascii filenames are much easier to
 > deal with if I use unicode for filenames everywhere (py2). Somehow,
 > I have yet to be bitten by mixed encoding in filenames.

.gov domain?  ASCII-only terminal settings?  It's not "somehow", it's
that you live a sheltered life.<wink/>

 > So will using a surrogate-escape error handling with pathlib make
 > all this just work?

Not answerable until you define "all this" more precisely.

And that's the big problem with Oleg's complaint, too.  It's not at
all clear what he wants, except that all of his current code should
continue to work in Python 3.  Just like all of us.  The question then
is persuading him that it's worth moving to Python 3 despite the
effort of porting Python-2-specific code.  Maybe he can be persuaded,
maybe not.  Python 2 is a better than average language.

From marko at  Sat Aug 23 10:21:57 2014
From: marko at (Marko Rauhamaa)
Date: Sat, 23 Aug 2014 11:21:57 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <> (Stephen J. Turnbull's
 message of "Sat, 23 Aug 2014 17:02:25 +0900")
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

"Stephen J. Turnbull" <stephen at>:

> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points

HTML and XML are interesting examples since their encoding is initially

  <?xml version="1.0"?>
                      +--- Now I know it is UTF-8

  <?xml version="1.0" encoding="UTF-16"?>
                                      +--- Now I know it was UTF-16
                                           all along!

Then we have:

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


From rosuav at  Sat Aug 23 11:32:57 2014
From: rosuav at (Chris Angelico)
Date: Sat, 23 Aug 2014 19:32:57 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Sat, Aug 23, 2014 at 7:02 PM, Stephen J. Turnbull <stephen at> wrote:
> Chris Barker writes:
>  > So I write bytes that are encoded one way into a text file that's encoded
>  > another way, and expect to be abel to read that later?
> No, not you.  Crap software does that.  Your MUD server.  Oleg's
> favorite web pages with ads, or more likely the ad servers.

Just to clarify: Presumably you're referring to my previous post
regarding my MUD client's heuristic handling of broken encodings. It's
"my server" in the sense of the one that I'm connecting to, and not in
the sense that I control it. I do also run a MUD server, and it
guarantees that everything it sends is UTF-8. (Incidentally, that
server has the exact same set of heuristics for coping with broken
encodings from other clients. There's no escaping it.) Your point is
absolutely right: mess like that is to cope with the fact that there's
broken stuff out there.


From marko at  Sat Aug 23 11:46:34 2014
From: marko at (Marko Rauhamaa)
Date: Sat, 23 Aug 2014 12:46:34 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
 (Isaac Morland's message of "Sat, 23 Aug 2014 05:27:54 -0400 (EDT)")
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

Isaac Morland <ijmorlan at>:

>>  HTTP/1.1 200 OK
>>  Content-Type: text/html; charset=ISO-8859-1
>>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>  <html>
>>  <head>
>>  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
> For HTML it's not quite so bad.  According to the HTML 4 standard:
> [...]
> The Content-Type header takes precedence over a <meta> element. I
> thought I read once that the reason was to allow proxy servers to
> transcode documents but I don't have a cite for that. Also, the <meta>
> element "must only be used when the character encoding is organized
> such that ASCII-valued bytes stand for ASCII characters" so the
> initial UTF-16 example wouldn't be conformant in HTML.

That's not how I read it:

   The META declaration must only be used when the character encoding is
   organized such that ASCII characters stand for themselves (at least
   until the META element is parsed). META declarations should appear as
   early as possible in the HEAD element.


IOW, you must obey the HTTP character encoding until you have parsed a
conflicting META content-type declaration.

The author of the standard keeps a straight face and continues:

   For cases where neither the HTTP protocol nor the META element
   provides information about the character encoding of a document, HTML
   also provides the charset attribute on several elements. By combining
   these mechanisms, an author can greatly improve the chances that,
   when the user retrieves a resource, the user agent will recognize the
   character encoding.


From stephen at  Sat Aug 23 12:14:47 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 23 Aug 2014 19:14:47 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Oleg Broytman writes:

 >    This is the core of the problem. Python2 favors Unix model but
 > Windows people pays the price. Python3 reverses that

This is certainly not true.  What is true is that Python 3 makes no
attempt to make it easy to write crappy software in the old Unix
style, that breaks when unexpected character encoding are encountered.
Python 3 is designed to make it easier to write reliable software,
even if it will only ever be used on one platform.  Nevertheless, it's
still a reasonable language for writing byte-shoveling software, with
the last piece in place as of the acceptance of PEP 461.

As of that PEP, you can use regexps for tokenizing byte streams and
%-formatting to conveniently produce them.  If you want to treat them
piecewise as character streams with different encodings, you have a
large library of codecs, which provide an incremental decoder
interface.  While AFAIK no codec implements a decode-until-error mode,
that's not all that much of a loss, as many encodings overlap.  Eg, if
you start decoding using a latin-1 codec, decoding the whole document
will succeed, even if it switches to windows-1251 in the meantime.

Oleg, I gather Russian is your native language.  That's moderately
complicated, I admit.  But the Russians are a distant second to the
Japanese in self-destructive proliferation of incompatible character
coding standards and non-standard variants.  After 24 years of dealing
with the mess that is East Asian encodings (which is even bound up
with the "religion" of Japanese exceptionalism -- some Japanese have
argued that there is a spiritual superiority to Japanese JIS codes!),
I cannot believe you are going to find a better environment for
dealing with these issues than Python 3.

From steve at  Sat Aug 23 13:08:29 2014
From: steve at (Steven D'Aprano)
Date: Sat, 23 Aug 2014 21:08:29 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <20140823110828.GY25957@ando>

On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote:

> The point is that if you are reading a file name from the system, and then
> passing it back to the system, then you can treat it as just bytes -- who
> cares? And if you add the byte value of 47 thing, then you can even do
> basic path manipulations. But once you want to do other things with your
> file name, then you need to know the encoding. And it is very, very common
> for users to need to do other things with filenames, and they almost always
> want them as text that they can read and understand.
> Python3 supports this case very well. But it does indeed make it hard to
> work with filenames when you don't know the encoding they are in.

Just "not knowing" is not sufficient. In that case, you'll likely get a 
Unicode string containing moji-bake:

# I write a file name using UTF-8 on my system:
filename = 'music by ????.txt'.encode('utf-8')
# You try to use it assuming ISO-8859-7 (Greek)
=> 'music by ?\x9d??????.txt'

which, even though it looks wrong, still lets you refer to the file 
(provided you then encode back to bytes with ISO-8859-7 again). This 
won't always be the case, sometimes the encoding you guess will be 

When I started this email, I originally began to say that the actual 
problem was with byte file names that cannot be decoded into Unicode 
using the system encoding (typically UTF-8 on Linux systems. But I've 
actually had difficulty demonstrating that it actually is a problem. I 
started with a byte sequence which is invalid UTF-8, namely:


created a file with that name, and then tried listing it with 
os.listdir. Even in Python 3.1 it worked fine. I was able to list the 
directory and open the file, so I'm not entirely sure where the problem 
lies exactly. Can somebody demonstrate the failure mode?


From rdmurray at  Sat Aug 23 15:41:22 2014
From: rdmurray at (R. David Murray)
Date: Sat, 23 Aug 2014 09:41:22 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <20140823110828.GY25957@ando>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano <steve at> wrote:
> When I started this email, I originally began to say that the actual 
> problem was with byte file names that cannot be decoded into Unicode 
> using the system encoding (typically UTF-8 on Linux systems. But I've 
> actually had difficulty demonstrating that it actually is a problem. I 
> started with a byte sequence which is invalid UTF-8, namely:
> b'ZZ\xdb\xdf\xfa\xff'
> created a file with that name, and then tried listing it with 
> os.listdir. Even in Python 3.1 it worked fine. I was able to list the 
> directory and open the file, so I'm not entirely sure where the problem 
> lies exactly. Can somebody demonstrate the failure mode?

The "failure" happens only when you try to cross from the domain of
posix binary filenames into the domain of text streams (that is, a
stream with a consistent encoding).  If you stick with os interfaces
that handle filenames, Python3 handles posix bytes filenames just fine
(though there may be a few corner-case rough edges yet to be fixed, and
the standard streams was one of them).

The difficultly comes if you try to use a filename that contains
undecodable bytes in a non-os-interface text-context (such as writing it
to a text file that you have declared to be a utf-8 encoding): there you
will get an error...not completely unlike the old "your code works until
your user uses unicode" problem we had in python2, but in this case only
happening in a very narrow set of circumstances involving trying to
translate between one domain (posix binary filenames) and another domain
(io streams with a consistent declared encoding).  This is not a common
operation, but appears to be the one Oleg is concerned about. The old
unicode-blowup errors would happen almost any time someone with a
non-ascii language tried to use a program written by an ascii-only
programmer (which was most of us).

The same problem existed in python2 if your goal was to produce a stream
with a consistent encoding, but now python3 treats that as an error.  If
you really want a stream with an inconsistent encoding, open it as
binary and use the surrogate escape error handler to recover the bytes
in the filenames.  That is, *be explicit* about your intentions.

So yes, we've shifted a burden from those who want non-ascii text to
work consistently to those who wanted inconsistently encoded text to "just
work" (or rather *appear* to "just work").  The number of people who
benefit from the improved text model *greatly* outweighs the number of
people inconvenienced by the new strictness when the domain line (posix
binary filenames to consistently encoded text stream) are crossed.  And
the result is more *valid* programs, and fewer unexpected errors
overall, with no inconvenience unless that domain line is crossed,
and even then the inconvenience is limited to the open call that creates
the binary stream.


From phd at  Sat Aug 23 17:15:52 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 17:15:52 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen at> wrote:
> And that's the big problem with Oleg's complaint, too.  It's not at
> all clear what he wants

   The first thing is I want to understand why people continue to refer
to Unix was as "broken". Better yet, to persuade them it's not.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From phd at  Sat Aug 23 17:16:39 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 17:16:39 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On Sat, Aug 23, 2014 at 07:14:47PM +0900, "Stephen J. Turnbull" <stephen at> wrote:
> I cannot believe you are going to find a better environment for
> dealing with these issues than Python 3.

   Well, that's may be.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From ijmorlan at  Sat Aug 23 11:27:54 2014
From: ijmorlan at (Isaac Morland)
Date: Sat, 23 Aug 2014 05:27:54 -0400 (EDT)
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <20140822151911.GS25957@ando> <>
 <> <>
Message-ID: <>

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:

> "Stephen J. Turnbull" <stephen at>:
>> Just read as bytes and decode piecewise in one way or another. For
>> Oleg's HTML case, there's a well-understood structure that can be used
>> to determine retry points
> HTML and XML are interesting examples since their encoding is initially
> unknown:
>  <?xml version="1.0"?>
>                      ^
>                      +--- Now I know it is UTF-8
>  <?xml version="1.0" encoding="UTF-16"?>
>                                      ^
>                                      +--- Now I know it was UTF-16
>                                           all along!
> Then we have:
>  HTTP/1.1 200 OK
>  Content-Type: text/html; charset=ISO-8859-1
>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>  <html>
>  <head>
>  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
> See how deep you have to parse the TCP stream before you realize the
> content encoding is UTF-16.

For HTML it's not quite so bad.  According to the HTML 4 standard:

The Content-Type header takes precedence over a <meta> element.  I thought 
I read once that the reason was to allow proxy servers to transcode 
documents but I don't have a cite for that.  Also, the <meta> element 
"must only be used when the character encoding is organized such that 
ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 
example wouldn't be conformant in HTML.

In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte 
order mark) is used:

Not sure about XML.

Of course this whole area is a bit of an "arms race" between programmers 
competing to get away with being as sloppy as possible and other 
programmers who have to deal with their mess.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From marko at  Sat Aug 23 18:33:06 2014
From: marko at (Marko Rauhamaa)
Date: Sat, 23 Aug 2014 19:33:06 +0300
Subject: [Python-Dev] Bytes path support
In-Reply-To: <> (R. David Murray's
 message of "Sat, 23 Aug 2014 09:41:22 -0400")
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

"R. David Murray" <rdmurray at>:

> The same problem existed in python2 if your goal was to produce a stream
> with a consistent encoding, but now python3 treats that as an error.

I have a different interpretation of the situation: as a rule, use byte
strings in Python3. Text strings are a special corner case for
applications that have to deal with human languages.

If your application has to talk SMTP, use bytes.

If your application has to do IPC, use bytes.

If your application has to do file I/O, use bytes.

If your application is a word processor or an IM client, you have text
strings available. You might find, though, that barely any modern GUI
application is satisfied with crude text strings. You will need weights,
styles, sizes, emoticons, positions, directions, shadows, alignment etc
etc so it may be that Python's text strings are only good enough for
storing individual characters or short snippets.

In sum, Python's text strings might have one sweet spot: Usenet clients.


From p.f.moore at  Sat Aug 23 19:40:37 2014
From: p.f.moore at (Paul Moore)
Date: Sat, 23 Aug 2014 18:40:37 +0100
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 23 August 2014 16:15, Oleg Broytman <phd at> wrote:
> On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen at> wrote:
>> And that's the big problem with Oleg's complaint, too.  It's not at
>> all clear what he wants
>    The first thing is I want to understand why people continue to refer
> to Unix was as "broken". Better yet, to persuade them it's not.

Generally, it seems to be mostly a reaction to the repeated claims
that Python, or Windows, or whatever, is "broken". Unix advocates (not
yourself) are prone to declaring anything *other* than the Unix model
as "broken", so it's tempting to give them a taste of their own
medicine. Sorry for that (to the extent that I was one of the people
doing so).

Rhetoric aside, none of Unix, Windows or Python are "broken". They
just react in different ways to fundamentally difficult edge cases.

But expecting Python (a cross-platform language) to prefer the Unix
model is putting all the pain on non-Unix users of Python, which I
don't feel is reasonable. Let's all compromise a little.


PS The key thing *I* think is a problem with the Unix behaviour is
that it treats filenames as bytes rather than Unicode. People name
files using *characters*. So every filename is semantically text, in
the mind of the person who created it. Unix enforces a transformation
to bytes, but does not retain the encoding of those bytes. So
information about the original author's intent is lost. But that's a
historical fact, baked into Unix at a low level. Whether that's
"broken" or just "something to deal with" is not important to me.

From phd at  Sat Aug 23 20:37:29 2014
From: phd at (Oleg Broytman)
Date: Sat, 23 Aug 2014 20:37:29 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>


On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore at> wrote:
> On 23 August 2014 16:15, Oleg Broytman <phd at> wrote:
> > On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen at> wrote:
> >> And that's the big problem with Oleg's complaint, too.  It's not at
> >> all clear what he wants
> >
> >    The first thing is I want to understand why people continue to refer
> > to Unix was as "broken". Better yet, to persuade them it's not.

   "Unix was" => "Unix way"

> Generally, it seems to be mostly a reaction to the repeated claims
> that Python, or Windows, or whatever, is "broken".

   Ah, if that's the only problem I certainly can live with that. My
problem is that it *seems* this anti-Unix attitude infiltrates Python
core development. I very much hope I'm wrong and it really isn't.

> Unix advocates (not
> yourself) are prone to declaring anything *other* than the Unix model
> as "broken", so it's tempting to give them a taste of their own
> medicine. Sorry for that (to the extent that I was one of the people
> doing so).

   You didn't see me in my younger years. I surely was one of those
Windows bashers. Please take my apology.

> Rhetoric aside, none of Unix, Windows or Python are "broken". They
> just react in different ways to fundamentally difficult edge cases.
> But expecting Python (a cross-platform language) to prefer the Unix
> model is putting all the pain on non-Unix users of Python, which I
> don't feel is reasonable. Let's all compromise a little.
> Paul
> PS The key thing *I* think is a problem with the Unix behaviour is
> that it treats filenames as bytes rather than Unicode. People name
> files using *characters*. So every filename is semantically text, in
> the mind of the person who created it. Unix enforces a transformation
> to bytes, but does not retain the encoding of those bytes. So
> information about the original author's intent is lost. But that's a
> historical fact, baked into Unix at a low level. Whether that's
> "broken" or just "something to deal with" is not important to me.

   The problem is hardly specific to Unix. Despite Joel Spolsky's "There
Ain't No Such Thing As Plain Text" people create text files all the
time. Without specifying an encoding. And put filenames into those text
files (audio playlists, like .m3u and .pls are just text files with
   Unix takes the idea that everything is text and a stream of bytes to
its extreme.

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From p.f.moore at  Sat Aug 23 22:42:45 2014
From: p.f.moore at (Paul Moore)
Date: Sat, 23 Aug 2014 21:42:45 +0100
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 23 August 2014 19:37, Oleg Broytman <phd at> wrote:
> Unix takes the idea that everything is text and a stream of bytes to
> its extreme.

I don't really understand the idea of "text and a stream of bytes".
The two are fundamentally different in my view. But I guess that's why
we have to agree to differ - our perspectives are just very different.


From greg.ewing at  Sun Aug 24 03:11:10 2014
From: greg.ewing at (Greg Ewing)
Date: Sun, 24 Aug 2014 13:11:10 +1200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
 <> <>
Message-ID: <>

Isaac Morland wrote:
> In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF 
> (byte order mark) is used:
> Not sure about XML.

According to Appendix F here:

an XML parser needs to be prepared to try all the encodings it
supports until it finds one that works well enough to decode
the XML declaration, then it can find out the exact encoding


From ncoghlan at  Sun Aug 24 05:27:55 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 24 Aug 2014 13:27:55 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 24 August 2014 04:37, Oleg Broytman <phd at> wrote:
> On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore at> wrote:
>> Generally, it seems to be mostly a reaction to the repeated claims
>> that Python, or Windows, or whatever, is "broken".
>    Ah, if that's the only problem I certainly can live with that. My
> problem is that it *seems* this anti-Unix attitude infiltrates Python
> core development. I very much hope I'm wrong and it really isn't.

The POSIX locale based approach to handling encodings is genuinely
broken - it's almost as broken as code pages are on Windows. The
fundamental flaw is that locales encourage *bilingual* computing:
handling English plus one other language correctly. Given a global
internet, bilingual computing *is a fundamentally broken approach*. We
need multilingual computing (any human language, all the time), and
that means Unicode.

As some examples of where bilingual computing breaks down:

* My NFS client and server may have different locale settings
* My FTP client and server may have different locale settings
* My SSH client and server may have different locale settings
* I save a file locally and send it to someone with a different locale setting
* I attempt to access a Windows share from a Linux client (or vice-versa)
* I clone my POSIX hosted git or Mercurial repository on a Windows client
* I have to connect my Linux client to a Windows Active Directory
domain (or vice-versa)
* I have to interoperate between native code and JVM code

The entire computing industry is currently struggling with this
monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
encoding/code pages) -> multilingual (Unicode) transition. It's been
going on for decades, and it's still going to be quite some time
before we're done.

The POSIX world is slowly clawing its way towards a multilingual model
that actually works: UTF-8
Windows (including the CLR) and the JVM adopted a different
multilingual model, but still one that actually works: UTF-16-LE

POSIX is hampered by legacy ASCII defaults in various subsystems (most
notably the default locale) and the assumption that system metadata is
"just bytes" (an assumption that breaks down as soon as you have to
hand that metadata over to another machine that may have different
locale settings)
Windows is hampered by the fact they kept the old 8-bit APIs around
for backwards compatibility purposes, so applications using those APIs
are still only bilingual (at best) rather than multilingual.
JVM and CLR applications will at least handle the Basic Multilingual
Plane (UCS-2) correctly, but may not correctly handle code points
beyond the 16-bit boundary (this is the "Python narrow builds don't
handle Unicode correctly" problem that was resolved for Python 3.3+ by
PEP 393)

Individual users (including some organisations) may have the luxury of
saying "well, all my clients and all my servers are POSIX, so I don't
care about interoperability with other platforms". As the providers of
a cross-platform runtime environment, we don't have that luxury - we
need to figure out how to get *all* the major platforms playing nice
with each other, regardless of whether they chose UTF-8 or UTF-16-LE
as the basis for their approach towards providing multilingual
computing environments.

Historically, that question of cross platform interoperability for
open source software has been handled in a few different ways:

* Don't really interoperate with anybody, reinvent all the wheels (the JVM way)
* Emulate POSIX on Windows (the Cygwin/MinGW way)
* Let the application developer figure it out (the Python 2 way)

The first approach is inordinately expensive - it took the resources
of Sun in its heyday to make it possible, and it effectively locks the
JVM out of certain kinds of computing (e.g. it's hard to do array
oriented programming in JVM languages, because the CPU and GPU
vectorisation features aren't readily accessible).

The second approach prevents the creation of truly native Windows
applications, which makes it uncompelling as a way of attracting
Windows users - it sends a clear signal that the project doesn't
*really* care about supporting Windows as a platform, but instead only
grudgingly accepts that there are Windows users out there that might
like to use their software.

The third approach is the one we tried for a long time with Python 2,
and essentially found to be an "experts only" solution. Yes, you can
*make* it work, but the runtime isn't set up so it works *by default*.

The Unicode changes in Python 3 are a result of the Python core
development team saying "it really shouldn't be this hard for
application developers to get cross-platform interoperability between
correctly configured systems when dealing solely with correctly
encoded data and metadata". The idea of Python 3 is that applications
should require additional complexity solely to deal with *incorrectly*
configured systems and improperly encoded data and metadata (and,
ideally, the detection of the need for such handling should be "Python
3 threw an exception" rather than "something further down the line
detected corrupted data").

This is software rather than magic, though - these improvements only
happen through people actually knuckling down and solving the related
problems. When folks complain about Python 3's operating system
interface handling causing problems in some situations? They're almost
always referring to areas where we're still relying on the locale
system on POSIX or the code page system on Windows. Both of those
approaches are irredeemably broken - the answer is to stop relying on
them, but appropriately updating the affected subsystems generally
isn't a trivial task. A lot of the affected code runs before the
interpreter is fully initialised, which makes it really hard to test,
and a lot of it is incredibly convoluted due to various configuration
options and platform specific details, which makes it incredibly hard
to modify without breaking anything.

One of those areas is the fact that we still use the old 8-bit APIs to
interact with the Windows console. Those are just as broken in a
multilingual world as the other Windows 8-bit APIs, so Drekin came up
with a project to expose the Windows console as a UTF-16-LE stream
that uses the 16-bit APIs instead:

I personally hope we'll be able to get the issues Drekin references
there resolved for Python 3.5 - if other folks hope for the same
thing, then one of the best ways to help that happen is to try out the
win_unicode_console module and provide feedback on what does and
doesn't work.

Another was getting exceptions attempting to write OS data to
sys.stdout when the locale settings had been scrubbed from the
environment. For Python 3.5, we better tolerate that situation by
setting "errors=surrogateescape" on sys.stdout when the environment
claims "ascii" as a suitable encoding for talking to the operating
system (this is our way of saying "we don't actually believe you, but
also don't have the data we need to overrule you completely").

While I was going to wait for more feedback from Fedora folks before
pushing the idea again, this thread also makes me think it would be
worth our while to add more tools for dealing with surrogate escapes
and latin-1 binary data smuggling just to help make those techniques
more discoverable and accessible:

These various discussions are also giving me plenty of motivation to
get back to working on PEP 432 (the rewrite of the interpreter startup
sequence) for Python 3.5. A lot of these things are just plain hard to
change because of the complexity of the current startup code.
Redesigning that to use a cleaner, multiphase startup sequence that
gets the core interpreter running *before* configuring the operating
system integration should give us several more options when it comes
to dealing with some of these challenges.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From guido at  Sun Aug 24 06:17:34 2014
From: guido at (Guido van Rossum)
Date: Sat, 23 Aug 2014 21:17:34 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <>
Message-ID: <>

I declare this thread irreparably broken. Do not make any decisions in this
thread. Tell me (in another thread) when it's time to decide and I will.

On Sat, Aug 23, 2014 at 8:27 PM, Nick Coghlan <ncoghlan at> wrote:

> On 24 August 2014 04:37, Oleg Broytman <phd at> wrote:
> > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <
> p.f.moore at> wrote:
> >> Generally, it seems to be mostly a reaction to the repeated claims
> >> that Python, or Windows, or whatever, is "broken".
> >
> >    Ah, if that's the only problem I certainly can live with that. My
> > problem is that it *seems* this anti-Unix attitude infiltrates Python
> > core development. I very much hope I'm wrong and it really isn't.
> The POSIX locale based approach to handling encodings is genuinely
> broken - it's almost as broken as code pages are on Windows. The
> fundamental flaw is that locales encourage *bilingual* computing:
> handling English plus one other language correctly. Given a global
> internet, bilingual computing *is a fundamentally broken approach*. We
> need multilingual computing (any human language, all the time), and
> that means Unicode.
> As some examples of where bilingual computing breaks down:
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale
> setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE
> POSIX is hampered by legacy ASCII defaults in various subsystems (most
> notably the default locale) and the assumption that system metadata is
> "just bytes" (an assumption that breaks down as soon as you have to
> hand that metadata over to another machine that may have different
> locale settings)
> Windows is hampered by the fact they kept the old 8-bit APIs around
> for backwards compatibility purposes, so applications using those APIs
> are still only bilingual (at best) rather than multilingual.
> JVM and CLR applications will at least handle the Basic Multilingual
> Plane (UCS-2) correctly, but may not correctly handle code points
> beyond the 16-bit boundary (this is the "Python narrow builds don't
> handle Unicode correctly" problem that was resolved for Python 3.3+ by
> PEP 393)
> Individual users (including some organisations) may have the luxury of
> saying "well, all my clients and all my servers are POSIX, so I don't
> care about interoperability with other platforms". As the providers of
> a cross-platform runtime environment, we don't have that luxury - we
> need to figure out how to get *all* the major platforms playing nice
> with each other, regardless of whether they chose UTF-8 or UTF-16-LE
> as the basis for their approach towards providing multilingual
> computing environments.
> Historically, that question of cross platform interoperability for
> open source software has been handled in a few different ways:
> * Don't really interoperate with anybody, reinvent all the wheels (the JVM
> way)
> * Emulate POSIX on Windows (the Cygwin/MinGW way)
> * Let the application developer figure it out (the Python 2 way)
> The first approach is inordinately expensive - it took the resources
> of Sun in its heyday to make it possible, and it effectively locks the
> JVM out of certain kinds of computing (e.g. it's hard to do array
> oriented programming in JVM languages, because the CPU and GPU
> vectorisation features aren't readily accessible).
> The second approach prevents the creation of truly native Windows
> applications, which makes it uncompelling as a way of attracting
> Windows users - it sends a clear signal that the project doesn't
> *really* care about supporting Windows as a platform, but instead only
> grudgingly accepts that there are Windows users out there that might
> like to use their software.
> The third approach is the one we tried for a long time with Python 2,
> and essentially found to be an "experts only" solution. Yes, you can
> *make* it work, but the runtime isn't set up so it works *by default*.
> The Unicode changes in Python 3 are a result of the Python core
> development team saying "it really shouldn't be this hard for
> application developers to get cross-platform interoperability between
> correctly configured systems when dealing solely with correctly
> encoded data and metadata". The idea of Python 3 is that applications
> should require additional complexity solely to deal with *incorrectly*
> configured systems and improperly encoded data and metadata (and,
> ideally, the detection of the need for such handling should be "Python
> 3 threw an exception" rather than "something further down the line
> detected corrupted data").
> This is software rather than magic, though - these improvements only
> happen through people actually knuckling down and solving the related
> problems. When folks complain about Python 3's operating system
> interface handling causing problems in some situations? They're almost
> always referring to areas where we're still relying on the locale
> system on POSIX or the code page system on Windows. Both of those
> approaches are irredeemably broken - the answer is to stop relying on
> them, but appropriately updating the affected subsystems generally
> isn't a trivial task. A lot of the affected code runs before the
> interpreter is fully initialised, which makes it really hard to test,
> and a lot of it is incredibly convoluted due to various configuration
> options and platform specific details, which makes it incredibly hard
> to modify without breaking anything.
> One of those areas is the fact that we still use the old 8-bit APIs to
> interact with the Windows console. Those are just as broken in a
> multilingual world as the other Windows 8-bit APIs, so Drekin came up
> with a project to expose the Windows console as a UTF-16-LE stream
> that uses the 16-bit APIs instead:
> I personally hope we'll be able to get the issues Drekin references
> there resolved for Python 3.5 - if other folks hope for the same
> thing, then one of the best ways to help that happen is to try out the
> win_unicode_console module and provide feedback on what does and
> doesn't work.
> Another was getting exceptions attempting to write OS data to
> sys.stdout when the locale settings had been scrubbed from the
> environment. For Python 3.5, we better tolerate that situation by
> setting "errors=surrogateescape" on sys.stdout when the environment
> claims "ascii" as a suitable encoding for talking to the operating
> system (this is our way of saying "we don't actually believe you, but
> also don't have the data we need to overrule you completely").
> While I was going to wait for more feedback from Fedora folks before
> pushing the idea again, this thread also makes me think it would be
> worth our while to add more tools for dealing with surrogate escapes
> and latin-1 binary data smuggling just to help make those techniques
> more discoverable and accessible:
> These various discussions are also giving me plenty of motivation to
> get back to working on PEP 432 (the rewrite of the interpreter startup
> sequence) for Python 3.5. A lot of these things are just plain hard to
> change because of the complexity of the current startup code.
> Redesigning that to use a cleaner, multiphase startup sequence that
> gets the core interpreter running *before* configuring the operating
> system integration should give us several more options when it comes
> to dealing with some of these challenges.
> Regards,
> Nick.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 24 06:44:36 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 24 Aug 2014 14:44:36 +1000
Subject: [Python-Dev] Bytes path related questions for Guido
Message-ID: <>

At Guido's request, splitting out two specific questions from Serhiy's
thread where I believe we could do with an explicit "yes or no" from

1. Should we accept patches adding support for the direct use of bytes
paths in lower level filesystem manipulation APIs? (i.e. everything
that isn't pathlib)

This was Serhiy's original question (due to some open issues [1,2]). I
think the answer is yes, as we already do in some cases, and the
"pathlib doesn't support binary paths" design decision is a high level
platform independent API vs low level potentially platform dependent
API one rather than being about disallowing the use of bytes paths in


2. Should we add some additional helpers to the string module for
dealing with surrogate escaped bytes and other techniques for
smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points)
* string.clean(s): replaces surrogates with '\ufffd' or another
specified code point
* string.redecode(s, encoding): encodes a string back to bytes and
then decodes it again using the specified encoding (the old encoding
defaults to 'latin-1' to match the assumptions in WSGI)

"s != string.clean(s)" would then serve as a check for "does this
string contain any surrogate escaped bytes?"



Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ncoghlan at  Sun Aug 24 15:04:31 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 24 Aug 2014 23:04:31 +1000
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

On 24 August 2014 14:44, Nick Coghlan <ncoghlan at> wrote:
> 2. Should we add some additional helpers to the string module for
> dealing with surrogate escaped bytes and other techniques for
> smuggling arbitrary binary data as text?
> My proposal [3] is to add:
> * string.escaped_surrogates (constant with the 128 escaped code points)
> * string.clean(s): replaces surrogates with '\ufffd' or another
> specified code point
> * string.redecode(s, encoding): encodes a string back to bytes and
> then decodes it again using the specified encoding (the old encoding
> defaults to 'latin-1' to match the assumptions in WSGI)

Serhiy & Ezio convinced me to scale this one back to a proposal for
"codecs.clean_surrogate_escapes(s)", which replaces surrogates that
may be produced by surrogateescape (that's what string.clean() above
was supposed to be, but my description was not correct, and the name
was too vague for that error to be obvious to the reader)

"s != codecs.clean_surrogate_escapes(s)" would then become the check
for "does this string contain any surrogate escaped bytes?"


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From antoine at  Sun Aug 24 16:23:52 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 24 Aug 2014 10:23:52 -0400
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <ltcsho$dn6$>

Le 24/08/2014 09:04, Nick Coghlan a ?crit :
> On 24 August 2014 14:44, Nick Coghlan <ncoghlan at> wrote:
>> 2. Should we add some additional helpers to the string module for
>> dealing with surrogate escaped bytes and other techniques for
>> smuggling arbitrary binary data as text?
>> My proposal [3] is to add:
>> * string.escaped_surrogates (constant with the 128 escaped code points)
>> * string.clean(s): replaces surrogates with '\ufffd' or another
>> specified code point
>> * string.redecode(s, encoding): encodes a string back to bytes and
>> then decodes it again using the specified encoding (the old encoding
>> defaults to 'latin-1' to match the assumptions in WSGI)
> Serhiy & Ezio convinced me to scale this one back to a proposal for
> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that
> may be produced by surrogateescape (that's what string.clean() above
> was supposed to be, but my description was not correct, and the name
> was too vague for that error to be obvious to the reader)

"clean" conveys the wrong meaning. It should use a scary word such as 
"trap". "Cleaning" surrogates is unlikely to be the right procedure when 
dealing with surrogates produced by undecodable byte sequences.



From ncoghlan at  Sun Aug 24 17:26:43 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 25 Aug 2014 01:26:43 +1000
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <ltcsho$dn6$>
References: <>
Message-ID: <>

On 25 August 2014 00:23, Antoine Pitrou <antoine at> wrote:
> Le 24/08/2014 09:04, Nick Coghlan a ?crit :
>> Serhiy & Ezio convinced me to scale this one back to a proposal for
>> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that
>> may be produced by surrogateescape (that's what string.clean() above
>> was supposed to be, but my description was not correct, and the name
>> was too vague for that error to be obvious to the reader)
> "clean" conveys the wrong meaning. It should use a scary word such as
> "trap". "Cleaning" surrogates is unlikely to be the right procedure when
> dealing with surrogates produced by undecodable byte sequences.

"purge_surrogate_escapes" was the other term that occurred to me.

Either way, my use case is to filter them out when I *don't* want to
pass them along to other software, but would prefer the Unicode
replacement character to the ASCII question mark created by using the
"replace" filter when encoding.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From guido at  Sun Aug 24 19:55:25 2014
From: guido at (Guido van Rossum)
Date: Sun, 24 Aug 2014 10:55:25 -0700
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

Yes on #1 -- making the low-level functions more usable for edge cases by
supporting bytes seems fine (as long as the support for strings, where it
exists, is not compromised).

The status of pathlib is a little unclear to me -- is there a plan to
eventually support bytes or not?

For #2 I think you should probably just work with the others you have

On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan <ncoghlan at> wrote:

> At Guido's request, splitting out two specific questions from Serhiy's
> thread where I believe we could do with an explicit "yes or no" from
> him.
> 1. Should we accept patches adding support for the direct use of bytes
> paths in lower level filesystem manipulation APIs? (i.e. everything
> that isn't pathlib)
> This was Serhiy's original question (due to some open issues [1,2]). I
> think the answer is yes, as we already do in some cases, and the
> "pathlib doesn't support binary paths" design decision is a high level
> platform independent API vs low level potentially platform dependent
> API one rather than being about disallowing the use of bytes paths in
> general.
> [1]
> [2]
> 2. Should we add some additional helpers to the string module for
> dealing with surrogate escaped bytes and other techniques for
> smuggling arbitrary binary data as text?
> My proposal [3] is to add:
> * string.escaped_surrogates (constant with the 128 escaped code points)
> * string.clean(s): replaces surrogates with '\ufffd' or another
> specified code point
> * string.redecode(s, encoding): encodes a string back to bytes and
> then decodes it again using the specified encoding (the old encoding
> defaults to 'latin-1' to match the assumptions in WSGI)
> "s != string.clean(s)" would then serve as a check for "does this
> string contain any surrogate escaped bytes?"
> [3]
> Regards,
> Nick.
> --
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

--Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Mon Aug 25 01:19:19 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 25 Aug 2014 09:19:19 +1000
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

On 25 Aug 2014 03:55, "Guido van Rossum" <guido at> wrote:
> Yes on #1 -- making the low-level functions more usable for edge cases by
supporting bytes seems fine (as long as the support for strings, where it
exists, is not compromised).


> The status of pathlib is a little unclear to me -- is there a plan to
eventually support bytes or not?

It's text only and Antoine plans to keep it that - the concatenation
operations, etc, are really only safe if you decode first.

> For #2 I think you should probably just work with the others you have

Yes, that sounds like a good idea. There's been some good progress on the
issue tracker, so I think we can thrash out some workable (and
comprehensible!) utilities that will be useful in their own right while
also serving as aids to understanding for the underlying mechanisms.


> On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan <ncoghlan at> wrote:
>> At Guido's request, splitting out two specific questions from Serhiy's
>> thread where I believe we could do with an explicit "yes or no" from
>> him.
>> 1. Should we accept patches adding support for the direct use of bytes
>> paths in lower level filesystem manipulation APIs? (i.e. everything
>> that isn't pathlib)
>> This was Serhiy's original question (due to some open issues [1,2]). I
>> think the answer is yes, as we already do in some cases, and the
>> "pathlib doesn't support binary paths" design decision is a high level
>> platform independent API vs low level potentially platform dependent
>> API one rather than being about disallowing the use of bytes paths in
>> general.
>> [1]
>> [2]
>> 2. Should we add some additional helpers to the string module for
>> dealing with surrogate escaped bytes and other techniques for
>> smuggling arbitrary binary data as text?
>> My proposal [3] is to add:
>> * string.escaped_surrogates (constant with the 128 escaped code points)
>> * string.clean(s): replaces surrogates with '\ufffd' or another
>> specified code point
>> * string.redecode(s, encoding): encodes a string back to bytes and
>> then decodes it again using the specified encoding (the old encoding
>> defaults to 'latin-1' to match the assumptions in WSGI)
>> "s != string.clean(s)" would then serve as a check for "does this
>> string contain any surrogate escaped bytes?"
>> [3]
>> Regards,
>> Nick.
>> --
>> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at
>> Unsubscribe:
> --
> --Guido van Rossum (
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From phd at  Mon Aug 25 12:15:31 2014
From: phd at (Oleg Broytman)
Date: Mon, 25 Aug 2014 12:15:31 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Hi! Thank you very much, Nick, for long and detailed explanation!

On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan <ncoghlan at> wrote:
> On 24 August 2014 04:37, Oleg Broytman <phd at> wrote:
> > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore at> wrote:
> >> Generally, it seems to be mostly a reaction to the repeated claims
> >> that Python, or Windows, or whatever, is "broken".
> >
> >    Ah, if that's the only problem I certainly can live with that. My
> > problem is that it *seems* this anti-Unix attitude infiltrates Python
> > core development. I very much hope I'm wrong and it really isn't.
> The POSIX locale based approach to handling encodings is genuinely
> broken - it's almost as broken as code pages are on Windows. The
> fundamental flaw is that locales encourage *bilingual* computing:
> handling English plus one other language correctly. Given a global
> internet, bilingual computing *is a fundamentally broken approach*. We
> need multilingual computing (any human language, all the time), and
> that means Unicode.
> As some examples of where bilingual computing breaks down:
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE
> POSIX is hampered by legacy ASCII defaults in various subsystems (most
> notably the default locale) and the assumption that system metadata is
> "just bytes" (an assumption that breaks down as soon as you have to
> hand that metadata over to another machine that may have different
> locale settings)
> Windows is hampered by the fact they kept the old 8-bit APIs around
> for backwards compatibility purposes, so applications using those APIs
> are still only bilingual (at best) rather than multilingual.
> JVM and CLR applications will at least handle the Basic Multilingual
> Plane (UCS-2) correctly, but may not correctly handle code points
> beyond the 16-bit boundary (this is the "Python narrow builds don't
> handle Unicode correctly" problem that was resolved for Python 3.3+ by
> PEP 393)
> Individual users (including some organisations) may have the luxury of
> saying "well, all my clients and all my servers are POSIX, so I don't
> care about interoperability with other platforms". As the providers of
> a cross-platform runtime environment, we don't have that luxury - we
> need to figure out how to get *all* the major platforms playing nice
> with each other, regardless of whether they chose UTF-8 or UTF-16-LE
> as the basis for their approach towards providing multilingual
> computing environments.
> Historically, that question of cross platform interoperability for
> open source software has been handled in a few different ways:
> * Don't really interoperate with anybody, reinvent all the wheels (the JVM way)
> * Emulate POSIX on Windows (the Cygwin/MinGW way)
> * Let the application developer figure it out (the Python 2 way)
> The first approach is inordinately expensive - it took the resources
> of Sun in its heyday to make it possible, and it effectively locks the
> JVM out of certain kinds of computing (e.g. it's hard to do array
> oriented programming in JVM languages, because the CPU and GPU
> vectorisation features aren't readily accessible).
> The second approach prevents the creation of truly native Windows
> applications, which makes it uncompelling as a way of attracting
> Windows users - it sends a clear signal that the project doesn't
> *really* care about supporting Windows as a platform, but instead only
> grudgingly accepts that there are Windows users out there that might
> like to use their software.
> The third approach is the one we tried for a long time with Python 2,
> and essentially found to be an "experts only" solution. Yes, you can
> *make* it work, but the runtime isn't set up so it works *by default*.
> The Unicode changes in Python 3 are a result of the Python core
> development team saying "it really shouldn't be this hard for
> application developers to get cross-platform interoperability between
> correctly configured systems when dealing solely with correctly
> encoded data and metadata". The idea of Python 3 is that applications
> should require additional complexity solely to deal with *incorrectly*
> configured systems and improperly encoded data and metadata (and,
> ideally, the detection of the need for such handling should be "Python
> 3 threw an exception" rather than "something further down the line
> detected corrupted data").
> This is software rather than magic, though - these improvements only
> happen through people actually knuckling down and solving the related
> problems. When folks complain about Python 3's operating system
> interface handling causing problems in some situations? They're almost
> always referring to areas where we're still relying on the locale
> system on POSIX or the code page system on Windows. Both of those
> approaches are irredeemably broken - the answer is to stop relying on
> them, but appropriately updating the affected subsystems generally
> isn't a trivial task. A lot of the affected code runs before the
> interpreter is fully initialised, which makes it really hard to test,
> and a lot of it is incredibly convoluted due to various configuration
> options and platform specific details, which makes it incredibly hard
> to modify without breaking anything.
> One of those areas is the fact that we still use the old 8-bit APIs to
> interact with the Windows console. Those are just as broken in a
> multilingual world as the other Windows 8-bit APIs, so Drekin came up
> with a project to expose the Windows console as a UTF-16-LE stream
> that uses the 16-bit APIs instead:
> I personally hope we'll be able to get the issues Drekin references
> there resolved for Python 3.5 - if other folks hope for the same
> thing, then one of the best ways to help that happen is to try out the
> win_unicode_console module and provide feedback on what does and
> doesn't work.
> Another was getting exceptions attempting to write OS data to
> sys.stdout when the locale settings had been scrubbed from the
> environment. For Python 3.5, we better tolerate that situation by
> setting "errors=surrogateescape" on sys.stdout when the environment
> claims "ascii" as a suitable encoding for talking to the operating
> system (this is our way of saying "we don't actually believe you, but
> also don't have the data we need to overrule you completely").
> While I was going to wait for more feedback from Fedora folks before
> pushing the idea again, this thread also makes me think it would be
> worth our while to add more tools for dealing with surrogate escapes
> and latin-1 binary data smuggling just to help make those techniques
> more discoverable and accessible:
> These various discussions are also giving me plenty of motivation to
> get back to working on PEP 432 (the rewrite of the interpreter startup
> sequence) for Python 3.5. A lot of these things are just plain hard to
> change because of the complexity of the current startup code.
> Redesigning that to use a cleaner, multiphase startup sequence that
> gets the core interpreter running *before* configuring the operating
> system integration should give us several more options when it comes
> to dealing with some of these challenges.
> Regards,
> Nick.
> -- 
> Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:

     Oleg Broytman              phd at
           Programmers don't die, they just GOSUB without RETURN.

From rdmurray at  Mon Aug 25 16:32:22 2014
From: rdmurray at (R. David Murray)
Date: Mon, 25 Aug 2014 10:32:22 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <20140823110828.GY25957@ando> <>
Message-ID: <>

On Sat, 23 Aug 2014 19:33:06 +0300, Marko Rauhamaa <marko at> wrote:
> "R. David Murray" <rdmurray at>:
> > The same problem existed in python2 if your goal was to produce a stream
> > with a consistent encoding, but now python3 treats that as an error.
> I have a different interpretation of the situation: as a rule, use byte
> strings in Python3. Text strings are a special corner case for
> applications that have to deal with human languages.

Clearly, then, you are writing unix (or perhaps posix)-only programs.

Also, as has been discussed in this thread previously, any program that
deals with filenames is dealing with human readable languages, even
if posix itself treats the filenames as bytes.


From ijmorlan at  Mon Aug 25 18:46:46 2014
From: ijmorlan at (Isaac Morland)
Date: Mon, 25 Aug 2014 12:46:46 -0400 (EDT)
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <20140822151911.GS25957@ando> <>
 <> <>
Message-ID: <>

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:

> Isaac Morland <ijmorlan at>:
>>>  HTTP/1.1 200 OK
>>>  Content-Type: text/html; charset=ISO-8859-1
>>>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>>  <html>
>>>  <head>
>>>  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
>> For HTML it's not quite so bad.  According to the HTML 4 standard:
>> [...]
>> The Content-Type header takes precedence over a <meta> element. I
>> thought I read once that the reason was to allow proxy servers to
>> transcode documents but I don't have a cite for that. Also, the <meta>
>> element "must only be used when the character encoding is organized
>> such that ASCII-valued bytes stand for ASCII characters" so the
>> initial UTF-16 example wouldn't be conformant in HTML.
> That's not how I read it:
>   The META declaration must only be used when the character encoding is
>   organized such that ASCII characters stand for themselves (at least
>   until the META element is parsed). META declarations should appear as
>   early as possible in the HEAD element.
>   <URL:
>   ml#doc-char-set>
> IOW, you must obey the HTTP character encoding until you have parsed a
> conflicting META content-type declaration.

>From the same document:

To sum up, conforming user agents must observe the following priorities 
when determining a document's character encoding (from highest priority to 

     An HTTP "charset" parameter in a "Content-Type" field.
     A META declaration with "http-equiv" set to "Content-Type" and a value 
set for "charset".
     The charset attribute set on an element that designates an external 

(In the original they are numbered)

This is a priority list - if the Content-Type header gives a charset, it 
takes precedence, and all other sources for the encoding are ignored.  The 
"charset=" on an <img> or similar is only used if it is the only source 
for the encoding.

The "at least until the META element is parsed" bit allows for the use of 
encodings which make use of shifting.  So maybe they start out 
ASCII-compatible, but after a particular shift byte is seen those bytes 
now stand for Japanese Kanji characters until another shift byte is seen. 
This is allowed by the specification, as long as none of the 
non-ASCII-compatible stuff is seen before the META element.

> The author of the standard keeps a straight face and continues:

I like your way of putting this - "straight face" indeed.  The third 
option really is a hack to allow working around nonsensical situations 
(and even the META tag is pretty questionable).  All this complexity 
because people can't be bothered to do things properly.

>   For cases where neither the HTTP protocol nor the META element
>   provides information about the character encoding of a document, HTML
>   also provides the charset attribute on several elements. By combining
>   these mechanisms, an author can greatly improve the chances that,
>   when the user retrieves a resource, the user agent will recognize the
>   character encoding.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From stephen at  Tue Aug 26 04:11:31 2014
From: stephen at (Stephen J. Turnbull)
Date: Tue, 26 Aug 2014 11:11:31 +0900
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

Nick Coghlan writes:

 > "purge_surrogate_escapes" was the other term that occurred to me.

"purge" suggests removal, not replacement.  That may be useful too.

neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')

maybe?  (Of course the remove argument is feature creep, so I'm only
about +0.5 myself.  And the name is long, but I can't think of any
better synonyms for "make safe" in English right now).

 > Either way, my use case is to filter them out when I *don't* want to
 > pass them along to other software, but would prefer the Unicode
 > replacement character to the ASCII question mark created by using the
 > "replace" filter when encoding.

I think it would be preferable to be unicodely correct here by
default, since this is a str -> str function.

From stephen at  Tue Aug 26 04:25:19 2014
From: stephen at (Stephen J. Turnbull)
Date: Tue, 26 Aug 2014 11:25:19 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

R. David Murray writes:

 > Also, as has been discussed in this thread previously, any program that
 > deals with filenames is dealing with human readable languages, even
 > if posix itself treats the filenames as bytes.

That's a bit extreme.  I can name two interesting applications
offhand: git's object database and the Coda filesystem's containers.

It's true that for debugging purposes bytestrings representing largish
numbers are readably encoded (in hexadecimal and decimal,
respectively), but they're clearly not "human readable" in the sense
you mean.

Nevertheless, these are the applications that prove your rule.  You
don't need the power of pathlib to conveniently (for the programmer)
and efficiently handle the file structures these programs use.
os.path is plenty.

From rdmurray at  Tue Aug 26 04:41:31 2014
From: rdmurray at (R. David Murray)
Date: Mon, 25 Aug 2014 22:41:31 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <20140823110828.GY25957@ando> <>
Message-ID: <>

On Tue, 26 Aug 2014 11:25:19 +0900, "Stephen J. Turnbull" <stephen at> wrote:
> R. David Murray writes:
>  > Also, as has been discussed in this thread previously, any program that
>  > deals with filenames is dealing with human readable languages, even
>  > if posix itself treats the filenames as bytes.
> That's a bit extreme.  I can name two interesting applications
> offhand: git's object database and the Coda filesystem's containers.

As soon as I hit send I realized there were a few counter examples :)
So, replace "any" with "most".


From stephen at  Tue Aug 26 04:47:24 2014
From: stephen at (Stephen J. Turnbull)
Date: Tue, 26 Aug 2014 11:47:24 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

Isaac Morland writes:

 > I like your way of putting this - "straight face" indeed.  The third 
 > option really is a hack to allow working around nonsensical situations 
 > (and even the META tag is pretty questionable).  All this complexity 
 > because people can't be bothered to do things properly.

At least in Japan and Russia, doing things "properly" in your sense in
heterogenous distributed systems is really hard, requiring use of
rather fragile encoding detection heuristics that break at the
slightest whiff of encodings that are unusual in the particular
locale, and in Japan requiring equally fragile transcoding programs
that break on vendor charset variations.  The META "charset" attribute
is useful in those contexts, and the "charset" attribute for external
elements may have been useful in the past as well, although I've never
needed it.

I agree that an environment where "charset" attributes on META and
other elements are needed kinda sucks, but the prerequisite for "doing
things properly" is basically Unicode[1], and that just wasn't going
to happen until at least the 1990s.  To make the transition in less
than several decades would have required a degree of monopoly in
software production that I shudder to contemplate.  Even today there
are programmers around the world grumbling about having to deal with
the Unicode coded character set.

[1]  More precisely, a universal coded character set.  TRON code or
MULE code would have done (but yuck!)  ISO 2022 won't do!

From ncoghlan at  Tue Aug 26 09:32:51 2014
From: ncoghlan at (Nick Coghlan)
Date: Tue, 26 Aug 2014 17:32:51 +1000
Subject: [Python-Dev] Fwd: Accepting PEP 440: Version Identification and
	Dependency Specification
In-Reply-To: <>
References: <>
Message-ID: <>

Antoine pointed out that it would still be a good idea to forward
packaging PEP acceptance announcements to python-dev, even when the
actual acceptance happens on distutils-sig.

That makes sense to me, so here's last week's notice of the acceptance
of PEP 440, the implementation independent versioning standard derived
from pkg_resources, PEP 386, and ideas from both Linux distributions
and other open source language communities.


---------- Forwarded message ----------
From: Nick Coghlan <ncoghlan at>
Date: 22 August 2014 22:34
Subject: Accepting PEP 440: Version Identification and Dependency Specification
To: DistUtils mailing list <distutils-sig at>

I just pushed Donald's final round of edits in response to the
feedback on the last PEP 440 thread, and as such I'm happy to announce
that I am accepting PEP 440 as the recommended approach to identifying
versions and specifying dependencies when distributing Python

The PEP is available in the usual place at

It's been a long road to get to an implementation independent
versioning standard that has a feasible migration path from the
current pkg_resources defined de facto standard, and I'd like to thank
a few folks:

* Donald Stufft for his extensive work on PEP 440 itself, especially
the proof of concept integration into pip
* Vinay Sajip for his efforts in validating earlier versions of the PEP
* Tarek Ziad? for starting us down the road to an implementation
independent versioning standard with the initial creation of PEP 386
back in June 2009, more than five years ago!


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From martin at  Tue Aug 26 13:14:23 2014
From: martin at (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 26 Aug 2014 13:14:23 +0200
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <20140822151911.GS25957@ando>
 <> <>
 <> <>
Message-ID: <>

Am 24.08.14 03:11, schrieb Greg Ewing:
> Isaac Morland wrote:
>> In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF
>> (byte order mark) is used:
>> Not sure about XML.
> According to Appendix F here:
> an XML parser needs to be prepared to try all the encodings it
> supports until it finds one that works well enough to decode
> the XML declaration, then it can find out the exact encoding
> used.

That's not what this section says. Instead, it says that
you need to auto-detect UCS-4, UTF-16, UTF-8 from the BOM,
or guess them or EBCDIC from the encoding of '<?'. This should
be enough to actually parse the encoding declaration. Other
non-ASCII-compatible encodings can only be used if declared
in an upper-level protocol (such as HTTP).

The parser is not expected to try out all encodings it supports.


From python at  Tue Aug 26 13:31:19 2014
From: python at (MRAB)
Date: Tue, 26 Aug 2014 12:31:19 +0100
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> Nick Coghlan writes:
>   > "purge_surrogate_escapes" was the other term that occurred to me.
> "purge" suggests removal, not replacement.  That may be useful too.
> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
How about:

     replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

> maybe?  (Of course the remove argument is feature creep, so I'm only
> about +0.5 myself.  And the name is long, but I can't think of any
> better synonyms for "make safe" in English right now).
>   > Either way, my use case is to filter them out when I *don't* want to
>   > pass them along to other software, but would prefer the Unicode
>   > replacement character to the ASCII question mark created by using the
>   > "replace" filter when encoding.
> I think it would be preferable to be unicodely correct here by
> default, since this is a str -> str function.

From rdmurray at  Tue Aug 26 15:11:31 2014
From: rdmurray at (R. David Murray)
Date: Tue, 26 Aug 2014 09:11:31 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <>
Message-ID: <>

On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan at> wrote:
> As some examples of where bilingual computing breaks down:
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE

This kind of puts the "length" of the python2->python3 transition
period in perspective, doesn't it?


From p.f.moore at  Tue Aug 26 17:23:30 2014
From: p.f.moore at (Paul Moore)
Date: Tue, 26 Aug 2014 16:23:30 +0100
Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path
Message-ID: <>

On 24 August 2014 04:27, Nick Coghlan <ncoghlan at> wrote:
> One of those areas is the fact that we still use the old 8-bit APIs to
> interact with the Windows console. Those are just as broken in a
> multilingual world as the other Windows 8-bit APIs, so Drekin came up
> with a project to expose the Windows console as a UTF-16-LE stream
> that uses the 16-bit APIs instead:
> I personally hope we'll be able to get the issues Drekin references
> there resolved for Python 3.5 - if other folks hope for the same
> thing, then one of the best ways to help that happen is to try out the
> win_unicode_console module and provide feedback on what does and
> doesn't work.

This looks very cool, and I plan on giving it a try. But I don't see
any issues mentioned there (unless you mean the fact that it's not
possible to hook into Python's interactive interpreter directly, but I
don't see how that could be fixed in an external module). There's no
open issues on the project's github tracker.

I'd love to see this go into 3.5, so any more specific suggestions as
to what would be needed to move it forwards would be great.


From tjreedy at  Tue Aug 26 18:51:02 2014
From: tjreedy at (Terry Reedy)
Date: Tue, 26 Aug 2014 12:51:02 -0400
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <>
Message-ID: <ltidtp$rpb$>

On 8/26/2014 9:11 AM, R. David Murray wrote:
> On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan at> wrote:
>> As some examples of where bilingual computing breaks down:
>> * My NFS client and server may have different locale settings
>> * My FTP client and server may have different locale settings
>> * My SSH client and server may have different locale settings
>> * I save a file locally and send it to someone with a different locale setting
>> * I attempt to access a Windows share from a Linux client (or vice-versa)
>> * I clone my POSIX hosted git or Mercurial repository on a Windows client
>> * I have to connect my Linux client to a Windows Active Directory
>> domain (or vice-versa)
>> * I have to interoperate between native code and JVM code
>> The entire computing industry is currently struggling with this
>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
>> encoding/code pages) -> multilingual (Unicode) transition. It's been
>> going on for decades, and it's still going to be quite some time
>> before we're done.
>> The POSIX world is slowly clawing its way towards a multilingual model
>> that actually works: UTF-8
>> Windows (including the CLR) and the JVM adopted a different
>> multilingual model, but still one that actually works: UTF-16-LE

Nick, I think the first half of your post is one of the clearest 
expositions yet of 'why Python 3' (in particular, the str to unicode 
change).  It is worthy of wider distribution and without much change, it 
would be a great blog post.

> This kind of puts the "length" of the python2->python3 transition
> period in perspective, doesn't it?

Terry Jan Reedy

From ncoghlan at  Wed Aug 27 00:52:32 2014
From: ncoghlan at (Nick Coghlan)
Date: Wed, 27 Aug 2014 08:52:32 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <ltidtp$rpb$>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at> wrote:
> On 8/26/2014 9:11 AM, R. David Murray wrote:
>> On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan at>
>>> As some examples of where bilingual computing breaks down:
>>> * My NFS client and server may have different locale settings
>>> * My FTP client and server may have different locale settings
>>> * My SSH client and server may have different locale settings
>>> * I save a file locally and send it to someone with a different locale
>>> * I attempt to access a Windows share from a Linux client (or
>>> * I clone my POSIX hosted git or Mercurial repository on a Windows
>>> * I have to connect my Linux client to a Windows Active Directory
>>> domain (or vice-versa)
>>> * I have to interoperate between native code and JVM code
>>> The entire computing industry is currently struggling with this
>>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
>>> encoding/code pages) -> multilingual (Unicode) transition. It's been
>>> going on for decades, and it's still going to be quite some time
>>> before we're done.
>>> The POSIX world is slowly clawing its way towards a multilingual model
>>> that actually works: UTF-8
>>> Windows (including the CLR) and the JVM adopted a different
>>> multilingual model, but still one that actually works: UTF-16-LE
> Nick, I think the first half of your post is one of the clearest
expositions yet of 'why Python 3' (in particular, the str to unicode
change).  It is worthy of wider distribution and without much change, it
would be a great blog post.

Indeed, I had the same idea - I had been assuming users already understood
this context, which is almost certainly an invalid assumption.

The blog post version is already mostly written, but I ran out of weekend.
Will hopefully finish it up and post it some time in the next few days :)

>> This kind of puts the "length" of the python2->python3 transition
>> period in perspective, doesn't it?

I realised in writing the post that ASCII is over 50 years old at this
point, while Unicode as an official standard is more than 20. By the time
this is done, we'll likely be talking 30+ years for Unicode to displace the
confusing mess that is code pages and locale encodings :)


> --
> Terry Jan Reedy
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From Nikolaus at  Wed Aug 27 03:39:35 2014
From: Nikolaus at (Nikolaus Rath)
Date: Tue, 26 Aug 2014 18:39:35 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
 (Nick Coghlan's message of "Wed, 27 Aug 2014 08:52:32 +1000")
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Nick Coghlan <ncoghlan at> writes:
>>>> As some examples of where bilingual computing breaks down:
>>>> * My NFS client and server may have different locale settings
>>>> * My FTP client and server may have different locale settings
>>>> * My SSH client and server may have different locale settings
>>>> * I save a file locally and send it to someone with a different locale
> setting
>>>> * I attempt to access a Windows share from a Linux client (or
> vice-versa)
>>>> * I clone my POSIX hosted git or Mercurial repository on a Windows
> client
>>>> * I have to connect my Linux client to a Windows Active Directory
>>>> domain (or vice-versa)
>>>> * I have to interoperate between native code and JVM code
>>>> The entire computing industry is currently struggling with this
>>>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
>>>> encoding/code pages) -> multilingual (Unicode) transition. It's been
>>>> going on for decades, and it's still going to be quite some time
>>>> before we're done.
>>>> The POSIX world is slowly clawing its way towards a multilingual model
>>>> that actually works: UTF-8
>>>> Windows (including the CLR) and the JVM adopted a different
>>>> multilingual model, but still one that actually works: UTF-16-LE
>> Nick, I think the first half of your post is one of the clearest
> expositions yet of 'why Python 3' (in particular, the str to unicode
> change).  It is worthy of wider distribution and without much change, it
> would be a great blog post.
> Indeed, I had the same idea - I had been assuming users already understood
> this context, which is almost certainly an invalid assumption.
> The blog post version is already mostly written, but I ran out of weekend.
> Will hopefully finish it up and post it some time in the next few days
> :)

In that case, maybe it'd be nice to also explain why you use the term
"bilingual" for codepage based encoding. At least to me, a
codepage/locale is pretty monolingual, or alternatively covering a whole
region (e.g. western europe). I figure with bilingual you mean ascii +
something, but that's mostly a guess from my side.


GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             ?Time flies like an arrow, fruit flies like a Banana.?

From stephen at  Wed Aug 27 04:52:46 2014
From: stephen at (Stephen J. Turnbull)
Date: Wed, 27 Aug 2014 11:52:46 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Nikolaus Rath writes:

 > In that case, maybe it'd be nice to also explain why you use the
 > term "bilingual" for codepage based encoding.

Modern computing systems are written in languages which are invariably
based on syntax expressed using ASCII, and provide by default
functionality for expressing dates etc suitable for rendering American
English.  Thus ASCII (ie, American English) is always an available
language.  Code pages provide facilities for rendering one or more
languages languages sharing a common coded character set, but are
unsuitable for rendering most of the rest of the world's dozens of
language groups (grouping languages by common character set).

Multilingual has come to mean "able to express (almost) any set of
languages in a single text" (see, for example, Emacs's "HELLO" file),
not just "more than two".  So code pages are closer in spirit to
"bilingual" (two of many) than to "multilingual" (all of many).

It's messy, analogical terminology.  But then, natural language is
messy and analogical.<wink/>

From ncoghlan at  Wed Aug 27 10:09:13 2014
From: ncoghlan at (Nick Coghlan)
Date: Wed, 27 Aug 2014 18:09:13 +1000
Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path
In-Reply-To: <>
References: <>
Message-ID: <>

On 27 August 2014 01:23, Paul Moore <p.f.moore at> wrote:
> On 24 August 2014 04:27, Nick Coghlan <ncoghlan at> wrote:
>> One of those areas is the fact that we still use the old 8-bit APIs to
>> interact with the Windows console. Those are just as broken in a
>> multilingual world as the other Windows 8-bit APIs, so Drekin came up
>> with a project to expose the Windows console as a UTF-16-LE stream
>> that uses the 16-bit APIs instead:
>> I personally hope we'll be able to get the issues Drekin references
>> there resolved for Python 3.5 - if other folks hope for the same
>> thing, then one of the best ways to help that happen is to try out the
>> win_unicode_console module and provide feedback on what does and
>> doesn't work.
> This looks very cool, and I plan on giving it a try. But I don't see
> any issues mentioned there (unless you mean the fact that it's not
> possible to hook into Python's interactive interpreter directly, but I
> don't see how that could be fixed in an external module). There's no
> open issues on the project's github tracker.

There are two links to CPython issues from the project description:

Part of the feedback on those was that as much as possible should be
made available as a third party module before returning to the
question of how to update CPython.

If we can get additional confirmation that the module addresses the
CLI integration issues, then we can take a closer look at switching
CPython itself over.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From p.f.moore at  Wed Aug 27 11:46:53 2014
From: p.f.moore at (Paul Moore)
Date: Wed, 27 Aug 2014 10:46:53 +0100
Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path
In-Reply-To: <>
References: <>
Message-ID: <>

On 27 August 2014 09:09, Nick Coghlan <ncoghlan at> wrote:
> There are two links to CPython issues from the project description:
> Part of the feedback on those was that as much as possible should be
> made available as a third party module before returning to the
> question of how to update CPython.

OK, ta.

The only issues I'm seeing are that it doesn't play well with the
interactive interpreter, which is a known problem but unfortunately
makes it pretty hard for me to do any significant testing (nearly all
of the stuff that I do which prints to the screen is in the REPL, or
in IPython which has its own custom interpreter loop).

If I come up with anything worth commenting on, I will do so (I assume
that comments of the form "+1 me too!" are not needed ;-))


From ncoghlan at  Wed Aug 27 14:16:35 2014
From: ncoghlan at (Nick Coghlan)
Date: Wed, 27 Aug 2014 22:16:35 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 27 August 2014 08:52, Nick Coghlan <ncoghlan at> wrote:
> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at> wrote:
>> Nick, I think the first half of your post is one of the clearest
>> expositions yet of 'why Python 3' (in particular, the str to unicode
>> change).  It is worthy of wider distribution and without much change, it
>> would be a great blog post.
> Indeed, I had the same idea - I had been assuming users already understood
> this context, which is almost certainly an invalid assumption.
> The blog post version is already mostly written, but I ran out of weekend.
> Will hopefully finish it up and post it some time in the next few days :)

Aaand, it's up:


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ndbecker2 at  Wed Aug 27 14:58:56 2014
From: ndbecker2 at (Neal Becker)
Date: Wed, 27 Aug 2014 08:58:56 -0400
Subject: [Python-Dev] pip enhancement
Message-ID: <ltkkmg$ckh$>

On systems where os-level packaging is available (e.g., fedora linux), it is not 
unusual to want a newer python package installed than available from the vendor.  
pip install --user can be used for this.

But then there is the danger that these pip installed packages are not 

At least, pip should have the ability to alert the user to potential updates,

pip update

could list which packages need updating, and offer to perform the update.  I 
think this would go a long way to helping with this problem.

-- Those who don't understand recursion are doomed to repeat it

From skip at  Wed Aug 27 15:21:24 2014
From: skip at (Skip Montanaro)
Date: Wed, 27 Aug 2014 08:21:24 -0500
Subject: [Python-Dev] pip enhancement
In-Reply-To: <ltkkmg$ckh$>
References: <ltkkmg$ckh$>
Message-ID: <>

On Wed, Aug 27, 2014 at 7:58 AM, Neal Becker <ndbecker2 at> wrote:
> On systems where os-level packaging is available (e.g., fedora linux), it is not
> unusual to want a newer python package installed than available from the vendor.
> pip install --user can be used for this.

How? I have exactly this problem with nose. We actually get it bundled
(currently at ancient 1.1.2, trying to get to 1.3.4) with a bunch of
other open source software from an outside packaging company, and even
though I add the --user flag, it still complains that a version is
already installed. When I add the --upgrade flag it tries to uninstall
the global version.


From p.f.moore at  Wed Aug 27 15:24:42 2014
From: p.f.moore at (Paul Moore)
Date: Wed, 27 Aug 2014 14:24:42 +0100
Subject: [Python-Dev] pip enhancement
In-Reply-To: <ltkkmg$ckh$>
References: <ltkkmg$ckh$>
Message-ID: <>

On 27 August 2014 13:58, Neal Becker <ndbecker2 at> wrote:
> At least, pip should have the ability to alert the user to potential updates,
> pip update
> could list which packages need updating, and offer to perform the update.  I
> think this would go a long way to helping with this problem.

Do you mean something like "pip list --outdated"?

From skip at  Wed Aug 27 15:46:01 2014
From: skip at (Skip Montanaro)
Date: Wed, 27 Aug 2014 08:46:01 -0500
Subject: [Python-Dev] pip enhancement
In-Reply-To: <>
References: <ltkkmg$ckh$>
Message-ID: <>

On Wed, Aug 27, 2014 at 8:24 AM, Paul Moore <p.f.moore at> wrote:
> Do you mean something like "pip list --outdated"?

I was unaware of that command, as we were stuck at pip 1.2.1. I just
updated pip manually to 1.5.6. That is a very helpful command. It
would be even better if it understood --user so it could restrict it's
view to user-installed stuff.

Also, given that packages can be found in multiple places on a system, for me:

* the OpenSuSE system packages
* TWW-provided system-wide packages
* our own system-wide packages in /opt/local
* my private stuff in ~/.local

it would be great if there was a way for it to tell me where on my
system it found outdated package X. The --verbose flag tells me all
sorts of other stuff I'm not really interested in, but not the
installed location of the outdated package.


From p.f.moore at  Wed Aug 27 15:57:57 2014
From: p.f.moore at (Paul Moore)
Date: Wed, 27 Aug 2014 14:57:57 +0100
Subject: [Python-Dev] pip enhancement
In-Reply-To: <>
References: <ltkkmg$ckh$>
Message-ID: <>

On 27 August 2014 14:46, Skip Montanaro <skip at> wrote:
> it would be great if there was a way for it to tell me where on my
> system it found outdated package X. The --verbose flag tells me all
> sorts of other stuff I'm not really interested in, but not the
> installed location of the outdated package.

There's also packaged environments like conda. It would be nice if pip
could distinguish between conda-managed packages and ones I installed

Really, though, this is what the PEP 376 "INSTALLER" file was intended
for. As far as I know, though, it was never implemented (and you'd
also need to persuade the Linux vendors, the conda people, etc, to use
it as well if it were to be of any practical use).

Agreed about reporting the installed location, though. Specific
suggestions like this would be good things to add to the pip issue


From graffatcolmingov at  Wed Aug 27 16:04:17 2014
From: graffatcolmingov at (Ian Cordasco)
Date: Wed, 27 Aug 2014 09:04:17 -0500
Subject: [Python-Dev] pip enhancement
In-Reply-To: <>
References: <ltkkmg$ckh$>
Message-ID: <>

On Wed, Aug 27, 2014 at 8:24 AM, Paul Moore <p.f.moore at> wrote:
> On 27 August 2014 13:58, Neal Becker <ndbecker2 at> wrote:
>> At least, pip should have the ability to alert the user to potential updates,
>> pip update
>> could list which packages need updating, and offer to perform the update.  I
>> think this would go a long way to helping with this problem.
> Do you mean something like "pip list --outdated"?
> Paul

Also, isn't this discussion better suited for Distutils-SIG?

From skip at  Wed Aug 27 17:36:34 2014
From: skip at (Skip Montanaro)
Date: Wed, 27 Aug 2014 10:36:34 -0500
Subject: [Python-Dev] pip enhancement
In-Reply-To: <>
References: <ltkkmg$ckh$>
Message-ID: <>

On Wed, Aug 27, 2014 at 9:04 AM, Ian Cordasco
<graffatcolmingov at> wrote:
> Also, isn't this discussion better suited for Distutils-SIG?

I started up a thread there. I'd post an archive link, but it hasn't
yet turned up in the distutils-sig archive.


From ndbecker2 at  Wed Aug 27 15:36:13 2014
From: ndbecker2 at (Neal Becker)
Date: Wed, 27 Aug 2014 09:36:13 -0400
Subject: [Python-Dev] pip enhancement
In-Reply-To: <>
References: <ltkkmg$ckh$>
Message-ID: <>

Wow, I didn't know that existed.  Maybe needs to be more obvious.

But not quite.  It doesn't distinguish between locally installed files, and
globally installed.  Here, globally installed are maintained by the OS
vendor packaging, while locally (user, not virtualenv) installed are
managed by pip.

Really what's needed is for pip --user to apply to all pip commands, and
tell pip to ignore the system stuff.

Running pip list --outdated runs a long time, and gives me a very long list
of packages that are outdated, leaving me to still sort through which are
--user (and I might want to update via pip) and which are global (and I
can't really do anything about, other than filing a bug report requesting
an update).

On Wed, Aug 27, 2014 at 9:24 AM, Paul Moore <p.f.moore at> wrote:

> On 27 August 2014 13:58, Neal Becker <ndbecker2 at> wrote:
> > At least, pip should have the ability to alert the user to potential
> updates,
> >
> > pip update
> >
> > could list which packages need updating, and offer to perform the
> update.  I
> > think this would go a long way to helping with this problem.
> Do you mean something like "pip list --outdated"?
> Paul

*Those who don't understand recursion are doomed to repeat it*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From v+python at  Wed Aug 27 20:18:11 2014
From: v+python at (Glenn Linderman)
Date: Wed, 27 Aug 2014 11:18:11 -0700
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
 <> <>
 <> <ltidtp$rpb$>
Message-ID: <>

On 8/27/2014 5:16 AM, Nick Coghlan wrote:
> On 27 August 2014 08:52, Nick Coghlan <ncoghlan at> wrote:
>> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at> wrote:
>>> Nick, I think the first half of your post is one of the clearest
>>> expositions yet of 'why Python 3' (in particular, the str to unicode
>>> change).  It is worthy of wider distribution and without much change, it
>>> would be a great blog post.
>> Indeed, I had the same idea - I had been assuming users already understood
>> this context, which is almost certainly an invalid assumption.
>> The blog post version is already mostly written, but I ran out of weekend.
>> Will hopefully finish it up and post it some time in the next few days :)
> Aaand, it's up:
> Cheers,
> Nick.

Indeed, I also enjoyed and found enlightening your response to this 
issue, including the broader historical context. I remember when Unicode 
was first published back in 1991, and it sounded interesting, but far 
removed from the reality of implementations of the day. I was intrigued 
by UTF-8 at the time, and even wrote an encoder and decoder for it for a 
software package that eventually never reached any real customers.

Your blog post says:
> Choosing UTF-8 aims to treat formatting text for communication with 
> the user as "just a display issue". It's a low impact design that will 
> "just work" for a lot of software, but it comes at a price:
>   * because encoding consistency checks are mostly avoided, data in
>     different encodings may be freely concatenated and passed on to
>     other applications. Such data is typically not usable by the
>     receiving application.

I don't believe this is a necessary result of using UTF-8. It is a 
possible result, and I guess some implementations are using it this way, 
but a proper language could still provide and/or require proper usage of 
UTF-8 data through its type system just as Python3 is doing with PEP 
393.  In fact, if it were not for the requirement to support passing 
character strings in other formats (UTF-16, UTF-32) to historical APIs 
(in CPython add-on packages) and the resulting practical performance 
considerations of converting to/from UTF-8 repeatedly when calling those 
APIs, Python3 could have evolved to using UTF-8 as its underlying data 
format, and obtained equal encoding consistency as it has today.

Of course, nothing can be "required" if the user chooses to continue 
operating in the encoded domain, and manipulate data using the necessary 
byte-oriented features of of whatever language is in use.

One of the choices of Python3, was to retain character indexing as an 
underlying arithmetic implementation citing algorithmic speed, but that 
is a seldom needed operation, and of limited general applicability when 
considering grapheme clusters. An iterator based approach can solve both 
problems, but would have been best introduced as part of Python3.0, 
although it may have made 2to3 harder, and may have made it less 
practical to implement six and other "run on both Py2 and Py3" type 
solutions harder, without introducing those same iterative solutions 
into Python 2.6 or 2.7.

Such solutions could still be implemented as options. Even PEP 393 
grudgingly supports some use of UTF-8 when requested by the user, as I 
understand it. Whether such an implementation would be better based on 
bytes or str is uncertain without further analysis, although type 
checking would probably be easier if based on str. A high-performance 
implementation would likely need to be implemented at least partly in C 
rather than CPython, although it could be prototyped in Python for proof 
of functionality. The iterators could obviously be implemented to work 
based on top of solutions such as PEP 393, by simply using indexing 
underneath, when fixed-width characters are available, and other 
techniques when UTF-8 is the only available format (rather than 
converting from UTF-8 to fixed-width characters because of calling the 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From v+python at  Wed Aug 27 20:21:06 2014
From: v+python at (Glenn Linderman)
Date: Wed, 27 Aug 2014 11:21:06 -0700
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

On 8/26/2014 4:31 AM, MRAB wrote:
> On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>> Nick Coghlan writes:
>>   > "purge_surrogate_escapes" was the other term that occurred to me.
>> "purge" suggests removal, not replacement.  That may be useful too.
>> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
> How about:
>     replace_surrogate_escapes(s, replacement='\uFFFD')
> If you want them removed, just pass an empty string as the replacement.

And further, replacement could be a vector of 128 characters, to do 
immediate transcoding, or a single character to do wholesale replacement 
with some gibberish character, or None to remove (or an empty string).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Thu Aug 28 01:54:31 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 28 Aug 2014 09:54:31 +1000
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

On 28 Aug 2014 04:20, "Glenn Linderman" <v+python at> wrote:
> On 8/27/2014 5:16 AM, Nick Coghlan wrote:
>> On 27 August 2014 08:52, Nick Coghlan <ncoghlan at> wrote:
>>> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at> wrote:
>>>> Nick, I think the first half of your post is one of the clearest
>>>> expositions yet of 'why Python 3' (in particular, the str to unicode
>>>> change).  It is worthy of wider distribution and without much change,
>>>> would be a great blog post.
>>> Indeed, I had the same idea - I had been assuming users already
>>> this context, which is almost certainly an invalid assumption.
>>> The blog post version is already mostly written, but I ran out of
>>> Will hopefully finish it up and post it some time in the next few days
>> Aaand, it's up:
>> Cheers,
>> Nick.
> Indeed, I also enjoyed and found enlightening your response to this
issue, including the broader historical context. I remember when Unicode
was first published back in 1991, and it sounded interesting, but far
removed from the reality of implementations of the day. I was intrigued by
UTF-8 at the time, and even wrote an encoder and decoder for it for a
software package that eventually never reached any real customers.
> Your blog post says:
>> Choosing UTF-8 aims to treat formatting text for communication with the
user as "just a display issue". It's a low impact design that will "just
work" for a lot of software, but it comes at a price:
>> because encoding consistency checks are mostly avoided, data in
different encodings may be freely concatenated and passed on to other
applications. Such data is typically not usable by the receiving
> I don't believe this is a necessary result of using UTF-8. It is a
possible result, and I guess some implementations are using it this way,
but a proper language could still provide and/or require proper usage of
UTF-8 data through its type system just as Python3 is doing with PEP 393.

Yes, Go works that way, for example. I doubt it actually checks for valid
UTF-8 at OS boundaries though - that would be a potentially expensive
check, and as a network service centric language, Go can afford to place
more constraints on the operating environment than we can.

>In fact, if it were not for the requirement to support passing character
strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython
add-on packages) and the resulting practical performance considerations of
converting to/from UTF-8 repeatedly when calling those APIs, Python3 could
have evolved to using UTF-8 as its underlying data format, and obtained
equal encoding consistency as it has today.

We already have string processing algorithms that work for fixed width
encodings (and are known not to work for variable width encodings, hence
the bugs in Unicode handling on the old narrow builds).

It isn't that variable width encodings aren't a viable choice for
programming language text modelling, it's that the assumption of a fixed
width model is more deeply entrenched in CPython (and especially the C API)
than the exact number of bits used per code point.

> Of course, nothing can be "required" if the user chooses to continue
operating in the encoded domain, and manipulate data using the necessary
byte-oriented features of of whatever language is in use.
> One of the choices of Python3, was to retain character indexing as an
underlying arithmetic implementation citing algorithmic speed, but that is
a seldom needed operation, and of limited general applicability when
considering grapheme clusters.

The choice that was made was to say no to the question "Do we rewrite a
Unicode type that we already know works from scratch?". The decisions about
how to handle *text* were made way back before the PEP process even
existed, and later captured as PEP 100.

What changed in Python 3 was dropping the hybrid 8-bit str type with its
locale dependent behaviour, and parcelling its responsibilities out to
either the existing unicode type (renamed as str, as it was the default
choice), or the new locale independent bytes type.

> An iterator based approach can solve both problems, but would have been
best introduced as part of Python3.0, although it may have made 2to3
harder, and may have made it less practical to implement six and other "run
on both Py2 and Py3" type solutions harder, without introducing those same
iterative solutions into Python 2.6 or 2.7.

The option of fundamentally changing the text handling design was never on
the table. The Python 2 unicode type works fine, it is the Python 2 str
type that needed changing.

> Such solutions could still be implemented as options. Even PEP 393
grudgingly supports some use of UTF-8 when requested by the user, as I
understand it.

Not quite. PEP 393 heavily favours and optimises UTF-8, trading memory for
speed by implicitly caching the UTF-8 representation the support isn't
begrudged, it's enthusiastic. We just don't use it for the text processing
algorithms, because those assume a fixed width encoding.

> Whether such an implementation would be better based on bytes or str is
uncertain without further analysis, although type checking would probably
be easier if based on str. A high-performance implementation would likely
need to be implemented at least partly in C rather than CPython, although
it could be prototyped in Python for proof of functionality. The iterators
could obviously be implemented to work based on top of solutions such as
PEP 393, by simply using indexing underneath, when fixed-width characters
are available, and other techniques when UTF-8 is the only available format
(rather than converting from UTF-8 to fixed-width characters because of
calling the iterator).

For the cost of rewriting every single string manipulation algorithm in
CPython to avoid relying on C array access, the only thing you would save
over PEP 393 is a bit of memory - we already store the UTF-8 representation
when appropriate.

There's simply not a sufficient payoff to justify the cost.


> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From stephen at  Thu Aug 28 03:08:43 2014
From: stephen at (Stephen J. Turnbull)
Date: Thu, 28 Aug 2014 10:08:43 +0900
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

Glenn Linderman writes:
 > On 8/26/2014 4:31 AM, MRAB wrote:
 > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
 > >> Nick Coghlan writes:

 > > How about:
 > >
 > >     replace_surrogate_escapes(s, replacement='\uFFFD')
 > >
 > > If you want them removed, just pass an empty string as the
 > > replacement.

That seems better to me (I had too much C for breakfast, I think).

 > And further, replacement could be a vector of 128 characters, to do
 > immediate transcoding,

Using what encoding?  If you knew that much, why didn't you use
(write, if necessary) an appropriate codec?  I can't envision this
being useful.

OTOH, I could see using

    replace_surrogate_escapes(s, replacement='&#65533;')

in HTML.  (Actually, probably not; if it makes sense to use Unicode
features you're probably using Unicode as the external encoding, so a
character entity is silly.  But there might be contexts with a useful
multicharacter replacements.)

 > or a single character to do wholesale replacement with some
 > gibberish character, or None to remove (or an empty string).

Not None, that means default (which should be the Unicode standard


From stephen at  Thu Aug 28 04:04:01 2014
From: stephen at (Stephen J. Turnbull)
Date: Thu, 28 Aug 2014 11:04:01 +0900
Subject: [Python-Dev] Bytes path support
In-Reply-To: <>
References: <lt4rmr$a00$>
 <> <5124983344373446869@unknownmsgid>
Message-ID: <>

Glenn Linderman writes:
 > On 8/27/2014 5:16 AM, Nick Coghlan wrote:

 > > Choosing UTF-8 aims to treat formatting text for communication with 
 > > the user as "just a display issue". It's a low impact design that will 
 > > "just work" for a lot of software, but it comes at a price:
 > >
 > >   * because encoding consistency checks are mostly avoided, data in
 > >     different encodings may be freely concatenated and passed on to
 > >     other applications. Such data is typically not usable by the
 > >     receiving application.
 > I don't believe this is a necessary result of using UTF-8.

No, it's not, but if you're going to do the same kind of checks that
are necessary for transcoding UTF-8 to abstract Unicode, there's no
benefit to using UTF-8 internally, and you lose a lot.  The only
operations that you can do efficiently are concatenation and
iteration.  I've worked with a UTF-8-like internal encoding for 20
years now -- it's a huge cost.

 > Python3 could have evolved to using UTF-8 as its underlying data
 > format, and obtained equal encoding consistency as it has today.

Thank heaven it didn't!

 > One of the choices of Python3, was to retain character indexing as an 
 > underlying arithmetic implementation citing algorithmic speed, but that 
 > is a seldom needed operation,

That simply isn't true.  The negative effects of algorithmic slowness
in Emacsen are visible both as annoying user delays, and as excessive
developer concentration on optimizing a fundamentally insufficient
data structure.

 > and of limited general applicability when considering grapheme
 > clusters.  An iterator based approach can solve both problems,

On the contrary, grapheme clusters are the relatively rare use case in
textual computing, at least currently, that can be optimized for when
necessary.  There's no problem with creating iterators from arrays,
but making an iterator behave like a array ... well, that involves
creating the array.

 > Such solutions could still be implemented as options.

Sure, but the problems to be solved in that implementation are not due
to Python 3's internal representation.  A lot of painstaking (and
possibly hard?) work remains to be done.

 > A high-performance implementation would likely need to be
 > implemented at least partly in C rather than CPython,

That's how Emacs did it, and (a) over the decades it has involved an
inordinate amount of effort compared to rewriting the text-handling
functions for an array, (b) is fragile, and (c) performance sucks in

Unicode, not UTF-8, is the central component of the solution.  The
various UTFs are application-specific implementations of Unicode.
UTF-8 is an excellent solution for text streams, such as disk files
and network communication.  Fixed-width representations (ISO-8859-1,
UCS-2, UTF-32, PEP-393) are useful for applications of large buffers
that need O(1) "random" access, and can trivially be iterated for
stream applications.


From v+python at  Thu Aug 28 06:56:50 2014
From: v+python at (Glenn Linderman)
Date: Wed, 27 Aug 2014 21:56:50 -0700
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>	<>	<ltcsho$dn6$>	<>	<>	<>	<>
Message-ID: <>

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>   > On 8/26/2014 4:31 AM, MRAB wrote:
>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>   > >> Nick Coghlan writes:
>   > > How about:
>   > >
>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>   > >
>   > > If you want them removed, just pass an empty string as the
>   > > replacement.
> That seems better to me (I had too much C for breakfast, I think).
>   > And further, replacement could be a vector of 128 characters, to do
>   > immediate transcoding,
> Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map 
to a character in the vector.

> If you knew that much, why didn't you use
> (write, if necessary) an appropriate codec?  I can't envision this
> being useful.

If the data format describes its encoding, possibly containing data from 
several encodings in various spots, then perhaps it is best read as 
binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then 
when the encoding is determined, to transcode the data. Previously, a 
proposal was made to reverse the surrogate escapes to the original 
bytes, and then apply the (now known) appropriate codec. There are not 
appropriate codecs that can convert directly from surrogate escapes to 
the desired end result. This technique could be used instead, for 
single-byte, non-escaped encodings. On the other hand, writing specialty 
codecs for the purpose would be more general.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From stephen at  Thu Aug 28 08:30:44 2014
From: stephen at (Stephen J. Turnbull)
Date: Thu, 28 Aug 2014 15:30:44 +0900
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
Message-ID: <>

Glenn Linderman writes:
 > On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
 > > Glenn Linderman writes:

 > >   > And further, replacement could be a vector of 128 characters, to do
 > >   > immediate transcoding,
 > >
 > > Using what encoding?
 > The vector would contain the transcoding. Each lone surrogate would map 
 > to a character in the vector.

Yes, that's obvious.  The question is where do you get the vector?

 > > If you knew that much, why didn't you use (write, if necessary)
 > > an appropriate codec?  I can't envision this being useful.
 > If the data format describes its encoding, possibly containing data from 
 > several encodings in various spots, then perhaps it is best read as 
 > binary, and processed as binary until those definitions are found.

Exactly.  That's precisely why bytes have a .decode method.

 > But an alternative would be to read with surrogate escapes, and
 > then when the encoding is determined, to transcode the data.

Not every one-line expression needs to be in the stdlib:

data[start, end] = data[start, end].encode('utf-8', errors=surrogateescape).decode('DTRT-now')

Note that you *do* need to know start and end, because of the
possibility of "several encodings", where once you apply this
technique to the whole text, you can't recover the surrogates when you
get the encoding wrong.

 > Previously, a proposal was made to reverse the surrogate escapes to
 > the original bytes, and then apply the (now known) appropriate
 > codec.

Sure.  And in fact I do this kind of thing all the time in Emacs,
using the decode(encode(slice)) approach.  The only times in 25 years
of working with the insanity of digitized Japanese I've had a use for
anything other than that is when I don't have a round-tripping codec.
In that case I have to preserve the bytes or suffer lossy conversion
anyway, regardless of the method used to reconvert.

But surrogateescape is necessarily round-tripping (maybe with a few
exceptions in Chinese and a very small number in other languages, but
those failures are due to Unicode, not to surrogateescape).

 > There are not appropriate codecs that can convert directly from
 > surrogate escapes to the desired end result.

And there currently cannot be.  codecs are bytes<->str, not str->str.

 > This technique could be used instead, for single-byte, non-escaped
 > encodings.

That's pure theory, not a use case.  We have codecs for all the
encodings with significant numbers of users, and writing a new one
simply isn't that hard.


From python at  Thu Aug 28 09:30:39 2014
From: python at (MRAB)
Date: Thu, 28 Aug 2014 08:30:39 +0100
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>	<>	<ltcsho$dn6$>	<>	<>	<>	<>
 <> <>
Message-ID: <>

On 2014-08-28 05:56, Glenn Linderman wrote:
> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>> Glenn Linderman writes:
>>   > On 8/26/2014 4:31 AM, MRAB wrote:
>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>   > >> Nick Coghlan writes:
>>   > > How about:
>>   > >
>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>   > >
>>   > > If you want them removed, just pass an empty string as the
>>   > > replacement.
>> That seems better to me (I had too much C for breakfast, I think).
>>   > And further, replacement could be a vector of 128 characters, to do
>>   > immediate transcoding,
>> Using what encoding?
> The vector would contain the transcoding. Each lone surrogate would map
> to a character in the vector.
>> If you knew that much, why didn't you use
>> (write, if necessary) an appropriate codec?  I can't envision this
>> being useful.
> If the data format describes its encoding, possibly containing data from
> several encodings in various spots, then perhaps it is best read as
> binary, and processed as binary until those definitions are found.
> But an alternative would be to read with surrogate escapes, and then
> when the encoding is determined, to transcode the data. Previously, a
> proposal was made to reverse the surrogate escapes to the original
> bytes, and then apply the (now known) appropriate codec. There are not
> appropriate codecs that can convert directly from surrogate escapes to
> the desired end result. This technique could be used instead, for
> single-byte, non-escaped encodings. On the other hand, writing specialty
> codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong

From ncoghlan at  Thu Aug 28 14:26:16 2014
From: ncoghlan at (Nick Coghlan)
Date: Thu, 28 Aug 2014 22:26:16 +1000
Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path
 related questions for Guido)
In-Reply-To: <>
References: <>
Message-ID: <>

On 26 Aug 2014 21:34, "MRAB" <python at> wrote:
> On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>> Nick Coghlan writes:
>>   > "purge_surrogate_escapes" was the other term that occurred to me.
>> "purge" suggests removal, not replacement.  That may be useful too.
>> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
> How about:
>     replace_surrogate_escapes(s, replacement='\uFFFD')
> If you want them removed, just pass an empty string as the replacement.

The current proposal on the issue tracker is to instead take advantage of
the existing error handlers:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8',

That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)

I also filed a separate RFE to make backslashreplace usable on input, since
that allows the option of separating the replacement operation from the
encoding operation.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From p.f.moore at  Thu Aug 28 15:22:55 2014
From: p.f.moore at (Paul Moore)
Date: Thu, 28 Aug 2014 14:22:55 +0100
Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path
In-Reply-To: <>
References: <>
Message-ID: <>

On 27 August 2014 10:46, Paul Moore <p.f.moore at> wrote:
> If I come up with anything worth commenting on, I will do so (I assume
> that comments of the form "+1 me too!" are not needed ;-))

Nevertheless, here's a "Me, too". I've just been writing some PyPI
interrogation scripts, and it's absolutely awful having to deal with
random encoding errors in the output. Being able to just print
*anything* is a HUGE benefit. This is how sys.stdout should behave -
presumably the Unix guys are now all rolling their eyes and saying
"but it does - just use a proper OS" :-)

Enlightened-ly y'rs,

From v+python at  Thu Aug 28 19:15:40 2014
From: v+python at (Glenn Linderman)
Date: Thu, 28 Aug 2014 10:15:40 -0700
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>	<>	<ltcsho$dn6$>	<>	<>	<>	<>
 <> <>
Message-ID: <>

On 8/28/2014 12:30 AM, MRAB wrote:
> On 2014-08-28 05:56, Glenn Linderman wrote:
>> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>>> Glenn Linderman writes:
>>>   > On 8/26/2014 4:31 AM, MRAB wrote:
>>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>>   > >> Nick Coghlan writes:
>>>   > > How about:
>>>   > >
>>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>>   > >
>>>   > > If you want them removed, just pass an empty string as the
>>>   > > replacement.
>>> That seems better to me (I had too much C for breakfast, I think).
>>>   > And further, replacement could be a vector of 128 characters, to do
>>>   > immediate transcoding,
>>> Using what encoding?
>> The vector would contain the transcoding. Each lone surrogate would map
>> to a character in the vector.
>>> If you knew that much, why didn't you use
>>> (write, if necessary) an appropriate codec?  I can't envision this
>>> being useful.
>> If the data format describes its encoding, possibly containing data from
>> several encodings in various spots, then perhaps it is best read as
>> binary, and processed as binary until those definitions are found.
>> But an alternative would be to read with surrogate escapes, and then
>> when the encoding is determined, to transcode the data. Previously, a
>> proposal was made to reverse the surrogate escapes to the original
>> bytes, and then apply the (now known) appropriate codec. There are not
>> appropriate codecs that can convert directly from surrogate escapes to
>> the desired end result. This technique could be used instead, for
>> single-byte, non-escaped encodings. On the other hand, writing specialty
>> codecs for the purpose would be more general.
> There'll be a surrogate escape if a byte couldn't be decoded, but just
> because a byte could be decoded, it doesn't mean that it's correct.
> If you picked the wrong encoding, the other codepoints could be wrong
> too.

Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
is also pretty useless to "replace_surrogate_escapes" at all, because it 
only cleans out the non-decodable characters, not the incorrectly 
decoded characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From rdmurray at  Thu Aug 28 19:41:03 2014
From: rdmurray at (R. David Murray)
Date: Thu, 28 Aug 2014 13:41:03 -0400
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at> wrote:
> On 8/28/2014 12:30 AM, MRAB wrote:
> > On 2014-08-28 05:56, Glenn Linderman wrote:
> >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> >>> Glenn Linderman writes:
> >>>   > On 8/26/2014 4:31 AM, MRAB wrote:
> >>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> >>>   > >> Nick Coghlan writes:
> >>>
> >>>   > > How about:
> >>>   > >
> >>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
> >>>   > >
> >>>   > > If you want them removed, just pass an empty string as the
> >>>   > > replacement.
> >>>
> >>> That seems better to me (I had too much C for breakfast, I think).
> >>>
> >>>   > And further, replacement could be a vector of 128 characters, to do
> >>>   > immediate transcoding,
> >>>
> >>> Using what encoding?
> >>
> >> The vector would contain the transcoding. Each lone surrogate would map
> >> to a character in the vector.
> >>
> >>> If you knew that much, why didn't you use
> >>> (write, if necessary) an appropriate codec?  I can't envision this
> >>> being useful.
> >>
> >> If the data format describes its encoding, possibly containing data from
> >> several encodings in various spots, then perhaps it is best read as
> >> binary, and processed as binary until those definitions are found.
> >>
> >> But an alternative would be to read with surrogate escapes, and then
> >> when the encoding is determined, to transcode the data. Previously, a
> >> proposal was made to reverse the surrogate escapes to the original
> >> bytes, and then apply the (now known) appropriate codec. There are not
> >> appropriate codecs that can convert directly from surrogate escapes to
> >> the desired end result. This technique could be used instead, for
> >> single-byte, non-escaped encodings. On the other hand, writing specialty
> >> codecs for the purpose would be more general.
> >>
> > There'll be a surrogate escape if a byte couldn't be decoded, but just
> > because a byte could be decoded, it doesn't mean that it's correct.
> >
> > If you picked the wrong encoding, the other codepoints could be wrong
> > too.
> Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
> is also pretty useless to "replace_surrogate_escapes" at all, because it 
> only cleans out the non-decodable characters, not the incorrectly 
> decoded characters.

Well, replace would still be useful for ASCII+surrogateescape.  Also for
cases where the data stream is *supposed* to be in a given encoding, but
contains undecodable bytes.  Showing the stuff that incorrectly decodes
as whatever it decodes to is generally what you want in that case.


From v+python at  Thu Aug 28 19:54:44 2014
From: v+python at (Glenn Linderman)
Date: Thu, 28 Aug 2014 10:54:44 -0700
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On 8/28/2014 10:41 AM, R. David Murray wrote:
> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at> wrote:
>> On 8/28/2014 12:30 AM, MRAB wrote:
>>> On 2014-08-28 05:56, Glenn Linderman wrote:
>>>> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>>>>> Glenn Linderman writes:
>>>>>    > On 8/26/2014 4:31 AM, MRAB wrote:
>>>>>    > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>>>>    > >> Nick Coghlan writes:
>>>>>    > > How about:
>>>>>    > >
>>>>>    > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>>>>    > >
>>>>>    > > If you want them removed, just pass an empty string as the
>>>>>    > > replacement.
>>>>> That seems better to me (I had too much C for breakfast, I think).
>>>>>    > And further, replacement could be a vector of 128 characters, to do
>>>>>    > immediate transcoding,
>>>>> Using what encoding?
>>>> The vector would contain the transcoding. Each lone surrogate would map
>>>> to a character in the vector.
>>>>> If you knew that much, why didn't you use
>>>>> (write, if necessary) an appropriate codec?  I can't envision this
>>>>> being useful.
>>>> If the data format describes its encoding, possibly containing data from
>>>> several encodings in various spots, then perhaps it is best read as
>>>> binary, and processed as binary until those definitions are found.
>>>> But an alternative would be to read with surrogate escapes, and then
>>>> when the encoding is determined, to transcode the data. Previously, a
>>>> proposal was made to reverse the surrogate escapes to the original
>>>> bytes, and then apply the (now known) appropriate codec. There are not
>>>> appropriate codecs that can convert directly from surrogate escapes to
>>>> the desired end result. This technique could be used instead, for
>>>> single-byte, non-escaped encodings. On the other hand, writing specialty
>>>> codecs for the purpose would be more general.
>>> There'll be a surrogate escape if a byte couldn't be decoded, but just
>>> because a byte could be decoded, it doesn't mean that it's correct.
>>> If you picked the wrong encoding, the other codepoints could be wrong
>>> too.
>> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
>> is also pretty useless to "replace_surrogate_escapes" at all, because it
>> only cleans out the non-decodable characters, not the incorrectly
>> decoded characters.
> Well, replace would still be useful for ASCII+surrogateescape.


> Also for
> cases where the data stream is *supposed* to be in a given encoding, but
> contains undecodable bytes.  Showing the stuff that incorrectly decodes
> as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe 
even learn to recognize it for what it was intended to be, in limited 
domains. But suppressing/replacing the surrogates doesn't help with 
that... would it not be better to replace the surrogates with an escape 
sequence that shows the original, undecodable, byte value?  Like  \xNN ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From rdmurray at  Thu Aug 28 20:43:51 2014
From: rdmurray at (R. David Murray)
Date: Thu, 28 Aug 2014 14:43:51 -0400
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
 <> <>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman <v+python at> wrote:
> On 8/28/2014 10:41 AM, R. David Murray wrote:
> > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at> wrote:
> >> On 8/28/2014 12:30 AM, MRAB wrote:
> >>> There'll be a surrogate escape if a byte couldn't be decoded, but just
> >>> because a byte could be decoded, it doesn't mean that it's correct.
> >>>
> >>> If you picked the wrong encoding, the other codepoints could be wrong
> >>> too.
> >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
> >> is also pretty useless to "replace_surrogate_escapes" at all, because it
> >> only cleans out the non-decodable characters, not the incorrectly
> >> decoded characters.
> > Well, replace would still be useful for ASCII+surrogateescape.
> How?

Because there "can't" be any incorrectly decoded bytes in the ASCII part,
so all undecodable bytes turning into 'unrecognized character' glyphs
is useful. "can't" is in quotes because of course if you decode random
binary data as ASCII+surrogate escape you could get a mess just like any
other encoding, so this is really a "more *likely* to be useful" version
of my second point, because "real" ASCII with some junk bytes mixed in
is much more likely to be encountered in the wild than, say, utf-8 with
some junk bytes mixed in (although is probably changing as use of utf-8
becomes more widespread, so this point applies to utf-8 as well).

> > Also for
> > cases where the data stream is *supposed* to be in a given encoding, but
> > contains undecodable bytes.  Showing the stuff that incorrectly decodes
> > as whatever it decodes to is generally what you want in that case.
> Sure, people can learn to recognize mojibake for what it is, and maybe 
> even learn to recognize it for what it was intended to be, in limited 
> domains. But suppressing/replacing the surrogates doesn't help with 

Well, it does if the alternative is not being able to display the string
to the user at all.  And yeah, people being able to recognize mojibake
in specific problem domains is what I'm talking about...not perhaps a
great use case, but it is a use case.

> that... would it not be better to replace the surrogates with an escape 
> sequence that shows the original, undecodable, byte value?  Like  \xNN ?

Yeah, that idea has been floated as well, and I think it would indeed be
more useful than the 'unknown character' glyph.  I've also seen fonts
that display the hex code inside a box character when the code point is
unknown, which would be cool...but that can hardly be part of unicode,
can it? :)


From stephen at  Fri Aug 29 02:32:58 2014
From: stephen at (Stephen J. Turnbull)
Date: Fri, 29 Aug 2014 09:32:58 +0900
Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path
 related questions for Guido)
In-Reply-To: <>
References: <>
Message-ID: <>

Nick Coghlan writes:

 > The current proposal on the issue tracker is to instead take advantage of
 > the existing error handlers:
 >     def convert_surrogateescape(data, errors='replace'):
 >         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
 > That code is short, but semantically dense

And it doesn't implement your original suggestion of replacement with
'?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
least, AFAICT from the docs there's no way to specify the replacement
character; decoding always uses U+FFFD.  (If I knew how to do that, I
would have suggested this.)

 > (Added bonus: once you're alerted to the possibility, it's trivial
 > to write your own version for existing Python 3 versions.

I'm not sure that's true.  At least, to me that code was obvious -- I
got the exact definition (except for the function name) on the first
try -- but I ruled it out because it didn't implement your suggestion
of replacement with '?', even as an option.

OTOH, I think a lot of the resistance to codec-based solutions is the
misconception that en/decoding streams is expensive, or the
misconception that Python's internal representation of text as an
array of code points (rather than an array of "characters" or
"grapheme clusters") is somehow insufficient for text processing.


From stephen at  Fri Aug 29 02:41:03 2014
From: stephen at (Stephen J. Turnbull)
Date: Fri, 29 Aug 2014 09:41:03 +0900
Subject: [Python-Dev] surrogatepass - she's a witch,
	burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>
Message-ID: <>

In the process of booking up for my other post in this thread, I
noticed the 'surrogatepass' handler.

Is there a real use case for the 'surrogatepass' error handler?  It
seems like a horrible break in the abstraction.  IMHO, if there's a
need, the application should handle this.  Python shouldn't provide
it on encoding as the resulting streams are not Unicode conformant,
nor on decoding UTF-16, as conversion of surrogate pairs is a
requirement of all Unicode versions since about 1995.


From ncoghlan at  Fri Aug 29 06:55:39 2014
From: ncoghlan at (Nick Coghlan)
Date: Fri, 29 Aug 2014 14:55:39 +1000
Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes
 path related questions for Guido)
In-Reply-To: <>
References: <>
Message-ID: <>

On 29 August 2014 10:32, Stephen J. Turnbull <stephen at> wrote:
> Nick Coghlan writes:
>  > The current proposal on the issue tracker is to instead take advantage of
>  > the existing error handlers:
>  >
>  >     def convert_surrogateescape(data, errors='replace'):
>  >         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
>  >
>  > That code is short, but semantically dense
> And it doesn't implement your original suggestion of replacement with
> '?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
> least, AFAICT from the docs there's no way to specify the replacement
> character; decoding always uses U+FFFD.  (If I knew how to do that, I
> would have suggested this.)

If that actually matters in a given context, I can do an ordinary
string replacement later. I couldn't think of a case where it actually
mattered though - if "must be ASCII" was a requirement, then
backslashreplace was a suitable alternative that lost less information
(hence the RFE to make that also usable on input).

>  > (Added bonus: once you're alerted to the possibility, it's trivial
>  > to write your own version for existing Python 3 versions.
> I'm not sure that's true.  At least, to me that code was obvious -- I
> got the exact definition (except for the function name) on the first
> try -- but I ruled it out because it didn't implement your suggestion
> of replacement with '?', even as an option.

Yeah, part of the tracker discussion involved me realising that part
wasn't a necessary requirement - the key is being able to get rid of
the surrogates, or replace them with something readily identifiable,
and less about being able to control exactly what they get replaced

> OTOH, I think a lot of the resistance to codec-based solutions is the
> misconception that en/decoding streams is expensive, or the
> misconception that Python's internal representation of text as an
> array of code points (rather than an array of "characters" or
> "grapheme clusters") is somehow insufficient for text processing.

We don't actually have any technical deep dives into how Python 3's
text handling works readily available online, so there's a lot of
speculation and misinformation floating around. My recent article
gives the high level context, but it really needs to be paired up with
a piece (or pieces) that go deep into the details of codec
optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE
Windows APIs, how the internal storage structure is determined at
allocation time, how it maintains compatibility with the legacy C
extension APIs, etc. The only current widely distributed articles on
those topics are written from a perspective that assumes we don't know
anything about Unicode, and are just making things unnecessarily
complicated (rather than solving hard cross platform compatibility and
text processing performance problems). That perspective is incorrect,
but "trust me, they're wrong" doesn't work very well with people that
are already angry.

Text manipulation is one of the most sophisticated subsystems in the
interpreter, though, so it's hard to know where to start on such a
series (and easy to get intimidated by the sheer magnitude of the work
involved in doing it right).


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From mal at  Fri Aug 29 09:48:50 2014
From: mal at (M.-A. Lemburg)
Date: Fri, 29 Aug 2014 09:48:50 +0200
Subject: [Python-Dev] surrogatepass - she's a witch,
 burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>	<>
Message-ID: <>

On 29.08.2014 02:41, Stephen J. Turnbull wrote:
> In the process of booking up for my other post in this thread, I
> noticed the 'surrogatepass' handler.
> Is there a real use case for the 'surrogatepass' error handler?  It
> seems like a horrible break in the abstraction.  IMHO, if there's a
> need, the application should handle this.  Python shouldn't provide
> it on encoding as the resulting streams are not Unicode conformant,
> nor on decoding UTF-16, as conversion of surrogate pairs is a
> requirement of all Unicode versions since about 1995.

This error handler allows applications to reactivate the Python 2
style behavior of the UTF codecs in Python 3, which allow reading
lone surrogates on input.

Since Python allows working with lone surrogates in Unicode (they
are valid code points) and we're using UTF-8 for marshal, we needed
a way to make sure that Python 3 also optionally supports working
with lone surrogates in such UTF-8 streams (nowadays called CESU-8:


for discussions.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 29 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                21 days to go
2014-09-27: PyDDF Sprint 2014 ...                          29 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From walter at  Fri Aug 29 12:09:54 2014
From: walter at (Walter =?utf-8?q?D=C3=B6rwald?=)
Date: Fri, 29 Aug 2014 12:09:54 +0200
Subject: [Python-Dev] Bytes path related questions for Guido
In-Reply-To: <>
References: <>
 <> <>
 <> <>
 <> <>
 <> <>
Message-ID: <>

On 28 Aug 2014, at 19:54, Glenn Linderman wrote:

> On 8/28/2014 10:41 AM, R. David Murray wrote:
>> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman 
>> <v+python at> wrote:
>> [...]
>> Also for
>> cases where the data stream is *supposed* to be in a given encoding, 
>> but
>> contains undecodable bytes.  Showing the stuff that incorrectly 
>> decodes
>> as whatever it decodes to is generally what you want in that case.
> Sure, people can learn to recognize mojibake for what it is, and maybe 
> even learn to recognize it for what it was intended to be, in limited 
> domains. But suppressing/replacing the surrogates doesn't help with 
> that... would it not be better to replace the surrogates with an 
> escape sequence that shows the original, undecodable, byte value?  
> Like  \xNN ?

For that we could extend the "backslashreplace" codec error callback, so 
that it can be used for decoding too, not just for encoding. I.e.

    b"a\xffb".decode("utf-8", "backslashreplace")

would return



From mal at  Fri Aug 29 14:18:34 2014
From: mal at (M.-A. Lemburg)
Date: Fri, 29 Aug 2014 14:18:34 +0200
Subject: [Python-Dev] surrogatepass - she's a witch,
 burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>	<>	<>	<>
Message-ID: <>

On 29.08.2014 13:22, Isaac Morland wrote:
> On Fri, 29 Aug 2014, M.-A. Lemburg wrote:
>> On 29.08.2014 02:41, Stephen J. Turnbull wrote:
>> Since Python allows working with lone surrogates in Unicode (they
>> are valid code points) and we're using UTF-8 for marshal, we needed
>> a way to make sure that Python 3 also optionally supports working
>> with lone surrogates in such UTF-8 streams (nowadays called CESU-8:
> If I want that wouldn't I specify "cesu-8" as the encoding?
> i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8').  Right now, trying this I get
> that cesu-8 is an unknown encoding but that could be changed without affecting the behaviour of the
> utf-8 codec.

Why write a new codec that's almost identical to the utf-8 codec,
if you can get the same functionality by explicitly using a
special error handler ?

>From a maintenance POV that does not sound like a good approach.

> It seems to me that .decode ('utf-8') should decode exactly and only valid utf-8, including the
> non-use of surrogate pairs as an intermediate encoding step.

It does in Python 3.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 29 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                21 days to go
2014-09-27: PyDDF Sprint 2014 ...                          29 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From ijmorlan at  Fri Aug 29 13:22:10 2014
From: ijmorlan at (Isaac Morland)
Date: Fri, 29 Aug 2014 07:22:10 -0400 (EDT)
Subject: [Python-Dev] surrogatepass - she's a witch,
 burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On Fri, 29 Aug 2014, M.-A. Lemburg wrote:

> On 29.08.2014 02:41, Stephen J. Turnbull wrote:
> Since Python allows working with lone surrogates in Unicode (they
> are valid code points) and we're using UTF-8 for marshal, we needed
> a way to make sure that Python 3 also optionally supports working
> with lone surrogates in such UTF-8 streams (nowadays called CESU-8:

If I want that wouldn't I specify "cesu-8" as the encoding?

i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8').  Right 
now, trying this I get that cesu-8 is an unknown encoding but that could 
be changed without affecting the behaviour of the utf-8 codec.

It seems to me that .decode ('utf-8') should decode exactly and only valid 
utf-8, including the non-use of surrogate pairs as an intermediate 
encoding step.

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist

From status at  Fri Aug 29 18:08:07 2014
From: status at (Python tracker)
Date: Fri, 29 Aug 2014 18:08:07 +0200 (CEST)
Subject: [Python-Dev] Summary of Python tracker Issues
Message-ID: <>

ACTIVITY SUMMARY (2014-08-22 - 2014-08-29)
Python tracker at

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4638 (+17)
  closed 29431 (+32)
  total  34069 (+49)

Open issues with patches: 2193 

Issues opened (41)

#17095: Modules/Setup *shared* support broken  reopened by haypo

#22200: Remove distutils checks for Python version  reopened by Arfrever

#22232: str.splitlines splitting on non-\r\n characters  reopened by terry.reedy

#22252: ssl blocking IO errors  opened by h.venev

#22253: ConfigParser does not handle files without sections  opened by kernc

#22255: Multiprocessing freeze_support raises RuntimeError  opened by Michael.McAuliffe

#22256: pyvenv should display a progress indicator while creating an e  opened by ncoghlan

#22257: PEP 432: Redesign the interpreter startup sequence  opened by ncoghlan

#22258: set_inheritable(): ioctl(FIOCLEX) is available but fails with  opened by igor.pashev

#22260: Rearrange tkinter tests, use test discovery  opened by zach.ware

#22261: Document how to use Concurrent Build when using MsBuild  opened by sbspider

#22263: Add a resource for CLI tests  opened by serhiy.storchaka

#22264: Add wsgiref.util helpers for dealing with "WSGI strings"  opened by ncoghlan

#22268: mrohasattr and mrogetattr  opened by Gregory.Salvan

#22269: Resolve distutils option conflicts with priorities  opened by minrk

#22270: cache version selection for documentation  opened by thejj

#22271: Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning  opened by haypo

#22273: abort when passing certain structs by value using ctypes  opened by weeble

#22274: subprocess.Popen(stderr=STDOUT) fails to redirect subprocess s  opened by akira

#22275: asyncio: enhance documentation of OS support  opened by haypo

#22276: pathlib glob issues  opened by

#22277: add parameters to suppress output on stdout and  opened by CristianCantoro

#22278: urljoin duplicate slashes  opened by demian.brecht

#22279: read() vs read1() in asyncio.StreamReader documentation  opened by oconnor663

#22281: ProcessPoolExecutor/ThreadPoolExecutor should provide introspe  opened by dan.oreilly

#22282: ipaddress module accepts octal formatted IPv4 addresses in IPv  opened by xZise

#22283: "AMD64 FreeBSD 9.0 3.x" fails to build the _decimal module: #e  opened by haypo

#22284: decimal module contains less symbols when the _decimal module  opened by haypo

#22285: The Modules/ directory should not be added to sys.path  opened by haypo

#22286: Allow backslashreplace error handler to be used on input  opened by ncoghlan

#22289: support.transient_internet() doesn't catch timeout on FTP test  opened by haypo

#22290: "AMD64 OpenIndiana 3.x" buildbot: assertion failed in PyObject  opened by haypo

#22292: pickle whichmodule RuntimeError  opened by attilio.dinisio

#22293: unittest.mock: use slots in MagicMock to reduce memory footpri  opened by james-w

#22294: 2to3 consuming_calls:  len, min, max, zip, map, reduce, filter  opened by eddygeek

#22295: Clarify available commands for package installation  opened by ncoghlan

#22296: cookielib uses time.time(), making incorrect checks of expirat  opened by regu0004

#22297: 2.7 json encoding broken for enums  opened by eddygeek

#22298: Lib/ _show_warning does not protect against being c  opened by Julius.Lehmann-Richter

#22299: resolve() on Windows makes some pathological paths unusable  opened by Kevin.Norris

#22300: PEP 446 What's New Updates for 2.7.9  opened by ncoghlan

Most recent 15 issues with no replies (15)

#22300: PEP 446 What's New Updates for 2.7.9

#22298: Lib/ _show_warning does not protect against being c

#22297: 2.7 json encoding broken for enums

#22296: cookielib uses time.time(), making incorrect checks of expirat

#22294: 2to3 consuming_calls:  len, min, max, zip, map, reduce, filter

#22289: support.transient_internet() doesn't catch timeout on FTP test

#22286: Allow backslashreplace error handler to be used on input

#22278: urljoin duplicate slashes

#22275: asyncio: enhance documentation of OS support

#22271: Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning

#22268: mrohasattr and mrogetattr

#22255: Multiprocessing freeze_support raises RuntimeError

#22251: Various markup errors in documentation

#22249: Possibly incorrect example is given for socket.getaddrinfo()

#22246: add strptime(s, '%s')

Most recent 15 issues waiting for review (15)

#22300: PEP 446 What's New Updates for 2.7.9

#22294: 2to3 consuming_calls:  len, min, max, zip, map, reduce, filter

#22292: pickle whichmodule RuntimeError

#22289: support.transient_internet() doesn't catch timeout on FTP test

#22285: The Modules/ directory should not be added to sys.path

#22282: ipaddress module accepts octal formatted IPv4 addresses in IPv

#22281: ProcessPoolExecutor/ThreadPoolExecutor should provide introspe

#22278: urljoin duplicate slashes

#22277: add parameters to suppress output on stdout and

#22275: asyncio: enhance documentation of OS support

#22274: subprocess.Popen(stderr=STDOUT) fails to redirect subprocess s

#22269: Resolve distutils option conflicts with priorities

#22268: mrohasattr and mrogetattr

#22261: Document how to use Concurrent Build when using MsBuild

#22260: Rearrange tkinter tests, use test discovery

Top 10 most discussed issues (10)

#18814: Add tools for "cleaning" surrogate escaped strings  15 msgs

#22232: str.splitlines splitting on non-\r\n characters  13 msgs

#22264: Add wsgiref.util helpers for dealing with "WSGI strings"  10 msgs

#22194: access to cdecimal / libmpdec API   9 msgs

#22277: add parameters to suppress output on stdout and   9 msgs

#22240: argparse support for "python -m module" in help   8 msgs

#22261: Document how to use Concurrent Build when using MsBuild   8 msgs

#22279: read() vs read1() in asyncio.StreamReader documentation   7 msgs

#22285: The Modules/ directory should not be added to sys.path   7 msgs

#21720: "TypeError: Item in ``from list'' not a string"  message   6 msgs

Issues closed (31)

#2527: Pass a namespace to timeit  closed by pitrou

#6550: asyncore incorrect failure when connection is refused and usin  closed by haypo

#11267: asyncore does not check for POLLERR and POLLHUP if neither rea  closed by haypo

#16808: inspect.stack() should return list of named tuples  closed by pitrou

#18530: posixpath.ismount performs extra lstat calls  closed by alex

#19447: py_compile.compile raises if a file has bad encoding  closed by berker.peksag

#20745: test_statistics fails in refleak mode  closed by zach.ware

#20996: Backport TLS 1.1 and 1.2 support for ssl_version  closed by alex

#21305: PEP 466: update os.urandom  closed by python-dev

#22034: posixpath.join() and bytearray  closed by serhiy.storchaka

#22042: signal.set_wakeup_fd(fd): raise an exception if the fd is in b  closed by haypo

#22059: incorrect type conversion from str to bytes in asynchat module  closed by r.david.murray

#22090: Decimal and float formatting treat '%' differently for infinit  closed by skrah

#22182: distutils.file_util.move_file unpacks wrongly an exception  closed by berker.peksag

#22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m  closed by ned.deily

#22236: Do not use _default_root in Tkinter tests  closed by serhiy.storchaka

#22239: asyncio: nested event loop  closed by gvanrossum

#22243: Documentation on try statement incorrectly implies target of e  closed by terry.reedy

#22244: load_verify_locations fails to handle unicode paths on Python  closed by python-dev

#22250: unittest lowercase methods  closed by ezio.melotti

#22254: match object generated by re.finditer cannot call groups() on  closed by leiju

#22259: fdopen of directory causes segmentation fault  closed by python-dev

#22262: Python External Libraries are stored in directory above where  closed by zach.ware

#22265: fix reliance on refcounting in test_itertools  closed by python-dev

#22266: fix reliance on refcounting in tarfile.gzopen  closed by python-dev

#22267: fix reliance on refcounting in test_weakref  closed by python-dev

#22272: sqlite3 memory leaks in cursor.execute  closed by haypo

#22280: _decimal: successful import despite build failure  closed by skrah

#22287: Use clock_gettime() in pytime.c  closed by haypo

#22288: Incorrect Call grammar in documentation  closed by mjpieters

#22291: Typo in docs - Lib/random  closed by r.david.murray

From alex.gaynor at  Fri Aug 29 21:47:16 2014
From: alex.gaynor at (Alex Gaynor)
Date: Fri, 29 Aug 2014 19:47:16 +0000 (UTC)
Subject: [Python-Dev] PEP 476: Enabling certificate validation by default!
Message-ID: <>

Hi all,

I've just submitted PEP 476, on enabling certificate validation by default for
HTTPS clients in Python. Please have a look and let me know what you think.

PEP text follows.



PEP: 476
Title: Enabling certificate verification by default for stdlib http clients
Version: $Revision$
Last-Modified: $Date$
Author: Alex Gaynor <alex.gaynor at>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 28-August-2014


Currently when a standard library http client (the ``urllib`` and ``http``
modules) encounters an ``https://`` URL it will wrap the network HTTP traffic
in a TLS stream, as is necessary to communicate with such a server. However,
during the TLS handshake it will not actually check that the server has an X509
certificate is signed by a CA in any trust root, nor will it verify that the
Common Name (or Subject Alternate Name) on the presented certificate matches
the requested host.

The failure to do these checks means that anyone with a privileged network
position is able to trivially execute a man in the middle attack against a
Python application using either of these HTTP clients, and change traffic at

This PEP proposes to enable verification of X509 certificate signatures, as
well as hostname verification for Python's HTTP clients by default, subject to
opt-out on a per-call basis.


The "S" in "HTTPS" stands for secure. When Python's users type "HTTPS" they are
expecting a secure connection, and Python should adhere to a reasonable
standard of care in delivering this. Currently we are failing at this, and in
doing so, APIs which appear simple are misleading users.

When asked, many Python users state that they were not aware that Python failed
to perform these validations, and are shocked.

The popularity of ``requests`` (which enables these checks by default)
demonstrates that these checks are not overly burdensome in any way, and the
fact that it is widely recommended as a major security improvement over the
standard library clients demonstrates that many expect a higher standard for
"security by default" from their tools.

The failure of various applications to note Python's negligence in this matter
is a source of *regular* CVE assignment [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_
[#]_ [#]_ [#]_.

.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]
.. [#]

Technical Details

Python would use the system provided certificate database on all platforms.
Failure to locate such a database would be an error, and users would need to
explicitly specify a location to fix it.

This can be achieved by simply replacing the use of
``ssl._create_stdlib_context`` with ``ssl.create_default_context`` in

Trust database

This PEP proposes using the system-provided certificate database. Previous
discussions have suggested bundling Mozilla's certificate database and using
that by default. This was decided against for several reasons:

* Using the platform trust database imposes a lower maintenance burden on the
  Python developers -- shipping our own trust database would require doing a
  release every time a certificate was revoked.
* Linux vendors, and other downstreams, would unbundle the Mozilla
  certificates, resulting in a more fragmented set of behaviors.
* Using the platform stores makes it easier to handle situations such as
  corporate internal CAs.

Backwards compatibility

This change will have the appearance of causing some HTTPS connections to
"break", because they will now raise an Exception during handshake.

This is misleading however, in fact these connections are presently failing
silently, an HTTPS URL indicates an expectation of confidentiality and
authentication. The fact that Python does not actually verify that the user's
request has been made is a bug, further: "Errors should never pass silently."

Nevertheless, users who have a need to access servers with self-signed or
incorrect certificates would be able to do so by providing a context with
custom trust roots or which disables validation (documentation should strongly
recommend the former where possible). Users will also be able to add necessary
certificates to system trust stores in order to trust them globally.

Twisted's 14.0 release made this same change, and it has been met with almost
no opposition.

Other protocols

This PEP only proposes requiring this level of validation for HTTP clients, not
for other protocols such as SMTP.

This is because while a high percentage of HTTPS servers have correct
certificates, as a result of the validation performed by browsers, for other
protocols self-signed or otherwise incorrect certificates are far more common.
Note that for SMTP at least, this appears to be changing and should be reviewed
for a potential similar PEP in the future:


Python Versions

This PEP proposes making these changes to ``default`` (Python 3) branch. I
strongly believe these changes also belong in Python 2, but doing them in a
patch-release isn't reasonable, and there is strong opposition to doing a 2.8


This document has been placed into the public domain.

   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8

From mal at  Fri Aug 29 22:00:00 2014
From: mal at (M.-A. Lemburg)
Date: Fri, 29 Aug 2014 22:00:00 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On 29.08.2014 21:47, Alex Gaynor wrote:
> Hi all,
> I've just submitted PEP 476, on enabling certificate validation by default for
> HTTPS clients in Python. Please have a look and let me know what you think.
> PEP text follows.

Thanks for the PEP. I think this is generally a good idea,
but some important parts are missing from the PEP:

 * transition plan:

   I think starting with warnings in Python 3.5 and going
   for exceptions in 3.6 would make a good transition

   Going straight for exceptions in 3.5 is not in line with
   our normal procedures for backwards incompatible changes.

 * configuration:

   It would be good to be able to switch this on or off
   without having to change the code, e.g. via a command
   line switch and environment variable; perhaps even
   controlling whether or not to raise an exception or

 * choice of trusted certificate:

   Instead of hard wiring using the system CA roots into
   Python it would be good to just make this default and
   permit the user to point Python to a different set of
   CA roots.

   This would enable using self signed certs more easily.
   Since these are often used for tests, demos and education,
   I think it's important to allow having more control of
   the trusted certs.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 29 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                21 days to go
2014-09-27: PyDDF Sprint 2014 ...                          29 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From dreid at  Fri Aug 29 21:56:59 2014
From: dreid at (David Reid)
Date: Fri, 29 Aug 2014 19:56:59 +0000 (UTC)
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <>

Alex Gaynor <alex.gaynor <at>> writes:

> Hi all,
> I've just submitted PEP 476, on enabling certificate validation by default for
> HTTPS clients in Python. Please have a look and let me know what you think.

Yes please.

The two most commons answers I get to "Why did you switch to go?" are 
"Concurrency" and "The  stdlib HTTP client verifies TLS by default."

In a work related survey of webhook providers I found that only ~7% of HTTPS 
URLs would be affected by a change like this.


From ethan at  Fri Aug 29 22:07:00 2014
From: ethan at (Ethan Furman)
Date: Fri, 29 Aug 2014 13:07:00 -0700
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/29/2014 01:00 PM, M.-A. Lemburg wrote:
> On 29.08.2014 21:47, Alex Gaynor wrote:
>> I've just submitted PEP 476, on enabling certificate validation by default for
>> HTTPS clients in Python. Please have a look and let me know what you think.
> Thanks for the PEP. I think this is generally a good idea,
> but some important parts are missing from the PEP:
>   * transition plan:
>     I think starting with warnings in Python 3.5 and going
>     for exceptions in 3.6 would make a good transition
>     Going straight for exceptions in 3.5 is not in line with
>     our normal procedures for backwards incompatible changes.
>   * configuration:
>     It would be good to be able to switch this on or off
>     without having to change the code, e.g. via a command
>     line switch and environment variable; perhaps even
>     controlling whether or not to raise an exception or
>     warning.
>   * choice of trusted certificate:
>     Instead of hard wiring using the system CA roots into
>     Python it would be good to just make this default and
>     permit the user to point Python to a different set of
>     CA roots.
>     This would enable using self signed certs more easily.
>     Since these are often used for tests, demos and education,
>     I think it's important to allow having more control of
>     the trusted certs.

+1 for PEP with above changes.


From donald at  Fri Aug 29 22:10:03 2014
From: donald at (Donald Stufft)
Date: Fri, 29 Aug 2014 16:10:03 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

> On Aug 29, 2014, at 4:00 PM, "M.-A. Lemburg" <mal at> wrote:
> * choice of trusted certificate:
>   Instead of hard wiring using the system CA roots into
>   Python it would be good to just make this default and
>   permit the user to point Python to a different set of
>   CA roots.
>   This would enable using self signed certs more easily.
>   Since these are often used for tests, demos and education,
>   I think it's important to allow having more control of
>   the trusted certs.

If I recall OpenSSL already allows this to be configured via envvar and the python API already allows it to be configured via API. 

From donald at  Fri Aug 29 23:11:35 2014
From: donald at (Donald Stufft)
Date: Fri, 29 Aug 2014 17:11:35 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

Sorry I was on my phone and didn?t get to fully reply to this.

> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg <mal at> wrote:
> On 29.08.2014 21:47, Alex Gaynor wrote:
>> Hi all,
>> I've just submitted PEP 476, on enabling certificate validation by default for
>> HTTPS clients in Python. Please have a look and let me know what you think.
>> PEP text follows.
> Thanks for the PEP. I think this is generally a good idea,
> but some important parts are missing from the PEP:
> * transition plan:
>   I think starting with warnings in Python 3.5 and going
>   for exceptions in 3.6 would make a good transition
>   Going straight for exceptions in 3.5 is not in line with
>   our normal procedures for backwards incompatible changes.

As far as a transition plan, I think that this is an important
enough thing to have an accelerated process. If we need
to provide a warning than let?s add it to the next 3.4 otherwise
it?s going to be 2.5+ years until we stop being unsafe by

Another problem with this is that I don?t think it?s actually
possible to do. Python itself isn?t validating the TLS certificates,
OpenSSL is doing that. To my knowledge OpenSSL doesn?t
have a way to say ?please validate these certificates and if
they don?t validate go ahead and keep going and just let me
get a warning from it?. It?s a 3 way switch, no validation, validation
if a certificate is provided, and validation always.

Now that?s strictly for the ?verify the certificate chain? portion,
the hostname verification is done entirely on our end and we
could do something there? but I?m not sure it makes sense
to do so if we can?t do it for invalid certificates too.

> * configuration:
>   It would be good to be able to switch this on or off
>   without having to change the code, e.g. via a command
>   line switch and environment variable; perhaps even
>   controlling whether or not to raise an exception or
>   warning.

I?m on the fence about this, if someone provides a certificate
that we can validate against (which can be done without
touching the code) then the only thing that really can?t be
?fixed? without touching the code is if someone has a certificate
that is otherwise invalid (expired, not yet valid, wrong hostname,
etc). I?d say if I was voting on this particular thing I?d be -0, I?d
rather it didn?t exist but I wouldn?t cry too much if it did.

> * choice of trusted certificate:
>   Instead of hard wiring using the system CA roots into
>   Python it would be good to just make this default and
>   permit the user to point Python to a different set of
>   CA roots.
>   This would enable using self signed certs more easily.
>   Since these are often used for tests, demos and education,
>   I think it's important to allow having more control of
>   the trusted certs.

Like my other email said, the Python API has everything needed
to easily specify your own CA roots and/or disable the validations.
The OpenSSL library also allows you to specify either a directory
or a file to change the root certificates without code changes. The
only real problems with the APIs are that the default is bad and
an unrelated thing where you can?t pass in an in memory certificate.

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From rdmurray at  Fri Aug 29 23:42:34 2014
From: rdmurray at (R. David Murray)
Date: Fri, 29 Aug 2014 17:42:34 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On Fri, 29 Aug 2014 17:11:35 -0400, Donald Stufft <donald at> wrote:
> Sorry I was on my phone and didn???t get to fully reply to this.
> > On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg <mal at> wrote:
> > 
> > * configuration:
> > 
> >   It would be good to be able to switch this on or off
> >   without having to change the code, e.g. via a command
> >   line switch and environment variable; perhaps even
> >   controlling whether or not to raise an exception or
> >   warning.
> I???m on the fence about this, if someone provides a certificate
> that we can validate against (which can be done without
> touching the code) then the only thing that really can???t be
> ???fixed??? without touching the code is if someone has a certificate
> that is otherwise invalid (expired, not yet valid, wrong hostname,
> etc). I???d say if I was voting on this particular thing I???d be -0, I???d
> rather it didn???t exist but I wouldn???t cry too much if it did.

Especially if you want an accelerated change, there must be a way to
*easily* get back to the previous behavior, or we are going to catch a
lot of flack.  There may be only 7% of public certs that are problematic,
but I'd be willing to bet you that there are more not-really-public ones
that are critical to day to day operations *somewhere* :)

wget and curl have 'ignore validation' as a command line flag for a reason.


From solipsis at  Fri Aug 29 23:55:40 2014
From: solipsis at (Antoine Pitrou)
Date: Fri, 29 Aug 2014 23:55:40 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <20140829235540.0f73b1d0@fsol>

On Fri, 29 Aug 2014 17:11:35 -0400
Donald Stufft <donald at> wrote:
> Another problem with this is that I don?t think it?s actually
> possible to do. Python itself isn?t validating the TLS certificates,
> OpenSSL is doing that. To my knowledge OpenSSL doesn?t
> have a way to say ?please validate these certificates and if
> they don?t validate go ahead and keep going and just let me
> get a warning from it?.

Actually, there may be a solution.
In client mode, OpenSSL always verifies the server cert chain and
stores the verification result in the SSL structure. It will then only
report an error if the verify mode is not SSL_VERIFY_NONE.
(see ssl3_get_server_certificate() in s3_clnt.c)

The verification result should then be readable using
SSL_get_verify_result(), even with SSL_VERIFY_NONE.

(note this is only from reading the source code and needs verifying)

Then we could have the following transition phase:
- define a new CERT_WARN value for SSLContext.verify_mode
- use that value as the default in the HTTP stack (people who want the
  old silent default will have to set verify_mode explicitly to
- with CERT_WARN, SSL_VERIFY_NONE is passed to OpenSSL and Python
  manually calls SSL_get_verify_result() after a handshake; if there
  was a verification error, a warning is printed out

And in the following version we switch the HTTP default to



From mal at  Fri Aug 29 23:58:29 2014
From: mal at (M.-A. Lemburg)
Date: Fri, 29 Aug 2014 23:58:29 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>	<>
Message-ID: <>

On 29.08.2014 23:11, Donald Stufft wrote:
> Sorry I was on my phone and didn?t get to fully reply to this.
>> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg <mal at> wrote:
>> On 29.08.2014 21:47, Alex Gaynor wrote:
>>> Hi all,
>>> I've just submitted PEP 476, on enabling certificate validation by default for
>>> HTTPS clients in Python. Please have a look and let me know what you think.
>>> PEP text follows.
>> Thanks for the PEP. I think this is generally a good idea,
>> but some important parts are missing from the PEP:
>> * transition plan:
>>   I think starting with warnings in Python 3.5 and going
>>   for exceptions in 3.6 would make a good transition
>>   Going straight for exceptions in 3.5 is not in line with
>>   our normal procedures for backwards incompatible changes.
> As far as a transition plan, I think that this is an important
> enough thing to have an accelerated process. If we need
> to provide a warning than let?s add it to the next 3.4 otherwise
> it?s going to be 2.5+ years until we stop being unsafe by
> default.

Fine with me; we're still early in the Python 3.4
patch level releases.

> Another problem with this is that I don?t think it?s actually
> possible to do. Python itself isn?t validating the TLS certificates,
> OpenSSL is doing that. To my knowledge OpenSSL doesn?t
> have a way to say ?please validate these certificates and if
> they don?t validate go ahead and keep going and just let me
> get a warning from it?. It?s a 3 way switch, no validation, validation
> if a certificate is provided, and validation always.
> Now that?s strictly for the ?verify the certificate chain? portion,
> the hostname verification is done entirely on our end and we
> could do something there? but I?m not sure it makes sense
> to do so if we can?t do it for invalid certificates too.

OpenSSL provides a callback for certificate validation,
so it is possible to issue a warning and continue with
accepting the certificate.

>> * configuration:
>>   It would be good to be able to switch this on or off
>>   without having to change the code, e.g. via a command
>>   line switch and environment variable; perhaps even
>>   controlling whether or not to raise an exception or
>>   warning.
> I?m on the fence about this, if someone provides a certificate
> that we can validate against (which can be done without
> touching the code) then the only thing that really can?t be
> ?fixed? without touching the code is if someone has a certificate
> that is otherwise invalid (expired, not yet valid, wrong hostname,
> etc). I?d say if I was voting on this particular thing I?d be -0, I?d
> rather it didn?t exist but I wouldn?t cry too much if it did.

If you're testing code or trying out some new stuff, you
don't want to get a valid cert first, but instead go ahead
with a self signed one. That's the use case.

>> * choice of trusted certificate:
>>   Instead of hard wiring using the system CA roots into
>>   Python it would be good to just make this default and
>>   permit the user to point Python to a different set of
>>   CA roots.
>>   This would enable using self signed certs more easily.
>>   Since these are often used for tests, demos and education,
>>   I think it's important to allow having more control of
>>   the trusted certs.
> Like my other email said, the Python API has everything needed
> to easily specify your own CA roots and/or disable the validations.
> The OpenSSL library also allows you to specify either a directory
> or a file to change the root certificates without code changes. The
> only real problems with the APIs are that the default is bad and
> an unrelated thing where you can?t pass in an in memory certificate.

Are you sure that's possible ? Python doesn't load the
openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env
vars only work for the openssl command line binary, AFAIK.

In any case, Python will have to tap into the OS CA root
provider using special code and this code could easily be
made to check other dirs/files as well.

The point is that it should be possible to change this default
at the Python level, without needing application code changes.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 29 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                21 days to go
2014-09-27: PyDDF Sprint 2014 ...                          29 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From solipsis at  Fri Aug 29 23:57:41 2014
From: solipsis at (Antoine Pitrou)
Date: Fri, 29 Aug 2014 23:57:41 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <20140829235741.3bf75d30@fsol>

On Fri, 29 Aug 2014 17:42:34 -0400
"R. David Murray" <rdmurray at> wrote:
> Especially if you want an accelerated change, there must be a way to
> *easily* get back to the previous behavior, or we are going to catch a
> lot of flack.  There may be only 7% of public certs that are problematic,
> but I'd be willing to bet you that there are more not-really-public ones
> that are critical to day to day operations *somewhere* :)

Actually, by construction, there are certs which will always fail
verification, for example because they are embedded in telco equipments
which don't have a predefined hostname or IP address.
(I have encountered some of those)



From donald at  Sat Aug 30 00:00:50 2014
From: donald at (Donald Stufft)
Date: Fri, 29 Aug 2014 18:00:50 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

> On Aug 29, 2014, at 5:42 PM, R. David Murray <rdmurray at> wrote:
> On Fri, 29 Aug 2014 17:11:35 -0400, Donald Stufft <donald at> wrote:
>> Sorry I was on my phone and didn?t get to fully reply to this.
>>> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg <mal at> wrote:
>>> * configuration:
>>>  It would be good to be able to switch this on or off
>>>  without having to change the code, e.g. via a command
>>>  line switch and environment variable; perhaps even
>>>  controlling whether or not to raise an exception or
>>>  warning.
>> I?m on the fence about this, if someone provides a certificate
>> that we can validate against (which can be done without
>> touching the code) then the only thing that really can?t be
>> ?fixed? without touching the code is if someone has a certificate
>> that is otherwise invalid (expired, not yet valid, wrong hostname,
>> etc). I?d say if I was voting on this particular thing I?d be -0, I?d
>> rather it didn?t exist but I wouldn?t cry too much if it did.
> Especially if you want an accelerated change, there must be a way to
> *easily* get back to the previous behavior, or we are going to catch a
> lot of flack.  There may be only 7% of public certs that are problematic,
> but I'd be willing to bet you that there are more not-really-public ones
> that are critical to day to day operations *somewhere* :)
> wget and curl have 'ignore validation' as a command line flag for a reason.

Right, that?s why I?m on the fence :)

On one hand, it?s going to break things for some people, (arguably they are
already broken, just silently so, but we?ll leave that argument aside) and a
way to get back the old behavior is good. There are already ways within
the Python code itself, so that?s covered. From outside of the Python code
there are ways if the certificate is untrusted but otherwise valid which are
pretty easy to do. The major ?gap? is when you have an actual invalid
certificate due to expiration or hostname or some other such thing.

On the other hand Python is not wget/curl and the people who are most
likely to be the target for a ?I can?t change the code but I need to get the
old behavior back? are people who are likely to not be invoking Python
itself but using something written in Python which happens to be using
Python. IOW they might be executing ?foobar? not ?python -m foobar?.

Like I said though, I?m personally fine either way so don?t take this as
being against that particular change!

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From donald at  Sat Aug 30 00:08:19 2014
From: donald at (Donald Stufft)
Date: Fri, 29 Aug 2014 18:08:19 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

> On Aug 29, 2014, at 5:58 PM, M.-A. Lemburg <mal at> wrote:
> On 29.08.2014 23:11, Donald Stufft wrote:
>> Sorry I was on my phone and didn?t get to fully reply to this.
>>> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg <mal at> wrote:
>>> On 29.08.2014 21:47, Alex Gaynor wrote:
>>>> Hi all,
>>>> I've just submitted PEP 476, on enabling certificate validation by default for
>>>> HTTPS clients in Python. Please have a look and let me know what you think.
>>>> PEP text follows.
>>> Thanks for the PEP. I think this is generally a good idea,
>>> but some important parts are missing from the PEP:
>>> * transition plan:
>>>  I think starting with warnings in Python 3.5 and going
>>>  for exceptions in 3.6 would make a good transition
>>>  Going straight for exceptions in 3.5 is not in line with
>>>  our normal procedures for backwards incompatible changes.
>> As far as a transition plan, I think that this is an important
>> enough thing to have an accelerated process. If we need
>> to provide a warning than let?s add it to the next 3.4 otherwise
>> it?s going to be 2.5+ years until we stop being unsafe by
>> default.
> Fine with me; we're still early in the Python 3.4
> patch level releases.
>> Another problem with this is that I don?t think it?s actually
>> possible to do. Python itself isn?t validating the TLS certificates,
>> OpenSSL is doing that. To my knowledge OpenSSL doesn?t
>> have a way to say ?please validate these certificates and if
>> they don?t validate go ahead and keep going and just let me
>> get a warning from it?. It?s a 3 way switch, no validation, validation
>> if a certificate is provided, and validation always.
>> Now that?s strictly for the ?verify the certificate chain? portion,
>> the hostname verification is done entirely on our end and we
>> could do something there? but I?m not sure it makes sense
>> to do so if we can?t do it for invalid certificates too.
> OpenSSL provides a callback for certificate validation,
> so it is possible to issue a warning and continue with
> accepting the certificate.

Ah right, I forgot about that. I was thinking in terms of CERT_NONE,
CERT_OPTIONAL, CERT_REQUIRED. I think it?s fine to add a warning
if possible to Python 3.4, I just couldn?t think off the top of my head
a way of doing it.

>>> * configuration:
>>>  It would be good to be able to switch this on or off
>>>  without having to change the code, e.g. via a command
>>>  line switch and environment variable; perhaps even
>>>  controlling whether or not to raise an exception or
>>>  warning.
>> I?m on the fence about this, if someone provides a certificate
>> that we can validate against (which can be done without
>> touching the code) then the only thing that really can?t be
>> ?fixed? without touching the code is if someone has a certificate
>> that is otherwise invalid (expired, not yet valid, wrong hostname,
>> etc). I?d say if I was voting on this particular thing I?d be -0, I?d
>> rather it didn?t exist but I wouldn?t cry too much if it did.
> If you're testing code or trying out some new stuff, you
> don't want to get a valid cert first, but instead go ahead
> with a self signed one. That's the use case.
>>> * choice of trusted certificate:
>>>  Instead of hard wiring using the system CA roots into
>>>  Python it would be good to just make this default and
>>>  permit the user to point Python to a different set of
>>>  CA roots.
>>>  This would enable using self signed certs more easily.
>>>  Since these are often used for tests, demos and education,
>>>  I think it's important to allow having more control of
>>>  the trusted certs.
>> Like my other email said, the Python API has everything needed
>> to easily specify your own CA roots and/or disable the validations.
>> The OpenSSL library also allows you to specify either a directory
>> or a file to change the root certificates without code changes. The
>> only real problems with the APIs are that the default is bad and
>> an unrelated thing where you can?t pass in an in memory certificate.
> Are you sure that's possible ? Python doesn't load the
> openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env
> vars only work for the openssl command line binary, AFAIK.

I?m not 100% sure on that. I know they are not limited to the command
line binary as ruby uses those environment variables in the way I
described above. I do not believe that Ruby has done anything
special to enable the use of those variables. It?s possible we?re doing
something differently that bypasses those variables though. If that is the
case then yes let?s add it, ideally doing whatever it needs to be to make
OpenSSL respect those variables, or else respecting them ourselves.

> In any case, Python will have to tap into the OS CA root
> provider using special code and this code could easily be
> made to check other dirs/files as well.
> The point is that it should be possible to change this default
> at the Python level, without needing application code changes.

Ok, I?m not opposed to it FWIW. Just sayiing I?m pretty sure those things
already exist in the form of environment variables and at the python level
APIs. Not sure what else there is, global state for the ?default?? A CLI

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From solipsis at  Sat Aug 30 00:22:54 2014
From: solipsis at (Antoine Pitrou)
Date: Sat, 30 Aug 2014 00:22:54 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <20140830002254.26351339@fsol>

On Fri, 29 Aug 2014 18:08:19 -0400
Donald Stufft <donald at> wrote:
> > 
> > Are you sure that's possible ? Python doesn't load the
> > openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env
> > vars only work for the openssl command line binary, AFAIK.
> I?m not 100% sure on that. I know they are not limited to the command
> line binary as ruby uses those environment variables in the way I
> described above.

SSL_CERT_DIR and SSL_CERT_FILE are used, if set, when
SSLContext.load_verify_locations() is called.

Actually, come to think of it, this allows us to write a better
test for that method. Patch welcome!



From rdmurray at  Sat Aug 30 00:57:35 2014
From: rdmurray at (R. David Murray)
Date: Fri, 29 Aug 2014 18:57:35 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On Fri, 29 Aug 2014 18:00:50 -0400, Donald Stufft <donald at> wrote:
> On Aug 29, 2014, at 5:42 PM, R. David Murray <rdmurray at> wrote:
> > Especially if you want an accelerated change, there must be a way to
> > *easily* get back to the previous behavior, or we are going to catch a
> > lot of flack.  There may be only 7% of public certs that are problematic,
> > but I'd be willing to bet you that there are more not-really-public ones
> > that are critical to day to day operations *somewhere* :)
> > 
> > wget and curl have 'ignore validation' as a command line flag for a reason.
> > 
> Right, that???s why I???m on the fence :)
> On one hand, it???s going to break things for some people, (arguably they are
> already broken, just silently so, but we???ll leave that argument aside) and a
> way to get back the old behavior is good. There are already ways within
> the Python code itself, so that???s covered. From outside of the Python code
> there are ways if the certificate is untrusted but otherwise valid which are
> pretty easy to do. The major ???gap??? is when you have an actual invalid
> certificate due to expiration or hostname or some other such thing.
> On the other hand Python is not wget/curl and the people who are most
> likely to be the target for a ???I can???t change the code but I need to get the
> old behavior back??? are people who are likely to not be invoking Python
> itself but using something written in Python which happens to be using
> Python. IOW they might be executing ???foobar??? not ???python -m foobar???.

Right, so an environment variable is better than a command line switch,
for Python.

> Like I said though, I???m personally fine either way so don???t take this as
> being against that particular change!



From greg.ewing at  Sat Aug 30 01:37:18 2014
From: greg.ewing at (Greg Ewing)
Date: Sat, 30 Aug 2014 11:37:18 +1200
Subject: [Python-Dev] surrogatepass - she's a witch,
 burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

M.-A. Lemburg wrote:
> we needed
> a way to make sure that Python 3 also optionally supports working
> with lone surrogates in such UTF-8 streams (nowadays called CESU-8:

I don't think CESU-8 is the same thing. According to the wiki
page, CESU-8 *requires* all code points above 0xffff to be split
into surrogate pairs before encoding. It also doesn't say that
lone surrogates are valid -- it doesn't mention lone surrogates
at all, only pairs. Neither does the linked technical report.

The technical report also says that CESU-8 forbids any UTF-8
sequences of more than three bytes, so it's definitely not
"UTF-8 plus lone surrogates".


From alex.gaynor at  Sat Aug 30 04:44:12 2014
From: alex.gaynor at (Alex Gaynor)
Date: Sat, 30 Aug 2014 02:44:12 +0000 (UTC)
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <>

Thanks for the rapid feedback everyone!

I want to summarize the action items and discussion points that have come up so

To add to the PEP:

* Emit a warning in for cases that would raise a Exception in 3.5
* Clearly state that the existing OpenSSL environment variables will be
  respected for setting the trust root

Discussion points:

* Disabling verification entirely externally to the program, through a CLI flag
  or environment variable. I'm pretty down on this idea, the problem you hit is
  that it's a pretty blunt instrument to swing, and it's almost impossible to
  imagine it not hitting things it shouldn't; it's far too likely to be used in
  applications that make two sets of outbound connections: 1) to some internal
  service which you want to disable verification on, and 2) some external
  service which needs strong validation. A global flag causes the latter to
  fail silently when subjected to a MITM attack, and that's exactly what we're
  trying to avoid. It also makes things much harder for library authors: I
  write an API client for some API, and make TLS connections to it. I want
  those to be verified by default. I can't even rely on the httplib defaults,
  because someone might disable them from the outside.


From stephen at  Sat Aug 30 06:21:56 2014
From: stephen at (Stephen J. Turnbull)
Date: Sat, 30 Aug 2014 13:21:56 +0900
Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er!
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

Greg Ewing writes:
 > M.-A. Lemburg wrote:
 > > we needed
 > > a way to make sure that Python 3 also optionally supports working
 > > with lone surrogates in such UTF-8 streams (nowadays called CESU-8:
 > >

Besides what Greg says, CESU-8 is an UTF, and therefore encodes valid
Unicode.  Speaking imprecisely, CESU-8 is UTF-16 with variable-width
code units (ie, each 16-bit code point is represented using the UTF-8
variable-width representation).[1]

I think you are thinking of Markus Kuhn's utf-8b (which I believe is
exactly what is implemented by the surrogateescape handler).

As far as the goal of "working with lone surrogates in such UTF-8
streams", the surrogateescape handler already permits that, and does
so consistently across streams in the sense that lone surrogates in
the UTF-8 stream cannot be mixed with garbage bytes decoded by
surrogateescape in another stream, which produces an unencodable mess.

I still don't see a justification for the surrogatepass handler.  What
applications are producing (not merely passing through) UTF-8-encoded
surrogates these days?

[1]  For the curious, it's imprecise because in Unicode code units are
fixed-width by definition.

From mal at  Sat Aug 30 12:03:06 2014
From: mal at (M.-A. Lemburg)
Date: Sat, 30 Aug 2014 12:03:06 +0200
Subject: [Python-Dev] surrogatepass - she's a witch,
 burn 'er!  [was: Cleaning up ...]
In-Reply-To: <>
References: <>	<>	<>	<>
Message-ID: <>

On 30.08.2014 01:37, Greg Ewing wrote:
> M.-A. Lemburg wrote:
>> we needed
>> a way to make sure that Python 3 also optionally supports working
>> with lone surrogates in such UTF-8 streams (nowadays called CESU-8:
> I don't think CESU-8 is the same thing. According to the wiki
> page, CESU-8 *requires* all code points above 0xffff to be split
> into surrogate pairs before encoding. It also doesn't say that
> lone surrogates are valid -- it doesn't mention lone surrogates
> at all, only pairs. Neither does the linked technical report.
> The technical report also says that CESU-8 forbids any UTF-8
> sequences of more than three bytes, so it's definitely not
> "UTF-8 plus lone surrogates".

You're right, it's not the same as UTF-8 plus lone surrogates.

CESU-8 does encode surrogates as individual code points using
the UTF-8 encoding, which is what probably caused it to be
mentioned in discussions when talking about having UTF-8 streams
do the same for lone surrogates.

So let's call the encoding UTF-8-py so that everyone knows
what we're talking about :-)

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From mal at  Sat Aug 30 12:19:11 2014
From: mal at (M.-A. Lemburg)
Date: Sat, 30 Aug 2014 12:19:11 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On 30.08.2014 04:44, Alex Gaynor wrote:
> Thanks for the rapid feedback everyone!
> I want to summarize the action items and discussion points that have come up so
> far:
> To add to the PEP:
> * Emit a warning in for cases that would raise a Exception in 3.5
> * Clearly state that the existing OpenSSL environment variables will be
>   respected for setting the trust root

I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that
causes OpenSSL to read the global openssl.cnf file for additional

> Discussion points:
> * Disabling verification entirely externally to the program, through a CLI flag
>   or environment variable. I'm pretty down on this idea, the problem you hit is
>   that it's a pretty blunt instrument to swing, and it's almost impossible to
>   imagine it not hitting things it shouldn't; it's far too likely to be used in
>   applications that make two sets of outbound connections: 1) to some internal
>   service which you want to disable verification on, and 2) some external
>   service which needs strong validation. A global flag causes the latter to
>   fail silently when subjected to a MITM attack, and that's exactly what we're
>   trying to avoid. It also makes things much harder for library authors: I
>   write an API client for some API, and make TLS connections to it. I want
>   those to be verified by default. I can't even rely on the httplib defaults,
>   because someone might disable them from the outside.

The reasoning here is the same as for hash randomization. There
are cases where you want to test your application using self-signed
certificates which don't validate against the system CA root list.

In those cases, you do know what you're doing. The test would fail
otherwise and the reason is not a bug in your code, it's just
the fact that the environment you're running it in is a test

Ideally, all applications should give you this choice, but this is
unlikely to happen, so it's good to be able to change the Python
default, since with the proposed change, most applications will
probably continue to use the Python defaults as they do now.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From solipsis at  Sat Aug 30 12:40:26 2014
From: solipsis at (Antoine Pitrou)
Date: Sat, 30 Aug 2014 12:40:26 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
Message-ID: <20140830124026.6dd1d92b@fsol>

On Sat, 30 Aug 2014 12:19:11 +0200
"M.-A. Lemburg" <mal at> wrote:
> > To add to the PEP:
> > 
> > * Emit a warning in for cases that would raise a Exception in 3.5
> > * Clearly state that the existing OpenSSL environment variables will be
> >   respected for setting the trust root
> I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that
> causes OpenSSL to read the global openssl.cnf file for additional
> configuration.

Python links against OpenSSL as a shared library, not statically. It's
unlikely that setting a compile constant inside Python would affect
OpenSSL at all.

> > Discussion points:
> > 
> > * Disabling verification entirely externally to the program, through a CLI flag
> >   or environment variable. I'm pretty down on this idea, the problem you hit is
> >   that it's a pretty blunt instrument to swing, and it's almost impossible to
> >   imagine it not hitting things it shouldn't; it's far too likely to be used in
> >   applications that make two sets of outbound connections: 1) to some internal
> >   service which you want to disable verification on, and 2) some external
> >   service which needs strong validation. A global flag causes the latter to
> >   fail silently when subjected to a MITM attack, and that's exactly what we're
> >   trying to avoid. It also makes things much harder for library authors: I
> >   write an API client for some API, and make TLS connections to it. I want
> >   those to be verified by default. I can't even rely on the httplib defaults,
> >   because someone might disable them from the outside.
> The reasoning here is the same as for hash randomization. There
> are cases where you want to test your application using self-signed
> certificates which don't validate against the system CA root list.

That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
env vars (or, better, by specific settings *inside* the application).

I'm against multiplying environment variables, as it makes it more
difficult to assess the actual security of a setting. The danger of an
ill-secure setting is much more severe than with hash randomization.



From mal at  Sat Aug 30 12:46:47 2014
From: mal at (M.-A. Lemburg)
Date: Sat, 30 Aug 2014 12:46:47 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <20140830124026.6dd1d92b@fsol>
References: <>	<>	<>
Message-ID: <>

On 30.08.2014 12:40, Antoine Pitrou wrote:
> On Sat, 30 Aug 2014 12:19:11 +0200
> "M.-A. Lemburg" <mal at> wrote:
>>> To add to the PEP:
>>> * Emit a warning in for cases that would raise a Exception in 3.5
>>> * Clearly state that the existing OpenSSL environment variables will be
>>>   respected for setting the trust root
>> I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that
>> causes OpenSSL to read the global openssl.cnf file for additional
>> configuration.
> Python links against OpenSSL as a shared library, not statically. It's
> unlikely that setting a compile constant inside Python would affect
> OpenSSL at all.

The change is to the OpenSSL API, not the OpenSSL lib. By setting
the variable you enable a few special calls to the config loader
functions in OpenSSL when calling the initializer it:

>>> Discussion points:
>>> * Disabling verification entirely externally to the program, through a CLI flag
>>>   or environment variable. I'm pretty down on this idea, the problem you hit is
>>>   that it's a pretty blunt instrument to swing, and it's almost impossible to
>>>   imagine it not hitting things it shouldn't; it's far too likely to be used in
>>>   applications that make two sets of outbound connections: 1) to some internal
>>>   service which you want to disable verification on, and 2) some external
>>>   service which needs strong validation. A global flag causes the latter to
>>>   fail silently when subjected to a MITM attack, and that's exactly what we're
>>>   trying to avoid. It also makes things much harder for library authors: I
>>>   write an API client for some API, and make TLS connections to it. I want
>>>   those to be verified by default. I can't even rely on the httplib defaults,
>>>   because someone might disable them from the outside.
>> The reasoning here is the same as for hash randomization. There
>> are cases where you want to test your application using self-signed
>> certificates which don't validate against the system CA root list.
> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
> env vars (or, better, by specific settings *inside* the application).
> I'm against multiplying environment variables, as it makes it more
> difficult to assess the actual security of a setting. The danger of an
> ill-secure setting is much more severe than with hash randomization.

You have a point there. So how about just a python run-time switch
and no env var ?

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From p.f.moore at  Sat Aug 30 12:48:55 2014
From: p.f.moore at (Paul Moore)
Date: Sat, 30 Aug 2014 11:48:55 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

 30 August 2014 03:44, Alex Gaynor <alex.gaynor at> wrote:
> Discussion points:
> * Disabling verification entirely externally to the program, through a CLI flag
>   or environment variable. I'm pretty down on this idea, the problem you hit is
>   that it's a pretty blunt instrument to swing, and it's almost impossible to
>   imagine it not hitting things it shouldn't

As a data point, I use --no-check-certificates extensively, in wget,
curl and some Python programs which have it, like youtube-dl.

The reason I do so is typically because the programs do not use the
Windows cerificate store, and configuring a second certificate store
on a per-program basis is too much of a pain to be worth it
(per-program because the hacks such programs use to get round the fact
that Windows has no central location like /etc are inconsistent).

The key question for me is therefore, does Python's ssl support use
the Windows store directly these days? I checked the docs and couldn't
find anything explicitly stating this (but all the terminology is
foreign to me, so I may have missed it). If it does, programs like
youtube-dl will start to "just work" and I won't have the need for a
"switch off everything" flag.

If a new Python 3.5 installation on a Windows machine will enforce
https cert checking and yet will not check the system store (or, I
guess, come with an embedded store, but aren't there maintenance
issues with doing that?) then I believe a global "don't check" flag
will be needed, as not all programs offer a "don't check certificates"
mode. And naive users like me may not even know how to code the
behaviour for such an option - and the tone of the debate here leads
me to believe that it'll be hard for developers to get unbiased advice
on how to switch off checking, so it'll end up being patchily


From solipsis at  Sat Aug 30 12:55:54 2014
From: solipsis at (Antoine Pitrou)
Date: Sat, 30 Aug 2014 12:55:54 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <20140830124026.6dd1d92b@fsol>
Message-ID: <20140830125554.0a6c06e0@fsol>

On Sat, 30 Aug 2014 12:46:47 +0200
"M.-A. Lemburg" <mal at> wrote:
> The change is to the OpenSSL API, not the OpenSSL lib. By setting
> the variable you enable a few special calls to the config loader
> functions in OpenSSL when calling the initializer it:

Ah, ok. Do you have experience with openssl.cnf? Apparently, it is
meant for offline tools such as certificate generation, I am not sure
how it could impact certification validation.

> > That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
> > env vars (or, better, by specific settings *inside* the application).
> > 
> > I'm against multiplying environment variables, as it makes it more
> > difficult to assess the actual security of a setting. The danger of an
> > ill-secure setting is much more severe than with hash randomization.
> You have a point there. So how about just a python run-time switch
> and no env var ?

Well, why not, but does it have a value over letting the code properly
configure their SSLContext?



From mal at  Sat Aug 30 14:03:57 2014
From: mal at (M.-A. Lemburg)
Date: Sat, 30 Aug 2014 14:03:57 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <20140830125554.0a6c06e0@fsol>
References: <>	<>	<>
 <20140830124026.6dd1d92b@fsol>	<>
Message-ID: <>

On 30.08.2014 12:55, Antoine Pitrou wrote:
> On Sat, 30 Aug 2014 12:46:47 +0200
> "M.-A. Lemburg" <mal at> wrote:
>> The change is to the OpenSSL API, not the OpenSSL lib. By setting
>> the variable you enable a few special calls to the config loader
>> functions in OpenSSL when calling the initializer it:
> Ah, ok. Do you have experience with openssl.cnf? Apparently, it is
> meant for offline tools such as certificate generation, I am not sure
> how it could impact certification validation.

I'm still exploring this: the OpenSSL documentation is, well,
less than complete on these things, so searching mailing lists
and reading source code appears to be the only reasonable way
to figure out what is possible and what not.

The openssl.cnf config file is indeed mostly used by the various
openssl subcommands (e.g. req and ca), but it can also be used to
configuring engines and my hope is that configuration of
e.g. default certificate stores also becomes possible.

One of the engines can tap into the Windows certificate store,
for example.

>>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
>>> env vars (or, better, by specific settings *inside* the application).
>>> I'm against multiplying environment variables, as it makes it more
>>> difficult to assess the actual security of a setting. The danger of an
>>> ill-secure setting is much more severe than with hash randomization.
>> You have a point there. So how about just a python run-time switch
>> and no env var ?
> Well, why not, but does it have a value over letting the code properly
> configure their SSLContext?

Yes, because when Python changes the default to be validating
and more secure, application developers will do the same as
they do now: simply use the defaults ;-)

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From rdmurray at  Sat Aug 30 15:32:32 2014
From: rdmurray at (R. David Murray)
Date: Sat, 30 Aug 2014 09:32:32 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140830124026.6dd1d92b@fsol> <>
 <20140830125554.0a6c06e0@fsol> <>
Message-ID: <>

On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" <mal at> wrote:
> On 30.08.2014 12:55, Antoine Pitrou wrote:
> > On Sat, 30 Aug 2014 12:46:47 +0200
> > "M.-A. Lemburg" <mal at> wrote:
> >>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
> >>> env vars (or, better, by specific settings *inside* the application).
> >>>
> >>> I'm against multiplying environment variables, as it makes it more
> >>> difficult to assess the actual security of a setting. The danger of an
> >>> ill-secure setting is much more severe than with hash randomization.
> >>
> >> You have a point there. So how about just a python run-time switch
> >> and no env var ?
> > 
> > Well, why not, but does it have a value over letting the code properly
> > configure their SSLContext?
> Yes, because when Python changes the default to be validating
> and more secure, application developers will do the same as
> they do now: simply use the defaults ;-)

But neither of those addresses the articulated use case: someone *using*
a program implemented in python that does not itself provide a way to
disable the new default security (because it is *new*).  Only an
environment variable will do that.

Since the environment variable is opt-in, I think the "consenting
adults" argument applies to Alex's demure about "multiple connections".
It could still emit the warnings.


From mal at  Sat Aug 30 16:20:22 2014
From: mal at (M.-A. Lemburg)
Date: Sat, 30 Aug 2014 16:20:22 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>	<>
 <>	<20140830124026.6dd1d92b@fsol>
 <>	<20140830125554.0a6c06e0@fsol>
 <> <>
Message-ID: <>

On 30.08.2014 15:32, R. David Murray wrote:
> On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" <mal at> wrote:
>> On 30.08.2014 12:55, Antoine Pitrou wrote:
>>> On Sat, 30 Aug 2014 12:46:47 +0200
>>> "M.-A. Lemburg" <mal at> wrote:
>>>>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
>>>>> env vars (or, better, by specific settings *inside* the application).
>>>>> I'm against multiplying environment variables, as it makes it more
>>>>> difficult to assess the actual security of a setting. The danger of an
>>>>> ill-secure setting is much more severe than with hash randomization.
>>>> You have a point there. So how about just a python run-time switch
>>>> and no env var ?
>>> Well, why not, but does it have a value over letting the code properly
>>> configure their SSLContext?
>> Yes, because when Python changes the default to be validating
>> and more secure, application developers will do the same as
>> they do now: simply use the defaults ;-)
> But neither of those addresses the articulated use case: someone *using*
> a program implemented in python that does not itself provide a way to
> disable the new default security (because it is *new*).  Only an
> environment variable will do that.
> Since the environment variable is opt-in, I think the "consenting
> adults" argument applies to Alex's demure about "multiple connections".
> It could still emit the warnings.

That would be a possibility as well, yes.

I'd just like to see a way to say: I know what I'm doing
and I'm not in the mood to configure my own CA list, so
please go ahead and accept whatever certs you find --
much like what --no-check-certificate does for wget.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...
>>> mxODBC.Zope/Plone.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...
2014-08-27: Released eGenix PyRun 2.0.1 ...
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

From Steve.Dower at  Sat Aug 30 16:24:05 2014
From: Steve.Dower at (Steve Dower)
Date: Sat, 30 Aug 2014 14:24:05 +0000
Subject: [Python-Dev] PEP 476: Enabling certificate validation
	by	default!
In-Reply-To: <>
References: <>
 <> <>
 <20140830124026.6dd1d92b@fsol> <>
Message-ID: <>

This sounds great, but the disable switch worries me if it's an ENVVAR=1 kind of deal. Those switches have a tendency on Windows of becoming "well known tricks" and they get set globally and permanently, often by application installers or sysadmins (PYTHONPATH suffers the exact same problem).

It sounds like the likely approach is a certificate name, which is fine, provided there's no option for "accept everything". I just wanted to get an early vote in against a boolean switch.


Top-posted from my Windows Phone
From: R. David Murray<mailto:rdmurray at>
Sent: ?8/?30/?2014 6:33
To: python-dev at<mailto:python-dev at>
Subject: Re: [Python-Dev] PEP 476: Enabling certificate validation by default!

On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" <mal at> wrote:
> On 30.08.2014 12:55, Antoine Pitrou wrote:
> > On Sat, 30 Aug 2014 12:46:47 +0200
> > "M.-A. Lemburg" <mal at> wrote:
> >>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE
> >>> env vars (or, better, by specific settings *inside* the application).
> >>>
> >>> I'm against multiplying environment variables, as it makes it more
> >>> difficult to assess the actual security of a setting. The danger of an
> >>> ill-secure setting is much more severe than with hash randomization.
> >>
> >> You have a point there. So how about just a python run-time switch
> >> and no env var ?
> >
> > Well, why not, but does it have a value over letting the code properly
> > configure their SSLContext?
> Yes, because when Python changes the default to be validating
> and more secure, application developers will do the same as
> they do now: simply use the defaults ;-)

But neither of those addresses the articulated use case: someone *using*
a program implemented in python that does not itself provide a way to
disable the new default security (because it is *new*).  Only an
environment variable will do that.

Since the environment variable is opt-in, I think the "consenting
adults" argument applies to Alex's demure about "multiple connections".
It could still emit the warnings.

Python-Dev mailing list
Python-Dev at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From alex.gaynor at  Sat Aug 30 17:22:01 2014
From: alex.gaynor at (Alex Gaynor)
Date: Sat, 30 Aug 2014 15:22:01 +0000 (UTC)
Subject: [Python-Dev]
References: <>
Message-ID: <>

The Windows certificate store is used by ``load_default_certs``:


Cheers, Alex

From p.f.moore at  Sat Aug 30 17:36:23 2014
From: p.f.moore at (Paul Moore)
Date: Sat, 30 Aug 2014 16:36:23 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On 30 August 2014 16:22, Alex Gaynor <alex.gaynor at> wrote:
> The Windows certificate store is used by ``load_default_certs`

Cool, in which case this sounds like a good plan. I have no particular
opinion on whether there should be a global Python-level "don't check
certificates" option, but I would suggest that the docs include a
section explaining how a user can implement a
"--no-check-certificates" flag in their program if they want to (with
appropriate warnings as to the risks, of course!). Better to explain
how to do it properly than to say "you shouldn't do that" and have
developers implement awkward or incorrect hacks in spite of the


From marko at  Sat Aug 30 18:17:28 2014
From: marko at (Marko Rauhamaa)
Date: Sat, 30 Aug 2014 19:17:28 +0300
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
 (Paul Moore's message of "Sat, 30 Aug 2014 16:36:23 +0100")
References: <>
Message-ID: <>

Paul Moore <p.f.moore at>:

> Cool, in which case this sounds like a good plan. I have no particular
> opinion on whether there should be a global Python-level "don't check
> certificates" option, but I would suggest that the docs include a
> section explaining how a user can implement a
> "--no-check-certificates" flag in their program if they want to (with
> appropriate warnings as to the risks, of course!). Better to explain
> how to do it properly than to say "you shouldn't do that" and have
> developers implement awkward or incorrect hacks in spite of the
> advice.

Will there be a way to specify a particular CA certificate (as in "wget

Will there be a way to specify a particular CA certificate directory (as
in "wget --ca-directory")?


From barry at  Sat Aug 30 18:42:12 2014
From: barry at (Barry Warsaw)
Date: Sat, 30 Aug 2014 09:42:12 -0700
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <20140830094212.5e9c13b5@anarchist>

On Aug 30, 2014, at 12:19 PM, M.-A. Lemburg wrote:

>The reasoning here is the same as for hash randomization. There
>are cases where you want to test your application using self-signed
>certificates which don't validate against the system CA root list.
>In those cases, you do know what you're doing. The test would fail
>otherwise and the reason is not a bug in your code, it's just
>the fact that the environment you're running it in is a test

Exactly.  I have test cases where I have to load up a self-signed cert via
.load_cert_chain() and in the good-path tests, I expect to make successful
https connections.  I also have test cases that expect to fail when:

 * I load bogus self-signed certs
 * I have an http server masquerading as an https server
 * I load an expired self-signed cert

It certainly makes sense for the default to be the most secure, but other use
cases must be preserved.


From christian at  Sat Aug 30 19:21:41 2014
From: christian at (Christian Heimes)
Date: Sat, 30 Aug 2014 19:21:41 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

On 30.08.2014 17:22, Alex Gaynor wrote:
> The Windows certificate store is used by ``load_default_certs``:
> *
> *

The Windows part of load_default_certs() has one major flaw: it can only
load certificates that are already in Windows's cert store. However
Windows comes only with a small set of default certs and downloads more
certs on demand. In order to trigger a download Python or OpenSSL would
have to use the Windows API to verify root certificates.


From martin at  Sat Aug 30 22:03:20 2014
From: martin at (martin at
Date: Sat, 30 Aug 2014 22:03:20 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

Zitat von Christian Heimes <christian at>:

> On 30.08.2014 17:22, Alex Gaynor wrote:
>> The Windows certificate store is used by ``load_default_certs``:
>> *
>> *
> The Windows part of load_default_certs() has one major flaw: it can only
> load certificates that are already in Windows's cert store. However
> Windows comes only with a small set of default certs and downloads more
> certs on demand. In order to trigger a download Python or OpenSSL would
> have to use the Windows API to verify root certificates.

It's better than you think. Vista+ has a weekly prefetching procedure that
should assure that virtually all root certificates are available:

BTW, it's patented:


From ncoghlan at  Sun Aug 31 01:26:30 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 31 Aug 2014 09:26:30 +1000
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 30 Aug 2014 06:08, "Ethan Furman" <ethan at> wrote:
> On 08/29/2014 01:00 PM, M.-A. Lemburg wrote:
>> On 29.08.2014 21:47, Alex Gaynor wrote:
>>> I've just submitted PEP 476, on enabling certificate validation by
default for
>>> HTTPS clients in Python. Please have a look and let me know what you
>> Thanks for the PEP. I think this is generally a good idea,
>> but some important parts are missing from the PEP:
>>   * transition plan:
>>     I think starting with warnings in Python 3.5 and going
>>     for exceptions in 3.6 would make a good transition
>>     Going straight for exceptions in 3.5 is not in line with
>>     our normal procedures for backwards incompatible changes.
>>   * configuration:
>>     It would be good to be able to switch this on or off
>>     without having to change the code, e.g. via a command
>>     line switch and environment variable; perhaps even
>>     controlling whether or not to raise an exception or
>>     warning.
>>   * choice of trusted certificate:
>>     Instead of hard wiring using the system CA roots into
>>     Python it would be good to just make this default and
>>     permit the user to point Python to a different set of
>>     CA roots.
>>     This would enable using self signed certs more easily.
>>     Since these are often used for tests, demos and education,
>>     I think it's important to allow having more control of
>>     the trusted certs.
> +1 for PEP with above changes.

Ditto from me.

In relation to changing the Python CLI API to offer some of the wget/curl
style command line options, I like the idea of providing recipes in the
docs for implementing them at the application layer, but postponing making
the *default* behaviour configurable that way.

Longer term, I'd like to actually have a per-runtime configuration file for
some of these things that also integrated with the pyvenv support, but that
requires untangling the current startup code first (and there are only so
many hours in the day).


> --
> ~Ethan~
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From solipsis at  Sun Aug 31 03:25:25 2014
From: solipsis at (Antoine Pitrou)
Date: Sun, 31 Aug 2014 03:25:25 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
References: <>
 <> <>
Message-ID: <20140831032525.19b7e48c@fsol>

On Sun, 31 Aug 2014 09:26:30 +1000
Nick Coghlan <ncoghlan at> wrote:
> >>
> >>   * configuration:
> >>
> >>     It would be good to be able to switch this on or off
> >>     without having to change the code, e.g. via a command
> >>     line switch and environment variable; perhaps even
> >>     controlling whether or not to raise an exception or
> >>     warning.
> >>
> >>   * choice of trusted certificate:
> >>
> >>     Instead of hard wiring using the system CA roots into
> >>     Python it would be good to just make this default and
> >>     permit the user to point Python to a different set of
> >>     CA roots.
> >>
> >>     This would enable using self signed certs more easily.
> >>     Since these are often used for tests, demos and education,
> >>     I think it's important to allow having more control of
> >>     the trusted certs.
> >
> >
> > +1 for PEP with above changes.
> Ditto from me.
> In relation to changing the Python CLI API to offer some of the wget/curl
> style command line options, I like the idea of providing recipes in the
> docs for implementing them at the application layer, but postponing making
> the *default* behaviour configurable that way.

I'm against any additional environment variables and command-line
options. It will only complicate and obscure the security parameters of
certificate validation.

The existing knobs have already been mentioned in this thread, I won't
mention them here again.



From rdmurray at  Sun Aug 31 04:21:49 2014
From: rdmurray at (R. David Murray)
Date: Sat, 30 Aug 2014 22:21:49 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <20140831032525.19b7e48c@fsol>
References: <>
 <> <>
Message-ID: <>

On Sun, 31 Aug 2014 03:25:25 +0200, Antoine Pitrou <solipsis at> wrote:
> On Sun, 31 Aug 2014 09:26:30 +1000
> Nick Coghlan <ncoghlan at> wrote:
> > >>
> > >>   * configuration:
> > >>
> > >>     It would be good to be able to switch this on or off
> > >>     without having to change the code, e.g. via a command
> > >>     line switch and environment variable; perhaps even
> > >>     controlling whether or not to raise an exception or
> > >>     warning.
> > >>
> > >>   * choice of trusted certificate:
> > >>
> > >>     Instead of hard wiring using the system CA roots into
> > >>     Python it would be good to just make this default and
> > >>     permit the user to point Python to a different set of
> > >>     CA roots.
> > >>
> > >>     This would enable using self signed certs more easily.
> > >>     Since these are often used for tests, demos and education,
> > >>     I think it's important to allow having more control of
> > >>     the trusted certs.
> > >
> > >
> > > +1 for PEP with above changes.
> > 
> > Ditto from me.
> > 
> > In relation to changing the Python CLI API to offer some of the wget/curl
> > style command line options, I like the idea of providing recipes in the
> > docs for implementing them at the application layer, but postponing making
> > the *default* behaviour configurable that way.
> I'm against any additional environment variables and command-line
> options. It will only complicate and obscure the security parameters of
> certificate validation.
> The existing knobs have already been mentioned in this thread, I won't
> mention them here again.

Do those knobs allow one to instruct urllib to accept an invalid
certificate without changing the program code?


From stephen at  Sun Aug 31 07:53:17 2014
From: stephen at (Stephen J. Turnbull)
Date: Sun, 31 Aug 2014 14:53:17 +0900
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
Message-ID: <>

martin at writes:

 > BTW, it's patented:

Damn them.  I hope they never get a look at my crontab.

From ncoghlan at  Sun Aug 31 08:09:26 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 31 Aug 2014 16:09:26 +1000
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 12:21, R. David Murray <rdmurray at> wrote:
> On Sun, 31 Aug 2014 03:25:25 +0200, Antoine Pitrou <solipsis at> wrote:
>> On Sun, 31 Aug 2014 09:26:30 +1000
>> Nick Coghlan <ncoghlan at> wrote:
>> > In relation to changing the Python CLI API to offer some of the wget/curl
>> > style command line options, I like the idea of providing recipes in the
>> > docs for implementing them at the application layer, but postponing making
>> > the *default* behaviour configurable that way.
>> I'm against any additional environment variables and command-line
>> options. It will only complicate and obscure the security parameters of
>> certificate validation.

As Antoine says here, I'm also opposed to adding more Python specific
configuration options. However, I think there may be something
worthwhile we can do that's closer to the way browsers work, and has
the significant benefit of being implementable as a PyPI module first
(more on that in a separate reply).

>> The existing knobs have already been mentioned in this thread, I won't
>> mention them here again.
> Do those knobs allow one to instruct urllib to accept an invalid
> certificate without changing the program code?

Only if you add the specific certificate concerned to the certificate
store that Python is using (which PEP 476 currently suggests will be
the platform wide certificate store). Whether or not that is an
adequate solution is the point currently in dispute.

My view is that the core problem/concern we need to address here is
how we manage the migration away from a network communication model
that trusts the network by default. That transition will happen
regardless of whether or not we adapt Python as a platform - the
challenge for us is how we can address it in a way that minimises the
impact on existing users, while still ensuring future users are
protected by default.

This would be relatively easy if we only had to worry about the public
internet (since we're followers rather than leaders in that
environment), but we don't. Python made the leap into enterprise
environments long ago, so we not only need to cope with corporate
intranets, we need to cope with corporate intranets that aren't
necessarily being well managed. That's what makes this a harder
problem for us than it is for a new language like Go that was created
by a public internet utility, specifically for use over the public
internet - they didn't *have* an installed base to manage, they could
just build a language specifically tailored for the task of running
network services on Linux, without needing to account for any other
use cases.

The reason our existing installed base creates a problem is because
corporate network security has historically focused on "perimeter
defence": carving out a trusted island behind the corporate firewall
where users and other computer systems could be "safely" assumed not
to be malicious.

As an industry, we have learned though harsh experience that *this
model doesn't work*. You can't trust the network, period. A corporate
intranet is *less* dangerous than the public internet, but you still
can't trust it. This "don't trust the network" ethos is also
reinforced by the broad shift to "utility computing" where more and
more companies are running distributed networks, where some of their
systems are actually running on vendor provided servers. The "network
perimeter" is evaporating, as corporate "intranets" start to look a
lot more like recreations of the internet in miniature, with the only
difference being the existence of more formal contractual
relationships than typically exist between internet peers.

Unfortunately, far too many organisations (especially those outside
the tech industry) still trust in perimeter defence for their internal
network security, and hence tolerate the use of unsecured connections,
or skipping certificate validation internally. This is actually a
really terrible idea, but it's still incredibly common due to the
general failure of the technology industry to take usability issues
seriously when we design security systems - doing the wrong "unsafe"
thing is genuinely easier than doing things right.

We have enough evidence now to be able to say (as Alex does in PEP
476) that it has been comprehensively demonstrated that "opt-in
security" really just means "security failures are common and silent
by default". We've seen it with C buffer overflow vulnerabilities,
we've seen it with plain text communication links, we've seen it with
SSL certificate validation - the vast majority of users and developers
will just run with the default behaviour of the platform or
application they're using, even if those defaults have serious
problems. As the saying goes, "you can't document your way out of a
usability problem" - uncovered connections, or that are vulnerable to
a man-in-the-middle attack appear to work for all functional purposes,
they're just vulnerable to monitoring and subversion.

It turns out "opt-out security with a global off switch" isn't
actually much better when it comes to changing *existing* behaviours,
as people just turn the new security features off and continue on as
they were, rather than figuring out what dangers the new security
system is trying to warn them about and encourage them to
pre-emptively address them. Offering that kind of flag may sometimes
be a necessary transition phase (or we wouldn't have things like
"setenforce 0" for SELinux) but it should be considered an absolute
last resort.

In the specific case of network security, we need to take
responsibility as an industry for the endemic failure of the
networking infrastructure to provide robust end user security and
privacy, and figure out how to get to a state where encrypted and
authenticated network connections are as easy to use as uncovered
ones. I see Alex's PEP (along with the preceding work on the SSL
module that makes it feasible) as a significant step in that

At the same time, we need to account for the fact that most existing
organisations still trust in perimeter defence for their internal
network security, and hence tolerate (or even actively encourage) the
use of unsecured connections, or skipping certificate validation,
internally. This is actually a really terrible idea, but it's still
incredibly common due to the general failure of the technology
industry to take usability issues seriously when we design security
systems (at least until recently) - doing the wrong "unsafe" thing is
genuinely easier than doing things right.

We can, and should, tackle this as a design problem, and ensure PEP
476 covers this scenario adequately. We also need to make sure we do
it in a way that avoids places any significant additional burdens on
teams that may already be trying to explain what "long term
maintenance" means, and why the flow of free feature releases for the
Python 2 series stopped.

This message is already rather long, however, so I'll go into more
technical details in a separate reply to David's question.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From donald at  Sun Aug 31 08:16:55 2014
From: donald at (Donald Stufft)
Date: Sun, 31 Aug 2014 02:16:55 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
Message-ID: <>

> On Aug 31, 2014, at 2:09 AM, Nick Coghlan <ncoghlan at> wrote:
> At the same time, we need to account for the fact that most existing
> organisations still trust in perimeter defence for their internal
> network security, and hence tolerate (or even actively encourage) the
> use of unsecured connections, or skipping certificate validation,
> internally. This is actually a really terrible idea, but it's still
> incredibly common due to the general failure of the technology
> industry to take usability issues seriously when we design security
> systems (at least until recently) - doing the wrong "unsafe" thing is
> genuinely easier than doing things right.

Just a quick clarification in order to be a little clearer, this change will
(obviously) only effect those who trust perimeter security *and* decided to
install an invalid certificate instead of just using HTTP. I'm not saying that
this doesn't happen, just being specific (I'm not actually sure why they would
install a TLS certificate at all if they are trusting perimeter security, but
I'm sure folks do).

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 31 08:24:43 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 31 Aug 2014 16:24:43 +1000
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 12:21, R. David Murray <rdmurray at> wrote:
> Do those knobs allow one to instruct urllib to accept an invalid
> certificate without changing the program code?

My first reply ended up being a context dump of the challenges created
by legacy corporate intranets that may not be immediately obvious to
folks that spend most of their time working on or with the public
internet. I decided to split these more technical details out to a new
reply for the benefit of folks that already know all that history :)

To answer David's specific question, the existing knobs at the OpenSSL
level (SSL_CERT_DIR and SSL_CERT_FILE ) let people add an internal CA,
opt out of the default CA system, and trust *specific* self-signed

What they don't allow is a global "trust any cert" setting -
exceptions need to be added at the individual cert level or at the CA
level, or the application needs to offer an option to not do cert
validation at all. That "trust anything" option at the platform level
is the setting that is a really bad idea - if an organisation thinks
it needs that (because they have a lot of self-signed certs, but
aren't verifying their HTTPS connections to those servers), then what
they really need is an internal CA, where their systems just need to
be set up to trust the internal CA in addition to the platform CA

With Alex's proposal, organisations that are already running an
internal CA should be just fine - Python 3.5 will see the CA cert in
the platform cert store and accept certs signed by it as valid. (Note:
the Python 3.4 warning should take this into account, which could be a
problem since we don't currently do validity checks against the
platform store by default. The PEP needs to cover the mechanics of
that in more detail, as I think it means we'll need to make *some*
changes to the default configuration even in Python 3.4 to get
accurate validity data back from OpenSSL)

However, we also need to accept that there's a reason browser vendors
still offer "click through insecurity" for sites with self-signed
certificates, and tools like wget/curl offer the option to say "don't
check the certificate": these are necessary compromises to make SSL
based network connections actually work on many current corporate

It is corporate environments that also make it desirable to be able to
address this potential problem at a *user* level, since many Python
users in a large organisations are actually running Python entirely
out of their home directories, rather than as a system installation
(they may not even have admin access to their own systems).

My suggestion at this point is that we take a leaf from both browser
vendors and the design of SSH: make it easy to *add* a specific
self-signed cert to the set a *particular user* trusts by default
(preferably *only* for a particular host, to limit the power of such
certs). "python -m ssl" doesn't currently do anything interesting, so
it could be used to provide an API for managing that user level
certificate store.

A Python-specific user level cert store is something that could be
developed as a PyPI library for Python 2.7.9+ and 3.4+ (Is cert
management considered in scope for If so, that could
be a good home).

So while I agree with the intent of PEP 476, and like the suggested
end state, I'm back to thinking that the transition plan for existing
corporate users needs more work before it can be accepted. This is
especially true since it becomes another barrier to migrating from
Python 2.7 to Python 3.5+ (a warning in Python 3.4 doesn't help with
that aspect, although a new -3 warning might).

A third party module that offers a user level certificate store, and a
gevent.monkey style way of opting in to this behaviour for existing
Python versions would be one way to provide a more compelling
transition plan.


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From ncoghlan at  Sun Aug 31 08:45:42 2014
From: ncoghlan at (Nick Coghlan)
Date: Sun, 31 Aug 2014 16:45:42 +1000
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 16:16, Donald Stufft <donald at> wrote:
> On Aug 31, 2014, at 2:09 AM, Nick Coghlan <ncoghlan at> wrote:
> At the same time, we need to account for the fact that most existing
> organisations still trust in perimeter defence for their internal
> network security, and hence tolerate (or even actively encourage) the
> use of unsecured connections, or skipping certificate validation,
> internally. This is actually a really terrible idea, but it's still
> incredibly common due to the general failure of the technology
> industry to take usability issues seriously when we design security
> systems (at least until recently) - doing the wrong "unsafe" thing is
> genuinely easier than doing things right.
> Just a quick clarification in order to be a little clearer, this change will
> (obviously) only effect those who trust perimeter security *and* decided to
> install an invalid certificate instead of just using HTTP. I'm not saying
> that
> this doesn't happen, just being specific (I'm not actually sure why they
> would
> install a TLS certificate at all if they are trusting perimeter security,
> but
> I'm sure folks do).

It's the end result when a company wide edict to use HTTPS isn't
backed up by the necessary documentation and training on how to get a
properly signed cert from your internal CA (or, even better, when such
an edict comes down without setting up an internal CA first). Folks
hit the internet instead, find instructions on creating a self-signed
cert, install that, and tell their users to ignore the security
warning and accept the cert. Historically, Python clients have "just
worked" in environments that required a click-through on the browser
side, since you had to opt in to checking the certificates properly.

Self-signed certificates can also be really handy for doing local
testing - you're not really aiming to authenticate the connection in
that case, you're just aiming to test that the secure connection
machinery is all working properly.

(As far as the "what about requests?" question goes - that's in a
similar situation to Go, where being new allows it to choose different
defaults, and folks for whom those defaults don't work just won't use
it. There's also the fact that most corporate Python users are
unlikely to know that PyPI exists, let alone that it contains a module
called "requests" that does SSL certificate validation by default.
Those of us in the corporate world that interact directly with
upstream are still the exception rather than the rule)


Nick Coghlan   |   ncoghlan at   |   Brisbane, Australia

From cory at  Sun Aug 31 12:42:12 2014
From: cory at (Cory Benfield)
Date: Sun, 31 Aug 2014 11:42:12 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 07:45, Nick Coghlan <ncoghlan at> wrote:
> There's also the fact that most corporate Python users are
> unlikely to know that PyPI exists, let alone that it contains a module
> called "requests" that does SSL certificate validation by default.
> Those of us in the corporate world that interact directly with
> upstream are still the exception rather than the rule)

I think this point deserves just a little bit more emphasis. This is
why any solution that begins with 'use PyPI' is insufficient. I've
worked on requests for 3 years now and most of my colleagues have
never heard of it, and it's not because I don't talk about it (I talk
about it all the time!).

When building internal tools, corporate environments frequently
restrict themselves to the standard library. This is because it's hard
enough to get adoption of a tool when it requires a new language
runtime, let alone if you have to get people ramped up on package
distribution as well! I have enough trouble getting people to upgrade
Python versions at work: trying to get them up to speed on pip and
PyPI is worse.

It is no longer tenable in the long term for Python to trust the
network: you're right in this regard Nick. In the past, on this very
list, I've been bullish about fixing up Python's network security
position. I was an aggressive supporter of PEP 466 (and there are some
corners of PEP 466 that I think didn't go far enough). However, I'm
with you here: we should do this once and do it right. Corporate users
*will* bump into it, and they will look to the docs to fix it. That
fix needs to be easy and painless.

A user-level cert store is a good start, and if aren't
interested in it I might take a look at implementing it under the umbrella instead.


From christian at  Sun Aug 31 13:18:28 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 13:18:28 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <20140830002254.26351339@fsol>
References: <>
Message-ID: <>

On 30.08.2014 00:22, Antoine Pitrou wrote:
> SSL_CERT_DIR and SSL_CERT_FILE are used, if set, when
> SSLContext.load_verify_locations() is called.
> Actually, come to think of it, this allows us to write a better
> test for that method. Patch welcome!

The environment vars are used only when
SSLContext.set_default_verify_paths() is called. load_verify_locations()
loads certificates from a given file, directory or memory but it doesn't
look at the env vars.

create_default_context() calls SSLContext.load_default_certs() when
neither cafile, capath nor cadata is given as an argument.
SSLContext.load_default_certs() then calls
SSLContext.set_default_verify_paths(). However there is a catch:
SSLContext.set_default_verify_paths() is not called on Windows. In
retrospective it was a bad decision by me to omit the call.


PS: SSL_CERT_DIR and SSL_CERT_FILE are the default names. It's possible
to change the names in OpenSSL. ssl.get_default_verify_paths() returns
the names and paths to the default verify locations.

From victor.stinner at  Sun Aug 31 14:44:35 2014
From: victor.stinner at (Victor Stinner)
Date: Sun, 31 Aug 2014 14:44:35 +0200
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
Message-ID: <>

HTML version:

PEP: 475
Title: Retry system calls failing with EINTR
Version: $Revision$
Last-Modified: $Date$
Author: Charles-Fran?ois Natali <cf.natali at>, Victor Stinner
<victor.stinner at>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 29-July-2014
Python-Version: 3.5


Retry system calls failing with the ``EINTR`` error and recompute
timeout if needed.


Interrupted system calls

On POSIX systems, signals are common. Your program must be prepared to
handle them. Examples of signals:

* The most common signal is ``SIGINT``, signal sent when CTRL+c is
  pressed. By default, Python raises a ``KeyboardInterrupt`` exception
  when this signal is received.
* When running subprocesses, the ``SIGCHLD`` signal is sent when a
  child process exits.
* Resizing the terminal sends the ``SIGWINCH`` signal to the
  applications running in the terminal.
* Putting the application in background (ex: press CTRL-z and then
  type the ``bg`` command) sends the ``SIGCONT`` signal.

Writing a signal handler is difficult, only "async-signal safe"
functions can be called.  For example, ``printf()`` and ``malloc()``
are not async-signal safe. When a signal is sent to a process calling
a system call, the system call can fail with the ``EINTR`` error to
give the program an opportunity to handle the signal without the
restriction on signal safe functions. Depending on the platform, on
the system call and the ``SA_RESTART`` flag, the system call may or
may not fail with ``EINTR``.

If the signal handler was set with the ``SA_RESTART`` flag set, the
kernel retries some the system call instead of failing with
``EINTR``. For example, ``read()`` is retried, whereas ``select()`` is
not retried. The Python function ``signal.signal()`` clears the
``SA_RESTART`` flag when setting the signal handler: all system calls
should fail with ``EINTR`` in Python.

The problem is that handling ``EINTR`` should be done for all system
calls. The problem is similar to handling errors in the C language
which does not have exceptions: you must check all function returns to
check for error, and usually duplicate the code checking for errors.
Python does not have this issue, it uses exceptions to notify errors.

Current status

Currently in Python, the code to handle the ``InterruptedError``
exception (``EINTR`` error) is duplicated on case by case. Only a few
Python modules handle this exception, and fixes usually took several
years to cover a whole module. Example of code retrying
```` on ``InterruptedError``::

    while True:
            data =
        except InterruptedError:

List of Python modules of the standard library which handle

* ``asyncio``
* ``asyncore``
* ``io``, ``_pyio``
* ``multiprocessing``
* ``selectors``
* ``socket``
* ``socketserver``
* ``subprocess``

Other programming languages like Perl, Java and Go already retry
system calls failing with ``EINTR``.

Use Case 1: Don't Bother With Signals

In most cases, you don't want to be interrupted by signals and you
don't expect to get ``InterruptedError`` exceptions. For example, do
you really want to write such complex code for an "Hello World"


    while True:
            print("Hello World")
        except InterruptedError:

``InterruptedError`` can happen in unexpected places. For example,
``os.close()`` and ``FileIO.close()`` can raises ``InterruptedError``:
see the article `close() and EINTR

The `Python issues related to EINTR`_ section below gives examples of
bugs caused by "EINTR".

The expectation is that Python hides the ``InterruptedError``: retry
system calls failing with the ``EINTR`` error.

Use Case 2: Be notified of signals as soon as possible

Sometimes, you expect some signals and you want to handle them as soon
as possible.  For example, you may want to quit immediatly a program
using the ``CTRL+c`` keyboard shortcut.

Some signals are not interesting and should not interrupt the the
application.  There are two options to only interrupt an application
on some signals:

* Raise an exception in the signal handler, like ``KeyboardInterrupt`` for
* Use a I/O multiplexing function like ``select()`` with the Python
  signal "wakeup" file descriptor: see the function


If a system call fails with ``EINTR``, Python must call signal
handlers: call ``PyErr_CheckSignals()``. If a signal handler raises
an exception, the Python function fails with the exception.
Otherwise, the system call is retried.  If the system call takes a
timeout parameter, the timeout is recomputed.

Modified functions

Example of functions that need to be modified:

* ````, ````, ``io.FileIO.readinto()``
* ``os.write()``, ``io.FileIO.write()``
* ``os.waitpid()``
* ``socket.accept()``
* ``socket.connect()``
* ``socket.recv()``, ``socket.recv_into()``
* ``socket.recv_from()``
* ``socket.send()``
* ``socket.sendto()``
* ``time.sleep()``
* ````
* ``select.poll()``
* ``select.epoll.poll()``
* ``select.devpoll.poll()``
* ``select.kqueue.control()``
* ```` and other selector classes

Note: The ``selector`` module already retries on ``InterruptedError``, but it
doesn't recompute the timeout yet.

Backward Compatibility

Applications relying on the fact that system calls are interrupted
with ``InterruptedError`` will hang. The authors of this PEP don't
think that such application exist.

If such applications exist, they are not portable and are subject to
race conditions (deadlock if the signal comes before the system call).
These applications must be fixed to handle signals differently, to
have a reliable behaviour on all platforms and all Python versions.
For example, use a signal handler which raises an exception, or use a
wakeup file descriptor.

For applications using event loops, ``signal.set_wakeup_fd()`` is the
recommanded option to handle signals. The signal handler writes signal
numbers into the file descriptor and the event loop is awaken to read
them. The event loop can handle these signals without the restriction
of signal handlers.


Wakeup file descriptor

Since Python 3.3, ``signal.set_wakeup_fd()`` writes the signal number
into the file descriptor, whereas it only wrote a null byte before.
It becomes possible to handle different signals using the wakeup file

Linux has a ``signalfd()`` which provides more information on each
signal.  For example, it's possible to know the pid and uid who sent
the signal.  This function is not exposed in Python yet (see the
`issue 12304 <>`_).

On Unix, the ``asyncio`` module uses the wakeup file descriptor to
wake up its event loop.


A C signal handler can be called from any thread, but the Python
signal handler should only be called in the main thread.

Python has a ``PyErr_SetInterrupt()`` function which calls the
``SIGINT`` signal handler to interrupt the Python main thread.

Signals on Windows

Control events

Windows uses "control events":

* ``CTRL_CLOSE_EVENT``: Close event

The `SetConsoleCtrlHandler() function
can be used to install a control handler.

The ``CTRL_C_EVENT`` and ``CTRL_BREAK_EVENT`` events can be sent to a
process using the `GenerateConsoleCtrlEvent() function
This function is exposed in Python as ``os.kill()``.


The following signals are supported on Windows:

* ``SIGBREAK`` (``CTRL_BREAK_EVENT``): signal only available on Windows
* ``SIGFPE``
* ``SIGILL``


The default Python signal handler for ``SIGINT`` sets a Windows event
object: ``sigint_event``.

``time.sleep()`` is implemented with ``WaitForSingleObjectEx()``, it
waits for the ``sigint_event`` object using ``time.sleep()`` parameter
as the timeout.  So the sleep can be interrupted by ``SIGINT``.

``_winapi.WaitForMultipleObjects()`` automatically adds
``sigint_event`` to the list of watched handles, so it can also be

``PyOS_StdioReadline()`` also used ``sigint_event`` when ``fgets()``
failed to check if Ctrl-C or Ctrl-Z was pressed.



* `glibc manual: Primitives Interrupted by Signals
* `Bug #119097 for perl5: print returning EINTR in 5.14

Python issues related to EINTR

The main issue is: `handle EINTR in the stdlib

Open issues:

* `Add a new signal.set_wakeup_socket() function
* `signal.set_wakeup_fd(fd): set the fd to non-blocking mode
* `Use a monotonic clock to compute timeouts
* `sys.stdout.write on OS X is not EINTR safe
* `platform.uname() not EINTR safe
* `asyncore does not handle EINTR in recv, send, connect, accept,
* `socket.create_connection() doesn't handle EINTR properly

Closed issues:

* `Interrupted system calls are not retried
* `Solaris: EINTR exception in select/socket calls in telnetlib
* `subprocess: Popen.communicate() doesn't handle EINTR in some cases
* `multiprocessing.util._eintr_retry doen't recalculate timeouts
* `file readline, readlines & readall methods can lose data on EINTR
* `multiprocessing BaseManager serve_client() does not check EINTR on recv
* `selectors behaviour on EINTR undocumented
* `asyncio: limit EINTR occurrences with SA_RESTART
* ` socket.create_connection() also doesn't handle EINTR properly
* `Faulty RESTART/EINTR handling in Parser/myreadline.c
* `test_httpservers intermittent failure, test_post and EINTR
* `os.spawnv(P_WAIT, ...) on Linux doesn't handle EINTR
* `asyncore fails when EINTR happens in pol
* `file.write and don't handle EINTR
* `socket.readline() interface doesn't handle EINTR properly
* `subprocess is not EINTR-safe
* `SocketServer doesn't handle syscall interruption
* `subprocess deadlock when read() is interrupted
* `time.sleep(1): call PyErr_CheckSignals() if the sleep was interrupted
* `siginterrupt with flag=False is reset when signal received
* `need siginterrupt()  on Linux - impossible to do timeouts
* `[Windows] Can not interrupt time.sleep()

Python issues related to signals

Open issues:

* `signal.default_int_handler should set signal number on the raised
  exception <>`_
* `expose signalfd(2) in the signal module
* `missing return in win32_kill?
* `Interrupts are lost during readline PyOS_InputHook processing
* `cannot catch KeyboardInterrupt when using curses getkey()
* `Deferred KeyboardInterrupt in interactive mode

Closed issues:

* `sys.interrupt_main()


This document has been placed in the public domain.

   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8

From rdmurray at  Sun Aug 31 16:16:27 2014
From: rdmurray at (R. David Murray)
Date: Sun, 31 Aug 2014 10:16:27 -0400
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
Message-ID: <>

On Sun, 31 Aug 2014 16:45:42 +1000, Nick Coghlan <ncoghlan at> wrote:
> On 31 August 2014 16:16, Donald Stufft <donald at> wrote:
> >
> > On Aug 31, 2014, at 2:09 AM, Nick Coghlan <ncoghlan at> wrote:
> >
> > At the same time, we need to account for the fact that most existing
> > organisations still trust in perimeter defence for their internal
> > network security, and hence tolerate (or even actively encourage) the
> > use of unsecured connections, or skipping certificate validation,
> > internally. This is actually a really terrible idea, but it's still
> > incredibly common due to the general failure of the technology
> > industry to take usability issues seriously when we design security
> > systems (at least until recently) - doing the wrong "unsafe" thing is
> > genuinely easier than doing things right.
> >
> >
> > Just a quick clarification in order to be a little clearer, this change will
> > (obviously) only effect those who trust perimeter security *and* decided to
> > install an invalid certificate instead of just using HTTP. I'm not saying
> > that
> > this doesn't happen, just being specific (I'm not actually sure why they
> > would
> > install a TLS certificate at all if they are trusting perimeter security,
> > but
> > I'm sure folks do).
> It's the end result when a company wide edict to use HTTPS isn't
> backed up by the necessary documentation and training on how to get a
> properly signed cert from your internal CA (or, even better, when such
> an edict comes down without setting up an internal CA first). Folks
> hit the internet instead, find instructions on creating a self-signed
> cert, install that, and tell their users to ignore the security
> warning and accept the cert. Historically, Python clients have "just
> worked" in environments that required a click-through on the browser
> side, since you had to opt in to checking the certificates properly.
> Self-signed certificates can also be really handy for doing local
> testing - you're not really aiming to authenticate the connection in
> that case, you're just aiming to test that the secure connection
> machinery is all working properly.

Self-signed certificates are not crazy in an internal corporate
environment even when properly playing the defense in depth game.  Once
you've acked the cert the first time, you will be warned if it changes
(like an ssh host key).  Sure, as Nick says the corp could set up an
internal signing authority and make sure everyone has their CA...and
they *should*...but realistically, that is probably relatively rare at
the moment, because it is not particularly easy to accomplish
(distributing the CA everywhere it needs to go is still a Hard Problem,
though it has gotten a lot better).

Given the reality of human nature, even when the documentation
accompanying the HTTPS initiative is good, there will *still* be someone
who hasn't followed the internal rules, yet you really need to talk to
the piece of infrastructure they are maintaining.  At least that one is
short term problem (for some definition of "short" that may be several
months long), but it does exist.

In addition, as has been mentioned before, self-signed certs are often
embedded in *devices* from vendors (I'm looking at you, Cisco).  This is
another area where security conciousness has gotten better (the cert
exists) but isn't good yet (the cert is self-signed and replacing it
isn't trivial when it is even possible; and, because the self-signed cert
happens by gets left in place).  And in the case of those
embedded certs, the cert can wind up *invalid* (expired) as well as
self-signed.  (This last item is where my concern about being able
to talk to invalid certs comes from.)

And yes, I have encountered all of this in the wild.


From stefan at  Sun Aug 31 16:51:24 2014
From: stefan at (Stefan Krah)
Date: Sun, 31 Aug 2014 16:51:24 +0200
Subject: [Python-Dev] [libmpdec] mpdecimal-2.4.1 released
Message-ID: <>


I've released mpdecimal-2.4.1:

da74d3cfab559971a4fbd4fb506e1b4498636eb77d0fd09e44f8e546d18ac068  mpdecimal-2.4.1.tar.gz

Starting with Python 3.4.2, this version should be used for an external libmpdec.

Stefan Krah

From marko at  Sun Aug 31 17:19:32 2014
From: marko at (Marko Rauhamaa)
Date: Sun, 31 Aug 2014 18:19:32 +0300
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
In-Reply-To: <>
 (Victor Stinner's message of "Sun, 31 Aug 2014 14:44:35 +0200")
References: <>
Message-ID: <>

Victor Stinner <victor.stinner at>:

> Proposition
> ===========
> If a system call fails with ``EINTR``, Python must call signal
> handlers: call ``PyErr_CheckSignals()``. If a signal handler raises
> an exception, the Python function fails with the exception.
> Otherwise, the system call is retried.  If the system call takes a
> timeout parameter, the timeout is recomputed.

Signals are tricky and easy to get wrong, to be sure, but I think it is
dangerous for Python to unconditionally commandeer signal handling. If
the proposition is accepted, there should be a way to opt out.


From christian at  Sun Aug 31 18:27:48 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 18:27:48 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
Message-ID: <>

On 31.08.2014 16:16, R. David Murray wrote:
> Self -signed certificates are not crazy in an internal corporate
> environment even when properly playing the defense in depth game.  Once
> you've acked the cert the first time, you will be warned if it changes
> (like an ssh host key).  Sure, as Nick says the corp could set up an
> internal signing authority and make sure everyone has their CA...and
> they *should*...but realistically, that is probably relatively rare at
> the moment, because it is not particularly easy to accomplish
> (distributing the CA everywhere it needs to go is still a Hard Problem,
> though it has gotten a lot better).

It's very simple to trust a self-signed certificate: just download it
and stuff it into the trust store. That's all. A self-signed certificate
acts as its own root CA (so to speak). But there is a downside, too. The
certificate is trusted for any and all connections. Python's SSL module
has no way to trust a specific certificate for a host.


From p.f.moore at  Sun Aug 31 19:03:30 2014
From: p.f.moore at (Paul Moore)
Date: Sun, 31 Aug 2014 18:03:30 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 17:27, Christian Heimes <christian at> wrote:
> It's very simple to trust a self-signed certificate: just download it
> and stuff it into the trust store.

"Stuff it into the trust store" is the hard bit, though. I have
honestly no idea how to do that. Or if it's temporary (which it likely
is) how to manage it - delete it when I no longer need it, list what
junk I've added over time, etc. And equally relevantly, no idea how to
do that in a way that won't clash with my company's policies...


From christian at  Sun Aug 31 19:23:53 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 19:23:53 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
Message-ID: <>

On 31.08.2014 08:24, Nick Coghlan wrote:
> To answer David's specific question, the existing knobs at the OpenSSL
> level (SSL_CERT_DIR and SSL_CERT_FILE ) let people add an internal CA,
> opt out of the default CA system, and trust *specific* self-signed
> certs.

This works only on Unix platforms iff SSL_CERT_DIR and SSL_CERT_FILE are
both set to a non-empty string that points to non-existing files or
something like /dev/null.

On Windows my enhancement will always cause the system trust store to
kick in. There is currently no way to disable the Windows system store
for ssl.create_default_context() and ssl._create_stdlib_context() with
the functions' default arguments.

On Mac OS X the situation is even more obscure. Apple's OpenSSL binaries
are using Apple's Trust Evaluation Agent. You have to set
OPENSSL_X509_TEA_DISABLE=1 in order to prevent the agent from adding
trusted certs from OSX key chain. Hynek Schlawack did a deep dive into

> A Python-specific user level cert store is something that could be
> developed as a PyPI library for Python 2.7.9+ and 3.4+ (Is cert
> management considered in scope for If so, that could
> be a good home).

Python's SSL module is lacking some functionalities in order to
implement a fully functional cert store.

* no verify hook to verify each certificate in the chain like

* no way to get the full cert chain including the root certificate.

* no API to get the subject public key information (SPKI). The SPKI hash
can be used to identify a certificate. For example it's used in Google's

* the cert validation exception could use some additional information.

There are probably some more things mising. An X509 object would help, too.


From antoine at  Sun Aug 31 19:29:38 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 31 Aug 2014 19:29:38 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <ltvm22$vrt$>

Le 31/08/2014 19:03, Paul Moore a ?crit :
> On 31 August 2014 17:27, Christian Heimes <christian at> wrote:
>> It's very simple to trust a self-signed certificate: just download it
>> and stuff it into the trust store.
> "Stuff it into the trust store" is the hard bit, though. I have
> honestly no idea how to do that.

You certainly shouldn't do so. If an application has special needs that 
require trusting a self-signed certificate, then it should expose a 
configuration setting to let users specify the cert's location. Stuffing 
self-signed certs into the system trust store is really a measure of 
last resort.

There's another case which isn't solved by this, though, which is when a 
cert is invalid. The common situation being that it has expired 
(renewing certs is a PITA and therefore expired certs are more common 
than it sounds they should be). In this case, there is no way to 
whitelist it: you have to disable certificate checking altogether. This 
can be exposed by the application as configuration option if necessary, 
as well.



From p.f.moore at  Sun Aug 31 20:28:58 2014
From: p.f.moore at (Paul Moore)
Date: Sun, 31 Aug 2014 19:28:58 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <ltvm22$vrt$>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 18:29, Antoine Pitrou <antoine at> wrote:
> If an application has special needs that require trusting a self-signed
> certificate, then it should expose a configuration setting to let users
> specify the cert's location.

I can't see how that would be something the application would know.
For example, pip allows me to specify an "alternate cert bundle" but
not a single additional cert. So IIUC, I can't use my local index that
serves https using a self-signed cert. I'd find it hard to argue that
it's pip's responsibility to think of that use case - pretty much any
program that interacts with a web service *might* need to interact
with a self-signed dummy version, if only under test conditions.

Or did you mean that Python should provide such a setting that would
cover all applications written in Python?


From antoine at  Sun Aug 31 20:37:50 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 31 Aug 2014 20:37:50 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <ltvq1u$dao$>

Le 31/08/2014 20:28, Paul Moore a ?crit :
> I can't see how that would be something the application would know.
> For example, pip allows me to specify an "alternate cert bundle" but
> not a single additional cert. So IIUC, I can't use my local index that
> serves https using a self-signed cert. I'd find it hard to argue that
> it's pip's responsibility to think of that use case - pretty much any
> program that interacts with a web service *might* need to interact
> with a self-signed dummy version, if only under test conditions.

Well, it's certainly pip's responsibility more than Python's. What would 
Python do? Provide a setting that would blindly add a cert for all uses 
of httplib?

pip knows about the use cases here, Python doesn't.

(perhaps you want to serve your local index using http, though)



From p.f.moore at  Sun Aug 31 21:12:28 2014
From: p.f.moore at (Paul Moore)
Date: Sun, 31 Aug 2014 20:12:28 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <ltvq1u$dao$>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 19:37, Antoine Pitrou <antoine at> wrote:
> Well, it's certainly pip's responsibility more than Python's. What would
> Python do? Provide a setting that would blindly add a cert for all uses of
> httplib?

That's more or less my point, pip doesn't have that much better idea
than Python. I was talking about putting the cert in my local cert
store, so that *I* can decide, and applications don't need to take
special care to allow me to handle this case. You said that doing so
was bad, but I don't see why. It seems to me that you're saying that I
should raise a feature request for pip instead, which seems
unreasonable. Am I missing something?


From antoine at  Sun Aug 31 22:15:10 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 31 Aug 2014 22:15:10 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <ltvvof$gta$>

Le 31/08/2014 21:12, Paul Moore a ?crit :
> On 31 August 2014 19:37, Antoine Pitrou <antoine at> wrote:
>> Well, it's certainly pip's responsibility more than Python's. What would
>> Python do? Provide a setting that would blindly add a cert for all uses of
>> httplib?
> That's more or less my point, pip doesn't have that much better idea
> than Python. I was talking about putting the cert in my local cert
> store, so that *I* can decide, and applications don't need to take
> special care to allow me to handle this case.

What do you call your local cert store?

If you mean the system cert store, then that will affect all users.



From christian at  Sun Aug 31 22:16:22 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 22:16:22 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <ltvm22$vrt$>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <>

On 31.08.2014 19:29, Antoine Pitrou wrote:
> You certainly shouldn't do so. If an application has special needs that
> require trusting a self-signed certificate, then it should expose a
> configuration setting to let users specify the cert's location. Stuffing
> self-signed certs into the system trust store is really a measure of
> last resort.


I merely wanted to state that OpenSSL can verify a self-signed
certificate easily. The certificate 'just' have to be added to the
SSLContext's store of trusted root certs. Somebody has to figure out how
Python can accomplish the task.

> There's another case which isn't solved by this, though, which is when a
> cert is invalid. The common situation being that it has expired
> (renewing certs is a PITA and therefore expired certs are more common
> than it sounds they should be). In this case, there is no way to
> whitelist it: you have to disable certificate checking altogether. This
> can be exposed by the application as configuration option if necessary,
> as well.

It's possible to ignore errors with a verify callback. OpenSSL's wiki
has an example for the expired certs


From p.f.moore at  Sun Aug 31 22:30:28 2014
From: p.f.moore at (Paul Moore)
Date: Sun, 31 Aug 2014 21:30:28 +0100
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <ltvvof$gta$>
References: <>
 <> <>
Message-ID: <>

On 31 August 2014 21:15, Antoine Pitrou <antoine at> wrote:
> What do you call your local cert store?

I was referring to Christian's comment
> It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store.

>From his recent response, I guess he meant the system store, and he
agrees that this is a bad option.

OK, that's fair, but:

a) Is there really no OS-level personal trust store? I'm thinking of
Windows here for my own personal use, but the same question applies
b) I doubt my confusion over Christian's response is atypical. Based
on what he said, if we hadn't had the subsequent discussion, I would
probably have found a way to add a cert to "the store" without
understanding the implications. While it's not Python's job to educate
users, it would be a shame if its default behaviour led people to make
ill-informed decisions.

Maybe an SSL HOWTO would be a useful addition to the docs, if anyone
feels motivated to write one.

Regardless, thanks for the education!


From victor.stinner at  Sun Aug 31 22:59:16 2014
From: victor.stinner at (Victor Stinner)
Date: Sun, 31 Aug 2014 22:59:16 +0200
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
In-Reply-To: <>
References: <>
Message-ID: <>


Sorry but I don't understand your remark. What is your problem with
retrying syscall on EINTR? Can you please elaborate? What do you mean by
"get wrong"?


Le dimanche 31 ao?t 2014, Marko Rauhamaa <marko at> a ?crit :

> Victor Stinner <victor.stinner at <javascript:;>>:
> > Proposition
> > ===========
> >
> > If a system call fails with ``EINTR``, Python must call signal
> > handlers: call ``PyErr_CheckSignals()``. If a signal handler raises
> > an exception, the Python function fails with the exception.
> > Otherwise, the system call is retried.  If the system call takes a
> > timeout parameter, the timeout is recomputed.
> Signals are tricky and easy to get wrong, to be sure, but I think it is
> dangerous for Python to unconditionally commandeer signal handling. If
> the proposition is accepted, there should be a way to opt out.
> Marko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From marko at  Sun Aug 31 23:19:15 2014
From: marko at (Marko Rauhamaa)
Date: Mon, 01 Sep 2014 00:19:15 +0300
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
In-Reply-To: <>
 (Victor Stinner's message of "Sun, 31 Aug 2014 22:59:16 +0200")
References: <>
Message-ID: <>

Victor Stinner <victor.stinner at>:

> Sorry but I don't understand your remark. What is your problem with
> retrying syscall on EINTR?

The application will often want the EINTR return (exception) instead of
having the function resume on its own.

> Can you please elaborate? What do you mean by "get wrong"?

Proper handling of signals is difficult and at times even impossible.
For example it is impossible to wake up reliably from the select(2)
system call when a signal is generated (which is why linux now has


From ethan at  Sun Aug 31 23:38:04 2014
From: ethan at (Ethan Furman)
Date: Sun, 31 Aug 2014 14:38:04 -0700
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
In-Reply-To: <>
References: <>
Message-ID: <>

On 08/31/2014 02:19 PM, Marko Rauhamaa wrote:
> Victor Stinner <victor.stinner at>:
>> Sorry but I don't understand your remark. What is your problem with
>> retrying syscall on EINTR?
> The application will often want the EINTR return (exception) instead of
> having the function resume on its own.


As an ignorant person in this area, I do not know why I would ever want to have EINTR raised instead just getting the 
results of, say, my read() call.


From victor.stinner at  Sun Aug 31 23:38:38 2014
From: victor.stinner at (Victor Stinner)
Date: Sun, 31 Aug 2014 23:38:38 +0200
Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR
In-Reply-To: <>
References: <>
Message-ID: <>

Le dimanche 31 ao?t 2014, Marko Rauhamaa <marko at> a ?crit :

> Victor Stinner <victor.stinner at <javascript:;>>:
> > Sorry but I don't understand your remark. What is your problem with
> > retrying syscall on EINTR?
> The application will often want the EINTR return (exception) instead of
> having the function resume on its own.

This case is described as the use case #2 in the PEP, so it is supported.
As written in the PEP, if you want to be notified of the signal, set a
signal handler which raises an exception. For example the default signal
handler for SIGINT raises KeyboardInterrupt.

> > Can you please elaborate? What do you mean by "get wrong"?
> Proper handling of signals is difficult and at times even impossible.
> For example it is impossible to wake up reliably from the select(2)
> system call when a signal is generated (which is why linux now has
> pselect).

In my experience, using signal.set_wakeup_fd() works well with select(),
even on Windows. The PEP promotes this. It is even thread safe.

I don't know issues of signals with select() (and without a file descriptor
used to wake up it). Python now exposes signal.pthread_sigmask(), I don't
know if it helps. In my experience, signals don't play well with
multithreading. On FreeBSD, the signal is send to a "random" thread. So you
must have the same signal mask on all threads if you want to rely on them.

But I don't get you point. How does this PEP make the situation worse?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From ncoghlan at  Sun Aug 31 23:41:21 2014
From: ncoghlan at (Nick Coghlan)
Date: Mon, 1 Sep 2014 07:41:21 +1000
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

On 1 Sep 2014 06:32, "Paul Moore" <p.f.moore at> wrote:
> On 31 August 2014 21:15, Antoine Pitrou <antoine at> wrote:
> > What do you call your local cert store?
> I was referring to Christian's comment
> > It's very simple to trust a self-signed certificate: just download it
and stuff it into the trust store.
> From his recent response, I guess he meant the system store, and he
> agrees that this is a bad option.
> OK, that's fair, but:
> a) Is there really no OS-level personal trust store? I'm thinking of
> Windows here for my own personal use, but the same question applies
> elsewhere.
> b) I doubt my confusion over Christian's response is atypical. Based
> on what he said, if we hadn't had the subsequent discussion, I would
> probably have found a way to add a cert to "the store" without
> understanding the implications. While it's not Python's job to educate
> users, it would be a shame if its default behaviour led people to make
> ill-informed decisions.

Right, this is why I came to the conclusion we need to follow the browser
vendors lead here and support a per-user Python specific supplementary
certificate cache before we can start validating certs by default at the
*Python* level. There are still too many failure modes for cert management
on private networks for us to safely ignore the use case of needing to
force connections to services with invalid certs.

We don't need to *solve* that problem here today - we can push it back to
Alex (and anyone else interested) as a building block to investigate
providing as part of or, with a view to making a
standard library version of that (along with any SSL module updates) part
of PEP 476.

In the meantime, we can update the security considerations for the ssl
module to make it clearer that the defaults are set up for trusted networks
and that using it safely on the public internet may mean you're better off
with a third party library like requests or Twisted. (I'll start another
thread shortly that is highly relevant to that topic)


> Maybe an SSL HOWTO would be a useful addition to the docs, if anyone
> feels motivated to write one.
> Regardless, thanks for the education!
> Paul
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at
> Unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

From christian at  Sun Aug 31 23:43:05 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 23:43:05 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
Message-ID: <>

On 31.08.2014 08:09, Nick Coghlan wrote:
> As Antoine says here, I'm also opposed to adding more Python specific
> configuration options. However, I think there may be something
> worthwhile we can do that's closer to the way browsers work, and has
> the significant benefit of being implementable as a PyPI module first
> (more on that in a separate reply).

I'm on your and Antoine's side and strictly against any additional
environment variables or command line arguments. That would make the
whole validation process even more complex and harder to understand.

There might be a better option to give people and companies the option
to tune the SSL module to their needs. Python already have a
customization hook for the site module called sitecustomize. How about
another module named sslcustomize? Such a module could be used to tune
the ssl module to the needs of users, e.g. configure a different default
context, add certificates to a default context etc.

Companies could install them in a system global directory on their
servers. Users could put them in their own user site directory and even
each virtual env can have one sslcustomize of its own. It's fully
backward compatible, doesn't add any flags and developers have the full
power of Python for configuration and customization.


From antoine at  Sun Aug 31 23:53:14 2014
From: antoine at (Antoine Pitrou)
Date: Sun, 31 Aug 2014 23:53:14 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <lu05ga$eq8$>

Le 31/08/2014 23:41, Nick Coghlan a ?crit :
> Right, this is why I came to the conclusion we need to follow the browser
> vendors lead here and support a per-user Python specific supplementary
> certificate cache before we can start validating certs by default at the
> *Python* level. There are still too many failure modes for cert management
> on private networks for us to safely ignore the use case of needing to
> force connections to services with invalid certs.

We are not ignoring that use case. The proper solution is simply to 
disable cert validation in the application code (or, for more 
sophisticated needs, provide an application configuration setting for 
cert validation).

 > In the meantime, we can update the security considerations for the ssl
 > module to make it clearer that the defaults are set up for trusted 
 > and that using it safely on the public internet may mean you're 
better off
 > with a third party library like requests or Twisted.

No, you simply have to select the proper validation settings.



From christian at  Sun Aug 31 23:59:10 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 23:59:10 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <>

On 31.08.2014 22:30, Paul Moore wrote:
> On 31 August 2014 21:15, Antoine Pitrou <antoine at> wrote:
>> What do you call your local cert store?
> I was referring to Christian's comment
>> It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store.

I was referring to the the trust store of the SSLContext object and not
to any kind of cert store of the operating system. Sorry for the confusion.

> a) Is there really no OS-level personal trust store? I'm thinking of
> Windows here for my own personal use, but the same question applies
> elsewhere.

Windows and OSX have superior cert stores compared to Linux and BSD.
They have means for user and system wide cert stores and trust settings
Linux just have one central directory or file with all trusted certs. My
KDE has some options to disable certs but I don't know how to make use
of the configuration.

Even worse: Linux distros doesn't make a different between purposes. On
Windows a user can trust a certificate for S/MIME but not for server
auth or client auth. Ubuntu just puts all certification in one directory
but it's wrong. :(


From christian at  Sun Aug 31 23:59:10 2014
From: christian at (Christian Heimes)
Date: Sun, 31 Aug 2014 23:59:10 +0200
Subject: [Python-Dev] PEP 476: Enabling certificate validation by
In-Reply-To: <>
References: <>
 <> <>
 <20140831032525.19b7e48c@fsol> <>
 <> <>
Message-ID: <>

On 31.08.2014 22:30, Paul Moore wrote:
> On 31 August 2014 21:15, Antoine Pitrou <antoine at> wrote:
>> What do you call your local cert store?
> I was referring to Christian's comment
>> It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store.

I was referring to the the trust store of the SSLContext object and not
to any kind of cert store of the operating system. Sorry for the confusion.

> a) Is there really no OS-level personal trust store? I'm thinking of
> Windows here for my own personal use, but the same question applies
> elsewhere.

Windows and OSX have superior cert stores compared to Linux and BSD.
They have means for user and system wide cert stores and trust settings
Linux just have one central directory or file with all trusted certs. My
KDE has some options to disable certs but I don't know how to make use
of the configuration.

Even worse: Linux distros doesn't make a different between purposes. On
Windows a user can trust a certificate for S/MIME but not for server
auth or client auth. Ubuntu just puts all certification in one directory
but it's wrong. :(
