
Hi, I didn't want to grow FUD on python-dev, but a FUD there seems to be a good topic for discussion here. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html As you may see, Python is losing its positions. I blame Python 3 and that Python development is not concentrating on users enough [1], and that there is a big resistance in getting the things done (/moin/ prefix story) and the whole communication process is a bit discouraging. If it is not the cause, then the cause is the lack of visibility into the real problem, but what the real problem is? I guess the topic is for upcoming language summit at PyCon, but it will be hard for me to get there this year from Belarus, so it would be nice to read some opinions here. 1. http://python-for-humans.heroku.com/ -- anatoly t.

On Thu, Feb 9, 2012 at 5:05 PM, Masklinn <masklinn@masklinn.net> wrote:
1. Where would be the correct place to talk about a grand state of python affairs? 2. Like it or not, many use such ratings to decide which language to learn, which language to use for their next project and whether or not to be proud of their language of choice. I think it's important for python to be popular and good. One without the other isn't too useful. Yuval

On Thu, Feb 09, 2012 at 05:13:03PM +0200, Yuval Greenfield wrote:
Nowhere because: 1. Nobody cares. This is Free Software, and we are scratching our own itches. 2. Do you consider Python developers stupid? Do you think they don't have any idea how things are going on in the wild?
Java (or Perl, or whatever) has won, hands down. Congrats to them! Can we please return to our own development? We are not going to conquer the world, are we? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

The reason python is slipping in the index is the same reason that its popularity doesn't matter (much). Wrapper generating tools, cross language interfaces and whatnot are making "polyglot" programming a pretty simple affair these days... The TIOBE index for the most part has two distinct groups: Languages that people use at work, where risk aversion are large driving forces (see java, c/++, php) and languages people use personally because they enjoy programming in them. Because the library issue for a new or less popular language is not as big a deal as it once was, people have more freedom in their choice, and that is reflected in the diversification of "fun" languages. Javascript is an outlier here, you don't have a choice if you target the browser. Nathan

On Thu, 9 Feb 2012 16:19:23 +0100 Antoine Pitrou <solipsis@pitrou.net> wrote:
And to elaborate a bit, here's the description of the python-ideas list: “This list is to contain discussion of speculative language ideas for Python for possible inclusion into the language. If an idea gains traction it can then be discussed and honed to the point of becoming a solid proposal to put to either python-dev or python-3000 as appropriate.” (*) python-ideas is not a catchall for random opinions about Python. (*) someone should really remove that python-3000 reference

Here is another data point: http://redmonk.com/sogrady/2012/02/08/language-rankings-2-2012/ Unfortunately the TIOBE index does matter. I can speak for python in education and trends I seen. Python is and remains the easiest language to teach but it is no longer true that getting Python to run is easer than alternatives (not for the average undergrad student). It used to be you download python 2.5 and you were in business. Now you have to make a choice 2.x or 3.x. 20% of the students cannot tell one from the other (even after been told repeatedly which one to use). Three weeks into the class they complain with "the class code won't compile" (the same 20% cannot tell a compiler form an interpreter). 50+% of the students have a mac and an increasing number of packages depend on numpy. Installing numpy on mac is a lottery. Those who do not have a mac have windows and they expect an IDE like eclipse. I know you can use Python with eclipse but they do not. They download Python and complain that IDLE has no autocompletion, no line numbers, no collapsible functions/classes. From the hard core computer scientists prospective there are usually three objections to using Python: - Most software engineers think we should only teach static type languages - Those who care about scalability complain about the GIL - The programming language purists complain about the use of reference counting instead of garbage collection The net result is that people cannot agree and it is getting increasingly difficult to make the case for the use of Python in intro CS courses. For some reason javaScript seems to win these days. Massimo On Feb 9, 2012, at 8:36 AM, anatoly techtonik wrote:

I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on Tkinter, with multiple pages, autocompletion, collapsible, line numbers, better printing with syntax highlitghing), and if easy_install were accessible via Idle, this would be a killer version. Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for. If this cannot be done without breaking backward compatibility again, then 3.x should be considered an experimental branch, people should be advised to stay with 2.7 (2.8?) and then skip to 4.x directly when these problems are resolved. Python should not make a habit of breaking backward compatibility. It would be really nice if it were to include an async web server (based on gevent for example) and better parser for HTTP headers and a python based template language (like mako or the web2py one) not just for the web but for document generation in general. Massimo On Feb 9, 2012, at 11:12 AM, Edward Lesmes wrote:

Massimo Di Pierro wrote:
IDLE does look a little long in the tooth.
It isn't difficult to find out about previous attempts to remove the GIL. Googling for "python removing the gil" brings up plenty of links, including: http://www.artima.com/weblogs/viewpost.jsp?thread=214235 http://dabeaz.blogspot.com.au/2011/08/inside-look-at-gil-removal-patch-of.ht... Or just use Jython or IronPython, neither of which have a GIL. And since neither of them support Python 3 yet, you have no confusing choice of version to make. I'm not sure if IronPython is suitable for teaching, if you have to support Macs as well as Windows, but as a counter-argument against GIL trolls, there are two successful implementations of Python without the GIL. (And neither is as popular as CPython, which I guess says something about where people's priorities lie. If the GIL was as serious a problem in practice as people claim, there would be far more interest in Jython and IronPython.)
Python 4.x (Python 4000) is pure vapourware. It it irresponsible to tell people to stick to Python 2.7 (there will be no 2.8) in favour of something which may never exist. http://www.python.org/dev/peps/pep-0404/ -- Steven

On 10Feb2012 05:50, Steven D'Aprano <steve@pearwood.info> wrote: | Python 4.x (Python 4000) is pure vapourware. It it irresponsible to tell | people to stick to Python 2.7 (there will be no 2.8) in favour of something | which may never exist. | | http://www.python.org/dev/peps/pep-0404/ Please tell me this PEP number is deliberate! -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Once I reached adulthood, I never had enemies until I posted to Usenet. - Barry Schwartz <bbs@hankel.rutgers.edu>

Le 12/02/2012 00:18, Cameron Simpson a écrit :
It is, sir! At first the number was taken by the virtualenv PEP with no special meaning, just the next number in sequence, but when Barry wrote up the 2.8 Unrelease PEP and took the number 405, the occasion was too good to be missed and the numbers were swapped. Cheers

On Thu, Feb 9, 2012 at 9:46 AM, Massimo Di Pierro < massimo.dipierro@gmail.com> wrote:
IIRC gevent still needs to be ported to 3.x (maybe someone with the necessary skills should apply to the PSF for funding). But the rest sounds like the domain of a superinstaller, not inclusion in the stdlib. IDLE will never be able to compete with Eclipse -- you can love one or the other bot not both. Longer term removing the GIL and using garbage collection should be a
priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for.
I think the best way to accomplish both is to focus on PyPy. It needs porting to 3.x; Google has already given them some money towards this goal.
That's really bad advice. 4.x will not be here for another decade.
Python should not make a habit of breaking backward compatibility.
Agreed. 4.x should be fully backwards compatible -- with 3.x, not with 2.x. It would be really nice if it were to include an async web server (based on
Again, that's a bundling issue. With the infrequency of Python releases, anything still under development is much better off being distributed separately. Bundling into core Python requires a package to be essentially stable, i.e., dead. -- --Guido van Rossum (python.org/~guido)

On 2/9/2012 12:46 PM, Massimo Di Pierro wrote:
I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on
I am working on the patches already on the tracker, starting with bug fixes.
Tkinter, with multiple pages,
If you mean multiple tabbed pages in one window, I believe there is a patch. autocompletion, IDLE already has 'auto-completion'. If you mean something else, please explain.
collapsible [blocks], line numbers,
I have thought about those.
better printing with syntax highlighting),
Better basic printing support is really needed. #1528593 Color printing if not possible now would be nice, as color printers are common now. I have no idea if tkinter print support makes either easier now.
and if easy_install were accessible via Idle, this would be a killer version.
That should be possible with an extension.
Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but
As has been discussed here and on pydev, the problems include things like making Python slower and disabling C extensions.
For non-Euro-Americans, a major problem with Python 1/2 was the use of ascii for identifiers. This was *fixed* by Python 3. When I went to Japan a couple of years ago and stopped in a general bookstore (like Borders), its computer language section had about 10 books on Python, most in Japanese as I remember. So it is apparently in use there.
resolved. Python should not make a habit of breaking backward compatibility.
I believe the main problem has been the unicode switch, which is critical to Python being a world language. Removal of old-style classes was mostly a non-issue, except for the very few who intentionally continued to use them. -- Terry Jan Reedy

On Thu, Feb 09, 2012 at 11:46:45AM -0600, Massimo Di Pierro wrote:
I am not sure if popularity contests are just based on technical merits/demerits alone. I guess, less people here could care less for popularity, but more for good tools in python land. So if there are things lacking in Python world, then those are good project opportunities. What I personally feel is, the various plug-and-play libraries are giving JavaScript a thumbs up and more is going on web world front-end than back-end. So, if there is a requirement for Python programmer, there is an assumption that he should web techs too. There are also PHP/Ruby/Java folks who also know web technologies. So, the web tech like (javascript) gets counted 4x time. -- Senthil

On Feb 9, 2012, at 12:46 PM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on Tkinter, with multiple pages, autocompletion, collapsible, line numbers, better printing with syntax highlitghing), and if easy_install were accessible via Idle, this would be a killer version.
Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for.
Please do not volunteer revenue that does not exist, or PSF funds for things without a grant proposal or working group. Especially PyCon revenue - which does not exist. Jesse

On 09.02.2012 19:02, Terry Reedy wrote:
And make installing Python on the Mac a lottery?
Or a subset of NumPy? The main offender is numpy.linalg, with needs a BLAS library that should be tuned to the hardware. (There is a reason NumPy and SciPy binary installers on Windows are bloated.) And from what I have seen on complaints building NumPy on Mav it tends to be the BLAS/LAPACK stuff that drives people crazy, particularly those who want to use ATLAS (Which is a bit stupid, as OpenBLAS/GotoBLAS2 is easier to build and much faster.) If Python comes with NumPy built against Netlib reference BLAS, there will be lots of complaints that "Matlab is so much faster then Python" when it is actually the BLAS libraries that are different. But I am not sure we want 50-100 MB of bloat in the Python binary installer just to cover all possible cases of CPU-tuned OpenBLAS/GotoBLAS2 or ATLAS libraries. Sturla

On Thu, Feb 9, 2012 at 10:30 AM, Sturla Molden <sturla@molden.no> wrote:
I don't know much of this area, but maybe this is something where a dynamic installer (along the lines of easy_install) might actually be handy? The funny thing is that most Java software is even more bloated and you rarely hear about that (at least not from Java users ;-). -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 19:36, Guido van Rossum wrote:
I don't know much of this area, but maybe this is something where a dynamic installer (along the lines of easy_install) might actually be handy?
That is what NumPy and SciPy does on Windows. But it also means the "superpack" installer is a very big download. Sturla

Sturla Molden, 09.02.2012 19:46:
I think this is an area where distributors can best play their role. If you want Python to include SciPy, go and ask Enthought. If you also want an operating system with it, go and ask Debian or Canonical. Or macports, if you prefer paying for your apples instead. Stefan

From my own observations, the recent drop is sure to uncertainty with Python 3, and an increase of alternatives on server side, such as Node.
The transition is only going to get more painful as system critical software lags on 2.x while users clamour for 3.x. I understand there are some fundamental problems in running both simultaneously which makes gradual integration not a possibility. Dynamic typing also doesn't help, making it very hard to automatically port, and update dependencies. Lesser reasons include an increasing gap in scalability to multicore compared with other languages (the GIL being the gorilla here, multiprocessing is unacceptable as long as native threading is the only supported concurrency mechanism), and a lack of enthusiasm from key technologies and vendors: GAE, gevent, matplotlib, are a few encountered personally.

On Fri, 10 Feb 2012 01:35:17 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
the GIL being the gorilla here, multiprocessing is unacceptable as long as native threading is the only supported concurrency mechanism
If threading is the only acceptable concurrency mechanism, then Python is the wrong language to use. But you're also not building scaleable systems, which is most of where it really matters. If you're willing to consider things other than threading - and you have to if you want to build scaleable systems - then Python makes a good choice. Personally, I'd like to see a modern threading model in Python, especially if it's tools can be extended to work with other concurrency mechanisms. But that's a *long* way into the future. As for "popular vs. good" - "good" is subjective measure. So the two statements "anything popular is good" and "nothing popular was ever good unless it had no competition" can both be true. Personally, I lean toward the latter. I tend to find things that are popular to not be very good, which makes me distrust the taste of the populace. The python core developers, on the other hand, have an excellent record when it comes to keeping the language good - and the failures tend to be concessions to popularity! So I'd rather the current system for adding features stay in place and *not* see the language add features just to gain popularity. We already have Perl if you want that kind of language. That said, it's perfectly reasonable to suggest changes you think will improve the popularity of the language. But be prepared to show that they're actually good, as opposed to merely possibly popular. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 09.02.2012 19:42, Mike Meyer wrote:
Yes or no... Python is used for parallel computing on the biggest supercomputers, monsters like Cray and IBM blue genes with tens of thousands of CPUs. But what really fails to scale is the Python module loader! For example it can take hours to "import numpy" for 30,000 Python processes on a blue gene. And yes, nobody would consider to use Java for such systems, even though Java does not have a GIL (well, theads do no matter that much on a cluster with distributed memory anyway). It is Python, C and Fortran that are popular. But that really disproves that Python sucks for big concurrency, except perhaps for the module loader. Sturla

On Thu, Feb 9, 2012 at 10:57 AM, Sturla Molden <sturla@molden.no> wrote:
I'm curious about the module loader problem. Did someone ever analyze the cause and come up with a fix? Is it the import lock? Maybe it's something for the bug tracker. -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 20:05, Guido van Rossum wrote:
See this: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes. So a solution is to have one process locate the modules and then broadcast their location to the other processes. There is even a paper on the issue. Here they suggest importing from ramdisk might work on IBM blue gene, but not on Cray. http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p... Another solution might be to use sys.meta_path to bypass imp.find_module: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059813.html The best solution would of course be to fix imp.find_module so it scales properly. Sturla

On 09.02.2012 20:05, Guido van Rossum wrote:
See this: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes. So a solution is to have one process locate the modules and then broadcast their location to the other processes. There is even a paper on the issue. Here they suggest importing from ramdisk might work on IBM blue gene, but not on Cray. http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p... Another solution might be to use sys.meta_path to bypass imp.find_module: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059813.html The best solution would of course be to fix imp.find_module so it scales properly. Sturla

On Thu, 09 Feb 2012 20:25:48 +0100 Sturla Molden <sturla@molden.no> wrote:
The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes.
Ah, I see why I never ran into it. I build systems that start by loading all the modules they need, then fork()ing many processes from that parent. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 09.02.2012 20:48, Mike Meyer wrote:
Yes, but that would not work with MPI (e.g. mpi4py) where the MPI runtime (e.g. MPICH2) is starting the Python processes. Theoretically the issue should be be present on Windows when using multiprocessing, but not on Linux as multiprocessing is using os.fork. Sturla

On Thu, 09 Feb 2012 19:57:20 +0100 Sturla Molden <sturla@molden.no> wrote:
Whether or not hours of time to import is an issue depends on what you're doing. I typically build systems running on hundreds of CPUs for weeks on end, meaning you get years of CPU time per run. So if it took a few hours of CPU time to get started, it wouldn't be much of a problem. If it took a few hours of wall clock time - well, that would be more of a problem, mostly because that long of an outage would be unacceptable. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 2/9/2012 1:57 PM, Sturla Molden wrote:
Mike Meyer posted that on pydev today http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html They determined that the time was gobbled by *finding* modules in each process, so they cut hours by finding them in 1 process and sending the locations to the other 29,999. We are already discussing how to use this lesson in core Python. The sub-thread is today's posts in "requirements for moving __import__ over to importlib?" -- Terry Jan Reedy

Yes but core Python doesn't have any other true concurrency mechanisms other than native threading, and they're too heavyweight for this purpose alone. On top of this they're useless for Python-only parallelism.
Too far. It needs to be now. The downward spiral is already beginning. Mobile phones are going multicore. My next desktop will probably have 8 cores or more. All the heavyweight languages are firing up thread/STM standardizations and implementations to make this stuff more performant and easier than it already is.
This doesn't apply to "enabling" features. Features that make it possible for popular stuff to happen. Concurrency isn't popular, but parallelism is. At least where the GIL is concerned, an good alternative concurrency mechanism doesn't exist. (The popular one is native threading).

On Fri, 10 Feb 2012 03:16:00 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
Huh? Core python has other concurrency mechanisms other than native threading. I don't know what your purpose is, but for mine (building horizontally scaleable systems of various types), they work fine. They're much easier to design with and maintain than using threads as well. They also work well in Python-only systems. If you're using "true" to exclude anything but threading, then you're just playing word games. The reality is that most problems don't need threading. The only thing it buys you over the alternatives is easy shared memory. Very few problems actually require that.
Yes, Python needs something like that. You can't have it without breaking backwards compatibility. It's not clear you can have it without serious performance hits in Python's primary use area, which is single-threaded scripts. Which means it's probably a Python 4K feature. There have been a number of discussions on python-ideas about this. I submitted a proto-pep that covers most of that to python-dev for further discussion and approval. I'd suggest you chase those things down.
No, the process needs to apply to *all* changes. Even changes to implementation details - like removing the GIL. If your implementation that removes the GIL causes a 50% slowdown in single-threaded python code, it ain't gonna happen. But until you actually propose a change, it won't matter. Nothing's going to happen until someone actually does something more than talk about it. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Il 09 febbraio 2012 18:35, Matt Joiner <anacrolix@gmail.com> ha scritto:
I think it's not only a matter of 3th party modules not being ported quickly enough or the amount of work involved when facing the 2->3 conversion. I bet a lot of people don't want to upgrade for another reason: unicode. The impression I got is that python 3 forces the user to use and *understand* unicode and a lot of people simply don't want to deal with that. In python 2 there was no such a strong imposition. Python 2 string type acting both as bytes and as text was certainly ambiguos and "impure" on different levels and changing that was definitively a win in terms of purity and correctness. I bet most advanced users are happy with this change. On the other hand, Python 2 average user was free to ignore that distinction even if that meant having subtle bugs hidden somewhere in his/her code. I think this aspect shouldn't be underestimated. --- Giampaolo http://code.google.com/p/pyftpdlib/ http://code.google.com/p/psutil/ http://code.google.com/p/pysendfile/

On Thu, Feb 9, 2012 at 11:31 AM, Matt Joiner <anacrolix@gmail.com> wrote:
The difference is that *if* you hit a Unicode error in 2.x, you're done for. Even understanding Unicode doesn't help. In 3.x, you will hit Unicode problems less frequently than in 2.x, and when you do, the problem can actually be overcome, and then your code is better. In 2.x, the typical solution, when there *is* a solution, involves making your code messier and sending up frequent prayers to the gods of Unicode. -- --Guido van Rossum (python.org/~guido)

On 2/9/2012 2:31 PM, Matt Joiner wrote:
I am really puzzled what you mean. I have used Python 3 since 3.0 alpha and as long as I have used strictly ascii, I have encountered no such issues.
I have learned about unicode, but just so I could play around with other characters.
I had to learn Unicode right then and there. Fortunately, the Python docs HOWTO on Unicode is excellent.
Were you doing some non-ascii or non-average framework-like things? Would you really not have had to learn the same about unicode if you were using 2.x? -- Terry Jan Reedy

On Fri, Feb 10, 2012 at 5:25 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
The problem for average users *right now* is that many of the Unicode handling tools that were written for the blurry "is-it-bytes-or-is-it-text?" 2.x 8-bit str type haven't been ported to 3.x yet. That's currently happening, and the folks doing it are the ones who really have to make the adjustment, and figure out what they can deal with on behalf of their users and what they need to expose (if anything). The idea with Python 3 unicode is to have errors happen at (or at least close to) the point where the code is doing something wrong, unlike the Python 2 implicit conversion model, where either data gets silently corrupted, or you get a Unicode error far from the location that introduced the problem. I actually find it somewhat amusing when people say that python-dev isn't focusing on users enough because of the Python 3 transition or the Windows installer problems. What they *actually* seem to be complaining about is that python-dev isn't focused entirely on users that are native English speakers using an expensive proprietary OS. And that's a valid observation - most of us are here because we like Python and primarily want to make it better for the environments where *we* use it, which is mostly a combination of Linux and Mac users, a few other POSIX based platforms and a small minority of Windows developers. Given the contrariness of Windows as a target platform, the time of those developers is mostly spent on making it keep working, and bringing it up to feature parity with the POSIX version, so cleaning up the installation process falls to the wayside. (And, for all the cries of, "Python should be better supported on Windows!", we just don't see many Windows devs signing up to help - since I consider developing for Windows it's own special kind of hell that I'm happy to never have to do again, it doesn't actually surprise me there's a shortage of people willing to do it as a hobby) In terms of actually *fixing it*, the PSF doesn't generally solicit grant proposals, it reviews (and potentially accepts) them. If anyone is serious about getting something done for 3.3, then *write and submit a grant proposal* to the PSF board with the goal of either finalising the Python launcher for Windows, or else just closing out various improvements to the current installer that are already on the issue tracker (e.g. version numbers in the shortcut names, an option to modify the system PATH). Even without going all the way to a grant proposal, go find those tracker items I mentioned and see if there's anything you can do to help folks like Martin von Loewis, Brian Curtin and Terry Reedy close them out. In the meantime, if the python.org packages for Windows aren't up to scratch (and they aren't in many ways), *use the commercially backed ones* (or one of the other sumo distributions that are out there). Don't tell your students to grab the raw installers directly from python.org, redirect them to the free rebuilds from ActiveState or Enthought, or go all out and get them to install something like Python(X, Y). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

First of all all the Python developers are doing an amazing job, and none of the comments should be taken as a critique but only as a suggestion. On Feb 9, 2012, at 3:34 PM, Nick Coghlan wrote: [...]
This is what I do now. I tell my students if they have trouble to Enthought. Yet there are issues with license and 32 (free) vs 64 bits (not free). Long term I do not think this what we should encourage.

On 2/9/2012 2:16 PM, Giampaolo Rodolà wrote:
Do *you* think that? Or or you reporting what others think? In either case, we have another communication problem. If one only uses the ascii subset, the usage of 3.x strings is transparent. As far as I can think, one does not need to know *anything* about unicode to use 3.x. In 3.3, there will not even be a memory hit. We should be saying that. Thanks for the head's up. It is hard to know what misconceptions people have until someone reports them ;-).
In python 2 there was no such a strong imposition.
Nor is there in 3.x. We need to communicate that. I may give it a try on python-list. If and when one does want to use more characters, it should be *easier* in 3.x than in 2.x, especially for non-Latin1 Western European chars . -- Terry Jan Reedy

On 2/9/2012 11:30 PM, Matt Joiner wrote:
Not true, it's necessary to understand that encodings translate to and from bytes,
Only if you use byte encodings for ascii text. I never have, and I would not know why you do unless you are using internet modules that do not sufficiently hide such details. Anyway...
So one only needs to know one encoding name, which most should know anyway, and that it *is* an encoding name.
and how to use the API.
Give the required parameter, which is standard.
In 2.x you rarely needed to know what unicode is.
All one *needs* to know about unicode, that I can see, is that unicode is a superset of ascii, that ascii number codes remain the same, and that one can ignore just about everything else until one uses (or wants to know about) non-ascii characters. Since one will see 'utf-8' here and there, it is probably to know that the utf-8 encoding is a superset of the ascii encoding, so that ascii text *is* utf-8 text. -- Terry Jan Reedy

There are a lot of things covered in this thread. I want to address 2 of them. 1. Garbage Collection. Python has garbage collection. There is no free() function in Python, anyone who says that Python does not have GC is talking nonsense. CPython using reference counting as its means of implementing GC. Ref counting has different performance characteristics from tracing GC, but it only makes sense to consider this is the context of overall Python performance. One key disadvantage of ref-counting is that does not play well with threads, which leads on to... 2. Global Interpreter Lock and Threads. The GIL is so deeply embedded into CPython that I think it cannot be removed. There are too many subtle assumptions pervading both the VM and 3rd party code, to make truly concurrent threads possible. But are threads the way to go? Javascript does not have threads. Lua does not have threads. Erlang does not have threads; Erlang processes are implemented (in the BEAM engine) as coroutines. One of the Lua authors said this about threads: (I can't remember the quote so I will paraphrase) "How can you program in a language where 'a = a + 1' is not deterministic?" Indeed. What Python needs are better libraries for concurrent programming based on processes and coroutines. Cheers, Mark.

The way I see it is not whether Python has threads, fibers, coroutines, etc. The problem is that in 5 years we going to have on the market CPUs with 100 cores (my phone has 2, my office computer has 8 not counting GPUs). The compiler/interpreters must be able to parallelize tasks using those cores without duplicating the memory space. Erlang may not have threads in the sense that it does not expose threads via an API but provides optional parallel schedulers where coroutines are distributed automatically over the available cores/CPUs (http://erlang.2086793.n4.nabble.com/Some-facts-about-Erlang-and-SMP-td210877...). Different languages have different mechanisms for taking advantages of multiple cores without forking. Python does not provide a mechanism and I do not know if anybody is working on one. In Python, currently, you can only do threading to parallelize your code without duplicating memory space, but performance decreases instead of increasing with number of cores. This means threading is only good for concurrency not for scalability. The GC vs reference counting (RC) is the hearth of the matter. With RC every time a variable is allocated or deallocated you need to lock the counter because you do know who else is accessing the same variable from another thread. This forces the interpreter to basically serialize the program even if you have threads, cores, coroutines, etc. Forking is a solution only for simple toy cases and in trivially parallel cases. People use processes to parallelize web serves and task queues where the tasks do not need to talk to each other (except with the parent/master process). If you have 100 cores even with a small 50MB program, in order to parallelize it you go from 50MB to 5GB. Memory and memory access become a major bottle neck. Erlang Massimo On Feb 10, 2012, at 3:29 AM, Mark Shannon wrote:

On 2012-02-10, at 15:52 , Massimo Di Pierro wrote:
Erlang may not have threads in the sense that it does not expose threads via an API but provides optional parallel schedulers
-smp has been enabled by default since R13 or R14, it's as optional as multithreading being optional because you can bind a process to a core.
In Python, currently, you can only do threading to parallelize your code without duplicating memory space, but performance decreases instead of increasing with number of cores. This means threading is only good for concurrency not for scalability.
That's definitely not true, you can also fork and multiprocessing, while not ideal by a long shot, provides a number of tools for working building concurrent applications via multiple processes.

Massimo Di Pierro, 10.02.2012 15:52:
Seriously - what's wrong with forking? multiprocessing is so increadibly easy to use that it's hard for me to understand why anyone would fight for getting threading to do essentially the same thing, just less safe. Threading is a seriously hard problem, very tricky to get right and full of land mines. Basically, you start from a field that's covered with one big mine, and start cutting it down until you can get yourself convinced that the remaining mines (if any, right?) are small enough to not hurt anyone. They usually do anyway, but at least not right away. This is generally worth a read (not necessarily for the conclusion, but definitely for the problem description): http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
Well, nothing keeps you from putting your data into shared memory if you use multiple processes. It's not that hard either, but it has the major advantage over threading that you can choose exactly what data should be shared, so that you can more easily avoid race conditions and unintended interdependencies. Basically, you start from a safe split and then add explicit data sharing and messaging until you have enough shared data and synchronisation points to make it work, while still keeping up a safe and efficient concurrent system. Note how this is the opposite of threading, where you start off from the maximum possible unsafety where all state is shared, and then wade through it with a machete trying to cut down unsafe interaction points. And if you miss any one spot, you have a problem.
This means threading is only good for concurrency not for scalability.
Yes, concurrency, or more specifically, I/O concurrency is still a valid use case for threading.
I think you should read up a bit on the various mechanisms for parallel processing. Stefan

On Feb 10, 2012, at 9:28 AM, Stefan Behnel wrote:
yes I should ;-) (Perhaps I should take this course http://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?CrseId=001533) The fact is, in my experience, many modern applications where performance is important try to take advantage of all parallelization available. I have worked on many years in lattice QCD and I have written code that runs on various parallel machines. We used processes to parallelize across nodes, threads to parallelize on single node, and assembly vectorial instructions to parallelize within each core. This used to be a state of art way of programming but now I see these patters trickling down to many consumer applications, for example games. People do not like threads because of the need for locking but, as you increase the number of cores, the bottle neck becomes memory access. If you use processes, you don't just bloat ram usage killing cache performance but you need to use message passing for interprocess communication. Message passing require copy of data which is expensive (remember ram is the bottle neck). Ever worse, some times message passing cannot be done using ram only and you need disk buffered message for interprocess communication. Some programs are parallelized ok with processes. Those I have experience with require both processes and threads. Again, this does not mean using threading APIs. The VM should use threads to parallelize tasks. How this is exposed to the developed is a different matter. Massimo

On 10 February 2012 14:52, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
I don't know much about forking, but I'm pretty sure that forking a process doesn't mean you double the amount of physical memory used. With copy-on-write, a lot of physical memory can be shared. -- Arnaud

On Feb 10, 2012, at 9:43 AM, Arnaud Delobelle wrote:
Anyway, copy-on-write does not solve the problem. The OS tries to save memory but not duplicating physical memory space and by assigning the different address spaces of the various forked processes to the same physical memory. But as soon as one process writes into the segment, the entire segment is copied. It has to be, the processes must have different address spaces. That is what fork does. Anyway, there are many applications that are parallelized well with processes (at least for a small number of cores/cpus).
-- Arnaud

Arnaud Delobelle, 10.02.2012 16:43:
That applies to systems that support both fork and copy-on-write. Not all systems are that lucky, although many major Unices have caught up in recent years. The Cygwin implementation of fork() is especially involved for example, simple because Windows lacks this idiom completely (well, in it's normal non-POSIX identity, that is, where basically all Windows programs run). http://seit.unsw.adfa.edu.au/staff/sites/hrp/webDesignHelp/cygwin-ug-net-noc... Stefan

On Fri, Feb 10, 2012 at 9:38 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Intel already has immediate plans for 10 core cpus, those have well functioning HT so they should be considered 20 core. Two socket boards are quite common, there's 40 cores. 4+ socket boards exist bringing your total to 80+ cores connected to a bucket of dram on a single motherboard. These are the types of systems in data centers being made available to people to run their computationally intensive software on. That counts as general purpose in my book. -gps

On Fri, 10 Feb 2012 08:52:16 -0600 Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
Forking is a solution only for simple toy cases and in trivially parallel cases.
But threading is only a solution for simple toy cases and trivial levels of scaling.
Only if they haven't thought much about using processes to build parallel systems. They work quite well for data that can be handed off to the next process, and where the communications is a small enough part of the problem that serializing it for communications is reasonable, and for cases where the data that needs high-speed communications can be treated as a relocatable chunk of memory. And any combination of those three, of course. The real problem with using processes in python is that there's no way to share complex python objects between processes - you're restricted to ctypes values or arrays of those. For many applications, that's fine. If you need to share a large searchable structure, you're reduced to FORTRAN techniques.
That should be fixed in the OS, not by making your problem 2**100 times as hard to analyze. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Massimo Di Pierro wrote:
Forking is a solution only for simple toy cases and in trivially parallel cases. People use processes to parallelize web serves and task queues where the tasks do not need to talk to each other (except with the parent/master process). If you have 100 cores even with a small 50MB program, in order to parallelize it you go from 50MB to 5GB. Memory and memory access become a major bottle neck.
By the time we 100 core CPUs, we'll be measuring RAM in TB, so that shouldn't be a problem ;-) Many Python use cases are indeed easy to scale using multiple processes which then each run on a separate core, so that approach is a very workable way forward. If you need to share data across processes, you can use a shared memory mechanism. In many cases, the data to be shared will already be stored in a database and those can easily be accessed from all processes (again using shared memory). I often find these GIL discussion a bit theoretical. In practice I've so far never run into any issues with Python scalability. It's other components that cause a lot more trouble, like e.g. database query scalability, network congestion or disk access being too slow. In cases where the GIL does cause problems, it's usually better to consider changing the application design and use asynchronous processing with a single threaded design or a multi-process design where each of the processes only uses a low number of threads (20-50 per process). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, 10 Feb 2012 19:36:52 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Just a warning: mixing threads and forks can be hazardous to your sanity. In particular, forking a process that has threads running has behaviors, problems and solutions that vary between Unix variants. Best to make sure you've done all your forks before you create a thread if you want your code to be portable. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
Right. Applications using such strategies will usually have long running processes, so it's often better to spawn new processes than to use fork. This also helps if you want to bind processes to cores. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

I think the much, much better response to the questions and comments around Python, the GIL and parallel computing in general is this: Yes, let's have more of that! It's like asking if people like pie, or babies. 99% of people polled are going to say "Yes, let's have more of that!" - so it goes with Python, the GIL, STM, Multiprocessing, Threads, etc. Where all of these discussions break down - and they always do - is that we lack: 1> Someone with a working patch for Pie 2> Someone with a fleshed out proposal/PEP on how to get more Pie 3> A group of people with time to bake more Pies that could help be paid to make Pie Banging on the table and asking for more Pie won't get us more Pie - what we need are actual proposals, in the form of well thought out PEPs, the people to implement and maintain the thing (see: unladen swallow), or working implementations. No one in this thread is arguing that having more Pie, or babies, would be bad. No one is arguing that more/better concurrency constructs would be good. Tools like concurrent.futures in Python 3 would be a good example of something recently added. The problem is people, plans and time. If we can solve the People and Time problems, instead of looking to already overworked volunteers then I'm sure we can come up with a good Pie plan. I really like pie. Jesse

On 10.02.2012 19:36, M.-A. Lemburg wrote:
By the time we 100 core CPUs, we'll be measuring RAM in TB, so that shouldn't be a problem ;-)
Actually, Python is already great for those. They are called GPUs, and OpenCL is all about text processing.
The "GIL problem" is much easier to analyze than most Python developers using Linux might think: - Windows has no fork system call. SunOS used to have a very slow fork system call. The majority of Java developers worked with Windows or Sun, and learned to work with threads. For which the current summary is: - The GIL sucks because Windows has no fork. Which some might say is the equivalent of: - Windows sucks. Sturla

Sturla Molden wrote:
I'm not sure why you think you need os.fork() in order to work with multiple processes. Spawning processes works just as well and, often enough, is all you really need to get the second variant working. The first variant doesn't need threads at all, but can not always be used since it requires all application components to play along nicely with the async approach. I forgot to mention a third variant: use a multi-process design with single threaded asynchronous processing in each process. This third variant is becoming increasingly popular, esp. if you have to handle lots and lots of individual requests with relatively low need for data sharing between the requests. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, Feb 10, 2012 at 11:54 AM, Sturla Molden <sturla@molden.no> wrote:
Please do not claim that fork() semantics and copy-on-write are good things to build off of... They are not. fork() was designed in a world *before threads* existed. It simply can not be used reliably in a process that uses threads and tons of real world practical C and C++ software that Python programs need to interact with, be embedded in or use via extension modules these days uses threads quite effectively. The multiprocessing module on posix would be better off if it offered a windows CreateProcess() work-a-like mode that spawns a *new* python interpreter process rather than depending on fork(). The fork() means multithreaded processes cannot reliably use the multiprocessing module (and those other threads could come from libraries or C/C++ extension modules that you cannot control within the scope of your own software that desires to use multiprocessing). This is likely not hard to implement, if nobody has done it already, as I believe the windows support already has to do much the same thing today. -gps

Sorry for the late reply, but this itch finally got to me...
Please do not claim that fork() semantics and copy-on-write are good things to build off of...
They work just fine for large classes of problems that require hundreds or thousands of cores.
They are not. fork() was designed in a world *before threads* existed.
This is wrong. While the "name" thread may not have existed when fork() was created. the *concept* of concurrent execution in a shared address space predates the creation of Unix by a good decade. Most notably, Multics - what the creators of Unix were working on before they did Unix - at least discussed the idea, though it may never have been implemented (a common fate of Multics features). Also notable is that Unix introduced the then ground-breaking idea of having the command processor create a new process to run user programs. Before Unix, user commands were run in the process (and hence address space) of the command processor. Running things in what is now called "the background" (which this architecture made a major PITA) gave you concurrent execution in a shared address space - what we today call threads. The reason those systems did this was because creating a process was *expensive*. That's also why the Multics folks looked at threads. The Unix fork/exec pair was cheap and flexible, allowing the creation of a command processor that supported easy backgrounding, pipes, and IO redirection. Fork has since gotten more expensive, in spite of the ongoing struggles to keep it cheap.
Personally, I find that threads can't be used reliably in a process that forks makes threads bad things to build off of. After all, there's tons of real world practical software in many languages that python needs to interact with that use fork effectively.
While it's a throwback to the 60s, it would make using threads and processes more convenient, but I don't need it. Why don't you submit a patch? <mike

[Replies have been sent to concurrency-sig@python.org] On Sun, 12 Feb 2012 23:14:51 +0100 Sturla Molden <sturla@molden.no> wrote:
subprocess and threads interact *really* badly on Unix systems. Python is missing the tools needed to deal with this situation properly. See http://bugs.python.org/issue6923. Just another of the minor reasons not to use threads in Python. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Mon, 13 Feb 2012 08:13:36 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
This attitude is exemplary of the status quo in Python on threads: Pretend they don't exist or you'll get hurt.
Yup. After all, the answer to the question "Which modules in the standard library are thread-safe?" is "threading, queue, logging and functools" (at least, that's my best guess). Any effort to "fix" threading in Python is pretty much doomed until the authoritative answer to that question includes most of the standard library. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Mon, 13 Feb 2012 01:41:48 +0100 Sturla Molden <sturla@molden.no> wrote:
Not (quite) true. There are a few fringe languages that have embraced threading and been built (or worked over) from the ground up to work well with it. I haven't seen any let you mix multiprocessing and threading safely, though, so the attitude there is "pretend fork doesn't exist or you'll get hurt." These are the places where I've seen safe (as in, I trusted them as much as I'd have trusted a version written using processes) non-trivial (as in, they were complex enough that if they'd been written in a mainstream language like Python, I wouldn't have trusted them) threaded applications. I strongly believe we need better concurrency solutions in Python. I'm not convinced that threading is best general solution, because threading is like the GIL: a kludge that solves the problem by fixing *everything*, whether it needs it or not, and at very high cost. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Den 12.02.2012 21:56, skrev Mike Meyer:
The "expensive" argument is also why the Windows API has no fork, although the Windows NT-kernel supports it. (There is even a COW fork in Windows' SUA.) I think fork() is the one function I have missed most when programming for Windows. It is the best reason to use SUA or Cygwin instead of the Windows API. Sturla

On Fri, Feb 10, 2012 at 9:52 AM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
uh... if you need to lock it for allocation, that is an issue with the malloc, rather than refcounting. And if you need to lock it for deallocation, then your program already has a (possibly threading-race-condition-related) bug. The problem is that you need to lock the memory for writing every time you acquire or release a view of the object, even if you won't be modifying the object. (And this changing of the refcount makes copy-on-write copy too much.) There are plenty of ways around that, mostly by using thread-local (or process-local or machine-local) proxies; the original object only gets one incref/decref from each remote thread; if sharable objects are delegated to a memory-controller thread, even better. Once you have the infrastructure for this, you could also more easily support "permanent" objects like None. The catch is that the overhead of having the refcount+pointer (even without the proxies) instead of just "refcount 4 bytes ahead" turns out to be pretty high, so those forks (and extensions, if I remember pyro http://irmen.home.xs4all.nl/pyro/ correctly) never really caught on. Maybe that will change when the number of cores that aren't already in use for other processes really does skyrocket. -jJ

Terry Reedy writes:
Sorry, Terry, but you're basically wrong here. True, if one sticks to pure ASCII, there's no difference to notice, but that's just not possible for people who live outside of the U.S., or who share text with people outside of the U.S. They need currency symbols, they have friends whose names have little dots on them. Every single one of those is a backtrace waiting to happen. A backtrace on f = open('text-file.txt') for line in f: pass is an imposition. That doesn't happen in 2.x (for the wrong reasons, but it's very convenient 95% of the time). This is what Victor's "locale" codec is all about. I think that's the wrong spelling for the feature, but there does need to be a way to express "don't bother me about Unicode" in most scripts for most people. We don't have a decent boilerplate for that yet.

On 2/10/2012 3:41 AM, Stephen J. Turnbull wrote:
The claim is that Python3 imposes a large burden on users that Python2 does not.
Nor is there in 3.x.
I view that claim as FUD, at least for many users, and at least until the persons making the claim demonstrate it. In particular, I claim that people who use Python2 knowing nothing of unicode do not need to know much more to do the same things in Python3. And, if someone uses Python2 with full knowledge of Unicode, that Python3 cannot impose any extra burden. Since I am claiming negatives, the burden of proof is on those who claim otherwise.
Sorry, Terry, but you're basically wrong here.
If one only uses the ascii subset, the usage of 3.x strings is
This is not a nice way to start a response, especially when you go on to admit that I was right as the the user case I discussed. Here is what you clipped. transparent.
True, if one sticks to pure ASCII, there's no difference to notice,
Which is a restatement what you clipped. In another post I detailed the *small* amount (one paragraph) that I believe such people need to know to move to Python3. I have not seen this minimum laid out before and I think it would be useful to help such people move to Python3 without FUD fear.
but that's just not possible for people who live outside of the U.S.,
Who *already* have to know about more than ascii to use Python2. The question is whether they have to learn *substantially* more to use Python3.
OK, real-life example. My wife has colleagues in China. They interchange emails (utf-8 encoded) with project budgets and some Chinese characters. Suppose she asks me to use Python to pick out ¥ renminbi/yuan figures and convert to dollars. What 'strong imposition' does Python3 make to learn things I would not have to know to do the same thing in Python2?
I do not consider that adding an encoding argument to make the same work in Python3 to be "a strong imposition of unicode awareness". Do you? In order to do much other than pass, I believe one typically needs to know the encoding of the file, even in Python2. And of course, knowing about and using the one unicode byte encoding is *much* easier than knowing about and using the 100 or so non-unicode (or unicode subset) encodings. To me, Python3's s = open('text.txt', 'utf-8').read() is easier and simpler than either Python2 version below (and please pardon any errors as I never actually did this) import codecs s = codecs.open('text.txt', 'utf-8').read() or f = open('text.txt') s = unicode(f.read, 'utf-8') -- Terry Jan Reedy

Threading is a tool (the most popular, and most flexible tool) for concurrency and parallelism. Compared to forking, multiprocessing, shared memory, mmap, and dozens of other auxiliary OS concepts it's also the easiest. Not all problems are clearly chunkable or fit some alternative parallelism pattern. Threading is arguably the cheapest method for parallelism, as we've heard throughout this thread. Just because it can be dangerous is no reason to discourage it. Many alternatives are equally as dangerous, more difficult and less portable. Python is a very popular language.someone mentioned earlier that popularity shouldn't be an argument for features but here it's fair ground. If Python 3 had unrestrained threading, this transition plunge would not be happening. People would be flocking to it for their free lunches. Unrestrained single process parallelism is the #1 reason not to choose Python for a future project. Note that certain fields use alternative parallelism like MPI, and whoopee for them, these aren't applicable to general programming. Nor is the old argument "write a C extension". Except for old stooges who can't let go of curly braces, most people agree Python is the best mainstream language, but the implementation is holding it back. The GIL has to go if CPython is to remain viable in the future for non-trivial applications. The current transition is like VB when .NET came out: everyone switched to C# rather than upgrade to VB.NET, because it was wiser to switch to the better language than to pay the high upgrade cost. Unfortunately the Python 3 ship has sailed, and presumably the GIL has to remain until 4.x at the least. Given this, it seems there is some wisdom in the current head-in-the-sand advice: It's too hard to remove the GIL so just use some other mechanism if you want parallelism, but it's misleading to suggest they're superior as described above. So with that in mind, can the following changes occur in Python 3 without breaking spec? - Replace the ref-counting with another GC? - Remove the GIL? If not, should these be relegated to Python 4 and alternate implementation discussions?

Matt: directing a threading rant at me because I posted about unicode, a completely different subject, is bizarre. I have not said a word on this thread, and hardly ever on any other thread, about threading, concurrency, and the GIL. I have no direct interest in these subjects. But since you directed this at me, I will answer. On 2/10/2012 9:24 PM, Matt Joiner wrote: ...
If you had paid attention to this thread and others, you would know 1. These are implementation details not in the spec. 2. There are other implementations without these. 3. People have attempted the changes you want for CPython. But so far, both would have substantial negative impacts on many CPython users, including me. 4. You are free to try to improve on previous work. As to the starting subject of this thread: I switched to Python 1.3, just before 1.4, when Python was an obscure language in the Tiobe 20s. I thought then and still do that it was best for *me*, regardless of what others decided for themselves. So while am I pleased that it usage has risen considerably, I do not mind that it has (relatively) plateaued over the last 5 years. And I am not panicked that an up wiggle was followed by a down wiggle. -- Terry Jan Reedy

I'm asking if it'd actually be accepted in 3. I know well, and have seen how quickly things are blocked and rejected in core (dabeaz and shannon's attempts come to mind). I'm well familiar with previous attempts. As an example consider that replacing ref counting would probably change the API, but is a prerequisite for performant removal of the GIL.

On Sat, Feb 11, 2012 at 1:40 PM, Matt Joiner <anacrolix@gmail.com> wrote:
I'm asking if it'd actually be accepted in 3.
Why is that relevant? If free threading is the all-singing all dancing wonderment you believe: 1. Fork CPython 2. Make it free-threaded (while retaining backwards compatibility with all the C extensions out there!) 3. Watch the developer hordes flock to your door (after all, it's the lack of free-threading that has held Python's growth back for the last two decades, so everyone will switch in a heartbeat the second you, or anyone else, publishes a free-threaded alternative where all their C extensions work. Right?).
If that's what you think happened, then no, you're not familiar with them at all. python-dev has just a few simple rules for accepting a free-threading patch: 1. All current third party C extension modules must continue to work (ask the folks working on Ironclad for IronPython and cpyext for PyPy how much fun *that* requirement is) 2. Calls to builtin functions and methods should remain atomic (the Jython experience can help a lot here) 3. The performance impact on single threaded scripts must be minimal (which basically means eliding all the locks in single-threaded mode the way CPython currently does with the GIL, but then creating those elided locks in the correct state when Python's threading support gets initialised) That's it, that's basically all the criteria we have for accepting a free-threading patch. However, while most people are quite happy to say "Hey, someone should make CPython free-threaded!", they're suddenly far less interested when faced with the task of implementing it *themselves* while preserving backwards compatibility (and if you think the Python 2 -> Python 3 transition is rough going, you are definitely *not* prepared for the enormity of the task of trying to move the entire C extension ecosystem away from the refcounting APIs. The refcounting C API compatibility requirement is *not* optional if you want a free-threaded CPython to be the least bit interesting in real world terms). When we can't even get enough volunteers willing to contribute back their fixes and workarounds for the known flaws in multiprocessing, do people think there is some magical concurrency fairy that will sprinkle free threading pixie dust over CPython and the GIL will be gone? Removing the GIL *won't* be fun. Just like fixing multiprocessing, or making improvements to the GIL itself, it will be a long, tedious grind dealing with subtleties of the locking and threading implementations on Windows, Linux, Mac OS X, *BSD, Solaris and various other platforms where CPython is supported (or at least runs). For extra fun, try to avoid breaking the world for CPython variants on platforms that don't even *have* threading (e.g. PyMite). And your reward for all that effort? A CPython with better support for what is arguably one of the *worst* approaches to concurrency that computer science has ever invented. If a fraction of the energy that people put into asking for free threading was instead put into asking "how can we make inter-process communication better?", we'd no doubt have a good shared object implementation in the mmap module by now (and someone might have actually responded to Jesse's request for a new multiprocessing maintainer when circumstances forced him to step down). But no, this is the internet: it's much easier to incessantly berate python-dev for pointing out that free threading would be extraordinarily hard to implement correctly and isn't the panacea that many folks seem to think it is than it is to go *do* something that's more likely to offer a good return on the time investment required. My own personal wishlist for Python's concurrency support? * I want to see mmap2 up on PyPI, with someone working on fast shared object IPC that can then be incorporated into the stdlib's mmap module * I want to see multiprocessing2 on PyPI, with someone working on the long list of multiprocessing bugs on the python.org bug tracker (including adding support for Windows-style, non-fork based child processes on POSIX platforms) * I want to see progress on PEP 3153, so that some day we can have a "Python event loop" instead of a variety of framework specific event loops, as well as solid cross-platform async IO support in the stdlib. As Jesse said earlier, asking for free threading in CPython is like asking for free pie. Sure, free pie would be nice, but who's going to bake it? And what else could those people be doing with their time if they weren't busy baking pie? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Den 11.02.2012 07:15, skrev Nick Coghlan:
There are several solutions to this, I think. One is to use one interpreter per thread, and share no data between them, similar to tcl and PerlFork. The drawback is developers who forget to duplicate file handles, so one interpreter can close a handle used by another. Another solution is transactional memory. Consider a database with commit and rollback. Not sure how to fit this with C extensions though, but one could in theory build a multithreaded interpreter like that.
I think I already explained why BSD mmap is a dead end. We need named kernel objects (System V IPC or Unix domain sockets) as they can be communicated between processes. There is also reasons to prefer SysV message queues over shared memory (Sys V or BSD), such a thread safety. I.e. access is synchronized by the kernel. SysV message queues also have atomic read/write, unlike sockets, and they are generally faster than pipes. With sockets we have to ensure that the correct number of bytes were read or written, which is a PITA for any IPC use (or any other messaging for that matter). In the meanwhile, take a look at ZeroMQ (actually written ØMQ). ZeroMQ also have atomic read/write messages. Sturla

Matt Joiner, 11.02.2012 03:24:
Sure, "easy" as in "nothing is easier to get wrong". You did read my post on this matter, right? I'm yet to see a piece of non-trivially parallel code that uses threading and is known to be completely safe under all circumstances. And I've seen a lot.
Wrong again. Threading can be pretty expensive in terms of unexpected data dependencies, but it certainly is in terms of debugging time. Debugging spurious threading issues is amongst the hardest problems for a programmer.
Just because it can be dangerous is no reason to discourage it. Many alternatives are equally as dangerous, more difficult and less portable.
Seriously - how is running separate processes less portable than threading?
Note that this is not a "Python 2 vs. Python 3" issue. In fact, it has nothing to do with Python 3 in particular. [stripped some additional garbage] Stefan

On 2012-02-11, at 03:24 , Matt Joiner wrote:
Such a statement unqualified can only be declared wrong. Threading is the most common due to Windows issues (historically, Unix parallelism used multiple processes and the switch happened with the advent of multiplatform tools, which focused on threading due to Windows's poor performances and high overhead with processes), and it is also the easiest tool *to start using*, because you just say "start a thread". Which is equivalent to saying grenades are the easiest tool to handle conversations because you just pull the pin. Threads are by far the hardest concurrency tool to use because they throw out the window all determinism in the whole program, and that determinism then needs to be reclaimed through (very) careful analysis and the use of locks or other such sub-tools. And the flexibility claim is complete nonsense. Oh, and so are your comparisons, "shared memory" and "mmap" are not comparable to threading since they *are used* by and in threading. And forking and multiprocessing are the same thing, only the initialization call changes. Finally, multiprocessing has a far better upgrade path (as e.g. Erlang demonstrates): if your non-deterministic points are well delineated and your interfaces to other concurrent execution points are well defined, scaling from multiple cores to multiple machines becomes possible.
Of course it is, just as manual memory management is "discouraged".
Many alternatives are equally as dangerous, more difficult and less portable.
The main alternative to threading is multiprocessing (via fork or via starting new processes does not matter), it is significantly less dangerous, it is only more difficult in that you can't take extremely dangerous shortcuts and it is just as portable (if not more).
Threading is a red herring, nobody fundamentally cares about threading, what users want is a way to exploit their cores. If `multiprocessing` was rock-solid and easier to use `threading` could just be killed and nobody would care. And we'd probably find ourselves in far better a world.

Terry Reedy writes:
The point is that the user case you discuss is a toy case. Of course the problem goes away if you get to define the problem away. I don't know of any nice way to say that.
I'll go back and take a look at it. It probably is useful. But I don't think it deals with the real issue. The problem is that without substantially more knowledge than what you describe as the minimum, the fear, uncertainty, and doubt is *real*. Anybody who follows Mailman, for example, is going to hear (even today, though much less frequently than 3 years ago, and only for installations with ancient Mailman from 2006 or so) of weird Unicode errors that cause messages to be "lost". Hearing that Python 3 requires everything be decoded to Unicode is not going to give innocent confidence. There's also a lot of FUD being created out of whole cloth, as well, such as the alleged inefficiency of recoding ASCII into Unicode, etc., which doesn't matter for most applications. The problem is that the FUD based on real issues that you don't understand gives credibility to the FUD that somebody made up.
None. The FUD is not about *processing* non-ASCII. It's about non-ASCII horking your process even though you have no intention of processing it.
Yes, I do. If you get it wrong, you will still get a fatal UnicodeError.
In order to do much other than pass, I believe one typically needs to know the encoding of the file, even in Python2.
The gentleman once again seems to be suffering from a misconception. Quite often you need to know nothing about the encoding of a file, except that the parts you care about are ASCII-encoded. For example, in an American programming shop git log | ./count-files-touched-per-day.py will founder on 'Óscar Fuentes' as author, unless you know what coding system is used, or know enough to use latin-1 (because it's effectively binary, not because it's the actual encoding).
Indeed, it is. But we're not talking about dealing with Unicode; we're talking about why somebody who really only wants to deal with ASCII needs to know more about Unicode in Python 3 than in Python 2.
The reason why Unicode is part of the FUD is that in Python 2 you never needed to do that, unless you wanted to deal with a non-English language. With Python 3 you need to deal with the codec, always, or risk a UnicodeError simply because some Spaniard's name gets mentioned by somebody who cares about orthography.

On Feb 10, 2012, at 5:32 PM, Stephen J. Turnbull wrote:
Or just use errors="surrogateescape". I think we should tell people who are scared of unicode and refuse to learn how to use it to just add an errors="surrogateescape" keyword to their file open arguments. Obviously, it's the wrong thing to do, but it's wrong in the same way that Python 2 bytes are wrong, so if you're absolutely committed to remaining ignorant of encodings, you can continue to do that.

Carl M. Johnson writes:
No, it's not the same as Python 2, and it's *subtly* the wrong thing to do, too. surrogateescape is intended to roundtrip on input from a specific API to unchanged output to that same API, and that's all it it is guaranteed to do. Less pedantically, if you use latin-1, the internal representation is valid Unicode but (partially) incorrect content. No UnicodeErrors. If you use errors="surrogateescape", any code that insists on valid Unicode will crash. Here I'm talking about a use case where the user believes that as long as the ASCII content is correct they will get correct output. It's arguable that using errors="surrogateescape" is a better approach, *because* of the possibility of a validity check. I tend to think not. But that's a different argument from "same as Python 2".

On 2/10/2012 10:32 PM, Stephen J. Turnbull wrote: The issue is whether Python 3 has a "strong imposition of Unicode awareness" that Python 2 does not. If the OP only meant awareness of the fact that something called 'unicode' exists, then I suppose that could be argued. I interpreted the claim as being about some substantive knowledge of unicode. In any case, the claim that I disagree a not about people's reactions to Python 3 or about human psychology and the propensity to stick with the known. In response to Jim Jewett, you wrote
That is pretty much my counterclaim, with the note that the 'little bit of knowledge' is most about non-unicode encodings and the change to some Python details.
The point is that the user case you discuss is a toy case.
Thanks for dismissing me and perhaps a hundred thousand users as a 'toy cases'.
the problem goes away if you get to define the problem away.
Doing case analysis, starting with the easiest cases is not defining the problem away. It is rather, an attempt to find the 'little bit on knowledge' needed in various cases. In your response, you went on to write
Exactly, and finding the Python 3 version of the magic spells needed in various cases, so they can be documented and publicized, is what I have been trying to do. For ascii-only use, the magic spell in 'ascii' in bytes() calls. For some other uses, it is 'encoding=latin-1' in open(), str(), and bytes() calls, and perhaps elsewhere. Neither of these constitute substantial 'unicode awareness'.
I don't know of any nice way to say that.
There was no need to say it. -- Terry Jan Reedy

Terry Reedy writes:
I interpreted the claim as being about changing their coding practice, including maintaining existing scripts and modules that deal with textual input that people may need/want to transition to Python 3. As Paul Moore pointed out, adding "encoding='latin-1'" to their scripts doesn't come naturally to everyone. I'm sure that at a higher level, that's the stance you intend to take, too. I think there's a disconnect between that high-level stance, and the interpretation that it's about "substantive knowledge of Unicode".
OK. But then I think you are failing to deal with the problem, because I think *that* is the problem. Python 3 doesn't lack simple idioms for making (most naive, near-English) processing look like Python 2 to a greater or lesser extent. The question is which of those idioms we should teach, and AFAICS what's controversial about that depends on human psychology, not on the admitted facts about Python 3.
And my counterrebuttal is "true -- but that's not what these users want, and they probably don't need it." That is, they don't want to debug a crash when they don't care what happens to non-ASCII in their mostly-ASCII, nearly-readable-as-English byte streams.
Thanks for unwarrantedly dissing me. I do *not* dismiss people. I claim that the practical use case for these users is *not* 6-sigma- pure ASCII. You, too, will occasionally see Mr. Fuentes or even his Israeli sister-in-law show up in your "pure ASCII, or so I thought" texts. Better-than-Ivory-soap-pure *is* a "toy" case. Only in one's own sandbox can that be guaranteed. Otherwise, Python 3 needs to be instructed to prepare for (occasional) non-ASCII.
Except that AFAIK Python 3 already handles pure ASCII pretty much automatically. But pure ASCII doesn't exist for most people any more, even in Kansas; that magic spell will crash. 'latin-1' is a much better spell (except for people who want to crash in appropriate circumstances -- but AFAIK in the group whose needs this thread addresses, they are a tiny minority).
I don't know of any nice way to say that.
There was no need to say it.
Maybe not, but I think there was. Some of your well-intended recommendations are unrealistic, and letting them pass would be a disservice to the users we are *both* trying to serve.

On 11 February 2012 00:07, Terry Reedy <tjreedy@udel.edu> wrote:
Concrete example, then. I have a text file, in an unknown encoding (yes, it does happen to me!) but opening in an editor shows it's mainly-ASCII. I want to find all the lines starting with a '*'. The simple with open('myfile.txt') as f: for line in f: if line.startswith('*'): print(line) fails with encoding errors. What do I do? Short answer, grumble and go and use grep (or in more complex cases, awk) :-( Paul.

On 2012-02-11, at 13:33 , Stefan Behnel wrote:
It's true that requires to handle encodings upfront where Python 2 allowed you to play fast-and-lose though. And using latin-1 in that context looks and feels weird/icky, the file is not encoded using latin-1, the encoding just happens to work to manipulate bytes as ascii text + non-ascii stuff.

Masklinn, 11.02.2012 13:41:
Well, except for the cases where that didn't work. Remember that implicit encoding behaves in a platform dependent way in Python 2, so even if your code runs on your machine doesn't mean it will work for anyone else.
Correct. That's precisely the use case described above. Besides, it's perfectly possible to process bytes in Python 3. You just have to open the file in binary mode and do the processing at the byte string level. But if you don't care (and if most of the data is really ASCII-ish), using the ISO-8859-1 encoding in and out will work just fine for problems like the above. Stefan

On 2012-02-11, at 13:53 , Stefan Behnel wrote:
Sure, I said it allowed you, not that this allowance actually worked.
Yes, but now instead of just ignoring that stuff you have to actively and knowingly lie to Python to get it to shut up.
I think that's the route which should be taken, but (and I'll readily admit not to have followed the current state of this story) I'd understood manipulations of bytes-as-ascii-characters-and-stuff to be far more annoying (in Python 3) than string manipulation even for simple use cases.

Masklinn, 11.02.2012 17:18:
The advantage is that it becomes explicit what you are doing. In Python 2, without any encoding, you are implicitly assuming that the encoding is Latin-1, because that's how you are processing it. You're just not spelling it out anywhere, thus leaving it to the innocent reader to guess what's happening. In Python 3, and in better Python 2 code (using codecs.open(), for example), you'd make it clear right in the open() call that Latin-1 is the way you are going to process the data.
Oh, absolutely not. When it's text, it's best to process it as Unicode. Stefan

On 2012-02-11, at 20:35 , Stefan Behnel wrote:
I'm not sure going from "ignoring it" to "explicitly lying about it" is a great step forward. latin-1 is not "the way you are going to process the data" in this case, it's just the easiest way to get Python to shut up and open the damn thing.
Except it's not processed as text, it's processed as "stuff with ascii characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS (which is kinda-sorta-extended-ascii but not exactly), and while using an ISO-8859 will yield unicode data that's about the only thing you can say about it and the actual result will probably be mojibake either way. By processing it as bytes, it's made explicit that this is not known and decoded text (which is what unicode strings imply) but that it's some semi-arbitrary ascii-compatible encoding and that's the extent of the developer's knowledge and interest in it.

Masklinn, 11.02.2012 20:46:
Well, you are still processing it as text because you are (again, implicitly) assuming those ASCII characters to be just that: ASCII encoded characters. You couldn't apply the same byte processing algorithm to UCS2 encoded text or a compressed gzip file, for example, at least not with a useful outcome. Mind you, I'm not regarding any text semantics here. I'm not considering whether the thus decoded data results in French, Danish, German or other human words, or in completely incomprehensible garbage. That's not relevant. What is relevant is that the program assumes an identity mapping from 1 byte to 1 character to work correctly, which, speaking in Unicode terms, implies Latin-1 decoding. Therefore my advice to make that assumption explicit. Stefan

On 11 February 2012 19:46, Masklinn <masklinn@masklinn.net> wrote:
No, not at all. It *is* text. I *know* it's text. I know that it is encoded in an ASCII-superset (because I can read it in a text editor and *see* that it is). What I *don't* know is what those funny bits of mojibake I see in the text editor are. But I don't really care. Yes, I could do some analysis based on the surrounding text and confirm whether it's latin-1, utf-8, or something similar. But it honestly doesn't matter to me, as all I care about is parsing the file to find the change authors, and printing their names (to re-use the "manipulating a ChangeLog file" example). And even if it did matter, the next file might be in a different ASCII-superset encoding, but I *still* won't care because the parsing code will be exactly the same. Saying "it's bytes" is even more of a lie than "it's latin-1". The honest truth is "it's an ASCII superset", and that's all I need to know to do the job manually, so I'd like to write code to do the same job without needing to lie about what I know. I'm now 100% convinced that encoding="ascii",errors="surrogateescape" is the way to say this in code. Paul.

On Sun, Feb 12, 2012 at 9:24 AM, Paul Moore <p.f.moore@gmail.com> wrote:
I created http://bugs.python.org/issue13997 to suggest codifying this explicitly as an open_ascii() builtin. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11 February 2012 21:24, Paul Moore <p.f.moore@gmail.com> wrote:
What I *don't* know is what those funny bits of mojibake I see in the text editor are.
So, do yourself and to us, "the rest of the world", a favor, and open the file in binary mode. Also, I'd suggest you and anyone being picky about encoding to read http://www.joelonsoftware.com/articles/Unicode.html so you can finally have in your mind that *** ASCII is not text *** . It used to be text when to get to non-[A-Z|a-z] text you had to have someone recording a file in a tape, pack it in the luggage, and take a plane to "overseas" to the U.S.A. . That is not the case anymore, and that, as far as I understand, is the reasoning to Python 3 to default to unicode. Anyone can work "ignoring text" and treating bytes as bytes, opening a file in binary mode. You can use "os.linesep" instead of a hard-coded "\n" to overcome linebreaking. (Of course you might accidentally break a line inside a multi-byte character in some enconding, since you prefer to ignore them altogether, but it should be rare). js -><-

On Sun, Feb 12, 2012 at 1:43 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
See http://bugs.python.org/issue13997 , mentioned earlier in the thread. Cheers, Chris

Greg Ewing writes:
Yes! However, I don't think this 1.5-liner needs to be a built-in. (The 1.5-liner for 'open_as_ascii_compatible' was posted elsewhere.) There's also the issue of people who strongly prefer sloppy encoding and Read My Lips: No UnicodeErrors. I disagree with them in all purity, but you know ....

Paul Moore writes:
It probably is, for you. If that ever gives you a UnicodeError, you know how to find out how to deal with it. And it probably won't.<wink/> That may also be a good universal default for Python 3, as it will pass through non-ASCII text unchanged, while raising an error if the program tries to manipulate it (or hand it to a module that validates). (encoding='latin-1' definitely is not a good default.) But I'm not sure of that, and the current approach of using the preferred system encoding is probably better. I don't think either argument applies to everybody who needs such a recipe, though. Many will be best served with encoding='latin-1' by some name.

On 13 February 2012 05:12, Stephen J. Turnbull <stephen@xemacs.org> wrote:
And yet, after your earlier posting on latin-1, and your comments here, I'm less certain. Thank you so much :-) Seriously, I find these discussions about Unicode immensely useful. I now have a much better feel for how to deal with (and think about) text in "unknown but mostly ASCII" format, which can only be a good thing.
Probably the key question is, how do we encapsulate this debate in a simple form suitable for people to find out about *without* feeling like they "have to learn all about Unicode"? A note in the Unicode HOWTO seems worthwhile, but how to get people to look there? Given that this is people who don't want to delve too deeply into Unicode issues. Just to be clear, my reluctance to "do the right thing" was *not* because I didn't want to understand Unicode - far from it, I'm interested in, and inclined towards, "doing Unicode right". The problem is that I know enough to realise that "proper" handling of files where I don't know the encoding, and it seems to be inconsistent sometimes (both between files, and even on occasion within a file), is a seriously hard issue. And I don't want to get into really hard Unicode issues for what, in practical terms, is a simple problem as it's one-off code and minor corruption isn't really an issue. Paul.

+1 for the URL in the exception. Well in all exceptions Bringing the language into the 21st century. Great entry points for learning about the language. Whilst google provides an excellent service in finding documentation, it seems that a programming language has other methods of defining entry points for learning, being a complex but (mostly) deterministic thing. So exceptions with URLs. The URLs point to kind of "knowledge base wiki" sorts of things where the "What is your intent/usecase" can be matched up with the deterministic state we know the interpreter is in. With something like encodings, which can be happily ignored by someone until poof, suddenly they just have mush, finding out things like "Its possible printing the string to the screen is giving the error", and "There are libraries which guess encodings" and "latin-1" is a magic bullet can take many many days of searching. Also it may be possible, from this perspective, to show ways that the developer can gather more deterministic information about his interpreter's state to narrow down his intent for the Knowledge Base (e.g. if its a print statement that throws the error, its possible the program doesnt have any encoding issues, except debugging statements) The encoding issue here is a great example of this because of the complexity and mobility of encodings (i.e. they ve changed a lot). There must be other good examples which can fireup equally strong and informative discussion on "options" and their limitations and benefits. Id be very interested in formalising the idea of a "KnowledgeBase Wiki thing", maybe there already is one...

On Feb 12, 2012, at 10:50 PM, Christopher Reay wrote:
That's not a bad idea. We might want to use some kind of URL shortener for length and future proofing though. If the site changes, we can have redirection of the short URLs updated. Something like http://pyth.on/e1234 --> http://docs.python.org/library/exceptions.html

On Mon, Feb 13, 2012 at 11:19 AM, Carl M. Johnson < cmjohnson.mailinglist@gmail.com> wrote:
I think we can use wiki.python.org/ for hosting exception specific content. E.g. http://wiki.python.org/moin/PrintFails needs a lot of love and care. Microsoft actually has documentation for every single compiler and linker error that ever existed. Not that we have the same amount of resources at our disposal, but it is a nice concept. Concerning the shortened url's - I'd go with trustworthiness over compactness - http://python.org/s1 or http://python.org/s/1 would be better than http://pyth.on/1 imo. Yuval

Entry Points: Google: Natural Language user searches based on "intent of code" Module Name/Function names: user wants more details on something he already knows exists Exception Name: Great, finds you the exception definition just like any other Class name. Googling for "UnicodeEncodingError Python" gives me a link to the 2.7 documentation which says at the top "this is not yet updated for python 3" - I dont know how important this is Googling for "UnicodeEncodingError Python 3" gives http://docs.python.org/release/3.0.1/howto/unicode.html This is a great document. It explains encoding very well. The unicode tutorial doesnt mention anything about the terminal output encoding to STDOUT, and whilst this is obvious after a while, it is not always clear the printing to the terminal is the cause of the attempt to encode as ascii during a print statement. To some extent, the unicode tutorial doesnt have the practical specifics that are being discussed in this thread which is targetted at "learning curve into Python" I think the most important points here are: The exception knows what version of Python its from (which allows the language to make changes It would be nice to have a wiki type document targetted by the exception/error Sections like: - "Python Official Docs" - Murgh, Fix This NOW, Dont care how dirty - Contributed Docs we have none and loved/stack overflow etc... - Discussions from python-dev / python ideas - PEPs that apply The point is that Google cant be responsible for making sure all these sections are laid out, obvious correct or constant

Masklinn writes:
That's the coding pedant's way to look at it. However, people who speak only ASCII or Latin 1 are in general not going to see it that way. The ASCII speakers are a pretty clear-cut case. Using 'latin-1' as the codec, almost all things they can do with a 100% ASCII program and a sanely-encoded text (which leaves out Shift JIS, Big 5, and maybe some obsolete Vietnamese encodings, but not much else AFAIK) will pass through the non-ASCII verbatim, or delete it. Latin 1 speakers are harder, because they might do things like convert accented characters to their base, which would break multibyte characters in Asian languages. Still, one suspects that they mostly won't care terribly much about that (if they did, they'd be interested in using Unicode properly, and it would be worth investing the small amount of time required to learn a couple of recipes).
No, decoding with 'latin-1' is a far better approach for this purpose. If the name bothers you, give it an alias like 'semi-arbitrary-ascii-compatible'. The problem is that for many operations, b'*' and 'trailing text' are incompatible. Try concatenating them, or testing one against the other with .startswith(), or whatever. Such literals are buried in many modules, and you will lose if you're using bytes because those modules generally assume you're working with str.

On Mon, Feb 13, 2012 at 3:04 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'd hazard a guess that the non-ASCII compatible encoding mostly likely to be encountered outside Asia is UTF-16. The choice is really between "never give me UnicodeErrors, but feel free to silently corrupt the data stream if I do the wrong thing with that data" (i.e. "latin-1") and "correctly handle any ASCII compatible encoding, but still throw UnicodeEncodeError if I'm about to emit corrupted data" ("ascii+surrogateescape"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
I'd hazard a guess that the non-ASCII compatible encoding mostly likely to be encountered outside Asia is UTF-16.
In other words, only people who insist on messing with application/octet-stream files (like Word ;-). They don't deserve the pain, but they're gonna feel it anyway.
Yes.
Not if I understand what ascii+surrogateescape would do correctly. Yes, you can pass through verbatim, but AFAICS you would have to work quite hard to do anything to that stream that would cause a UnicodeError in your program, even though you corrupt it. (Eg, delete half of a multibyte EUC character.) The question is what happens if you run into a validating processor internally -- then you'll see an error (even though you're just passing it through verbatim!)

On Tue, Feb 14, 2012 at 6:02 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape") it's very hard to corrupt your data stream with processing that assumes an ASCII compatible encoding (as you point out, you'd have to be splitting on arbitrary codepoints instead of searching for ASCII first). However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set. In particular, sys.stdout has error handling set to strict, which I believe is likely to throw UnicodeEncodeError if you try to feed a string containing surrogate escaped bytes to an encoding that can't handle them. (Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish, as they would in the latin-1 case. I guess its only on Windows and in any other locations with a more restrictive default stdout encoding that errors are particularly likely). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Feb 13, 2012, at 10:45 PM, Nick Coghlan wrote:
I don't think that's right. I think that by default Python refuses to turn surrogate characters into UTF-8:
OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym? -- Carl Johnson

On Tue, Feb 14, 2012 at 6:39 AM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
encoding="ascii-ish" # gets the sloppyness right encoding="passthrough" # I would like "ignore", if it wouldn't cause confusion with the errorhandler encoding="binpass" encoding="rawbytes" -jJ

MRAB wrote:
"Ignore" won't do. Ignore what? Everything? Don't actually run an encoder? That doesn't even make sense! "Passthrough" is bad too, because it perpetrates the idea that ASCII characters are "plain text" which are bytes. Unicode strings, even those that are purely ASCII, are not strings of bytes (except in the sense that every data structure is a string of bytes). You can't just "pass bytes through" to turn them into Unicode.
You have a smiley, but I think that's the best name I've seen yet. It's explicit in what you get -- mojibake. The only downside is that it's a little obscure. Not everyone knows what mojibake is called, or calls it mojibake, although I suppose we could add aliases to other terms such as Buchstabensalat and Krähenfüße if German users complain <wink> But remind me again, why are we doing this? If you have to teach people the recipe open(filename, encoding='mojibake') why not just teach them the very slightly more complex recipe open(filename, encoding='ascii', errors='surrogateescape') which captures the user's intent ("I want ASCII, with some way of escaping errors so I don't have to deal with them") much more accurately. Sometimes brevity is *not* a virtue. -- Steven

Steven D'Aprano writes:
MRAB wrote:
encoding="ascii-ish" # gets the sloppyness right
+0.8 I'd prefer the more precise "ascii-compatible". Shift JIS is "ASCII-ish", but should not be decoded with this codec.
Explicit, but incorrect. Mojibake ("bake" means "change") is what you get when you use one encoding to encode characters, and another to decode them. Here, not only are we talking about using the same codec at both ends, but in fact it's inside out (we are decoding then encoding). This is GIGO, not mojibake.
Why not? Because 'surrogateescape' does not express the user's intent. That user *will* have to deal with errors as soon as she invokes modules that validate their input, or include some portion of the text being treated in output of any kind, unless they use an error-suppressing handler themselves. Surrogates are errors in Unicode, and that's the way it should be. That's precisely why Martin felt it necessary to use this technique in PEP 383: to ensure that errors *will* occur unless you are very careful in handling strings produced with the surrogateescape handler active. It's arguable that most applications *should* want errors in these cases; I've made that argument myself. But it's quite clearly not the user's intent.

On Wed, Feb 15, 2012 at 12:43 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
However, from a correctness point of view, it's a big step up from just saying "latin-1" (which effectively turns off *all* of the additional encoding related sanity checking Python 3 offers over Python 2). For many "I don't care about Unicode" use cases, using "ascii+surrogateescape" for your own I/O and setting "backslashreplace" on sys.stdout should cover you (and any exceptions you get will be warning you about cases where your original assumptions about not caring about Unicode validity have been proven wrong). If the logging module doesn't do it already, it should probably be defaulting to backslashreplace when encoding messages, too (for the same reason sys.stderr already defaults to that - you don't want your error reporting system failing to encode corrupted Unicode data). sys.stdin and sys.stdout are different due to the role they play in pipeline processing - for those, locale.getpreferredencoding()+"strict" is a more reasonable default (but we should make it easy to replace them with something more specific for a given application, hence http://bugs.python.org/issue14017) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
Are you saying you know more than the user about her application?
If the logging module doesn't do it already, it should probably be defaulting to backslashreplace when encoding messages, too
See, *you* don't know whether it will raise, either, and that about an important stdlib module. Why should somebody who is not already a Unicode geek and is just using a module they've downloaded off of PyPI be required to audit its IO foibles? Really, I think use of 'latin1' in this context is covered by "consenting adults." We *should* provide an alias that says "all we know about this string is that the ASCII codes represent ASCII characters," and document that even if your own code is ASCII compatible (ie, treats runs of non-ASCII as opaque, atomic blobs), third party modules may corrupt the text. And use the word "corrupt"; all UnicodelyRightThinking folks will run away screaming. That statement about corrupting text is true in Python 2, and pre-PEP-393 Python 3, anyway (on Windows and UCS-2 builds elsewhere), you know, since they can silently slice a surrogate pair in half.

If I can I would like to offer one argument for surrogateescape over latin-1 as the newbie approach. Suppose I am naively processing text files to create a webpage and one of my filters is a "smart quotes" filter to change "" to “”. Of course, there's no way to smarten quotes up if you don't know the encoding of your input or output files; you'll just make a mess. In this situation, Latin-1 lets you mojibake it up. If your input turns out not to have been Latin-1, the final result will be corrupted by the quote smartener. On the other hand, if you use encoding="ascii", errors="surrogateescape" Python will complain, because the smart quotes being added aren't ascii. In other words, the surrogate escape force naive users to stick to ASCII unless they can determine what encoding they want to use for their input/output. It's not perfect, but I think it strikes a better balance than letting the users shoot themselves in the foot.

Carl M. Johnson writes:
If I can I would like to offer one argument for surrogateescape over latin-1 as the newbie approach.
This isn't the newbie approach. What should be recommended to newbies is to use the default (which is locale-dependent, and therefore "usually" "good enough"), and live with the risk of occasional exceptions. If they get exceptions, or must avoid exceptions, learn about encodings or consult with someone who already knows.[1] *Neither* of the approaches discussed here is reliable for tasks like automatically processing email or uploaded files on the web, and neither should be recommended to people who aren't already used to encoding-agnostic processing in the Python 2 "str" style. So, now that you mention "newbies", I don't know what other people are discussing, but what I've been discussing here is an approach for people who are comfortable working around (or never experience!) the defects of Python 2's ASCII-compatible approach to handling varied encodings in a single program, and want a workalike for Python 3. The choice between the two is task-dependent. The encoding='latin1' method is for tasks where a little mojibake can be tolerated, but an exception would stop the show. The errors='surrogateencoding' method is for tasks where any mojibake at all is a disaster, but occasional exceptions can be handled as they arise. Footnotes: [1] When this damned term is over in a few weeks, I'll take a look at the tutorial-level docs and see if I can come up with a gentle approach for those who are finding out for the first time that the locale-dependent default isn't good enough for them.

On Feb 15, 2012, at 04:46 PM, Stephen J. Turnbull wrote:
I really hope you do this, but note that it would be very helpful to have guidelines and recommendations even for advanced, knowledgeable Python developers. I have participated in many discussions in various forums with other Python developers where genuine differences of opinion or experience, leads to different solutions. It would be very helpful to point to a document and say "here are the best practices for your [application|library] as recommended by core Python experts in Unicode handling." Cheers, -Barry

Barry Warsaw writes:
I'll see what I can do, but for *best practices* going beyond the level of Paul Moore's use case is difficult for the reasons elaborated elsewhere (by others as well as myself): basic Unicode handling is no harder than ASCII handling as long as everything is Unicode. So the real answer is to insist on valid Unicode for your text I/O, failing that, text labeled *as* text *with* an encoding[1], and failing that (or failing validation of the input), reject the input.[2] If that's not acceptable -- all too often it is not -- you're in a world of pain, and the solutions are going to be ad hoc. The WSGI folks will not find the solutions proposed for email acceptable, and vice versa. Something like the format Nick proposed, where the tradeoffs are described, would be useful, I guess. But the tradeoffs have to be made ad hoc. Footnotes: [1] Of course it's OK if these are implicitly labeled by requirements or defaults of a higher-level protocol. [2] This is the Unicode party line, of course. But it's really the only generally applicable advice.

On Wed, Feb 15, 2012 at 2:12 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
No, I'm merely saying that at least 3 options (latin-1, ascii+surrogateescape, chardet2) should be presented clearly to beginners and the trade-offs explained. For example: Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: None Approach: Specify encoding="latin-1" Bytes/bytearray: data.decode("latin-1") Text files: open(fname, encoding="latin-1") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "latin-1") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "latin-1", line_buffered=True) Stdout replacement (terminal): Leave it alone By decoding with latin-1, an application won't get *any* Unicode decoding errors, as that encoding maps byte values directly to the first 256 Unicode code points. However, any output data generated by that application *will* be corrupted if the assumption of ASCII compatibility are violated, or if implicit transcoding to any encoding other than "latin-1" occurs (e.g. when writing to sys.stdout or a log file, communicating over a network socket or serialising the string the json module). This is the closest Python 3 comes to emulating the permissive behaviour of Python 2's 8-bit strings (implicit interoperation with byte sequences is still disallowed). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: Minimal Approach: Use encoding="ascii" and errors="surrogateescape" (or, alternatively, errors="backslashreplace" for sys.stdout) Bytes/bytearray: data.decode("ascii", errors="surrogateescape") Text files: open(fname, encoding="ascii", "surrogateescape") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "ascii", "surrogateescape") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "ascii", "surrogateescape", line_buffered=True) Stdout replacement (terminal): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, sys.stdout.encoding, "backslashreplace", line_buffered=True) Using "ascii+surrogateescape" instead of "latin-1" is a small initial step into the Unicode-aware world. It still lets an application process any ASCII-compatible encoding *without* having to know the exact encoding of the source data, but will complain if there is an implicit attempt to transcode the data to another encoding, or if the application inserts non-ASCII data into the strings before writing them out. Whether non-ASCII compatible encodings trigger errors or get corrupted will depend on the specifics of the encoding and how the program manipulates the data. The "backslashreplace" error handler (enabled by default for sys.stderr, optionally enabled as shown above for sys.stdout) can be useful to help ensure that printing out strings will not trigger UnicodeEncodeErrors (note: the *repr* of strings already escapes non-ASCII characters internally, such that repr(x) == ascii(x). Thus, UnicodeEncodeErrors will occur only when encoding the string itself using the "strict" error handler, or when another library performs equivalent validation on the string). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: High Approach: Use binary APIs and the "chardet2" module from PyPI to detect the character encoding Bytes/bytearray: data.decode(detected_encoding) Text files: open(fname, encoding=detected_encoding) The *right* way to process text in an unknown encoding is to do your best to derive the encoding from the data stream. The "chardet2" module on PyPI allows this. Refer to that module's documentation (WHERE?) for details. With this approach, transcoding to the default sys.stdin and sys.stdout encodings should generally work (although the default restrictive character set on Windows and in some locales may cause problems). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

+1000 Great, lets do that Will I be repetitive if I say "can we put a link in the "UnicodeDecodeError" docstring? At the top of that page have "FOR BEGINNERS" or "Mugh, just make this error go away, Now", and this info from Nick Also link to all the other tons and tons of stuff that exists on UnicodeDecoding... Chardet does nothing like the complex character set decoding that any of the browsers accomplish. Also, it almost always calls "latin-1" encoded files "latin-2" and "latin-someOtherNumber", which actually doesnt work to decode the data. The browsers can translate seemingly untouchable mush of mixed char encodings into UTF-8 (on my linux box) without hiccupping. I tried to emulate their behaviour for almost a week before I gave up. To be fair, I was at that time char set newbie, and I guess I still am, though my scraper works properly. Christopherq

It seems we once again agree violently on the principles. I think our differences here are mostly due to me giving a lot of attention to audience and presentation, and you focusing on the content of what to say. Re: spin control: Nick Coghlan writes:
Are you defining "beginner" as "Python 2 programmer experienced in a multilingual context but new to Python 3"? My point is that, by other definitions of "beginner", I don't think the tradeoffs can be usefully explained to beginners without substantial discussion of the issues involved in ASCII vs. the encoding Babel vs. Unicode. Only in extreme cases where the beginner only cares about *never* getting a Unicode error, or only cares about *never* getting mojibake, will they be able to get much out of this. Re: descriptions
Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: None
I don't understand what "Unicode awareness" means here. The degree to which Python will raise Unicode errors? The awareness of the programmer?
As advice, I think this is mostly false. In particular, unless you do language-specific manipulations (transforming particular words and the like), the Latin-N family is going to be 6-sigma interoperable with Latin-1, and the rest of the ISO 8859 and Windows-125x family tolerably so. This is why it is so hard to root out the "Python 3 is just Unicode-me-harder by another name" meme. The most you should say here is that data *may* be corrupted and that, depending on the program, the risk *may* be non-negligible for non-Latin-1 data if you ever encounter it.
That last line would be better "attempt to validate the data, or output it without an error-suppressing handler (which may occur implicitly, in a module your program uses)."
You can be a little more precise: Non-ASCII-compatible encodings will trigger errors in the same circumstances as ASCII-compatible encodings. They also likely to be corrupted, but depending on the specifics of the encoding and how the program manipulates the data. I don't know if it's worth the extra verbosity, though.
The claim of "right" isn't good advice. The *right* way to process text is to insist on knowing the encoding in advance. If you have to process text in unknown encodings, then what is "right" will vary with the application. For one thing, accurate detection generally impossible without advice from outside. Given inaccuracy of automatic detection, I would often prefer to fall back to a generic ASCII-compatible algorithm that omits any processing that requires identifying non-ASCII characters or inserting non-ASCII characters into the text stream, rather than risk mojibake. In other cases, all of the significant processing is done on ASCII characters, and non-ASCII is simply passed through verbatim. Then if you need to process text in assorted encodings, the 'latin1' method is not merely acceptable, it is the obvious winning strategy. And to some extent the environment:
[T]he default restrictive character set on Windows and in some locales may cause problems.
In sum, most likely naive use of chardet is most effective as a way to rule out non-ASCII-compatible encodings, which *can* be done rather accurately (Shift JIS, Big5, UTF-16, and UTF-32 all have characteristic patterns of use of non-ASCII octets).

I really like a task-oriented approach like this. +1000 for this sort of thing in the docs. On 15 February 2012 08:03, Nick Coghlan <ncoghlan@gmail.com> wrote:
Task: Process data in any ASCII compatible encoding
This is actually closest to how I think about what I'm doing, so thanks for spelling it out.
Unicode Awareness Care Factor: High
I'm not entirely sure how to interpret this - "High level of interest in getting it right" or "High amount of investment in understanding Unicode needed"? Or something else?
If this is going into the Unicode FAQ or somewhere similar, it probably needs a more complete snippet of sample code. Without having looked for and read the chardet2 documentation, do I need to read the file once in binary mode (possibly only partially) to scan it for an encoding, and then start again "for real". That's arguably a downside to this approach.
There is arguably another, simpler approach, which is to pick a default encoding (probably what Python gives you by default) and add a command line argument to your program (or equivalent if your program isn't a command line app) to manually specify an alternative. That's probably more complicated than the naive user wanted to deal with when they started reading this summary, but may well not sound so bad by the time they get to this point :-)
A couple of other tasks spring to mind: Task: Process data in a file whose encoding I don't know Unicode Understanding Needed: Medium-Low Unicode Correctness: High Approach: Use external tools to identify the encoding, then simply specify it when opening the file. On Unix, "file -i FILENAME" will attempt to detect the encoding, on Windows, XXX. If, and only if, this approach doesn't identify the encoding clearly, then the other options allow you to do the best you can. (Needs a better description of what tools to use, and maybe a sample Python script using chardet2 as a fallback). This is actually the "right way", and should be highlighted as such. By describing it this way, it's also rather clear that it's *not hard*, once you get over the idea that you don't know how to get the encoding, because it's not specified in the file. Having read through and extended Nick's analysis to this point, I'm thinking that it actually fits my use cases fine (and correct Unicode handling no longer feels like such a hard problem to me :-)) Task: Process data in a file believed to have inconsistent encodings Unicode Understanding Needed: High Unicode Correctness: Low Approach: ??? Panic :-) This is the killer, but should be extremely rare. We don't need to explain what to do here, but maybe offer a simple strategy (1. Are you sure the file has mixed encodings? Have you checked twice? 2. If it's ASCII-compatible, can you work on a basis that you just pass the mixed-encoding bytes through unchanged? If so use one of the other recipes Nick explained. 3. Do you care about mojibake or corruption? Can you afford not to? 4. Are you a Unicode expert, or do you know one? :-)) I think something like this would be a huge benefit for the Unicode FAQ. I haven't got the time or expertise to write it, but I wish I did. If I get some spare time, I might well have a go anyway, but I can't promise. Paul

On 15/02/2012 6:51pm, Paul Moore wrote:
Don't recommend "file -i". I just tried it on the files in /usr/share/libtextcat/ShortTexts/. Basically, everything is identified as us-ascii, iso-8859-1 or unknown-8bit. Examples: chinese-big5.txt: text/plain; charset=iso-8859-1 chinese-gb2312.txt: text/plain; charset=iso-8859-1 japanese-euc_jp.txt: text/plain; charset=iso-8859-1 korean.txt: text/plain; charset=iso-8859-1 arabic-windows1256.txt: text/plain; charset=iso-8859-1 georgian.txt: text/plain; charset=iso-8859-1 greek-iso8859-7.txt: text/plain; charset=iso-8859-1 hebrew-iso8859_8.txt: text/plain; charset=iso-8859-1 russian-windows1251.txt: text/plain; charset=iso-8859-1 ukrainian-koi8_r.txt: text/plain; charset=iso-8859-1 sbt

On 15 February 2012 19:53, shibturn <shibturn@gmail.com> wrote:
Don't recommend "file -i".
Fair enough - I have no experience to comment one way or another. it was just something I'd seen mentioned in the thread. If there isn't a good standard encoding detector, maybe a small Python script using chardet2 would be the best thing to recommend... Paul.

MRAB <python@mrabarnett.plus.com> writes:
encoding="mojibake" # :-)
+1 If people want to remain wilfully ignorant of text encoding in the third millennium of our calendar, then a name like “mojibake” is clear about what they'll get, and will perhaps be publicly embarrassing enough that some proportion of programmers will decide to reduce their ignorance and use a specific encoding instead. -- \ “Science is a way of trying not to fool yourself. The first | `\ principle is that you must not fool yourself, and you are the | _o__) easiest person to fool.” —Richard P. Feynman, 1964 | Ben Finney

On Wed, Feb 15, 2012 at 11:15:36AM +1100, Ben Finney wrote:
If people want to remain wilfully ignorant of text encoding in the third millennium
This returns us to the very beginning of the thread. The original complain was: Python3 requires users to learn too much about unicode, more than they really need. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Matt Joiner wrote:
The thread was reasons for a possible drop in popularity. Somehow the other reasons have been sabotaged leaving only the unicode discussion still alive.
Not so much sabotaged as ignored. Perhaps because we don't believe this alleged drop in popularity represents anything real, while the Unicode issue is a genuine problem that needs a solution. -- Steven

On 16/02/12 02:39, Oleg Broytman wrote:
I don't think it's helpful to label everyone who wants to use the techniques being discussed here as lazy or ignorant. As we've seen, there are cases where you truly *can't* know the true encoding, and at the same time it *doesn't matter*, because all you want to do is treat the unknown bytes as opaque data. To tell someone in that position that they're being lazy is both wrong and insulting. It seems to me that what surrogateescape is effectively doing is creating a new data type that consists of a mixture of ASCII characters and raw bytes, and enables you to tell which is which. Maybe there should be a real data type like this, or a flag on the unicode type. The data would be stored in the same way as a latin1-decoded string, but anything with the high bit set would be regarded as a byte instead of a character. This might make it easier to interoperate with external libraries that expect well-formed unicode. -- Greg

On Thu, Feb 16, 2012 at 02:37:12PM +1300, Greg Ewing wrote:
In fairness, this thread was originally started with the scenario "I'm reading files which are only mostly ASCII, but I don't want to learn about Unicode" rather than "I know about Unicode, but it doesn't help me in this situation because the encoding truly is unknown". So wilful ignorance does apply, at least in the use-case the thread started with. (If it helps, think of them as too busy to learn, not too lazy.) If you already know about Unicode, then you probably don't need to be given a simple recipe to follow, because you probably already have a solution that works for you. Which brings us back to the original use-case: "I have a file which is only mostly ASCII, and I don't care to learn about Unicode at this time to deal with it. I need a recipe I can follow that will do the right-thing so I can continue to ignore the issue for a little longer." I don't think that we should either insist that these people be forced to learn Unicode, nor expect to be able to solve every possible problem they might find. A couple of recipes in the FAQs, and discussion of why you might prefer one to the other, should be able to cover most simple cases: open(filename, encoding='ascii', errors='surrogateescape') open(filename, encoding='latin1') Both recipes hint at the wider world of encodings and error handlers, hence act as a non-threatening introduction to Unicode. -- Steven

On 16 February 2012 04:08, Steven D'Aprano <steve@pearwood.info> wrote:
As the person who started the thread with this use case, I'd dispute that description of what I said. To restate it "I'm reading files which are mostly ASCII but not all. I know that I should identify the encoding, and what to do if I did know the encoding, but I'm not sure how to find out reliably what the encoding is. Also, the problem doesn't really warrant investing the time needed to research means of doing so - given that I don't need to process the non-ASCII, I just want to avoid decoding errors and not corrupt the data". I'm not lazy, I've just done a cost/benefit analysis and determined that my limited knowledge should be enough. Experience with other tools which aren't as strict as Python 3 on Unicode matters confirms that a "good enough" job does satisfy my needs. And I'm not willfully ignorant, I actually have a good feel for Unicode and the issues involved, and I certainly know what's right. I've just found that everything I've read assumes that "knowing the encoding" isn't hard - and my experience differs, so I don't know where to go for answers. Add to this the fact that I *know* I've seen supposed text files with mixed encoding content, and no-one has *ever* explained how to handle that (it's basically a damaged file, and so all the "right way to deal with Unicode" discussions ignore it) even though tools like grep and awk do a perfectly acceptable job to the level I care about. I'm very pleased with the way this thread has gone, because it has answered all of the questions I've had about "nearly-ASCII" text files. But there's no way I'd have expected to spend this much time, and involve this many other people with more knowledge than me, just to handle my original changelog-parsing problem that I could do in awk or Python 2 in about 5 minutes. Now, I could also do it in Python 3. But then, I couldn't. Hopefully the knowledge from this thread can be captured so that other people can avoid my dilemma. OK, so maybe I do feel somewhat insulted... Cheers, Paul.

Paul Moore wrote:
I am sorry, I spoke poorly. Apologies if you feel I misrepresented you. To be honest, this thread has been so large, and so rambling, and covering so much ground, I have no idea what the *actual* first mention of encoding related issues was. The oldest I can find was Giampaolo Rodolà on 9 Feb 2012 20:16:00 +0100: I bet a lot of people don't want to upgrade for another reason: unicode. The impression I got is that python 3 forces the user to use and *understand* unicode and a lot of people simply don't want to deal with that. two days before the first post from you mentioning encoding issues that I can find. Another mention of a similar use-case was by Stephen J Turnbull on 10 Feb 2012 17:41:21 +0900: True, if one sticks to pure ASCII, there's no difference to notice, but that's just not possible for people who live outside of the U.S., or who share text with people outside of the U.S. They need currency symbols, they have friends whose names have little dots on them. Every single one of those is a backtrace waiting to happen. A backtrace on f = open('text-file.txt') for line in f: pass is an imposition. That doesn't happen in 2.x (for the wrong reasons, but it's very convenient 95% of the time). This is what Victor's "locale" codec is all about. I think that's the wrong spelling for the feature, but there does need to be a way to express "don't bother me about Unicode" in most scripts for most people. We don't have a decent boilerplate for that yet. which I *paraphrased* as "I have text files that are mostly ASCII and I don't want to deal with Unicode yadda yadda yadda". But in any case, I expressed myself poorly, and I'm sorry about that. Regardless of who made the very first mention of the encoding problem in this thread, I think we should all be able to agree that laziness is *not* the only reason for having encoding problems. I thought I made it clear that I did not subscribe to that opinion. -- Steven

On 16 February 2012 13:44, Steven D'Aprano <steve@pearwood.info> wrote:
Not a problem. Equally, my "I feel insulted" dig was uncalled for - it was the sort of semi-humorous comment that doesn't translate itself well in email. I think the debate here has been immensely useful, and I appreciate everyone's comments. Paul.

Paul Moore writes:
Add to this the fact that I *know* I've seen supposed text files with mixed encoding content,
Heck, I've seen *file names* with mixed encoding content.
The right way to handle such a file is ad hoc: operate on the features you can identify, and treats runs of bytes of unknown encoding as atomic blobs. In practice, there is a generic such feature that supports many applications: runs of ASCII text. Which is the intuition all the pragmatists start with -- it's correct.
OK, so maybe I do feel somewhat insulted...
I'm sorry you feel that way. (I've sided with the pragmatists in this thread, but on this issue I'm a purist at heart.)

On 16 February 2012 15:25, Stephen J. Turnbull <stephen@xemacs.org> wrote:
As I said elsewhere that was a lame attempt at a joke. My apologies. No-one has been anything but helpful in this thread, I was just reacting (a little) to the occasional characterisation I've noticed of people as "lazy" - your term "pragmatists" is much less emotive. (And it wasn't so much a personal reaction anyway, just an awareness that we need to be careful how we express things to people struggling with this) Paul.

On 2/16/2012 7:59 AM, Paul Moore wrote:
Before unicode, mixed encodings was the only was to have multi-lingual digital text (with multiple symbol systems) in one file. I presume such texts used some sort of language markup like <English>, <Hindi> (or <Sanskrit>), and <Tibetan>, along with software that understood the markup. Such files were not broken, just the pre-unicode system of different codes for each language or nation. To handle such a file, the program, whatever the language, has to understand the custom markup, segment the bytes, and handle each segment appropriately. Crazy text that switches among unknown encodings without notice is a possibly unsolvable decryption problem. Such have no guaranteed algorithms, only heuristics. -- Terry Jan Reedy

Terry Reedy writes:
Before unicode, mixed encodings was the only was to have multi-lingual digital text (with multiple symbol systems) in one file.
There is a long-accepted standard for doing this, ISO 2022. IIRC it's available online from ISO now, and if not, ECMA 35 is the same. The X Compound Text standard (I think this is documented in the ICCCM) and the Motif Compound String are profiles of ISO 2022. If that is what Paul is seeing, then the iso-2022-jp codec might be good enough to decode the files he has, depending on which version of ISO-2022-JP is implemented. If not, iconv -f ISO-2022-JP-2 (or ISO-2022-JP-3) should work (at least for GNU's iconv implementation).
They would use encoding "markup" (specifically escape sequences). Language is not enough, as all languages have had multiple encodings since the invention of ASCII (or EBCDIC, whichever came second ;-), and in many cases multilingual standards have evolved (Japanese, for example, includes Greek and Cyrillic alphabets in its JIS standard coded character set). More recently, many languages have several ISO 2022-based encodings (the ISO 8859 family is a conformant profile of ISO 2022, as are the EUC encodings for Asian languages; the Windows 125x code pages are non-conformant extensions of ASCII based on ISO 8859).
Crazy text that switches among unknown encodings without notice is a possibly unsolvable decryption problem.
True, and occasionally seen even today in Japan (cat(1) will produce such files easily, and any system for including files).

Greg Ewing writes:
Maybe there should be a real data type [parallel to str and bytes that mixes str and bytes], or a flag on the unicode type.
-1. This is yesterday's problem. It still hurts today; we need workarounds. But it's going to be less and less important as time goes on, because nobody can afford one-locale software anymore, and the cheapest way to be multilocale is to process in Unicode, and insist on Unicode on input and output. The unknown encoding problem is not one with a generally acceptable solution. That's why Unicode was invented. To "solve" the problem by ensuring it doesn't occur in the first place.

Greg Ewing wrote:
How so? Sounds like this new data type assumes everything over 127 is a raw byte, but there are plenty of applications where values between 0 - 127 should be interpreted as raw bytes even when the majority are indeed just plain ascii.
I can see a data type that is easier to work with than bytes (ascii-string, anybody? ;) but I don't think we want to make it any kind of unicode -- once the text has been extracted from this ascii-string it should be converted to unicode for further processing, while any other non-convertible bytes should stay as bytes (or ascii-string, or whatever we call it). The above is not arguing with the 'latin-1' nor 'surrogateescape' techniques, but only commenting on a different data type with probably different uses. ~Ethan~

Ethan Furman writes:
But there really aren't any uses that aren't equally well dealt with by 'surrogateescape' that I can see. You have to process it code unit by code unit (just like surrogateescape) and if you find a non- character code unit, you then have an ad hoc decision to make about what to do with it. surrogateescape makes one particular treatment blazingly efficient (namely, turning the surrogate back into a byte with no known meaning). What other treatment of a byte of by-definition unknown semantics deserves the blazing efficiency that a new (presumably builtin) type could give?

Stephen J. Turnbull wrote:
It wasn't the 'unknown semantics' that I was responding to (latin-1 and surrogateescape deal with that just fine), but rather a new data type with a mixture of valid unicode (0-127) and raw bytes (128-255) -- I don't think that would be common enough to justify, and I can see confusion again creeping in when somebody (like myself ;) sees a datatype which seemingly supports a mixture of unicode and raw bytes only to find out that 'uni_raw(...)[5] != 32' because a u' ' was returned and an integer (or raw byte) was expected at that location. ~Ethan~

On 2/14/2012 6:39 AM, Carl M. Johnson wrote:
While this is a Py3 str object, it is not unicode. Unicode only only allows proper surrogate codeunit pairs. Py2 allowed mal-formed unicode objects and that was not changed in Py3 -- or 3.3. It seems appropriate that bytes that are meaningless to ascii should be translated to codeunits that are meaningless (by themselves) to unicode.
utf-8 only encodes proper unicode.
_.encode("utf-8", errors="surrogateescape") b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
The result is not utf-8 and it would be better not to use 'utf-8' instead of 'ascii' in the expression. The above encodes to ascii + uninterpreted high-bit-set bytes.
-- Terry Jan Reedy

On Tue, Feb 14, 2012 at 9:39 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
Oops, that's what I get for posting without testing :) Still, your example clearly illustrates the point I was trying to make - that using "ascii+surrogateescape" is less likely to silently corrupt the data stream than using "latin-1", because attempts to encode it under the "strict" error handler will generally fail, even for an otherwise universal encoding like UTF-8.
OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym?
I don't see any reason to obfuscate the use of "latin-1" as a workaround that maps 8-bit bytes directly to the corresponding Unicode code points. My proposal would be two-fold: Firstly, that we document three alternatives for working with arbitrary ASCII compatible encodings (from simplest to most flexible): 1. Use the "latin-1" encoding The latin-1 encoding accepts arbitrary binary data by mapping individual bytes directly to the first 256 Unicode code points. Thus, any sequence of bytes may be translated to a sequence of code points, effectively reproducing the behaviour of Python 2's 8-bit strings. If all data supplied is genuinely in an ASCII compatible encoding then this will work correctly. However, it fails badly if the supplied data is ever in an ASCII incompatible encoding, or if the decoded string is written back out using a different encoding. Using this option switches off *all* of Python 3's support for ensuring transcoding correctness - errors will frequently pass silently and result in corrupted output data rather than explicit exceptions. 2. Use the "ascii" encoding with the "surrogateescape" error handler This is the most correct approach that doesn't involve attempting to guess the string encoding. Behaviour if given data in an ASCII incompatible encoding is still unpredictable (and likely to result in data corruption). This approach retains most of Python 3's support for ensuring transcoding correctness, while still accepting any ASCII compatible encoding. If UnicodeEncodeErrors when displaying surrogate escaped strings are not desired, sys.stdout should also be updated to use the "backslashreplace" error handler. (see below) 3. Initially process the data as binary, using the "chardet" package from PyPI to guess the encoding This is the most correct option that can even cope with many ASCII incompatible encodings. Unfortunately, the chardet site is gone, since Mark Pilgrim took down his entire web presence. This (including the dead home page link from the PyPI entry) would need to be addressed before its use could be recommended in the official documentation (or, failing that, is there a properly documented alternative package available?) Secondly, that we make it easy to replace a TextIOWrapper with an equivalent wrapper that has only selected settings changed (e.g. encoding or errors). In 3.2, that is currently not possible, since the original "newline" argument is not made available as a public attribute. The closest we can get is to force universal newlines mode along with whatever other changes we want to make: old = sys.stdout sys.stdout = io.TextIOWrapper(old.buffer, old.encoding, "backslashreplace", None, old.line_buffering) 3.3 currently makes this even worse by accepting a "write_through" argument that isn't available for introspection. I propose that we make it possible to write the above as: sys.stdout = sys.stdout.rewrap(errors="backslashreplace") For the latter point, see http://bugs.python.org/issue14017 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape")
This is the only case that makes sense in this thread. We're talking about people coming from Python 2 who want an encoding-agnostic way to script ASCII-oriented operations for an ASCII-compatible environment, and not to learn about encodings at all. While my opinions on this are (probably obviously) informed by the WSGI discussion, this is not about making life come up roses for the WSGI folks. They work in a sewer; life stinks for them, and all they can do about it is to hold their noses. This thread is about people who are not trying to handle sewage in a sanitary fashion, rather just cook a meal and ignore the occasional hairs that inevitably fall in.
However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set.
Sure, but getting errors is for people who want to learn how to do it right, not for people who just need to get a job done. Cf. the fevered opposition to giving "import cElementTree" a DeprecationWarning.
No, it should *always* throw a UnicodeEncodeError, because there are *no* encodings that can handle them -- they're not characters, so they can't be encoded.
(Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish,
No, they will raise UnicodeEncodeError; that's why surrogateescape was invented, to work around the problem of what to do with bytes that the programmer knows are meaningful to somebody, but do not represent characters as far as Python can know: wideload:~ 10:06$ python3.2 Python 3.2 (r32:88445, Mar 20 2011, 01:56:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type "help", "copyright", "credits" or "license" for more information. position 0: surrogates not allowed
The reason I advocate 'latin-1' (preferably under an appropriate alias) is that you simply can't be sure that those surrogates won't be passed to some module that decides to emit information about them somewhere (eg, a warning or logging) -- without the protection of a "silencing error handler". Bang-bang! Python's silver hammer comes down upon your head!

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
[…]
You have made me feel strange emotions with this message. I don't know what they are, but a combination of “sickened” and “admiring” and “nostalgia”, with a pinch of fear, seems close. Maybe this is what it's like to read poetry. -- \ “[Entrenched media corporations will] maintain the status quo, | `\ or die trying. Either is better than actually WORKING for a | _o__) living.” —ringsnake.livejournal.com, 2007-11-12 | Ben Finney

On 2/11/2012 7:53 AM, Stefan Behnel wrote:
If one has ascii text + unspecified 'other stuff', one can either process as 'polluted text' or as 'bytes with some ascii character codes'. Since (as I just found out) one can iterate binary mode files by line just as with text mode, I am not sure what the tradeoffs are. I would guess it is mostly whether one wants to process a sequence of characters or a sequence of character codes (ints). -- Terry Jan Reedy

On 11 February 2012 12:41, Masklinn <masklinn@masklinn.net> wrote:
To be honest, I'm fine with the answer "use latin1" for this case. Practicality beats purity and all that. But as you say, it feels wrong somehow. I suspect that errors=surrogateescape is the better "I don't really care" option. And I still maintain it would be useful for combating FUD if there was a commonly-accepted idiom for this. Interestingly, on my Windows PC, if I open a file using no encoding in Python 3, I seem to get code page 1252: Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
So actually, on this PC, I can't really provoke these sorts of decoding error problems (CP1252 accepts all bytes, it's basically latin1). Whether this is a good thing or a bad thing, I'm not sure :-) Paul

Masklinn writes:
So give latin-1 an additional name. Emacsen use "raw-text" (there's also binary, but raw-text will do a loose equivalent of universal newlines for you, binary doesn't). You could also use a name more exact and less English-biased like "ascii-compatible-bytes". Same codec, name denotes different semantics.

On 2/11/2012 5:47 AM, Paul Moore wrote:
Good example. I believe adding ", encoding='latin-1'" to open() is sufficient. (And from your response elsewhere to Stephen, you seem to know that.) This should be in the tutorial if not already. But in reference to what I wrote above, knowing that magic phrase is not 'knowledge of unicode'. And I include it in the 'not much more knowledge' needed for Python 3. -- Terry Jan Reedy

On 2/11/2012 12:00 PM, Masklinn wrote:
When I wrote that response, I thought that 'for line in f' would not work for binary-mode files. I then opened IDLE, experimented with 'rb', and discovered otherwise. So the remaining issue is how one wants the unknown encoding bytes to appear when printed -- as hex escapes, or as arbitrary but more readable non-ascii latin-1 chars. -- Terry Jan Reedy

On 11 February 2012 17:00, Masklinn <masklinn@masklinn.net> wrote:
In my view, that's less scalable to more complex cases. It's likely you'll hit things you need to do that don't translate easily to bytes sooner than if you stick in a string-only world. A simple example, check for a regex rather than a simple starting character. The problem I have with encoding="latin-1" is that in many cases I *know* that's a lie. From what's been said in this discussion so far, I think that the "better" way to say "I know this file contains mostly ASCII, but there's some other bits I'm not sure about but don't care too much as long as they round-trip cleanly" is encoding="ascii",errors="surrogateescape". But as we've seen here, that's not the idiom that gets recommended by everyone (the "One Obvious Way", if you like). I suspect that if the community did embrace a "one obvious way", that would reduce the "Python 3 makes me need to know Unicode" FUD that's around. But as long as people get 3 different answers when they ask the question, there's going to be uncertainty and doubt (and hence, probably, fear...) Paul. PS I'm pretty confident that I have *my* answer now (ascii/surrogateescape). So this thread was of benefit to me, if nothing else, and my thanks for that.

Masklinn writes:
Why not open the file in binary mode in stead? (and replace `'*'` by `b'*'` in the startswith call)
This will often work, but it's task-dependent. In particular, I believe not just `.startswith(), but general regexps work with either bytes or str in Python 3. But other APIs may not. and you're going to need to prefix *all* literals (including those in modules your code imports!) with `b`. So you import a module that does exactly what you want, and be stymied by a TypeError because the module wants Unicode. This would not happen with Python 2, and there's the rub.

On Mon, Feb 13, 2012 at 2:50 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
The other trap is APIs like urllib.parse which explicitly refuse the temptation to guess when it comes to bytes data, and decodes it as "ascii+strict". If you want it to do something else that's more permissive (e.g. "latin-1" or "ascii+surrogateescape") then you *have* to decode it to Unicode yourself before handing it over. Really, Python 3 forces programmers to learn enough about Unicode to be able to make the choice between the 4 possible options for processing ASCII-compatible encodings: 1. Process them as binary data. This is often *not* going to be what you want, since many text processing APIs will either only accept Unicode, or only pure ASCII, or require you to supply encoding+errors if you want them to process binary data. 2. Process them as "latin-1". This is the answer that completely bypasses all Unicode integrity checks. If you get fed non-ASCII data, you *will* silently produce gibberish as output. 3. Process them as "ascii+surrogateescape". This is the *right* answer if you plan solely to manipulate the text and then write it back out in the same encoding as was originally received. You will get errors if you try to write a string with escaped characters out to a non-ascii channel or an ascii channel without surrogateescape enabled. To write such strings to non-ascii channels (e.g. sys.stdout), you need to remember to use something like "ascii+replace" to mask out the values with unknown encoding first. You may still get hard to debug UnicodeEncodeError exceptions when handed data in a non-ASCII compatible encoding (like UTF-16 or UTF-32), but your odds of silently corrupting data are fairly low. 4. Get a third party encoding guessing library and use that instead of waving away the problem of ASCII-incompatible encodings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2012/2/11 Paul Moore <p.f.moore@gmail.com>
I just look at the Python 3 documentation ( http://docs.python.org/release/3.1.3/library/functions.html#open), there is a "error" parameter to the open function. when set to "ignore" or "replace" it will solved your problem. Another way is to try to guess the encoding programaticaly (I found chardet module http://pypi.python.org/pypi/chardet) and pass it to decode your file with unknown encoding. Then why not put a value "auto" available for "encoding" parameter which makes "open" call a detector before opening and throw error when the guess is less than a certain percentage. Gabriel AHTUNE

Massimo Di Pierro wrote:
Is that a commentary on Python, or the average undergrad student?
Python has a compiler. The "c" in .pyc files stands for "compiled" and Python has a built-in function called "compile". It just happens to compile to byte code that runs on a virtual machine, not machine code running on physical hardware. PyPy takes it even further, with a JIT compiler that operates on the byte code.
How is that relevant to a language being taught to undergrads? Sounds more like an excuse to justify dislike of teaching Python rather than an actual reason to dislike Python.
- The programming language purists complain about the use of reference counting instead of garbage collection
The programming language purists should know better than that. The choice of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature. -- Steven

On 2012-02-09, at 19:03 , Steven D'Aprano wrote:
The choice of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature.
That's debatable, it's an implementation detail with very different semantics which tends to leak out into usage patterns of the language (as it did with CPython, which basically did not get fixed in the community until Pypy started ascending), especially when the language does not provide "better" ways to handle things (as Python finally did by adding context managers in 2.5). So theoretically, automatic refcounting is a detail, but practically it influences language usage differently than most other GC techniques (when it'd the only GC strategy in the language anyway)

On Thu, Feb 9, 2012 at 10:14 AM, Masklinn <masklinn@masklinn.net> wrote:
I think it was actually Jython that first sensitized the community to this issue.
Are there still Python idioms/patterns/recipes around that depend on refcounting? (There also used to be some well-known anti-patterns that were only bad because of the refcounting, mostly around saving exceptions. But those should all have melted away -- CPython has had auxiliary GC for over a decade.) -- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:26 , Guido van Rossum wrote:
The first one was Jython yes, of course, but I did not see the "movement" gain much prominence before Pypy started looking like a serious CPython alternative, before that there were a few voices lost in the desert.
There shouldn't be, but I'm not going to rule out reliance on automatic resource cleanup just yet, I'm sure there are still significant pieces of code using those in the wild.

On Thu, Feb 9, 2012 at 10:37 AM, Masklinn <masklinn@masklinn.net> wrote:
I guess everyone has a different perspective.
There shouldn't be, but I'm not going to rule out reliance on automatic
resource cleanup just yet, I'm sure there are still significant pieces of code using those in the wild.
I am guessing in part that's a function of resistance to change, and in part it means PyPy hasn't gotten enough mindshare yet. (Raise your hand if you have PyPy installed on one of your systems. Raise your hand if you use it. Raise your hand if you are a PyPy contributor. :-) Anyway, the refcounting objection seems the least important one. The more important trolls to fight are "static typing is always better" and "the GIL makes Python multicore-unfriendly". TBH, I see some movement in the static typing discussion, evidence that the static typing zealots are considering a hybrid approach (e.g. C# dynamic, and the optional static type checks in Dart). -- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:44 , Guido van Rossum wrote:
These seem to be efforts of people trying for both sides (for various reasons) more than people firmly rooted in one camp or another. Dart was widely panned for its wonky approach to "static typing", which is generally considered a joke amongst people looking for actual static type (in that they're about as useful as Python 3's type annotations).

On Thu, Feb 09, 2012 at 10:44:42AM -0800, Guido van Rossum wrote:
I don't know if you actually want replies, but I'll bite. I have pypy installed (from the standard Fedora pypy package), and for a particular project it provided a 20x speedup. I'm not a PyPy contributor, but I'm a believer. I would use PyPy everywhere if it worked with Python 3 and scipy. My apologies if this was just a rhetorical question. :) -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 10 February 2012 06:06, Guido van Rossum <guido@python.org> wrote:
In that case ... - I have various versions of PyPy installed (regularly pull the latest working Windows build); - I use it occasionally, but most of my Python work ATM is Google App Engine-based, and the GAE SDK doesn't work with PyPy; - I'm not a PyPy contributor, but am also a believer - I definitely think that PyPy is the future and should be the base for Python4K. - I won't be at PyCon. Cheers, Tim Delaney

Andrew McNabb, 09.02.2012 19:58:
AFAIK, there is no concrete roadmap towards supporting SciPy on top of PyPy. Currently, PyPy is getting its own implementation of NumPy-like arrays, but there is currently no interaction with anything in the SciPy world outside of those. Given the shear size of SciPy, reimplementing it on top of numpypy is unrealistic. That being said, it's quite possible to fire up CPython from PyPy (or vice versa) and interact with that, if you really need both PyPy and SciPy. It even seems to be supported through multiprocessing. I find that pretty cool. http://thread.gmane.org/gmane.comp.python.pypy/9159/focus=9161 Stefan

On Thu, Feb 09, 2012 at 08:53:55PM +0100, Stefan Behnel wrote:
I understand that there is some hope in getting cython to support pure python and ctypes as a backend, and then to migrate scipy to use cython. This is definitely a long-term solution. Most people don't depend on all of scipy, and for some use cases, it's not too hard to find alternatives. Today I migrated a project from scipy to the GNU Scientific Library (with ctypes). It now works great with PyPy, and I saw a total speedup of 10.6. Dropping from 27 seconds to 2.55 seconds is huge. It's funny, but for a new project I would go to great lengths to try to use the GSL instead of scipy (though I'm sure for some use cases it wouldn't be possible).
That's a fascinating idea that I had never considered. Thanks for sharing. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 2012-02-10, at 01:03 , Guido van Rossum wrote:
I'm not sure what could be open to negotiate, being part of the GNU constellation I don't see GSL budging from the GPL, and SciPy is backed by industry members and used in "nonfree" products (notably the Enthought Python Distribution) so there's little room for it to use the GPL. Best thing that could happen (and I'm not even sure it's allowed by the GSL's license (which is under the GPL not the LGPL) would be for SciPy to grow some sort of GSL backend to delegate its operations to, when the GSL is installed.

On 2/10/12 8:49 AM, Masklinn wrote:
While I am an Enthought employee and really do want to keep scipy BSD so I can continue to use it in the proprietary software that I write for clients, I must also add that the most vociferous BSD advocates in our community are the academics. They have to wade through more weird licensing arrangements than I do, and the flexibility of the BSD license is quite important to let them get their jobs done.
We've done that kind of thing in the past for FFTW and other libraries but have since removed them for all of the installation and maintenance headaches it causes. In my mind (and others disagree), having scipy-the-package subsume every relevant library is not a worthwhile pursuit. The important thing is that these packages are available to the scientific Python community. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2/10/12 10:41 AM, Masklinn wrote:
No apologies necessary. I just wanted to be thorough. :-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

GIL + Threads = Fibers CPython doesn't have threads but it calls its fibers "threads" which causes confusion and disappointment. The underlying implementation is not important eg when you implement a "lock" using "events" does the lock become an event? No. This is a PR disaster. 100% agree we need a PR offensive but first we need a strategy. Erlang champions the actor/message paradigm so they dodge the threading bullet completely. What's the python championed parallelism paradigm? It should be on the front page of python.org and in the first paragraph of wikipedia on python. One of the Lua authors said this about threads:
Anyone who cares enough about performance doesn't mind that 'a = a + 1' is only as deterministic as you design it to be with or without locks. Multiprocessing has this same problem btw. What Python needs are better libraries for concurrent programming based on
processes and coroutines.
The killer feature for threads (vs multiprocessing) is access to shared state with nearly zero overhead. And note that a single-threaded event-driven process can serve 100,000 open
sockets -- while no JVM can create 100,000 threads.
Graphics engines, simulations, games, etc don't want 100,000 threads, they just want true threads as many as there are CPU's. Yuval

Pure python code running in python "threads" on CPython behaves like fibers. I'd like to point out the word "external" in your statement.
I don't believe this to be true. Fibers are not preempted. The GIL is released at regular intervals to allow the effect of preempted switching. Many other behaviours of Python threads are still native thread like, particularly in their interaction with other components and the OS. GIL + Threads = Simplified, non parallel interpreter

Matt Joiner, 10.02.2012 15:48:
Absolutely. Even C extensions cannot always prevent a thread switch from happening when they need to call back into CPython's C-API.
GIL + Threads = Simplified, non parallel interpreter
Note that this also applies to PyPy, so even "interpreter" isn't enough of a generalisation. I think it's best to speak of the GIL as what it is: a lock that protects internal state of the CPython runtime (and also some external code, when used that way). Rather convenient, if you ask me. Stefan

On Sat, Feb 11, 2012 at 1:30 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Armin Rigo's series on Software Transactional Memory on the PyPy blog is also required reading for anyone seriously interested in practical shared memory concurrency that doesn't impose a horrendous maintenance burden on developers that try to use it: http://morepypy.blogspot.com.au/2011/06/global-interpreter-lock-or-how-to-ki... http://morepypy.blogspot.com.au/2011/08/we-need-software-transactional-memor... http://morepypy.blogspot.com.au/2012/01/transactional-memory-ii.html And for those that may be inclined to dismiss STM as pie-in-the-sky stuff that is never going to be practical in the "real world", the best I can offer is Intel's plans to bake an initial attempt at it into a consumer grade chip within the next couple of years: http://arstechnica.com/business/news/2012/02/transactional-memory-going-main... I do like Armin's analogy that free threading is to concurrency as malloc() and free() are to memory management :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Can we please break this thread out into multiple subject headers? It's very difficult to follow the flow of conversation with some many different discussions all lumped under one name. Some proposed subjects: - Refcounting vs. Other GC - Numpy - Windows Installers - Unicode - Python in Education - Python's Popularity

On Fri, Feb 10, 2012 at 12:43 PM, Carl M. Johnson < cmjohnson.mailinglist@gmail.com> wrote:
No, The subject is correct, we have a -3% problem in the index. so the solution is to keep this thread long with many keywords like python pypy jython etc... and than the % will grow! (at least @TIOBE since it relies on google search ;) )

On 2/9/2012 1:26 PM, Guido van Rossum wrote:
Yes, it was. The first PyPy status blog in Oct 2007 http://morepypy.blogspot.com/2007/10/first-post.html long before any practical release, was a year after the 2.5 release. -- Terry Jan Reedy

On Thu, Feb 9, 2012 at 19:26, Guido van Rossum <guido@python.org> wrote:
There are some simple patterns that are great with refcounting and not so great with garbage collection. We encountered some of these with Mercurial. IIRC, the basic example is just open('foo').read() With refcounting, the file will be closed soon. With garbage collection, it won't. Being able to rely on cleanup per frame/function call is pretty useful. Cheers, Dirkjan

On 10 February 2012 20:16, Dirkjan Ochtman <dirkjan@ochtman.nl> wrote:
This is the #1 anti-pattern that shouldn't be encouraged. Using this idiom is just going to cause problems (mysterious exceptions while trying to open files due to running out of file handles for the process) for anyone trying to port your code to other implementations of Python. If you read PEP 343 (and the various discussions around that time) it's clear that the above anti-pattern is one of the major driving forces for the introduction of the 'with' statement. Tim Delaney

On Fri, Feb 10, 2012 at 4:32 AM, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
It's not that open('foo').read() is "good". Clearly with the presence of nondeterministic garbage collection, it's bad. But it is convenient and compact. Refcounting GCs in general give very nice, predictable behavior, which lets us ignore a lot of the details of destroying things. Without something like this, we have to do some forms of resource management by hand that we could otherwise push to the garbage collector, and while sometimes this is as easy as a with statement, sometimes it isn't. For example, what do you do if multiple objects are meant to hold onto a file and take turns reading it? How do we close the file at the end when all the objects are done? Is the answer "manual refcounting"? Or is the answer "I don't care, let the GC handle it"? -- Devin

On Thu, Feb 9, 2012 at 10:03 AM, Steven D'Aprano <steve@pearwood.info>wrote:
Well either way it's depressing...
Not sure how that's relevant. Massimo used "won't compile" as a shorthand for "has a syntax error". 50+% of the students have a mac and an increasing number of packages
depend on numpy. Installing numpy on mac is a lottery.
But that was the same in the 2.5 days. The problem is worse now because (a) numpy is going mainstream, and (b) Macs don't come with a C compiler any more. I think the answer will have to be in making an effort to produce robust and frequently updated downloads of numpy to match various popular Python versions and platforms. This is a major pain (packaging always is) so maybe some incentive is necessary (just like ActiveState has its Python distros).
Hm. I know a fair number of people who use Eclipse to edit Python (there's some plugin). This seems easy enough to address by just pointing people to the plugin, I don't think Python itself is to blame here. From the hard core computer scientists prospective there are usually
How is that relevant to a language being taught to undergrads? Sounds more
like an excuse to justify dislike of teaching Python rather than an actual reason to dislike Python.
I can see the discomfort if the other professors keep bringing this up. It is, sadly, a very effective troll. (Before it was widely know, the most common troll was the whitespace. People would declare it to be ridiculous without ever having tried it. Same with the GIL.) - The programming language purists complain about the use of reference
counting instead of garbage collection
The programming language purists should know better than that. The choice
of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature.
Yeah, trolls are a pain. We need to start spreading more effective counter-memes. -- --Guido van Rossum (python.org/~guido)

On Feb 9, 2012, at 12:03 PM, Steven D'Aprano wrote:
I teach so the average student is my benchmark. Please do not misunderstand. While some may be lazy, but the average CS undergrad is not stupid but quite intelligent. They just do not like wasting time with setups and I sympathize with that. Batteries included is the Python motto.
Don't shoot the messenger please. You can dismiss or address the problem. Anyway... undergrads do care because they will take 4 years to grade and they do not want to come out with obsolete skills. Our undergrads learn Python, Ruby, Java, Javascript and C++. Many know other languages which they learn on their own (Scala and Clojure are popular). They all agree multi-core is the future and whichever language can deal with them better is the future too. As masklinn says, the difference between garbage collection and reference counting is more than an implementation issue.

On Thu, Feb 9, 2012 at 10:25 AM, Massimo Di Pierro < massimo.dipierro@gmail.com> wrote:
I'd give those students a bonus for being in touch with what's popular in academia. Point them to Haskell next. They may amount to something.
They all agree multi-core is the future and whichever language can deal with them better is the future too.
Surely not JavaScript (which is single-threaded and AFAIK also uses refcounting :-). Also, AFAIK Ruby has a GIL much like Python. I think it's time to start a PR offensive explaining why these are not the problem the trolls make them out to be, and how you simply have to use different patterns for scaling in some languages than in others. And note that a single-threaded event-driven process can serve 100,000 open sockets -- while no JVM can create 100,000 threads. As masklinn says, the difference between garbage collection and reference
counting is more than an implementation issue.
-- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:34 , Guido van Rossum wrote:
I don't think I've seen a serious refcounted JS implementation in the last decade. , although it is possible that JS runtimes have localized usage of references and reference-counted resources. AFAIK all modern JS runtimes are JITed which probably does not mesh well with refcounting. In any case, V8 (Chrome's runtime) uses a stop-the-world generational GC for sure[0], Mozilla's SpiderMonkey uses a GC as well[1] although I'm not sure which type (the reference to JS_MarkGCThing indicates it could be or at least use a mark-and-sweep amongst its strategies), Webkit/Safari's JavaScriptCore uses a GC as well[2] and MSIE's JScript used a mark-and-sweep GC back in 2003[3] (although the DOM itself was in COM, and reference-counted).
Only because it's OS threads of course, Erlang is not evented and has no problem spawning half a million (preempted) processes if there's RAM enough to store them. [0] http://code.google.com/apis/v8/design.html#garb_coll [1] https://developer.mozilla.org/en/SpiderMonkey/1.8.5#Garbage_collection [2] Since ~2009 http://www.masonchang.com/blog/2009/3/26/nitros-garbage-collector.html [3] http://blogs.msdn.com/b/ericlippert/archive/2003/09/17/53038.aspx

On Thu, Feb 9, 2012 at 10:50 AM, Masklinn <masklinn@masklinn.net> wrote:
I stand corrected (but I am right about the single-threadedness :-).
Only because it's OS threads of course, Erlang is not evented and has no
problem spawning half a million (preempted) processes if there's RAM enough to store them.
Sure. But the people complaining about the GIL come from Java, not from Erlang. (Erlang users typically envy Python because of its superior standard library. :-)
-- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:54 , Guido van Rossum wrote:
I stand corrected (but I am right about the single-threadedness :-).
Absolutely (until WebWorkers anyway)
True. Then they remember how good Python is with concurrency, distribution and distributed resilience :D (don't forget syntax, one of Erlang's biggest failures) (although it pleased cfbolz since he could get syntax coloration for his prolog)

On 09.02.2012 19:50, Masklinn wrote:
And Chrome uses one *process* for each tab, right? Is there a reason Chrome does not use one thread for each tab, such as security?
Actually, spawning half a million OS threads will burn the computer. *POFF* ... and it goes up in a ball of smoke. Spawning half a million threads is the Windows equivalent of a fork bomb. I think you confuse threads and fibers/coroutines. Sturla

On Thu, Feb 09, 2012 at 08:08:36PM +0100, Sturla Molden wrote:
And Chrome uses one *process* for each tab, right? Is there a reason Chrome does not use one thread for each tab, such as security?
Safety, I dare say. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Feb 9, 2012 at 11:08 AM, Sturla Molden <sturla@molden.no> wrote:
Stability and security. If something goes wrong/rogue, the effects are reasonably isolated to the individual tab in question. And they can use OS resource/privilege limiting APIs to lock down these processes as much as possible. Cheers, Chris

On 2012-02-09, at 20:08 , Sturla Molden wrote:
I do not know the precise reasons no, but it probably has to do with security and ensuring isolation yes (webpage semantics mandate that each page gets its very own isolated javascript execution context)
No. You probably misread my comment somehow.

On Thu, Feb 9, 2012 at 2:08 PM, Sturla Molden <sturla@molden.no> wrote:
And Chrome uses one *process* for each tab, right?
Supposedly. If you click the wrench, then select Tools/Task Manager, it looks like there are actually several tabs/process (at least if you have enough tabs), but there can easily be several processes controlling separate tabs within the same window.
Is there a reason Chrome does not use one thread for each tab, such as security?
That too, but the reason they documented when introducing Chrome was for stability. I can say that Chrome often warns me that a selection of tabs[1] appears to be stopped, and asks if I want to kill them; it more often appears to freeze -- but switching to a different tab is usually effective in getting some response, while I wait the issue out. [1] Not sure if the selection is exactly equal to those handled by a single process, but it seems so. -jJ

On Thu, Feb 9, 2012 at 11:39 AM, Jim Jewett <jimjjewett@gmail.com> wrote:
On Thu, Feb 9, 2012 at 2:08 PM, Sturla Molden <sturla@molden.no> wrote:
And Chrome uses one *process* for each tab, right?
Can we stop discussing Chrome here? It doesn't really matter. -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 19:25, Massimo Di Pierro wrote:
As masklinn says, the difference between garbage collection and reference counting is more than an implementation issue.
Actually it is not. The GIL is a problem for those who want to use threading.Thread and plain Python code for parallel processing. Those who think in those terms have typically prior experience with Java or .NET. Processes are excellent for concurrency, cf. multiprocessing, os.fork and MPI. They actually are more efficient than threads (due to avoidance of false sharing cache lines) and safer (deadlock and livelocks are more difficult to produce). And I assume students who learn to use such tools from the start are not annoyed by the GIL. The GIL annoys those who have learned to expect threading.Thread for CPU bound concurrency in advance -- which typically means prior experience with Java. Python threads are fine for their intended use -- e.g. I/O and background tasks in a GUI. Sturla

On Fri, Feb 10, 2012 at 03:19:36AM +0800, Matt Joiner wrote:
The GIL is almost entirely a PR issue. In actual practice, it is so great (simple, straightforward, functional) I believe that it is a sign of Guido's time machine-enabled foresight. --titus -- C. Titus Brown, ctb@msu.edu

On Thu, Feb 09, 2012 at 08:42:40PM +0100, Masklinn wrote:
Are we scheduling interventions for me now? 'cause there's a lot of people who want to jump in that queue :) dabeaz understands this stuff at a deeper level than me, which is often a handicap in these kinds of discussions, IMO. (He's also said that he prefers message passing to threading.) The point is that in terms of actually making my own libraries and parallelizing code, the GIL has been very straightforward, cross platform, and quite simple for understanding the consequences of a fairly wide range of multithreading models. Most people want to go do inappropriately complex things ("ooh! threads! shiny!") with threads and then fail to write robust code or understand the scaling of their code; I think the GIL does a fine job of blocking the simplest stupidities. Anyway, I love the GIL myself, although I think there is a great opportunity for a richer & more usable mid-level C API for both thread states and interpreters. cheers, --titus -- C. Titus Brown, ctb@msu.edu

On Thu, Feb 9, 2012 at 11:19 AM, Matt Joiner <anacrolix@gmail.com> wrote:
I'd actually say that using OS threads is too heavy *specifically* for trivial cases. If you spawn a thread to add two numbers you'll have a huge overhead. If you spawn a thread to do something significant, the overhead doesn't matter much. Note that even in Java, everyone uses thread pools to reduce thread creation overhead. -- --Guido van Rossum (python.org/~guido)

On Fri, Feb 10, 2012 at 5:19 AM, Matt Joiner <anacrolix@gmail.com> wrote:
Have you even *tried* concurrent.futures (http://docs.python.org/py3k/library/concurrent.futures)? Or the 2.x backport on PyPI (http://pypi.python.org/pypi/futures)? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, 10 Feb 2012 21:01:07 +0100 Sturla Molden <sturla@molden.no> wrote:
In what way does the mmap module fail to provide your binary file interface? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Massimo Di Pierro writes:
Well, maybe you should teach your students the rudiments of lying, erm, "statistics". That -3% on the TIOBE index is a steaming heap of FUD, as Anatoly himself admitted. Feb 2011 is clearly above trend, Feb 2012 below it. Variables vary, OK? So at the moment it is absolutely unclear whether Python's trend line has turned down or even decreased slope. And the RedMonk ranking shows Python at the very top.
Maybe they should learn something about reality of the IT industry, too. According to the TIOBE survey, COBOL and PL/1 are in the same class (rank 51-100, basically indistinguishable) with POSIX shell. Old programming languages never die ... and experts in them only become more valuable with time. Python skills will hardly become "obsolete" in the next decade, certainly not in the next 4 years. You say "dismiss or address the problem." Is there a problem? I dunno. Popularity is nice, but I really don't know if I would want to use a Python that spent the next five years (because that's what it will take) fixing what ain't broke to conform to undergraduate misconceptions. Sure, it would be nice have more robust support for installing non-stdlib modules such as numpy. But guess what? That's a hard nut to crack, and more, people have been working quite hard on the issue for a while. The distutils folks seem to be about to release at this point -- I guess the Time Machine has struck again! And by the way, which of Ruby, Java, Javascript, and C++ provides something like numpy that's easier to install? Preferably part of their stdlib? In my experience on Linux and Mac, at least, numerical code has always been an issue, whether it's numpy (once that I can remember, and that was because of some dependency which wouldn't build, not numpy itself), Steel Bank Common Lisp, ATLAS, R, .... The one thing that bothers me about the picture at TIOBE is the Objective-C line. I assume that's being driven by iPhone and iPad apps, and I suppose Java is being driven in part by Android. It's too bad Python can't get a piece of that action!

On Thu, Feb 9, 2012 at 1:14 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
It's too bad Python can't get a piece of that action!
Getting closer: http://morepypy.blogspot.com/2012/02/almost-there-pypys-arm-backend_01.html -eric

On Thu, Feb 9, 2012 at 11:49 AM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
At the University of Toronto we tell students to use the Wing IDE (Wing 101 was developed specifically for our use in the classroom, in fact). All classroom examples are done either in the interactive interpreter, or in a session of Wing 101. All computer lab sessions are done using Wing 101, and the first lab is dedicated specifically for introducing how to edit files with it and use its debugging features. If students don't like IDLE, tell them to use a different editor instead, and pretend that Python doesn't include one with itself. (By default IDLE only shows an interactive session, so if they get curious and click-y they'll still be in the dark.) -- Devin

On Thu, Feb 9, 2012 at 5:05 PM, Masklinn <masklinn@masklinn.net> wrote:
1. Where would be the correct place to talk about a grand state of python affairs? 2. Like it or not, many use such ratings to decide which language to learn, which language to use for their next project and whether or not to be proud of their language of choice. I think it's important for python to be popular and good. One without the other isn't too useful. Yuval

On Thu, Feb 09, 2012 at 05:13:03PM +0200, Yuval Greenfield wrote:
Nowhere because: 1. Nobody cares. This is Free Software, and we are scratching our own itches. 2. Do you consider Python developers stupid? Do you think they don't have any idea how things are going on in the wild?
Java (or Perl, or whatever) has won, hands down. Congrats to them! Can we please return to our own development? We are not going to conquer the world, are we? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

The reason python is slipping in the index is the same reason that its popularity doesn't matter (much). Wrapper generating tools, cross language interfaces and whatnot are making "polyglot" programming a pretty simple affair these days... The TIOBE index for the most part has two distinct groups: Languages that people use at work, where risk aversion are large driving forces (see java, c/++, php) and languages people use personally because they enjoy programming in them. Because the library issue for a new or less popular language is not as big a deal as it once was, people have more freedom in their choice, and that is reflected in the diversification of "fun" languages. Javascript is an outlier here, you don't have a choice if you target the browser. Nathan

On Thu, 9 Feb 2012 16:19:23 +0100 Antoine Pitrou <solipsis@pitrou.net> wrote:
And to elaborate a bit, here's the description of the python-ideas list: “This list is to contain discussion of speculative language ideas for Python for possible inclusion into the language. If an idea gains traction it can then be discussed and honed to the point of becoming a solid proposal to put to either python-dev or python-3000 as appropriate.” (*) python-ideas is not a catchall for random opinions about Python. (*) someone should really remove that python-3000 reference

Here is another data point: http://redmonk.com/sogrady/2012/02/08/language-rankings-2-2012/ Unfortunately the TIOBE index does matter. I can speak for python in education and trends I seen. Python is and remains the easiest language to teach but it is no longer true that getting Python to run is easer than alternatives (not for the average undergrad student). It used to be you download python 2.5 and you were in business. Now you have to make a choice 2.x or 3.x. 20% of the students cannot tell one from the other (even after been told repeatedly which one to use). Three weeks into the class they complain with "the class code won't compile" (the same 20% cannot tell a compiler form an interpreter). 50+% of the students have a mac and an increasing number of packages depend on numpy. Installing numpy on mac is a lottery. Those who do not have a mac have windows and they expect an IDE like eclipse. I know you can use Python with eclipse but they do not. They download Python and complain that IDLE has no autocompletion, no line numbers, no collapsible functions/classes. From the hard core computer scientists prospective there are usually three objections to using Python: - Most software engineers think we should only teach static type languages - Those who care about scalability complain about the GIL - The programming language purists complain about the use of reference counting instead of garbage collection The net result is that people cannot agree and it is getting increasingly difficult to make the case for the use of Python in intro CS courses. For some reason javaScript seems to win these days. Massimo On Feb 9, 2012, at 8:36 AM, anatoly techtonik wrote:

I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on Tkinter, with multiple pages, autocompletion, collapsible, line numbers, better printing with syntax highlitghing), and if easy_install were accessible via Idle, this would be a killer version. Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for. If this cannot be done without breaking backward compatibility again, then 3.x should be considered an experimental branch, people should be advised to stay with 2.7 (2.8?) and then skip to 4.x directly when these problems are resolved. Python should not make a habit of breaking backward compatibility. It would be really nice if it were to include an async web server (based on gevent for example) and better parser for HTTP headers and a python based template language (like mako or the web2py one) not just for the web but for document generation in general. Massimo On Feb 9, 2012, at 11:12 AM, Edward Lesmes wrote:

Massimo Di Pierro wrote:
IDLE does look a little long in the tooth.
It isn't difficult to find out about previous attempts to remove the GIL. Googling for "python removing the gil" brings up plenty of links, including: http://www.artima.com/weblogs/viewpost.jsp?thread=214235 http://dabeaz.blogspot.com.au/2011/08/inside-look-at-gil-removal-patch-of.ht... Or just use Jython or IronPython, neither of which have a GIL. And since neither of them support Python 3 yet, you have no confusing choice of version to make. I'm not sure if IronPython is suitable for teaching, if you have to support Macs as well as Windows, but as a counter-argument against GIL trolls, there are two successful implementations of Python without the GIL. (And neither is as popular as CPython, which I guess says something about where people's priorities lie. If the GIL was as serious a problem in practice as people claim, there would be far more interest in Jython and IronPython.)
Python 4.x (Python 4000) is pure vapourware. It it irresponsible to tell people to stick to Python 2.7 (there will be no 2.8) in favour of something which may never exist. http://www.python.org/dev/peps/pep-0404/ -- Steven

On 10Feb2012 05:50, Steven D'Aprano <steve@pearwood.info> wrote: | Python 4.x (Python 4000) is pure vapourware. It it irresponsible to tell | people to stick to Python 2.7 (there will be no 2.8) in favour of something | which may never exist. | | http://www.python.org/dev/peps/pep-0404/ Please tell me this PEP number is deliberate! -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Once I reached adulthood, I never had enemies until I posted to Usenet. - Barry Schwartz <bbs@hankel.rutgers.edu>

Le 12/02/2012 00:18, Cameron Simpson a écrit :
It is, sir! At first the number was taken by the virtualenv PEP with no special meaning, just the next number in sequence, but when Barry wrote up the 2.8 Unrelease PEP and took the number 405, the occasion was too good to be missed and the numbers were swapped. Cheers

On Thu, Feb 9, 2012 at 9:46 AM, Massimo Di Pierro < massimo.dipierro@gmail.com> wrote:
IIRC gevent still needs to be ported to 3.x (maybe someone with the necessary skills should apply to the PSF for funding). But the rest sounds like the domain of a superinstaller, not inclusion in the stdlib. IDLE will never be able to compete with Eclipse -- you can love one or the other bot not both. Longer term removing the GIL and using garbage collection should be a
priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for.
I think the best way to accomplish both is to focus on PyPy. It needs porting to 3.x; Google has already given them some money towards this goal.
That's really bad advice. 4.x will not be here for another decade.
Python should not make a habit of breaking backward compatibility.
Agreed. 4.x should be fully backwards compatible -- with 3.x, not with 2.x. It would be really nice if it were to include an async web server (based on
Again, that's a bundling issue. With the infrequency of Python releases, anything still under development is much better off being distributed separately. Bundling into core Python requires a package to be essentially stable, i.e., dead. -- --Guido van Rossum (python.org/~guido)

On 2/9/2012 12:46 PM, Massimo Di Pierro wrote:
I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on
I am working on the patches already on the tracker, starting with bug fixes.
Tkinter, with multiple pages,
If you mean multiple tabbed pages in one window, I believe there is a patch. autocompletion, IDLE already has 'auto-completion'. If you mean something else, please explain.
collapsible [blocks], line numbers,
I have thought about those.
better printing with syntax highlighting),
Better basic printing support is really needed. #1528593 Color printing if not possible now would be nice, as color printers are common now. I have no idea if tkinter print support makes either easier now.
and if easy_install were accessible via Idle, this would be a killer version.
That should be possible with an extension.
Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but
As has been discussed here and on pydev, the problems include things like making Python slower and disabling C extensions.
For non-Euro-Americans, a major problem with Python 1/2 was the use of ascii for identifiers. This was *fixed* by Python 3. When I went to Japan a couple of years ago and stopped in a general bookstore (like Borders), its computer language section had about 10 books on Python, most in Japanese as I remember. So it is apparently in use there.
resolved. Python should not make a habit of breaking backward compatibility.
I believe the main problem has been the unicode switch, which is critical to Python being a world language. Removal of old-style classes was mostly a non-issue, except for the very few who intentionally continued to use them. -- Terry Jan Reedy

On Thu, Feb 09, 2012 at 11:46:45AM -0600, Massimo Di Pierro wrote:
I am not sure if popularity contests are just based on technical merits/demerits alone. I guess, less people here could care less for popularity, but more for good tools in python land. So if there are things lacking in Python world, then those are good project opportunities. What I personally feel is, the various plug-and-play libraries are giving JavaScript a thumbs up and more is going on web world front-end than back-end. So, if there is a requirement for Python programmer, there is an assumption that he should web techs too. There are also PHP/Ruby/Java folks who also know web technologies. So, the web tech like (javascript) gets counted 4x time. -- Senthil

On Feb 9, 2012, at 12:46 PM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
I think if easy_install, gevent, numpy (*), and win32 extensions where included in 3.x, together with a slightly better Idle (still based on Tkinter, with multiple pages, autocompletion, collapsible, line numbers, better printing with syntax highlitghing), and if easy_install were accessible via Idle, this would be a killer version.
Longer term removing the GIL and using garbage collection should be a priority. I am not sure what is involved and how difficult it is but perhaps this is what PyCon money can be used for.
Please do not volunteer revenue that does not exist, or PSF funds for things without a grant proposal or working group. Especially PyCon revenue - which does not exist. Jesse

On 09.02.2012 19:02, Terry Reedy wrote:
And make installing Python on the Mac a lottery?
Or a subset of NumPy? The main offender is numpy.linalg, with needs a BLAS library that should be tuned to the hardware. (There is a reason NumPy and SciPy binary installers on Windows are bloated.) And from what I have seen on complaints building NumPy on Mav it tends to be the BLAS/LAPACK stuff that drives people crazy, particularly those who want to use ATLAS (Which is a bit stupid, as OpenBLAS/GotoBLAS2 is easier to build and much faster.) If Python comes with NumPy built against Netlib reference BLAS, there will be lots of complaints that "Matlab is so much faster then Python" when it is actually the BLAS libraries that are different. But I am not sure we want 50-100 MB of bloat in the Python binary installer just to cover all possible cases of CPU-tuned OpenBLAS/GotoBLAS2 or ATLAS libraries. Sturla

On Thu, Feb 9, 2012 at 10:30 AM, Sturla Molden <sturla@molden.no> wrote:
I don't know much of this area, but maybe this is something where a dynamic installer (along the lines of easy_install) might actually be handy? The funny thing is that most Java software is even more bloated and you rarely hear about that (at least not from Java users ;-). -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 19:36, Guido van Rossum wrote:
I don't know much of this area, but maybe this is something where a dynamic installer (along the lines of easy_install) might actually be handy?
That is what NumPy and SciPy does on Windows. But it also means the "superpack" installer is a very big download. Sturla

Sturla Molden, 09.02.2012 19:46:
I think this is an area where distributors can best play their role. If you want Python to include SciPy, go and ask Enthought. If you also want an operating system with it, go and ask Debian or Canonical. Or macports, if you prefer paying for your apples instead. Stefan

From my own observations, the recent drop is sure to uncertainty with Python 3, and an increase of alternatives on server side, such as Node.
The transition is only going to get more painful as system critical software lags on 2.x while users clamour for 3.x. I understand there are some fundamental problems in running both simultaneously which makes gradual integration not a possibility. Dynamic typing also doesn't help, making it very hard to automatically port, and update dependencies. Lesser reasons include an increasing gap in scalability to multicore compared with other languages (the GIL being the gorilla here, multiprocessing is unacceptable as long as native threading is the only supported concurrency mechanism), and a lack of enthusiasm from key technologies and vendors: GAE, gevent, matplotlib, are a few encountered personally.

On Fri, 10 Feb 2012 01:35:17 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
the GIL being the gorilla here, multiprocessing is unacceptable as long as native threading is the only supported concurrency mechanism
If threading is the only acceptable concurrency mechanism, then Python is the wrong language to use. But you're also not building scaleable systems, which is most of where it really matters. If you're willing to consider things other than threading - and you have to if you want to build scaleable systems - then Python makes a good choice. Personally, I'd like to see a modern threading model in Python, especially if it's tools can be extended to work with other concurrency mechanisms. But that's a *long* way into the future. As for "popular vs. good" - "good" is subjective measure. So the two statements "anything popular is good" and "nothing popular was ever good unless it had no competition" can both be true. Personally, I lean toward the latter. I tend to find things that are popular to not be very good, which makes me distrust the taste of the populace. The python core developers, on the other hand, have an excellent record when it comes to keeping the language good - and the failures tend to be concessions to popularity! So I'd rather the current system for adding features stay in place and *not* see the language add features just to gain popularity. We already have Perl if you want that kind of language. That said, it's perfectly reasonable to suggest changes you think will improve the popularity of the language. But be prepared to show that they're actually good, as opposed to merely possibly popular. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 09.02.2012 19:42, Mike Meyer wrote:
Yes or no... Python is used for parallel computing on the biggest supercomputers, monsters like Cray and IBM blue genes with tens of thousands of CPUs. But what really fails to scale is the Python module loader! For example it can take hours to "import numpy" for 30,000 Python processes on a blue gene. And yes, nobody would consider to use Java for such systems, even though Java does not have a GIL (well, theads do no matter that much on a cluster with distributed memory anyway). It is Python, C and Fortran that are popular. But that really disproves that Python sucks for big concurrency, except perhaps for the module loader. Sturla

On Thu, Feb 9, 2012 at 10:57 AM, Sturla Molden <sturla@molden.no> wrote:
I'm curious about the module loader problem. Did someone ever analyze the cause and come up with a fix? Is it the import lock? Maybe it's something for the bug tracker. -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 20:05, Guido van Rossum wrote:
See this: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes. So a solution is to have one process locate the modules and then broadcast their location to the other processes. There is even a paper on the issue. Here they suggest importing from ramdisk might work on IBM blue gene, but not on Cray. http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p... Another solution might be to use sys.meta_path to bypass imp.find_module: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059813.html The best solution would of course be to fix imp.find_module so it scales properly. Sturla

On 09.02.2012 20:05, Guido van Rossum wrote:
See this: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes. So a solution is to have one process locate the modules and then broadcast their location to the other processes. There is even a paper on the issue. Here they suggest importing from ramdisk might work on IBM blue gene, but not on Cray. http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p... Another solution might be to use sys.meta_path to bypass imp.find_module: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059813.html The best solution would of course be to fix imp.find_module so it scales properly. Sturla

On Thu, 09 Feb 2012 20:25:48 +0100 Sturla Molden <sturla@molden.no> wrote:
The offender is actually imp.find_module, which results in huge number of failed open() calls when used concurrently from many processes.
Ah, I see why I never ran into it. I build systems that start by loading all the modules they need, then fork()ing many processes from that parent. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 09.02.2012 20:48, Mike Meyer wrote:
Yes, but that would not work with MPI (e.g. mpi4py) where the MPI runtime (e.g. MPICH2) is starting the Python processes. Theoretically the issue should be be present on Windows when using multiprocessing, but not on Linux as multiprocessing is using os.fork. Sturla

On Thu, 09 Feb 2012 19:57:20 +0100 Sturla Molden <sturla@molden.no> wrote:
Whether or not hours of time to import is an issue depends on what you're doing. I typically build systems running on hundreds of CPUs for weeks on end, meaning you get years of CPU time per run. So if it took a few hours of CPU time to get started, it wouldn't be much of a problem. If it took a few hours of wall clock time - well, that would be more of a problem, mostly because that long of an outage would be unacceptable. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 2/9/2012 1:57 PM, Sturla Molden wrote:
Mike Meyer posted that on pydev today http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html They determined that the time was gobbled by *finding* modules in each process, so they cut hours by finding them in 1 process and sending the locations to the other 29,999. We are already discussing how to use this lesson in core Python. The sub-thread is today's posts in "requirements for moving __import__ over to importlib?" -- Terry Jan Reedy

Yes but core Python doesn't have any other true concurrency mechanisms other than native threading, and they're too heavyweight for this purpose alone. On top of this they're useless for Python-only parallelism.
Too far. It needs to be now. The downward spiral is already beginning. Mobile phones are going multicore. My next desktop will probably have 8 cores or more. All the heavyweight languages are firing up thread/STM standardizations and implementations to make this stuff more performant and easier than it already is.
This doesn't apply to "enabling" features. Features that make it possible for popular stuff to happen. Concurrency isn't popular, but parallelism is. At least where the GIL is concerned, an good alternative concurrency mechanism doesn't exist. (The popular one is native threading).

On Fri, 10 Feb 2012 03:16:00 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
Huh? Core python has other concurrency mechanisms other than native threading. I don't know what your purpose is, but for mine (building horizontally scaleable systems of various types), they work fine. They're much easier to design with and maintain than using threads as well. They also work well in Python-only systems. If you're using "true" to exclude anything but threading, then you're just playing word games. The reality is that most problems don't need threading. The only thing it buys you over the alternatives is easy shared memory. Very few problems actually require that.
Yes, Python needs something like that. You can't have it without breaking backwards compatibility. It's not clear you can have it without serious performance hits in Python's primary use area, which is single-threaded scripts. Which means it's probably a Python 4K feature. There have been a number of discussions on python-ideas about this. I submitted a proto-pep that covers most of that to python-dev for further discussion and approval. I'd suggest you chase those things down.
No, the process needs to apply to *all* changes. Even changes to implementation details - like removing the GIL. If your implementation that removes the GIL causes a 50% slowdown in single-threaded python code, it ain't gonna happen. But until you actually propose a change, it won't matter. Nothing's going to happen until someone actually does something more than talk about it. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Il 09 febbraio 2012 18:35, Matt Joiner <anacrolix@gmail.com> ha scritto:
I think it's not only a matter of 3th party modules not being ported quickly enough or the amount of work involved when facing the 2->3 conversion. I bet a lot of people don't want to upgrade for another reason: unicode. The impression I got is that python 3 forces the user to use and *understand* unicode and a lot of people simply don't want to deal with that. In python 2 there was no such a strong imposition. Python 2 string type acting both as bytes and as text was certainly ambiguos and "impure" on different levels and changing that was definitively a win in terms of purity and correctness. I bet most advanced users are happy with this change. On the other hand, Python 2 average user was free to ignore that distinction even if that meant having subtle bugs hidden somewhere in his/her code. I think this aspect shouldn't be underestimated. --- Giampaolo http://code.google.com/p/pyftpdlib/ http://code.google.com/p/psutil/ http://code.google.com/p/pysendfile/

On Thu, Feb 9, 2012 at 11:31 AM, Matt Joiner <anacrolix@gmail.com> wrote:
The difference is that *if* you hit a Unicode error in 2.x, you're done for. Even understanding Unicode doesn't help. In 3.x, you will hit Unicode problems less frequently than in 2.x, and when you do, the problem can actually be overcome, and then your code is better. In 2.x, the typical solution, when there *is* a solution, involves making your code messier and sending up frequent prayers to the gods of Unicode. -- --Guido van Rossum (python.org/~guido)

On 2/9/2012 2:31 PM, Matt Joiner wrote:
I am really puzzled what you mean. I have used Python 3 since 3.0 alpha and as long as I have used strictly ascii, I have encountered no such issues.
I have learned about unicode, but just so I could play around with other characters.
I had to learn Unicode right then and there. Fortunately, the Python docs HOWTO on Unicode is excellent.
Were you doing some non-ascii or non-average framework-like things? Would you really not have had to learn the same about unicode if you were using 2.x? -- Terry Jan Reedy

On Fri, Feb 10, 2012 at 5:25 AM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
The problem for average users *right now* is that many of the Unicode handling tools that were written for the blurry "is-it-bytes-or-is-it-text?" 2.x 8-bit str type haven't been ported to 3.x yet. That's currently happening, and the folks doing it are the ones who really have to make the adjustment, and figure out what they can deal with on behalf of their users and what they need to expose (if anything). The idea with Python 3 unicode is to have errors happen at (or at least close to) the point where the code is doing something wrong, unlike the Python 2 implicit conversion model, where either data gets silently corrupted, or you get a Unicode error far from the location that introduced the problem. I actually find it somewhat amusing when people say that python-dev isn't focusing on users enough because of the Python 3 transition or the Windows installer problems. What they *actually* seem to be complaining about is that python-dev isn't focused entirely on users that are native English speakers using an expensive proprietary OS. And that's a valid observation - most of us are here because we like Python and primarily want to make it better for the environments where *we* use it, which is mostly a combination of Linux and Mac users, a few other POSIX based platforms and a small minority of Windows developers. Given the contrariness of Windows as a target platform, the time of those developers is mostly spent on making it keep working, and bringing it up to feature parity with the POSIX version, so cleaning up the installation process falls to the wayside. (And, for all the cries of, "Python should be better supported on Windows!", we just don't see many Windows devs signing up to help - since I consider developing for Windows it's own special kind of hell that I'm happy to never have to do again, it doesn't actually surprise me there's a shortage of people willing to do it as a hobby) In terms of actually *fixing it*, the PSF doesn't generally solicit grant proposals, it reviews (and potentially accepts) them. If anyone is serious about getting something done for 3.3, then *write and submit a grant proposal* to the PSF board with the goal of either finalising the Python launcher for Windows, or else just closing out various improvements to the current installer that are already on the issue tracker (e.g. version numbers in the shortcut names, an option to modify the system PATH). Even without going all the way to a grant proposal, go find those tracker items I mentioned and see if there's anything you can do to help folks like Martin von Loewis, Brian Curtin and Terry Reedy close them out. In the meantime, if the python.org packages for Windows aren't up to scratch (and they aren't in many ways), *use the commercially backed ones* (or one of the other sumo distributions that are out there). Don't tell your students to grab the raw installers directly from python.org, redirect them to the free rebuilds from ActiveState or Enthought, or go all out and get them to install something like Python(X, Y). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

First of all all the Python developers are doing an amazing job, and none of the comments should be taken as a critique but only as a suggestion. On Feb 9, 2012, at 3:34 PM, Nick Coghlan wrote: [...]
This is what I do now. I tell my students if they have trouble to Enthought. Yet there are issues with license and 32 (free) vs 64 bits (not free). Long term I do not think this what we should encourage.

On 2/9/2012 2:16 PM, Giampaolo Rodolà wrote:
Do *you* think that? Or or you reporting what others think? In either case, we have another communication problem. If one only uses the ascii subset, the usage of 3.x strings is transparent. As far as I can think, one does not need to know *anything* about unicode to use 3.x. In 3.3, there will not even be a memory hit. We should be saying that. Thanks for the head's up. It is hard to know what misconceptions people have until someone reports them ;-).
In python 2 there was no such a strong imposition.
Nor is there in 3.x. We need to communicate that. I may give it a try on python-list. If and when one does want to use more characters, it should be *easier* in 3.x than in 2.x, especially for non-Latin1 Western European chars . -- Terry Jan Reedy

On 2/9/2012 11:30 PM, Matt Joiner wrote:
Not true, it's necessary to understand that encodings translate to and from bytes,
Only if you use byte encodings for ascii text. I never have, and I would not know why you do unless you are using internet modules that do not sufficiently hide such details. Anyway...
So one only needs to know one encoding name, which most should know anyway, and that it *is* an encoding name.
and how to use the API.
Give the required parameter, which is standard.
In 2.x you rarely needed to know what unicode is.
All one *needs* to know about unicode, that I can see, is that unicode is a superset of ascii, that ascii number codes remain the same, and that one can ignore just about everything else until one uses (or wants to know about) non-ascii characters. Since one will see 'utf-8' here and there, it is probably to know that the utf-8 encoding is a superset of the ascii encoding, so that ascii text *is* utf-8 text. -- Terry Jan Reedy

There are a lot of things covered in this thread. I want to address 2 of them. 1. Garbage Collection. Python has garbage collection. There is no free() function in Python, anyone who says that Python does not have GC is talking nonsense. CPython using reference counting as its means of implementing GC. Ref counting has different performance characteristics from tracing GC, but it only makes sense to consider this is the context of overall Python performance. One key disadvantage of ref-counting is that does not play well with threads, which leads on to... 2. Global Interpreter Lock and Threads. The GIL is so deeply embedded into CPython that I think it cannot be removed. There are too many subtle assumptions pervading both the VM and 3rd party code, to make truly concurrent threads possible. But are threads the way to go? Javascript does not have threads. Lua does not have threads. Erlang does not have threads; Erlang processes are implemented (in the BEAM engine) as coroutines. One of the Lua authors said this about threads: (I can't remember the quote so I will paraphrase) "How can you program in a language where 'a = a + 1' is not deterministic?" Indeed. What Python needs are better libraries for concurrent programming based on processes and coroutines. Cheers, Mark.

The way I see it is not whether Python has threads, fibers, coroutines, etc. The problem is that in 5 years we going to have on the market CPUs with 100 cores (my phone has 2, my office computer has 8 not counting GPUs). The compiler/interpreters must be able to parallelize tasks using those cores without duplicating the memory space. Erlang may not have threads in the sense that it does not expose threads via an API but provides optional parallel schedulers where coroutines are distributed automatically over the available cores/CPUs (http://erlang.2086793.n4.nabble.com/Some-facts-about-Erlang-and-SMP-td210877...). Different languages have different mechanisms for taking advantages of multiple cores without forking. Python does not provide a mechanism and I do not know if anybody is working on one. In Python, currently, you can only do threading to parallelize your code without duplicating memory space, but performance decreases instead of increasing with number of cores. This means threading is only good for concurrency not for scalability. The GC vs reference counting (RC) is the hearth of the matter. With RC every time a variable is allocated or deallocated you need to lock the counter because you do know who else is accessing the same variable from another thread. This forces the interpreter to basically serialize the program even if you have threads, cores, coroutines, etc. Forking is a solution only for simple toy cases and in trivially parallel cases. People use processes to parallelize web serves and task queues where the tasks do not need to talk to each other (except with the parent/master process). If you have 100 cores even with a small 50MB program, in order to parallelize it you go from 50MB to 5GB. Memory and memory access become a major bottle neck. Erlang Massimo On Feb 10, 2012, at 3:29 AM, Mark Shannon wrote:

On 2012-02-10, at 15:52 , Massimo Di Pierro wrote:
Erlang may not have threads in the sense that it does not expose threads via an API but provides optional parallel schedulers
-smp has been enabled by default since R13 or R14, it's as optional as multithreading being optional because you can bind a process to a core.
In Python, currently, you can only do threading to parallelize your code without duplicating memory space, but performance decreases instead of increasing with number of cores. This means threading is only good for concurrency not for scalability.
That's definitely not true, you can also fork and multiprocessing, while not ideal by a long shot, provides a number of tools for working building concurrent applications via multiple processes.

Massimo Di Pierro, 10.02.2012 15:52:
Seriously - what's wrong with forking? multiprocessing is so increadibly easy to use that it's hard for me to understand why anyone would fight for getting threading to do essentially the same thing, just less safe. Threading is a seriously hard problem, very tricky to get right and full of land mines. Basically, you start from a field that's covered with one big mine, and start cutting it down until you can get yourself convinced that the remaining mines (if any, right?) are small enough to not hurt anyone. They usually do anyway, but at least not right away. This is generally worth a read (not necessarily for the conclusion, but definitely for the problem description): http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
Well, nothing keeps you from putting your data into shared memory if you use multiple processes. It's not that hard either, but it has the major advantage over threading that you can choose exactly what data should be shared, so that you can more easily avoid race conditions and unintended interdependencies. Basically, you start from a safe split and then add explicit data sharing and messaging until you have enough shared data and synchronisation points to make it work, while still keeping up a safe and efficient concurrent system. Note how this is the opposite of threading, where you start off from the maximum possible unsafety where all state is shared, and then wade through it with a machete trying to cut down unsafe interaction points. And if you miss any one spot, you have a problem.
This means threading is only good for concurrency not for scalability.
Yes, concurrency, or more specifically, I/O concurrency is still a valid use case for threading.
I think you should read up a bit on the various mechanisms for parallel processing. Stefan

On Feb 10, 2012, at 9:28 AM, Stefan Behnel wrote:
yes I should ;-) (Perhaps I should take this course http://www.cdm.depaul.edu/academics/pages/courseinfo.aspx?CrseId=001533) The fact is, in my experience, many modern applications where performance is important try to take advantage of all parallelization available. I have worked on many years in lattice QCD and I have written code that runs on various parallel machines. We used processes to parallelize across nodes, threads to parallelize on single node, and assembly vectorial instructions to parallelize within each core. This used to be a state of art way of programming but now I see these patters trickling down to many consumer applications, for example games. People do not like threads because of the need for locking but, as you increase the number of cores, the bottle neck becomes memory access. If you use processes, you don't just bloat ram usage killing cache performance but you need to use message passing for interprocess communication. Message passing require copy of data which is expensive (remember ram is the bottle neck). Ever worse, some times message passing cannot be done using ram only and you need disk buffered message for interprocess communication. Some programs are parallelized ok with processes. Those I have experience with require both processes and threads. Again, this does not mean using threading APIs. The VM should use threads to parallelize tasks. How this is exposed to the developed is a different matter. Massimo

On 10 February 2012 14:52, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
I don't know much about forking, but I'm pretty sure that forking a process doesn't mean you double the amount of physical memory used. With copy-on-write, a lot of physical memory can be shared. -- Arnaud

On Feb 10, 2012, at 9:43 AM, Arnaud Delobelle wrote:
Anyway, copy-on-write does not solve the problem. The OS tries to save memory but not duplicating physical memory space and by assigning the different address spaces of the various forked processes to the same physical memory. But as soon as one process writes into the segment, the entire segment is copied. It has to be, the processes must have different address spaces. That is what fork does. Anyway, there are many applications that are parallelized well with processes (at least for a small number of cores/cpus).
-- Arnaud

Arnaud Delobelle, 10.02.2012 16:43:
That applies to systems that support both fork and copy-on-write. Not all systems are that lucky, although many major Unices have caught up in recent years. The Cygwin implementation of fork() is especially involved for example, simple because Windows lacks this idiom completely (well, in it's normal non-POSIX identity, that is, where basically all Windows programs run). http://seit.unsw.adfa.edu.au/staff/sites/hrp/webDesignHelp/cygwin-ug-net-noc... Stefan

On Fri, Feb 10, 2012 at 9:38 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Intel already has immediate plans for 10 core cpus, those have well functioning HT so they should be considered 20 core. Two socket boards are quite common, there's 40 cores. 4+ socket boards exist bringing your total to 80+ cores connected to a bucket of dram on a single motherboard. These are the types of systems in data centers being made available to people to run their computationally intensive software on. That counts as general purpose in my book. -gps

On Fri, 10 Feb 2012 08:52:16 -0600 Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
Forking is a solution only for simple toy cases and in trivially parallel cases.
But threading is only a solution for simple toy cases and trivial levels of scaling.
Only if they haven't thought much about using processes to build parallel systems. They work quite well for data that can be handed off to the next process, and where the communications is a small enough part of the problem that serializing it for communications is reasonable, and for cases where the data that needs high-speed communications can be treated as a relocatable chunk of memory. And any combination of those three, of course. The real problem with using processes in python is that there's no way to share complex python objects between processes - you're restricted to ctypes values or arrays of those. For many applications, that's fine. If you need to share a large searchable structure, you're reduced to FORTRAN techniques.
That should be fixed in the OS, not by making your problem 2**100 times as hard to analyze. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Massimo Di Pierro wrote:
Forking is a solution only for simple toy cases and in trivially parallel cases. People use processes to parallelize web serves and task queues where the tasks do not need to talk to each other (except with the parent/master process). If you have 100 cores even with a small 50MB program, in order to parallelize it you go from 50MB to 5GB. Memory and memory access become a major bottle neck.
By the time we 100 core CPUs, we'll be measuring RAM in TB, so that shouldn't be a problem ;-) Many Python use cases are indeed easy to scale using multiple processes which then each run on a separate core, so that approach is a very workable way forward. If you need to share data across processes, you can use a shared memory mechanism. In many cases, the data to be shared will already be stored in a database and those can easily be accessed from all processes (again using shared memory). I often find these GIL discussion a bit theoretical. In practice I've so far never run into any issues with Python scalability. It's other components that cause a lot more trouble, like e.g. database query scalability, network congestion or disk access being too slow. In cases where the GIL does cause problems, it's usually better to consider changing the application design and use asynchronous processing with a single threaded design or a multi-process design where each of the processes only uses a low number of threads (20-50 per process). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, 10 Feb 2012 19:36:52 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Just a warning: mixing threads and forks can be hazardous to your sanity. In particular, forking a process that has threads running has behaviors, problems and solutions that vary between Unix variants. Best to make sure you've done all your forks before you create a thread if you want your code to be portable. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
Right. Applications using such strategies will usually have long running processes, so it's often better to spawn new processes than to use fork. This also helps if you want to bind processes to cores. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

I think the much, much better response to the questions and comments around Python, the GIL and parallel computing in general is this: Yes, let's have more of that! It's like asking if people like pie, or babies. 99% of people polled are going to say "Yes, let's have more of that!" - so it goes with Python, the GIL, STM, Multiprocessing, Threads, etc. Where all of these discussions break down - and they always do - is that we lack: 1> Someone with a working patch for Pie 2> Someone with a fleshed out proposal/PEP on how to get more Pie 3> A group of people with time to bake more Pies that could help be paid to make Pie Banging on the table and asking for more Pie won't get us more Pie - what we need are actual proposals, in the form of well thought out PEPs, the people to implement and maintain the thing (see: unladen swallow), or working implementations. No one in this thread is arguing that having more Pie, or babies, would be bad. No one is arguing that more/better concurrency constructs would be good. Tools like concurrent.futures in Python 3 would be a good example of something recently added. The problem is people, plans and time. If we can solve the People and Time problems, instead of looking to already overworked volunteers then I'm sure we can come up with a good Pie plan. I really like pie. Jesse

On 10.02.2012 19:36, M.-A. Lemburg wrote:
By the time we 100 core CPUs, we'll be measuring RAM in TB, so that shouldn't be a problem ;-)
Actually, Python is already great for those. They are called GPUs, and OpenCL is all about text processing.
The "GIL problem" is much easier to analyze than most Python developers using Linux might think: - Windows has no fork system call. SunOS used to have a very slow fork system call. The majority of Java developers worked with Windows or Sun, and learned to work with threads. For which the current summary is: - The GIL sucks because Windows has no fork. Which some might say is the equivalent of: - Windows sucks. Sturla

Sturla Molden wrote:
I'm not sure why you think you need os.fork() in order to work with multiple processes. Spawning processes works just as well and, often enough, is all you really need to get the second variant working. The first variant doesn't need threads at all, but can not always be used since it requires all application components to play along nicely with the async approach. I forgot to mention a third variant: use a multi-process design with single threaded asynchronous processing in each process. This third variant is becoming increasingly popular, esp. if you have to handle lots and lots of individual requests with relatively low need for data sharing between the requests. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2012)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, Feb 10, 2012 at 11:54 AM, Sturla Molden <sturla@molden.no> wrote:
Please do not claim that fork() semantics and copy-on-write are good things to build off of... They are not. fork() was designed in a world *before threads* existed. It simply can not be used reliably in a process that uses threads and tons of real world practical C and C++ software that Python programs need to interact with, be embedded in or use via extension modules these days uses threads quite effectively. The multiprocessing module on posix would be better off if it offered a windows CreateProcess() work-a-like mode that spawns a *new* python interpreter process rather than depending on fork(). The fork() means multithreaded processes cannot reliably use the multiprocessing module (and those other threads could come from libraries or C/C++ extension modules that you cannot control within the scope of your own software that desires to use multiprocessing). This is likely not hard to implement, if nobody has done it already, as I believe the windows support already has to do much the same thing today. -gps

Sorry for the late reply, but this itch finally got to me...
Please do not claim that fork() semantics and copy-on-write are good things to build off of...
They work just fine for large classes of problems that require hundreds or thousands of cores.
They are not. fork() was designed in a world *before threads* existed.
This is wrong. While the "name" thread may not have existed when fork() was created. the *concept* of concurrent execution in a shared address space predates the creation of Unix by a good decade. Most notably, Multics - what the creators of Unix were working on before they did Unix - at least discussed the idea, though it may never have been implemented (a common fate of Multics features). Also notable is that Unix introduced the then ground-breaking idea of having the command processor create a new process to run user programs. Before Unix, user commands were run in the process (and hence address space) of the command processor. Running things in what is now called "the background" (which this architecture made a major PITA) gave you concurrent execution in a shared address space - what we today call threads. The reason those systems did this was because creating a process was *expensive*. That's also why the Multics folks looked at threads. The Unix fork/exec pair was cheap and flexible, allowing the creation of a command processor that supported easy backgrounding, pipes, and IO redirection. Fork has since gotten more expensive, in spite of the ongoing struggles to keep it cheap.
Personally, I find that threads can't be used reliably in a process that forks makes threads bad things to build off of. After all, there's tons of real world practical software in many languages that python needs to interact with that use fork effectively.
While it's a throwback to the 60s, it would make using threads and processes more convenient, but I don't need it. Why don't you submit a patch? <mike

[Replies have been sent to concurrency-sig@python.org] On Sun, 12 Feb 2012 23:14:51 +0100 Sturla Molden <sturla@molden.no> wrote:
subprocess and threads interact *really* badly on Unix systems. Python is missing the tools needed to deal with this situation properly. See http://bugs.python.org/issue6923. Just another of the minor reasons not to use threads in Python. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Mon, 13 Feb 2012 08:13:36 +0800 Matt Joiner <anacrolix@gmail.com> wrote:
This attitude is exemplary of the status quo in Python on threads: Pretend they don't exist or you'll get hurt.
Yup. After all, the answer to the question "Which modules in the standard library are thread-safe?" is "threading, queue, logging and functools" (at least, that's my best guess). Any effort to "fix" threading in Python is pretty much doomed until the authoritative answer to that question includes most of the standard library. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Mon, 13 Feb 2012 01:41:48 +0100 Sturla Molden <sturla@molden.no> wrote:
Not (quite) true. There are a few fringe languages that have embraced threading and been built (or worked over) from the ground up to work well with it. I haven't seen any let you mix multiprocessing and threading safely, though, so the attitude there is "pretend fork doesn't exist or you'll get hurt." These are the places where I've seen safe (as in, I trusted them as much as I'd have trusted a version written using processes) non-trivial (as in, they were complex enough that if they'd been written in a mainstream language like Python, I wouldn't have trusted them) threaded applications. I strongly believe we need better concurrency solutions in Python. I'm not convinced that threading is best general solution, because threading is like the GIL: a kludge that solves the problem by fixing *everything*, whether it needs it or not, and at very high cost. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Den 12.02.2012 21:56, skrev Mike Meyer:
The "expensive" argument is also why the Windows API has no fork, although the Windows NT-kernel supports it. (There is even a COW fork in Windows' SUA.) I think fork() is the one function I have missed most when programming for Windows. It is the best reason to use SUA or Cygwin instead of the Windows API. Sturla

On Fri, Feb 10, 2012 at 9:52 AM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
uh... if you need to lock it for allocation, that is an issue with the malloc, rather than refcounting. And if you need to lock it for deallocation, then your program already has a (possibly threading-race-condition-related) bug. The problem is that you need to lock the memory for writing every time you acquire or release a view of the object, even if you won't be modifying the object. (And this changing of the refcount makes copy-on-write copy too much.) There are plenty of ways around that, mostly by using thread-local (or process-local or machine-local) proxies; the original object only gets one incref/decref from each remote thread; if sharable objects are delegated to a memory-controller thread, even better. Once you have the infrastructure for this, you could also more easily support "permanent" objects like None. The catch is that the overhead of having the refcount+pointer (even without the proxies) instead of just "refcount 4 bytes ahead" turns out to be pretty high, so those forks (and extensions, if I remember pyro http://irmen.home.xs4all.nl/pyro/ correctly) never really caught on. Maybe that will change when the number of cores that aren't already in use for other processes really does skyrocket. -jJ

Terry Reedy writes:
Sorry, Terry, but you're basically wrong here. True, if one sticks to pure ASCII, there's no difference to notice, but that's just not possible for people who live outside of the U.S., or who share text with people outside of the U.S. They need currency symbols, they have friends whose names have little dots on them. Every single one of those is a backtrace waiting to happen. A backtrace on f = open('text-file.txt') for line in f: pass is an imposition. That doesn't happen in 2.x (for the wrong reasons, but it's very convenient 95% of the time). This is what Victor's "locale" codec is all about. I think that's the wrong spelling for the feature, but there does need to be a way to express "don't bother me about Unicode" in most scripts for most people. We don't have a decent boilerplate for that yet.

On 2/10/2012 3:41 AM, Stephen J. Turnbull wrote:
The claim is that Python3 imposes a large burden on users that Python2 does not.
Nor is there in 3.x.
I view that claim as FUD, at least for many users, and at least until the persons making the claim demonstrate it. In particular, I claim that people who use Python2 knowing nothing of unicode do not need to know much more to do the same things in Python3. And, if someone uses Python2 with full knowledge of Unicode, that Python3 cannot impose any extra burden. Since I am claiming negatives, the burden of proof is on those who claim otherwise.
Sorry, Terry, but you're basically wrong here.
If one only uses the ascii subset, the usage of 3.x strings is
This is not a nice way to start a response, especially when you go on to admit that I was right as the the user case I discussed. Here is what you clipped. transparent.
True, if one sticks to pure ASCII, there's no difference to notice,
Which is a restatement what you clipped. In another post I detailed the *small* amount (one paragraph) that I believe such people need to know to move to Python3. I have not seen this minimum laid out before and I think it would be useful to help such people move to Python3 without FUD fear.
but that's just not possible for people who live outside of the U.S.,
Who *already* have to know about more than ascii to use Python2. The question is whether they have to learn *substantially* more to use Python3.
OK, real-life example. My wife has colleagues in China. They interchange emails (utf-8 encoded) with project budgets and some Chinese characters. Suppose she asks me to use Python to pick out ¥ renminbi/yuan figures and convert to dollars. What 'strong imposition' does Python3 make to learn things I would not have to know to do the same thing in Python2?
I do not consider that adding an encoding argument to make the same work in Python3 to be "a strong imposition of unicode awareness". Do you? In order to do much other than pass, I believe one typically needs to know the encoding of the file, even in Python2. And of course, knowing about and using the one unicode byte encoding is *much* easier than knowing about and using the 100 or so non-unicode (or unicode subset) encodings. To me, Python3's s = open('text.txt', 'utf-8').read() is easier and simpler than either Python2 version below (and please pardon any errors as I never actually did this) import codecs s = codecs.open('text.txt', 'utf-8').read() or f = open('text.txt') s = unicode(f.read, 'utf-8') -- Terry Jan Reedy

Threading is a tool (the most popular, and most flexible tool) for concurrency and parallelism. Compared to forking, multiprocessing, shared memory, mmap, and dozens of other auxiliary OS concepts it's also the easiest. Not all problems are clearly chunkable or fit some alternative parallelism pattern. Threading is arguably the cheapest method for parallelism, as we've heard throughout this thread. Just because it can be dangerous is no reason to discourage it. Many alternatives are equally as dangerous, more difficult and less portable. Python is a very popular language.someone mentioned earlier that popularity shouldn't be an argument for features but here it's fair ground. If Python 3 had unrestrained threading, this transition plunge would not be happening. People would be flocking to it for their free lunches. Unrestrained single process parallelism is the #1 reason not to choose Python for a future project. Note that certain fields use alternative parallelism like MPI, and whoopee for them, these aren't applicable to general programming. Nor is the old argument "write a C extension". Except for old stooges who can't let go of curly braces, most people agree Python is the best mainstream language, but the implementation is holding it back. The GIL has to go if CPython is to remain viable in the future for non-trivial applications. The current transition is like VB when .NET came out: everyone switched to C# rather than upgrade to VB.NET, because it was wiser to switch to the better language than to pay the high upgrade cost. Unfortunately the Python 3 ship has sailed, and presumably the GIL has to remain until 4.x at the least. Given this, it seems there is some wisdom in the current head-in-the-sand advice: It's too hard to remove the GIL so just use some other mechanism if you want parallelism, but it's misleading to suggest they're superior as described above. So with that in mind, can the following changes occur in Python 3 without breaking spec? - Replace the ref-counting with another GC? - Remove the GIL? If not, should these be relegated to Python 4 and alternate implementation discussions?

Matt: directing a threading rant at me because I posted about unicode, a completely different subject, is bizarre. I have not said a word on this thread, and hardly ever on any other thread, about threading, concurrency, and the GIL. I have no direct interest in these subjects. But since you directed this at me, I will answer. On 2/10/2012 9:24 PM, Matt Joiner wrote: ...
If you had paid attention to this thread and others, you would know 1. These are implementation details not in the spec. 2. There are other implementations without these. 3. People have attempted the changes you want for CPython. But so far, both would have substantial negative impacts on many CPython users, including me. 4. You are free to try to improve on previous work. As to the starting subject of this thread: I switched to Python 1.3, just before 1.4, when Python was an obscure language in the Tiobe 20s. I thought then and still do that it was best for *me*, regardless of what others decided for themselves. So while am I pleased that it usage has risen considerably, I do not mind that it has (relatively) plateaued over the last 5 years. And I am not panicked that an up wiggle was followed by a down wiggle. -- Terry Jan Reedy

I'm asking if it'd actually be accepted in 3. I know well, and have seen how quickly things are blocked and rejected in core (dabeaz and shannon's attempts come to mind). I'm well familiar with previous attempts. As an example consider that replacing ref counting would probably change the API, but is a prerequisite for performant removal of the GIL.

On Sat, Feb 11, 2012 at 1:40 PM, Matt Joiner <anacrolix@gmail.com> wrote:
I'm asking if it'd actually be accepted in 3.
Why is that relevant? If free threading is the all-singing all dancing wonderment you believe: 1. Fork CPython 2. Make it free-threaded (while retaining backwards compatibility with all the C extensions out there!) 3. Watch the developer hordes flock to your door (after all, it's the lack of free-threading that has held Python's growth back for the last two decades, so everyone will switch in a heartbeat the second you, or anyone else, publishes a free-threaded alternative where all their C extensions work. Right?).
If that's what you think happened, then no, you're not familiar with them at all. python-dev has just a few simple rules for accepting a free-threading patch: 1. All current third party C extension modules must continue to work (ask the folks working on Ironclad for IronPython and cpyext for PyPy how much fun *that* requirement is) 2. Calls to builtin functions and methods should remain atomic (the Jython experience can help a lot here) 3. The performance impact on single threaded scripts must be minimal (which basically means eliding all the locks in single-threaded mode the way CPython currently does with the GIL, but then creating those elided locks in the correct state when Python's threading support gets initialised) That's it, that's basically all the criteria we have for accepting a free-threading patch. However, while most people are quite happy to say "Hey, someone should make CPython free-threaded!", they're suddenly far less interested when faced with the task of implementing it *themselves* while preserving backwards compatibility (and if you think the Python 2 -> Python 3 transition is rough going, you are definitely *not* prepared for the enormity of the task of trying to move the entire C extension ecosystem away from the refcounting APIs. The refcounting C API compatibility requirement is *not* optional if you want a free-threaded CPython to be the least bit interesting in real world terms). When we can't even get enough volunteers willing to contribute back their fixes and workarounds for the known flaws in multiprocessing, do people think there is some magical concurrency fairy that will sprinkle free threading pixie dust over CPython and the GIL will be gone? Removing the GIL *won't* be fun. Just like fixing multiprocessing, or making improvements to the GIL itself, it will be a long, tedious grind dealing with subtleties of the locking and threading implementations on Windows, Linux, Mac OS X, *BSD, Solaris and various other platforms where CPython is supported (or at least runs). For extra fun, try to avoid breaking the world for CPython variants on platforms that don't even *have* threading (e.g. PyMite). And your reward for all that effort? A CPython with better support for what is arguably one of the *worst* approaches to concurrency that computer science has ever invented. If a fraction of the energy that people put into asking for free threading was instead put into asking "how can we make inter-process communication better?", we'd no doubt have a good shared object implementation in the mmap module by now (and someone might have actually responded to Jesse's request for a new multiprocessing maintainer when circumstances forced him to step down). But no, this is the internet: it's much easier to incessantly berate python-dev for pointing out that free threading would be extraordinarily hard to implement correctly and isn't the panacea that many folks seem to think it is than it is to go *do* something that's more likely to offer a good return on the time investment required. My own personal wishlist for Python's concurrency support? * I want to see mmap2 up on PyPI, with someone working on fast shared object IPC that can then be incorporated into the stdlib's mmap module * I want to see multiprocessing2 on PyPI, with someone working on the long list of multiprocessing bugs on the python.org bug tracker (including adding support for Windows-style, non-fork based child processes on POSIX platforms) * I want to see progress on PEP 3153, so that some day we can have a "Python event loop" instead of a variety of framework specific event loops, as well as solid cross-platform async IO support in the stdlib. As Jesse said earlier, asking for free threading in CPython is like asking for free pie. Sure, free pie would be nice, but who's going to bake it? And what else could those people be doing with their time if they weren't busy baking pie? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Den 11.02.2012 07:15, skrev Nick Coghlan:
There are several solutions to this, I think. One is to use one interpreter per thread, and share no data between them, similar to tcl and PerlFork. The drawback is developers who forget to duplicate file handles, so one interpreter can close a handle used by another. Another solution is transactional memory. Consider a database with commit and rollback. Not sure how to fit this with C extensions though, but one could in theory build a multithreaded interpreter like that.
I think I already explained why BSD mmap is a dead end. We need named kernel objects (System V IPC or Unix domain sockets) as they can be communicated between processes. There is also reasons to prefer SysV message queues over shared memory (Sys V or BSD), such a thread safety. I.e. access is synchronized by the kernel. SysV message queues also have atomic read/write, unlike sockets, and they are generally faster than pipes. With sockets we have to ensure that the correct number of bytes were read or written, which is a PITA for any IPC use (or any other messaging for that matter). In the meanwhile, take a look at ZeroMQ (actually written ØMQ). ZeroMQ also have atomic read/write messages. Sturla

Matt Joiner, 11.02.2012 03:24:
Sure, "easy" as in "nothing is easier to get wrong". You did read my post on this matter, right? I'm yet to see a piece of non-trivially parallel code that uses threading and is known to be completely safe under all circumstances. And I've seen a lot.
Wrong again. Threading can be pretty expensive in terms of unexpected data dependencies, but it certainly is in terms of debugging time. Debugging spurious threading issues is amongst the hardest problems for a programmer.
Just because it can be dangerous is no reason to discourage it. Many alternatives are equally as dangerous, more difficult and less portable.
Seriously - how is running separate processes less portable than threading?
Note that this is not a "Python 2 vs. Python 3" issue. In fact, it has nothing to do with Python 3 in particular. [stripped some additional garbage] Stefan

On 2012-02-11, at 03:24 , Matt Joiner wrote:
Such a statement unqualified can only be declared wrong. Threading is the most common due to Windows issues (historically, Unix parallelism used multiple processes and the switch happened with the advent of multiplatform tools, which focused on threading due to Windows's poor performances and high overhead with processes), and it is also the easiest tool *to start using*, because you just say "start a thread". Which is equivalent to saying grenades are the easiest tool to handle conversations because you just pull the pin. Threads are by far the hardest concurrency tool to use because they throw out the window all determinism in the whole program, and that determinism then needs to be reclaimed through (very) careful analysis and the use of locks or other such sub-tools. And the flexibility claim is complete nonsense. Oh, and so are your comparisons, "shared memory" and "mmap" are not comparable to threading since they *are used* by and in threading. And forking and multiprocessing are the same thing, only the initialization call changes. Finally, multiprocessing has a far better upgrade path (as e.g. Erlang demonstrates): if your non-deterministic points are well delineated and your interfaces to other concurrent execution points are well defined, scaling from multiple cores to multiple machines becomes possible.
Of course it is, just as manual memory management is "discouraged".
Many alternatives are equally as dangerous, more difficult and less portable.
The main alternative to threading is multiprocessing (via fork or via starting new processes does not matter), it is significantly less dangerous, it is only more difficult in that you can't take extremely dangerous shortcuts and it is just as portable (if not more).
Threading is a red herring, nobody fundamentally cares about threading, what users want is a way to exploit their cores. If `multiprocessing` was rock-solid and easier to use `threading` could just be killed and nobody would care. And we'd probably find ourselves in far better a world.

Terry Reedy writes:
The point is that the user case you discuss is a toy case. Of course the problem goes away if you get to define the problem away. I don't know of any nice way to say that.
I'll go back and take a look at it. It probably is useful. But I don't think it deals with the real issue. The problem is that without substantially more knowledge than what you describe as the minimum, the fear, uncertainty, and doubt is *real*. Anybody who follows Mailman, for example, is going to hear (even today, though much less frequently than 3 years ago, and only for installations with ancient Mailman from 2006 or so) of weird Unicode errors that cause messages to be "lost". Hearing that Python 3 requires everything be decoded to Unicode is not going to give innocent confidence. There's also a lot of FUD being created out of whole cloth, as well, such as the alleged inefficiency of recoding ASCII into Unicode, etc., which doesn't matter for most applications. The problem is that the FUD based on real issues that you don't understand gives credibility to the FUD that somebody made up.
None. The FUD is not about *processing* non-ASCII. It's about non-ASCII horking your process even though you have no intention of processing it.
Yes, I do. If you get it wrong, you will still get a fatal UnicodeError.
In order to do much other than pass, I believe one typically needs to know the encoding of the file, even in Python2.
The gentleman once again seems to be suffering from a misconception. Quite often you need to know nothing about the encoding of a file, except that the parts you care about are ASCII-encoded. For example, in an American programming shop git log | ./count-files-touched-per-day.py will founder on 'Óscar Fuentes' as author, unless you know what coding system is used, or know enough to use latin-1 (because it's effectively binary, not because it's the actual encoding).
Indeed, it is. But we're not talking about dealing with Unicode; we're talking about why somebody who really only wants to deal with ASCII needs to know more about Unicode in Python 3 than in Python 2.
The reason why Unicode is part of the FUD is that in Python 2 you never needed to do that, unless you wanted to deal with a non-English language. With Python 3 you need to deal with the codec, always, or risk a UnicodeError simply because some Spaniard's name gets mentioned by somebody who cares about orthography.

On Feb 10, 2012, at 5:32 PM, Stephen J. Turnbull wrote:
Or just use errors="surrogateescape". I think we should tell people who are scared of unicode and refuse to learn how to use it to just add an errors="surrogateescape" keyword to their file open arguments. Obviously, it's the wrong thing to do, but it's wrong in the same way that Python 2 bytes are wrong, so if you're absolutely committed to remaining ignorant of encodings, you can continue to do that.

Carl M. Johnson writes:
No, it's not the same as Python 2, and it's *subtly* the wrong thing to do, too. surrogateescape is intended to roundtrip on input from a specific API to unchanged output to that same API, and that's all it it is guaranteed to do. Less pedantically, if you use latin-1, the internal representation is valid Unicode but (partially) incorrect content. No UnicodeErrors. If you use errors="surrogateescape", any code that insists on valid Unicode will crash. Here I'm talking about a use case where the user believes that as long as the ASCII content is correct they will get correct output. It's arguable that using errors="surrogateescape" is a better approach, *because* of the possibility of a validity check. I tend to think not. But that's a different argument from "same as Python 2".

On 2/10/2012 10:32 PM, Stephen J. Turnbull wrote: The issue is whether Python 3 has a "strong imposition of Unicode awareness" that Python 2 does not. If the OP only meant awareness of the fact that something called 'unicode' exists, then I suppose that could be argued. I interpreted the claim as being about some substantive knowledge of unicode. In any case, the claim that I disagree a not about people's reactions to Python 3 or about human psychology and the propensity to stick with the known. In response to Jim Jewett, you wrote
That is pretty much my counterclaim, with the note that the 'little bit of knowledge' is most about non-unicode encodings and the change to some Python details.
The point is that the user case you discuss is a toy case.
Thanks for dismissing me and perhaps a hundred thousand users as a 'toy cases'.
the problem goes away if you get to define the problem away.
Doing case analysis, starting with the easiest cases is not defining the problem away. It is rather, an attempt to find the 'little bit on knowledge' needed in various cases. In your response, you went on to write
Exactly, and finding the Python 3 version of the magic spells needed in various cases, so they can be documented and publicized, is what I have been trying to do. For ascii-only use, the magic spell in 'ascii' in bytes() calls. For some other uses, it is 'encoding=latin-1' in open(), str(), and bytes() calls, and perhaps elsewhere. Neither of these constitute substantial 'unicode awareness'.
I don't know of any nice way to say that.
There was no need to say it. -- Terry Jan Reedy

Terry Reedy writes:
I interpreted the claim as being about changing their coding practice, including maintaining existing scripts and modules that deal with textual input that people may need/want to transition to Python 3. As Paul Moore pointed out, adding "encoding='latin-1'" to their scripts doesn't come naturally to everyone. I'm sure that at a higher level, that's the stance you intend to take, too. I think there's a disconnect between that high-level stance, and the interpretation that it's about "substantive knowledge of Unicode".
OK. But then I think you are failing to deal with the problem, because I think *that* is the problem. Python 3 doesn't lack simple idioms for making (most naive, near-English) processing look like Python 2 to a greater or lesser extent. The question is which of those idioms we should teach, and AFAICS what's controversial about that depends on human psychology, not on the admitted facts about Python 3.
And my counterrebuttal is "true -- but that's not what these users want, and they probably don't need it." That is, they don't want to debug a crash when they don't care what happens to non-ASCII in their mostly-ASCII, nearly-readable-as-English byte streams.
Thanks for unwarrantedly dissing me. I do *not* dismiss people. I claim that the practical use case for these users is *not* 6-sigma- pure ASCII. You, too, will occasionally see Mr. Fuentes or even his Israeli sister-in-law show up in your "pure ASCII, or so I thought" texts. Better-than-Ivory-soap-pure *is* a "toy" case. Only in one's own sandbox can that be guaranteed. Otherwise, Python 3 needs to be instructed to prepare for (occasional) non-ASCII.
Except that AFAIK Python 3 already handles pure ASCII pretty much automatically. But pure ASCII doesn't exist for most people any more, even in Kansas; that magic spell will crash. 'latin-1' is a much better spell (except for people who want to crash in appropriate circumstances -- but AFAIK in the group whose needs this thread addresses, they are a tiny minority).
I don't know of any nice way to say that.
There was no need to say it.
Maybe not, but I think there was. Some of your well-intended recommendations are unrealistic, and letting them pass would be a disservice to the users we are *both* trying to serve.

On 11 February 2012 00:07, Terry Reedy <tjreedy@udel.edu> wrote:
Concrete example, then. I have a text file, in an unknown encoding (yes, it does happen to me!) but opening in an editor shows it's mainly-ASCII. I want to find all the lines starting with a '*'. The simple with open('myfile.txt') as f: for line in f: if line.startswith('*'): print(line) fails with encoding errors. What do I do? Short answer, grumble and go and use grep (or in more complex cases, awk) :-( Paul.

On 2012-02-11, at 13:33 , Stefan Behnel wrote:
It's true that requires to handle encodings upfront where Python 2 allowed you to play fast-and-lose though. And using latin-1 in that context looks and feels weird/icky, the file is not encoded using latin-1, the encoding just happens to work to manipulate bytes as ascii text + non-ascii stuff.

Masklinn, 11.02.2012 13:41:
Well, except for the cases where that didn't work. Remember that implicit encoding behaves in a platform dependent way in Python 2, so even if your code runs on your machine doesn't mean it will work for anyone else.
Correct. That's precisely the use case described above. Besides, it's perfectly possible to process bytes in Python 3. You just have to open the file in binary mode and do the processing at the byte string level. But if you don't care (and if most of the data is really ASCII-ish), using the ISO-8859-1 encoding in and out will work just fine for problems like the above. Stefan

On 2012-02-11, at 13:53 , Stefan Behnel wrote:
Sure, I said it allowed you, not that this allowance actually worked.
Yes, but now instead of just ignoring that stuff you have to actively and knowingly lie to Python to get it to shut up.
I think that's the route which should be taken, but (and I'll readily admit not to have followed the current state of this story) I'd understood manipulations of bytes-as-ascii-characters-and-stuff to be far more annoying (in Python 3) than string manipulation even for simple use cases.

Masklinn, 11.02.2012 17:18:
The advantage is that it becomes explicit what you are doing. In Python 2, without any encoding, you are implicitly assuming that the encoding is Latin-1, because that's how you are processing it. You're just not spelling it out anywhere, thus leaving it to the innocent reader to guess what's happening. In Python 3, and in better Python 2 code (using codecs.open(), for example), you'd make it clear right in the open() call that Latin-1 is the way you are going to process the data.
Oh, absolutely not. When it's text, it's best to process it as Unicode. Stefan

On 2012-02-11, at 20:35 , Stefan Behnel wrote:
I'm not sure going from "ignoring it" to "explicitly lying about it" is a great step forward. latin-1 is not "the way you are going to process the data" in this case, it's just the easiest way to get Python to shut up and open the damn thing.
Except it's not processed as text, it's processed as "stuff with ascii characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS (which is kinda-sorta-extended-ascii but not exactly), and while using an ISO-8859 will yield unicode data that's about the only thing you can say about it and the actual result will probably be mojibake either way. By processing it as bytes, it's made explicit that this is not known and decoded text (which is what unicode strings imply) but that it's some semi-arbitrary ascii-compatible encoding and that's the extent of the developer's knowledge and interest in it.

Masklinn, 11.02.2012 20:46:
Well, you are still processing it as text because you are (again, implicitly) assuming those ASCII characters to be just that: ASCII encoded characters. You couldn't apply the same byte processing algorithm to UCS2 encoded text or a compressed gzip file, for example, at least not with a useful outcome. Mind you, I'm not regarding any text semantics here. I'm not considering whether the thus decoded data results in French, Danish, German or other human words, or in completely incomprehensible garbage. That's not relevant. What is relevant is that the program assumes an identity mapping from 1 byte to 1 character to work correctly, which, speaking in Unicode terms, implies Latin-1 decoding. Therefore my advice to make that assumption explicit. Stefan

On 11 February 2012 19:46, Masklinn <masklinn@masklinn.net> wrote:
No, not at all. It *is* text. I *know* it's text. I know that it is encoded in an ASCII-superset (because I can read it in a text editor and *see* that it is). What I *don't* know is what those funny bits of mojibake I see in the text editor are. But I don't really care. Yes, I could do some analysis based on the surrounding text and confirm whether it's latin-1, utf-8, or something similar. But it honestly doesn't matter to me, as all I care about is parsing the file to find the change authors, and printing their names (to re-use the "manipulating a ChangeLog file" example). And even if it did matter, the next file might be in a different ASCII-superset encoding, but I *still* won't care because the parsing code will be exactly the same. Saying "it's bytes" is even more of a lie than "it's latin-1". The honest truth is "it's an ASCII superset", and that's all I need to know to do the job manually, so I'd like to write code to do the same job without needing to lie about what I know. I'm now 100% convinced that encoding="ascii",errors="surrogateescape" is the way to say this in code. Paul.

On Sun, Feb 12, 2012 at 9:24 AM, Paul Moore <p.f.moore@gmail.com> wrote:
I created http://bugs.python.org/issue13997 to suggest codifying this explicitly as an open_ascii() builtin. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11 February 2012 21:24, Paul Moore <p.f.moore@gmail.com> wrote:
What I *don't* know is what those funny bits of mojibake I see in the text editor are.
So, do yourself and to us, "the rest of the world", a favor, and open the file in binary mode. Also, I'd suggest you and anyone being picky about encoding to read http://www.joelonsoftware.com/articles/Unicode.html so you can finally have in your mind that *** ASCII is not text *** . It used to be text when to get to non-[A-Z|a-z] text you had to have someone recording a file in a tape, pack it in the luggage, and take a plane to "overseas" to the U.S.A. . That is not the case anymore, and that, as far as I understand, is the reasoning to Python 3 to default to unicode. Anyone can work "ignoring text" and treating bytes as bytes, opening a file in binary mode. You can use "os.linesep" instead of a hard-coded "\n" to overcome linebreaking. (Of course you might accidentally break a line inside a multi-byte character in some enconding, since you prefer to ignore them altogether, but it should be rare). js -><-

On Sun, Feb 12, 2012 at 1:43 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
See http://bugs.python.org/issue13997 , mentioned earlier in the thread. Cheers, Chris

Greg Ewing writes:
Yes! However, I don't think this 1.5-liner needs to be a built-in. (The 1.5-liner for 'open_as_ascii_compatible' was posted elsewhere.) There's also the issue of people who strongly prefer sloppy encoding and Read My Lips: No UnicodeErrors. I disagree with them in all purity, but you know ....

Paul Moore writes:
It probably is, for you. If that ever gives you a UnicodeError, you know how to find out how to deal with it. And it probably won't.<wink/> That may also be a good universal default for Python 3, as it will pass through non-ASCII text unchanged, while raising an error if the program tries to manipulate it (or hand it to a module that validates). (encoding='latin-1' definitely is not a good default.) But I'm not sure of that, and the current approach of using the preferred system encoding is probably better. I don't think either argument applies to everybody who needs such a recipe, though. Many will be best served with encoding='latin-1' by some name.

On 13 February 2012 05:12, Stephen J. Turnbull <stephen@xemacs.org> wrote:
And yet, after your earlier posting on latin-1, and your comments here, I'm less certain. Thank you so much :-) Seriously, I find these discussions about Unicode immensely useful. I now have a much better feel for how to deal with (and think about) text in "unknown but mostly ASCII" format, which can only be a good thing.
Probably the key question is, how do we encapsulate this debate in a simple form suitable for people to find out about *without* feeling like they "have to learn all about Unicode"? A note in the Unicode HOWTO seems worthwhile, but how to get people to look there? Given that this is people who don't want to delve too deeply into Unicode issues. Just to be clear, my reluctance to "do the right thing" was *not* because I didn't want to understand Unicode - far from it, I'm interested in, and inclined towards, "doing Unicode right". The problem is that I know enough to realise that "proper" handling of files where I don't know the encoding, and it seems to be inconsistent sometimes (both between files, and even on occasion within a file), is a seriously hard issue. And I don't want to get into really hard Unicode issues for what, in practical terms, is a simple problem as it's one-off code and minor corruption isn't really an issue. Paul.

+1 for the URL in the exception. Well in all exceptions Bringing the language into the 21st century. Great entry points for learning about the language. Whilst google provides an excellent service in finding documentation, it seems that a programming language has other methods of defining entry points for learning, being a complex but (mostly) deterministic thing. So exceptions with URLs. The URLs point to kind of "knowledge base wiki" sorts of things where the "What is your intent/usecase" can be matched up with the deterministic state we know the interpreter is in. With something like encodings, which can be happily ignored by someone until poof, suddenly they just have mush, finding out things like "Its possible printing the string to the screen is giving the error", and "There are libraries which guess encodings" and "latin-1" is a magic bullet can take many many days of searching. Also it may be possible, from this perspective, to show ways that the developer can gather more deterministic information about his interpreter's state to narrow down his intent for the Knowledge Base (e.g. if its a print statement that throws the error, its possible the program doesnt have any encoding issues, except debugging statements) The encoding issue here is a great example of this because of the complexity and mobility of encodings (i.e. they ve changed a lot). There must be other good examples which can fireup equally strong and informative discussion on "options" and their limitations and benefits. Id be very interested in formalising the idea of a "KnowledgeBase Wiki thing", maybe there already is one...

On Feb 12, 2012, at 10:50 PM, Christopher Reay wrote:
That's not a bad idea. We might want to use some kind of URL shortener for length and future proofing though. If the site changes, we can have redirection of the short URLs updated. Something like http://pyth.on/e1234 --> http://docs.python.org/library/exceptions.html

On Mon, Feb 13, 2012 at 11:19 AM, Carl M. Johnson < cmjohnson.mailinglist@gmail.com> wrote:
I think we can use wiki.python.org/ for hosting exception specific content. E.g. http://wiki.python.org/moin/PrintFails needs a lot of love and care. Microsoft actually has documentation for every single compiler and linker error that ever existed. Not that we have the same amount of resources at our disposal, but it is a nice concept. Concerning the shortened url's - I'd go with trustworthiness over compactness - http://python.org/s1 or http://python.org/s/1 would be better than http://pyth.on/1 imo. Yuval

Entry Points: Google: Natural Language user searches based on "intent of code" Module Name/Function names: user wants more details on something he already knows exists Exception Name: Great, finds you the exception definition just like any other Class name. Googling for "UnicodeEncodingError Python" gives me a link to the 2.7 documentation which says at the top "this is not yet updated for python 3" - I dont know how important this is Googling for "UnicodeEncodingError Python 3" gives http://docs.python.org/release/3.0.1/howto/unicode.html This is a great document. It explains encoding very well. The unicode tutorial doesnt mention anything about the terminal output encoding to STDOUT, and whilst this is obvious after a while, it is not always clear the printing to the terminal is the cause of the attempt to encode as ascii during a print statement. To some extent, the unicode tutorial doesnt have the practical specifics that are being discussed in this thread which is targetted at "learning curve into Python" I think the most important points here are: The exception knows what version of Python its from (which allows the language to make changes It would be nice to have a wiki type document targetted by the exception/error Sections like: - "Python Official Docs" - Murgh, Fix This NOW, Dont care how dirty - Contributed Docs we have none and loved/stack overflow etc... - Discussions from python-dev / python ideas - PEPs that apply The point is that Google cant be responsible for making sure all these sections are laid out, obvious correct or constant

Masklinn writes:
That's the coding pedant's way to look at it. However, people who speak only ASCII or Latin 1 are in general not going to see it that way. The ASCII speakers are a pretty clear-cut case. Using 'latin-1' as the codec, almost all things they can do with a 100% ASCII program and a sanely-encoded text (which leaves out Shift JIS, Big 5, and maybe some obsolete Vietnamese encodings, but not much else AFAIK) will pass through the non-ASCII verbatim, or delete it. Latin 1 speakers are harder, because they might do things like convert accented characters to their base, which would break multibyte characters in Asian languages. Still, one suspects that they mostly won't care terribly much about that (if they did, they'd be interested in using Unicode properly, and it would be worth investing the small amount of time required to learn a couple of recipes).
No, decoding with 'latin-1' is a far better approach for this purpose. If the name bothers you, give it an alias like 'semi-arbitrary-ascii-compatible'. The problem is that for many operations, b'*' and 'trailing text' are incompatible. Try concatenating them, or testing one against the other with .startswith(), or whatever. Such literals are buried in many modules, and you will lose if you're using bytes because those modules generally assume you're working with str.

On Mon, Feb 13, 2012 at 3:04 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'd hazard a guess that the non-ASCII compatible encoding mostly likely to be encountered outside Asia is UTF-16. The choice is really between "never give me UnicodeErrors, but feel free to silently corrupt the data stream if I do the wrong thing with that data" (i.e. "latin-1") and "correctly handle any ASCII compatible encoding, but still throw UnicodeEncodeError if I'm about to emit corrupted data" ("ascii+surrogateescape"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
I'd hazard a guess that the non-ASCII compatible encoding mostly likely to be encountered outside Asia is UTF-16.
In other words, only people who insist on messing with application/octet-stream files (like Word ;-). They don't deserve the pain, but they're gonna feel it anyway.
Yes.
Not if I understand what ascii+surrogateescape would do correctly. Yes, you can pass through verbatim, but AFAICS you would have to work quite hard to do anything to that stream that would cause a UnicodeError in your program, even though you corrupt it. (Eg, delete half of a multibyte EUC character.) The question is what happens if you run into a validating processor internally -- then you'll see an error (even though you're just passing it through verbatim!)

On Tue, Feb 14, 2012 at 6:02 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape") it's very hard to corrupt your data stream with processing that assumes an ASCII compatible encoding (as you point out, you'd have to be splitting on arbitrary codepoints instead of searching for ASCII first). However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set. In particular, sys.stdout has error handling set to strict, which I believe is likely to throw UnicodeEncodeError if you try to feed a string containing surrogate escaped bytes to an encoding that can't handle them. (Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish, as they would in the latin-1 case. I guess its only on Windows and in any other locations with a more restrictive default stdout encoding that errors are particularly likely). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Feb 13, 2012, at 10:45 PM, Nick Coghlan wrote:
I don't think that's right. I think that by default Python refuses to turn surrogate characters into UTF-8:
OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym? -- Carl Johnson

On Tue, Feb 14, 2012 at 6:39 AM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
encoding="ascii-ish" # gets the sloppyness right encoding="passthrough" # I would like "ignore", if it wouldn't cause confusion with the errorhandler encoding="binpass" encoding="rawbytes" -jJ

MRAB wrote:
"Ignore" won't do. Ignore what? Everything? Don't actually run an encoder? That doesn't even make sense! "Passthrough" is bad too, because it perpetrates the idea that ASCII characters are "plain text" which are bytes. Unicode strings, even those that are purely ASCII, are not strings of bytes (except in the sense that every data structure is a string of bytes). You can't just "pass bytes through" to turn them into Unicode.
You have a smiley, but I think that's the best name I've seen yet. It's explicit in what you get -- mojibake. The only downside is that it's a little obscure. Not everyone knows what mojibake is called, or calls it mojibake, although I suppose we could add aliases to other terms such as Buchstabensalat and Krähenfüße if German users complain <wink> But remind me again, why are we doing this? If you have to teach people the recipe open(filename, encoding='mojibake') why not just teach them the very slightly more complex recipe open(filename, encoding='ascii', errors='surrogateescape') which captures the user's intent ("I want ASCII, with some way of escaping errors so I don't have to deal with them") much more accurately. Sometimes brevity is *not* a virtue. -- Steven

Steven D'Aprano writes:
MRAB wrote:
encoding="ascii-ish" # gets the sloppyness right
+0.8 I'd prefer the more precise "ascii-compatible". Shift JIS is "ASCII-ish", but should not be decoded with this codec.
Explicit, but incorrect. Mojibake ("bake" means "change") is what you get when you use one encoding to encode characters, and another to decode them. Here, not only are we talking about using the same codec at both ends, but in fact it's inside out (we are decoding then encoding). This is GIGO, not mojibake.
Why not? Because 'surrogateescape' does not express the user's intent. That user *will* have to deal with errors as soon as she invokes modules that validate their input, or include some portion of the text being treated in output of any kind, unless they use an error-suppressing handler themselves. Surrogates are errors in Unicode, and that's the way it should be. That's precisely why Martin felt it necessary to use this technique in PEP 383: to ensure that errors *will* occur unless you are very careful in handling strings produced with the surrogateescape handler active. It's arguable that most applications *should* want errors in these cases; I've made that argument myself. But it's quite clearly not the user's intent.

On Wed, Feb 15, 2012 at 12:43 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
However, from a correctness point of view, it's a big step up from just saying "latin-1" (which effectively turns off *all* of the additional encoding related sanity checking Python 3 offers over Python 2). For many "I don't care about Unicode" use cases, using "ascii+surrogateescape" for your own I/O and setting "backslashreplace" on sys.stdout should cover you (and any exceptions you get will be warning you about cases where your original assumptions about not caring about Unicode validity have been proven wrong). If the logging module doesn't do it already, it should probably be defaulting to backslashreplace when encoding messages, too (for the same reason sys.stderr already defaults to that - you don't want your error reporting system failing to encode corrupted Unicode data). sys.stdin and sys.stdout are different due to the role they play in pipeline processing - for those, locale.getpreferredencoding()+"strict" is a more reasonable default (but we should make it easy to replace them with something more specific for a given application, hence http://bugs.python.org/issue14017) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
Are you saying you know more than the user about her application?
If the logging module doesn't do it already, it should probably be defaulting to backslashreplace when encoding messages, too
See, *you* don't know whether it will raise, either, and that about an important stdlib module. Why should somebody who is not already a Unicode geek and is just using a module they've downloaded off of PyPI be required to audit its IO foibles? Really, I think use of 'latin1' in this context is covered by "consenting adults." We *should* provide an alias that says "all we know about this string is that the ASCII codes represent ASCII characters," and document that even if your own code is ASCII compatible (ie, treats runs of non-ASCII as opaque, atomic blobs), third party modules may corrupt the text. And use the word "corrupt"; all UnicodelyRightThinking folks will run away screaming. That statement about corrupting text is true in Python 2, and pre-PEP-393 Python 3, anyway (on Windows and UCS-2 builds elsewhere), you know, since they can silently slice a surrogate pair in half.

If I can I would like to offer one argument for surrogateescape over latin-1 as the newbie approach. Suppose I am naively processing text files to create a webpage and one of my filters is a "smart quotes" filter to change "" to “”. Of course, there's no way to smarten quotes up if you don't know the encoding of your input or output files; you'll just make a mess. In this situation, Latin-1 lets you mojibake it up. If your input turns out not to have been Latin-1, the final result will be corrupted by the quote smartener. On the other hand, if you use encoding="ascii", errors="surrogateescape" Python will complain, because the smart quotes being added aren't ascii. In other words, the surrogate escape force naive users to stick to ASCII unless they can determine what encoding they want to use for their input/output. It's not perfect, but I think it strikes a better balance than letting the users shoot themselves in the foot.

Carl M. Johnson writes:
If I can I would like to offer one argument for surrogateescape over latin-1 as the newbie approach.
This isn't the newbie approach. What should be recommended to newbies is to use the default (which is locale-dependent, and therefore "usually" "good enough"), and live with the risk of occasional exceptions. If they get exceptions, or must avoid exceptions, learn about encodings or consult with someone who already knows.[1] *Neither* of the approaches discussed here is reliable for tasks like automatically processing email or uploaded files on the web, and neither should be recommended to people who aren't already used to encoding-agnostic processing in the Python 2 "str" style. So, now that you mention "newbies", I don't know what other people are discussing, but what I've been discussing here is an approach for people who are comfortable working around (or never experience!) the defects of Python 2's ASCII-compatible approach to handling varied encodings in a single program, and want a workalike for Python 3. The choice between the two is task-dependent. The encoding='latin1' method is for tasks where a little mojibake can be tolerated, but an exception would stop the show. The errors='surrogateencoding' method is for tasks where any mojibake at all is a disaster, but occasional exceptions can be handled as they arise. Footnotes: [1] When this damned term is over in a few weeks, I'll take a look at the tutorial-level docs and see if I can come up with a gentle approach for those who are finding out for the first time that the locale-dependent default isn't good enough for them.

On Feb 15, 2012, at 04:46 PM, Stephen J. Turnbull wrote:
I really hope you do this, but note that it would be very helpful to have guidelines and recommendations even for advanced, knowledgeable Python developers. I have participated in many discussions in various forums with other Python developers where genuine differences of opinion or experience, leads to different solutions. It would be very helpful to point to a document and say "here are the best practices for your [application|library] as recommended by core Python experts in Unicode handling." Cheers, -Barry

Barry Warsaw writes:
I'll see what I can do, but for *best practices* going beyond the level of Paul Moore's use case is difficult for the reasons elaborated elsewhere (by others as well as myself): basic Unicode handling is no harder than ASCII handling as long as everything is Unicode. So the real answer is to insist on valid Unicode for your text I/O, failing that, text labeled *as* text *with* an encoding[1], and failing that (or failing validation of the input), reject the input.[2] If that's not acceptable -- all too often it is not -- you're in a world of pain, and the solutions are going to be ad hoc. The WSGI folks will not find the solutions proposed for email acceptable, and vice versa. Something like the format Nick proposed, where the tradeoffs are described, would be useful, I guess. But the tradeoffs have to be made ad hoc. Footnotes: [1] Of course it's OK if these are implicitly labeled by requirements or defaults of a higher-level protocol. [2] This is the Unicode party line, of course. But it's really the only generally applicable advice.

On Wed, Feb 15, 2012 at 2:12 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
No, I'm merely saying that at least 3 options (latin-1, ascii+surrogateescape, chardet2) should be presented clearly to beginners and the trade-offs explained. For example: Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: None Approach: Specify encoding="latin-1" Bytes/bytearray: data.decode("latin-1") Text files: open(fname, encoding="latin-1") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "latin-1") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "latin-1", line_buffered=True) Stdout replacement (terminal): Leave it alone By decoding with latin-1, an application won't get *any* Unicode decoding errors, as that encoding maps byte values directly to the first 256 Unicode code points. However, any output data generated by that application *will* be corrupted if the assumption of ASCII compatibility are violated, or if implicit transcoding to any encoding other than "latin-1" occurs (e.g. when writing to sys.stdout or a log file, communicating over a network socket or serialising the string the json module). This is the closest Python 3 comes to emulating the permissive behaviour of Python 2's 8-bit strings (implicit interoperation with byte sequences is still disallowed). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: Minimal Approach: Use encoding="ascii" and errors="surrogateescape" (or, alternatively, errors="backslashreplace" for sys.stdout) Bytes/bytearray: data.decode("ascii", errors="surrogateescape") Text files: open(fname, encoding="ascii", "surrogateescape") Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "ascii", "surrogateescape") Stdout replacement (pipeline): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, "ascii", "surrogateescape", line_buffered=True) Stdout replacement (terminal): sys.stdout = io.TextIOWrapper(sys.stdout.buffer, sys.stdout.encoding, "backslashreplace", line_buffered=True) Using "ascii+surrogateescape" instead of "latin-1" is a small initial step into the Unicode-aware world. It still lets an application process any ASCII-compatible encoding *without* having to know the exact encoding of the source data, but will complain if there is an implicit attempt to transcode the data to another encoding, or if the application inserts non-ASCII data into the strings before writing them out. Whether non-ASCII compatible encodings trigger errors or get corrupted will depend on the specifics of the encoding and how the program manipulates the data. The "backslashreplace" error handler (enabled by default for sys.stderr, optionally enabled as shown above for sys.stdout) can be useful to help ensure that printing out strings will not trigger UnicodeEncodeErrors (note: the *repr* of strings already escapes non-ASCII characters internally, such that repr(x) == ascii(x). Thus, UnicodeEncodeErrors will occur only when encoding the string itself using the "strict" error handler, or when another library performs equivalent validation on the string). Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: High Approach: Use binary APIs and the "chardet2" module from PyPI to detect the character encoding Bytes/bytearray: data.decode(detected_encoding) Text files: open(fname, encoding=detected_encoding) The *right* way to process text in an unknown encoding is to do your best to derive the encoding from the data stream. The "chardet2" module on PyPI allows this. Refer to that module's documentation (WHERE?) for details. With this approach, transcoding to the default sys.stdin and sys.stdout encodings should generally work (although the default restrictive character set on Windows and in some locales may cause problems). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

+1000 Great, lets do that Will I be repetitive if I say "can we put a link in the "UnicodeDecodeError" docstring? At the top of that page have "FOR BEGINNERS" or "Mugh, just make this error go away, Now", and this info from Nick Also link to all the other tons and tons of stuff that exists on UnicodeDecoding... Chardet does nothing like the complex character set decoding that any of the browsers accomplish. Also, it almost always calls "latin-1" encoded files "latin-2" and "latin-someOtherNumber", which actually doesnt work to decode the data. The browsers can translate seemingly untouchable mush of mixed char encodings into UTF-8 (on my linux box) without hiccupping. I tried to emulate their behaviour for almost a week before I gave up. To be fair, I was at that time char set newbie, and I guess I still am, though my scraper works properly. Christopherq

It seems we once again agree violently on the principles. I think our differences here are mostly due to me giving a lot of attention to audience and presentation, and you focusing on the content of what to say. Re: spin control: Nick Coghlan writes:
Are you defining "beginner" as "Python 2 programmer experienced in a multilingual context but new to Python 3"? My point is that, by other definitions of "beginner", I don't think the tradeoffs can be usefully explained to beginners without substantial discussion of the issues involved in ASCII vs. the encoding Babel vs. Unicode. Only in extreme cases where the beginner only cares about *never* getting a Unicode error, or only cares about *never* getting mojibake, will they be able to get much out of this. Re: descriptions
Task: Process data in any ASCII compatible encoding Unicode Awareness Care Factor: None
I don't understand what "Unicode awareness" means here. The degree to which Python will raise Unicode errors? The awareness of the programmer?
As advice, I think this is mostly false. In particular, unless you do language-specific manipulations (transforming particular words and the like), the Latin-N family is going to be 6-sigma interoperable with Latin-1, and the rest of the ISO 8859 and Windows-125x family tolerably so. This is why it is so hard to root out the "Python 3 is just Unicode-me-harder by another name" meme. The most you should say here is that data *may* be corrupted and that, depending on the program, the risk *may* be non-negligible for non-Latin-1 data if you ever encounter it.
That last line would be better "attempt to validate the data, or output it without an error-suppressing handler (which may occur implicitly, in a module your program uses)."
You can be a little more precise: Non-ASCII-compatible encodings will trigger errors in the same circumstances as ASCII-compatible encodings. They also likely to be corrupted, but depending on the specifics of the encoding and how the program manipulates the data. I don't know if it's worth the extra verbosity, though.
The claim of "right" isn't good advice. The *right* way to process text is to insist on knowing the encoding in advance. If you have to process text in unknown encodings, then what is "right" will vary with the application. For one thing, accurate detection generally impossible without advice from outside. Given inaccuracy of automatic detection, I would often prefer to fall back to a generic ASCII-compatible algorithm that omits any processing that requires identifying non-ASCII characters or inserting non-ASCII characters into the text stream, rather than risk mojibake. In other cases, all of the significant processing is done on ASCII characters, and non-ASCII is simply passed through verbatim. Then if you need to process text in assorted encodings, the 'latin1' method is not merely acceptable, it is the obvious winning strategy. And to some extent the environment:
[T]he default restrictive character set on Windows and in some locales may cause problems.
In sum, most likely naive use of chardet is most effective as a way to rule out non-ASCII-compatible encodings, which *can* be done rather accurately (Shift JIS, Big5, UTF-16, and UTF-32 all have characteristic patterns of use of non-ASCII octets).

I really like a task-oriented approach like this. +1000 for this sort of thing in the docs. On 15 February 2012 08:03, Nick Coghlan <ncoghlan@gmail.com> wrote:
Task: Process data in any ASCII compatible encoding
This is actually closest to how I think about what I'm doing, so thanks for spelling it out.
Unicode Awareness Care Factor: High
I'm not entirely sure how to interpret this - "High level of interest in getting it right" or "High amount of investment in understanding Unicode needed"? Or something else?
If this is going into the Unicode FAQ or somewhere similar, it probably needs a more complete snippet of sample code. Without having looked for and read the chardet2 documentation, do I need to read the file once in binary mode (possibly only partially) to scan it for an encoding, and then start again "for real". That's arguably a downside to this approach.
There is arguably another, simpler approach, which is to pick a default encoding (probably what Python gives you by default) and add a command line argument to your program (or equivalent if your program isn't a command line app) to manually specify an alternative. That's probably more complicated than the naive user wanted to deal with when they started reading this summary, but may well not sound so bad by the time they get to this point :-)
A couple of other tasks spring to mind: Task: Process data in a file whose encoding I don't know Unicode Understanding Needed: Medium-Low Unicode Correctness: High Approach: Use external tools to identify the encoding, then simply specify it when opening the file. On Unix, "file -i FILENAME" will attempt to detect the encoding, on Windows, XXX. If, and only if, this approach doesn't identify the encoding clearly, then the other options allow you to do the best you can. (Needs a better description of what tools to use, and maybe a sample Python script using chardet2 as a fallback). This is actually the "right way", and should be highlighted as such. By describing it this way, it's also rather clear that it's *not hard*, once you get over the idea that you don't know how to get the encoding, because it's not specified in the file. Having read through and extended Nick's analysis to this point, I'm thinking that it actually fits my use cases fine (and correct Unicode handling no longer feels like such a hard problem to me :-)) Task: Process data in a file believed to have inconsistent encodings Unicode Understanding Needed: High Unicode Correctness: Low Approach: ??? Panic :-) This is the killer, but should be extremely rare. We don't need to explain what to do here, but maybe offer a simple strategy (1. Are you sure the file has mixed encodings? Have you checked twice? 2. If it's ASCII-compatible, can you work on a basis that you just pass the mixed-encoding bytes through unchanged? If so use one of the other recipes Nick explained. 3. Do you care about mojibake or corruption? Can you afford not to? 4. Are you a Unicode expert, or do you know one? :-)) I think something like this would be a huge benefit for the Unicode FAQ. I haven't got the time or expertise to write it, but I wish I did. If I get some spare time, I might well have a go anyway, but I can't promise. Paul

On 15/02/2012 6:51pm, Paul Moore wrote:
Don't recommend "file -i". I just tried it on the files in /usr/share/libtextcat/ShortTexts/. Basically, everything is identified as us-ascii, iso-8859-1 or unknown-8bit. Examples: chinese-big5.txt: text/plain; charset=iso-8859-1 chinese-gb2312.txt: text/plain; charset=iso-8859-1 japanese-euc_jp.txt: text/plain; charset=iso-8859-1 korean.txt: text/plain; charset=iso-8859-1 arabic-windows1256.txt: text/plain; charset=iso-8859-1 georgian.txt: text/plain; charset=iso-8859-1 greek-iso8859-7.txt: text/plain; charset=iso-8859-1 hebrew-iso8859_8.txt: text/plain; charset=iso-8859-1 russian-windows1251.txt: text/plain; charset=iso-8859-1 ukrainian-koi8_r.txt: text/plain; charset=iso-8859-1 sbt

On 15 February 2012 19:53, shibturn <shibturn@gmail.com> wrote:
Don't recommend "file -i".
Fair enough - I have no experience to comment one way or another. it was just something I'd seen mentioned in the thread. If there isn't a good standard encoding detector, maybe a small Python script using chardet2 would be the best thing to recommend... Paul.

MRAB <python@mrabarnett.plus.com> writes:
encoding="mojibake" # :-)
+1 If people want to remain wilfully ignorant of text encoding in the third millennium of our calendar, then a name like “mojibake” is clear about what they'll get, and will perhaps be publicly embarrassing enough that some proportion of programmers will decide to reduce their ignorance and use a specific encoding instead. -- \ “Science is a way of trying not to fool yourself. The first | `\ principle is that you must not fool yourself, and you are the | _o__) easiest person to fool.” —Richard P. Feynman, 1964 | Ben Finney

On Wed, Feb 15, 2012 at 11:15:36AM +1100, Ben Finney wrote:
If people want to remain wilfully ignorant of text encoding in the third millennium
This returns us to the very beginning of the thread. The original complain was: Python3 requires users to learn too much about unicode, more than they really need. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Matt Joiner wrote:
The thread was reasons for a possible drop in popularity. Somehow the other reasons have been sabotaged leaving only the unicode discussion still alive.
Not so much sabotaged as ignored. Perhaps because we don't believe this alleged drop in popularity represents anything real, while the Unicode issue is a genuine problem that needs a solution. -- Steven

On 16/02/12 02:39, Oleg Broytman wrote:
I don't think it's helpful to label everyone who wants to use the techniques being discussed here as lazy or ignorant. As we've seen, there are cases where you truly *can't* know the true encoding, and at the same time it *doesn't matter*, because all you want to do is treat the unknown bytes as opaque data. To tell someone in that position that they're being lazy is both wrong and insulting. It seems to me that what surrogateescape is effectively doing is creating a new data type that consists of a mixture of ASCII characters and raw bytes, and enables you to tell which is which. Maybe there should be a real data type like this, or a flag on the unicode type. The data would be stored in the same way as a latin1-decoded string, but anything with the high bit set would be regarded as a byte instead of a character. This might make it easier to interoperate with external libraries that expect well-formed unicode. -- Greg

On Thu, Feb 16, 2012 at 02:37:12PM +1300, Greg Ewing wrote:
In fairness, this thread was originally started with the scenario "I'm reading files which are only mostly ASCII, but I don't want to learn about Unicode" rather than "I know about Unicode, but it doesn't help me in this situation because the encoding truly is unknown". So wilful ignorance does apply, at least in the use-case the thread started with. (If it helps, think of them as too busy to learn, not too lazy.) If you already know about Unicode, then you probably don't need to be given a simple recipe to follow, because you probably already have a solution that works for you. Which brings us back to the original use-case: "I have a file which is only mostly ASCII, and I don't care to learn about Unicode at this time to deal with it. I need a recipe I can follow that will do the right-thing so I can continue to ignore the issue for a little longer." I don't think that we should either insist that these people be forced to learn Unicode, nor expect to be able to solve every possible problem they might find. A couple of recipes in the FAQs, and discussion of why you might prefer one to the other, should be able to cover most simple cases: open(filename, encoding='ascii', errors='surrogateescape') open(filename, encoding='latin1') Both recipes hint at the wider world of encodings and error handlers, hence act as a non-threatening introduction to Unicode. -- Steven

On 16 February 2012 04:08, Steven D'Aprano <steve@pearwood.info> wrote:
As the person who started the thread with this use case, I'd dispute that description of what I said. To restate it "I'm reading files which are mostly ASCII but not all. I know that I should identify the encoding, and what to do if I did know the encoding, but I'm not sure how to find out reliably what the encoding is. Also, the problem doesn't really warrant investing the time needed to research means of doing so - given that I don't need to process the non-ASCII, I just want to avoid decoding errors and not corrupt the data". I'm not lazy, I've just done a cost/benefit analysis and determined that my limited knowledge should be enough. Experience with other tools which aren't as strict as Python 3 on Unicode matters confirms that a "good enough" job does satisfy my needs. And I'm not willfully ignorant, I actually have a good feel for Unicode and the issues involved, and I certainly know what's right. I've just found that everything I've read assumes that "knowing the encoding" isn't hard - and my experience differs, so I don't know where to go for answers. Add to this the fact that I *know* I've seen supposed text files with mixed encoding content, and no-one has *ever* explained how to handle that (it's basically a damaged file, and so all the "right way to deal with Unicode" discussions ignore it) even though tools like grep and awk do a perfectly acceptable job to the level I care about. I'm very pleased with the way this thread has gone, because it has answered all of the questions I've had about "nearly-ASCII" text files. But there's no way I'd have expected to spend this much time, and involve this many other people with more knowledge than me, just to handle my original changelog-parsing problem that I could do in awk or Python 2 in about 5 minutes. Now, I could also do it in Python 3. But then, I couldn't. Hopefully the knowledge from this thread can be captured so that other people can avoid my dilemma. OK, so maybe I do feel somewhat insulted... Cheers, Paul.

Paul Moore wrote:
I am sorry, I spoke poorly. Apologies if you feel I misrepresented you. To be honest, this thread has been so large, and so rambling, and covering so much ground, I have no idea what the *actual* first mention of encoding related issues was. The oldest I can find was Giampaolo Rodolà on 9 Feb 2012 20:16:00 +0100: I bet a lot of people don't want to upgrade for another reason: unicode. The impression I got is that python 3 forces the user to use and *understand* unicode and a lot of people simply don't want to deal with that. two days before the first post from you mentioning encoding issues that I can find. Another mention of a similar use-case was by Stephen J Turnbull on 10 Feb 2012 17:41:21 +0900: True, if one sticks to pure ASCII, there's no difference to notice, but that's just not possible for people who live outside of the U.S., or who share text with people outside of the U.S. They need currency symbols, they have friends whose names have little dots on them. Every single one of those is a backtrace waiting to happen. A backtrace on f = open('text-file.txt') for line in f: pass is an imposition. That doesn't happen in 2.x (for the wrong reasons, but it's very convenient 95% of the time). This is what Victor's "locale" codec is all about. I think that's the wrong spelling for the feature, but there does need to be a way to express "don't bother me about Unicode" in most scripts for most people. We don't have a decent boilerplate for that yet. which I *paraphrased* as "I have text files that are mostly ASCII and I don't want to deal with Unicode yadda yadda yadda". But in any case, I expressed myself poorly, and I'm sorry about that. Regardless of who made the very first mention of the encoding problem in this thread, I think we should all be able to agree that laziness is *not* the only reason for having encoding problems. I thought I made it clear that I did not subscribe to that opinion. -- Steven

On 16 February 2012 13:44, Steven D'Aprano <steve@pearwood.info> wrote:
Not a problem. Equally, my "I feel insulted" dig was uncalled for - it was the sort of semi-humorous comment that doesn't translate itself well in email. I think the debate here has been immensely useful, and I appreciate everyone's comments. Paul.

Paul Moore writes:
Add to this the fact that I *know* I've seen supposed text files with mixed encoding content,
Heck, I've seen *file names* with mixed encoding content.
The right way to handle such a file is ad hoc: operate on the features you can identify, and treats runs of bytes of unknown encoding as atomic blobs. In practice, there is a generic such feature that supports many applications: runs of ASCII text. Which is the intuition all the pragmatists start with -- it's correct.
OK, so maybe I do feel somewhat insulted...
I'm sorry you feel that way. (I've sided with the pragmatists in this thread, but on this issue I'm a purist at heart.)

On 16 February 2012 15:25, Stephen J. Turnbull <stephen@xemacs.org> wrote:
As I said elsewhere that was a lame attempt at a joke. My apologies. No-one has been anything but helpful in this thread, I was just reacting (a little) to the occasional characterisation I've noticed of people as "lazy" - your term "pragmatists" is much less emotive. (And it wasn't so much a personal reaction anyway, just an awareness that we need to be careful how we express things to people struggling with this) Paul.

On 2/16/2012 7:59 AM, Paul Moore wrote:
Before unicode, mixed encodings was the only was to have multi-lingual digital text (with multiple symbol systems) in one file. I presume such texts used some sort of language markup like <English>, <Hindi> (or <Sanskrit>), and <Tibetan>, along with software that understood the markup. Such files were not broken, just the pre-unicode system of different codes for each language or nation. To handle such a file, the program, whatever the language, has to understand the custom markup, segment the bytes, and handle each segment appropriately. Crazy text that switches among unknown encodings without notice is a possibly unsolvable decryption problem. Such have no guaranteed algorithms, only heuristics. -- Terry Jan Reedy

Terry Reedy writes:
Before unicode, mixed encodings was the only was to have multi-lingual digital text (with multiple symbol systems) in one file.
There is a long-accepted standard for doing this, ISO 2022. IIRC it's available online from ISO now, and if not, ECMA 35 is the same. The X Compound Text standard (I think this is documented in the ICCCM) and the Motif Compound String are profiles of ISO 2022. If that is what Paul is seeing, then the iso-2022-jp codec might be good enough to decode the files he has, depending on which version of ISO-2022-JP is implemented. If not, iconv -f ISO-2022-JP-2 (or ISO-2022-JP-3) should work (at least for GNU's iconv implementation).
They would use encoding "markup" (specifically escape sequences). Language is not enough, as all languages have had multiple encodings since the invention of ASCII (or EBCDIC, whichever came second ;-), and in many cases multilingual standards have evolved (Japanese, for example, includes Greek and Cyrillic alphabets in its JIS standard coded character set). More recently, many languages have several ISO 2022-based encodings (the ISO 8859 family is a conformant profile of ISO 2022, as are the EUC encodings for Asian languages; the Windows 125x code pages are non-conformant extensions of ASCII based on ISO 8859).
Crazy text that switches among unknown encodings without notice is a possibly unsolvable decryption problem.
True, and occasionally seen even today in Japan (cat(1) will produce such files easily, and any system for including files).

Greg Ewing writes:
Maybe there should be a real data type [parallel to str and bytes that mixes str and bytes], or a flag on the unicode type.
-1. This is yesterday's problem. It still hurts today; we need workarounds. But it's going to be less and less important as time goes on, because nobody can afford one-locale software anymore, and the cheapest way to be multilocale is to process in Unicode, and insist on Unicode on input and output. The unknown encoding problem is not one with a generally acceptable solution. That's why Unicode was invented. To "solve" the problem by ensuring it doesn't occur in the first place.

Greg Ewing wrote:
How so? Sounds like this new data type assumes everything over 127 is a raw byte, but there are plenty of applications where values between 0 - 127 should be interpreted as raw bytes even when the majority are indeed just plain ascii.
I can see a data type that is easier to work with than bytes (ascii-string, anybody? ;) but I don't think we want to make it any kind of unicode -- once the text has been extracted from this ascii-string it should be converted to unicode for further processing, while any other non-convertible bytes should stay as bytes (or ascii-string, or whatever we call it). The above is not arguing with the 'latin-1' nor 'surrogateescape' techniques, but only commenting on a different data type with probably different uses. ~Ethan~

Ethan Furman writes:
But there really aren't any uses that aren't equally well dealt with by 'surrogateescape' that I can see. You have to process it code unit by code unit (just like surrogateescape) and if you find a non- character code unit, you then have an ad hoc decision to make about what to do with it. surrogateescape makes one particular treatment blazingly efficient (namely, turning the surrogate back into a byte with no known meaning). What other treatment of a byte of by-definition unknown semantics deserves the blazing efficiency that a new (presumably builtin) type could give?

Stephen J. Turnbull wrote:
It wasn't the 'unknown semantics' that I was responding to (latin-1 and surrogateescape deal with that just fine), but rather a new data type with a mixture of valid unicode (0-127) and raw bytes (128-255) -- I don't think that would be common enough to justify, and I can see confusion again creeping in when somebody (like myself ;) sees a datatype which seemingly supports a mixture of unicode and raw bytes only to find out that 'uni_raw(...)[5] != 32' because a u' ' was returned and an integer (or raw byte) was expected at that location. ~Ethan~

On 2/14/2012 6:39 AM, Carl M. Johnson wrote:
While this is a Py3 str object, it is not unicode. Unicode only only allows proper surrogate codeunit pairs. Py2 allowed mal-formed unicode objects and that was not changed in Py3 -- or 3.3. It seems appropriate that bytes that are meaningless to ascii should be translated to codeunits that are meaningless (by themselves) to unicode.
utf-8 only encodes proper unicode.
_.encode("utf-8", errors="surrogateescape") b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
The result is not utf-8 and it would be better not to use 'utf-8' instead of 'ascii' in the expression. The above encodes to ascii + uninterpreted high-bit-set bytes.
-- Terry Jan Reedy

On Tue, Feb 14, 2012 at 9:39 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
Oops, that's what I get for posting without testing :) Still, your example clearly illustrates the point I was trying to make - that using "ascii+surrogateescape" is less likely to silently corrupt the data stream than using "latin-1", because attempts to encode it under the "strict" error handler will generally fail, even for an otherwise universal encoding like UTF-8.
OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym?
I don't see any reason to obfuscate the use of "latin-1" as a workaround that maps 8-bit bytes directly to the corresponding Unicode code points. My proposal would be two-fold: Firstly, that we document three alternatives for working with arbitrary ASCII compatible encodings (from simplest to most flexible): 1. Use the "latin-1" encoding The latin-1 encoding accepts arbitrary binary data by mapping individual bytes directly to the first 256 Unicode code points. Thus, any sequence of bytes may be translated to a sequence of code points, effectively reproducing the behaviour of Python 2's 8-bit strings. If all data supplied is genuinely in an ASCII compatible encoding then this will work correctly. However, it fails badly if the supplied data is ever in an ASCII incompatible encoding, or if the decoded string is written back out using a different encoding. Using this option switches off *all* of Python 3's support for ensuring transcoding correctness - errors will frequently pass silently and result in corrupted output data rather than explicit exceptions. 2. Use the "ascii" encoding with the "surrogateescape" error handler This is the most correct approach that doesn't involve attempting to guess the string encoding. Behaviour if given data in an ASCII incompatible encoding is still unpredictable (and likely to result in data corruption). This approach retains most of Python 3's support for ensuring transcoding correctness, while still accepting any ASCII compatible encoding. If UnicodeEncodeErrors when displaying surrogate escaped strings are not desired, sys.stdout should also be updated to use the "backslashreplace" error handler. (see below) 3. Initially process the data as binary, using the "chardet" package from PyPI to guess the encoding This is the most correct option that can even cope with many ASCII incompatible encodings. Unfortunately, the chardet site is gone, since Mark Pilgrim took down his entire web presence. This (including the dead home page link from the PyPI entry) would need to be addressed before its use could be recommended in the official documentation (or, failing that, is there a properly documented alternative package available?) Secondly, that we make it easy to replace a TextIOWrapper with an equivalent wrapper that has only selected settings changed (e.g. encoding or errors). In 3.2, that is currently not possible, since the original "newline" argument is not made available as a public attribute. The closest we can get is to force universal newlines mode along with whatever other changes we want to make: old = sys.stdout sys.stdout = io.TextIOWrapper(old.buffer, old.encoding, "backslashreplace", None, old.line_buffering) 3.3 currently makes this even worse by accepting a "write_through" argument that isn't available for introspection. I propose that we make it possible to write the above as: sys.stdout = sys.stdout.rewrap(errors="backslashreplace") For the latter point, see http://bugs.python.org/issue14017 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape")
This is the only case that makes sense in this thread. We're talking about people coming from Python 2 who want an encoding-agnostic way to script ASCII-oriented operations for an ASCII-compatible environment, and not to learn about encodings at all. While my opinions on this are (probably obviously) informed by the WSGI discussion, this is not about making life come up roses for the WSGI folks. They work in a sewer; life stinks for them, and all they can do about it is to hold their noses. This thread is about people who are not trying to handle sewage in a sanitary fashion, rather just cook a meal and ignore the occasional hairs that inevitably fall in.
However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set.
Sure, but getting errors is for people who want to learn how to do it right, not for people who just need to get a job done. Cf. the fevered opposition to giving "import cElementTree" a DeprecationWarning.
No, it should *always* throw a UnicodeEncodeError, because there are *no* encodings that can handle them -- they're not characters, so they can't be encoded.
(Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish,
No, they will raise UnicodeEncodeError; that's why surrogateescape was invented, to work around the problem of what to do with bytes that the programmer knows are meaningful to somebody, but do not represent characters as far as Python can know: wideload:~ 10:06$ python3.2 Python 3.2 (r32:88445, Mar 20 2011, 01:56:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type "help", "copyright", "credits" or "license" for more information. position 0: surrogates not allowed
The reason I advocate 'latin-1' (preferably under an appropriate alias) is that you simply can't be sure that those surrogates won't be passed to some module that decides to emit information about them somewhere (eg, a warning or logging) -- without the protection of a "silencing error handler". Bang-bang! Python's silver hammer comes down upon your head!

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
[…]
You have made me feel strange emotions with this message. I don't know what they are, but a combination of “sickened” and “admiring” and “nostalgia”, with a pinch of fear, seems close. Maybe this is what it's like to read poetry. -- \ “[Entrenched media corporations will] maintain the status quo, | `\ or die trying. Either is better than actually WORKING for a | _o__) living.” —ringsnake.livejournal.com, 2007-11-12 | Ben Finney

On 2/11/2012 7:53 AM, Stefan Behnel wrote:
If one has ascii text + unspecified 'other stuff', one can either process as 'polluted text' or as 'bytes with some ascii character codes'. Since (as I just found out) one can iterate binary mode files by line just as with text mode, I am not sure what the tradeoffs are. I would guess it is mostly whether one wants to process a sequence of characters or a sequence of character codes (ints). -- Terry Jan Reedy

On 11 February 2012 12:41, Masklinn <masklinn@masklinn.net> wrote:
To be honest, I'm fine with the answer "use latin1" for this case. Practicality beats purity and all that. But as you say, it feels wrong somehow. I suspect that errors=surrogateescape is the better "I don't really care" option. And I still maintain it would be useful for combating FUD if there was a commonly-accepted idiom for this. Interestingly, on my Windows PC, if I open a file using no encoding in Python 3, I seem to get code page 1252: Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
So actually, on this PC, I can't really provoke these sorts of decoding error problems (CP1252 accepts all bytes, it's basically latin1). Whether this is a good thing or a bad thing, I'm not sure :-) Paul

Masklinn writes:
So give latin-1 an additional name. Emacsen use "raw-text" (there's also binary, but raw-text will do a loose equivalent of universal newlines for you, binary doesn't). You could also use a name more exact and less English-biased like "ascii-compatible-bytes". Same codec, name denotes different semantics.

On 2/11/2012 5:47 AM, Paul Moore wrote:
Good example. I believe adding ", encoding='latin-1'" to open() is sufficient. (And from your response elsewhere to Stephen, you seem to know that.) This should be in the tutorial if not already. But in reference to what I wrote above, knowing that magic phrase is not 'knowledge of unicode'. And I include it in the 'not much more knowledge' needed for Python 3. -- Terry Jan Reedy

On 2/11/2012 12:00 PM, Masklinn wrote:
When I wrote that response, I thought that 'for line in f' would not work for binary-mode files. I then opened IDLE, experimented with 'rb', and discovered otherwise. So the remaining issue is how one wants the unknown encoding bytes to appear when printed -- as hex escapes, or as arbitrary but more readable non-ascii latin-1 chars. -- Terry Jan Reedy

On 11 February 2012 17:00, Masklinn <masklinn@masklinn.net> wrote:
In my view, that's less scalable to more complex cases. It's likely you'll hit things you need to do that don't translate easily to bytes sooner than if you stick in a string-only world. A simple example, check for a regex rather than a simple starting character. The problem I have with encoding="latin-1" is that in many cases I *know* that's a lie. From what's been said in this discussion so far, I think that the "better" way to say "I know this file contains mostly ASCII, but there's some other bits I'm not sure about but don't care too much as long as they round-trip cleanly" is encoding="ascii",errors="surrogateescape". But as we've seen here, that's not the idiom that gets recommended by everyone (the "One Obvious Way", if you like). I suspect that if the community did embrace a "one obvious way", that would reduce the "Python 3 makes me need to know Unicode" FUD that's around. But as long as people get 3 different answers when they ask the question, there's going to be uncertainty and doubt (and hence, probably, fear...) Paul. PS I'm pretty confident that I have *my* answer now (ascii/surrogateescape). So this thread was of benefit to me, if nothing else, and my thanks for that.

Masklinn writes:
Why not open the file in binary mode in stead? (and replace `'*'` by `b'*'` in the startswith call)
This will often work, but it's task-dependent. In particular, I believe not just `.startswith(), but general regexps work with either bytes or str in Python 3. But other APIs may not. and you're going to need to prefix *all* literals (including those in modules your code imports!) with `b`. So you import a module that does exactly what you want, and be stymied by a TypeError because the module wants Unicode. This would not happen with Python 2, and there's the rub.

On Mon, Feb 13, 2012 at 2:50 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
The other trap is APIs like urllib.parse which explicitly refuse the temptation to guess when it comes to bytes data, and decodes it as "ascii+strict". If you want it to do something else that's more permissive (e.g. "latin-1" or "ascii+surrogateescape") then you *have* to decode it to Unicode yourself before handing it over. Really, Python 3 forces programmers to learn enough about Unicode to be able to make the choice between the 4 possible options for processing ASCII-compatible encodings: 1. Process them as binary data. This is often *not* going to be what you want, since many text processing APIs will either only accept Unicode, or only pure ASCII, or require you to supply encoding+errors if you want them to process binary data. 2. Process them as "latin-1". This is the answer that completely bypasses all Unicode integrity checks. If you get fed non-ASCII data, you *will* silently produce gibberish as output. 3. Process them as "ascii+surrogateescape". This is the *right* answer if you plan solely to manipulate the text and then write it back out in the same encoding as was originally received. You will get errors if you try to write a string with escaped characters out to a non-ascii channel or an ascii channel without surrogateescape enabled. To write such strings to non-ascii channels (e.g. sys.stdout), you need to remember to use something like "ascii+replace" to mask out the values with unknown encoding first. You may still get hard to debug UnicodeEncodeError exceptions when handed data in a non-ASCII compatible encoding (like UTF-16 or UTF-32), but your odds of silently corrupting data are fairly low. 4. Get a third party encoding guessing library and use that instead of waving away the problem of ASCII-incompatible encodings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2012/2/11 Paul Moore <p.f.moore@gmail.com>
I just look at the Python 3 documentation ( http://docs.python.org/release/3.1.3/library/functions.html#open), there is a "error" parameter to the open function. when set to "ignore" or "replace" it will solved your problem. Another way is to try to guess the encoding programaticaly (I found chardet module http://pypi.python.org/pypi/chardet) and pass it to decode your file with unknown encoding. Then why not put a value "auto" available for "encoding" parameter which makes "open" call a detector before opening and throw error when the guess is less than a certain percentage. Gabriel AHTUNE

Massimo Di Pierro wrote:
Is that a commentary on Python, or the average undergrad student?
Python has a compiler. The "c" in .pyc files stands for "compiled" and Python has a built-in function called "compile". It just happens to compile to byte code that runs on a virtual machine, not machine code running on physical hardware. PyPy takes it even further, with a JIT compiler that operates on the byte code.
How is that relevant to a language being taught to undergrads? Sounds more like an excuse to justify dislike of teaching Python rather than an actual reason to dislike Python.
- The programming language purists complain about the use of reference counting instead of garbage collection
The programming language purists should know better than that. The choice of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature. -- Steven

On 2012-02-09, at 19:03 , Steven D'Aprano wrote:
The choice of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature.
That's debatable, it's an implementation detail with very different semantics which tends to leak out into usage patterns of the language (as it did with CPython, which basically did not get fixed in the community until Pypy started ascending), especially when the language does not provide "better" ways to handle things (as Python finally did by adding context managers in 2.5). So theoretically, automatic refcounting is a detail, but practically it influences language usage differently than most other GC techniques (when it'd the only GC strategy in the language anyway)

On Thu, Feb 9, 2012 at 10:14 AM, Masklinn <masklinn@masklinn.net> wrote:
I think it was actually Jython that first sensitized the community to this issue.
Are there still Python idioms/patterns/recipes around that depend on refcounting? (There also used to be some well-known anti-patterns that were only bad because of the refcounting, mostly around saving exceptions. But those should all have melted away -- CPython has had auxiliary GC for over a decade.) -- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:26 , Guido van Rossum wrote:
The first one was Jython yes, of course, but I did not see the "movement" gain much prominence before Pypy started looking like a serious CPython alternative, before that there were a few voices lost in the desert.
There shouldn't be, but I'm not going to rule out reliance on automatic resource cleanup just yet, I'm sure there are still significant pieces of code using those in the wild.

On Thu, Feb 9, 2012 at 10:37 AM, Masklinn <masklinn@masklinn.net> wrote:
I guess everyone has a different perspective.
There shouldn't be, but I'm not going to rule out reliance on automatic
resource cleanup just yet, I'm sure there are still significant pieces of code using those in the wild.
I am guessing in part that's a function of resistance to change, and in part it means PyPy hasn't gotten enough mindshare yet. (Raise your hand if you have PyPy installed on one of your systems. Raise your hand if you use it. Raise your hand if you are a PyPy contributor. :-) Anyway, the refcounting objection seems the least important one. The more important trolls to fight are "static typing is always better" and "the GIL makes Python multicore-unfriendly". TBH, I see some movement in the static typing discussion, evidence that the static typing zealots are considering a hybrid approach (e.g. C# dynamic, and the optional static type checks in Dart). -- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:44 , Guido van Rossum wrote:
These seem to be efforts of people trying for both sides (for various reasons) more than people firmly rooted in one camp or another. Dart was widely panned for its wonky approach to "static typing", which is generally considered a joke amongst people looking for actual static type (in that they're about as useful as Python 3's type annotations).

On Thu, Feb 09, 2012 at 10:44:42AM -0800, Guido van Rossum wrote:
I don't know if you actually want replies, but I'll bite. I have pypy installed (from the standard Fedora pypy package), and for a particular project it provided a 20x speedup. I'm not a PyPy contributor, but I'm a believer. I would use PyPy everywhere if it worked with Python 3 and scipy. My apologies if this was just a rhetorical question. :) -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 10 February 2012 06:06, Guido van Rossum <guido@python.org> wrote:
In that case ... - I have various versions of PyPy installed (regularly pull the latest working Windows build); - I use it occasionally, but most of my Python work ATM is Google App Engine-based, and the GAE SDK doesn't work with PyPy; - I'm not a PyPy contributor, but am also a believer - I definitely think that PyPy is the future and should be the base for Python4K. - I won't be at PyCon. Cheers, Tim Delaney

Andrew McNabb, 09.02.2012 19:58:
AFAIK, there is no concrete roadmap towards supporting SciPy on top of PyPy. Currently, PyPy is getting its own implementation of NumPy-like arrays, but there is currently no interaction with anything in the SciPy world outside of those. Given the shear size of SciPy, reimplementing it on top of numpypy is unrealistic. That being said, it's quite possible to fire up CPython from PyPy (or vice versa) and interact with that, if you really need both PyPy and SciPy. It even seems to be supported through multiprocessing. I find that pretty cool. http://thread.gmane.org/gmane.comp.python.pypy/9159/focus=9161 Stefan

On Thu, Feb 09, 2012 at 08:53:55PM +0100, Stefan Behnel wrote:
I understand that there is some hope in getting cython to support pure python and ctypes as a backend, and then to migrate scipy to use cython. This is definitely a long-term solution. Most people don't depend on all of scipy, and for some use cases, it's not too hard to find alternatives. Today I migrated a project from scipy to the GNU Scientific Library (with ctypes). It now works great with PyPy, and I saw a total speedup of 10.6. Dropping from 27 seconds to 2.55 seconds is huge. It's funny, but for a new project I would go to great lengths to try to use the GSL instead of scipy (though I'm sure for some use cases it wouldn't be possible).
That's a fascinating idea that I had never considered. Thanks for sharing. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 2012-02-10, at 01:03 , Guido van Rossum wrote:
I'm not sure what could be open to negotiate, being part of the GNU constellation I don't see GSL budging from the GPL, and SciPy is backed by industry members and used in "nonfree" products (notably the Enthought Python Distribution) so there's little room for it to use the GPL. Best thing that could happen (and I'm not even sure it's allowed by the GSL's license (which is under the GPL not the LGPL) would be for SciPy to grow some sort of GSL backend to delegate its operations to, when the GSL is installed.

On 2/10/12 8:49 AM, Masklinn wrote:
While I am an Enthought employee and really do want to keep scipy BSD so I can continue to use it in the proprietary software that I write for clients, I must also add that the most vociferous BSD advocates in our community are the academics. They have to wade through more weird licensing arrangements than I do, and the flexibility of the BSD license is quite important to let them get their jobs done.
We've done that kind of thing in the past for FFTW and other libraries but have since removed them for all of the installation and maintenance headaches it causes. In my mind (and others disagree), having scipy-the-package subsume every relevant library is not a worthwhile pursuit. The important thing is that these packages are available to the scientific Python community. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2/10/12 10:41 AM, Masklinn wrote:
No apologies necessary. I just wanted to be thorough. :-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

GIL + Threads = Fibers CPython doesn't have threads but it calls its fibers "threads" which causes confusion and disappointment. The underlying implementation is not important eg when you implement a "lock" using "events" does the lock become an event? No. This is a PR disaster. 100% agree we need a PR offensive but first we need a strategy. Erlang champions the actor/message paradigm so they dodge the threading bullet completely. What's the python championed parallelism paradigm? It should be on the front page of python.org and in the first paragraph of wikipedia on python. One of the Lua authors said this about threads:
Anyone who cares enough about performance doesn't mind that 'a = a + 1' is only as deterministic as you design it to be with or without locks. Multiprocessing has this same problem btw. What Python needs are better libraries for concurrent programming based on
processes and coroutines.
The killer feature for threads (vs multiprocessing) is access to shared state with nearly zero overhead. And note that a single-threaded event-driven process can serve 100,000 open
sockets -- while no JVM can create 100,000 threads.
Graphics engines, simulations, games, etc don't want 100,000 threads, they just want true threads as many as there are CPU's. Yuval

Pure python code running in python "threads" on CPython behaves like fibers. I'd like to point out the word "external" in your statement.
I don't believe this to be true. Fibers are not preempted. The GIL is released at regular intervals to allow the effect of preempted switching. Many other behaviours of Python threads are still native thread like, particularly in their interaction with other components and the OS. GIL + Threads = Simplified, non parallel interpreter

Matt Joiner, 10.02.2012 15:48:
Absolutely. Even C extensions cannot always prevent a thread switch from happening when they need to call back into CPython's C-API.
GIL + Threads = Simplified, non parallel interpreter
Note that this also applies to PyPy, so even "interpreter" isn't enough of a generalisation. I think it's best to speak of the GIL as what it is: a lock that protects internal state of the CPython runtime (and also some external code, when used that way). Rather convenient, if you ask me. Stefan

On Sat, Feb 11, 2012 at 1:30 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Armin Rigo's series on Software Transactional Memory on the PyPy blog is also required reading for anyone seriously interested in practical shared memory concurrency that doesn't impose a horrendous maintenance burden on developers that try to use it: http://morepypy.blogspot.com.au/2011/06/global-interpreter-lock-or-how-to-ki... http://morepypy.blogspot.com.au/2011/08/we-need-software-transactional-memor... http://morepypy.blogspot.com.au/2012/01/transactional-memory-ii.html And for those that may be inclined to dismiss STM as pie-in-the-sky stuff that is never going to be practical in the "real world", the best I can offer is Intel's plans to bake an initial attempt at it into a consumer grade chip within the next couple of years: http://arstechnica.com/business/news/2012/02/transactional-memory-going-main... I do like Armin's analogy that free threading is to concurrency as malloc() and free() are to memory management :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Can we please break this thread out into multiple subject headers? It's very difficult to follow the flow of conversation with some many different discussions all lumped under one name. Some proposed subjects: - Refcounting vs. Other GC - Numpy - Windows Installers - Unicode - Python in Education - Python's Popularity

On Fri, Feb 10, 2012 at 12:43 PM, Carl M. Johnson < cmjohnson.mailinglist@gmail.com> wrote:
No, The subject is correct, we have a -3% problem in the index. so the solution is to keep this thread long with many keywords like python pypy jython etc... and than the % will grow! (at least @TIOBE since it relies on google search ;) )

On 2/9/2012 1:26 PM, Guido van Rossum wrote:
Yes, it was. The first PyPy status blog in Oct 2007 http://morepypy.blogspot.com/2007/10/first-post.html long before any practical release, was a year after the 2.5 release. -- Terry Jan Reedy

On Thu, Feb 9, 2012 at 19:26, Guido van Rossum <guido@python.org> wrote:
There are some simple patterns that are great with refcounting and not so great with garbage collection. We encountered some of these with Mercurial. IIRC, the basic example is just open('foo').read() With refcounting, the file will be closed soon. With garbage collection, it won't. Being able to rely on cleanup per frame/function call is pretty useful. Cheers, Dirkjan

On 10 February 2012 20:16, Dirkjan Ochtman <dirkjan@ochtman.nl> wrote:
This is the #1 anti-pattern that shouldn't be encouraged. Using this idiom is just going to cause problems (mysterious exceptions while trying to open files due to running out of file handles for the process) for anyone trying to port your code to other implementations of Python. If you read PEP 343 (and the various discussions around that time) it's clear that the above anti-pattern is one of the major driving forces for the introduction of the 'with' statement. Tim Delaney

On Fri, Feb 10, 2012 at 4:32 AM, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
It's not that open('foo').read() is "good". Clearly with the presence of nondeterministic garbage collection, it's bad. But it is convenient and compact. Refcounting GCs in general give very nice, predictable behavior, which lets us ignore a lot of the details of destroying things. Without something like this, we have to do some forms of resource management by hand that we could otherwise push to the garbage collector, and while sometimes this is as easy as a with statement, sometimes it isn't. For example, what do you do if multiple objects are meant to hold onto a file and take turns reading it? How do we close the file at the end when all the objects are done? Is the answer "manual refcounting"? Or is the answer "I don't care, let the GC handle it"? -- Devin

On Thu, Feb 9, 2012 at 10:03 AM, Steven D'Aprano <steve@pearwood.info>wrote:
Well either way it's depressing...
Not sure how that's relevant. Massimo used "won't compile" as a shorthand for "has a syntax error". 50+% of the students have a mac and an increasing number of packages
depend on numpy. Installing numpy on mac is a lottery.
But that was the same in the 2.5 days. The problem is worse now because (a) numpy is going mainstream, and (b) Macs don't come with a C compiler any more. I think the answer will have to be in making an effort to produce robust and frequently updated downloads of numpy to match various popular Python versions and platforms. This is a major pain (packaging always is) so maybe some incentive is necessary (just like ActiveState has its Python distros).
Hm. I know a fair number of people who use Eclipse to edit Python (there's some plugin). This seems easy enough to address by just pointing people to the plugin, I don't think Python itself is to blame here. From the hard core computer scientists prospective there are usually
How is that relevant to a language being taught to undergrads? Sounds more
like an excuse to justify dislike of teaching Python rather than an actual reason to dislike Python.
I can see the discomfort if the other professors keep bringing this up. It is, sadly, a very effective troll. (Before it was widely know, the most common troll was the whitespace. People would declare it to be ridiculous without ever having tried it. Same with the GIL.) - The programming language purists complain about the use of reference
counting instead of garbage collection
The programming language purists should know better than that. The choice
of which garbage collection implementation (ref counting is garbage collection) is a quality of implementation detail, not a language feature.
Yeah, trolls are a pain. We need to start spreading more effective counter-memes. -- --Guido van Rossum (python.org/~guido)

On Feb 9, 2012, at 12:03 PM, Steven D'Aprano wrote:
I teach so the average student is my benchmark. Please do not misunderstand. While some may be lazy, but the average CS undergrad is not stupid but quite intelligent. They just do not like wasting time with setups and I sympathize with that. Batteries included is the Python motto.
Don't shoot the messenger please. You can dismiss or address the problem. Anyway... undergrads do care because they will take 4 years to grade and they do not want to come out with obsolete skills. Our undergrads learn Python, Ruby, Java, Javascript and C++. Many know other languages which they learn on their own (Scala and Clojure are popular). They all agree multi-core is the future and whichever language can deal with them better is the future too. As masklinn says, the difference between garbage collection and reference counting is more than an implementation issue.

On Thu, Feb 9, 2012 at 10:25 AM, Massimo Di Pierro < massimo.dipierro@gmail.com> wrote:
I'd give those students a bonus for being in touch with what's popular in academia. Point them to Haskell next. They may amount to something.
They all agree multi-core is the future and whichever language can deal with them better is the future too.
Surely not JavaScript (which is single-threaded and AFAIK also uses refcounting :-). Also, AFAIK Ruby has a GIL much like Python. I think it's time to start a PR offensive explaining why these are not the problem the trolls make them out to be, and how you simply have to use different patterns for scaling in some languages than in others. And note that a single-threaded event-driven process can serve 100,000 open sockets -- while no JVM can create 100,000 threads. As masklinn says, the difference between garbage collection and reference
counting is more than an implementation issue.
-- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:34 , Guido van Rossum wrote:
I don't think I've seen a serious refcounted JS implementation in the last decade. , although it is possible that JS runtimes have localized usage of references and reference-counted resources. AFAIK all modern JS runtimes are JITed which probably does not mesh well with refcounting. In any case, V8 (Chrome's runtime) uses a stop-the-world generational GC for sure[0], Mozilla's SpiderMonkey uses a GC as well[1] although I'm not sure which type (the reference to JS_MarkGCThing indicates it could be or at least use a mark-and-sweep amongst its strategies), Webkit/Safari's JavaScriptCore uses a GC as well[2] and MSIE's JScript used a mark-and-sweep GC back in 2003[3] (although the DOM itself was in COM, and reference-counted).
Only because it's OS threads of course, Erlang is not evented and has no problem spawning half a million (preempted) processes if there's RAM enough to store them. [0] http://code.google.com/apis/v8/design.html#garb_coll [1] https://developer.mozilla.org/en/SpiderMonkey/1.8.5#Garbage_collection [2] Since ~2009 http://www.masonchang.com/blog/2009/3/26/nitros-garbage-collector.html [3] http://blogs.msdn.com/b/ericlippert/archive/2003/09/17/53038.aspx

On Thu, Feb 9, 2012 at 10:50 AM, Masklinn <masklinn@masklinn.net> wrote:
I stand corrected (but I am right about the single-threadedness :-).
Only because it's OS threads of course, Erlang is not evented and has no
problem spawning half a million (preempted) processes if there's RAM enough to store them.
Sure. But the people complaining about the GIL come from Java, not from Erlang. (Erlang users typically envy Python because of its superior standard library. :-)
-- --Guido van Rossum (python.org/~guido)

On 2012-02-09, at 19:54 , Guido van Rossum wrote:
I stand corrected (but I am right about the single-threadedness :-).
Absolutely (until WebWorkers anyway)
True. Then they remember how good Python is with concurrency, distribution and distributed resilience :D (don't forget syntax, one of Erlang's biggest failures) (although it pleased cfbolz since he could get syntax coloration for his prolog)

On 09.02.2012 19:50, Masklinn wrote:
And Chrome uses one *process* for each tab, right? Is there a reason Chrome does not use one thread for each tab, such as security?
Actually, spawning half a million OS threads will burn the computer. *POFF* ... and it goes up in a ball of smoke. Spawning half a million threads is the Windows equivalent of a fork bomb. I think you confuse threads and fibers/coroutines. Sturla

On Thu, Feb 09, 2012 at 08:08:36PM +0100, Sturla Molden wrote:
And Chrome uses one *process* for each tab, right? Is there a reason Chrome does not use one thread for each tab, such as security?
Safety, I dare say. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Feb 9, 2012 at 11:08 AM, Sturla Molden <sturla@molden.no> wrote:
Stability and security. If something goes wrong/rogue, the effects are reasonably isolated to the individual tab in question. And they can use OS resource/privilege limiting APIs to lock down these processes as much as possible. Cheers, Chris

On 2012-02-09, at 20:08 , Sturla Molden wrote:
I do not know the precise reasons no, but it probably has to do with security and ensuring isolation yes (webpage semantics mandate that each page gets its very own isolated javascript execution context)
No. You probably misread my comment somehow.

On Thu, Feb 9, 2012 at 2:08 PM, Sturla Molden <sturla@molden.no> wrote:
And Chrome uses one *process* for each tab, right?
Supposedly. If you click the wrench, then select Tools/Task Manager, it looks like there are actually several tabs/process (at least if you have enough tabs), but there can easily be several processes controlling separate tabs within the same window.
Is there a reason Chrome does not use one thread for each tab, such as security?
That too, but the reason they documented when introducing Chrome was for stability. I can say that Chrome often warns me that a selection of tabs[1] appears to be stopped, and asks if I want to kill them; it more often appears to freeze -- but switching to a different tab is usually effective in getting some response, while I wait the issue out. [1] Not sure if the selection is exactly equal to those handled by a single process, but it seems so. -jJ

On Thu, Feb 9, 2012 at 11:39 AM, Jim Jewett <jimjjewett@gmail.com> wrote:
On Thu, Feb 9, 2012 at 2:08 PM, Sturla Molden <sturla@molden.no> wrote:
And Chrome uses one *process* for each tab, right?
Can we stop discussing Chrome here? It doesn't really matter. -- --Guido van Rossum (python.org/~guido)

On 09.02.2012 19:25, Massimo Di Pierro wrote:
As masklinn says, the difference between garbage collection and reference counting is more than an implementation issue.
Actually it is not. The GIL is a problem for those who want to use threading.Thread and plain Python code for parallel processing. Those who think in those terms have typically prior experience with Java or .NET. Processes are excellent for concurrency, cf. multiprocessing, os.fork and MPI. They actually are more efficient than threads (due to avoidance of false sharing cache lines) and safer (deadlock and livelocks are more difficult to produce). And I assume students who learn to use such tools from the start are not annoyed by the GIL. The GIL annoys those who have learned to expect threading.Thread for CPU bound concurrency in advance -- which typically means prior experience with Java. Python threads are fine for their intended use -- e.g. I/O and background tasks in a GUI. Sturla

On Fri, Feb 10, 2012 at 03:19:36AM +0800, Matt Joiner wrote:
The GIL is almost entirely a PR issue. In actual practice, it is so great (simple, straightforward, functional) I believe that it is a sign of Guido's time machine-enabled foresight. --titus -- C. Titus Brown, ctb@msu.edu

On Thu, Feb 09, 2012 at 08:42:40PM +0100, Masklinn wrote:
Are we scheduling interventions for me now? 'cause there's a lot of people who want to jump in that queue :) dabeaz understands this stuff at a deeper level than me, which is often a handicap in these kinds of discussions, IMO. (He's also said that he prefers message passing to threading.) The point is that in terms of actually making my own libraries and parallelizing code, the GIL has been very straightforward, cross platform, and quite simple for understanding the consequences of a fairly wide range of multithreading models. Most people want to go do inappropriately complex things ("ooh! threads! shiny!") with threads and then fail to write robust code or understand the scaling of their code; I think the GIL does a fine job of blocking the simplest stupidities. Anyway, I love the GIL myself, although I think there is a great opportunity for a richer & more usable mid-level C API for both thread states and interpreters. cheers, --titus -- C. Titus Brown, ctb@msu.edu

On Thu, Feb 9, 2012 at 11:19 AM, Matt Joiner <anacrolix@gmail.com> wrote:
I'd actually say that using OS threads is too heavy *specifically* for trivial cases. If you spawn a thread to add two numbers you'll have a huge overhead. If you spawn a thread to do something significant, the overhead doesn't matter much. Note that even in Java, everyone uses thread pools to reduce thread creation overhead. -- --Guido van Rossum (python.org/~guido)

On Fri, Feb 10, 2012 at 5:19 AM, Matt Joiner <anacrolix@gmail.com> wrote:
Have you even *tried* concurrent.futures (http://docs.python.org/py3k/library/concurrent.futures)? Or the 2.x backport on PyPI (http://pypi.python.org/pypi/futures)? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, 10 Feb 2012 21:01:07 +0100 Sturla Molden <sturla@molden.no> wrote:
In what way does the mmap module fail to provide your binary file interface? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Massimo Di Pierro writes:
Well, maybe you should teach your students the rudiments of lying, erm, "statistics". That -3% on the TIOBE index is a steaming heap of FUD, as Anatoly himself admitted. Feb 2011 is clearly above trend, Feb 2012 below it. Variables vary, OK? So at the moment it is absolutely unclear whether Python's trend line has turned down or even decreased slope. And the RedMonk ranking shows Python at the very top.
Maybe they should learn something about reality of the IT industry, too. According to the TIOBE survey, COBOL and PL/1 are in the same class (rank 51-100, basically indistinguishable) with POSIX shell. Old programming languages never die ... and experts in them only become more valuable with time. Python skills will hardly become "obsolete" in the next decade, certainly not in the next 4 years. You say "dismiss or address the problem." Is there a problem? I dunno. Popularity is nice, but I really don't know if I would want to use a Python that spent the next five years (because that's what it will take) fixing what ain't broke to conform to undergraduate misconceptions. Sure, it would be nice have more robust support for installing non-stdlib modules such as numpy. But guess what? That's a hard nut to crack, and more, people have been working quite hard on the issue for a while. The distutils folks seem to be about to release at this point -- I guess the Time Machine has struck again! And by the way, which of Ruby, Java, Javascript, and C++ provides something like numpy that's easier to install? Preferably part of their stdlib? In my experience on Linux and Mac, at least, numerical code has always been an issue, whether it's numpy (once that I can remember, and that was because of some dependency which wouldn't build, not numpy itself), Steel Bank Common Lisp, ATLAS, R, .... The one thing that bothers me about the picture at TIOBE is the Objective-C line. I assume that's being driven by iPhone and iPad apps, and I suppose Java is being driven in part by Android. It's too bad Python can't get a piece of that action!

On Thu, Feb 9, 2012 at 1:14 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
It's too bad Python can't get a piece of that action!
Getting closer: http://morepypy.blogspot.com/2012/02/almost-there-pypys-arm-backend_01.html -eric

On Thu, Feb 9, 2012 at 11:49 AM, Massimo Di Pierro <massimo.dipierro@gmail.com> wrote:
At the University of Toronto we tell students to use the Wing IDE (Wing 101 was developed specifically for our use in the classroom, in fact). All classroom examples are done either in the interactive interpreter, or in a session of Wing 101. All computer lab sessions are done using Wing 101, and the first lab is dedicated specifically for introducing how to edit files with it and use its debugging features. If students don't like IDLE, tell them to use a different editor instead, and pretend that Python doesn't include one with itself. (By default IDLE only shows an interactive session, so if they get curious and click-y they'll still be in the dark.) -- Devin
participants (50)
-
anatoly techtonik
-
Andrew McNabb
-
Antoine Pitrou
-
Arnaud Delobelle
-
Barry Warsaw
-
Ben Finney
-
Benjamin Peterson
-
C. Titus Brown
-
Cameron Simpson
-
Carl M. Johnson
-
Chris Rebert
-
Christopher Reay
-
Devin Jeanpierre
-
Dirkjan Ochtman
-
Edward Lesmes
-
Eric Snow
-
Ethan Furman
-
Gabriel AHTUNE
-
Giampaolo Rodolà
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
Jesse Noller
-
Jim Jewett
-
Joao S. O. Bueno
-
M.-A. Lemburg
-
Mark Lawrence
-
Mark Shannon
-
Masklinn
-
Massimo Di Pierro
-
Matt Joiner
-
Mike Meyer
-
MRAB
-
Nathan Rice
-
Nick Coghlan
-
Oleg Broytman
-
Paul Moore
-
Robert Kern
-
Senthil Kumaran
-
Serhiy Storchaka
-
shibturn
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sturla Molden
-
Terry Reedy
-
Tim Delaney
-
yoav glazner
-
Yuval Greenfield
-
Éric Araujo