Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

On Wed, Jan 08, 2014 at 07:12:06PM +0100, Stefan Behnel wrote:
Why can't someone write a third-party library that does what these projects need, and that works in both Py2 and Py3, so that these projects can be modified to use that library and thus get on with their porting to Py3?
Apologies if this is out of place and slightly OT and soap-boxey... Does it not strike anyone here how odd it is that one would need a library to manipulate binary data in a programming language with "batteries included" on a binary computer? And maybe you can do it with existing facilities in both versions of Python, although in python3, I need to understand what bytes, format, ascii, and surrogateescape mean - among other things. I started in Python blissfully unaware of unicode - it was a different time for sure, but what I knew from C worked pretty much the same in Python - I could read some binary data out of a file, twiddle some bits, and write it back out again without any of these complexities - life was good and granted I was naive, but it made Python approachable for me and I enjoyed it. I stuck with it and learned about unicode and the complexities of encoding data and now I'm astonished at how many professional programmers don't know the slightest bit about it and how horribly munged some data you can consume on the web might be - I agree it's all quite a mess. So now I'm getting more serious about Python3 and my fear is that the development community (python3) has fractured from the user community (python2) in that they've built something that solves their problems (to oversimplify lets say a webapp) - sure, a bunch of stuff got fixed along the way and we gave the users division they would expect (3/2 == 1.5), but somewhere what I felt was more like a hobbyist language has become big and complex and "we need to protect our users from doing the wrong thing." And I think everyone was well intentioned - and python3 covers most of the bases, but working with binary data is not only a "wire-protocol programmer's" problem. Needing a library to wrap bytesthing.format('ascii', 'surrogateescape') or some such thing makes python3 less approachable for those who haven't learned that yet - which was almost all of us at some point when we started programming. I appreciate everyone's hard work - I'm confident the community will cross the 2-3 chasm and I hope we preserve the approachability I first came to love about Python when I started using it for all sorts of applications. thx m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On Wed, 08 Jan 2014 19:22:08 +0000, "Matt Billenstein" <matt@vazor.com> wrote:
I started in Python blissfully unaware of unicode - it was a different time for sure, but what I knew from C worked pretty much the same in Python - I could read some binary data out of a file, twiddle some bits, and write it back out again without any of these complexities - life was good and granted I was naive, but it made Python approachable for me and I enjoyed it. I stuck with it and learned about unicode and the complexities of encoding data and now I'm astonished at how many professional programmers don't know the slightest bit about it and how horribly munged some data you can consume on the web might be - I agree it's all quite a mess.
So now I'm getting more serious about Python3 and my fear is that the development community (python3) has fractured from the user community (python2) in that they've built something that solves their problems (to oversimplify lets say a webapp) - sure, a bunch of stuff got fixed along the way and we gave the users division they would expect (3/2 == 1.5), but somewhere what I felt
I believe this is a mis-perception. I think Python3 is *simpler* and *less complex* than Python2, both at the Python language level and at the CPython implementation level. (I'm using a definition of these terms that roughly works out to "easier to understand".) That was part of the point. Python3 is *easier* to use for new projects than Python2. I'm not speaking from theory here, I've written and worked on non-trivial new projects in both versions.[1] It is true that in Python3 you *must* learn the difference between bytes and strings. But in the modern world, you had better learn to do that anyway, and learn to do it right up front. If you don't want to, I suppose you could stay stuck in an earlier age and keep using Python2. It also is true that it would be nice to have a more convenient API for, as Antoine put it, interpolating into a binary stream. But really, the vast majority of programs have no need to do that. It is pretty much only the low level libraries, most of them dealing with data-interchange (wire protocols), that would use this.
was more like a hobbyist language has become big and complex and "we need to protect our users from doing the wrong thing."
As I just learned recently, Python was always intended to be a "real" programming language, and not a hobbyist language :) But it was also always meant to be easy to learn and use. Python3's goal is to make it *easier* to do the *right* thing. The fact that in some cases it also makes it harder to to the wrong thing is mostly a consequence of making it easier to do the right thing. Python's philosophy is still one of "consenting adults", despite a few voices agitating for preventing users from shooting themselves in the foot. But making "the one obvious way to do it" easy, and consequently making the other ways harder, fits in to its overall philosophy just fine. As does trying to prevent the wrong thing from happening *by accident* (read: mojibake). --David [1] I also find it easier to maintain my python3 programs than I do my python2 programs, probably because I've gotten used to the convenience of the new Python3 features, and miss them when working Python2. [2] With perfect hindsight I think we'd have focused more right from the start on single-codebase, rather than on 2to3; but perfect hindsight doesn't do you any good when it comes to foresight.

Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care? I have lots of little utilities, to help me with day to day stuff like this. One fine morning I decided to start usnig Python 3 for the job. Imagine my surprise when it turned out to make my job more complicated, not easier. Suddenly I had to start thining about stuff that hadn't mattered at all, and still didn't really matter. All it did was complicate things for no benefit. Python forcing you to think about this is like the cashier at the hardware store who won't let you buy the hammer you brought to the cash register because you don't know what wood its handle is made of. Sure, Python should make it easier to do the *right* thing. That's equivalent to placing the indicator selector at a convenient place near the steering wheel. What it shouldn't do, is make the flashing of the indicator mandatory whenever you turn the wheel. All of this talk is positive, though. The fact that these topics have finally reached the halls of python-dev are indication that people out there are _trying_ to move to 3.3 :) Cheers, K ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of R. David Murray [rdmurray@bitdance.com] Sent: Wednesday, January 08, 2014 21:29 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...) ... It is true that in Python3 you *must* learn the difference between bytes and strings. But in the modern world, you had better learn to do that anyway, and learn to do it right up front. If you don't want to, I suppose you could stay stuck in an earlier age and keep using Python2. ... Python3's goal is to make it *easier* to do the *right* thing. The fact that in some cases it also makes it harder to to the wrong thing is mostly a consequence of making it easier to do the right thing.

On 8 January 2014 20:04, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Kristján, the answer is obviously "yes you must" :-)

Hi,
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses.
You can open a text file using ascii + surrogateescape, or just open the file in binary. Victor

On Wed, 08 Jan 2014 22:04:56 +0000, <kristjan@ccpgames.com> wrote:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Why *do* you care? Isn't your system configured for utf-8, and all your .txt files encoded with utf-8 by default? Or at least configured with a single consistent encoding? If that's the case, Python3 doesn't make you think about the encoding. Knowing the right encoding is different from needing to know the difference between text and bytes; you only need to worry about encodings when your system isn't configured consistently to begin with. If you do have to care, your little utilities only work by accident in Python2, and must have produced mojibake when the encoding was wrong, unless I'm completely confused. So yeah, sorting that out is harder if you were just living with the mojibake before...but if so I'm surprised you haven't wanted to fix that before this. --David

Just to avoid confusion, let me state up front that I am very well aware of encodings and all that, having internationalized one largish app in python 2.x. I know the problems that 2.x had with tracking down the source of errors and understand the beautiful concept of encodings on the boundary. However: For a lot of data processing and tools, encoding isn't an issue. Either you assume ascii, or you're working with something like latin1. A single byte encoding. This is because you're working with a text file that _you_ wrote. And you're not assigning any semantics to the characters. If there is actual "text" in there it is just english, not Norwegian or Turkish. A byte read at code 0xfa doesn't mean anything special. It's just that, a byte with that value. The file system doesn't have any default encoding. A file on disk is just a file on disk consisting of bytes. There can never be any wrong encoding, no mojibake. With python 2, you can read that file into a string object. You can scan for your field delimiter, e.g. a comma, split up your string, interpolate some binary data, spit it out again. All without ever thinking about encodings. Even though the file is conceptually encoded in something, if you insist on attaching a particular semantic meaning to every ordinal value, whatever that meaning is is in many cases irrelevant to the program. I understand that surrogateescape allows you to do this. But it is an awkward extra step and forces an extra layer of needles semantics on to that guy that just wants to read a file. Sure, vegetarians and alergics like to read the list of ingredients on everything that they eat. But others are just omnivores and want to be able to eat whatever is on the table, and not worry about what it is made of. And yes, you can read the file in binary mode but then you end up with those bytes objects that we have just found that are tedious to work with. So, what I'm saying is that at least I have a very common use case that has just become a) more confusing (having to needlessly derail the train of thought about the data processing to be done by thinking about text encodings) and b) more complicated. Not sure if there is anything to be done about it though :) I think there might be a different analogy: Having to specify an encoding is like having strong typing. In Python 2.7, we _can_ forego that and just duck-type our strings :) K ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of R. David Murray [rdmurray@bitdance.com] Sent: Wednesday, January 08, 2014 23:40 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...) Why *do* you care? Isn't your system configured for utf-8, and all your .txt files encoded with utf-8 by default? Or at least configured with a single consistent encoding? If that's the case, Python3 doesn't make you think about the encoding. Knowing the right encoding is different from needing to know the difference between text and bytes; you only need to worry about encodings when your system isn't configured consistently to begin with. If you do have to care, your little utilities only work by accident in Python2, and must have produced mojibake when the encoding was wrong, unless I'm completely confused. So yeah, sorting that out is harder if you were just living with the mojibake before...but if so I'm surprised you haven't wanted to fix that before this.

On 09/01/2014 00:12, Kristján Valur Jónsson wrote:
Just to avoid confusion, let me state up front that I am very well aware of encodings and all that, having internationalized one largish app in python 2.x. I know the problems that 2.x had with tracking down the source of errors and understand the beautiful concept of encodings on the boundary.
However: For a lot of data processing and tools, encoding isn't an issue. Either you assume ascii, or you're working with something like latin1. A single byte encoding. This is because you're working with a text file that _you_ wrote. And you're not assigning any semantics to the characters. If there is actual "text" in there it is just english, not Norwegian or Turkish. A byte read at code 0xfa doesn't mean anything special. It's just that, a byte with that value. The file system doesn't have any default encoding. A file on disk is just a file on disk consisting of bytes. There can never be any wrong encoding, no mojibake.
With python 2, you can read that file into a string object. You can scan for your field delimiter, e.g. a comma, split up your string, interpolate some binary data, spit it out again. All without ever thinking about encodings.
Even though the file is conceptually encoded in something, if you insist on attaching a particular semantic meaning to every ordinal value, whatever that meaning is is in many cases irrelevant to the program.
I understand that surrogateescape allows you to do this. But it is an awkward extra step and forces an extra layer of needles semantics on to that guy that just wants to read a file. Sure, vegetarians and alergics like to read the list of ingredients on everything that they eat. But others are just omnivores and want to be able to eat whatever is on the table, and not worry about what it is made of. And yes, you can read the file in binary mode but then you end up with those bytes objects that we have just found that are tedious to work with.
All I can say is that I've been using python 3 for years and wouldn't know what a surrogateescape was if you were to hit me around the head with it. I open my files, I process them, and Python kindly closes them for me via a context manager. So if you're not bothered about encoding, where has the "awkward extra step and forces an extra layer of needles semantics" bit come from? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Thu, 09 Jan 2014 00:12:57 +0000, <kristjan@ccpgames.com> wrote:
I think there might be a different analogy: Having to specify an encoding is like having strong typing. In Python 2.7, we _can_ forego that and just duck-type our strings :)
Python is a strongly typed language. Saying that python2 let you duck type bytestrings (ie: postpone the decision as to what encoding they were in until the last minute) is an interesting perspective...but as we know it led to many many program bugs. Which were the result, essentially, of a failure to strongly type the string and bytes types the way other python types are strongly typed. However, I do now understand your use case better, even though I wouldn't myself write programs like that. Or, rather, I make sure all my files are in the same encoding (utf-8). I suppose that this is because I, as an English-speaking USAian, came late to the need for non-ascii characters, after utf-8 was already well established. The rest of the world didn't have that luxury. --David

On Wed, 8 Jan 2014, Kristján Valur Jónsson wrote:
Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Mostly staying out of this, but I need to say something here. If you don't know what encoding the file has, you don't know what bytes correspond to phone numbers. So yes, you must care, or else you simply cannot write your code. Of course, in practice, it's probably encoded in an ASCII-compatible encoding, so '0' encodes as the single byte 0x30. Whether it's UTF-8, ISO-8859-1, or something else that is ASCII-compatible doesn't really matter. So, as a practical matter, you can just use ISO-8859-1, even though in principal this is totally wrong. Then ASCII is one byte per character as you expect, and all other bytes will round-trip unchanged. Just don't do any non-trivial processing on non-ASCII characters. I don't see how it could be made any simpler without going back to making it easy for people to pretend the issue doesn't exist at all and bringing back the attendant confusion and problems.
I have lots of little utilities, to help me with day to day stuff like this. One fine morning I decided to start usnig Python 3 for the job. Imagine my surprise when it turned out to make my job more complicated, not easier. Suddenly I had to start thining about stuff that hadn't mattered at all, and still didn't really matter. All it did was complicate things for no benefit. [....]
All of this talk is positive, though. The fact that these topics have finally reached the halls of python-dev are indication that people out there are _trying_ to move to 3.3 :)
Agreed. Isaac Morland CSCF Web Guru DC 2619, x36650 WWW Software Specialist

Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding. Python should not guess the encoding if it's unknown. Without the right encoding, you don't get text, you get partial or complete gibberish. So, if what you want is to parse text and not get gibberish, you need to *tell* Python what the encoding is. That's a brute fact of the world of text in computing.
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Yes, you must.
Python forcing you to think about this is like the cashier at the hardware store who won't let you buy the hammer you brought to the cash register because you don't know what wood its handle is made of.
The cashier is making a mistake: the hammer, regardless of the wood in the handle, still functions just fine as a hammer. Hence, the question is unimportant to the purpose. The same is not true of changing the encoding for text. The encoding matters, and the programmer needs to care. -- \ “How wonderful that we have met with a paradox. Now we have | `\ some hope of making progress.” —Niels Bohr | _o__) | Ben Finney

On 2014-01-09 00:07, Ben Finney wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
Python should not guess the encoding if it's unknown. Without the right encoding, you don't get text, you get partial or complete gibberish.
So, if what you want is to parse text and not get gibberish, you need to *tell* Python what the encoding is. That's a brute fact of the world of text in computing.
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Yes, you must.
Python forcing you to think about this is like the cashier at the hardware store who won't let you buy the hammer you brought to the cash register because you don't know what wood its handle is made of.
The cashier is making a mistake: the hammer, regardless of the wood in the handle, still functions just fine as a hammer. Hence, the question is unimportant to the purpose.
On the other hand: "I need a new battery." "What kind of battery?" "I don't care!"
The same is not true of changing the encoding for text. The encoding matters, and the programmer needs to care.

On 09/01/2014 00:21, MRAB wrote:
"I need a new battery."
"What kind of battery?"
"I don't care!"
A neat summary of the draft requirements specification for Python 2.8. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

MRAB <python@mrabarnett.plus.com> writes:
On 2014-01-09 00:07, Ben Finney wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Yes, you must.
Python forcing you to think about this is like the cashier at the hardware store who won't let you buy the hammer you brought to the cash register because you don't know what wood its handle is made of.
The cashier is making a mistake: the hammer, regardless of the wood in the handle, still functions just fine as a hammer. Hence, the question is unimportant to the purpose.
On the other hand:
"I need a new battery."
"What kind of battery?"
"I don't care!"
That's a much better analogy. The customer may not care, but the question is essential and must be answered; if the supplier guesses what the customer wants, they are doing the customer a disservice. If the customer insists the supplier just give them a battery which will work regardless of what type of battery the device requires, the *customer is wrong*. Such customers need to be educated about the necessity to care about details they may have no interest in, if they want to get their device working reliably. We can all work toward a world where there is just one encoding which works for all text and no other encodings to confuse the matter. Until then, everyone needs to deal with the world as it is. (good sigmonster, have a cookie) -- \ “Ours is a world where people don't know what they want and are | `\ willing to go through hell to get it.” —Donald Robert Perry | _o__) Marquis | Ben Finney

Ben Finney writes:
That's a much better analogy. The customer may not care, but the question is essential and must be answered; if the supplier guesses what the customer wants, they are doing the customer a disservice.
It is a much better analogy for me on my desktop, and for programmers working for global enterprises, too. It is not for Kristján, nor for many other American, European, and yes, even Australian programmers. You're making the same kind of mistake he is (although I personally benefit from your mistake, and have suffered for decades from his :-). Diff'rent folks, diff'rent strokes. It would be nice if we could serve both use cases *by default*. We haven't found the way yet, that's all.

On Thu, Jan 9, 2014 at 11:21 AM, MRAB <python@mrabarnett.plus.com> wrote:
On the other hand:
"I need a new battery."
"What kind of battery?"
"I don't care!"
Or, bringing it back to Python: How do you write a set out to a file? foo = {1, 2, 4, 8, 16, 32} open("foo.txt","w").write(foo) # Uh... nope! I don't want to have to worry about how it's formatted! I just want to write that set out and have someone read it in later! A text string is just as abstract as any other complex type. For some reason, we've grown up thinking that "ABCD" == \x61\x62\x63\x64 == "ABCD", even though it's just as logical for those bytes to represent 12.1414 or 1094861636 or 1145258561. There's no difference between encoding one thing to bytes and encoding another thing to bytes, and it's critical to get those encodes/decodes right. ChrisA

Still playing the devil's advocate: I didn't used to must. Why must I must now? Did the universe just shift when I fired up python3? Things were demonstatably working just fine before without doing so. K ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of Ben Finney [ben+python@benfinney.id.au] Sent: Thursday, January 09, 2014 00:07 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity" Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Yes, you must.

Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
I didn't used to must. Why must I must now? Did the universe just shift when I fired up python3?
In a sense, yes. The world of software has been shifting for decades, as a reasult of broader changes in how different segments of humanity have changed their interactions, and thereby changed their expectations of what computers can do with their data. While for some programmers, in past decades, it used to be reasonable to stick one's head in the sand and ignore all encodings except one privileged local encoding, that is no longer reasonable today. As a result, it is incumbent on any programmer working with text to care about text encodings. You've likely already seen it, but the point I'm making is better made in this essay <URL:http://www.joelonsoftware.com/articles/Unicode.html>. -- \ 己所不欲、勿施于人。 | `\ (What is undesirable to you, do not do to others.) | _o__) —孔夫子 Confucius, 551 BCE – 479 BCE | Ben Finney

-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Ben Finney Sent: 9. janúar 2014 00:50 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
I didn't used to must. Why must I must now? Did the universe just shift when I fired up python3?
In a sense, yes. The world of software has been shifting for decades, as a reasult of broader changes in how different segments of humanity have changed their interactions, and thereby changed their expectations of what computers can do with their data.
Do I speak Chinese to my grocer because china is a growing force in the world? Or start every discussion with my children with a negotiation on what language to use? I get all the talk about Unicode, and interoperability and foreign languages and the world (I'm Icelandic, after all.) The point I'm trying to make, and which I think you are missing is this: A tool that I have been happily using on my own system, to my own ends (I'm not writing international spam posts or hosting a United Nations election, but parsing and writing config.ini files, say) just became harder to use for that purpose. I think I'm not the only one to realize this, otherwise, PEP460 wouldn't be there. Anyway, I'll duck out now *ducks* K

just became harder to use for that purpose.
The entire discussion reminds me very much of the situation with file names in OS X. Whenever I want to look at an old zip file or tarball which happens to have been lying around on my hard drive for a decade or more, I can't because OS X insist that file names be encoded in UTF-8 and just throw errors if that requirement is not met. And certainly I cannot be required to re-encode all files to the then-favored encoding continually – although favors don’t change often and I’m willing to bet that UTF-8 is here to stay, but it has already happened twice in my active computer life (DOS -> latin-1 -> UTF-8). Going back to the old tarballs, OS X is completely useless for handling them as a result of their encoding decision, and I have to move to a Linux machine which just does not care about encodings. PS I was very relieved to find out that os.listdir() – jut to pick one file name-related function – will still return bytes if requested, as it is not at all uncommon (at least for me) to have conflicting file name encodings in different parts of a filesystem.

-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Stefan Ring Sent: 9. janúar 2014 09:32 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
just became harder to use for that purpose.
The entire discussion reminds me very much of the situation with file names in OS X. Whenever I want to look at an old zip file or tarball which happens to have been lying around on my hard drive for a decade or more, I can't because OS X insist that file names be encoded in UTF-8 and just throw errors if that requirement is not met. And certainly I cannot be required to re-encode all files to the then-favored encoding continually – although favors don’t change often and I’m willing to bet that UTF-8 is here to stay, but it has already happened twice in my active computer life (DOS -> latin-1 -> UTF-8).
Well, yes. Also, the problem I'm describing has to do with real world stuff. This is the python 2 program: with open(fn1) as f1: with open(fn2, 'w') as f2: f2.write(process_text(f1.read()) Moving to python 3, I found that this quickly caused problems. So, I explicitly added an encoding. Better guess an encoding, something that is likely, e.g. cp1252 with open(fn1, encoding='cp1252') as f1: with open(fn2, 'w', encoding='cp1252') as f2: f2.write(process_text(f1.read()) This mostly worked. But then, with real world data, sometimes we found that even files we declared to be cp1252, sometimes contained invalid code points. Was the file really in cp1252? Or did someone mess up somewhere? Or simply take a small poet's leave with the specification? This is when it started to become annoying. I mean, clearly something was broken at some point, or I don't know the exactly correct encoding of the file. But this is not the place to correct that mistake. I want my program to be robust towards such errors. And these errors exist. So, the third version was: with open(fn1, "b") as f1: with open(fn2, 'wb') as f2: f2.write(process_bytes(f1.read()) This works, but now I have a bytes object which is rather limited in what it can do. Also, all all string constants in my process_bytes() function have to be b'foo', rather than 'foo'. Only much later did I learn about 'surrogateescape'. How is a new user to python to know about it? The final version would probably be this: with open(fn1, encoding='cp1252', errors='surrogateescape') as f1: with open(fn2, 'w', encoding='cp1252', errors='surrogateescape') as f2: f2.write(process_text(f1.read()) Will this always work? I don't know. I hope so. But it seems very verbose when all you want to do is munge on some bytes. And the 'surrogateescape' error handler is not something that a newcomer to the language, or someone coming from python2, is likely to automatically know about. Could this be made simpler? What If we had an encoding that combines 'ascii' and 'surrogateescape'? Something that allows you to read ascii text with unknown high order bytes without this unneeded verbosity? Something that would be immediately obvious to the newcomer? K

On 9 January 2014 10:15, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Also, the problem I'm describing has to do with real world stuff. This is the python 2 program: with open(fn1) as f1: with open(fn2, 'w') as f2: f2.write(process_text(f1.read())
Moving to python 3, I found that this quickly caused problems.
You don't say what problems, but I assume encoding/decoding errors. So the files apparently weren't in the system encoding. OK, at that point I'd probably say to heck with it and use latin-1. Assuming I was sure that (a) I'd never hit a non-ascii compatible file (e.g., UTF16) and (b) I didn't have a decent means of knowing the encoding. One thing that genuinely is difficult is that because disk files don't have any out-of-band data defining their encoding, it *can* be hard to know what encoding to use in an environment where more than one encoding is common. But this isn't really a Python issue - as I say, I've hit it with GNU tools, and I've had to explain the issue to colleagues using Java on many occasions. The key difference is that with grep, people blame the file, whereas with Python people blame the language :-) (Of course, with Java, people expect this sort of problem so they blame the perverseness of the universe as a whole... ;-)) Paul.

-----Original Message----- From: Paul Moore [mailto:p.f.moore@gmail.com] Sent: 9. janúar 2014 10:53 To: Kristján Valur Jónsson Cc: Stefan Ring; python-dev@python.org
Moving to python 3, I found that this quickly caused problems.
You don't say what problems, but I assume encoding/decoding errors. So the files apparently weren't in the system encoding. OK, at that point I'd probably say to heck with it and use latin-1. Assuming I was sure that (a) I'd never hit a non-ascii compatible file (e.g., UTF16) and (b) I didn't have a decent means of knowing the encoding. Right. But even latin-1, or better, cp1252 (on windows) does not solve it because these have undefined code points. So you need 'surrogateescape' error handling as well. Something that I didn't know at
the time, having just come from python 2 and knowing its Unicode model well.
One thing that genuinely is difficult is that because disk files don't have any out-of-band data defining their encoding, it *can* be hard to know what encoding to use in an environment where more than one encoding is common. But this isn't really a Python issue - as I say, I've hit it with GNU tools, and I've had to explain the issue to colleagues using Java on many occasions. The key difference is that with grep, people blame the file, whereas with Python people blame the language :-) (Of course, with Java, people expect this sort of problem so they blame the perverseness of the universe as a whole... ;-))
Which reminds me, can Python3 read text files with BOM automatically yet? K

On 9 January 2014 13:00, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
You don't say what problems, but I assume encoding/decoding errors. So the files apparently weren't in the system encoding. OK, at that point I'd probably say to heck with it and use latin-1. Assuming I was sure that (a) I'd never hit a non-ascii compatible file (e.g., UTF16) and (b) I didn't have a decent means of knowing the encoding. Right. But even latin-1, or better, cp1252 (on windows) does not solve it because these have undefined code points. So you need 'surrogateescape' error handling as well. Something that I didn't know at the time, having just come from python 2 and knowing its Unicode model well.
bin = bytes(range(256)) bin b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\ x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\ x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x 9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb 8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4 \xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\ xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' bin.decode('latin-1') '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x 1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x 80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9 c\x9d\x9e\x9f\xa0¡¢£\xa4¥\xa6\xa7\xa8\xa9ª«¬\xad\xae\xaf°±²\xb3\xb4µ\xb6·\xb8\xb9º»¼½\xbe¿\xc0\xc1\xc2\xc3ÄÅÆÇ\xc 8É\xca\xcb\xcc\xcd\xce\xcf\xd0Ñ\xd2\xd3\xd4\xd5Ö\xd7\xd8\xd9\xda\xdbÜ\xdd\xdeßàáâ\xe3äåæçèéêëìíîï\xf0ñòóô\xf5ö÷\x f8ùúûü\xfd\xfeÿ'
No undefined bytes there. If you mean that latin-1 can't encode all of the Unicode code points, then how did those code points get in there? Presumably you put them in, and so you're not just playing with the ASCII text parts. And you *do* need to understand encodings.
One thing that genuinely is difficult is that because disk files don't have any out-of-band data defining their encoding, it *can* be hard to know what encoding to use in an environment where more than one encoding is common. But this isn't really a Python issue - as I say, I've hit it with GNU tools, and I've had to explain the issue to colleagues using Java on many occasions. The key difference is that with grep, people blame the file, whereas with Python people blame the language :-) (Of course, with Java, people expect this sort of problem so they blame the perverseness of the universe as a whole... ;-))
Which reminds me, can Python3 read text files with BOM automatically yet?
If by "automatically" you mean "reads the BOM and chooses an appropriate encoding based on it" then I don't know, but I suspect not. But unless you're worried about 2-byte encodings (see! you need to understand encodings again!) latin-1 will still work. It sounds to me like what you *really* want is something that autodetects encodings on Windows in the same sort of way as other Windows tools like Notepad does. That's a fair thing to want, but no, Python doesn't provide it (nor did Python 2). I suspect that it would be possible to write a codec to do this, though. Maybe there's even one on PyPI. Paul

On Thu, Jan 09, 2014 at 01:00:59PM +0000, Kristján Valur Jónsson wrote:
Which reminds me, can Python3 read text files with BOM automatically yet?
I'm not sure what you mean by that. If you mean, can Python3 distinguish between UTF-16BE and UTF-16LE on the basis of a BOM, then it's been able to do that for a long time: steve@orac:~$ hexdump sample-utf-16.txt 0000000 feff 0048 0065 006c 006c 006f 0020 0057 0000010 006f 0072 006c 0064 0021 000a 00a2 00a3 0000020 00a7 2022 00b6 00df 03c0 2248 2206 000a 0000030 steve@orac:~$ python3.1 -c "print(open('sample-utf-16.txt', encoding='utf-16').read())" Hello World! ¢£§•¶ßπ≈∆ If you mean, "Will Python assume that the presence of bytes FEFF or FFFE at the start of a file means that it is encoded in UTF-16?", then as far as I know, the answer is "No": [steve@ando ~]$ python3.3 -c "print(open('sample-utf-16.txt').read())" Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/local/lib/python3.3/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte I wouldn't want it to guess the encoding by default. See the Zen about ambiguity. -- Steven

On Thu, 9 Jan 2014 10:15:08 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Moving to python 3, I found that this quickly caused problems. So, I explicitly added an encoding. Better guess an encoding, something that is likely, e.g. cp1252 with open(fn1, encoding='cp1252') as f1: with open(fn2, 'w', encoding='cp1252') as f2: f2.write(process_text(f1.read())
If you don't "care" about the encoding, why don't you use latin1? Things will roundtrip fine and work as well as under Python 2. Regards Antoine.

-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou Sent: 9. janúar 2014 12:42 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
On Thu, 9 Jan 2014 10:15:08 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Moving to python 3, I found that this quickly caused problems. So, I explicitly added an encoding. Better guess an encoding, something that is
likely, e.g. cp1252 with open(fn1, encoding='cp1252') as f1:
with open(fn2, 'w', encoding='cp1252') as f2: f2.write(process_text(f1.read())
If you don't "care" about the encoding, why don't you use latin1? Things will roundtrip fine and work as well as under Python 2.
Because latin1 does not define all code points, giving you errors there. Same with cp1252. Which is why you need 'surrogateescape' in addition. K

On Thu, 9 Jan 2014 12:55:35 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
If you don't "care" about the encoding, why don't you use latin1? Things will roundtrip fine and work as well as under Python 2.
Because latin1 does not define all code points, giving you errors there.
b = bytes(range(256)) b.decode('latin1') '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
Not sure which errors you were getting? Regards Antoine.

-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou Sent: 9. janúar 2014 13:18 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
On Thu, 9 Jan 2014 12:55:35 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
If you don't "care" about the encoding, why don't you use latin1? Things will roundtrip fine and work as well as under Python 2.
Because latin1 does not define all code points, giving you errors there.
b = bytes(range(256)) b.decode('latin1') '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12 \x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,- ./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijkl mnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x 8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9 c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎ ÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
You are right. I'm talking about "cp1252" which is the windows version thereof:
s = ''.join(chr(i) for i in range(256)) s.decode('cp1252') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 129: character maps to <undefined>
This definition is funny, because according to Wikipedia, it is a "superset" of 8869-1 ( latin1) See http://en.wikipedia.org/wiki/Cp1252 Also, see http://en.wikipedia.org/wiki/Latin1 There is confusion there. The iso8859-1 does in fact not define the control codes in range 128 to 158, whereas the Unicode page Latin 1 does. Strictly speaking, then, a Latin1 (or more specifically, ISO8859-1) decoder should error on these characters. the 'Latin1' codec therefore is not a true 8859-1 codec. K

-----Original Message----- From: Python-Dev [mailto:python-dev- bounces+kristjan=ccpgames.com@python.org] On Behalf Of Kristján Valur Jónsson Sent: 9. janúar 2014 13:37 To: Antoine Pitrou; python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
This definition is funny, because according to Wikipedia, it is a "superset" of 8869-1 ( latin1) See http://en.wikipedia.org/wiki/Cp1252 Also, see http://en.wikipedia.org/wiki/Latin1
There is confusion there. The iso8859-1 does in fact not define the control codes in range 128 to 158, whereas the Unicode page Latin 1 does. Strictly speaking, then, a Latin1 (or more specifically, ISO8859-1) decoder should error on these characters. the 'Latin1' codec therefore is not a true 8859-1 codec.
See also: http://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) for the latin-1 supplement, not to be confused with 8859-1. The header of the 8859-1 page is telling: """ ISO/IEC 8859-1 From Wikipedia, the free encyclopedia (Redirected from Latin1) For the Unicode block also called "Latin 1", see Latin-1 Supplement (Unicode block). For the character encoding commonly mislabeled as "ISO-8859-1", see Windows-1252. """ K

2014/1/9 Kristján Valur Jónsson <kristjan@ccpgames.com>:
This definition is funny, because according to Wikipedia, it is a "superset" of 8869-1 ( latin1)
Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in (IANA's) ISO-8859-1. Python implements the latter, ISO-8859-1. Wikipedia says "This encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1". Victor

-----Original Message----- From: Victor Stinner [mailto:victor.stinner@gmail.com] Sent: 9. janúar 2014 13:51 To: Kristján Valur Jónsson Cc: Antoine Pitrou; python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
2014/1/9 Kristján Valur Jónsson <kristjan@ccpgames.com>:
This definition is funny, because according to Wikipedia, it is a "superset" of 8869-1 ( latin1)
Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in (IANA's) ISO-8859-1.
Python implements the latter, ISO-8859-1.
Wikipedia says "This encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1".
Thanks. That's entirely non-confusing :) " ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429." So anyway, yes, Python's "latin1" encoding does cover the entire 256 range. But on windows we use cp1252 instead which does not, but instead defines useful and common windows characters in many of the control caracters slots. Hence the need for "surrogateescape" to be able to roundtrip characters. Again, this is non-obvious, and knowing from my experience with cp1252, I had no way of guessing that the "subset", i.e. latin1, would indeed cover all the range. Two things then I have learned since my initial foray into parsing ascii files with python3: Surrogateescapes and "latin1 in python == IANA's ISO-8859-1 which does indeed define the whole 8 bit range". K

On 9 Jan 2014 22:25, "Kristján Valur Jónsson" <kristjan@ccpgames.com> wrote:
-----Original Message----- From: Victor Stinner [mailto:victor.stinner@gmail.com] Sent: 9. janúar 2014 13:51 To: Kristján Valur Jónsson Cc: Antoine Pitrou; python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
2014/1/9 Kristján Valur Jónsson <kristjan@ccpgames.com>:
This definition is funny, because according to Wikipedia, it is a "superset" of 8869-1 ( latin1)
Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in (IANA's) ISO-8859-1.
Python implements the latter, ISO-8859-1.
Wikipedia says "This encoding is a superset of ISO 8859-1, but differs
the IANA's ISO-8859-1".
Thanks. That's entirely non-confusing :) " ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429."
So anyway, yes, Python's "latin1" encoding does cover the entire 256 range. But on windows we use cp1252 instead which does not, but instead defines useful and common windows characters in many of the control caracters slots. Hence the need for "surrogateescape" to be able to roundtrip characters.
Again, this is non-obvious, and knowing from my experience with cp1252, I had no way of guessing that the "subset", i.e. latin1, would indeed cover all the range. Two things then I have learned since my initial foray into
from parsing ascii files with python3: Surrogateescapes and "latin1 in python == IANA's ISO-8859-1 which does indeed define the whole 8 bit range". http://python-notes.curiousefficiency.org/en/latest/python3/text_file_proces... currently linked from the Unicode HOWTO. However, I'd be happy to offer it for direct inclusion to help make it more discoverable. Cheers, Nick.
K _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Thanks Nick. This does seem to cover it all. Perhaps it is worth mentioning cp1252 as the windows version of latin1, which _does_not_ cover all code points and hence requires surrogateescapes for best effort solution. K ________________________________ From: Nick Coghlan [ncoghlan@gmail.com] Sent: Thursday, January 09, 2014 18:08 To: Kristján Valur Jónsson Cc: Victor Stinner; Antoine Pitrou; python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity" http://python-notes.curiousefficiency.org/en/latest/python3/text_file_proces... is currently linked from the Unicode HOWTO. However, I'd be happy to offer it for direct inclusion to help make it more discoverable.

This has all gotten a bit complicated because everyone has been thinking in terms of actual encodings and actual text files. But I think the use-case here is something different: A file with a bunch of bytes in it, _some_of which are ascii, and the rest are other bytes (maybe binary data, maybe non-ascii-encoded text). I think this is the use-case that "just worked" in py2, but doesn't in py3 -- i.e. in py3 you have to choose either the binary interpretation or the ascii one, but you can't have both. If you choose ascii, it will barf when you try to decode it, if you choose binary, you lose the ability to do simple stuff with the ascii subset -- parsing, substitution, etc. Some folks have suggested using latin-1 (or other 8-bit encoding) -- is that guaranteed to work with any binary data, and round-trip accurately? and will surrogateescape work for arbitrary binary data? If this is a common need, then it would be nice for py3 to address. I know that I work with a couple file formats that have text headers followed by binary data (not as hard to deal with, but still harder in py3). And from this discussion , it seems that "wire protocols" commonly mix ascii and binary. So the decisions to be made: Is this a use-case worth supporting in the standard library? If so, how? 1) add some of the basic stuff to the bytes object - i.e. string formatting, what this all started with. 2) create a custom encoding that could losslessly convert to from this mixture to/from a unicode object. I 'm not sure if that is even possible, but it would be kind of cool. 3) create a new object, neither a string nor a bytes object that did what we want (it would look a lot like the py2 string...) 4) create a module for doing the stuff wanted with a bytes object (not very OO) Does that clarify the discussion at all? On Thu, Jan 9, 2014 at 2:15 AM, Kristján Valur Jónsson < kristjan@ccpgames.com> wrote:
This is the python 2 program: with open(fn1) as f1: with open(fn2, 'w') as f2: f2.write(process_text(f1.read())
I think the key point here is that this worked because a common case was ascii text and arbitrary binary mixed. As long as all the process_text() stuff is ascii only, that would work, either with arbitrary binary data or ascii-compatible encoding. The fact that it would NOT work with arbitrarily encoded data doesn't mean it's not useful for this special, but perhaps common, case. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, 9 Jan 2014 13:36:05 -0800 Chris Barker <chris.barker@noaa.gov> wrote:
Some folks have suggested using latin-1 (or other 8-bit encoding) -- is that guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
and will surrogateescape work for arbitrary binary data?
Yes, it will. Regards Antoine.

On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
latin-1 guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
and will surrogateescape work for arbitrary binary data?
Yes, it will.
Then maybe this is really a documentation issue, after all. I know I learned something. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 01/09/2014 02:00 PM, Chris Barker wrote:
On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou wrote:
Chris Barker wrote:
latin-1 guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
and will surrogateescape work for arbitrary binary data?
Yes, it will.
Then maybe this is really a documentation issue, after all.
I know I learned something.
If latin1 is used to convert binary to text, how convoluted is it to then take chunks of that text and convert to int, or some other variety of unicode? For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80' If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.) -- ~Ethan~

On 9 January 2014 22:08, Ethan Furman <ethan@stoneleaf.us> wrote:
For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.)
I cannot conceive why you would. Slice the bytes then use struct.unpack on the first 2 bytes and decode on the last 6. We're talking about using latin1 for cases where you want to treat the text as essentially ascii (with a few bits of binary junk you want to ignore). Please don't take away the message that latin1 makes things "just like Python 2.X" - that's completely the wrong idea. Paul

On 01/09/2014 02:54 PM, Paul Moore wrote:
On 9 January 2014 22:08, Ethan Furman wrote:
For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.)
I cannot conceive why you would.
Sorry, I was too short with my example. My use case is binary files, with ASCII metadata and binary metadata, as well as ASCII-encoded numeric values, binary-coded numeric values, ASCII-encoded boolean values, and who-knows-what-(before checking the in-band metadata)-encoded text. I have to process all of it, and before we say "It's just a documentation issue" I want to make sure it /is/ just a documentation issue. -- ~Ethan~

On Thu, Jan 9, 2014 at 3:14 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Sorry, I was too short with my example. My use case is binary files, with ASCII metadata and binary metadata, as well as ASCII-encoded numeric values, binary-coded numeric values, ASCII-encoded boolean values, and who-knows-what-(before checking the in-band metadata)-encoded text. I have to process all of it, and before we say "It's just a documentation issue" I want to make sure it /is/ just a documentation issue.
As I am coming to understand it -- yes, using latin-1 would let you work with all that. You could decode the binary data using latin-1, which would give you a unicode object, which would: 1) act like ascii for ascii values, for the normal string operations, search, replace, etc, etc... 2) have a 1:1 mapping of indexes to bytes in the original. 3) be not-too-bad for memory and other performance (as I understand it py3 now has a cool unicode implementation that does not waste a lot of bytes for low codepoints) 4) would preserve the binary data that was not directly touched. Though you'd still have to encode() to bytes to get chunks that could be used as binary -- i.e. passed to the struct module, or to a frombytes() or frombuffer() method of say numpy, or PIL or something... But I'm no expert.... -Chris
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ chris.barker%40noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

latin1 is OK but is it Pythonic? I've posted suggestion about add 'bytes' as a alias for 'latin1'. http://comments.gmane.org/gmane.comp.python.ideas/10315 I want one Pythonic way to handle "binary containing ascii (or latin1 or utf-8 or other ascii compatible)". On Fri, Jan 10, 2014 at 8:53 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Jan 9, 2014 at 3:14 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Sorry, I was too short with my example. My use case is binary files, with ASCII metadata and binary metadata, as well as ASCII-encoded numeric values, binary-coded numeric values, ASCII-encoded boolean values, and who-knows-what-(before checking the in-band metadata)-encoded text. I have to process all of it, and before we say "It's just a documentation issue" I want to make sure it /is/ just a documentation issue.
As I am coming to understand it -- yes, using latin-1 would let you work with all that. You could decode the binary data using latin-1, which would give you a unicode object, which would:
1) act like ascii for ascii values, for the normal string operations, search, replace, etc, etc...
2) have a 1:1 mapping of indexes to bytes in the original.
3) be not-too-bad for memory and other performance (as I understand it py3 now has a cool unicode implementation that does not waste a lot of bytes for low codepoints)
4) would preserve the binary data that was not directly touched.
Though you'd still have to encode() to bytes to get chunks that could be used as binary -- i.e. passed to the struct module, or to a frombytes() or frombuffer() method of say numpy, or PIL or something...
But I'm no expert....
-Chris
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ chris.barker%40noaa.gov
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
-- INADA Naoki <songofacandy@gmail.com>

INADA Naoki writes:
latin1 is OK but is it Pythonic?
Yes. EIBTI, including being explicit that you're doing something that has semantics that you are ignoring but may come back to bite you or somebody who naively uses your module. There's nothing un-Pythonic about using potentially dangerous idioms. We assume that you know what you are doing and either have taken measures to trap exceptional cases or are willing to accept the risk of an unhandled exception.
I've posted suggestion about add 'bytes' as a alias for 'latin1'.
Unpythonic. Such alternative names hide the fact that there are semantics that you may not want. Only the programmer can know whether it's safe. If you want an ascii-compatible and space-efficient representation that is safe even if the bytestream is something you don't expect, you need to do something like I proposed. If you don't need efficiency, (encoding='ascii', errors='surrogateescape') is the way to go. But these still don't provide convenient interpolation of binary data, as we discovered earlier.

INADA Naoki wrote:
latin1 is OK but is it Pythonic?
Latin is most certainly a Pythonic subject: http://www.youtube.com/watch?v=IIAdHEwiAy8 -- Greg

On 01/09/2014 02:54 PM, Paul Moore wrote:
On 9 January 2014 22:08, Ethan Furman <ethan@stoneleaf.us> wrote:
For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.)
Please don't take away the message that latin1 makes things "just like Python 2.X" - that's completely the wrong idea.
Sure is! --> struct.unpack('>h', '\x01\x00') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' does not support the buffer interface -- ~Ethan~

On Thu, Jan 9, 2014 at 2:54 PM, Paul Moore
For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
If that were decoded using latin1 how would I then get the first two
bytes
to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.)
I cannot conceive why you would. Slice the bytes then use struct.unpack on the first 2 bytes and decode on the last 6.
exactly.
We're talking about using latin1 for cases where you want to treat the text as essentially ascii (with a few bits of binary junk you want to ignore).
as so -- I want to replace a bit of ascii text surrounded by arbitrary binary: (apologies for the py2...) In [24]: b Out[24]: '\x01\x00\xd1\x80\xd1a name\xd0\x80' In [25]: u = b.decode('latin-1') In [26]: u2 = u.replace('a name', 'a different name') In [28]: b2 = u2.encode('latin-1') In [29]: b2 Out[29]: '\x01\x00\xd1\x80\xd1a different name\xd0\x80' -Chris
Please don't take away the message that latin1 makes things "just like Python 2.X" - that's completely the wrong idea.
Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 1/9/2014 6:25 PM, Chris Barker wrote:
as so -- I want to replace a bit of ascii text surrounded by arbitrary binary: (apologies for the py2...) In [24]: b Out[24]: '\x01\x00\xd1\x80\xd1a name\xd0\x80' In [25]: u = b.decode('latin-1') In [26]: u2 = u.replace('a name', 'a different name') In [28]: b2 = u2.encode('latin-1') In [29]: b2 Out[29]: '\x01\x00\xd1\x80\xd1a different name\xd0\x80'
Just to check, with 3.4 print(b'\x01\x00\xd1\x80\xd1a name\xd0\x80' .decode('latin-1'). replace('a name', 'a different name') .encode('latin-1') == b'\x01\x00\xd1\x80\xd1a different name\xd0\x80')
True
The b prefix works in 2.6/7, so this code does the same thing in 2.6+ and 3.x. -- Terry Jan Reedy

On Thu, Jan 09, 2014 at 02:08:57PM -0800, Ethan Furman wrote:
If latin1 is used to convert binary to text, how convoluted is it to then take chunks of that text and convert to int, or some other variety of unicode?
For example: b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to their Cyrillic meaning? (Apologies for not testing myself, short on time.)
Not terribly convoluted, but there is some double-processing. When you know up-front that some data is non-text, you shouldn't convert it to text, otherwise you're just double-processing: py> b = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80' py> s = b.decode('latin1') py> num, = struct.unpack('>h', s[:2].encode('latin1')) py> assert num == 0x100 Better to just go straight from bytes to the struct, if you can: py> struct.unpack('>h', b[:2]) (256,) As for the last six bytes and "their Cyrillic meaning", which Cyrillic meaning did you have in mind? py> s = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'.decode('latin1') py> for encoding in "cp1251 ibm866 iso-8859-5 koi8-r koi8-u mac_cyrillic".split(): ... print(s[-6:].encode('latin1').decode(encoding)) ... СЂСѓРЂ ╤А╤Г╨А бба я─я┐п─ я─я┐п─ —А—Г–А I understand that Cyrillic is an especially poor choice, since there are many incompatible Cyrillic code-pages. On the other hand, it's also an especially good example of how you need to know the encoding before you can make sense of the data. Again, note that if you know the encoding you are intending to use is not Latin-1, decoding to Latin-1 first just ends up double-handling. If you can, it is best to split your data into fields up front, and then decode each piece once only. -- Steven

On Thu, Jan 9, 2014 at 5:00 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou <solipsis@pitrou.net>wrote:
latin-1 guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
and will surrogateescape work for arbitrary binary data?
Yes, it will.
Then maybe this is really a documentation issue, after all.
I know I learned something.
I think the other issue is everyone is talking about keeping the data from the file in a single object. If you slice it up into pieces and decode the parts as necessary this also solves the issue. So if you had an HTTP header you could do:: raw_header, body = data.split(b'\r\n\r\n) header = raw_header.decode('ascii') # Ort whatever HTTP headers are encoded in. Now that might not easily solve the issue of the ASCII text interspersed (such as Kristján's "phone number in the middle of stuff" example), but it will deal with the problem. And if the numbers were separated with clean markers then this would probably still work.

On 9 January 2014 22:00, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
latin-1 guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
and will surrogateescape work for arbitrary binary data?
Yes, it will.
Then maybe this is really a documentation issue, after all.
Certainly, the idea that you can use the latin1 codec and you'll get the same sort of "ascii works and you can safely ignore the rest"[1] behaviour that you get in Python 2 is not well promoted, and is non-obvious. Paul [1] Where "safely" means "probably not as safely as you think, but I'll try not to nag you" :-) And of course you have to make sure you don't *add* any content that uses unicode characters beyond 255, or you get encoding errors. But you weren't going to do that, were you?

On 09.01.2014 22:45, Antoine Pitrou wrote:
On Thu, 9 Jan 2014 13:36:05 -0800 Chris Barker <chris.barker@noaa.gov> wrote:
Some folks have suggested using latin-1 (or other 8-bit encoding) -- is that guaranteed to work with any binary data, and round-trip accurately?
Yes, it is.
Just a word of caution: Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data. E.g. "Marc-André" read using 'latin-1' if the string itself is encoded as UTF-8 will give you "Marc-André" in your application. (Yes, I see that a lot in applications and websites I use ;-)) Also note that indexing based on code points will likely break that way as well, ie. if you pass an index to an application based on what you see in your editor or shell, those indexes can be wrong when used on the encoded data. UTF-8 is an example of a popular variable length encoding for Unicode, so you'll hit this problem whenever dealing with non-ASCII UTF-8 data.
and will surrogateescape work for arbitrary binary data?
Yes, it will.
The surrogateescape trick only works if you are encoding your work using the same encoding that you used for decoding it. Otherwise, you'll get a mix of the input encoding and the output encoding as output. Note that the error handler trick has an advantage over the latin-1 trick: if you try to encode a Unicode string with escape surrogates without using the error handler, it will fail, so you at least know that there are "funny" code points in your output string that need some extra care. BTW: Perhaps it would be a good idea to backport the surrogateescape error handler to Python 2.7 to simplify writing code which works in both Python 2 and 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 10 January 2014 12:19, M.-A. Lemburg <mal@egenix.com> wrote:
Just a word of caution:
Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data.
Agreed. The latin-1 suggestion is purely for people who object to learning how to handle the encodings in their data more accurately. That's not a criticism, wanting to avoid getting sidetracked into understanding encodings when porting a personal script is a classic "practicality vs purity" situation. Current responses to people with encoding issues tend towards an idealistic "you should understand your data better" position, which while true in the abstract is not always what the requester wants to hear. Paul.

On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore <p.f.moore@gmail.com> wrote:
Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data.
Agreed. The latin-1 suggestion is purely for people who object to learning how to handle the encodings in their data more accurately.
I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 10/01/2014 22:06, Chris Barker wrote:
On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore <p.f.moore@gmail.com <mailto:p.f.moore@gmail.com>> wrote:
> Using the 'latin-1' to mean unknown encoding can easily result > in Mojibake (unreadable text) entering your application with > dangerous effects on your other text data.
Agreed. The latin-1 suggestion is purely for people who object to learning how to handle the encodings in their data more accurately.
I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one.
-Chris
The correct way is to read the interface specification which tells you what should be in the data. Or do people not use interface specifications these days, preferring to guess what they've got instead? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Fri, Jan 10, 2014 at 3:22 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wrote:
The correct way is to read the interface specification which tells you what should be in the data. Or do people not use interface specifications these days, preferring to guess what they've got instead?
No one is suggesting guessing (OK, sometimes for what encoding text is in -- but that's when you already know it's text). But while some specs for mixed ascii and binary may specify which bytes are which, not all do -- there may be a read the file 'till you find this text, then the next n bytes are binary, or maybe the next bytes are binary until you get to this ascii text, etc... This is not guessing, but it does require working with an object which has both ascii text and binary in it -- and why shouldn't Python provide a reasonable way to work with that? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 01/10/2014 03:22 PM, Mark Lawrence wrote:
On 10/01/2014 22:06, Chris Barker wrote:
I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one.
The correct way is to read the interface specification which tells you what should be in the data.
Of course. The debate is about how to generate the data to the specs in an elegant manner. -- ~Ethan~

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-01-10, 12:19 GMT, you wrote:
Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data.
E.g. "Marc-André" read using 'latin-1' if the string itself is encoded as UTF-8 will give you "Marc-André" in your application. (Yes, I see that a lot in applications and websites I use ;-))
I am afraid that for most 'latin-1' is just another attempt to make Unicode complexity go away and the way how to ignore it. Matěj -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN ZXTbDtiJtJrrhnRPzbgc3dc= =Pr1X -----END PGP SIGNATURE-----

Now I feel it is bad thing that encouraging using unicode for binary with latin-1 encoding or surrogateescape errorhandler. Handling binary data in str type using latin-1 is just a hack. Surrogateescape is just a workaround to keep undecodable bytes in text. Encouraging binary data in str type with latin-1 or surrogateescape means encourage mixing binary and text data. It is worth than Python 2. So Python should encourage handling binary data in bytes type. On Fri, Jan 10, 2014 at 11:28 PM, Matěj Cepl <matej@ceplovi.cz> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2014-01-10, 12:19 GMT, you wrote:
Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data.
E.g. "Marc-André" read using 'latin-1' if the string itself is encoded as UTF-8 will give you "Marc-André" in your application. (Yes, I see that a lot in applications and websites I use ;-))
I am afraid that for most 'latin-1' is just another attempt to make Unicode complexity go away and the way how to ignore it.
Matěj
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux)
iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN ZXTbDtiJtJrrhnRPzbgc3dc= =Pr1X -----END PGP SIGNATURE----- _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
-- INADA Naoki <songofacandy@gmail.com>

10.01.14 14:19, M.-A. Lemburg написав(ла):
BTW: Perhaps it would be a good idea to backport the surrogateescape error handler to Python 2.7 to simplify writing code which works in both Python 2 and 3.
You also should change the UTF-8 codec so that it will reject surrogates (i.e. u'\ud880'.encode('utf-8') and '\xed\xa2\x80'.decode('utf-8') should raise exceptions). And this will break much code.

On Thu, Jan 9, 2014 at 10:06 AM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Do I speak Chinese to my grocer because china is a growing force in the world? Or start every discussion with my children with a negotiation on what language to use?
No, because your environment have a default language. And Python has a default encoding. You only get problems when some file doesn't use the default encoding. //Lennart

On 10 January 2014 13:32, Lennart Regebro <regebro@gmail.com> wrote:
On Thu, Jan 9, 2014 at 10:06 AM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Do I speak Chinese to my grocer because china is a growing force in the world? Or start every discussion with my children with a negotiation on what language to use?
No, because your environment have a default language. And Python has a default encoding. You only get problems when some file doesn't use the default encoding.
Putting this here because I found out today it's not in any of the PEPs and folks have to go digging in mailing list archives to find it. I'll add it to my Python 3 Q&A at some point. The reason Python 3 currently tries to rely on the POSIX locale encoding is that during the Python 3 development process it was pointed out that ShiftJIS, ISO-2022 and various CJK codec are in widespread use in Asia, since Asian users needed solutions to the problem of representing kana, ideographs and other non-Latin characters long before the Unicode Consortium existed. This creates a problem for Python 3, as assuming utf-8 means we have a high risk of corrupting user's data at least in Asian locales, as well as anywhere else where non-UTF-8 encodings are common (especially when encodings that aren't ASCII compatible are involved). While the Python 3 status quo on POSIX systems certainly isn't ideal, it at least means our most likely failure mode is an exception rather than silent data corruption. One of the major culprits for that is the antiquated POSIX/C locale, which reports ASCII as the system encoding. One idea we're considering for Python 3.5 is to have a report of "ascii" on a POSIX OS imply the surrogateescape error handler (at least for the standard streams, and perhaps in other contexts), since the OS reporting the POSIX/C locale almost certainly indicates a configuration error rather than intentional behaviour. Cheers, Nick.
//Lennart _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan <ncoghlan@gmail.com> wrote:
One idea we're considering for Python 3.5 is to have a report of "ascii" on a POSIX OS imply the surrogateescape error handler (at least for the standard streams, and perhaps in other contexts), since the OS reporting the POSIX/C locale almost certainly indicates a configuration error rather than intentional behaviour.
On FreeBSD users apparently get the C locale by default. I don't think I've configured anything special during the install: freebsd-amd64# adduser Username: testuser Full name: Uid (Leave empty for default): Login group [testuser]: Login group is testuser. Invite testuser into other groups? []: Login class [default]: Shell (sh csh tcsh bash rbash nologin) [sh]: Home directory [/home/testuser]: Home directory permissions (Leave empty for default): Use password-based authentication? [yes]: no Lock out the account after creation? [no]: Username : testuser Password : <disabled> Full Name : Uid : 1003 Class : Groups : testuser Home : /home/testuser Home Mode : Shell : /bin/sh Locked : no OK? (yes/no): yes adduser: INFO: Successfully added (testuser) to the user database. Add another user? (yes/no): no Goodbye! freebsd-amd64# su - testuser $ locale LANG= LC_CTYPE="C" LC_COLLATE="C" LC_TIME="C" LC_NUMERIC="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= Stefan Krah

Le 10/01/2014 16:35, Nick Coghlan a écrit :
One idea we're considering for Python 3.5 is to have a report of "ascii" on a POSIX OS imply the surrogateescape error handler (at least for the standard streams, and perhaps in other contexts), since the OS reporting the POSIX/C locale almost certainly indicates a configuration error rather than intentional behaviour.
would it make sense to be more general, and allow a "lenient mode", where all files implicitly opened with the default encoding would also use the surrogateescape error handler ? That way, applications designed to process text mostly written in the default encoding would just call sys.set_lenient_mode() and be done. Of course, libraries would need to be strongly discouraged to ever use this and encouraged to explicitly set the error handler on appropriate files instead. Cheers, Baptiste

10.01.14 18:27, Baptiste Carvello написав(ла):
would it make sense to be more general, and allow a "lenient mode", where all files implicitly opened with the default encoding would also use the surrogateescape error handler ?
The surrogateescape error handler is compatible only with ASCII-compatible encodings (i.e. no ShiftJIS, no UTF-16). It can't be used by default. But you can set PYTHONIOENCODING=:surrogateescape and got you default locale encoding with surrogateescape.

On Fri, Jan 10, 2014 at 4:35 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 10 January 2014 13:32, Lennart Regebro <regebro@gmail.com> wrote:
No, because your environment have a default language. And Python has a default encoding. You only get problems when some file doesn't use the default encoding.
The reason Python 3 currently tries to rely on the POSIX locale encoding is that during the Python 3 development process it was pointed out that ShiftJIS, ISO-2022 and various CJK codec are in widespread use in Asia, since Asian users needed solutions to the problem of representing kana, ideographs and other non-Latin characters long before the Unicode Consortium existed.
This creates a problem for Python 3, as assuming utf-8 means we have a high risk of corrupting user's data at least in Asian locales, as well as anywhere else where non-UTF-8 encodings are common (especially when encodings that aren't ASCII compatible are involved).
From my experience, the concept of a default locale is deeply flawed. What if I log into a (Linux) machine using an old latin-1 putty from the Windows XP era, have most file names and contents in UTF-8 encoding, except for one directory where people from eastern Europe upload files via FTP in whatever encoding they choose. What should the "default" encoding be now?
That's why I make it a principle to always unset all LC_* and LANG variables, except when working locally, which happens rather rarely.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-01-10, 17:34 GMT, you wrote:
From my experience, the concept of a default locale is deeply flawed. What if I log into a (Linux) machine using an old latin-1 putty from the Windows XP era, have most file names and contents in UTF-8 encoding, except for one directory where people from eastern Europe upload files via FTP in whatever encoding they choose. What should the "default" encoding be now?
I know this stuff is really hard and only because I had to fight with it for a years (being Czech, so not blessed by Latin-1 covering my language … actually no living encoding does support it completely, but that’s mostly theoretical issue … Latin-2 used to work for us, and now everybody with civilized OS uses UTF-8 of course, not sure what’s the current state of MS Windows). It seems to me that you have some fundamental principles muddled together. a) Locale should be always set for the particular system. I.e., in your example above you have two variables only: locale of your Windows XP and locale of the Linux box. b) I know for fact that exactly putty (even on Windows XP) CAN translate from UTF-8 on the server to whatever Windows have to offer. So, there is no such thing as “latin-1 putty”. c) Responsibility for filenames on the system stands on whatever actually saves the file. So, in this testcase it is a matter of correct setting up of the FTP server (I see for example http://rhn.redhat.com/errata/RHBA-2012-0187.html and https://bugzilla.redhat.com/show_bug.cgi?id=638873 which seem to indicate that vsftpd, and what else you would use?, should support UTF-8 on filenames). If the server locale supports Eastern European filenames and vsftpd supports translation to this encoding (hint, hint: UTF-8 does), then you are all set.
That's why I make it a principle to always unset all LC_* and LANG variables, except when working locally, which happens rather rarely.
That’s a bad idea. Those variables have ALWAYS some value set (perhaps default, which tends to be something like en_US.ASCII, which is not what you want, fortunately on most Unices these days it would be en_US.UTF8, command locale(1) always gives some result). Matěj -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0TsM4J/vJdlkhKwRAg9+AJ9wuCEnPqbUr6imA2L9ak17svSP3ACePVRp 5MKkSVUQ9G7A+fZVhDGiEC8= =MXgT -----END PGP SIGNATURE-----

On Jan 10, 2014, at 7:35 AM, Nick Coghlan wrote:
Putting this here because I found out today it's not in any of the PEPs and folks have to go digging in mailing list archives to find it. I'll add it to my Python 3 Q&A at some point.
The reason Python 3 currently tries to rely on the POSIX locale encoding is that during the Python 3 development process it was pointed out that ShiftJIS, ISO-2022 and various CJK codec are in widespread use in Asia, since Asian users needed solutions to the problem of representing kana, ideographs and other non-Latin characters long before the Unicode Consortium existed.
Really? Because PEP 383 doesn't support and discourages the use of some of these codecs as a locale. -- Philip Jenvey

Kristján Valur Jónsson writes:
Still playing the devil's advocate: I didn't used to must. Why must I must now? Did the universe just shift when I fired up python3?
No. Go look at the Economist's tag cloud and notice how big "China" and "India" are most days. The universe has been shifting for 3 decades now, you just noticed it when you fired up Python 3.
Things were demonstatably working just fine before without doing so.
Who elected you General Secretary of the UN? Things were, and are still, demonstrably fucked up for the world at large. Python 3 is a big contribution to un-fucking the rest of us[1], thank you very much to Guido and Company! It's not obvious how to do things right for those of us who have to deal with 8-10 different encodings daily *on our desktops*, and still make things easy for those of you who rarely see ISO 8859/N for N != 1, let alone monstrosities like GB18030 or Shift JIS. That latter is a shame, but we're working on it (and have been all along -- it's not easy). Footnotes: [1] Or will be when my employer adopts it. <sigh/>

On 9 January 2014 10:22, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Still playing the devil's advocate: I didn't used to must. Why must I must now? Did the universe just shift when I fired up python3? Things were demonstatably working just fine before without doing so.
They were working fine for experienced POSIX users that had fully internalised the idiosycrasies of that platform and didn't need to care about any other environment (like Windows or the JVM). Cheers, Nick.
K
________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of Ben Finney [ben+python@benfinney.id.au] Sent: Thursday, January 09, 2014 00:07 To: python-dev@python.org Subject: Re: [Python-Dev] Python3 "complexity"
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
Yes, you must. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc. This is harder in Python 3, as bytes does not have all the functionality strings has, like formatting. This can probably be fixed in Python 3.5, if the relevant PEP gets finished. For the battery analogy, that's like saying: "I want a battery." "What kind?" "It doesn't matter, as long as it's over 5V." //Lennart

On 09/01/2014 06:50, Lennart Regebro wrote:
On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc.
This is harder in Python 3, as bytes does not have all the functionality strings has, like formatting. This can probably be fixed in Python 3.5, if the relevant PEP gets finished.
For the battery analogy, that's like saying:
"I want a battery."
"What kind?"
"It doesn't matter, as long as it's over 5V."
//Lennart
"That Python 3 battery you sold me blew up when I tried using it". "We've been telling you for years that could happen". "I didn't think you actually meant it". -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Thu, Jan 9, 2014 at 10:00 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
On 09/01/2014 06:50, Lennart Regebro wrote:
On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc.
This is harder in Python 3, as bytes does not have all the functionality strings has, like formatting. This can probably be fixed in Python 3.5, if the relevant PEP gets finished.
For the battery analogy, that's like saying:
"I want a battery."
"What kind?"
"It doesn't matter, as long as it's over 5V."
//Lennart
"That Python 3 battery you sold me blew up when I tried using it".
"We've been telling you for years that could happen".
"I didn't think you actually meant it".
"These new nuclear cells are awesome! But you stop from from leaking on their users?" A1: "The nuclear power is radioactive. Accept it." A2: "This is the basic stdlib container. You're supposed to protect yourself." A3: "The world is changing. Everybody should learn nuclear fission to use things properly." "..." and while we are at it, if the battery became more advanced, there is no reason to strip off simple default interface. This interface is not an abstract discussion here, but a real user experience study (I am going to spread UX virus), which starts with: 1. expectations 2. experience 3. outcomes and progressively iterate over 2 to get 3 matching 1 as close as possibly, without trying to change 1. 1 is equal to changing people - it is simple and natural solution that people practicing every day on children and subordinates. The only problem is that it is ineffective, hard and useless activity in open source environment, because most people by the nature of their neural network processes become conservative with ages. That's why people invented forks. However, for the encoding problem, there are some good default solutions. You'll have choose between different interests anyway, but here it is: 1. always open() text files in UTF-8 by default 2. introduce autodetect mode to open functions 1. read and transform on the fly, maintaining a buffer that stores original bytes and their mapping to letters. The mapping is updated as bytes frequency changes. When the buffer is full, you have the best candidate. 3. provide sane error messages 1. messages that users do actually understand 2. messages that tell how to fix the problem If interface becomes more complicated - the last thing you should do is to leave user 1:1 with interface problems. And to conclude, I am not saying that people should not learn about unicode, but the learning curve should not be as steep as Python 3 demands it.

On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik@gmail.com> wrote:
2. introduce autodetect mode to open functions 1. read and transform on the fly, maintaining a buffer that stores original bytes and their mapping to letters. The mapping is updated as bytes frequency changes. When the buffer is full, you have the best candidate.
Bad idea. Bad, bad idea! No biscuit. Sit! This sort of magic is what brings the "bush hid the facts" bug in Windows Notepad. If byte value distribution is used to guess encoding, there's no end to the craziness that can result. How do you know that the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case ASCII letters and not a 32-bit integer or floating-point value, or some accented lower-case letter A's in EBCDIC, or anything else? Maybe if you have a whole document, AND you know for sure that it's linguistic text, then maybe - MAYBE - you could guess with reasonable reliability. But even then, how can you be sure? Remember, too, you might have to deal with something that's actually mis-encoded. If you're told this is UTF-8 and you find the byte sequence ED B3 BF, do you decide that it can't possibly be UTF-8 and pick a different encoding to decode with? That would produce no end of trouble, where the actual result you want is (most likely) to throw an error. ChrisA

On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik@gmail.com> wrote:
2. introduce autodetect mode to open functions 1. read and transform on the fly, maintaining a buffer that stores original bytes and their mapping to letters. The mapping is updated as bytes frequency changes. When the buffer is full, you have the best candidate.
Bad idea. Bad, bad idea! No biscuit. Sit!
This sort of magic is what brings the "bush hid the facts" bug in Windows Notepad. If byte value distribution is used to guess encoding, there's no end to the craziness that can result.
I think that heuristics to guess the encoding have their role to play, if the caller understands the risks. For example, an application might give the user the choice of specifying the codec, or having the app guess it. (I dislike the term "Auto detect", since that implies a level of certainty which often doesn't apply to real files.) There is already a third-party library, chardet, which does this. Perhaps the std lib should include this? Perhaps chardet should be considered best-of-breed "atomic reactor", but the std lib could include a "battery" to do something similar. I don't think we ought to dismiss this idea out of hand.
How do you know that the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case ASCII letters and not a 32-bit integer or floating-point value,
Presumably if you're reading a file intended to be text, they'll be meant to be text and not arbitrary binary blobs. Given that it is 2014 and not 1974, chances are reasonably good that bytes 0x41 0x42 0x43 0x44 are meant as ASCII letters rather than EBCDIC. But you can't be certain, and even if "ASCII capital A" is the right way to bet with byte 0x41, it's much harder to guess what 0xC9 is intended as: py> for encoding in "macroman cp1256 latin1 koi8_r".split(): ... print(b'\xC9'.decode(encoding)) ... … ة É и If you know the encoding via some out-of-band metadata, that's great. If you don't, or if the specified encoding is wrong, an application may not have the luxury of just throwing up its hands and refusing to process the data. Your web browser has to display something even if the web page lies about the encoding used or contains invalid data. Even though encoding issues are more than 40 years old, making this problem older than most programmers, it's still new to many people. (Perhaps they haven't been paying attention, or living in denial that it would even happen to them, or they've just been lucky to be living in a pure ASCII world.) So a bit of sympathy to those struggling with this, but on the flip side, they need to HTFU and deal with it. Python 3 did not cause encoding issues, and in these days of code being interchanged all over the world, any programmer who doesn't have at least a basic understanding of this is like a programmer who doesn't understand why "<insert name of language> cannot multiply correctly": py> 0.7*7 == 4.9 False -- Steven

On Fri, Jan 10, 2014 at 1:39 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik@gmail.com> wrote:
2. introduce autodetect mode to open functions 1. read and transform on the fly, maintaining a buffer that stores original bytes and their mapping to letters. The mapping is updated as bytes frequency changes. When the buffer is full, you have the best candidate.
Bad idea. Bad, bad idea! No biscuit. Sit!
This sort of magic is what brings the "bush hid the facts" bug in Windows Notepad. If byte value distribution is used to guess encoding, there's no end to the craziness that can result.
I think that heuristics to guess the encoding have their role to play, if the caller understands the risks. For example, an application might give the user the choice of specifying the codec, or having the app guess it. (I dislike the term "Auto detect", since that implies a level of certainty which often doesn't apply to real files.)
There is already a third-party library, chardet, which does this. Perhaps the std lib should include this? Perhaps chardet should be considered best-of-breed "atomic reactor", but the std lib could include a "battery" to do something similar. I don't think we ought to dismiss this idea out of hand.
I don't deny that chardet has its place, but would you use it like this (I'm assuming it works with Py3, the docs seem to imply Py2): text = "" with open("blah", "rb") as f: while True: data = f.read(256) if not data: break text += data.decode(chardet.detect(data)['encoding']) Certainly not. But that's how the file-open-mode of "auto detect" sounds. At very least, it has to do something like this _until_ it has confidence; maybe it can retain the chardet state after the first read, but it's still going to have to decode as little as you first read. How can it handle this case? first_char = open("blah", encoding="auto").read(1) Somehow it needs to know how many bytes to read (and not read too many more, preferably - buffering a line-ish is reasonable, buffering a megabyte not so much) and figure out what's one character. I see this as similar to the Python 2 input() function. It's not the file-open builtin's job to do something advanced and foot-shooting as automatic charset detection. If you want that, you should be prepared for its failures and the messes of partial reads, and call on chardet yourself, same as you should use eval(input()) explicitly in Py3 (and, in my opinion, eval(raw_input()) equally explicitly in Py2). I'm not saying that chardet is bad, but I *am* saying, and I stand by this, that an auto-detect option on file open is a bad idea. Unix comes with a 'file' command which will tell you even more about what something is. (For what it thinks are text files, I believe it uses heuristics similar to chardet to guess an encoding.) Would you want a parameter to the open() builtin that tries to read the file as an image, or an audio file, or a document, or an executable, and automatically decodes it to a PIL.Image, an mm.wave, etc, or execute the code and return its stdout, all entirely automatically? I don't think so. Not open()'s job. ChrisA

Chris Angelico writes:
I'm not saying that chardet is bad, but I *am* saying, and I stand by this, that an auto-detect option on file open is a bad idea.
I have used it by default in Emacs and XEmacs since 1990, and I certainly haven't experienced it as a bad idea at *any* time in more than two decades. Of course, it shouldn't be default in Python for two reasons: (1) Emacsen are invariably interactive so very flexible with error recovery, not so for Python, and (2) Emacsen can generally assume that the files they open are more or less text in the first place, which again is not true for Python.
Would you want a parameter to the open() builtin
It's not a parameter, it's a particular value for the encoding parameter.
that tries to read the file as an image, or an audio file, or a document, or an executable, and automatically decodes it to a PIL.Image, an mm.wave, etc,
Emacsen do that, too. It's not the sayonara Grand Slam in the 7th game of the World Series spectacular win that text encoding detection is, but it is very useful much of the time. What it comes down to for all of the above is "consenting adults." Python should *not* do any guessing by default, but if the programmer or user explicitly request a guess with "encoding=chardet", why in the world would you want Python to do anything but give it the old college try? Of course any Python-supplied guesser should take a very pessimistic approach and error unless it's quite certain, but
or execute the code and return its stdout, all entirely automatically?
Now *that* is a really bad idea. You shouldn't mix it with the others. (I'll also concede that many file formats -- Postscript, I'm looking at you -- require special care to avoid arbitrary code execution.)

Steven D'Aprano <steve@pearwood.info> writes:
I think that heuristics to guess the encoding have their role to play, if the caller understands the risks.
I think, for a language whose developers espouse a principle “In the face of ambiguity, refuse the temptation to guess”, heuristics have no role to play in the standard library.
There is already a third-party library, chardet, which does this.
As a third-party library, it's fine and quite useful.
Perhaps the std lib should include this?
In my opinion, content-type guessing heuristics certainly don't belong in the standard library. -- \ “Nothing is more sacred than the facts.” —Sam Harris, _The End | `\ of Faith_, 2004 | _o__) | Ben Finney

Steven D'Aprano wrote:
I think that heuristics to guess the encoding have their role to play, if the caller understands the risks.
Ben Finney wrote:
In my opinion, content-type guessing heuristics certainly don't belong in the standard library.
It would be great if there were never any need to guess. But in the real world, there is -- and often the user won't know any more than python does. So when it is time to guess, a source of good guesses is an important battery to include. The HTML5 specifications go through some fairly extreme contortions to document what browsers actually do, as opposed to what previous standards have mandated. They don't currently specify how to guess (though I think a draft once tried, since the major browsers all do it, and at the time did it similarly), but the specs do explicitly support such a step, and do provide an implementation note encouraging user-agents to do at least minimal auto-detection. http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#det... My own opinion is therefore that Python SHOULD provide better support for both of the following use cases: (1) Treat this file like it came from the web -- including autodetection and even overriding explicit charset declarations for certain charsets. We should explicitly treat autodetection like time zone data -- there is no promise that the "right answer" (or at least the "best guess") won't change, even within a release. I offer no opinion on whether chardet in particular is still too volatile, but the docs should warn that the API is driven by possibly changing external data. (2) Treat this file as "ASCII+", where anything non-ASCII will (at most) be written back out unchanged; it doesn't even need to be converted to text. At this time, I don't know whether the right answer is making it easy to default to surrogate-escape for all error-handling, adding more bytes methods, encouraging use of python's latin-1 variant, offering a dedicated (new?) codec, or some new suggestion. I do know that this use case is important, and that python 3 currently looks clumsy compared to python 2. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

"Jim J. Jewett" <jimjjewett@gmail.com> writes:
Steven D'Aprano wrote:
I think that heuristics to guess the encoding have their role to play, if the caller understands the risks.
Ben Finney wrote:
In my opinion, content-type guessing heuristics certainly don't belong in the standard library.
It would be great if there were never any need to guess. But in the real world, there is -- and often the user won't know any more than python does.
That's why I think it's great to have heuristic guessing code available as a third-party library.
So when it is time to guess, a source of good guesses is an important battery to include.
Why is it important enough to deserve that privilege, over the thousands of other candidates for the standard library? The barrier for entry to the standard library is higher than mere usefulness.
We should explicitly treat autodetection like time zone data -- there is no promise that the "right answer" (or at least the "best guess") won't change, even within a release.
But there is exactly one set of authoritative time zones at any particular point in time. That's why it makes sense to have that set of authoritative values available in the standard library. Heuristic guesses about content types do not have the property of exactly one authoritative source, so your analogy is not compelling. -- \ “Unix is an operating system, OS/2 is half an operating system, | `\ Windows is a shell, and DOS is a boot partition virus.” —Peter | _o__) H. Coffin | Ben Finney

So when it is time to guess [at the character encoding of a file], a source of good guesses is an important battery to include.
The barrier for entry to the standard library is higher than mere usefulness.
Agreed. But "most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code" *is* sufficient. The points of contention are (1) How many programs have to deal with documents written outside their control -- and probably originating on another system. I'm not ready to say "most" programs in general, but I think that barrier is met for both web clients (for which we already supply several batteries) and quick-and-dirty utilities. (2) How serious are the bugs / How annoying are the workarounds? As someone who mostly sticks to English, and who tends to manually ignore stray bytes when dealing with a semi-binary file format, the bugs aren't that serious for me personally. So I may well choose to write buggy programs, and the bug may well never get triggered on my own machine. But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component). (3) How clearcut is the *right* answer? As I said, at one point (several years ago), the w3c and whatwg started to standardize the "right" answer. They backed that out, because vendors wanted the option to improve their detection in the future without violating standards. There are certainly situations where local knowledge can do better than a global solution like chardet, but ... the "right" answer is clear most of the time. Just ignoring the problem is still a 99% answer, because most text in ASCII-mostly environments really is "close enough". But that is harder (and the One Obvious Way is less reliable) under Python 3 than it was under Python 2. An alias for "open" that defaulted to surrogate-escape (or returned the new "ASCIIstr" bytes hybrid) would probably be sufficient to get back (almost) to Python 2 levels of ease and reliability. But it would tend to encourage ASCII/English-only assumptions. You could fix most of the remaining problems by scripting a web browser, except that scripting the browser in a cross-platform manner is slow and problematic, even with webbrowser.py. "Whatever a recent Firefox does" is (almost by definition) good enough, and is available ... but maybe not in a convenient form, which is one reason that chardet was created as a port thereof. Also note that firefox assumes you will update more often than Python does. "Whatever chardet said at the time the Python release was cut" is almost certainly good enough too. The browser makers go to great lengths to match each other even in bizarre corner cases. (Which is one reason there aren't more competing solutions.) But that doesn't mean it is *impossible* to construct a test case where they disagree -- or even one where a recent improvement in the algorithms led to regressions for one particular document. That said, such regressions should be limited to documents that were not properly labeled in the first place, and should be rare even there. Think of the changes as obscure bugfixes, akin to a program starting to handle NaN properly, in a place where it "should" not ever see one. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
The barrier for entry to the standard library is higher than mere usefulness.
Agreed. But "most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code" *is* sufficient.
Well, no, that's not sufficient on its own either. But yes, it's a stronger argument.
But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component).
Having it *do the wrong thing* one run in ten is even worse. If you need chardet, then get approval for the third-party component. That's a political issue, not a technical one. "This needs to be in the stdlib because I'm not allowed to install anything else"? I hope not. Also, a PyPI package is free to update independently of the Python version schedule. The stdlib is bound. ChrisA

On 1/13/2014 7:06 PM, Chris Angelico wrote:
On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett <jimjjewett@gmail.com> wrote:
Agreed. But "most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code" *is* sufficient.
Well, no, that's not sufficient on its own either. But yes, it's a stronger argument.
But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component).
Having it *do the wrong thing* one run in ten is even worse.
If you need chardet, then get approval for the third-party component. That's a political issue, not a technical one. "This needs to be in the stdlib because I'm not allowed to install anything else"? I hope not. Also, a PyPI package is free to update independently of the Python version schedule. The stdlib is bound.
This discussion strikes me as more appropriate for python-ideas. That said, I am leery of a heuristics module in the stdlib. When is a change a 'bug fix'? and when is it an 'enhancement'? -- Terry Jan Reedy

On Mon, Jan 13, 2014 at 07:58:43PM -0500, Terry Reedy wrote:
This discussion strikes me as more appropriate for python-ideas. That said, I am leery of a heuristics module in the stdlib. When is a change a 'bug fix'? and when is it an 'enhancement'?
Depends on the nature of the heuristic. For example, there's a simple "guess the encoding of text files" heuristic which uses the presence of a BOM to pick the encoding: - read the first four bytes in binary mode - if bytes 0 and 1 are FEFF or FFFE, then the encoding is UTF-16; - if bytes 0 through 2 are EFBBBF, then the encoding is UTF-8; - if bytes 0 through 3 are 0000FEFF or FFFE0000, then the encoding is UTF-32; - if bytes 0 through 2 are 2B2F76 and byte 3 is 38, 39, 2B or 2F, then the encoding is UTF-7; - otherwise the encoding is unknown. Here a bug fix versus an enhancement is easy: a bug fix is (say) getting one of the BOMs wrong (suppose it tested for EFFF instead of FEFF, that would be a bug); an enhancement would be adding a new BOM/encoding detector (say, F7644C for UTF-1). The same would not apply to, for instance, the chardet library, where detection is based on statistics. If the library adjusts a frequency table, does that reflect a bug or an enhancement or both? -- Steven

On Thu, Jan 9, 2014 at 5:50 PM, Lennart Regebro <regebro@gmail.com> wrote:
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc.
You would have to define it as "A stream of bytes encoded in {ASCII|Latin-1|CP-1252|UTF-8} that" etc etc. Otherwise, those bytes might be EBCDIC, UTF-16, or anything else, and your code will fail. And once you've demanded that, well, you're right back here with clarifying encodings, so you may as well just pass encoding="ascii" and do it honestly. ChrisA

On 9 January 2014 04:50, Lennart Regebro <regebro@gmail.com> wrote:
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc.
That is, until you hit a character which has a byte with the same value of ASCII newline in the middle of a multi-byte character. So, this approach is broken to start with.

On Fri, Jan 10, 2014 at 2:03 AM, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
On 9 January 2014 04:50, Lennart Regebro <regebro@gmail.com> wrote:
To be honest, you can define text as "A stream of bytes that are split up in lines separated by a linefeed", and do some basic text processing like that. Just very *basic*, but still. Replacing characters. Extracting certain lines etc.
That is, until you hit a character which has a byte with the same value of ASCII newline in the middle of a multi-byte character.
So, this approach is broken to start with.
For a very specific definition of broken, yes, namely that it will fail with UTF-16 or EBCDIC. Files that with the above definition of "text files" are not text files. :-) //Lennart

On 9 January 2014 10:07, Ben Finney <ben+python@benfinney.id.au> wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
Python should not guess the encoding if it's unknown. Without the right encoding, you don't get text, you get partial or complete gibberish.
So, if what you want is to parse text and not get gibberish, you need to *tell* Python what the encoding is. That's a brute fact of the world of text in computing.
Set the mode to "rb", process it as binary. Done. See http://python-notes.curiousefficiency.org/en/latest/python3/text_file_proces... for details. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan <ncoghlan@gmail.com> writes:
On 9 January 2014 10:07, Ben Finney <ben+python@benfinney.id.au> wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding. […]
Set the mode to "rb", process it as binary. Done.
Which entails abandoning the stated goal of “just want to parse text files” :-) -- \ “All television is educational television. The question is: | `\ what is it teaching?” —Nicholas Johnson | _o__) | Ben Finney

On Thu, Jan 9, 2014 at 8:16 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
Nick Coghlan <ncoghlan@gmail.com> writes:
Set the mode to "rb", process it as binary. Done.
Which entails abandoning the stated goal of “just want to parse text files” :-)
Only if your definition of "text files" means it's unicode.

On Thu, Jan 09, 2014 at 05:11:06PM +1000, Nick Coghlan wrote:
On 9 January 2014 10:07, Ben Finney <ben+python@benfinney.id.au> wrote:
So, if what you want is to parse text and not get gibberish, you need to *tell* Python what the encoding is. That's a brute fact of the world of text in computing.
Set the mode to "rb", process it as binary. Done.
A nice point, but really, you lose a lot by doing so. Even simple things like the ability to write: if word[0] == 'X' instead you have to write things like: if word[0:1] = b'X' if chr(word[0]) == 'X' if word[0] == ord('X') if word[0] == 0x58 (pick the one that annoys you the least). And while bytes objects do have a surprising (to me) number of string-ish methods, like upper(), there are a few missing, like format() and isnumeric(). So it's not quite as straightforward as "done". If it were, we wouldn't need text strings :-) -- Steven

On 09/01/14 00:07, Ben Finney wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
I'm glad someone pointed this out.

On 9 January 2014 09:01, Mark Shannon <mark@hotpy.org> wrote:
On 09/01/14 00:07, Ben Finney wrote:
Kristján Valur Jónsson <kristjan@ccpgames.com> writes:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files.
Files don't contain text, they contain bytes. Bytes only become text when filtered through the correct encoding.
I'm glad someone pointed this out.
Try working on Windows with Powershell as your default shell for a while. You learn that message *very* fast. You end up with a mix of CP1250 and UTF-16 files, and you can no longer even assume that a file of "simple text" is in an ASCII-compatible encoding. After tools like grep fail to work often enough, you get a really strong sense of why knowing the encoding matters (and you feel this urge to rewrite all the GNU tools in Python 3 ;-)). And that's on a single PC in an English-speaking locale :-( (You also get this fun with the £ sign being encoded differently in the console and the GUI). So it's not just people that "use funny foreign languages" (apologies to 99% of the globe for that :-)) who are affected. I assume Kristján knows all this, given the "á" in his name :-) But certainly just using open without specifying an encoding has always served me fine in Python 3, in the sense that it does at least as well as Python 2 So I think that if this discussion is to be of any real benefit, a specific example is needed. I honestly don't think I've ever encountered a case where "Sometimes [I] just want to parse text files" and code that uses the default encoding (i.e., looks pretty much identical to Python 2) has *failed* to do the job for me. PEP460 is addressing a very specific use case, and certainly isn't for "just parsing text files" - at least as I understand it. Paul.

Paul Moore writes:
So I think that if this discussion is to be of any real benefit, a specific example is needed. I honestly don't think I've ever encountered a case where "Sometimes [I] just want to parse text files" and code that uses the default encoding (i.e., looks pretty much identical to Python 2) has *failed* to do the job for me.
I don't understand why it fails for Kristján, but I can tell you why it failed for me: Mac OS X "Snow Leopard" (at least on my box, and perhaps due to my misconfiguration) doesn't set the locale variables and for some reason the fallback for locale.getpreferredencoding() is not UTF-8 (== sys.getfilesystemencoding()) nor some Japanese encoding (Japanese is my system language), but US-ASCII! Naturally, putting LANG=ja_JP.UTF-8 in my shell startup fixed that once and for all, so as I say I don't understand why Kristján has a problem.

On 1/8/2014 5:04 PM, Kristján Valur Jónsson wrote:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and
I suspect that you would do that by looking for the bytes that can be interpreted as ascii digits. That will work fine as long as the .txt file has an ascii-compatible encoding. As soon as it does not, the little utility fails. It also fails with non-European digits, such as are used in Arabic and Indic writings. Even if you are in an environment where all .txt files are encoded in utf-8, it will be easier to look for non-ascii digits in decoded unicode strings.
merge in some email addresses. What encoding does the file have? Do I care? Must I care?
If the email addresses have non-ascii characters, then you must. ...
All this talk is positive, though. The fact that these topics have finally reached the halls of python-dev are indication that people out there are _trying_ to move to 3.3 :)
That is an interesting observation, worth keeping in mind among the turmoil. -- Terry Jan Reedy

On Wed, Jan 8, 2014 at 2:04 PM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Believe it or not, sometimes you really don't care about encodings. Sometimes you just want to parse text files. Python 3 forces you to think about abstract concepts like encodings when all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
If computers had taken off in China before the USA, you'd probably be wondering why some Chinese refuse to care about encodings, when the rest of the world clearly needs them. Yes, you really should care about encodings. No, it's not quite as simple as it once was for English speakers as it once was. It was formerly simple (for us) because we were effectively pressing everyone else to read and write English. If you want to keep things close to what you're used to, use latin-1 as your encoding. It's still a choice, and not a great one for user-facing text, but if you want to be simplistic about it, that's a way to do it. That said, there will be some text that isn't user-facing, EG in a network protocol. This is probably what all the fuss is about. But like I said, this can be done with latin-1.

Kristján Valur Jónsson wrote:
all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
To some extent, yes. If the encoding happens to be an ascii-compatible one, such as latin-1 or utf-8, you can probably extract the phone numbers without caring what the rest of the bytes mean. But not if it's utf-16, for example. If you know that all the files on your system have an ascii-compatible encoding, you can use the surrogateescape error handler to avoid having to know about the exact encoding. Granted, that makes it slightly more complicated than it was in Python 2, but not much. -- Greg

On 9 January 2014 15:22, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Kristján Valur Jónsson wrote:
all you want is to open that .txt file on the drive and extract some phone numbers and merge in some email addresses. What encoding does the file have? Do I care? Must I care?
To some extent, yes. If the encoding happens to be an ascii-compatible one, such as latin-1 or utf-8, you can probably extract the phone numbers without caring what the rest of the bytes mean. But not if it's utf-16, for example.
If you know that all the files on your system have an ascii-compatible encoding, you can use the surrogateescape error handler to avoid having to know about the exact encoding. Granted, that makes it slightly more complicated than it was in Python 2, but not much.
There's also the fact that POSIX folks are used to "r" and "rb" being the same thing. Python 3 chose to make the default behaviour be to open files as text files in the default system encoding. This created two significant user visible changes: - POSIX users could no longer ignore the difference between binary mode and text mode when opening files (Windows users have always had to care due to the line ending problem) - POSIX users could no longer ignore locale configuration errors We're aiming to resolve the most common locale configuration issue by configuring surrogateescape on the standard streams when the OS claims that default encoding is ASCII, but ultimately, the long term fix is for POSIX platforms to standardise on and consistently report UTF-8 as the system encoding (as well as configuring ssh environments properly by default) Python 2 is *very* much a POSIX first language, with Windows, the JVM and other non-POSIX environments as an afterthought. Python 3 is intentionally offers more consistent cross platform behaviour, which means it no longer aligns as neatly with the sensibilities of experienced users of POSIX systems. Cheers, Nick.
-- Greg
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, 9 Jan 2014 17:09:10 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
There's also the fact that POSIX folks are used to "r" and "rb" being the same thing.
Which fails immediately under Windows :-) Regards Antoine.

So the customer you're looking for is the person who cares a lot about encodings, knows how to do Unicode correctly, and has noticed that certain valid cases not limited to imperialist simpletons (dealing with specific common things invented before 1996, dealing with mixed encodings, doing what Nick describes as "ASCII compatible binary protocols") are *more complicated to do correctly* in Python 3 because Python 3 undeniably has more complicated though probably better *Unicode* support. N.b. WSGI, email, url parsing etc. The same person loves Python, all the other Python 3 features, and probably you personally, but mostly does not write programs in the domains that Python 3 makes easier. They emphatically do not want the Python 2 model especially not implicit coercion. They only want additional tools for text or string processing in Python 3.

On Thu, 9 Jan 2014 09:03:40 -0500 Daniel Holth <dholth@gmail.com> wrote:
They emphatically do not want the Python 2 model especially not implicit coercion. They only want additional tools for text or string processing in Python 3.
That's a good point. Now it's up to people who need those additional tools to propose them. We can't second-guess everyone's needs. Regards Antoine.

On 9 Jan 2014 22:08, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
On Thu, 9 Jan 2014 09:03:40 -0500 Daniel Holth <dholth@gmail.com> wrote:
They emphatically do not want the Python 2 model especially not implicit coercion. They only want additional tools for text or string processing in Python 3.
That's a good point. Now it's up to people who need those additional tools to propose them. We can't second-guess everyone's needs.
Note that I've tried to find prettier ways to write the standard library's URL parsing code. In addition to the original alternatives I explored, I'm currently experimenting with a generic function based approach with mixed results. I'm reserving judgement until I see how the completed conversion looks, but currently it doesn't seem any simpler than my current higher order function approach. However, the implicit conversions are *critical* to sharing constants between the two code paths in Python 2 without coercing bytes to str or vice-versa (disabling the implicit coercion breaks Unicode handling), so I'm still not sure the goal is achievable without creating a new type *specifically* for that task. Python 3 only code is generally much simpler - you can usually pick binary or text and just support one of them, rather than trying to support both in the same API. Cheers, Nick.
Regards
Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

And I think everyone was well intentioned - and python3 covers most of the bases, but working with binary data is not only a "wire-protocol programmer's" problem. Needing a library to wrap bytesthing.format('ascii', 'surrogateescape') or some such thing makes python3 less approachable for those who haven't learned that yet - which was almost all of us at some point when we started programming.
Totally agree with you. -- INADA Naoki <songofacandy@gmail.com>

On 9 Jan 2014 11:29, "INADA Naoki" <songofacandy@gmail.com> wrote:
And I think everyone was well intentioned - and python3 covers most of
bases, but working with binary data is not only a "wire-protocol
the programmer's"
problem.
If you're working with binary data, use the binary API offered by bytes, bytearray and memoryview.
Needing a library to wrap bytesthing.format('ascii', 'surrogateescape')
or some such thing makes python3 less approachable for those who haven't learned that yet - which was almost all of us at some point when we started programming.
Totally agree with you.
If you're on a relatively modern OS, everything should be UTF-8 and you should be fine as a beginner. When you start encountered malformed data, Python 3 should throw an error, and provide an opportunity to learn more (by looking up the error message), where Python 2 would silently corrupt the data stream. Python 2 enshrined a data model eminently suitable for boundary code that dealt with ASCII compatible binary protocols (like web frameworks) as the default text model. Application code then needed to take special steps to get correct behaviour for the full Unicode range. In essence, the Python 2 text model is the POSIX text model with Unicode support bolted on to the side to make it at least *possible* to write correct application code. This is completely backwards. Web applications vastly outnumber web frameworks, and the same goes for every other domain: applications are vastly more common than the libraries and frameworks that handle data transformations at system boundaries on their behalf, so making the latter easier to write at the expense of the former is a deeply flawed design choice. So Python 3 reverses the situation: the core text model is now more appropriate for the central application code, *after* the boundary code has cleaned up the murky details of wire protocols and file formats. This is pretty easy to deal with for *new* Python 3 code, since you just write things to deal with either bytes or text as appropriate. However, there is some code written for Python 2 that relies more heavily on the ability to treat ascii compatible binary data as both binary data *and* as text. This is the use case that Python 3 treats as a more specialised use case (perhaps benefitting from a specialised third party type), whereas Python 2 supports it by default. This is also the use case that relied most heavily on implicit encoding and decoding, since that's the mechanism that allows the 8-bit and Unicode paths to share string literals. Cheers, Nick.
-- INADA Naoki <songofacandy@gmail.com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
participants (35)
-
"Martin v. Löwis"
-
anatoly techtonik
-
Antoine Pitrou
-
Baptiste Carvello
-
Ben Finney
-
Brett Cannon
-
Chris Angelico
-
Chris Barker
-
Dan Stromberg
-
Daniel Holth
-
Ethan Furman
-
Greg Ewing
-
INADA Naoki
-
Isaac Morland
-
Jim J. Jewett
-
Joao S. O. Bueno
-
Kristján Valur Jónsson
-
Lennart Regebro
-
M.-A. Lemburg
-
Mark Lawrence
-
Mark Shannon
-
matej@ceplovi.cz
-
Matt Billenstein
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Philip Jenvey
-
R. David Murray
-
Serhiy Storchaka
-
Stefan Krah
-
Stefan Ring
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Victor Stinner