Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib? [1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797
The official policy is that we want them to go away, but reality so far has not budged. We will continue to hold our breath though. :-) On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib?
[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org
-- --Guido van Rossum (python.org/~guido)
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows. -Ben
On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib?
[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797
On Tue, Aug 19, 2014, at 10:31, Ben Hoyt wrote:
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix? -Ben
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/19/2014 01:43 PM, Ben Hoyt wrote:
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
ISTM that the policy is based on a fantasy that "it looks like text to me in my use cases, so therefore it must be text for everyone." Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEARECAAYFAlPzj8AACgkQ+gerLs4ltQ6AjACgzSC6kBXssnzNhVTdahWIi48u 5SwAn3+ytO/bh1YrVzCbVJqU/wIs7WiA =qGLR -----END PGP SIGNATURE-----
Tres Seaver <tseaver@palladion.com>:
On 08/19/2014 01:43 PM, Ben Hoyt wrote:
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
ISTM that the policy is based on a fantasy that "it looks like text to me in my use cases, so therefore it must be text for everyone."
What I like about Python is that it allows me to write native linux code without having to make portability compromises that plague, say, Java. I have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The "textualization" of Python3 seems part of a conscious effort to make Python more Java-esque. Marko
On 20 Aug 2014 04:18, "Marko Rauhamaa" <marko@pacujo.net> wrote:
Tres Seaver <tseaver@palladion.com>:
On 08/19/2014 01:43 PM, Ben Hoyt wrote:
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
ISTM that the policy is based on a fantasy that "it looks like text to me in my use cases, so therefore it must be text for everyone."
What I like about Python is that it allows me to write native linux code without having to make portability compromises that plague, say, Java. I have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The "textualization" of Python3 seems part of a conscious effort to make Python more Java-esque.
It's not just the JVM that says text and binary APIs should be separate - it's every widely used operating system services layer except POSIX. The POSIX way works well *if* everyone reliably encodes things as UTF-8 or always uses encoding detection, but its failure mode is unfortunately silent data corruption. That said, there's a lot of Python software that is POSIX specific, where bytes paths would be the least of the barriers to porting to Windows or Jython. I'm personally +1 on consistently allowing binary paths in lower level APIs, but disallowing them in higher level explicitly cross platform abstractions like pathlib. Regards, Nick.
Marko _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Le 20/08/2014 07:08, Nick Coghlan a écrit :
It's not just the JVM that says text and binary APIs should be separate - it's every widely used operating system services layer except POSIX. The POSIX way works well *if* everyone reliably encodes things as UTF-8 or always uses encoding detection, but its failure mode is unfortunately silent data corruption.
That said, there's a lot of Python software that is POSIX specific, where bytes paths would be the least of the barriers to porting to Windows or Jython. I'm personally +1 on consistently allowing binary paths in lower level APIs, but disallowing them in higher level explicitly cross platform abstractions like pathlib.
I fully agree with Nick's position here. To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators Adding full bytes support to pathlib would have added a lot of complication and fragility in the implementation *and* in the API (is it allowed to combine str and bytes paths? should they have separate classes?), for arguably little benefit. I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs. Regards Antoine.
On Wed Aug 20 2014 at 9:02:25 AM Antoine Pitrou <antoine@python.org> wrote:
Le 20/08/2014 07:08, Nick Coghlan a écrit :
It's not just the JVM that says text and binary APIs should be separate - it's every widely used operating system services layer except POSIX. The POSIX way works well *if* everyone reliably encodes things as UTF-8 or always uses encoding detection, but its failure mode is unfortunately silent data corruption.
That said, there's a lot of Python software that is POSIX specific, where bytes paths would be the least of the barriers to porting to Windows or Jython. I'm personally +1 on consistently allowing binary paths in lower level APIs, but disallowing them in higher level explicitly cross platform abstractions like pathlib.
I fully agree with Nick's position here.
To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators
Adding full bytes support to pathlib would have added a lot of complication and fragility in the implementation *and* in the API (is it allowed to combine str and bytes paths? should they have separate classes?), for arguably little benefit.
I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
+1 from me as well. Allowing the low-level stuff work on bytes but keeping high-level actually high-level keeps with our consenting adults policy as well as making things possible, but not at the detriment of the common case.
but disallowing them in higher level
explicitly cross platform abstractions like pathlib.
I think the trick here is that posix-using folks claim that filenames are just bytes, and indeed they can be passed around with a char*, so they seem to be. but you can't possible do anything other than pass them around if you REALLY think they are just bytes. So really, people treat them as "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible" If you assume that, then you could write a pathlib that would work. And in practice, I expect a lot of designed only for posix code works that way. But of course, this gets ugly if you go to a platform where filenames are not "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible", like windows. I'm not sure if it's worth having a pathlib, etc. that uses this assumption -- but it could help us all write code that actually works with this screwy lack of specification. Antoine Pitrou wrote:
To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators
but that uses os.fsencode: Encode filename to the filesystem encoding As I understand it, the whole problem with some posix systems is that there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are. (At least as I read Armin Ronacher's blog) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 21 Aug 2014 09:06, "Chris Barker" <chris.barker@noaa.gov> wrote:
As I understand it, the whole problem with some posix systems is that
there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are.
(At least as I read Armin Ronacher's blog)
Armin lets his astonishment at the idea we'd expect Linux vendors to fix their broken OS get the better of him at times - he thinks the responsibility lies entirely with us to work around its quirks and limitations :) The "surrogateescape" codec is our main answer to the unreliability of the POSIX encoding model - fsdecode will squirrel away arbitrary bytes in the private use area, and then fsencode will restore them again later. That works for the simple round tripping case, but we currently lack good default tools for "cleaning" strings that may contain surrogates (or even scanning a string to see if surrogates are present). One idea I had along those lines is a surrogatereplace error handler ( http://bugs.python.org/issue22016) that emitted an ASCII question mark for each smuggled byte, rather than propagating the encoding problem. Cheers, Nick.
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Nick Coghlan writes:
One idea I had along those lines is a surrogatereplace error handler ( http://bugs.python.org/issue22016) that emitted an ASCII question mark for each smuggled byte, rather than propagating the encoding problem.
Please, don't. "Smuggled bytes" are not independent events. They tend to be correlated *within* file names, and this handler would generate names whose human semantics get lost (and there *are* human semantics, otherwise the name would be str(some_counter)). They tend to be correlated across file names, and this handler will generate multiple files with the same munged name (and again, the differentiating human semantics get lost). If you don't know the semantics of the intended file names, you can't generate good replacement names. This has to be an application-level function, and often requires user intervention to get good names. If you want to provide helper functions that applications can use to clean names explicitly, that might be OK.
On 21 August 2014 12:16, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
One idea I had along those lines is a surrogatereplace error handler ( http://bugs.python.org/issue22016) that emitted an ASCII question mark for each smuggled byte, rather than propagating the encoding problem.
Please, don't.
"Smuggled bytes" are not independent events. They tend to be correlated *within* file names, and this handler would generate names whose human semantics get lost (and there *are* human semantics, otherwise the name would be str(some_counter)). They tend to be correlated across file names, and this handler will generate multiple files with the same munged name (and again, the differentiating human semantics get lost).
If you don't know the semantics of the intended file names, you can't generate good replacement names. This has to be an application-level function, and often requires user intervention to get good names.
If you want to provide helper functions that applications can use to clean names explicitly, that might be OK.
Yeah, I was thinking in the context of reproducing sys.stdout's behaviour in Python 2, but that reproduces the bytes faithfully, so 'surrogateescape' is already offers exactly the behaviour we want (sys.stdout will have surrogateescape enabled by default in 3.5). I'll keep pondering the question of possible helper functions in the "string" module. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
but disallowing them in higher level
explicitly cross platform abstractions like pathlib.
I think the trick here is that posix-using folks claim that filenames are just bytes, and indeed they can be passed around with a char*, so they seem to be.
but you can't possible do anything other than pass them around if you REALLY think they are just bytes.
So really, people treat them as "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible"
As someone who fought long and hard in the surrogate-escape listdir() wars, and was won over once the scheme was thoroughly explained to me, I take issue with these assertions: they are bogus or misleading. Firstly, POSIX filenames _are_ just byte strings. The only forbidden character is the NUL byte, which terminates a C string, and the only special character is the slash, which separates pathanme components. Second, a bare low level program cannot do _much_ more than pass them around. It certainly can do things like compute their basename, or other path related operations. The "bytes in some arbitrary encoding where at least the slash character (and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible. I think characterisations such as the one quoted are activately misleading. The way you get UTF-8 (or some other encoding, fortunately getting less and less common) is by convention: you decide in your environment to work in some encoding (say utf-8) via the locale variables, and all your user-facing text gets used in UTF-8 encoding form when turned into bytes for the filename calls because your text<->bytes methods say to do so. I think we'd all agree it is nice to have a system where filenames are all Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore the reality for such systems. I certainly think the Window-side Babel of code pages and multiple code systems is far far worse. (Disclaimer: not a Windows programmer, just based on hearing them complain.) I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the underlying filesystems reject invalid byte sequences). [...]
Antoine Pitrou wrote:
To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators
but that uses os.fsencode: Encode filename to the filesystem encoding
As I understand it, the whole problem with some posix systems is that there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are.
Yes and no. I made that argument too. There's no _external_ "filesystem encoding" in the sense of something recorded in the filesystem that anyone can inspect. But there is the expressed locale settings, available at runtime to any program that cares to pay attention. It is a workable situation. Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.) Cheers, Cameron Simpson <cs@zip.com.au> God is real, unless declared integer. - Johan Montald, johan@ingres.com
Hi! On Thu, Aug 21, 2014 at 02:52:19PM +1000, Cameron Simpson <cs@zip.com.au> wrote:
Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)
Cheers, Cameron Simpson <cs@zip.com.au>
+1 from another Unix fanboy. Like an old wine, Unix becomes better with years! ;-) Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 21 August 2014 14:52, Cameron Simpson <cs@zip.com.au> wrote:
Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)
The part that is broken is the idea that locale encodings are a viable solution to conveying the appropriate encoding to use to talk to the operating system. We've tried trusting them with Python 3, and they're reliably wrong in certain situations. systemd is apparently better than upstart at setting them correctly (e.g. for cron jobs), but even it can't defend against an erroneous (or deliberate!) "LANG=C", or ssh environment forwarding pushing a client's locale to the server. It's worth looking through some of Armin Ronacher's complaints about Python 3 being broken on Linux, and seeing how many of them boil down to "trusting the locale is wrong, Python 3 should just assume UTF-8 on every POSIX system, the same way it does on Mac OS X". (I suspect ShiftJIS, ISO-2022, et al users might object to that approach, but it's at least a more viable choice now than it was back in 2008) I still think we made the right call at least *trying* the idea of trusting the locale encoding (since that's the officially supported way of getting this information from the OS), and in many, many situations it works fine. But I suspect we may eventually need to resolve the technical issues currently preventing us from deciding to ignore the environmental locale during interpreter startup and try something different (such as always assuming UTF-8, or trying to force C.UTF-8 if we detect the C locale, or looking for the systemd config files and using those to set the OS encoding, rather than the environmental locale). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 21/08/2014 00:52, Cameron Simpson a écrit :
The "bytes in some arbitrary encoding where at least the slash character (and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible.
Of course there is. Try to split an UTF-16-encoded file path on the byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly mandates an ASCII-compatible encoding for file paths. Regards Antoine.
On 21Aug2014 09:20, Antoine Pitrou <antoine@python.org> wrote:
Le 21/08/2014 00:52, Cameron Simpson a écrit :
The "bytes in some arbitrary encoding where at least the slash character (and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible.
Of course there is. Try to split an UTF-16-encoded file path on the byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly mandates an ASCII-compatible encoding for file paths.
[Rolls eyes.] Looking at the UTF-16 encoding, it looks like it also embeds NUL bytes for various codes below 32768. How are they handled? As remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings. If you imagine you can embed bare UTF-16 freely even excluding code 47, I think one of us is missing something. That's not "ASCII compatible". That's "not all byte codes can be freely used without thought", and any multibyte coding will have to consider such things when embedding itself in another coding scheme. Cheers, Cameron Simpson <cs@zip.com.au> Microsoft: Committed to putting the "backward" into "backward compatibility."
On 21 August 2014 23:27, Cameron Simpson <cs@zip.com.au> wrote:
That's not "ASCII compatible". That's "not all byte codes can be freely used without thought", and any multibyte coding will have to consider such things when embedding itself in another coding scheme.
I wonder how badly a Unix system would break if you specified UTF16 as the system encoding...? Paul
On 8/21/2014 3:42 PM, Paul Moore wrote:
I wonder how badly a Unix system would break if you specified UTF16 as the system encoding...? Paul
Does Unix even support UTF-16 as an encoding? I suppose, these days, it probably does, for reading contents of files created on Windows, etc. (Unicode was just gaining traction when I last used Unix in a significant manner; yes, my web host runs Linux, and I know enough to do what can be done there... but haven't experimented with encodings other than ASCII & UTF-8 on the web host, and don't intend to). If it allows configuration of UTF-16 or UTF-32 as system encodings, I would consider that a bug, though, as too much of Unix predates Unicode, and would be likely to fail.
Does Unix even support UTF-16 as an encoding? I suppose, these days, it probably does, for reading contents of files created on Windows, etc.
I don't think Unix supports any encodings at all for the _contents_ of files -- that's up to applications. Of course the command line text processing tools need to know -- I'm guessing those are never going to work w/UTF-16! "System encoding" is a nice idea, but pretty much worthless. Only helpful for files created and processed on the same system -- not rare for that not to be the case. This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal) And people still want to say posix isn't broken in this regard? Sigh. -Chris
On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal)
There is no such thing as an encoding of text files. So we just write those bytes to the file or output them to the terminal. I often do that. My filesystems are full of files with names and content in at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a terminal with koi8 or utf-8 locale and fonts and some file always look weird. But however weird they are it's possible to work with them. The bigger problem is line feeds. A filename with linefeeds can be put to a text file, but cannot be read back. So one has to transform such names. Usually s/\\/\\\\/g and s/\n/\\n/g is enough. (-:
And people still want to say posix isn't broken in this regard?
Not at all! And broken or not broken it's what I (for many different reasons) prefer to use for my desktops, servers, notebooks, routers and smartphones, so if Python would stand on my way I'd rather switch to a different tools. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal)
There is no such thing as an encoding of text files.
I don't understand this comment. It seems to me that *text* files have to have an encoding, otherwise you can't interpret the contents as text. Files, of course, only contain bytes, but to be treated as bytes you need some way of transforming byte N to char C (or multiple bytes to C), which is an encoding. Perhaps you just mean that encodings are not recorded in the text file itself? To answer Chris' question, you typically cannot include arbitrary bytes in text files, and displaying them to the user is likewise problematic. The usual solution is to support some form of escaping, like \t #x0A; or %0D, to give a few examples. -- Steven
Hi! On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal)
There is no such thing as an encoding of text files.
I don't understand this comment. It seems to me that *text* files have to have an encoding, otherwise you can't interpret the contents as text.
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
Files, of course, only contain bytes, but to be treated as bytes you need some way of transforming byte N to char C (or multiple bytes to C), which is an encoding.
But you don't need to treat the entire file in one encoding. Strange characters are clearly visible so you can interpret them differently. I am very much trained to distinguish koi8, cp1251 and utf-8 texts; I cannot translate them mentally but I can recognize them.
Perhaps you just mean that encodings are not recorded in the text file itself?
Yes, that too.
To answer Chris' question, you typically cannot include arbitrary bytes in text files, and displaying them to the user is likewise problematic
As a person who view utf-8 files in koi8 fonts (and vice versa) every day I'd argue. (-: Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. If it is named .html and served by the server as UTF-8, then the server is misconfigured, or the file is incorrectly populated.
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings.
Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: than strict text/binary dichotomy in Python3.
Oleg.
I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. Also, if it is an HTML file, I doubt the browser will use multiple different encodings when interpreting it, so it is not clear that the file is of practical use for its intended purpose if it contains text in multiple different encodings, but is served using only a single encoding, unless there is javascript or some programming in the browser that reencodes the data. On the other hand, Python3 provides various facilities for working with such files. The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training. The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files. There may be other technique that I am not aware of. Glenn
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: than strict text/binary dichotomy in Python3.
I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file".
And in my opinion those files are perfect text. The files consist of lines separated by EOL characters (not necessary EOL characters of my OS because it could be a text file produced in a different OS), lines consist of words and words of characters.
Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition.
I don't have problems with Python3 text. I have problems with Python3 trying to get rid of byte strings and treating bytes as strict non-text.
On the other hand, Python3 provides various facilities for working with such files.
The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent.
This is perhaps the most promising approach. If I can open a text file in binary mode, iterate it line by line, split every line of non-ascii bytes with .split() and process them that'd satisfy my needs. But still there are dragons. If I read a filename from such file I read it as bytes, not str, so I can only use low-level APIs to manipulate with those filenames. Pity. Let see a perfectly normal situation I am quite often in. A person sent me a directory full of MP3 files. The transport doesn't matter; it could be FTP, or rsync, or a zip file sent by email, or bittorrent. What matters is that filenames and content are in alien encodings. Most often it's cp1251 (the encoding used in Russian Windows) but can be koi8 or utf8. There is a playlist among the files -- a text file that lists MP3 files, every file on a single line; usually with full paths ("C:\Audio\some.mp3"). Now I want to read filenames from the file and process the filenames (strip paths) and files (verify existing of files, or renumber the files or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding of the playlist but I know it corresponds to the encoding of filenames so I can expect those files exist on my filesystem; they have strangely looking unreadable names but they exist. Just a small example of why I do want to process filenames from a text file in an alien encoding. Without knowing the encoding in advance.
The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training.
The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files.
Yes, these are different workarounds. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 8/22/2014 11:50 AM, Oleg Broytman wrote:
On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: than strict text/binary dichotomy in Python3. I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". And in my opinion those files are perfect text. The files consist of
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: lines separated by EOL characters (not necessary EOL characters of my OS because it could be a text file produced in a different OS), lines consist of words and words of characters.
Until you know or can deduce the encoding of a file, it is binary. If it has multiple, different, embedded encodings of text, it is still binary. In my opinion. So these are just opinions, and naming conventions. If you call it text, you have a different definition of text file than I do.
Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. I don't have problems with Python3 text. I have problems with Python3 trying to get rid of byte strings and treating bytes as strict non-text.
Python3 is not trying to get rid of byte strings. But to some extent, it is wanting to treat bytes as non-text... bytes can be encoded text, but is not text until it is decoded. There is some processing that can be done on encoded text, but it has to be done differently (in many cases) than processing done on (non-encoded) text. One difference is the interpretation of what character is what varies from encoding to encoding, so if the processing requires understanding the characters, then the character code must be known. On the other hand, if it suffices to detect blocks of opaque text delimited by a known set of delimiters codes (EOL: CR, LF, combinations thereof) then that can be done relatively easily on binary, as long as the encoding doesn't have data puns where a multibyte encoded character might contain the code for the delimiter as one of the bytes of the code for the character.
On the other hand, Python3 provides various facilities for working with such files.
The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. This is perhaps the most promising approach. If I can open a text file in binary mode, iterate it line by line, split every line of non-ascii bytes with .split() and process them that'd satisfy my needs. But still there are dragons. If I read a filename from such file I read it as bytes, not str, so I can only use low-level APIs to manipulate with those filenames. Pity.
If the file names are in an unknown encoding, both in the directory and in the encoded text in the file listing, then unless you can deduce the encoding, you would be limited to doing manipulations with file APIs that support bytes, the low-level ones, yes. If you can deduce the encoding, then you are freed from that limitation.
Let see a perfectly normal situation I am quite often in. A person sent me a directory full of MP3 files. The transport doesn't matter; it could be FTP, or rsync, or a zip file sent by email, or bittorrent. What matters is that filenames and content are in alien encodings. Most often it's cp1251 (the encoding used in Russian Windows) but can be koi8 or utf8. There is a playlist among the files -- a text file that lists MP3 files, every file on a single line; usually with full paths ("C:\Audio\some.mp3"). Now I want to read filenames from the file and process the filenames (strip paths) and files (verify existing of files, or renumber the files or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are also in cp1251 of utf-8 encoding]...whatever).
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".
I don't know the encoding of the playlist but I know it corresponds to the encoding of filenames so I can expect those files exist on my filesystem; they have strangely looking unreadable names but they exist. Just a small example of why I do want to process filenames from a text file in an alien encoding. Without knowing the encoding in advance.
An interesting example, for sure. Life will be easier when everyone converts to Unicode and UTF-8.
The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training.
The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files. Yes, these are different workarounds.
Oleg.
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".
I'd assume "or" meant there, rather than "of", it's a common typo. Not sure why 1251, specifically, but it's not uncommon for boundary code to attempt a decode that consists of something like "attempt UTF-8 decode, and if that fails, attempt an eight-bit decode". For my MUD clients, that's pretty much required; one of the servers I frequent is completely bytes-oriented, so whatever encoding one client uses will be dutifully echoed to every other client. There are some that correctly use UTF-8, but others use whatever they feel like; and since those naughty clients are mainly on Windows, I can reasonably guess that they'll be using CP-1252. So that's what I do: UTF-8, fall-back on 1252. (It's also possible some clients will be using Latin-1, but 1252 is a superset of that.) But it's important to note that this is a method of handling junk. It's not a design intention; this is for a situation where I really want to cope with any byte stream and attempt to display it as text. And if I get something that's neither UTF-8 nor CP-1252, I will display it wrongly, and there's nothing can be done about that. ChrisA
On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".
I'd assume "or" meant there, rather than "of", it's a common typo.
Not sure why 1251, specifically
This is the encoding of Russian Windows. Files and emails in Russia are mostly in cp1251 encoding; something like 60-70%, I think. The second popular encoding is cp866 (Russian DOS); it's used by Windows as OEM encoding. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman <phd@phdru.name> wrote:
On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".
I'd assume "or" meant there, rather than "of", it's a common typo.
Not sure why 1251, specifically
This is the encoding of Russian Windows. Files and emails in Russia are mostly in cp1251 encoding; something like 60-70%, I think. The second popular encoding is cp866 (Russian DOS); it's used by Windows as OEM encoding.
Yeah, that makes sense. In any case, you pick one "most likely" 8-bit encoding and go with it. ChrisA
Chris Angelico writes:
Not sure why 1251,
All of those codes have repertoires that are Cyrillic supersets, presumably Russian-language content, based on Oleg's top domain.
But it's important to note that this is a method of handling junk. It's not a design intention; this is for a situation where I really want to cope with any byte stream and attempt to display it as text. And if I get something that's neither UTF-8 nor CP-1252, I will display it wrongly, and there's nothing can be done about that.
Of course there is. It just gets more heuristic the more numerous the potential encodings are.
On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
in cp1251 of utf-8 encoding
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".
But of course! Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings.
Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3.
First -- we're getting OT here -- this thread was about file and path
names, not the contents of files. But I suppose I brought that in when I talked about writing file names to files... The first I'll mention is the one that follows from my description of what
your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent.
right -- and you would have wanted to open such file in binary mode with py2 as well, but in that case, you's have the contents in py2 string object, which has a few more convenient ways to work with text (at least ascii-compatible) than the py3 bytes object does. The third is to specify the UTF-8 with the surrogate escape error handler.
This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files.
Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in? I wonder if this would make it hard to preserve byte boundaries, though. By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Chris Barker writes:
The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory.
Read as bytes and incrementally decode. If you hit an Exception, retry from that point.
Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in?
If and only if there are no changes to the content.
I wonder if this would make it hard to preserve byte boundaries, though.
I'm not sure what you mean by "byte boundaries". If you mean after concatenation of such objects, yes, the uninterpretable bytes will be encoded in such a way as to be identifiable as lone bytes; they won't be interpreted as Unicode characters.
By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy.
Bad idea, especially for Oleg's use case -- you can't decode those by codec without reencoding to bytes first. No point in abandoning codecs just because there isn't one designed for his use case exactly. Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points and a very few plausible coding systems, which can be fairly well distinguished by the range of bytes used and probably nearly perfectly with additional information from the structure and distribution of apparently decoded characters.
"Stephen J. Turnbull" <stephen@xemacs.org>:
Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points
HTML and XML are interesting examples since their encoding is initially unknown: <?xml version="1.0"?> ^ +--- Now I know it is UTF-8 <?xml version="1.0" encoding="UTF-16"?> ^ +--- Now I know it was UTF-16 all along! Then we have: HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16"> See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16. Marko
On Sat, 23 Aug 2014, Marko Rauhamaa wrote:
"Stephen J. Turnbull" <stephen@xemacs.org>:
Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points
HTML and XML are interesting examples since their encoding is initially unknown:
<?xml version="1.0"?> ^ +--- Now I know it is UTF-8
<?xml version="1.0" encoding="UTF-16"?> ^ +--- Now I know it was UTF-16 all along!
Then we have:
HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16.
For HTML it's not quite so bad. According to the HTML 4 standard: http://www.w3.org/TR/html4/charset.html The Content-Type header takes precedence over a <meta> element. I thought I read once that the reason was to allow proxy servers to transcode documents but I don't have a cite for that. Also, the <meta> element "must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 example wouldn't be conformant in HTML. In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte order mark) is used: http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration Not sure about XML. Of course this whole area is a bit of an "arms race" between programmers competing to get away with being as sloppy as possible and other programmers who have to deal with their mess. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
Isaac Morland <ijmorlan@uwaterloo.ca>:
HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
For HTML it's not quite so bad. According to the HTML 4 standard: [...]
The Content-Type header takes precedence over a <meta> element. I thought I read once that the reason was to allow proxy servers to transcode documents but I don't have a cite for that. Also, the <meta> element "must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 example wouldn't be conformant in HTML.
That's not how I read it: The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element. <URL: http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht ml#doc-char-set> IOW, you must obey the HTTP character encoding until you have parsed a conflicting META content-type declaration. The author of the standard keeps a straight face and continues: For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding. Marko
On Sat, 23 Aug 2014, Marko Rauhamaa wrote:
Isaac Morland <ijmorlan@uwaterloo.ca>:
HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
For HTML it's not quite so bad. According to the HTML 4 standard: [...]
The Content-Type header takes precedence over a <meta> element. I thought I read once that the reason was to allow proxy servers to transcode documents but I don't have a cite for that. Also, the <meta> element "must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 example wouldn't be conformant in HTML.
That's not how I read it:
The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.
<URL: http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht ml#doc-char-set>
IOW, you must obey the HTTP character encoding until you have parsed a conflicting META content-type declaration.
From the same document:
-------------------------------------------------------------------------- To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): An HTTP "charset" parameter in a "Content-Type" field. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". The charset attribute set on an element that designates an external resource. -------------------------------------------------------------------------- (In the original they are numbered) This is a priority list - if the Content-Type header gives a charset, it takes precedence, and all other sources for the encoding are ignored. The "charset=" on an <img> or similar is only used if it is the only source for the encoding. The "at least until the META element is parsed" bit allows for the use of encodings which make use of shifting. So maybe they start out ASCII-compatible, but after a particular shift byte is seen those bytes now stand for Japanese Kanji characters until another shift byte is seen. This is allowed by the specification, as long as none of the non-ASCII-compatible stuff is seen before the META element.
The author of the standard keeps a straight face and continues:
I like your way of putting this - "straight face" indeed. The third option really is a hack to allow working around nonsensical situations (and even the META tag is pretty questionable). All this complexity because people can't be bothered to do things properly.
For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding.
Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
Isaac Morland writes:
I like your way of putting this - "straight face" indeed. The third option really is a hack to allow working around nonsensical situations (and even the META tag is pretty questionable). All this complexity because people can't be bothered to do things properly.
At least in Japan and Russia, doing things "properly" in your sense in heterogenous distributed systems is really hard, requiring use of rather fragile encoding detection heuristics that break at the slightest whiff of encodings that are unusual in the particular locale, and in Japan requiring equally fragile transcoding programs that break on vendor charset variations. The META "charset" attribute is useful in those contexts, and the "charset" attribute for external elements may have been useful in the past as well, although I've never needed it. I agree that an environment where "charset" attributes on META and other elements are needed kinda sucks, but the prerequisite for "doing things properly" is basically Unicode[1], and that just wasn't going to happen until at least the 1990s. To make the transition in less than several decades would have required a degree of monopoly in software production that I shudder to contemplate. Even today there are programmers around the world grumbling about having to deal with the Unicode coded character set. Footnotes: [1] More precisely, a universal coded character set. TRON code or MULE code would have done (but yuck!) ISO 2022 won't do!
Isaac Morland wrote:
In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte order mark) is used:
http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration
Not sure about XML.
According to Appendix F here: http://www.w3.org/TR/xml/#sec-guessing an XML parser needs to be prepared to try all the encodings it supports until it finds one that works well enough to decode the XML declaration, then it can find out the exact encoding used. -- Greg
Am 24.08.14 03:11, schrieb Greg Ewing:
Isaac Morland wrote:
In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte order mark) is used:
http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration
Not sure about XML.
According to Appendix F here:
http://www.w3.org/TR/xml/#sec-guessing
an XML parser needs to be prepared to try all the encodings it supports until it finds one that works well enough to decode the XML declaration, then it can find out the exact encoding used.
That's not what this section says. Instead, it says that you need to auto-detect UCS-4, UTF-16, UTF-8 from the BOM, or guess them or EBCDIC from the encoding of '<?'. This should be enough to actually parse the encoding declaration. Other non-ASCII-compatible encodings can only be used if declared in an upper-level protocol (such as HTTP). The parser is not expected to try out all encodings it supports. Regards, Martin
On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman <phd@phdru.name> wrote:
On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal)
There is no such thing as an encoding of text files. So we just write those bytes to the file
So I write bytes that are encoded one way into a text file that's encoded another way, and expect to be abel to read that later? you're kidding, right? Only if that's he only thing in the file -- usually not the case with my text files. or output them to the terminal. I often do
that. My filesystems are full of files with names and content in at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a terminal with koi8 or utf-8 locale and fonts and some file always look weird. But however weird they are it's possible to work with them.
Not for me (or many other users) -- terminals are sometimes set with ascii-only encoding, so non-ascii barfs -- or you get some weird control characters that mess up your terminal -- dumping arbitrary bytes to a terminal does not always "just work".
And people still want to say posix isn't broken in this regard?
Not at all! And broken or not broken it's what I (for many different reasons) prefer to use for my desktops, servers, notebooks, routers and smartphones,
Sorry -- that's a Red Herring -- I agree, "broken" or "simple and consistent" is irrelevant, we all want Python to work as well as it can on such systems. The point is that if you are reading a file name from the system, and then passing it back to the system, then you can treat it as just bytes -- who cares? And if you add the byte value of 47 thing, then you can even do basic path manipulations. But once you want to do other things with your file name, then you need to know the encoding. And it is very, very common for users to need to do other things with filenames, and they almost always want them as text that they can read and understand. Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in. And apparently that's pretty common -- or common enough that it would be nice for Python to support it well. This trick is how -- we'd like the "just pass it around and do path manipulations" case to work with (almost) arbitrary bytes, but everything else to work naturally with text (unicode text). Which brings us to the "what APIs should accept bytes" question. I think that's been pretty much answered: All the low-level ones, so that protocol and library programmers can write code that works on systems with undefined filename encodings. But: casual users still need to do the normal things with file names and paths, and ideally those should work the same way on all systems. I think the way to do this is to abstract the path concept, like pathlib does. Back in the day, paths were "just strings", and that worked OK with py2 strings, because you could put arbitrary bytes in them. But the "py2 strings were perfect" folks seem to not acknowledge that while they are nice for matching the posix filename model, they were a pain in the neck when you needed to do somethign else like write them in to a JSON file or something. From my personal experience, non-ascii filenames are much easier to deal with if I use unicode for filenames everywhere (py2). Somehow, I have yet to be bitten by mixed encoding in filenames. So will using a surrogate-escape error handling with pathlib make all this just work? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker <chris.barker@noaa.gov> wrote:
Back in the day, paths were "just strings", and that worked OK with py2 strings, because you could put arbitrary bytes in them. But the "py2 strings were perfect" folks seem to not acknowledge that while they are nice for matching the posix filename model, they were a pain in the neck when you needed to do somethign else like write them in to a JSON file or something.
This is the core of the problem. Python2 favors Unix model but Windows people pays the price. Python3 reverses that and I'm still thinking if I want to pay the new price.
So will using a surrogate-escape error handling with pathlib make all this just work?
I'm involved in developing and maintaining a few big commercial projects that will hardly be ported to Python3. So I'm stuck with Python2 for many years and I haven't tried Python3. May be I should try a small personal project, but certainly not this year. May be the next one... Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman <phd@phdru.name> wrote:
I'm involved in developing and maintaining a few big commercial projects that will hardly be ported to Python3. So I'm stuck with Python2 for many years and I haven't tried Python3. May be I should try a small personal project, but certainly not this year. May be the next one...
Yes, you should try it. Really, it's not the monster you are constructing in your mind. The functions that read filenames and return them as text use surrogate escape to preserve the bytes, and the functions that accept filenames use surrogate escape to recover those bytes before passing them back to the OS. So posix binary filenames just work, as long as the only thing you depend on is being able to split and join them on the / character (and possibly the . character) and otherwise treat the names as black boxes...which is exactly the same situation you are in in python2. If you need to read filenames out of a file, you'll need to specify the surrogate escape error handler so that the bytes will be there to be recovered when you pass them to the file system functions, but it will work. Or, as discussed, you can treat them as binary and use the os level functions that accept binary input (which are exactly the ones you are used to using in python2). This includes os.path.split and os.path.join, which as noted are the only things you can depend on working correctly when you don't know the encoding of the filenames. So, the way to look at this is that python3 is no worse[1] than python2 for handling posix binary filenames, and also provides additional features if you *do* know the correct encoding of the filenames. --David [1] modulo any remaining API bugs, which is exactly where this thread started: trying to figure out which APIs need to be able to handle binary paths and/or surrogate escaped paths so that posix filenames consistently work as well in python3 as they did in python2).
Oleg Broytman writes:
This is the core of the problem. Python2 favors Unix model but Windows people pays the price. Python3 reverses that
This is certainly not true. What is true is that Python 3 makes no attempt to make it easy to write crappy software in the old Unix style, that breaks when unexpected character encoding are encountered. Python 3 is designed to make it easier to write reliable software, even if it will only ever be used on one platform. Nevertheless, it's still a reasonable language for writing byte-shoveling software, with the last piece in place as of the acceptance of PEP 461. As of that PEP, you can use regexps for tokenizing byte streams and %-formatting to conveniently produce them. If you want to treat them piecewise as character streams with different encodings, you have a large library of codecs, which provide an incremental decoder interface. While AFAIK no codec implements a decode-until-error mode, that's not all that much of a loss, as many encodings overlap. Eg, if you start decoding using a latin-1 codec, decoding the whole document will succeed, even if it switches to windows-1251 in the meantime. Oleg, I gather Russian is your native language. That's moderately complicated, I admit. But the Russians are a distant second to the Japanese in self-destructive proliferation of incompatible character coding standards and non-standard variants. After 24 years of dealing with the mess that is East Asian encodings (which is even bound up with the "religion" of Japanese exceptionalism -- some Japanese have argued that there is a spiritual superiority to Japanese JIS codes!), I cannot believe you are going to find a better environment for dealing with these issues than Python 3.
On Sat, Aug 23, 2014 at 07:14:47PM +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
I cannot believe you are going to find a better environment for dealing with these issues than Python 3.
Well, that's may be. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
Chris Barker writes:
So I write bytes that are encoded one way into a text file that's encoded another way, and expect to be abel to read that later?
No, not you. Crap software does that. Your MUD server. Oleg's favorite web pages with ads, or more likely the ad servers.
Not for me (or many other users) -- terminals are sometimes set with ascii-only encoding,
So? That means you can't handle text files in general, only those restricted to ASCII. That's a completely different issue.
Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in.
No, it doesn't. Reasonably handling "text streams" in unknown, possibly multiple, encodings is just hard. Python 3 has nothing to do with it, and Oleg should know that very well. It's true that code written in Python 2 to handle these issues needs to be ported to Python 3. Things is, Oleg says "another tool" -- any non-Python-2 tool will need porting of his code too.
And apparently that's pretty common -- or common enough that it would be nice for Python to support it well. This trick is how -- we'd like the "just pass it around and do path manipulations" case to work with (almost) arbitrary bytes,
It does. That's what os.path is for.
but everything else to work naturally with text (unicode text).
No gloss, please. It's text, period. The internal Unicode encoding is *not exposed*, with a few (important) exceptions such as Han unification.
I think the way to do this is to abstract the path concept, like pathlib does.
You forgot to append the word "well".<wink/>
From my personal experience, non-ascii filenames are much easier to deal with if I use unicode for filenames everywhere (py2). Somehow, I have yet to be bitten by mixed encoding in filenames.
.gov domain? ASCII-only terminal settings? It's not "somehow", it's that you live a sheltered life.<wink/>
So will using a surrogate-escape error handling with pathlib make all this just work?
Not answerable until you define "all this" more precisely. And that's the big problem with Oleg's complaint, too. It's not at all clear what he wants, except that all of his current code should continue to work in Python 3. Just like all of us. The question then is persuading him that it's worth moving to Python 3 despite the effort of porting Python-2-specific code. Maybe he can be persuaded, maybe not. Python 2 is a better than average language.
On Sat, Aug 23, 2014 at 7:02 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Chris Barker writes:
So I write bytes that are encoded one way into a text file that's encoded another way, and expect to be abel to read that later?
No, not you. Crap software does that. Your MUD server. Oleg's favorite web pages with ads, or more likely the ad servers.
Just to clarify: Presumably you're referring to my previous post regarding my MUD client's heuristic handling of broken encodings. It's "my server" in the sense of the one that I'm connecting to, and not in the sense that I control it. I do also run a MUD server, and it guarantees that everything it sends is UTF-8. (Incidentally, that server has the exact same set of heuristics for coping with broken encodings from other clients. There's no escaping it.) Your point is absolutely right: mess like that is to cope with the fact that there's broken stuff out there. ChrisA
On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
And that's the big problem with Oleg's complaint, too. It's not at all clear what he wants
The first thing is I want to understand why people continue to refer to Unix was as "broken". Better yet, to persuade them it's not. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 23 August 2014 16:15, Oleg Broytman <phd@phdru.name> wrote:
On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
And that's the big problem with Oleg's complaint, too. It's not at all clear what he wants
The first thing is I want to understand why people continue to refer to Unix was as "broken". Better yet, to persuade them it's not.
Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken". Unix advocates (not yourself) are prone to declaring anything *other* than the Unix model as "broken", so it's tempting to give them a taste of their own medicine. Sorry for that (to the extent that I was one of the people doing so). Rhetoric aside, none of Unix, Windows or Python are "broken". They just react in different ways to fundamentally difficult edge cases. But expecting Python (a cross-platform language) to prefer the Unix model is putting all the pain on non-Unix users of Python, which I don't feel is reasonable. Let's all compromise a little. Paul PS The key thing *I* think is a problem with the Unix behaviour is that it treats filenames as bytes rather than Unicode. People name files using *characters*. So every filename is semantically text, in the mind of the person who created it. Unix enforces a transformation to bytes, but does not retain the encoding of those bytes. So information about the original author's intent is lost. But that's a historical fact, baked into Unix at a low level. Whether that's "broken" or just "something to deal with" is not important to me.
Hi! On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore@gmail.com> wrote:
On 23 August 2014 16:15, Oleg Broytman <phd@phdru.name> wrote:
On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
And that's the big problem with Oleg's complaint, too. It's not at all clear what he wants
The first thing is I want to understand why people continue to refer to Unix was as "broken". Better yet, to persuade them it's not.
"Unix was" => "Unix way"
Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken".
Ah, if that's the only problem I certainly can live with that. My problem is that it *seems* this anti-Unix attitude infiltrates Python core development. I very much hope I'm wrong and it really isn't.
Unix advocates (not yourself) are prone to declaring anything *other* than the Unix model as "broken", so it's tempting to give them a taste of their own medicine. Sorry for that (to the extent that I was one of the people doing so).
You didn't see me in my younger years. I surely was one of those Windows bashers. Please take my apology.
Rhetoric aside, none of Unix, Windows or Python are "broken". They just react in different ways to fundamentally difficult edge cases.
But expecting Python (a cross-platform language) to prefer the Unix model is putting all the pain on non-Unix users of Python, which I don't feel is reasonable. Let's all compromise a little.
Paul
PS The key thing *I* think is a problem with the Unix behaviour is that it treats filenames as bytes rather than Unicode. People name files using *characters*. So every filename is semantically text, in the mind of the person who created it. Unix enforces a transformation to bytes, but does not retain the encoding of those bytes. So information about the original author's intent is lost. But that's a historical fact, baked into Unix at a low level. Whether that's "broken" or just "something to deal with" is not important to me.
The problem is hardly specific to Unix. Despite Joel Spolsky's "There Ain't No Such Thing As Plain Text" people create text files all the time. Without specifying an encoding. And put filenames into those text files (audio playlists, like .m3u and .pls are just text files with pathnames). Unix takes the idea that everything is text and a stream of bytes to its extreme. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On 23 August 2014 19:37, Oleg Broytman <phd@phdru.name> wrote:
Unix takes the idea that everything is text and a stream of bytes to its extreme.
I don't really understand the idea of "text and a stream of bytes". The two are fundamentally different in my view. But I guess that's why we have to agree to differ - our perspectives are just very different. Paul
On 24 August 2014 04:37, Oleg Broytman <phd@phdru.name> wrote:
On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore@gmail.com> wrote:
Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken".
Ah, if that's the only problem I certainly can live with that. My problem is that it *seems* this anti-Unix attitude infiltrates Python core development. I very much hope I'm wrong and it really isn't.
The POSIX locale based approach to handling encodings is genuinely broken - it's almost as broken as code pages are on Windows. The fundamental flaw is that locales encourage *bilingual* computing: handling English plus one other language correctly. Given a global internet, bilingual computing *is a fundamentally broken approach*. We need multilingual computing (any human language, all the time), and that means Unicode. As some examples of where bilingual computing breaks down: * My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done. The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE POSIX is hampered by legacy ASCII defaults in various subsystems (most notably the default locale) and the assumption that system metadata is "just bytes" (an assumption that breaks down as soon as you have to hand that metadata over to another machine that may have different locale settings) Windows is hampered by the fact they kept the old 8-bit APIs around for backwards compatibility purposes, so applications using those APIs are still only bilingual (at best) rather than multilingual. JVM and CLR applications will at least handle the Basic Multilingual Plane (UCS-2) correctly, but may not correctly handle code points beyond the 16-bit boundary (this is the "Python narrow builds don't handle Unicode correctly" problem that was resolved for Python 3.3+ by PEP 393) Individual users (including some organisations) may have the luxury of saying "well, all my clients and all my servers are POSIX, so I don't care about interoperability with other platforms". As the providers of a cross-platform runtime environment, we don't have that luxury - we need to figure out how to get *all* the major platforms playing nice with each other, regardless of whether they chose UTF-8 or UTF-16-LE as the basis for their approach towards providing multilingual computing environments. Historically, that question of cross platform interoperability for open source software has been handled in a few different ways: * Don't really interoperate with anybody, reinvent all the wheels (the JVM way) * Emulate POSIX on Windows (the Cygwin/MinGW way) * Let the application developer figure it out (the Python 2 way) The first approach is inordinately expensive - it took the resources of Sun in its heyday to make it possible, and it effectively locks the JVM out of certain kinds of computing (e.g. it's hard to do array oriented programming in JVM languages, because the CPU and GPU vectorisation features aren't readily accessible). The second approach prevents the creation of truly native Windows applications, which makes it uncompelling as a way of attracting Windows users - it sends a clear signal that the project doesn't *really* care about supporting Windows as a platform, but instead only grudgingly accepts that there are Windows users out there that might like to use their software. The third approach is the one we tried for a long time with Python 2, and essentially found to be an "experts only" solution. Yes, you can *make* it work, but the runtime isn't set up so it works *by default*. The Unicode changes in Python 3 are a result of the Python core development team saying "it really shouldn't be this hard for application developers to get cross-platform interoperability between correctly configured systems when dealing solely with correctly encoded data and metadata". The idea of Python 3 is that applications should require additional complexity solely to deal with *incorrectly* configured systems and improperly encoded data and metadata (and, ideally, the detection of the need for such handling should be "Python 3 threw an exception" rather than "something further down the line detected corrupted data"). This is software rather than magic, though - these improvements only happen through people actually knuckling down and solving the related problems. When folks complain about Python 3's operating system interface handling causing problems in some situations? They're almost always referring to areas where we're still relying on the locale system on POSIX or the code page system on Windows. Both of those approaches are irredeemably broken - the answer is to stop relying on them, but appropriately updating the affected subsystems generally isn't a trivial task. A lot of the affected code runs before the interpreter is fully initialised, which makes it really hard to test, and a lot of it is incredibly convoluted due to various configuration options and platform specific details, which makes it incredibly hard to modify without breaking anything. One of those areas is the fact that we still use the old 8-bit APIs to interact with the Windows console. Those are just as broken in a multilingual world as the other Windows 8-bit APIs, so Drekin came up with a project to expose the Windows console as a UTF-16-LE stream that uses the 16-bit APIs instead: https://pypi.python.org/pypi/win_unicode_console I personally hope we'll be able to get the issues Drekin references there resolved for Python 3.5 - if other folks hope for the same thing, then one of the best ways to help that happen is to try out the win_unicode_console module and provide feedback on what does and doesn't work. Another was getting exceptions attempting to write OS data to sys.stdout when the locale settings had been scrubbed from the environment. For Python 3.5, we better tolerate that situation by setting "errors=surrogateescape" on sys.stdout when the environment claims "ascii" as a suitable encoding for talking to the operating system (this is our way of saying "we don't actually believe you, but also don't have the data we need to overrule you completely"). While I was going to wait for more feedback from Fedora folks before pushing the idea again, this thread also makes me think it would be worth our while to add more tools for dealing with surrogate escapes and latin-1 binary data smuggling just to help make those techniques more discoverable and accessible: http://bugs.python.org/issue18814#msg225791 These various discussions are also giving me plenty of motivation to get back to working on PEP 432 (the rewrite of the interpreter startup sequence) for Python 3.5. A lot of these things are just plain hard to change because of the complexity of the current startup code. Redesigning that to use a cleaner, multiphase startup sequence that gets the core interpreter running *before* configuring the operating system integration should give us several more options when it comes to dealing with some of these challenges. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I declare this thread irreparably broken. Do not make any decisions in this thread. Tell me (in another thread) when it's time to decide and I will. On Sat, Aug 23, 2014 at 8:27 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <
On 24 August 2014 04:37, Oleg Broytman <phd@phdru.name> wrote: p.f.moore@gmail.com> wrote:
Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken".
Ah, if that's the only problem I certainly can live with that. My problem is that it *seems* this anti-Unix attitude infiltrates Python core development. I very much hope I'm wrong and it really isn't.
The POSIX locale based approach to handling encodings is genuinely broken - it's almost as broken as code pages are on Windows. The fundamental flaw is that locales encourage *bilingual* computing: handling English plus one other language correctly. Given a global internet, bilingual computing *is a fundamentally broken approach*. We need multilingual computing (any human language, all the time), and that means Unicode.
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
POSIX is hampered by legacy ASCII defaults in various subsystems (most notably the default locale) and the assumption that system metadata is "just bytes" (an assumption that breaks down as soon as you have to hand that metadata over to another machine that may have different locale settings) Windows is hampered by the fact they kept the old 8-bit APIs around for backwards compatibility purposes, so applications using those APIs are still only bilingual (at best) rather than multilingual. JVM and CLR applications will at least handle the Basic Multilingual Plane (UCS-2) correctly, but may not correctly handle code points beyond the 16-bit boundary (this is the "Python narrow builds don't handle Unicode correctly" problem that was resolved for Python 3.3+ by PEP 393)
Individual users (including some organisations) may have the luxury of saying "well, all my clients and all my servers are POSIX, so I don't care about interoperability with other platforms". As the providers of a cross-platform runtime environment, we don't have that luxury - we need to figure out how to get *all* the major platforms playing nice with each other, regardless of whether they chose UTF-8 or UTF-16-LE as the basis for their approach towards providing multilingual computing environments.
Historically, that question of cross platform interoperability for open source software has been handled in a few different ways:
* Don't really interoperate with anybody, reinvent all the wheels (the JVM way) * Emulate POSIX on Windows (the Cygwin/MinGW way) * Let the application developer figure it out (the Python 2 way)
The first approach is inordinately expensive - it took the resources of Sun in its heyday to make it possible, and it effectively locks the JVM out of certain kinds of computing (e.g. it's hard to do array oriented programming in JVM languages, because the CPU and GPU vectorisation features aren't readily accessible).
The second approach prevents the creation of truly native Windows applications, which makes it uncompelling as a way of attracting Windows users - it sends a clear signal that the project doesn't *really* care about supporting Windows as a platform, but instead only grudgingly accepts that there are Windows users out there that might like to use their software.
The third approach is the one we tried for a long time with Python 2, and essentially found to be an "experts only" solution. Yes, you can *make* it work, but the runtime isn't set up so it works *by default*.
The Unicode changes in Python 3 are a result of the Python core development team saying "it really shouldn't be this hard for application developers to get cross-platform interoperability between correctly configured systems when dealing solely with correctly encoded data and metadata". The idea of Python 3 is that applications should require additional complexity solely to deal with *incorrectly* configured systems and improperly encoded data and metadata (and, ideally, the detection of the need for such handling should be "Python 3 threw an exception" rather than "something further down the line detected corrupted data").
This is software rather than magic, though - these improvements only happen through people actually knuckling down and solving the related problems. When folks complain about Python 3's operating system interface handling causing problems in some situations? They're almost always referring to areas where we're still relying on the locale system on POSIX or the code page system on Windows. Both of those approaches are irredeemably broken - the answer is to stop relying on them, but appropriately updating the affected subsystems generally isn't a trivial task. A lot of the affected code runs before the interpreter is fully initialised, which makes it really hard to test, and a lot of it is incredibly convoluted due to various configuration options and platform specific details, which makes it incredibly hard to modify without breaking anything.
One of those areas is the fact that we still use the old 8-bit APIs to interact with the Windows console. Those are just as broken in a multilingual world as the other Windows 8-bit APIs, so Drekin came up with a project to expose the Windows console as a UTF-16-LE stream that uses the 16-bit APIs instead: https://pypi.python.org/pypi/win_unicode_console
I personally hope we'll be able to get the issues Drekin references there resolved for Python 3.5 - if other folks hope for the same thing, then one of the best ways to help that happen is to try out the win_unicode_console module and provide feedback on what does and doesn't work.
Another was getting exceptions attempting to write OS data to sys.stdout when the locale settings had been scrubbed from the environment. For Python 3.5, we better tolerate that situation by setting "errors=surrogateescape" on sys.stdout when the environment claims "ascii" as a suitable encoding for talking to the operating system (this is our way of saying "we don't actually believe you, but also don't have the data we need to overrule you completely").
While I was going to wait for more feedback from Fedora folks before pushing the idea again, this thread also makes me think it would be worth our while to add more tools for dealing with surrogate escapes and latin-1 binary data smuggling just to help make those techniques more discoverable and accessible: http://bugs.python.org/issue18814#msg225791
These various discussions are also giving me plenty of motivation to get back to working on PEP 432 (the rewrite of the interpreter startup sequence) for Python 3.5. A lot of these things are just plain hard to change because of the complexity of the current startup code. Redesigning that to use a cleaner, multiphase startup sequence that gets the core interpreter running *before* configuring the operating system integration should give us several more options when it comes to dealing with some of these challenges.
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
Hi! Thank you very much, Nick, for long and detailed explanation! On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 24 August 2014 04:37, Oleg Broytman <phd@phdru.name> wrote:
On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <p.f.moore@gmail.com> wrote:
Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken".
Ah, if that's the only problem I certainly can live with that. My problem is that it *seems* this anti-Unix attitude infiltrates Python core development. I very much hope I'm wrong and it really isn't.
The POSIX locale based approach to handling encodings is genuinely broken - it's almost as broken as code pages are on Windows. The fundamental flaw is that locales encourage *bilingual* computing: handling English plus one other language correctly. Given a global internet, bilingual computing *is a fundamentally broken approach*. We need multilingual computing (any human language, all the time), and that means Unicode.
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
POSIX is hampered by legacy ASCII defaults in various subsystems (most notably the default locale) and the assumption that system metadata is "just bytes" (an assumption that breaks down as soon as you have to hand that metadata over to another machine that may have different locale settings) Windows is hampered by the fact they kept the old 8-bit APIs around for backwards compatibility purposes, so applications using those APIs are still only bilingual (at best) rather than multilingual. JVM and CLR applications will at least handle the Basic Multilingual Plane (UCS-2) correctly, but may not correctly handle code points beyond the 16-bit boundary (this is the "Python narrow builds don't handle Unicode correctly" problem that was resolved for Python 3.3+ by PEP 393)
Individual users (including some organisations) may have the luxury of saying "well, all my clients and all my servers are POSIX, so I don't care about interoperability with other platforms". As the providers of a cross-platform runtime environment, we don't have that luxury - we need to figure out how to get *all* the major platforms playing nice with each other, regardless of whether they chose UTF-8 or UTF-16-LE as the basis for their approach towards providing multilingual computing environments.
Historically, that question of cross platform interoperability for open source software has been handled in a few different ways:
* Don't really interoperate with anybody, reinvent all the wheels (the JVM way) * Emulate POSIX on Windows (the Cygwin/MinGW way) * Let the application developer figure it out (the Python 2 way)
The first approach is inordinately expensive - it took the resources of Sun in its heyday to make it possible, and it effectively locks the JVM out of certain kinds of computing (e.g. it's hard to do array oriented programming in JVM languages, because the CPU and GPU vectorisation features aren't readily accessible).
The second approach prevents the creation of truly native Windows applications, which makes it uncompelling as a way of attracting Windows users - it sends a clear signal that the project doesn't *really* care about supporting Windows as a platform, but instead only grudgingly accepts that there are Windows users out there that might like to use their software.
The third approach is the one we tried for a long time with Python 2, and essentially found to be an "experts only" solution. Yes, you can *make* it work, but the runtime isn't set up so it works *by default*.
The Unicode changes in Python 3 are a result of the Python core development team saying "it really shouldn't be this hard for application developers to get cross-platform interoperability between correctly configured systems when dealing solely with correctly encoded data and metadata". The idea of Python 3 is that applications should require additional complexity solely to deal with *incorrectly* configured systems and improperly encoded data and metadata (and, ideally, the detection of the need for such handling should be "Python 3 threw an exception" rather than "something further down the line detected corrupted data").
This is software rather than magic, though - these improvements only happen through people actually knuckling down and solving the related problems. When folks complain about Python 3's operating system interface handling causing problems in some situations? They're almost always referring to areas where we're still relying on the locale system on POSIX or the code page system on Windows. Both of those approaches are irredeemably broken - the answer is to stop relying on them, but appropriately updating the affected subsystems generally isn't a trivial task. A lot of the affected code runs before the interpreter is fully initialised, which makes it really hard to test, and a lot of it is incredibly convoluted due to various configuration options and platform specific details, which makes it incredibly hard to modify without breaking anything.
One of those areas is the fact that we still use the old 8-bit APIs to interact with the Windows console. Those are just as broken in a multilingual world as the other Windows 8-bit APIs, so Drekin came up with a project to expose the Windows console as a UTF-16-LE stream that uses the 16-bit APIs instead: https://pypi.python.org/pypi/win_unicode_console
I personally hope we'll be able to get the issues Drekin references there resolved for Python 3.5 - if other folks hope for the same thing, then one of the best ways to help that happen is to try out the win_unicode_console module and provide feedback on what does and doesn't work.
Another was getting exceptions attempting to write OS data to sys.stdout when the locale settings had been scrubbed from the environment. For Python 3.5, we better tolerate that situation by setting "errors=surrogateescape" on sys.stdout when the environment claims "ascii" as a suitable encoding for talking to the operating system (this is our way of saying "we don't actually believe you, but also don't have the data we need to overrule you completely").
While I was going to wait for more feedback from Fedora folks before pushing the idea again, this thread also makes me think it would be worth our while to add more tools for dealing with surrogate escapes and latin-1 binary data smuggling just to help make those techniques more discoverable and accessible: http://bugs.python.org/issue18814#msg225791
These various discussions are also giving me plenty of motivation to get back to working on PEP 432 (the rewrite of the interpreter startup sequence) for Python 3.5. A lot of these things are just plain hard to change because of the complexity of the current startup code. Redesigning that to use a cleaner, multiphase startup sequence that gets the core interpreter running *before* configuring the operating system integration should give us several more options when it comes to dealing with some of these challenges.
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/phd%40phdru.name
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan@gmail.com> wrote:
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
This kind of puts the "length" of the python2->python3 transition period in perspective, doesn't it? --David
On 8/26/2014 9:11 AM, R. David Murray wrote:
On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan@gmail.com> wrote:
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post.
This kind of puts the "length" of the python2->python3 transition period in perspective, doesn't it?
-- Terry Jan Reedy
On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy@udel.edu> wrote:
On 8/26/2014 9:11 AM, R. David Murray wrote:
On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan <ncoghlan@gmail.com>
wrote:
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale
setting
* I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post.
Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption. The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :)
This kind of puts the "length" of the python2->python3 transition period in perspective, doesn't it?
I realised in writing the post that ASCII is over 50 years old at this point, while Unicode as an official standard is more than 20. By the time this is done, we'll likely be talking 30+ years for Unicode to displace the confusing mess that is code pages and locale encodings :) Cheers, Nick.
-- Terry Jan Reedy
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Nick Coghlan <ncoghlan@gmail.com> writes:
As some examples of where bilingual computing breaks down:
* My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code
The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done.
The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post.
Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption.
The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :)
In that case, maybe it'd be nice to also explain why you use the term "bilingual" for codepage based encoding. At least to me, a codepage/locale is pretty monolingual, or alternatively covering a whole region (e.g. western europe). I figure with bilingual you mean ascii + something, but that's mostly a guess from my side. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
Nikolaus Rath writes:
In that case, maybe it'd be nice to also explain why you use the term "bilingual" for codepage based encoding.
Modern computing systems are written in languages which are invariably based on syntax expressed using ASCII, and provide by default functionality for expressing dates etc suitable for rendering American English. Thus ASCII (ie, American English) is always an available language. Code pages provide facilities for rendering one or more languages languages sharing a common coded character set, but are unsuitable for rendering most of the rest of the world's dozens of language groups (grouping languages by common character set). Multilingual has come to mean "able to express (almost) any set of languages in a single text" (see, for example, Emacs's "HELLO" file), not just "more than two". So code pages are closer in spirit to "bilingual" (two of many) than to "multilingual" (all of many). It's messy, analogical terminology. But then, natural language is messy and analogical.<wink/>
On 27 August 2014 08:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy@udel.edu> wrote:
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post.
Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption.
The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :)
Aaand, it's up: http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 8/27/2014 5:16 AM, Nick Coghlan wrote:
On 27 August 2014 08:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post. Indeed, I had the same idea - I had been assuming users already understood
On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy@udel.edu> wrote: this context, which is almost certainly an invalid assumption.
The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :) Aaand, it's up: http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
Cheers, Nick.
Indeed, I also enjoyed and found enlightening your response to this issue, including the broader historical context. I remember when Unicode was first published back in 1991, and it sounded interesting, but far removed from the reality of implementations of the day. I was intrigued by UTF-8 at the time, and even wrote an encoder and decoder for it for a software package that eventually never reached any real customers. Your blog post says:
Choosing UTF-8 aims to treat formatting text for communication with the user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price:
* because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.
I don't believe this is a necessary result of using UTF-8. It is a possible result, and I guess some implementations are using it this way, but a proper language could still provide and/or require proper usage of UTF-8 data through its type system just as Python3 is doing with PEP 393. In fact, if it were not for the requirement to support passing character strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython add-on packages) and the resulting practical performance considerations of converting to/from UTF-8 repeatedly when calling those APIs, Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today. Of course, nothing can be "required" if the user chooses to continue operating in the encoded domain, and manipulate data using the necessary byte-oriented features of of whatever language is in use. One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation, and of limited general applicability when considering grapheme clusters. An iterator based approach can solve both problems, but would have been best introduced as part of Python3.0, although it may have made 2to3 harder, and may have made it less practical to implement six and other "run on both Py2 and Py3" type solutions harder, without introducing those same iterative solutions into Python 2.6 or 2.7. Such solutions could still be implemented as options. Even PEP 393 grudgingly supports some use of UTF-8 when requested by the user, as I understand it. Whether such an implementation would be better based on bytes or str is uncertain without further analysis, although type checking would probably be easier if based on str. A high-performance implementation would likely need to be implemented at least partly in C rather than CPython, although it could be prototyped in Python for proof of functionality. The iterators could obviously be implemented to work based on top of solutions such as PEP 393, by simply using indexing underneath, when fixed-width characters are available, and other techniques when UTF-8 is the only available format (rather than converting from UTF-8 to fixed-width characters because of calling the iterator).
On 28 Aug 2014 04:20, "Glenn Linderman" <v+python@g.nevcal.com> wrote:
On 8/27/2014 5:16 AM, Nick Coghlan wrote:
On 27 August 2014 08:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy@udel.edu> wrote:
Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change,
would be a great blog post.
Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption.
The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :)
Aaand, it's up:
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
Cheers, Nick.
Indeed, I also enjoyed and found enlightening your response to this issue, including the broader historical context. I remember when Unicode was first published back in 1991, and it sounded interesting, but far removed from the reality of implementations of the day. I was intrigued by UTF-8 at the time, and even wrote an encoder and decoder for it for a software package that eventually never reached any real customers.
Your blog post says:
Choosing UTF-8 aims to treat formatting text for communication with the
user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price:
because encoding consistency checks are mostly avoided, data in
different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.
I don't believe this is a necessary result of using UTF-8. It is a
it possible result, and I guess some implementations are using it this way, but a proper language could still provide and/or require proper usage of UTF-8 data through its type system just as Python3 is doing with PEP 393. Yes, Go works that way, for example. I doubt it actually checks for valid UTF-8 at OS boundaries though - that would be a potentially expensive check, and as a network service centric language, Go can afford to place more constraints on the operating environment than we can.
In fact, if it were not for the requirement to support passing character strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython add-on packages) and the resulting practical performance considerations of converting to/from UTF-8 repeatedly when calling those APIs, Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today.
We already have string processing algorithms that work for fixed width encodings (and are known not to work for variable width encodings, hence the bugs in Unicode handling on the old narrow builds). It isn't that variable width encodings aren't a viable choice for programming language text modelling, it's that the assumption of a fixed width model is more deeply entrenched in CPython (and especially the C API) than the exact number of bits used per code point.
Of course, nothing can be "required" if the user chooses to continue operating in the encoded domain, and manipulate data using the necessary byte-oriented features of of whatever language is in use.
One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation, and of limited general applicability when considering grapheme clusters.
The choice that was made was to say no to the question "Do we rewrite a Unicode type that we already know works from scratch?". The decisions about how to handle *text* were made way back before the PEP process even existed, and later captured as PEP 100. What changed in Python 3 was dropping the hybrid 8-bit str type with its locale dependent behaviour, and parcelling its responsibilities out to either the existing unicode type (renamed as str, as it was the default choice), or the new locale independent bytes type.
An iterator based approach can solve both problems, but would have been best introduced as part of Python3.0, although it may have made 2to3 harder, and may have made it less practical to implement six and other "run on both Py2 and Py3" type solutions harder, without introducing those same iterative solutions into Python 2.6 or 2.7.
The option of fundamentally changing the text handling design was never on the table. The Python 2 unicode type works fine, it is the Python 2 str type that needed changing.
Such solutions could still be implemented as options. Even PEP 393 grudgingly supports some use of UTF-8 when requested by the user, as I understand it.
Not quite. PEP 393 heavily favours and optimises UTF-8, trading memory for speed by implicitly caching the UTF-8 representation the support isn't begrudged, it's enthusiastic. We just don't use it for the text processing algorithms, because those assume a fixed width encoding.
Whether such an implementation would be better based on bytes or str is uncertain without further analysis, although type checking would probably be easier if based on str. A high-performance implementation would likely need to be implemented at least partly in C rather than CPython, although it could be prototyped in Python for proof of functionality. The iterators could obviously be implemented to work based on top of solutions such as PEP 393, by simply using indexing underneath, when fixed-width characters are available, and other techniques when UTF-8 is the only available format (rather than converting from UTF-8 to fixed-width characters because of calling the iterator).
For the cost of rewriting every single string manipulation algorithm in CPython to avoid relying on C array access, the only thing you would save over PEP 393 is a bit of memory - we already store the UTF-8 representation when appropriate. There's simply not a sufficient payoff to justify the cost. Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Glenn Linderman writes:
On 8/27/2014 5:16 AM, Nick Coghlan wrote:
Choosing UTF-8 aims to treat formatting text for communication with the user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price:
* because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.
I don't believe this is a necessary result of using UTF-8.
No, it's not, but if you're going to do the same kind of checks that are necessary for transcoding UTF-8 to abstract Unicode, there's no benefit to using UTF-8 internally, and you lose a lot. The only operations that you can do efficiently are concatenation and iteration. I've worked with a UTF-8-like internal encoding for 20 years now -- it's a huge cost.
Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today.
Thank heaven it didn't!
One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation,
That simply isn't true. The negative effects of algorithmic slowness in Emacsen are visible both as annoying user delays, and as excessive developer concentration on optimizing a fundamentally insufficient data structure.
and of limited general applicability when considering grapheme clusters. An iterator based approach can solve both problems,
On the contrary, grapheme clusters are the relatively rare use case in textual computing, at least currently, that can be optimized for when necessary. There's no problem with creating iterators from arrays, but making an iterator behave like a array ... well, that involves creating the array.
Such solutions could still be implemented as options.
Sure, but the problems to be solved in that implementation are not due to Python 3's internal representation. A lot of painstaking (and possibly hard?) work remains to be done.
A high-performance implementation would likely need to be implemented at least partly in C rather than CPython,
That's how Emacs did it, and (a) over the decades it has involved an inordinate amount of effort compared to rewriting the text-handling functions for an array, (b) is fragile, and (c) performance sucks in practice. Unicode, not UTF-8, is the central component of the solution. The various UTFs are application-specific implementations of Unicode. UTF-8 is an excellent solution for text streams, such as disk files and network communication. Fixed-width representations (ISO-8859-1, UCS-2, UTF-32, PEP-393) are useful for applications of large buffers that need O(1) "random" access, and can trivially be iterated for stream applications. Steve
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote:
The point is that if you are reading a file name from the system, and then passing it back to the system, then you can treat it as just bytes -- who cares? And if you add the byte value of 47 thing, then you can even do basic path manipulations. But once you want to do other things with your file name, then you need to know the encoding. And it is very, very common for users to need to do other things with filenames, and they almost always want them as text that they can read and understand.
Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in.
Just "not knowing" is not sufficient. In that case, you'll likely get a Unicode string containing moji-bake: # I write a file name using UTF-8 on my system: filename = 'music by Наӥв.txt'.encode('utf-8') # You try to use it assuming ISO-8859-7 (Greek) filename.decode('iso-8859-7') => 'music by Π\x9dΠ°Σ₯Π².txt' which, even though it looks wrong, still lets you refer to the file (provided you then encode back to bytes with ISO-8859-7 again). This won't always be the case, sometimes the encoding you guess will be wrong. When I started this email, I originally began to say that the actual problem was with byte file names that cannot be decoded into Unicode using the system encoding (typically UTF-8 on Linux systems. But I've actually had difficulty demonstrating that it actually is a problem. I started with a byte sequence which is invalid UTF-8, namely: b'ZZ\xdb\xdf\xfa\xff' created a file with that name, and then tried listing it with os.listdir. Even in Python 3.1 it worked fine. I was able to list the directory and open the file, so I'm not entirely sure where the problem lies exactly. Can somebody demonstrate the failure mode? -- Steven
On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano <steve@pearwood.info> wrote:
When I started this email, I originally began to say that the actual problem was with byte file names that cannot be decoded into Unicode using the system encoding (typically UTF-8 on Linux systems. But I've actually had difficulty demonstrating that it actually is a problem. I started with a byte sequence which is invalid UTF-8, namely:
b'ZZ\xdb\xdf\xfa\xff'
created a file with that name, and then tried listing it with os.listdir. Even in Python 3.1 it worked fine. I was able to list the directory and open the file, so I'm not entirely sure where the problem lies exactly. Can somebody demonstrate the failure mode?
The "failure" happens only when you try to cross from the domain of posix binary filenames into the domain of text streams (that is, a stream with a consistent encoding). If you stick with os interfaces that handle filenames, Python3 handles posix bytes filenames just fine (though there may be a few corner-case rough edges yet to be fixed, and the standard streams was one of them). The difficultly comes if you try to use a filename that contains undecodable bytes in a non-os-interface text-context (such as writing it to a text file that you have declared to be a utf-8 encoding): there you will get an error...not completely unlike the old "your code works until your user uses unicode" problem we had in python2, but in this case only happening in a very narrow set of circumstances involving trying to translate between one domain (posix binary filenames) and another domain (io streams with a consistent declared encoding). This is not a common operation, but appears to be the one Oleg is concerned about. The old unicode-blowup errors would happen almost any time someone with a non-ascii language tried to use a program written by an ascii-only programmer (which was most of us). The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error. If you really want a stream with an inconsistent encoding, open it as binary and use the surrogate escape error handler to recover the bytes in the filenames. That is, *be explicit* about your intentions. So yes, we've shifted a burden from those who want non-ascii text to work consistently to those who wanted inconsistently encoded text to "just work" (or rather *appear* to "just work"). The number of people who benefit from the improved text model *greatly* outweighs the number of people inconvenienced by the new strictness when the domain line (posix binary filenames to consistently encoded text stream) are crossed. And the result is more *valid* programs, and fewer unexpected errors overall, with no inconvenience unless that domain line is crossed, and even then the inconvenience is limited to the open call that creates the binary stream. --David
"R. David Murray" <rdmurray@bitdance.com>:
The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error.
I have a different interpretation of the situation: as a rule, use byte strings in Python3. Text strings are a special corner case for applications that have to deal with human languages. If your application has to talk SMTP, use bytes. If your application has to do IPC, use bytes. If your application has to do file I/O, use bytes. If your application is a word processor or an IM client, you have text strings available. You might find, though, that barely any modern GUI application is satisfied with crude text strings. You will need weights, styles, sizes, emoticons, positions, directions, shadows, alignment etc etc so it may be that Python's text strings are only good enough for storing individual characters or short snippets. In sum, Python's text strings might have one sweet spot: Usenet clients. Marko
On Sat, 23 Aug 2014 19:33:06 +0300, Marko Rauhamaa <marko@pacujo.net> wrote:
"R. David Murray" <rdmurray@bitdance.com>:
The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error.
I have a different interpretation of the situation: as a rule, use byte strings in Python3. Text strings are a special corner case for applications that have to deal with human languages.
Clearly, then, you are writing unix (or perhaps posix)-only programs. Also, as has been discussed in this thread previously, any program that deals with filenames is dealing with human readable languages, even if posix itself treats the filenames as bytes. --David
R. David Murray writes:
Also, as has been discussed in this thread previously, any program that deals with filenames is dealing with human readable languages, even if posix itself treats the filenames as bytes.
That's a bit extreme. I can name two interesting applications offhand: git's object database and the Coda filesystem's containers. It's true that for debugging purposes bytestrings representing largish numbers are readably encoded (in hexadecimal and decimal, respectively), but they're clearly not "human readable" in the sense you mean. Nevertheless, these are the applications that prove your rule. You don't need the power of pathlib to conveniently (for the programmer) and efficiently handle the file structures these programs use. os.path is plenty.
On Tue, 26 Aug 2014 11:25:19 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
R. David Murray writes:
Also, as has been discussed in this thread previously, any program that deals with filenames is dealing with human readable languages, even if posix itself treats the filenames as bytes.
That's a bit extreme. I can name two interesting applications offhand: git's object database and the Coda filesystem's containers.
As soon as I hit send I realized there were a few counter examples :) So, replace "any" with "most". --David
Chris Barker - NOAA Federal writes:
This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal)
"Very carefully." But this is strictly from need. *Nobody* (with the exception of the crackers who like to name their programs things like "\u0007") *wants* to do this. Real people want to name their files in some human language they understand, and spell it in the usual way, and encode those characters as bytes in the usual way. Decoding those characters in the usual way and getting nonsense is the exceptional case, and it must be the application's or user's problem to decide what to do. They know where they got the file from and usually have some idea of what its name should look like. Python doesn't, so Python cannot solve it for them. For that reason, I believe that Python's "normal"/high-level approach to file handling should treat file names as (human-oriented) text. Of course Python should be able to handle bytes straight from the disk, but most programmers shouldn't have to.
And people still want to say posix isn't broken in this regard?
Deal with it, bro'.<wink/>
On Thu, Aug 21, 2014 at 05:00:02PM -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/21/2014 3:42 PM, Paul Moore wrote:
I wonder how badly a Unix system would break if you specified UTF16 as the system encoding...?
Does Unix even support UTF-16 as an encoding?
As an encoding of file's content? Certainly yes. As a locale encoding? Definitely no. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
Le 21/08/2014 18:27, Cameron Simpson a écrit :
As remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings.
So you admit that POSIX mandates that file paths are expressed in an ASCII-compatible encoding after all? Good. I've nothing to add to your rant. Antoine.
On 8/21/2014 3:54 PM, Antoine Pitrou wrote:
Le 21/08/2014 18:27, Cameron Simpson a écrit :
As remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings.
So you admit that POSIX mandates that file paths are expressed in an ASCII-compatible encoding after all? Good. I've nothing to add to your rant.
Antoine.
0 and 47 are certainly originally derived from ASCII. However, there could be lots of encodings that are not ASCII compatible (but in practice, probably very few, since most encodings _are_ ASCII compatible) that could be fit those constraints. So while as a technical matter, Cameron is correct that Unix only treats 0 & 47 as special, and that is insufficient to declare that encodings must be ASCII compatible, as a practical matter, since most encodings are ASCII compatible anyway, it would be hard to find very many that could be used successfully with Unix file names that are not ASCII compatible, that could comply with the 0 & 47 requirements.
Am 22.08.14 01:56, schrieb Glenn Linderman:
0 and 47 are certainly originally derived from ASCII. However, there could be lots of encodings that are not ASCII compatible (but in practice, probably very few, since most encodings _are_ ASCII compatible) that could be fit those constraints.
So while as a technical matter, Cameron is correct that Unix only treats 0 & 47 as special, and that is insufficient to declare that encodings must be ASCII compatible, as a practical matter, since most encodings are ASCII compatible anyway, it would be hard to find very many that could be used successfully with Unix file names that are not ASCII compatible, that could comply with the 0 & 47 requirements.
More importantly, existing encodings that are distinctively *not* ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47 (instead, it is at 91 at EBCDIC, 47 is the BEL control character). There are boundary cases, of course. VISCII is "mostly ASCII compatible", putting graphic characters into some of the control characters, but using those that aren't used in ASCII, anyway. And then there is the YUSCII family of encodings, which definitely is not ASCII compatible, as it does not contain Latin characters, but still puts the / into 47 (and also keeps the ASCII digits and special characters in their positions). There is also SI 960, which has the slash, the ASCII uppercase letters, digits and special characters, but replaces the lower-case characters with Hebrew. So yes, Unix doesn't mandate ASCII-compatible encodings; but it still mandates ASCII-inspired encodings. I wonder how you would run "gcc", though, on an SI 960 system; you'ld have to type חדד. Regards, Martin
On Wed, Aug 20, 2014 at 9:52 PM, Cameron Simpson <cs@zip.com.au> wrote:
On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
So really, people treat them as
"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible"
As someone who fought long and hard in the surrogate-escape listdir() wars, and was won over once the scheme was thoroughly explained to me, I take issue with these assertions: they are bogus or misleading.
Firstly, POSIX filenames _are_ just byte strings. The only forbidden character is the NUL byte, which terminates a C string, and the only special character is the slash, which separates pathanme components.
so they are "just byte strings", oh, except that you can't have a null, and the "slash" had better be code 47 (and vice versa). How is that different than "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-is-ascii-compatible"? (sorry about the "maybe a couple others", I was too lazy to do my research and be sure). But my point is that python users want to be able to work with paths, and paths on posix are not strictly strings with a clearly defined encoding, but they are also not quite "just arbitrary bytes". So it would be nice if we could have a pathlib that would work with these odd beasts. I've lost track a bit as to whether the surrogate-escape solution allows this to all work now. If it does, then great, sorry for the noise. Second, a bare low level program cannot do _much_ more than pass them
around. It certainly can do things like compute their basename, or other path related operations.
only if you assume that pesky slash == 47 thing -- it's not much, but it's not raw bytes either. The "bytes in some arbitrary encoding where at least the slash character
(and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible. I think characterizations such as the one quoted are activately misleading.
code 47 == "slash" is ascii compatible -- where else did the 47 value come from?
I think we'd all agree it is nice to have a system where filenames are all Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore the reality for such systems.
well, the community could have gone to "if you want anything other than ascii, make it utf-8 -- but always, we're all a bunch of independent thinkers. But none of this is relevant -- systems in the wild do what they do -- clearly we all want Python to work with them as best it can.
There's no _external_ "filesystem encoding" in the sense of something recorded in the filesystem that anyone can inspect. But there is the expressed locale settings, available at runtime to any program that cares to pay attention. It is a workable situation.
I haven't run into it, but it seem the folks that have don't think relying on the locale setting is the least bit workable. If it were, we woldn't be havin this discussion -- use the locale setting to decide how to decode filenames -- done. Oh, and I reject Nick's characterisation of POSIX as "broken". It's
perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)
bug or feature? you decide. Internal consistency is a good start, but it punts the whole encoding issue to the client software, without giving it the tools to do it right. I call that "really hard to work with" if not broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, 21 Aug 2014, Chris Barker wrote:
so they are "just byte strings", oh, except that you can't have a null, and the "slash" had better be code 47 (and vice versa). How is that different than "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-is-ascii-compatible"?
Actually, slash doesn't need to be code 47. But no matter what code 47 means outside of the context of a filename, it is the path arc separator byte (not character). In fact, this isn't even entirely academic. On a Mac OS X machine, go into Finder and try to create a directory called ":". You'll get an error saying 'The name “:” can’t be used.'. Now create a directory called "/". No problem, raising the question of what is going on at the filesystem level? Answer: $ ls -al total 0 drwxr-xr-x 3 ijmorlan staff 102 21 Aug 18:57 ./ drwxr-xr-x+ 80 ijmorlan staff 2720 21 Aug 18:57 ../ drwxr-xr-x 2 ijmorlan staff 68 21 Aug 18:57 :/ And of course in shell one would remove the directory with this: rm -rf : not: rm -rf / So in effect the file system path arc encoding on Mac OS X is UTF-8 *except* that : is outlawed and / is encoded as \x3A rather than the usual \x2F. Of course, the path arc separator byte (not character) remains \x2F as always. Just for fun, there are contexts in which one can give a full path at the GUI level, where : is used as the path separator. This is for historical reasons and presumably is the reason for the above-noted behaviour. I think the real tension here is between the POSIX level where filenames are byte strings (except for \x00, which is reserved for string termination) where \x2F has special interpretation, and absolutely every application ever written, in every language, which wants filenames to be character strings. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
On 22 Aug 2014 09:24, "Isaac Morland" <ijmorlan@uwaterloo.ca> wrote:
I think the real tension here is between the POSIX level where filenames are byte strings (except for \x00, which is reserved for string termination) where \x2F has special interpretation, and absolutely every application ever written, in every language, which wants filenames to be character strings.
That's one of the best summaries of the situation I've ever seen :) Most languages (including Python 2) throw up their hands and say this is the developer's problem to deal with. Python 3 says it's *our* problem to deal with on behalf of our developers. The "surrogateescape" error handler allows recalcitrant bytes to be dealt with relatively gracefully in most situations. We don't quite cover *everything* yet (hence the complaints from some of the folks that are experts at dealing with Python 2 Unicode handling on POSIX systems), but the remaining problems are a lot more tractable than the "teach every native English speaker everywhere how to handle Unicode properly" problem. Regards, Nick.
Nick Coghlan <ncoghlan@gmail.com>:
Python 3 says it's *our* problem to deal with on behalf of our developers.
<URL: http://www.imdb.com/title/tt0120623/quotes?item=qt0353406> Flik: I was just trying to help. Mr. Soil: Then help us; *don't* help us. Marko
On 8/20/2014 9:01 AM, Antoine Pitrou wrote:
Le 20/08/2014 07:08, Nick Coghlan a écrit :
It's not just the JVM that says text and binary APIs should be separate - it's every widely used operating system services layer except POSIX. The POSIX way works well *if* everyone reliably encodes things as UTF-8 or always uses encoding detection, but its failure mode is unfortunately silent data corruption.
That said, there's a lot of Python software that is POSIX specific, where bytes paths would be the least of the barriers to porting to Windows or Jython. I'm personally +1 on consistently allowing binary paths in lower level APIs, but disallowing them in higher level explicitly cross platform abstractions like pathlib.
I fully agree with Nick's position here.
To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators
Adding full bytes support to pathlib would have added a lot of complication and fragility in the implementation *and* in the API (is it allowed to combine str and bytes paths? should they have separate classes?), for arguably little benefit.
I am glad you did not recreate the madness of pre 3.0 Python in that regard.
I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
Do our docs somewhere explain the idea that files names are conceptually *names*, not arbitrary bytes; explain the concept of low-level versus high-level API' and point to the two types of APIs in Python? -- Terry Jan Reedy
Antoine Pitrou wrote:
I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
The problem with scandir() in particular is that there is currently *no* low-level API exposed that gives the same functionality. If scandir() is not to support bytes paths, I'd suggest exposing the opendir() and readdir() system calls with bytes path support. -- Greg
On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote:
Antoine Pitrou wrote:
I think if you want low-level features (such as unconverted bytes paths
under POSIX), it is reasonable to point you to low-level APIs.
The problem with scandir() in particular is that there is currently *no* low-level API exposed that gives the same functionality.
If scandir() is not to support bytes paths, I'd suggest exposing the opendir() and readdir() system calls with bytes path support.
scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib! Cheers, Nick.
-- Greg
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 08/20/2014 03:31 PM, Nick Coghlan wrote:
On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>> wrote:
Antoine Pitrou wrote:
I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
The problem with scandir() in particular is that there is currently *no* low-level API exposed that gives the same functionality.
If scandir() is not to support bytes paths, I'd suggest exposing the opendir() and readdir() system calls with bytes path support.
scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib!
If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should support bytes paths. Is that what you meant to say? -- ~Ethan~
On 21 August 2014 09:33, Ethan Furman <ethan@stoneleaf.us> wrote:
On 08/20/2014 03:31 PM, Nick Coghlan wrote:
On 21 Aug 2014 08:19, "Greg Ewing" <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>> wrote:
Antoine Pitrou wrote:
I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs.
The problem with scandir() in particular is that there is currently *no* low-level API exposed that gives the same functionality.
If scandir() is not to support bytes paths, I'd suggest exposing the opendir() and readdir() system calls with bytes path support.
scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib!
If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should support bytes paths.
Is that what you meant to say?
Yes. The discussions around PEP 471 *deferred* discussions of bytes and file descriptor support to their own RFEs (not needing a PEP), they didn't decide definitively not to support them. So Serhiy's thread is entirely pertinent to that question. Note that adding bytes support still *should not* hold up the initial PEP 471 implementation - it should be done as a follow on RFE. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 08/20/2014 05:15 PM, Nick Coghlan wrote:
On 21 August 2014 09:33, Ethan Furman <ethan@stoneleaf.us> wrote:
On 08/20/2014 03:31 PM, Nick Coghlan wrote:
scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib!
If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should support bytes paths.
Is that what you meant to say?
Yes. The discussions around PEP 471 *deferred* discussions of bytes and file descriptor support to their own RFEs (not needing a PEP), they didn't decide definitively not to support them. So Serhiy's thread is entirely pertinent to that question.
Thanks for clearing that up. I hate feeling confused. ;)
Note that adding bytes support still *should not* hold up the initial PEP 471 implementation - it should be done as a follow on RFE.
Agreed. -- ~Ethan~
If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should support bytes paths.
Is that what you meant to say?
Yes. The discussions around PEP 471 *deferred* discussions of bytes and file descriptor support to their own RFEs (not needing a PEP), they didn't decide definitively not to support them. So Serhiy's thread is entirely pertinent to that question.
Note that adding bytes support still *should not* hold up the initial PEP 471 implementation - it should be done as a follow on RFE.
I agree with this (that scandir is low level and should support bytes). As it happens, I'm implementing bytes support as well -- what with the path_t support in posixmodule.c and the listdir implementation to go on, it's not really any harder. So I think we'll have it right off the bat. BTW, the Windows implementation of PEP 471 is basically done, and the POSIX implementation is written but not working yet. And then there's tests and docs. -Ben
On Tue, Aug 19, 2014, at 10:43, Ben Hoyt wrote:
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
Well, notice the official policy is desperately *wanting* them to go away with the implication that we grudgingly bow to reality. :)
Le 19/08/2014 13:43, Ben Hoyt a écrit :
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
PEP 383 should actually work on Unix quite well, AFAIR. Regards Antoine.
Ben Hoyt writes:
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
They're not essential on *nix. Unix paths at the OS level are "just bytes" (even on Mac, although the most common Mac filesystem does enforce UTF-8 Unicode NFD). This use case is now perfectly well served by codecs. However, there are a lot of applications that involve reading a file name from a directory, and passing it verbatim to another OS function. This case can be handled now using the surrogateescape error handler, but when these APIs were introduced we didn't even have a reliable way to roundtrip filenames because a Unix filename doesn't need to be a string of characters from *any* character set. And there's the undeniable convenience of treating file names as opaque objects in those applications. Regards,
I'm sorry my moment of levity was taken so seriously. With my serious hat on, I would like to claim that *conceptually* filenames are most definitely text. Due to various historical accidents the UNIX system calls often encoded text as arguments, and we sometimes need to control that encoding. Hence the occasional need for bytes arguments. But most of the time you don't have to think about that, and forcing users to worry about it is mostly as counter-productive as forcing to think about the encoding of every text file. -- --Guido van Rossum (python.org/~guido)
Guido van Rossum <guido@python.org>:
With my serious hat on, I would like to claim that *conceptually* filenames are most definitely text. Due to various historical accidents the UNIX system calls often encoded text as arguments, and we sometimes need to control that encoding.
Due to historical accidents, text (in the Python sense) is not a first-class data type in Unix. Text, machine language, XML, Python etc are interpretations of bytes. Bytes are the first-class data type recognized by the kernel. That reality cannot be wished away.
Hence the occasional need for bytes arguments. But most of the time you don't have to think about that, and forcing users to worry about it is mostly as counter-productive as forcing to think about the encoding of every text file.
The users of Python programs can often be given higher-level facades. Unix programmers, though, shouldn't be shielded from bytes. Marko
"Stephen J. Turnbull" <stephen@xemacs.org> writes:
Marko Rauhamaa writes:
Unix programmers, though, shouldn't be shielded from bytes.
Nobody's trying to do that. But Python users should be shielded from Unix programmers.
+1 QotW -- \ “Intellectual property is to the 21st century what the slave | `\ trade was to the 16th.” —David Mertz | _o__) | Ben Finney
On 20 August 2014 07:53, Ben Finney <ben+python@benfinney.id.au> wrote:
"Stephen J. Turnbull" <stephen@xemacs.org> writes:
Marko Rauhamaa writes:
Unix programmers, though, shouldn't be shielded from bytes.
Nobody's trying to do that. But Python users should be shielded from Unix programmers.
+1 QotW
That quote is actually almost a "hidden extra Zen of Python" IMO :-) Both parts of it. Paul
Greg Ewing writes:
Stephen J. Turnbull wrote:
This case can be handled now using the surrogateescape error handler,
So maybe the way to make bytes paths go away is to always use surrogateescape for paths on unix?
Backward compatibility rules that out, I think. I certainly would recommend that for new code, but even for new code there are many users who vehemently object to using Unicode as an intermediate representation of things they think of as binary blobs. Not worth the hassle to even seriously propose removing those APIs IMO.
On Tuesday, August 19, 2014, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Greg Ewing writes:
Stephen J. Turnbull wrote:
This case can be handled now using the surrogateescape error handler,
So maybe the way to make bytes paths go away is to always use surrogateescape for paths on unix?
Backward compatibility rules that out, I think. I certainly would recommend that for new code, but even for new code there are many users who vehemently object to using Unicode as an intermediate representation of things they think of as binary blobs. Not worth the hassle to even seriously propose removing those APIs IMO.
But maybe we don't have to add new ones? --Guido -- --Guido van Rossum (on iPad)
Guido van Rossum writes:
On Tuesday, August 19, 2014, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Greg Ewing writes:
So maybe the way to make bytes paths go away is to always use surrogateescape for paths on unix?
Backward compatibility rules that out, I think. I certainly would recommend that for new code, but even for new code there are many users who vehemently object to using Unicode as an intermediate representation of things they think of as binary blobs. Not worth the hassle to even seriously propose removing those APIs IMO.
But maybe we don't have to add new ones?
IMO, we should avoid it. There may be some use cases. Sergiy mentions two bug reports. http://bugs.python.org/issue19997 imghdr.what doesn't accept bytes paths http://bugs.python.org/issue20797 zipfile.extractall should accept bytes path as parameter I'm very unsympathetic to these. In both cases the bytes are coming from outside of module in question. Why are they in bytes? That question should scare you, because from the point of view of end users there are no good answers: they all mean that the end user is going to end up with uninterpretable bytes in their directories, for the convenience of the programmer. In the case of issue20797, I'd be a *little* sympathetic if the RFE were for the *members* argument. zipfiles evidently have no way to specify the encodings of the name(s) of their members (and the zipfile module doesn't have APIs for it!), so the programmer is kind of stuck, especially if the requirement is that the extraction require no user intervention. But again, this is rarely what the user wants. I would be sympathetic to an internal, bytes-based, "kids these stunts are performed by trained professionals do NOT try this at home" API, with a sane user-oriented str-based API for ordinary use for this module. I suppose it might be useful for such a multi-type API to be polymorphic, but it would have to be a "if there are bytes anywhere, everything must be bytes and return values will be bytes" and similarly for str kind of polymorphism. No mixing bytes and strings, period.
Am 19.08.14 19:43, schrieb Ben Hoyt:
The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows.
Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes.
Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix?
I think the people defending the "Unix file names are just bytes" side often miss an important detail: displaying file names to the user, and allowing the user to enter file names. A script that just needs to traverse a directory tree and look at files by certain criteria can easily do so with not worrying about a text interpretation of the file names. When it comes to user interaction, it becomes apparent that, even on Unix, file names are not just bytes. If you do "ls -l" in your shell, the "system" (not just the kernel - but ultimately the terminal program, which might be the console driver, or an X11 application) will interpret the file name as having an encoding, and render them with a font. So for Python, the question is: which of the use cases (processing all files, vs. showing them to the user) should be better supported? Python 3 took the latter as an answer, under the assumption that this is the more common case. Regards, Martin
"Martin v. Löwis" <martin@v.loewis.de>:
I think the people defending the "Unix file names are just bytes" side often miss an important detail: displaying file names to the user, and allowing the user to enter file names.
The user interface is a real issue and needs to be addressed. It is separate from the OS interface, though.
A script that just needs to traverse a directory tree and look at files by certain criteria can easily do so with not worrying about a text interpretation of the file names.
A single system often has file names that have been encoded with different schemes. Only today, I have had to deal with the JIS character table (<URL: http://i.msdn.microsoft.com/cc305152.932%28en-us,MSDN.10%29.gif>) -- you will notice that it doesn't have a backslash character. A coworker uses ISO-8859-1. I use UTF-8. UTF-8, of course, will refuse to deal with some byte sequences. My point is that the poor programmer cannot ignore the possibility of "funny" character sets. If Python tried to protect the programmer from that possibility, the result might be even more intractable: how to act on a file with an non-UTF-8 filename if you are unable to express it as a text string? Marko
On 21 August 2014 23:58, Marko Rauhamaa <marko@pacujo.net> wrote:
My point is that the poor programmer cannot ignore the possibility of "funny" character sets. If Python tried to protect the programmer from that possibility, the result might be even more intractable: how to act on a file with an non-UTF-8 filename if you are unable to express it as a text string?
That's what the "surrogateescape" codec is for - we use it by default on most OS interfaces, and it's implicit in the use of "os.fsencode" and "os.fsdecode". Starting with Python 3, it's also enabled on sys.stdout by default, so that "print(os.listdir(dirname))" will pass the original raw bytes through to the terminal the same way Python 2 does. The docs could use additional details as to which interfaces do and don't have surrogateescape enabled by default, but for the time being, the description of the codec error handler just links out to the original definition in PEP 383. It may also be useful to have some tools for detecting and cleaning strings containing surrogate escaped data, but there hasn't been a concrete proposal along those lines as yet. Personally, I'm currently waiting to see if the Fedora or OpenStack folks indicate a need for such tools before proposing any additions. Regards, Nick.
Marko _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 22 August 2014 00:12, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 August 2014 23:58, Marko Rauhamaa <marko@pacujo.net> wrote:
My point is that the poor programmer cannot ignore the possibility of "funny" character sets. If Python tried to protect the programmer from that possibility, the result might be even more intractable: how to act on a file with an non-UTF-8 filename if you are unable to express it as a text string?
That's what the "surrogateescape" codec is for
Oops, that should say "codec error handled" (I got it right later in the post). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Marko Rauhamaa writes:
My point is that the poor programmer cannot ignore the possibility of "funny" character sets.
*Poor* programmers do it all the time. That's why Python codecs raise when they encounter bytes they can't handle.
If Python tried to protect the programmer from that possibility,
I don't understand your point. The existing interfaces aren't going anywhere, and they're enough to do anything you need to do. Although there are a few radicals (like me in a past life :-) who might like to see them go away in favor of opt-in to binary encoding via surrogateescape error handling, nobody in their right mind supports us. The question here is not about going backward, it's about whether to add new bytes APIs, and which ones.
19.08.14 20:02, Guido van Rossum написав(ла):
The official policy is that we want them to go away, but reality so far has not budged. We will continue to hold our breath though. :-)
Does it mean that we should reject all propositions about adding bytes path support in existing functions (in particular issue19997 (imghdr) and issue20797 (zipfile))?
participants (26)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Ben Finney
-
Ben Hoyt
-
Benjamin Peterson
-
Brett Cannon
-
Cameron Simpson
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Ethan Furman
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Isaac Morland
-
Marko Rauhamaa
-
Nick Coghlan
-
Nikolaus Rath
-
Oleg Broytman
-
Paul Moore
-
R. David Murray
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tres Seaver