On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

 > the proposals to embed binary in Unicode by abusing Latin-1
 > encoding.

Those aren't "proposals", they are currently feasible techniques in
Python 3 for *some* use cases.

The question is why infecting Python 3 with the byte/character
confoundance virus is preferable to such techniques, especially if
their (serious!) deficiencies are removed by creating a new type such
as asciistr.

"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with. As long as the actual data is only Latin-1 and smuggled binary, the technique probably isn't too bad... you can define the the "smuggled binary" as a "decoding" of binary to text, sort of like base64 "decodes" binary to ASCII. And it can be a useful technique.

As soon as you introduce "smuggled non-ASCII, non-Latin-1 text" encodings into the mix, it gets thoroughly confusing... just as confusing as the Python 2 text model. It takes decode+encode to do the smuggled text, plus encode push it to the boundary, plus you have text that you know is text, but because of the required techniques for smuggling it, you can't operate on it or view it properly as the text that it should be.

The "byte/character confoundance virus" is a hobgoblin of paranoid perception. In another post, I pointed out that

''' b"%d" % 25 ''' is not equivalent to ''' "%d" % 25 ''' because of the "b" in the first case. So the "implicit" encoding that everyone on that side of the fence was talking about was not at all implicit, but explicit. The numeric characters produced by %d are clearly in the ASCII subset of text, so having b"%d" % 25 produce pre-encoded ASCII text is explicit and practical.

My only concern was what b"%s" % 'abc' should do, because in general, str may not contain only ASCII. (generalize to b"%s" % str(...) ). Guido solved that one nicely. Of course, at this point, I could punt the whole argument off to "Guido said so", but since you asked me, I felt it appropriate to respond from my perspective... and I'm not sure Guido specifically addressed your smuggled binary proposal.

When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed, and getting the text there requires that it was encoded (hopefully properly encoded per the binary specification being created) to become binary. And there are no extra, confusing Latin-1 encode/decode operations required.

From a higher-level perspective, I think it would be great to have a module, perhaps called "boundary" (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob. And then provide methods for generating and parsing it to/from Python objects. Obviously, the ABNF couldn't understand Python objects; instead, Python objects might define the ABNF to which they correspond, and methods for accepting binary and producing the object (factory method?) and methods for generating the binary. As objects build upon other objects, the ABNF to which the correspond could be constructed, and perhaps even proven to be capable of parsing all valid blobs corresponding to the specification, and perhaps even proven to be capable of generating only valid blobs (although I'm not a software proof guru; last I heard there were definite limits on the ability to do proofs, but maybe this is a limited enough domain it could work).

Then all blobs could be operated on sort of like web browsers operate on the DOM, or some XML parsing libraries, by defining each blob as a collection of objects for the pieces. XML is far too wordy for practical use (but hey! it is readable) but perhaps it could be practical if tokenized, and then the tokenized representation could be converted to a DOM just like XML and HTML are. (this is mostly to draw the parallel in the parsing and processing techniques; I'm not seriously suggesting a binary version of XML, but there is a strong parallel, and it could be done). Given a DOM-like structure, a validator could be written to operate on it, though, to provide, if not a proof, at least a sanity check. And, given the DOM-like structure, one call to the top-level object to generate the blob format would walk over all of them, generating the whole blob.

Off I go, drifting into Python ideas.... but I have a program I want to rewrite that could surely use some of these techniques (and probably will), because it wants to read several legacy formats, and produce several legacy formats, as well as a new, more comprehensive format. So the objects will be required to parse/generate 4 different blob structures, one of which has its own set of several legacy variations.