[Python-Dev] PEP 460: allowing %d and %f and mojibake
v+python at g.nevcal.com
Mon Jan 13 07:46:14 CET 2014
On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
> > the proposals to embed binary in Unicode by abusing Latin-1
> > encoding.
> Those aren't "proposals", they are currently feasible techniques in
> Python 3 for *some* use cases.
> The question is why infecting Python 3 with the byte/character
> confoundance virus is preferable to such techniques, especially if
> their (serious!) deficiencies are removed by creating a new type such
> as asciistr.
"smuggled binary" (great term borrowed from a different subthread)
muddies the waters of what you are dealing with. As long as the actual
data is only Latin-1 and smuggled binary, the technique probably isn't
too bad... you can define the the "smuggled binary" as a "decoding" of
binary to text, sort of like base64 "decodes" binary to ASCII. And it
can be a useful technique.
As soon as you introduce "smuggled non-ASCII, non-Latin-1 text"
encodings into the mix, it gets thoroughly confusing... just as
confusing as the Python 2 text model. It takes decode+encode to do the
smuggled text, plus encode push it to the boundary, plus you have text
that you know is text, but because of the required techniques for
smuggling it, you can't operate on it or view it properly as the text
that it should be.
The "byte/character confoundance virus" is a hobgoblin of paranoid
perception. In another post, I pointed out that
''' b"%d" % 25 ''' is not equivalent to ''' "%d" % 25 ''' because of
the "b" in the first case. So the "implicit" encoding that everyone on
that side of the fence was talking about was not at all implicit, but
explicit. The numeric characters produced by %d are clearly in the
ASCII subset of text, so having b"%d" % 25 produce pre-encoded ASCII
text is explicit and practical.
My only concern was what b"%s" % 'abc' should do, because in general,
str may not contain only ASCII. (generalize to b"%s" % str(...) ).
Guido solved that one nicely. Of course, at this point, I could punt
the whole argument off to "Guido said so", but since you asked me, I
felt it appropriate to respond from my perspective... and I'm not sure
Guido specifically addressed your smuggled binary proposal.
When the mixture of text and binary is done as encoded text in binary,
then it is obvious that only limited text processing can be performed,
and getting the text there requires that it was encoded (hopefully
properly encoded per the binary specification being created) to become
binary. And there are no extra, confusing Latin-1 encode/decode
From a higher-level perspective, I think it would be great to have a
module, perhaps called "boundary" (let's call it that for now), that
allow some definition syntax (augmented BNF? augmented ABNF?) to explain
the format of a binary blob. And then provide methods for generating and
parsing it to/from Python objects. Obviously, the ABNF couldn't
understand Python objects; instead, Python objects might define the ABNF
to which they correspond, and methods for accepting binary and producing
the object (factory method?) and methods for generating the binary. As
objects build upon other objects, the ABNF to which the correspond could
be constructed, and perhaps even proven to be capable of parsing all
valid blobs corresponding to the specification, and perhaps even proven
to be capable of generating only valid blobs (although I'm not a
software proof guru; last I heard there were definite limits on the
ability to do proofs, but maybe this is a limited enough domain it could
Then all blobs could be operated on sort of like web browsers operate on
the DOM, or some XML parsing libraries, by defining each blob as a
collection of objects for the pieces. XML is far too wordy for practical
use (but hey! it is readable) but perhaps it could be practical if
tokenized, and then the tokenized representation could be converted to a
DOM just like XML and HTML are. (this is mostly to draw the parallel in
the parsing and processing techniques; I'm not seriously suggesting a
binary version of XML, but there is a strong parallel, and it could be
done). Given a DOM-like structure, a validator could be written to
operate on it, though, to provide, if not a proof, at least a sanity
check. And, given the DOM-like structure, one call to the top-level
object to generate the blob format would walk over all of them,
generating the whole blob.
Off I go, drifting into Python ideas.... but I have a program I want to
rewrite that could surely use some of these techniques (and probably
will), because it wants to read several legacy formats, and produce
several legacy formats, as well as a new, more comprehensive format. So
the objects will be required to parse/generate 4 different blob
structures, one of which has its own set of several legacy variations.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev