[Python-Dev] PEP 460: allowing %d and %f and mojibake

Glenn Linderman v+python at g.nevcal.com
Mon Jan 13 21:44:06 CET 2014


On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>
>   > On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
>   >> Glenn Linderman writes:
>   >>> the proposals to embed binary in Unicode by abusing Latin-1
>   >>> encoding.
>
>   >> Those aren't "proposals", they are currently feasible
>   >> techniques in Python 3 for *some* use cases. The question is why
>   >> infecting Python 3 with the byte/character confoundance virus is
>   >> preferable to such techniques, especially if their (serious!)
>   >> deficiencies are removed by creating a new type such as
>   >> asciistr.
>
>   > "smuggled binary" (great term borrowed from a different
>   > subthread) muddies the waters of what you are dealing with.
>
> Not really.  The "mud" is one or more of the serious deficiencies.  It
> can be removed, I believe (and Nick apparently does, too).  "asciistr"
> is one way to try that.

Yes really. Use of smuggled binary means the str containing it can no 
longer be treated completely as a str. That is "muddier" than having a 
str that is only a str.

>   > When the mixture of text and binary is done as encoded text in
>   > binary, then it is obvious that only limited text processing can be
>   > performed,
>
> Hardly.  After all, that's how all text processing was done for
> decades.  Still is, in some programs, especially C programs.

I disagree, and so do you... text processing must be limited to the text 
subsets of the text that includes smuggled binary... that is limited... 
you can't just apply text searches, scans, and transformations over the 
complete str, when it contains smuggled binary.  You know that, but must 
have not considered it a limitation, because you know you can do any 
text processing on the text parts.  But it is a limitation to have to 
keep track of it, and apply the text processing only to the parts that 
are text. Yes, it has been done that way, and the limitations of doing 
it that way led to the plethora of encodings each of which was intended 
to be sufficient for some problem domain, but most of which were only 
sufficient for a smaller problem domain than intended, especially as 
communications became more global in nature.


>   > And there are no extra, confusing Latin-1 encode/decode operations
>   > required.
>
> The "extra" encode/decode operations are mostly (perhaps all) due to
> examples that started from bytes and end with bytes.  Of course if you
> assume that API and propose to do the operations using Unicode, you'll
> get "extra" decode/encode operations.

No, the "extra" encode/decode are from the requirement that smuggled 
binary use latin-1, and other binary flavors are not always latin-1.

>
>   > From a higher-level perspective, I think it would be great to have
>   > a module, perhaps called "boundary" (let's call it that for now),
>   > that allow some definition syntax (augmented BNF? augmented ABNF?)
>   > to explain the format of a binary blob.
>
> We have struct, for one.  I'm not sure why you want more than that.  I
> suppose you could go all the way to ASN.1.

struct is insufficient to capture a whole file format, with optional 
parts, although it suffices for fragments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140113/c32d79e6/attachment.html>


More information about the Python-Dev mailing list