Re: [Python-Dev] PEP 461 Final?

19 Jan 2014

      On 19 January 2014 12:34, Ethan Furman <ethan@stoneleaf.us> wrote:
...
On 01/18/2014 05:21 PM, Neil Schemenauer wrote:
...
Ethan Furman <ethan@stoneleaf.us> wrote:
...
So, if %a is added it would act like:
---------
    "%a" % some_obj
---------
    tmp = str(some_obj)
    res = b''
    for ch in tmp:
        if ord(ch) < 256:
            res += bytes([ord(ch)]
        else:
            res += unicode_escape(ch)
---------
where 'unicode_escape' would yield something like "\u0440" ?
My patch on the tracker already implements %a, it's simple.
Before one implements a patch it is good to know the specifications.
A very sound engineering principle :)

Neil has the resulting semantics right for what I had in mind, but the
faster path to bytes (rather than going through the ASCII builtin) is
to do the C level equivalent of:

    repr(obj).encode("ascii", errors="backslashreplace")

That's essentially what the ascii() builtin does, but that operates
entirely in the text domain, so (as Neil found) you still need a
separate encode step at the end.

    >>> ascii("è").encode("ascii")
    b"'\\xe8'"
    >>> repr("è").encode("ascii", errors="backslashreplace")
    b"'\\xe8'"

b"%a" % "è" should produce the same result as the two examples above.
(Code points higher up in the Unicode code space would produce \u and
\U escapes as needed, which should already be handled properly by the
backslashreplace error handler)

One nice thing about this definition is that in the specific case of
text input, the transformation can always be reversed by decoding as
ASCII and then applying ast.literal_eval():

    >>> import ast
    >>> ast.literal_eval(repr("è").encode("ascii",
"backslashreplace").decode("ascii"))
    'è'

(Please don't use eval() to reverse a transformation like this, as
doing so not only makes security engineers cry, it's also likely to
make your code vulnerable to all kinds of interesting attacks)

As noted earlier in the thread, one key purpose of including this
feature is to reduce the likelihood of people inappropriately adding
__bytes__ implementations for %s compatibility that look like:

    def __bytes__(self):
        # This is unlikely to be a good idea!
        return repr(self).encode("ascii", errors="backslashreplace")

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia