On 19 January 2014 12:34, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/18/2014 05:21 PM, Neil Schemenauer wrote:
Ethan Furman <ethan@stoneleaf.us> wrote:
So, if %a is added it would act like:
--------- "%a" % some_obj --------- tmp = str(some_obj) res = b'' for ch in tmp: if ord(ch) < 256: res += bytes([ord(ch)] else: res += unicode_escape(ch) ---------
where 'unicode_escape' would yield something like "\u0440" ?
My patch on the tracker already implements %a, it's simple.
Before one implements a patch it is good to know the specifications.
A very sound engineering principle :) Neil has the resulting semantics right for what I had in mind, but the faster path to bytes (rather than going through the ASCII builtin) is to do the C level equivalent of: repr(obj).encode("ascii", errors="backslashreplace") That's essentially what the ascii() builtin does, but that operates entirely in the text domain, so (as Neil found) you still need a separate encode step at the end. >>> ascii("è").encode("ascii") b"'\\xe8'" >>> repr("è").encode("ascii", errors="backslashreplace") b"'\\xe8'" b"%a" % "è" should produce the same result as the two examples above. (Code points higher up in the Unicode code space would produce \u and \U escapes as needed, which should already be handled properly by the backslashreplace error handler) One nice thing about this definition is that in the specific case of text input, the transformation can always be reversed by decoding as ASCII and then applying ast.literal_eval(): >>> import ast >>> ast.literal_eval(repr("è").encode("ascii", "backslashreplace").decode("ascii")) 'è' (Please don't use eval() to reverse a transformation like this, as doing so not only makes security engineers cry, it's also likely to make your code vulnerable to all kinds of interesting attacks) As noted earlier in the thread, one key purpose of including this feature is to reduce the likelihood of people inappropriately adding __bytes__ implementations for %s compatibility that look like: def __bytes__(self): # This is unlikely to be a good idea! return repr(self).encode("ascii", errors="backslashreplace") Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia