[Python-Dev] PEP 461 updates

Sun Jan 19 07:19:00 CET 2014

On 19 January 2014 00:39, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:
>
> If you want to draw a relevant lesson from that thread in this one
> then the lesson argues against PEP 461: adding back the bytes
> formatting methods helps people who refuse to understand text
> processing and continue implementing dirty hacks instead of doing it
> properly.

Yes, that's why it has taken so long to even *consider* bringing
binary interpolation support back - one of our primary concerns in the
early days of Python 3 was developers (including core developers!)
attempting to translate bad habits from Python 2 into Python 3 by
continuing to treat binary data as text. Making interpolation a purely
text domain operation helped strongly in enforcing this distinction,
as it generally required thinking about encoding issues in order to
get things into the text domain (or hitting them with the "latin-1"
hammer, in which case... *sigh*).

The reason PEP 460/461 came up is that we *do* acknowledge that there
is a legitimate use case for binary interpolation support when dealing
with binary formats that contain ASCII compatible segments. Now that
people have had a few years to get used to the Python 3 text model ,
lowering the barrier to migration from Python 2 and better handling
that use case in Python 3 in general has finally tilted the scales in
favour of providing the feature (assuming Guido is happy with PEP 461
after Ethan finishes the Rationale section).

(Tangent)

While I agree it's not relevant to the PEP 460/461 discussions, so
long as numpy.loadtxt is explicitly documented as only working with
latin-1 encoded files (it currently isn't), there's no problem. If
it's supposed to work with other encodings (but the entire file is
still required to use a consistent encoding), then it just needs
encoding and errors arguments to fit the Python 3 text model (with
"latin-1" documented as the default encoding). If it is intended to
allow S columns to contain text in arbitrary encodings, then that
should also be supported by the current API with an adjustment to the
default behaviour, since passing something like
codecs.getdecoder("utf-8") as a column converter should do the right
thing. However, if you're currently decoding S columns with latin-1
*before* passing the value to the converter, then you'll need to use a
WSGI style decoding dance instead:

    def fix_encoding(text):
        return text.encode("latin-1").decode("utf-8") # For example

That's more wasteful than just passing the raw bytes through for
decoding, but is the simplest backwards compatible option if you're
doing latin-1 decoding already.

If different rows in the *same* column are allowed to have different
encodings, then that's not a valid use of the operation (since the
column converter has no access to the rest of the row to determine
what encoding should be used for the decode operation).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia