[Python-Dev] PEP 461 updates
oscar.j.benjamin at gmail.com
Sun Jan 19 16:21:25 CET 2014
On 19 January 2014 06:19, Nick Coghlan <ncoghlan at gmail.com> wrote:
> While I agree it's not relevant to the PEP 460/461 discussions, so
> long as numpy.loadtxt is explicitly documented as only working with
> latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as
latin-1 when opening the file then it could document the fact that it
works for latin-1 encoded files. However it actually uses the system
default encoding to read the file and then converts the strings to
bytes with the as_bytes function that is hard-coded to use latin-1:
So it only works if the system default encoding is latin-1 and the
file content is white-space and newline compatible with latin-1.
Regardless of whether the file itself is in utf-8 or latin-1 it will
only work if the system default encoding is latin-1. I've never used a
system that had latin-1 as the default encoding (unless you count
cp1252 as latin-1).
> If it's supposed to work with other encodings (but the entire file is
> still required to use a consistent encoding), then it just needs
> encoding and errors arguments to fit the Python 3 text model (with
> "latin-1" documented as the default encoding).
This is the right solution. Have an encoding argument, document the
fact that it will use the system default encoding if none is
specified, and re-encode using the same encoding to fit any dtype='S'
bytes column. This will then work for any encoding including the ones
that aren't ASCII-compatible (e.g. utf-16).
Then instead of having a compat module with an as_bytes helper to get
rid of all the unicode strings on Python 3, you can have a compat
module with an open_unicode helper to do the right thing on Python 2.
The as_bytes function is just a way of fighting the Python 3 text
model: "I don't care about mojibake just do whatever it takes to shut
up the interpreter and its error messages and make sure it works for
> If it is intended to
> allow S columns to contain text in arbitrary encodings, then that
> should also be supported by the current API with an adjustment to the
> default behaviour, since passing something like
> codecs.getdecoder("utf-8") as a column converter should do the right
> thing. However, if you're currently decoding S columns with latin-1
> *before* passing the value to the converter, then you'll need to use a
> WSGI style decoding dance instead:
> def fix_encoding(text):
> return text.encode("latin-1").decode("utf-8") # For example
That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid "text file" and see no reason for loadtxt to
support reading it.
> That's more wasteful than just passing the raw bytes through for
> decoding, but is the simplest backwards compatible option if you're
> doing latin-1 decoding already.
> If different rows in the *same* column are allowed to have different
> encodings, then that's not a valid use of the operation (since the
> column converter has no access to the rest of the row to determine
> what encoding should be used for the decode operation).
More information about the Python-Dev