On Jul 11, 2013, at 19:01, Daniel Rode <dth4h95@gmail.com> wrote:

Since Python3, the python creators removed a lot of encodings from the str.encode() method. They did it because they weren't sure how to implement the feature in Python3. They wanted it to be better.

Only a few encodings were actually removed, the ones that encoded, in 2.x terms, str to str, which means they'd have to encode either bytes to bytes or bytes to str in 3.x. For some of them it's not clear which type the result should be. But more generally, calling "encode" on an encoded bytes string instead of a str looks wrong.

In most cases, there are other ways to do it--base64.encode(), binascii.hexlify(), etc. It would be nice if there was a convenient and consistent way to do all of them instead of having to hunt around the stdlib, and yours is a good attempt, but I don't think it works.


I have an idea, add a built in method called "convert".
Usage example:

convert(data, current_state, desired_state)
convert(data, from, to)


Real world examples:

dataBytes = b"hello"
dataUTF8_Str = "Ɠahhhh hi all ̮"

This is a misleading name, because it's not UTF-8, it's just a str, which doesn't have an encoding. (Under the covers, of course, it's actually stored as ASCII, UCS2, or UTF-32...)

convert(dataUTF8_Str, encodings.UTF8, encodings.BYTES)
Returns: b'\xc6\x93ahhhh hi all \xcc\xae'

That's misleading as well. It's not converting from UTF-8 to bytes, it's converting from str to bytes, encoding _to_, not _from_ UTF-8 to do so.

Plus, we already have a way to write this: dataUTF8_Str.encode('UTF-8'). And it's a little weird to pick one of the encodings that wasn't changed from 2.x to 3.x as your first example of restoring the encodings that were lost.

convert(dataBytes, encodings.BYTES, encodings.HEX)
Returns: b'c693616868686820686920616c6c20ccae'

Why would converting to hex give you a bytes object instead of a str? More to the point, if you _wanted_ a str, how would you get it? Also, why do you even have to specify BYTES here? The function can already tell that it's a bytes, so there's no extra information there.

Really, it seems like any time BYTES is useful, it will also be insufficient. How can you convert a str to BYTES without also specifying an encoding? (Hopefully not by using sys.getdefaultencoding(), because that would just bring back the same sloppy bugs we had in Python 2.)

Anyway, the only situation I can imagine where it's useful to provide both arguments is when you want to decode and immediately re-encode. In every other case, you're either decoding (in which case "to" is useless) or encoding (in which case "from is useless).

convert(dataUTF8_Str, encodings.UTF8, encodings.ASCII)
Returns: TypeError: can't convert utf8 character "\u0193" to ascii

There is no 'utf8 character "\u0193"'. UTF-8 doesn't have characters, it has bytes, because it's an encoding. If you encode the character '\u0193' as UTF-8, you get b'\xc6\x93'. 

If you actually gave this a UTF-8 included bytes instead of a str, this would be an example of a call that makes use of both parameters. But it would be exactly the same as s.decode('UTF-8').encode('ascii'). Why do we need another way to write that?

Some other encodings:
BASE64
UTF16
UTF32
BINARY

What is "BINARY"? What happens when you convert that to another encoding, or vice-versa? How is it different from BYTES?

Maybe even INT?

What does that do if I use it?


Feel free to add suggestions!
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
http://mail.python.org/mailman/listinfo/python-ideas