Since Python3, the python creators removed a lot of encodings from the str.encode() method. They did it because they weren't sure how to implement the feature in Python3. They wanted it to be better. I have an idea, add a built in method called "convert". Usage example: convert(data, current_state, desired_state) convert(data, from, to) Real world examples: dataBytes = b"hello" dataUTF8_Str = "Ɠahhhh hi all ̮" convert(dataUTF8_Str, encodings.UTF8, encodings.BYTES) Returns: b'\xc6\x93ahhhh hi all \xcc\xae' convert(dataBytes, encodings.BYTES, encodings.HEX) Returns: b'c693616868686820686920616c6c20ccae' convert(dataUTF8_Str, encodings.UTF8, encodings.ASCII) Returns: TypeError: can't convert utf8 character "\u0193" to ascii Some other encodings: BASE64 UTF16 UTF32 BINARY Maybe even INT? Feel free to add suggestions!
On Jul 11, 2013, at 19:01, Daniel Rode <dth4h95@gmail.com> wrote:
Since Python3, the python creators removed a lot of encodings from the str.encode() method. They did it because they weren't sure how to implement the feature in Python3. They wanted it to be better.
Only a few encodings were actually removed, the ones that encoded, in 2.x terms, str to str, which means they'd have to encode either bytes to bytes or bytes to str in 3.x. For some of them it's not clear which type the result should be. But more generally, calling "encode" on an encoded bytes string instead of a str looks wrong. In most cases, there are other ways to do it--base64.encode(), binascii.hexlify(), etc. It would be nice if there was a convenient and consistent way to do all of them instead of having to hunt around the stdlib, and yours is a good attempt, but I don't think it works.
I have an idea, add a built in method called "convert". Usage example:
convert(data, current_state, desired_state) convert(data, from, to)
Real world examples:
dataBytes = b"hello" dataUTF8_Str = "Ɠahhhh hi all ̮"
This is a misleading name, because it's not UTF-8, it's just a str, which doesn't have an encoding. (Under the covers, of course, it's actually stored as ASCII, UCS2, or UTF-32...)
convert(dataUTF8_Str, encodings.UTF8, encodings.BYTES) Returns: b'\xc6\x93ahhhh hi all \xcc\xae'
That's misleading as well. It's not converting from UTF-8 to bytes, it's converting from str to bytes, encoding _to_, not _from_ UTF-8 to do so. Plus, we already have a way to write this: dataUTF8_Str.encode('UTF-8'). And it's a little weird to pick one of the encodings that wasn't changed from 2.x to 3.x as your first example of restoring the encodings that were lost.
convert(dataBytes, encodings.BYTES, encodings.HEX) Returns: b'c693616868686820686920616c6c20ccae'
Why would converting to hex give you a bytes object instead of a str? More to the point, if you _wanted_ a str, how would you get it? Also, why do you even have to specify BYTES here? The function can already tell that it's a bytes, so there's no extra information there. Really, it seems like any time BYTES is useful, it will also be insufficient. How can you convert a str to BYTES without also specifying an encoding? (Hopefully not by using sys.getdefaultencoding(), because that would just bring back the same sloppy bugs we had in Python 2.) Anyway, the only situation I can imagine where it's useful to provide both arguments is when you want to decode and immediately re-encode. In every other case, you're either decoding (in which case "to" is useless) or encoding (in which case "from is useless).
convert(dataUTF8_Str, encodings.UTF8, encodings.ASCII) Returns: TypeError: can't convert utf8 character "\u0193" to ascii
There is no 'utf8 character "\u0193"'. UTF-8 doesn't have characters, it has bytes, because it's an encoding. If you encode the character '\u0193' as UTF-8, you get b'\xc6\x93'. If you actually gave this a UTF-8 included bytes instead of a str, this would be an example of a call that makes use of both parameters. But it would be exactly the same as s.decode('UTF-8').encode('ascii'). Why do we need another way to write that?
Some other encodings: BASE64 UTF16 UTF32 BINARY
What is "BINARY"? What happens when you convert that to another encoding, or vice-versa? How is it different from BYTES?
Maybe even INT?
What does that do if I use it?
Feel free to add suggestions! _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Jul 11, 2013, at 19:01, Daniel Rode <dth4h95@gmail.com> wrote:
Since Python3, the python creators removed a lot of encodings from the str.encode() method. They did it because they weren't sure how to implement the feature in Python3.
No, they specifically decided not to implement codecs that are not directional (in the sense that they convert str to str or bytes to bytes or both).
I have an idea, add a built in method called "convert".
Only the name is new; the idea has been suggested several times. However, the API proposed has usually been symmetric and polymorphic (that is, either bytes-to-bytes or str-to-str). It's arguable (and I've argued it) that base encoding should be bytes-to-str, but pragmatically base encodings are used mostly for content transfer encodings in wire protocols, and in the relatively rare and comparatively low-throughput cases where they're displayed to people, there's no real cost to decoding from ASCII to Unicode (str), especially since PEP 393. Since special-case methods already exist and are well known (not to forget easily Googled), there's little benefit to merely providing an bunch of aliases and a registry. So we want to reserve this opportunity for an API that helps users to avoid double-encoding and things like that. Andrew Barnert writes:
it's just a str, which doesn't have an encoding. (Under the covers, of course, it's actually stored as ASCII, UCS2, or UTF-32...)
Actually, 8-bit str is stored as ISO-8859-1.
On 12/07/13 12:01, Daniel Rode wrote:
Since Python3, the python creators removed a lot of encodings from the str.encode() method. They did it because they weren't sure how to implement the feature in Python3. They wanted it to be better.
That's wrong. They didn't remove them, they are just inaccessible from the string API. And they didn't do it because they weren't sure how to implement the feature, but because the feature was broken. Strings had both an encode and decode method, and people kept using the wrong one and getting weird results. Python 3 has the right API: you *encode* strings to bytes, and only bytes, and you *decode* bytes to strings, and only strings. However, the codec machinery is a lot more general than just str <-> bytes. Codecs can transform from bytes to bytes, or from strings to strings, or to other things, and you can still do so using the codecs module: py> codecs.encode(b"Hello World", "hex_codec") b'48656c6c6f20576f726c64' py> codecs.encode("Hello World", "rot_13") 'Uryyb Jbeyq' although the interface is a bit clunky. There's no way of telling ahead of time whether a codec expects bytes or strings. See also this open bug report: http://bugs.python.org/issue7475 and this one, pointing out that there's no easy way to know what codecs are available: http://bugs.python.org/issue17878 So there's a fair bit of improvement needed in the codec machinery. -- Steven
participants (4)
-
Andrew Barnert -
Daniel Rode -
Stephen J. Turnbull -
Steven D'Aprano