[Python-Dev] transform() and untransform() methods, and the codec registry
alexander.belopolsky at gmail.com
Tue Dec 7 05:46:54 CET 2010
On Sun, Dec 5, 2010 at 5:25 PM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> On Saturday 04 December 2010 09:31:04 you wrote:
>> Alexander Belopolsky writes:
>> > In fact, once the language moratorium is over, I will argue that
>> > str.encode() and byte.decode() should deprecate encoding argument and
>> > just do UTF-8 encoding/decoding. Hopefully by that time most people
>> > will forget that other encodings exist. (I can dream, right?)
>> It's just a dream. There's a pile of archival material, often on R/O
>> media, out there that won't be transcoded any more quickly than the
>> inscriptions on Tutankhamun's tomb.
> Not only, many libraries expect use bytes arguments encoded to a specific
> encoding (eg. locale encoding). Said differenlty, only few libraries written in
> C accept wchar* strings.
My proposal has nothing to do with C-API. It only concerns Python API
of the builtin str type.
> The Linux kernel (or many, or all, UNIX/BSD kernels) only manipulate byte
> strings. The libc only accept wide characters for a few operations. I don't
> know how to open a file with an unicode path with the Linux libc: you have to
> encode it...
Yes, but hopefully the encoding used by the filesystem will be UTF-8.
For Python users, however, encoding details will hopefully be hidden
by the open() call. Yes, I am aware of the many problems with
divining the filesystem encoding, but instructing application
developers to supply their own fsencoding in
open(filepath.encode(fsencoding)) calls is not very helpful.
> Alexander: you should first patch all UNIX/BSD kernels to use unicode
> everywhere, then patch all libc implementations, and then all libraries
> (written in C). After that, you can have a break.
As Martin explained later in this thread with respect to the
transform() method, removing codec argument from str.encode() method
does not imply removing the codecs themselves. If I need a method
to encode strings to say koi8_r encoding, I can easily access it
>>> from encodings import koi8_r
>>> to_koi8_r = koi8_r.Codec().encode
More likely, however, I will only need en/decoding to read/write
legacy files and rather than encoding the strings explicitly before
writing into a file, I will just open that file with the correct
Having all encodings accessible in a str method only promotes a
programming style where bytes objects can contain differently encoded
strings in different parts of the program. Instead, well-written
programs should decode bytes on input, do all processing with str type
and decode on output. When strings need to be passed to char* C APIs,
they should be encoded in UTF-8. Many C APIs originally designed for
ASCII actually produce meaningful results when given UTF-8 bytes.
(Supporting such usage was one of the design goals of UTF-8.)
More information about the Python-Dev