[Python-3000] string API growth [was: Re: PEP 3138- String representation in Python 3000]

Thu May 15 02:58:15 CEST 2008

Jim Jewett writes:
 > Maybe I'm missing something, but it seems to me that there are only a
 > few logical combinations; 

There are lots of logical combinations, but most of them fall into
"general transform", is that what you mean?

 > if the below is wrong, maybe that is one
 > reason unicode seems more complex than it should.
 > 
 > Encoding:  str -> ByteString
 >     (staticmethod) BytesString.encode(my_string, encoding=?)
 >     ==
 >     my_string.encode(encoding=?)
 > 
 > Decoding:  ByteString -> str
 >     my_bytes.decode(encoding=?)
 >     ==
 >     (staticmethod) str.decode(my_bytes, encoding=?)

+1

 > General Transforming:
 >     # Why insist on type-preservation?
 >     # Why even make these methods?
 >     my_string.transform(fn) == fn(my_string)
 >     my_bytes.transform(fn) == fn(my_bytes)

Make them methods if they are "like" codecs, by which I mean something
like (more or less) invertible stream-oriented transformations.  Eg,

    my_bytes.gzip()

Pretty weak, though.

 > Transcoding:  ByteString -> ByteString
 >     # If you care how it is represented, it is no longer unicode;
 >     # it is a specific (ByteString) representation
 >     mybytes.recode(old_encoding=?, new_encoding)
 > 
 >     # Can the old encoding often be inferred?
 >     # Or should it always be written because of EIBTI?

(1) I agree this is the obvious connotation of "transcode" in the
    codec context.

(2) This usage is too special to deserve treatment at this level,
    especially since for most purposes

    my_bytes.decode(old_encoding).encode(new_encoding)

    will be perfectly sufficient.

(3) old_encoding should not be inferred as part of .decode() or
    .recode(), as such inference is unreliable and domain-specific
    heuristics often lead to great improvements.  A separate
    method/function should be used.