[Python-3000] PEP 3138- String representation in Python 3000

Mon May 19 17:53:11 CEST 2008

Stephen J. Turnbull wrote:

> But why be verbose *and* ignore the vernacular?
> 
>     gzipped = plaintext.transform('gzip')
>     plaintext = gzipped.transform('gunzip')

I'm generally resistant to a registry, none of my applications are so 
general that they would take advantage of a 
string-key-to-dictionary-to-function-pointer.  If they did, they would 
have to have some pretty severe constraints on what functions can be 
selected, so I would end up building my own context sensitive dictionary 
of available functions.   I'm in favor of:

     gzipped = plaintext.transform(zlib.compress)
     plaintext = gzipped.transform(zlib.decompress)

So, you may ask, why would that be any better that this...

     gzipped = zlib.compress(plaintext)

...and the answer is that it depends on what you consider the most 
appropriate design pattern to follow.

> I think the style should be EIBTI for "private" protocols, and TOOWDTI
> for transforms that wrap well-known libraries.

I've been around socket libraries and protocol encoding/decoding stacks 
too long I guess, or I'm just jaded, but TOOWDTI is a pipe dream. 
There's Only One Blessed Way To Do It I can understand and appreciate.

EIBTI trumps TOOWDTI when it has to go through a registry.  I would be 
-1 on this design:

     In module codecs:

         from gzip import compress as _gzip_compress
         ...
         _registry['gzip'] = _gzip_compress

Where there is a great deal of code that enforces TOOWDTI, effectively 
obfuscating the fact that all your passing to transform() nothing more 
magical than a reference to a function.

> This is a non-starter, because you don't know what the representation
> of strings is.

If you're working on that kind of application.  My applications have to 
know what the items in the sequence are, or they have to figure it out, 
but when it comes time to do the transformation, they know.

> We could be right-thinking and mandate that in the
> .transform() context the string representation is considered
> big-endian (and for little-endian platforms the bytes are swabbed
> before applying the transformation).

Yuck.

> But that would annoy all the Wintel users because string.transform('zip')
> would produce gobbledgook when unzipped from the command line.  And
> of course assuming a little-endian representation is un-right-thinkable.

It would annoy me because mandating the format of the input is up to the 
transformation function, not the transform().

     y = x.transform(f)

If there is some endian restriction on f, it should detect it and 
enforce it, or if it can't, document it.  If there is some platform 
strangeness, it should take that into account.

> In this sense string-to-string and byte-to-byte *must* be kept
> separate from "true" codecs.

I don't any codecs that aren't true.  Some may be more popular or 
command than others, and the more popular ones may be blessed by being 
presented as easily accessible, just like your gunzip === gzip_to_plaintext.

> I think it would be a very bad idea to allow names to be shared
> for, say, byte-to-byte and string-to-byte "gzip" for the reason
> given above.

I don't agree, only because I've written plenty of functions that can 
take a variety of different kinds of inputs as a convenience.  If 
zlib.compress can take bytes or strings I would be fine with that, and 
if I could be more explicit, e.g.,

     gzipped = plainbytes.transform(zlib.compress_bytes)

I would be even happier.   What is not available in Python that is in 
C++, and believe that I don't miss it all THAT much, is a way to select 
the appropriate function based on both the input and output. 
Annotations would have been a way to do it, but there's far too many 
people that don't like it for very good reasons.

> Whether string-to-string and byte-to-byte need to share a namespace is
> another question, but since we already need three (string->byte,
> byte->string, byte->byte) that should be forced not to collide, I
> don't think that there's that big a loss in requiring that
> .transform('pig_latin') (string to string) be spelled differently from
> .transform('pig_latin1') (byte to byte assuming ISO 8859/1 data).

I agree, and I don't think there's an advantage to passing string names.

     import piglatin as pig
     piggy = mytext.transform(pig.latin1_encode)

I'm -1 on transform.register('pig_latin1', pig.latin1_encode).

> Do you have use cases where byte-to-byte and string-to-string
> transformations should share the same name?

Not in the same module.

Joel