[Python-Dev] transform() and untransform() methods, and the codec registry

Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 9 21:29:35 CET 2010

On Thu, Dec 9, 2010 at 2:17 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Thu, 9 Dec 2010 13:55:08 -0500
> Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote:
>> This is actually *very* misleading because
>> >>> 'abc'.transform('rot13')
>> 'nop'
>> works even though 'abc' is not "an object with the buffer interface".
> Agreed. It was already pointed out in the parent thread.
> I would say my opinion on keeping transform()/untransform() is +0 if
> these error messages (and preferably the exception type as well) get
> improved. Otherwise we'd better revert them and add a more polished
> version in 3.3.

Error messages is only one of the problems.  User confusion over which
codec supports which types is another.  Why, for example rot13 works
on str and not on bytes?  It only affects ASCII range, so isn't it
natural to expect  b'abc'.transform('rot13') to work?  Well,
presumably this is so because Caesar did not know about bytes and his
"cypher" was about character shuffling.  In this case, should't it
also shuffle other code points assigned to Latin letters?  Given how
"useful" rot13 is in practice, I feel that it was only added to
justify adding str.transform().

There are other problems raised on the issue and not addressed in the
tracker discussion.  For example, both Victor and I expressed concern
about having builitn methods that do import behind the scenes.
Granted, this issue already exists with encode/decode methods, but
these are usable without providing an explicit encoding and in this
form do not have side-effects.

Another problem is that with str.transform(), users are encouraged to
write programs in which data stored in strings is not always
interpreted as Unicode.  For example, when I see a 'n' in a string
variable, it may actually mean 'a' because it has been scrambled by
rot13.   Again, rot13 is not a realistic example, but as users are
encouraged to create their own string to string codecs, we may soon
find ourselves in the same mess as we have with 2.x programs trying to
support multiple locales.

As far as realistic examples go, Unicode transformations such as case
folding, normalization or decimal to ASCII translation have not been
considered in str.transform() design.  The
str.transform/str.untransform pair may or may not be a good solution
for these cases.  One obvious issue being that these transformations
are often not invertible.

I admit I have more questions than answers at this point, but a design
that adds the same two methods to three builtin types with very
different usage patterns (str, bytes and bytearray) does not seem to
be well thought out.

The str type already has 40+ methods many of which are not well-suited
to handle the complexities inherent in Unicode.   Rather than rushing
in two more methods that will prove to be about as useful as
str.swapcase(), lets consider actual use cases and come up with a
design that will properly address them.

More information about the Python-Dev mailing list