[Python-Dev] Which direction is UnTransform? / Unicode is different

Wed Nov 20 02:28:48 CET 2013

(Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote:

 > Serhiy Storchaka wrote:

 > > If the transform() method will be added, I prefer to have only
 > > one transformation method and specify a direction by the
 > > transformation name ("bzip2"/"unbzip2").

Me too.  Until I consider special cases like "compress", or "lower",
and realize that there are enough special cases to become a major wart
if generic transforms ever became popular.  

> People think about these transformations as "en- or de-coding", not
> "transforming", most of the time.  Even for a transformation that is
> an involution (eg, rot13), people have an very clear idea of what's
> encoded and what's not, and they are going to prefer the names
> "encode" and "decode" for these (generic) operations in many cases.

I think this is one of the major stumbling blocks with unicode.

I originally disagreed strongly with what Stephen wrote -- but then
I realized that all my counterexamples involved unicode text.

I can tell whether something is tarred or untarred, zipped or unzipped.

But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't
seem "encoded", and it doesn't make sense to "decode" a perfectly
readable (ASCII) string into a sequence of "code units".

Nor does it help that http://www.unicode.org/glossary/#code_unit
defines "code unit" as "The minimal bit combination that can represent
a unit of encoded text for processing or interchange. The Unicode
Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
units in the UTF-16 encoding form, and 32-bit code units in the UTF-32
encoding form. (See definition D77 in Section 3.9, Unicode Encoding
Forms.)"

I have to read that very carefully to avoid mentally translating it
into "Code Units are *en*coded, and there are lots of different
complicated encodings that I wouldn't use unless I were doing special
processing or interchange."  If I'm not using the network, or if my
"interchange format" already looks like readable ASCII, then unicode
sure sounds like a complication.  I *will* get confused over which
direction is encoding and which is decoding. (Removing .decode()
from the (unicode) str type in 3 does help a lot, if I have a Python 3
interpreter running to check against.)

I'm not sure exactly what implications the above has, but it certainly
supports separating the Text Processing from the generic codecs, both
in the documentation and in any potential new methods.

Instead of relying on introspection of .decodes_to and .encodes_to, it
would be useful to have charsetcodecs and tranformcodecs as entirely
different modules, with their own separate registries.  I will even
note that the existing help(codecs) seems more appropriate for
charsetcodecs than it does for the current conjoined module.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ