[Python-Dev] Which direction is UnTransform? / Unicode is different

Wed Nov 20 13:03:03 CET 2013

On Tue, Nov 19, 2013 at 05:28:48PM -0800, Jim J. Jewett wrote:
> 
>  
> (Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote:
> 
>  > Serhiy Storchaka wrote:
> 
>  > > If the transform() method will be added, I prefer to have only
>  > > one transformation method and specify a direction by the
>  > > transformation name ("bzip2"/"unbzip2").
> 
> Me too.  Until I consider special cases like "compress", or "lower",
> and realize that there are enough special cases to become a major wart
> if generic transforms ever became popular.  

I'm not sure I understand this comment. Why are "compress" and "lower" 
special cases? If there's a "compress" codec, presumably there'll be an 
"uncompress" or "expand" that reverses it. In the case of "lower", it's 
not losslessly reversable, but there's certainly a reverse 
transformation, "upper".

Some transformations are their own reverse, e.g. "rot13". In that case, 
there's no need for an unrot13 codec, since applying it twice undoes 
it.

> > People think about these transformations as "en- or de-coding", not
> > "transforming", most of the time.  Even for a transformation that is
> > an involution (eg, rot13), people have an very clear idea of what's
> > encoded and what's not, and they are going to prefer the names
> > "encode" and "decode" for these (generic) operations in many cases.
> 
> I think this is one of the major stumbling blocks with unicode.
> 
> I originally disagreed strongly with what Stephen wrote -- but then
> I realized that all my counterexamples involved unicode text.

Counterexamples to what? Again, I'm afraid I can't really understand 
what point you're trying to make here. Perhaps an explicit 
counterexample, and an explicit statement of what you're disagreeing 
with (e.g. "I disagree that people have a clear example of what's 
encoded and what's not") will help.

[...]
> But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't
> seem "encoded", and it doesn't make sense to "decode" a perfectly
> readable (ASCII) string into a sequence of "code units".

Of course it is encoded. There's nothing "a"-like about the byte 0x61, 
byte 0x2E is nothing like a period, and there is nothing about the byte 
0x0A that forces text editors to start a new line -- or should that be 
0x0D, or even possibly 0x85?

There's nothing that distinguishes the text "spam" from the four-byte 
integer 1936744813 (0x7370616d in hex) except the semantics that we 
grant it, and that includes an implicit transformation 0x73 <-> "s", 
etc.

Reading this may help:

www.joelonsoftware.com/articles/Unicode.html‎

> Nor does it help that http://www.unicode.org/glossary/#code_unit
> defines "code unit" as "The minimal bit combination that can represent
> a unit of encoded text for processing or interchange. The Unicode
> Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
> units in the UTF-16 encoding form, and 32-bit code units in the UTF-32
> encoding form. (See definition D77 in Section 3.9, Unicode Encoding
> Forms.)"

I agree that the official Unicode glossary is unfortunately confusing. 
It has a huge amount of information, often with confusingly similar 
terminology (code points and code units are, in a sense, opposites), and 
it's quite hard for beginners to Unicode to make sense of it all.

> I have to read that very carefully to avoid mentally translating it
> into "Code Units are *en*coded, 

Code units *are* encoded, in the sense that we say a burger is cooked. 
Take a raw meat patty and cook it, and you get a burger. Similarly, code 
units are the product of an encoding process, hence have been encoded.

Code points (think of them as characters, modulo a few technicalities) 
are encoded *into* code units, which are bytes. Which code units you 
get depend on the encoding form you use, i.e. the codec.

If you start with the character "a", and apply the UTF-8 encoding, 
you get a single 8-bit (one byte) code unit, 0x61. If you apply the 
UTF-16 (big endian) encoding, you get a single 16-bit (two bytes) 
code unit, 0x0061. If you apply UTF-32be codec, you get a single 32-bit 
(four bytes) code unit, 0x00000061.

> and there are lots of different
> complicated encodings that I wouldn't use unless I were doing special
> processing or interchange."

Very few of those encodings are Unicode. With the exception of a small 
handful of UTF-* codecs, and maybe one or two others, the vast majority 
are legacy encodings from the Bad Old Days when just about every 
computer had it's own distinct character set, or sets. If you're a 
Windows user, the non-UTF codecs (all the Latin-whatever codecs, Big5, 
cp-whatever, koi8-whatever, there are dozens of them) are basically old 
Windows code pages and the equivalent from other computer systems.

And yes, it is best to avoid them like the plague except when you need 
them for interoperability with legacy data.

> If I'm not using the network, or if my
> "interchange format" already looks like readable ASCII, then unicode
> sure sounds like a complication.

It's not, not compared to the Bad Old Days. If you're like me, you 
remember when you couldn't exchange text files from Macintosh to Windows 
and visa versa without data being corrupted. Now, so long as both sides 
use Unicode, data corruption ought to be a thing of the past.

It's not, but only because some operating systems still insist on using 
non-Unicode encodings by default.

> I *will* get confused over which
> direction is encoding and which is decoding. (Removing .decode()
> from the (unicode) str type in 3 does help a lot, if I have a Python 3
> interpreter running to check against.)

It took me a long time to learn that text encodes to bytes, and bytes 
decode back to text. Using Python 3 really helped with that.

-- 
Steven