[Python-Dev] bytes.from_hex()

Fri Feb 24 23:46:00 CET 2006

* The following reply is a rather longer than I intended explanation of 
why codings (and how they differ) like 'rot' aren't the same thing as 
pure unicode codecs and probably should be treated differently.
If you already understand that, then I suggest skipping this.  But if 
you like detailed logical analysis, it might be of some interest even if 
it's reviewing the obvious to those who already know.

(And hopefully I didn't make any really obvious errors myself.)

Stephen J. Turnbull wrote:
>>>>>> "Ron" == Ron Adam <rrr at ronadam.com> writes:
> 
>     Ron> We could call it transform or translate if needed.
> 
> You're still losing the directionality, which is my primary objection
> to "recode".  The absence of directionality is precisely why "recode"
> is used in that sense for i18n work.

I think your not understanding what I suggested.  It might help if we 
could agree on some points and then go from there.

So, lets consider a "codec" and a "coding" as being two different things 
where a codec is a character sub set of unicode characters expressed in 
a native format.  And a coding is *not* a subset of the unicode 
character set, but an _opperation_ performed on text.  So you would have 
the following properties.

    codec ->  text is always in *one_codec* at any time.

    coding ->  operation performed on text.

Lets add a special default coding called 'none' to represent a do 
nothing coding. (figuratively for explanation purposes)

    'none' -> return the input as is, or the uncoded text

Given the above relationships we have the following possible 
transformations.

   1. codec to like codec:   'ascii' to 'ascii'
   2. codec to unlike codec:   'ascii' to 'latin1'

And we have coding relationships of:

   a. coding to like coding      # Unchanged, do nothing
   b. coding to unlike coding

Then we can express all the possible combinations as...

    [1.a, 1.b, 2.a, 2.b]

    1.a -> coding in codec to like coding in like codec:

        'none' in 'ascii' to 'none' in 'ascii'

    1.b -> coding in codec to diff coding in like codec:

        'none' in 'ascii' to 'base64' in 'ascii'

    2.a -> coding in codec to same coding in diff codec:

        'none' in 'ascii' to 'none' in 'latin1'

    2.b -> coding in codec to diff coding in diff codec:

        'none' in 'latin1' to 'base64' in 'ascii'

This last one is a problem as some codecs combine coding with character 
set encoding and return text in a differnt encoding than they recieved. 
  The line is also blurred between types and encodings.  Is unicode and 
encoding?  Will bytes also be a encoding?

Using the above combinations:

(1.a) is just creating a new copy of a object.

    s = str(s)

(1.b) is recoding an object, it returns a copy of the object in the same 
encoding.

    s = s.encode('hex-codec')  # ascii str -> ascii str coded in hex
    s = s.decode('hex-codec')  # ascii str coded in hex -> ascii str

* these are really two differnt operations. And encoding repeatedly 
results in nested codings.  Codecs (as a pure subset of unicode) don't 
have that property.

* the hex-codec also fit the 2.b pattern below if the source string is 
of a differnt type than ascii. (or the the default string?)

(2.a) creates a copy encoded in a new codec.

    s = s.encode('latin1')

* I beleive string constructors should have a encoding argument for use 
with unicode strings.

    s = str(u, 'latin1')   # This would match the bytes constructor.

(2.b) are combinations of the above.

   s = u.encode('base64')
      # unicode to ascii string as base64 coded characters

   u = unicode(s.decode('base64'))
      # ascii string coded in base64 to unicode characters

or

>>> u = unicode(s, 'base64')
  Traceback (most recent call last):
    File "<stdin>", line 1, in ?
  TypeError: decoder did not return an unicode object (type=str)

Ooops...  ;)

So is coding the same as a codec?  I think they have different 
properties and should be treated differently except when the 
practicality over purity rule is needed.  And in those cases maybe the 
names could clearly state the result.

    u.decode('base64ascii')  # name indicates coding to codec

> A string. -> QSBzdHJpbmcu -> UVNCemRISnBibWN1

Looks like the underlying sequence is:

      native string -> unicode -> unicode coded base64 -> coded ascii str

And decode operation would be...

      coded ascii str -> unicode coded base64 -> unicode -> ascii str

Except it may combine some of these steps to speed it up.

Since it's a hybred codec including a coding operation. We have to treat 
it as a codec.

>     Ron>     * Given that the string type gains a __codec__ attribute
>     Ron> to handle automatic decoding when needed.  (is there a reason
>     Ron> not to?)
> 
>     Ron>        str(object[,codec][,error]) -> string coded with codec
> 
>     Ron>        unicode(object[,error]) -> unicode
> 
>     Ron>        bytes(object) -> bytes
> 
> str == unicode in Py3k, so this is a non-starter.  What do you want to
> say?
> 
>     Ron>      * a recode() method is used for transformations that
>     Ron> *do_not* change the current codec.
> 
> I'm not sure what you mean by the "current codec".  If it's attached
> to an "encoded object", it should be the codec needed to decode the
> object.  And it should be allowed to be a "codec stack".  

I wasn't thinking in terms of stacks, but in that case the current codec 
would be the top of the stack.  I think stackable codecs is a very bad 
idea for the record.

Back to recode vs encode/decode, the example used above might be useful.

    s = s.encode('hex-codec')  # ascii str -> ascii str coded in hex
    s = s.decode('hex-codec')  # ascii str coded in hex -> ascii str

In my opinion these are actually too very different (although related) 
operations that would be better expressed with different names.

Curently it's a hybred codec that converts it's input to an ascii string 
(or default encoding?),  but when decoding you end up with an ascii 
encoding even if you started with something else.  So the decode isn't a 
true inverse to encode in some cases.

As a coding operation it would be.

    u = u.recode('to_hex')
    u = u.recode('from_hex')

Where this would work with both unicode and strings without changing the 
codec.

It also keeps the 'if i do it again' it will *recode* the coded text' 
relationship. So I think the name is appropriate. IMHO

Pure codecs such as latin-1 can be envoked over and over and you can 
always get back what you put in in a single step.

 >>> s = 'abc'
 >>> for n in range(100):
...   s = s.encode('latin-1')
...
 >>> print s, type(s)
abc <type 'str'>

Supposedly a lot of these issues will go away in Python 3000. And we can 
probably live with the current state of things.  But even after Python 
3000 it seems to me we will still need access to codecs as we may run 
across encoded text input from various sources.

Cheers,
    Ron