[Python-Dev] bytes.from_hex()

Sun Feb 19 04:54:44 CET 2006

Josiah Carlson wrote:
> Ron Adam <rrr at ronadam.com> wrote:

> Except that ambiguates it even further.
>
> Is encodings.tounicode() encoding, or decoding?  According to everything
> you have said so far, it would be decoding.  But if I am decoding binary
> data, why should it be spending any time as a unicode string?  What do I
> mean?

Encoding and decoding are relative concepts.  It's all encoding from one
thing to another.  Weather it's "decoding" or "encoding" depends on the
relationship of the current encoding to a standard encoding.

The confusion introduced by "decode" is when the 'default_encoding'
changes, will change, or is unknown.

>     x = f.read() #x contains base-64 encoded binary data
>     y = encodings.to_unicode(x, 'base64')
>     
> y now contains BINARY DATA, except that it is a unicode string

No, that wasn't what I was describing.  You get a Unicode string object
as the result, not a bytes object with binary data.  See the toy example
at the bottom.

>     z = encodings.to_str(y, 'latin-1')
> 
> Later you define a str_to_str function, which I (or someone else) would
> use like:
> 
>     z = str_to_str(x, 'base64', 'latin-1')
> 
> But the trick is that I don't want some unicode string encoded in
> latin-1, I want my binary data unencoded.  They may happen to be the
> same in this particular example, but that doesn't mean that it makes any
> sense to the user.

If you want bytes then you would use the bytes() type to get bytes
directly.  Not encode or decode.

     binary_unicode = bytes(unicode_string)

The exact byte order and representation would need to be decided by the
python developers in this case.  The internal representation
'unicode-internal', is UCS-2 I believed.

>> It's no more ambiguous than any math 
>> operation where you can do it one way with one operations and get your 
>> original value back with the same operation by using an inverse value.
>>
>>     n2=n+1; n3=n+(-1); n==n3
>>     n2=n*2; n3=n*(.5); n==n3
> 
> Ahh, so you are saying 'to_base64' and 'from_base64'.  There is one
> major reason why I don't like that kind of a system: I can't just say
> encoding='base64' and use str.encode(encoding) and str.decode(encoding),
> I necessarily have to use, str.recode('to_'+encoding) and
> str.recode('from_'+encoding) .  Seems a bit awkward.

Yes, but the encodings API could abstract out the 'to_base64' and
'from_base64' so you can just say 'base64' and have it work either way.

Maybe a toy "incomplete" example might help.

    # in module bytes.py or someplace else.
    class bytes(list):
       """
       bytes methods defined here
       """

    # in module encodings.py

    # using a dict of lists, but other solutions would
    # work just as well.
    unicode_codecs = {
       'base64': ('from_base64', 'to_base64'),
       }

    def tounicode(obj, from_codec):
        b = bytes(obj)
        b = b.recode(unicode_codecs[from_codec][0])
        return unicode(b)

    def tostr(obj, to_codec):
        b = bytes(obj)
        b = b.recode(unicode_codecs[to_codec][1])
        return str(b)

    # in your application

    import encodings

    ... a bunch of code ...

    u = encodings.tounicode(s, 'base64')

    # or if going the other way

    s = encodings.tostr(u, 'base64')

Does this help?  Is the relationship between the bytes object and the
encodings API clearer here?  If not maybe we should discuss it further
off line.

Cheers,
    Ronald Adam