[Python-Dev] bytes.from_hex()

Ron Adam rrr at ronadam.com
Sun Feb 19 04:54:44 CET 2006


Josiah Carlson wrote:
> Ron Adam <rrr at ronadam.com> wrote:


> Except that ambiguates it even further.
>
> Is encodings.tounicode() encoding, or decoding?  According to everything
> you have said so far, it would be decoding.  But if I am decoding binary
> data, why should it be spending any time as a unicode string?  What do I
> mean?

Encoding and decoding are relative concepts.  It's all encoding from one
thing to another.  Weather it's "decoding" or "encoding" depends on the
relationship of the current encoding to a standard encoding.

The confusion introduced by "decode" is when the 'default_encoding'
changes, will change, or is unknown.


>     x = f.read() #x contains base-64 encoded binary data
>     y = encodings.to_unicode(x, 'base64')
>     
> y now contains BINARY DATA, except that it is a unicode string

No, that wasn't what I was describing.  You get a Unicode string object
as the result, not a bytes object with binary data.  See the toy example
at the bottom.


>     z = encodings.to_str(y, 'latin-1')
> 
> Later you define a str_to_str function, which I (or someone else) would
> use like:
> 
>     z = str_to_str(x, 'base64', 'latin-1')
> 
> But the trick is that I don't want some unicode string encoded in
> latin-1, I want my binary data unencoded.  They may happen to be the
> same in this particular example, but that doesn't mean that it makes any
> sense to the user.

If you want bytes then you would use the bytes() type to get bytes
directly.  Not encode or decode.

     binary_unicode = bytes(unicode_string)

The exact byte order and representation would need to be decided by the
python developers in this case.  The internal representation
'unicode-internal', is UCS-2 I believed.



>> It's no more ambiguous than any math 
>> operation where you can do it one way with one operations and get your 
>> original value back with the same operation by using an inverse value.
>>
>>     n2=n+1; n3=n+(-1); n==n3
>>     n2=n*2; n3=n*(.5); n==n3
> 
> Ahh, so you are saying 'to_base64' and 'from_base64'.  There is one
> major reason why I don't like that kind of a system: I can't just say
> encoding='base64' and use str.encode(encoding) and str.decode(encoding),
> I necessarily have to use, str.recode('to_'+encoding) and
> str.recode('from_'+encoding) .  Seems a bit awkward.

Yes, but the encodings API could abstract out the 'to_base64' and
'from_base64' so you can just say 'base64' and have it work either way.

Maybe a toy "incomplete" example might help.



    # in module bytes.py or someplace else.
    class bytes(list):
       """
       bytes methods defined here
       """



    # in module encodings.py

    # using a dict of lists, but other solutions would
    # work just as well.
    unicode_codecs = {
       'base64': ('from_base64', 'to_base64'),
       }

    def tounicode(obj, from_codec):
        b = bytes(obj)
        b = b.recode(unicode_codecs[from_codec][0])
        return unicode(b)

    def tostr(obj, to_codec):
        b = bytes(obj)
        b = b.recode(unicode_codecs[to_codec][1])
        return str(b)



    # in your application

    import encodings

    ... a bunch of code ...

    u = encodings.tounicode(s, 'base64')

    # or if going the other way

    s = encodings.tostr(u, 'base64')



Does this help?  Is the relationship between the bytes object and the
encodings API clearer here?  If not maybe we should discuss it further
off line.


Cheers,
    Ronald Adam


















More information about the Python-Dev mailing list