[Python-Dev] bytes.from_hex()

Thu Mar 2 18:32:40 CET 2006

Just van Rossum <just at letterror.com> wrote:
> 
> Ron Adam wrote:
> 
> > Josiah Carlson wrote:
> > > Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> > >>    u = unicode(b)
> > >>    u = unicode(b, 'utf8')
> > >>    b = bytes['utf8'](u)
> > >>    u = unicode['base64'](b)   # encoding
> > >>    b = bytes(u, 'base64')     # decoding
> > >>    u2 = unicode['piglatin'](u1)   # encoding
> > >>    u1 = unicode(u2, 'piglatin')   # decoding
> > > 
> > > Your provided semantics feel cumbersome and confusing to me, as
> > > compared with str/unicode.encode/decode() .
> > > 
> > >  - Josiah
> > 
> > This uses syntax to determine the direction of encoding.  It would be 
> > easier and clearer to just require two arguments or a tuple.
> > 
> >       u = unicode(b, 'encode', 'base64')
> >       b = bytes(u, 'decode', 'base64')
> > 
> >       b = bytes(u, 'encode', 'utf-8')
> >       u = unicode(b, 'decode', 'utf-8')
> > 
> >       u2 = unicode(u1, 'encode', 'piglatin')
> >       u1 = unicode(u2, 'decode', 'piglatin')
> > 
> > 
> > 
> > It looks somewhat cleaner if you combine them in a path style string.
> > 
> >       b = bytes(u, 'encode/utf-8')
> >       u = unicode(b, 'decode/utf-8')
> 
> It gets from bad to worse :(
> 
> I always liked the assymmetry between
> 
>     u = unicode(s, "utf8")
> 
> and
> 
>     s = u.encode("utf8")
> 
> which I think was the original design of the unicode API. Cudos for
> whoever came up with that.

I personally have never used that mechanism.  I always used
s.decode('utf8') and u.encode('utf8').  I prefer the symmetry that
.encode() and .decode() offer.

> When I saw
> 
>     b = bytes(u, "utf8")
> 
> mentioned for the first time, I thought: why on earth must the bytes
> constructor be coupled to the unicode API?!?! It makes no sense to me
> whatsoever.

It's not a 'unicode API'.  See integers for another example where a
second argument to a type object defines how to interpret the other
argument, or even arrays/structs where the first argument defines the
interpretation.

> Bytes have so much more use besides encoded text.

Agreed.

> I believe (please correct me if I'm wrong) that the encoding argument of
> bytes() was invented to make it easier to write byte literals. Perhaps a
> true bytes literal notation is in order after all?

Maybe, but I think the other earlier use-case was for using:
    s2 = bytes(s1, 'base64')
If bytes objects recieved an .encode() method, or even a .tobytes()
method.  I could be misremembering.

> My preference for bytes -> unicode -> bytes API would be this:
> 
>     u = unicode(b, "utf8")  # just like we have now
>     b = u.tobytes("utf8")   # like u.encode(), but being explicit
>                             # about the resulting type
> 
> As to base64, while it works as a codec ("Why a base64 codec? Because we
> can!"), I don't find it a natural API at all, for such conversions.

Depending on whose definiton of codec you listen to (is it a
compressor/decompressor, or a coder/decoder?), either very little of
what we have as 'codecs' are actual codecs (only zlib, etc.), or all of
them are.

I would imagine that base64, etc., were made into codecs, or really
encodings, because base64 is an 'encoding' of binary data in base64
format.  Similar to the way you can think of utf8 is an 'encoding' of
textual data in utf8 format.  I would argue, due to the "one obvious way
to do it", that using encodings/codecs should be preferred to one-shot
encoding/decoding functions in various modules (with some exceptions).

These exceptions are things like pickle, marshal, struct, etc., which
may take a non-basestring object and convert it into a byte string,
which is arguably an encoding of the object in a particular format.

 - Josiah