[Python-Dev] bytes.from_hex()

Sat Feb 18 20:46:40 CET 2006

Ron Adam <rrr at ronadam.com> wrote:
> Josiah Carlson wrote:
[snip]
> > Again, the problem is ambiguity; what does bytes.recode(something) mean?
> > Are we encoding _to_ something, or are we decoding _from_ something? 
> 
> This was just an example of one way that might work, but here are my 
> thoughts on why I think it might be good.
> 
> In this case, the ambiguity is reduced as far as the encoding and 
> decodings opperations are concerned.)
> 
>       somestring = encodings.tostr( someunicodestr, 'latin-1')
> 
> It's pretty clear what is happening to me.
> 
>      It will encode to a string an object, named someunicodestr, with 
> the 'latin-1' encoder.

But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
unicode(..., 'latin-1')?

What about string transformations:
    somestring = encodings.tostr(somestr, 'base64')

How do we get that back?  encodings.tostr() again is completely
ambiguous, str(somestring, 'base64') seems a bit awkward (switching
namespaces)?

> And also rusult in clear errors if the specified encoding is 
> unavailable, and if it is, if it's not compatible with the given 
> *someunicodestr* obj type.
> 
> Further hints could be gained by.
> 
>      help(encodings.tostr)
> 
> Which could result in... something like...
>      """
>      encoding.tostr( <string|unicode>, <encoder> ) -> string
> 
>      Encode a unicode string using a encoder codec to a
>      non-unicode string or transform a non-unicode string
>      to another non-unicode string using an encoder codec.
>      """
> 
> And if that's not enough, then help(encodings) could give more clues. 
> These steps would be what I would do. And then the next thing would be 
> to find the python docs entry on encodings.
> 
> Placing them in encodings seems like a fairly good place to look for 
> these functions if you are working with encodings.  So I find that just 
> as convenient as having them be string methods.
> 
> There is no intermediate default encoding involved above, (the bytes 
> object is used instead), so you wouldn't get some of the messages the 
> present system results in when ascii is the default.
> 
> (Yes, I know it won't when P3K is here also)
> 
> > Are we going to need to embed the direction in the encoding/decoding
> > name (to_base64, from_base64, etc.)?  That doesn't any better than
> > binascii.b2a_base64 .  
> 
> No, that's why I suggested two separate lists (or dictionaries might be 
> better).  They can contain the same names, but the lists they are in 
> determine the context and point to the needed codec.  And that step is 
> abstracted out by putting it inside the encodings.tostr() and 
> encodings.tounicode() functions.
> 
> So either function would call 'base64' from the correct codec list and 
> get the correct encoding or decoding codec it needs.

Either the API you have described is incomplete, you haven't noticed the
directional ambiguity you are describing, or I have completely lost it.

> > What about .reencode and .redecode?  It seems as
> > though the 're' added as a prefix to .encode and .decode makes it
> > clearer that you get the same type back as you put in, and it is also
> > unambiguous to direction.
> 
> But then wouldn't we end up with multitude of ways to do things?
> 
>      s.encode(codec) == s.redecode(codec)
>      s.decode(codec) == s.reencode(codec)
>      unicode(s, codec) == s.decode(codec)
>      str(u, codec) == u.encode(codec)
>      str(s, codec) == s.encode(codec)
>      unicode(s, codec) == s.reencode(codec)
>      str(u, codec) == s.redecode(codec)
>      str(s, codec) == s.redecode(codec)
> 
> Umm .. did I miss any?  Which ones would you remove?
> 
> Which ones of those will succeed with which codecs?

I must not be expressing myself very well.

Right now:
    s.encode() -> s
    s.decode() -> s, u
    u.encode() -> s, u
    u.decode() -> u

Martin et al's desired change to encode/decode:
    s.decode() -> u
    u.encode() -> s

No others.

What my thoughts on .reencode() and .redecode() would get you given
Martin et al's desired change:
    s.reencode() -> s (you get encoded strings as strings)
    s.redecode() -> s (you get decoded strings as strings)
    u.reencode() -> u (you get encoded unicode as unicode)
    u.redecode() -> u (you get decoded unicode as unicode)

If one wants to go from unicode to string, one uses .encode(). If one
wants to go from string to unicode, one uses .decode().  If one wants to
keep their type unchanged, but encode or decode the data/text, one would
use .reencode() and .redecode(), depending on whether their source is an
encoded block of data, or the original data they want to encode.

The other bonus is that if given .reencode() and .redecode(), one can
quite easily verify that the source is possible as a source, and that
you would get back the proper type.  How this would occur behind the
scenes is beyond the scope of this discussion, but it seems to me to be
easy, given what I've read about the current mechanism.

Whether the constructors for the str and unicode do their own codec
transformations is beside the point.

> The method bytes.recode(), always does a byte transformation which can 
> be almost anything.  It's the context bytes.recode() is used in that 
> determines what's happening.  In the above cases, it's using an encoding 
> transformation, so what it's doing is precisely what you would expect by 
> it's context.

Indeed, there is a translation going on, but it is not clear as to
whether you are encoding _to_ something or _from_ something.  What does
s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
That's where the ambiguity lies.

> There isn't a bytes.decode(), since that's just another transformation. 
> So only the one method is needed.  Which makes it easer to learn.

But ambiguous.

> > The question remains: is str.decode() returning a string or unicode
> > depending on the argument passed, when the argument quite literally
> > names the codec involved, difficult to understand?  I don't believe so;
> > am I the only one?
> 
> Using help(str.decode) and help(str.encode) gives:
> 
>       S.decode([encoding[,errors]]) -> object
> 
>       S.encode([encoding[,errors]]) -> object
> 
> These look an awful lot alike.  The descriptions are nearly identical as 
> well.  The Python docs just reproduce (or close to) the doc strings with 
> only a very small amount of additional words.
> 
> Learning how the current system works comes awfully close to reverse 
> engineering.  Maybe I'm overstating it a bit, but I suspect many end up 
> doing exactly that in order to learn how Python does it.

Again, we _need_ better documentation, regardless of whether or when the
removal of some or all .encode()/.decode() methods happen.

 - Josiah