[Python-Dev] bytes / unicode

Sun Jun 27 16:03:06 CEST 2010

P.J. Eby writes:
 > At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote:
 > >What I'm saying here is that if bytes are the signal of validity, and
 > >the stdlib functions preserve validity, then it's better to have the
 > >stdlib functions object to unicode data as an argument.  Compare the
 > >alternative: it returns a unicode object which might get passed around
 > >for a while before one of your functions receives it and identifies it
 > >as unvalidated data.
 > 
 > I still don't follow,

OK, I give up, since it was your use case that concerned me.  I
obviously misunderstood.  Sorry for the confusion.

    Sign me,
    +1 on polymorphic functions in Tsukuba Japan

 > >In general this is a hard problem, though.  Polymorphism, OK, one-way
 > >tainting OK, but in general combining related types is pretty
 > >arbitrary, and as in the encoded-bytes case, the result type often
 > >varies depending on expectations of callers, not the types of the
 > >data.
 > 
 > But the caller can enforce those expectations by passing in arguments 
 > whose types do what they want in such cases, as long as the string 
 > literals used by the function don't get to override the relevant 
 > parts of the string protocol(s).

This simply isn't true for encoded bytes as proposed.  For encoded
text, the current encoding has no deterministic relationship to the
desired encoding (at the level of generality of the stdlib; of course
in specific applications it may be mandated by a standard or private
convention).

I will have to pass on your other user-defined string types.  I've
never tried to implement one.  I only wanted to point out that a
user-controllable tainted string type would be preferable to
confounding "unicode" with "tainted".