Re: [Python-Dev] bytes / unicode

June 25, 2010

      At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote:
...
It seems to me what is wanted here is something like Perl's taint
mechanism, for *both* kinds of strings.  Am I missing something?
You could certainly view it as a kind of tainting.  The part where 
the type would be bytes-based is indeed somewhat incidental to the 
actual use case -- it's just that if you already have the bytes, and 
all you want to do is tag them (e.g. the WSGI headers case), the 
extra encoding step seems pointless.

A string coercion protocol (that would be used by .join(), .format(), 
__contains__, __mod__, etc.) would allow you to do whatever sort of 
tainted-string or tainted-bytes implementations one might wish to 
have.  I suppose that tainting user inputs (as in Perl) would be just 
as useful of an application of the same coercion protocol.

Actually, I have another use case for this custom string coercion, 
which is that I once wrote a string subclass whose purpose was to 
track the original file and line number of some text.  Even though 
only my code was manipulating the strings, it was very difficult to 
get the tainting to work correctly without extreme care as to the 
string methods used.  (For example, I had to use string addition 
rather than %-formatting.)
...
But with your architecture, it seems to me that you actually don't
want polymorphic functions in the stdlib.  You want the stdlib
functions to be bytes-oriented if and only if they are reliable.  (This
is what I was saying to Guido elsewhere.)
I'm not sure I follow you.  What I want is for the stdlib to create 
stringlike objects of a type determined by the types of the inputs -- 
where the logic for deciding this coercion can be controlled by the 
input objects' types, rather than putting this in the hands of the 
stdlib function.

And of course, this applies to non-stdlib functions, too -- anything 
that simply manipulates user-defined string classes, should allow the 
user-defined classes to determine the coercion of the result.
...
BTW, this was a little unclear to me:
...
[Collisions will] be with other *unicode* strings.  Ones coming
from other code, and literals embedded in the stdlib.
What about the literals in the stdlib?  Are you saying they contain
invalid code points for your known output encoding?  Or are you saying
that with non-polymorphic unicode stdlib, you get lots of false
positives when combining with your validated bytes?
No, I mean that the current string coercion rules cause everything to 
be converted to unicode, thereby discarding the tainting information, 
so to speak.  This applies equally to other tainting use cases, and 
other uses for custom stringlike objects.

Re: [Python-Dev] bytes / unicode

P.J. Eby