
At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote:
It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something?
You could certainly view it as a kind of tainting. The part where the type would be bytes-based is indeed somewhat incidental to the actual use case -- it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless. A string coercion protocol (that would be used by .join(), .format(), __contains__, __mod__, etc.) would allow you to do whatever sort of tainted-string or tainted-bytes implementations one might wish to have. I suppose that tainting user inputs (as in Perl) would be just as useful of an application of the same coercion protocol. Actually, I have another use case for this custom string coercion, which is that I once wrote a string subclass whose purpose was to track the original file and line number of some text. Even though only my code was manipulating the strings, it was very difficult to get the tainting to work correctly without extreme care as to the string methods used. (For example, I had to use string addition rather than %-formatting.)
But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.)
I'm not sure I follow you. What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs -- where the logic for deciding this coercion can be controlled by the input objects' types, rather than putting this in the hands of the stdlib function. And of course, this applies to non-stdlib functions, too -- anything that simply manipulates user-defined string classes, should allow the user-defined classes to determine the coercion of the result.
BTW, this was a little unclear to me:
[Collisions will] be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib.
What about the literals in the stdlib? Are you saying they contain invalid code points for your known output encoding? Or are you saying that with non-polymorphic unicode stdlib, you get lots of false positives when combining with your validated bytes?
No, I mean that the current string coercion rules cause everything to be converted to unicode, thereby discarding the tainting information, so to speak. This applies equally to other tainting use cases, and other uses for custom stringlike objects.