Mailman 3 Re: [Python-Dev] bytes / unicode - Python-Dev

newer
Re: [Python-Dev] bytes / unicode

Re: [Python-Dev] bytes / unicode

older
PyPy 1.3 released

P.J. Eby

June 25, 2010

9:07 a.m.

At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote:

...

P.J. Eby writes:

...
This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type that would handle the error checking.

Don't you realize that "encoded-bytes" is equivalent to use of a very limited profile of ISO 2022 coding extensions? Such as Emacs/MULE internal encoding or TRON code? It has been tried. It does not work.

I understand how types can do such checking; my point is that the encoded-bytes type doesn't have enough information to do it in the cases where you think it is better than converting to str. There are *no useful operations* that can be done on two encoded-bytes with different encodings unless you know the ultimate target codec.

I do know the ultimate target codec -- that's the point. IOW, I want to be able to do to all my operations by passing target-encoded strings to polymorphic functions. Then, the moment something creeps in that won't go to the target codec, I'll be able to track down the hole in the legacy code that's letting bad data creep in.

...

The only sensible way to define the concatenation of ('ascii', 'English') with ('euc-jp','ÆüËÜ¸ì') is something like ('ascii', 'English', 'euc-jp','ÆüËÜ¸ì'), and *not* ('euc-jp','EnglishÆüËÜ¸ì'), because you don't know that the ultimate target codec is 'euc-jp'-compatible. Worse, you need to build in all the information about which codecs are mutually compatible into the encoded-bytes type. For example, if the ultimate target is known to be 'shift_jis', it's trivially compatible with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't have.

The interaction won't be with other encoded bytes, it'll be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib.

...

No, the problem is not with the Unicode, it is with the code that allows characters not encodable with the target codec.

And which code that is, precisely, is the thing that may be very difficult to find, unless I can identify it at the first point it enters (and corrupts) my output data. When dealing with a large code base, this may be a nontrivial problem.

Show replies by date

Stephen J. Turnbull

June 2010

12:18 p.m.

New subject: bytes / unicode

P.J. Eby writes:

...

I do know the ultimate target codec -- that's the point.

IOW, I want to be able to do to all my operations by passing target-encoded strings to polymorphic functions.

IOW, you *do* have text and (ignoring efficiency issues) could just as well use str. But That Other Code is unreliable, so you need a marker for your own internal strings indicating that they are validated, while other strings are not. This has nothing to do with bytes vs. str as string types, then; it's all about validated (which your architecture indicates by using the bytes type) vs. unvalidated (which your architecture indicates with unicode). Eg, in the case of your USPS vs. ecommerce example, you can't even handle all bytes, so not all possible bytes objects are valid. And other applications might not be able to handle all Japanese, but only a subset, so having valid EUC-JP wouldn't be enough, you'd have to check repertoire -- might as well use str. It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something? But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.) BTW, this was a little unclear to me:

...

[Collisions will] be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib.

What about the literals in the stdlib? Are you saying they contain invalid code points for your known output encoding? Or are you saying that with non-polymorphic unicode stdlib, you get lots of false positives when combining with your validated bytes?

P.J. Eby

6:27 p.m.

New subject: bytes / unicode

At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote:

...

It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something?

You could certainly view it as a kind of tainting. The part where the type would be bytes-based is indeed somewhat incidental to the actual use case -- it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless. A string coercion protocol (that would be used by .join(), .format(), __contains__, __mod__, etc.) would allow you to do whatever sort of tainted-string or tainted-bytes implementations one might wish to have. I suppose that tainting user inputs (as in Perl) would be just as useful of an application of the same coercion protocol. Actually, I have another use case for this custom string coercion, which is that I once wrote a string subclass whose purpose was to track the original file and line number of some text. Even though only my code was manipulating the strings, it was very difficult to get the tainting to work correctly without extreme care as to the string methods used. (For example, I had to use string addition rather than %-formatting.)

...

But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.)

I'm not sure I follow you. What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs -- where the logic for deciding this coercion can be controlled by the input objects' types, rather than putting this in the hands of the stdlib function. And of course, this applies to non-stdlib functions, too -- anything that simply manipulates user-defined string classes, should allow the user-defined classes to determine the coercion of the result.

...

BTW, this was a little unclear to me:

...
[Collisions will] be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib.

What about the literals in the stdlib? Are you saying they contain invalid code points for your known output encoding? Or are you saying that with non-polymorphic unicode stdlib, you get lots of false positives when combining with your validated bytes?

No, I mean that the current string coercion rules cause everything to be converted to unicode, thereby discarding the tainting information, so to speak. This applies equally to other tainting use cases, and other uses for custom stringlike objects.

Stephen J. Turnbull

11:42 p.m.

New subject: bytes / unicode

P.J. Eby writes:

...

it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless.

Well, I'll have to concede that unless and until I get involved in the WSGI development effort.<wink>

...

...
But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.)

I'm not sure I follow you.

What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data. But you agree that there are better mechanisms for validation (although not available in Python yet), so I don't see this as an potential obstacle to polymorphism now.

...

What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs --

In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data.

P.J. Eby

2:17 p.m.

New subject: bytes / unicode

At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote:

...

What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data.

I still don't follow, since passing in bytes should return bytes. Returning unicode would be an error, in the case of a "polymorphic" function (per Guido).

...

But you agree that there are better mechanisms for validation (although not available in Python yet), so I don't see this as an potential obstacle to polymorphism now.

Nope. I'm just saying that, given two bytestrings to url-join or path join or whatever, a polymorph should hand back a bytestring. This seems pretty uncontroversial.

...

...
What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs --

In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data.

But the caller can enforce those expectations by passing in arguments whose types do what they want in such cases, as long as the string literals used by the function don't get to override the relevant parts of the string protocol(s). The idea that I'm proposing is that the basic string and byte types should defer to "user-defined" string types for mixed type operations, so that polymorphism of string-manipulation functions is the *default* case, rather than a *special* case. This makes tainting easier to implement, as well as optimizing and other special cases (like my "source string w/file and line info", or a string with font/formatting attributes).

...

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/pje%40telecommunity.com

Nick Coghlan

10:43 p.m.

New subject: bytes / unicode

On Sun, Jun 27, 2010 at 4:17 AM, P.J. Eby <pje@telecommunity.com> wrote:

...

The idea that I'm proposing is that the basic string and byte types should defer to "user-defined" string types for mixed type operations, so that polymorphism of string-manipulation functions is the *default* case, rather than a *special* case. This makes tainting easier to implement, as well as optimizing and other special cases (like my "source string w/file and line info", or a string with font/formatting attributes).

Rather than building this into the base string type, perhaps it would be better (at least initially) to add in a polymorphic str subtype that worked along the following lines: 1. Has an encoded argument in the constructor (e.g. poly_str("/", encoded=b"/") 2. If given objects with an encode() method, assumes they're strings and uses its own parent class methods 3. If given objects with a decode() method, assumes they're encoded and delegates to the encoded attribute str/bytes agnostic functions would need to invoke poly_str deliberately, while bytes-only and text-only algorithms could just use the appropriate literals. Third party types would be supported to some degree (by having either encode or decode methods), although they could still run into trouble with some operations (While full support for third party strings and byte sequence implementations is an interesting idea, I think it's overkill for the specific problem of making it easier to write str/bytes agnostic functions for tasks like URL parsing). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Stephen J. Turnbull

10:03 a.m.

New subject: bytes / unicode

P.J. Eby writes:

...

At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote:

...
What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data.

I still don't follow,

OK, I give up, since it was your use case that concerned me. I obviously misunderstood. Sorry for the confusion. Sign me, +1 on polymorphic functions in Tsukuba Japan

...

...
In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data.

But the caller can enforce those expectations by passing in arguments whose types do what they want in such cases, as long as the string literals used by the function don't get to override the relevant parts of the string protocol(s).

This simply isn't true for encoded bytes as proposed. For encoded text, the current encoding has no deterministic relationship to the desired encoding (at the level of generality of the stdlib; of course in specific applications it may be mandated by a standard or private convention). I will have to pass on your other user-defined string types. I've never tried to implement one. I only wanted to point out that a user-controllable tainted string type would be preferable to confounding "unicode" with "tainted".

5373

Age (days ago)

5375

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Nick Coghlan
P.J. Eby
Stephen J. Turnbull

Re: [Python-Dev] bytes / unicode

P.J. Eby

Stephen J. Turnbull

P.J. Eby

Stephen J. Turnbull

P.J. Eby

Nick Coghlan

Stephen J. Turnbull

tags

participants (3)