[Python-Dev] bytes / unicode

Stephen J. Turnbull stephen at xemacs.org
Fri Jun 25 18:18:33 CEST 2010


P.J. Eby writes:

 > I do know the ultimate target codec -- that's the point.
 > 
 > IOW, I want to be able to do to all my operations by passing
 > target-encoded strings to polymorphic functions.

IOW, you *do* have text and (ignoring efficiency issues) could just as
well use str.  But That Other Code is unreliable, so you need a marker
for your own internal strings indicating that they are validated,
while other strings are not.

This has nothing to do with bytes vs. str as string types, then; it's
all about validated (which your architecture indicates by using the
bytes type) vs. unvalidated (which your architecture indicates with
unicode).  Eg, in the case of your USPS vs. ecommerce example, you
can't even handle all bytes, so not all possible bytes objects are
valid.  And other applications might not be able to handle all
Japanese, but only a subset, so having valid EUC-JP wouldn't be
enough, you'd have to check repertoire -- might as well use str.

It seems to me what is wanted here is something like Perl's taint
mechanism, for *both* kinds of strings.  Am I missing something?

But with your architecture, it seems to me that you actually don't
want polymorphic functions in the stdlib.  You want the stdlib
functions to be bytes-oriented if and only if they are reliable.  (This
is what I was saying to Guido elsewhere.)

BTW, this was a little unclear to me:

 > [Collisions will] be with other *unicode* strings.  Ones coming
 > from other code, and literals embedded in the stdlib.

What about the literals in the stdlib?  Are you saying they contain
invalid code points for your known output encoding?  Or are you saying
that with non-polymorphic unicode stdlib, you get lots of false
positives when combining with your validated bytes?



More information about the Python-Dev mailing list