[Python-3000] setup.py fails in the py3k-struni branch

"Martin v. Löwis" martin at v.loewis.de
Fri Jun 15 08:03:41 CEST 2007


>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>>> encoded strings for interfacing with the outside non-unicode world.  Or
>>> something like that. <shrug>
>>
>> Hm... Requiring each str8 instance to have an encoding might be a
>> problem -- it means you can't just create one from a bytes object.
>> What would be the use of this information? What would happen on
>> concatenation? On slicing? (Slicing can break the encoding!)
> 
> Round trips to and from bytes should work just fine.  Why would that be
> a problem?

I'm strongly opposed to adding encoding information to str8 objects.
I think they will eventually go away, anyway; adding that kind of
overhead now is both a waste of developer's time and of memory
resources; plus it has all the semantic issues that Guido points
out.

As for creating str8 objects from bytes objects: If you want
the str8 object to carry an encoding, you would have to *specify*
the encoding when creating the str8 object, since the bytes object
does not have that information. This is *very* hard, as you may
not know what the encoding is when you need to create the str8
object.

> There really is no safety in concatenation and slicing of encoded 8bit
> strings now.  If by accident two strings of different encodings are
> combined, then all bets are off.  And since there is no way to ask a
> string what it's current encoding is, it becomes an easy to make and
> hard to find silent error.  So we have to be very careful not to mix
> encoded strings with different encodings.

Please answer the question: what would happen on concatenation? In
particular, what is the value of the encoding for the result
of the concatenated string if one input is "latin-1", and the
other one is "utf-8"?

It's easy to tell what happens now: the bytes of those input
strings are just appended; the result string does not follow
a consistent character encoding anymore. This answer does
not apply to your proposed modification, as it does not answer
what the value of the .encoding attribute of the str8 would be
after concatenation (likewise for slicing).

> It's not too different from trying to find the current unicode and str8
> issues in the py3k-struni branch.

This sentence I do not understand. What is not too different from
trying to find issues?

> Concatenating str8 and str types is a bit safer, as long as the str8 is
> in in "the" default encoding, but it may still be an unintended implicit
> conversion.  And if it's not in the default encoding, then all bets are
> off again.

Sure. However, the str8 type will go away, and along with it all these
issues.

> The use would be in ensuring the integrity of encoded strings.
> Concatenating strings with different encodings could then produce
> errors.

Ok. What about slicing?

Regards,
Martin



More information about the Python-3000 mailing list