[Python-3000] setup.py fails in the py3k-struni branch

Sat Jun 16 00:38:58 CEST 2007

Martin v. Löwis wrote:
>>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>>>> encoded strings for interfacing with the outside non-unicode world.  Or
>>>> something like that. <shrug>
>>> Hm... Requiring each str8 instance to have an encoding might be a
>>> problem -- it means you can't just create one from a bytes object.
>>> What would be the use of this information? What would happen on
>>> concatenation? On slicing? (Slicing can break the encoding!)
>> Round trips to and from bytes should work just fine.  Why would that be
>> a problem?
> 
> I'm strongly opposed to adding encoding information to str8 objects.
> I think they will eventually go away, anyway; adding that kind of
> overhead now is both a waste of developer's time and of memory
> resources; plus it has all the semantic issues that Guido points
> out.

This was in the context that it is decided by the community that a st8 type 
is needed and it does not go away.

The alternative is for str8 to be replaced by byte objects which I believe 
was, and still is, the plan if possible.

The same semantic issues will also be present in bytes objects in one form 
or another when handling data acquired from sources that use encoded 
strings.  They don't go away even if str8 does go away.

It sort of depends on how someone wants to handle situations where encoded 
strings are encountered.  Do they decode them and convert everything to 
unicode and then convert back as needed for any output.  Or can they keep 
the data in the encoded form for the duration?  I expect different people 
will feel differently on this.

> As for creating str8 objects from bytes objects: If you want
> the str8 object to carry an encoding, you would have to *specify*
> the encoding when creating the str8 object, since the bytes object
> does not have that information. This is *very* hard, as you may
> not know what the encoding is when you need to create the str8
> object.

True, and this also applies if you want to convert an already encoded bytes 
object to unicode as well.

>> There really is no safety in concatenation and slicing of encoded 8bit
>> strings now.  If by accident two strings of different encodings are
>> combined, then all bets are off.  And since there is no way to ask a
>> string what it's current encoding is, it becomes an easy to make and
>> hard to find silent error.  So we have to be very careful not to mix
>> encoded strings with different encodings.
> 
> Please answer the question: what would happen on concatenation? In
> particular, what is the value of the encoding for the result
> of the concatenated string if one input is "latin-1", and the
> other one is "utf-8"?

I was trying to avoid this becoming a long thread. If these Ideas seem 
worth discussing, maybe we can move the reply to python ideas and we can 
work out the details there.

But to not avoid your questions...

Concatenation of unlike encoded objects should cause an error message of 
course. It's just not possible to do presently.

I agree that putting an attribute on a str8 object instance is only a 
partial solution and does waste some space.  (I changed my mind on this 
yesterday morning after thinking about it some more.)

So I offered an alternative suggestion that it may be possibly to use 
dynamically created encoded str types, which avoids putting an attribute on 
every instance, and can handle the problems of slicing, concatenation, and 
conversion.  I didn't go into the details because it was, and is, only a 
general suggestion or thought.

One approach is to possibly use a factory function that uses metaclass's or 
mixins to create these based either on a str base type or a bytes object.

      Latin1 = get_encoded_str_type('latin-1')

      s1 = Latin1('Hello ')

      Utf8 = get_encoded_str_type('utf-8')

      s2 = Utf8('World')

      s = s1 + s2                 -> Exception Raised

      s = s1 + type(s1)(s2)       -> latin-1 string

      s = type(s2)(s1) + s2       -> utf-8 string

      lines = [s1, s2, ..., sn]
      s = Utf8.join([Utf8(s) for s in lines])

In this last case the strings in s1 can even be of arbitrary encoding types 
and they would still all get re-encoded to utf-8 correctly.  Chances are 
you would never have a list of strings with many different encodings, but 
you may have a list of strings with types unknown to a local function.

There can probably be various ways of creating these types that do not 
require them to be built in.  The advantage is they can be smarter about 
concatenation, slicing, and transforming to bytes and unicode and back. 
It's really just a higher level API.

Weather it's a waste of time and effort, <shrug>, I suppose that depends on 
who is doing it and weather or not they think so.  It could also be a third 
party module as well.  Then if it becomes popular it can be included in 
python some time in a future version.

> It's easy to tell what happens now: the bytes of those input
> strings are just appended; the result string does not follow
> a consistent character encoding anymore. This answer does
> not apply to your proposed modification, as it does not answer
> what the value of the .encoding attribute of the str8 would be
> after concatenation (likewise for slicing).

And what is the use of appending unlike encoded str8 types?  Most anything 
I can think of are hacks.

I disagree about it being easy to tell what happens.  Thats only true on a 
micro level.  On a macro level, it may work out ok, or it may cause an 
error to be raised at some point, or it may be completely silent and the 
data you send out is corrupted.  In which case, something even worse may 
happen when the data is used.  Like missing mars orbiters or crashed 
landers.  That does not sound like it is "easy to tell what happens" to me.

I think what Guido is thinking is we may need keep str8 around (for a 
while) as a 'C' compatible string type for purposes of interfacing to 'C' code.

What I was thinking about was to simplify encoding and decoding and 
avoiding issues that are caused by miss matched strings of *any* type.  A 
different problem set, that may need a different solution.

>> It's not too different from trying to find the current unicode and str8
>> issues in the py3k-struni branch.
> 
> This sentence I do not understand. What is not too different from
> trying to find issues?

It was a general statement reflecting on the process of converting the 
py3k-struni branch to unicode.

As I said above:

 >> ... it becomes an easy to make and
 >> hard to find silent errors ...

In this case the errors are expected, but finding them is still difficult. 
  It's not quite the same thing, but I did say "not too different", meaning 
there are some differences.

>> Concatenating str8 and str types is a bit safer, as long as the str8 is
>> in in "the" default encoding, but it may still be an unintended implicit
>> conversion.  And if it's not in the default encoding, then all bets are
>> off again.
> 
> Sure. However, the str8 type will go away, and along with it all these
> issues.

Yes, hopefully it will, eventually, along with encoded strings in the wild, 
as well. But probably not immediately.

>> The use would be in ensuring the integrity of encoded strings.
>> Concatenating strings with different encodings could then produce
>> errors.
> 
> Ok. What about slicing?

Details... for which all of these can be solved.  Encoded string types as I 
described above can also know how to slice themselves correctly.

Cheers and Regards,
    Ron