[Python-3000] setup.py fails in the py3k-struni branch

Sat Jun 16 02:31:02 CEST 2007

Martin v. Löwis wrote:
>> This was in the context that it is decided by the community that a st8
>> type is needed and it does not go away.
> 
> I think *that* context has not occurred. People wanted a read-only
> bytes type, not a byte-oriented character string type.
> 
>> The alternative is for str8 to be replaced by byte objects which I
>> believe was, and still is, the plan if possible.
> 
> That type is already implemented.

But the actual replacing of str8 by bytes is still a work in progress.

>> The same semantic issues will also be present in bytes objects in one
>> form or another when handling data acquired from sources that use
>> encoded strings.  They don't go away even if str8 does go away.
> 
> No they don't. The bytes type doesn't have an encoding associated
> with it, and it shouldn't. Values may not even represent text,
> but, say, image data.

Right, and in the cases where the bytes are an encoded form of string data, 
you will need to be very careful about how it is sliced.  But this isn't 
any different for any other byte type data.  It's a low level interface 
meant to do low level things.  Which is good, we need that.

>> It sort of depends on how someone wants to handle situations where
>> encoded strings are encountered.  Do they decode them and convert
>> everything to unicode and then convert back as needed for any output. 
>> Or can they keep the data in the encoded form for the duration?  I
>> expect different people will feel differently on this.
> 
> In Py3k, they will use the string type, because anything else will
> just be too tedious.

I agree, this will be the preferred way, and should be.

>>> As for creating str8 objects from bytes objects: If you want
>>> the str8 object to carry an encoding, you would have to *specify*
>>> the encoding when creating the str8 object, since the bytes object
>>> does not have that information. This is *very* hard, as you may
>>> not know what the encoding is when you need to create the str8
>>> object.
>> True, and this also applies if you want to convert an already encoded
>> bytes object to unicode as well.
> 
> Right, and therefore it can never be automatic - whereas the conversion
> between a bytes object and a str8 object *could* be automatic otherwise
> (assuming the str8 type survives at all).

But conversion between different encodings won't be automatic.  It will 
still be as tedious and confusing as it always has been.  The improvement 
that python3000 makes here is that maybe it won't be needed as often with 
unicode strings being the default.

>> One approach is to possibly use a factory function that uses metaclass's
>> or mixins to create these based either on a str base type or a bytes
>> object.
>>
>>      Latin1 = get_encoded_str_type('latin-1')
>>
>>      s1 = Latin1('Hello ')
> [snip]
> 
> I think I lost track now what problem precisely you are trying to solve.

A case of abstract motivation, prompting a very general idea, which 
illicits subjective responses, which prompts even more concrete examples, 
etc...

The original motivation wasn't explicitly stated at the beginning and got 
lost.  ;-)

My primary reason for the suggestion is that maybe it can increase string 
data integrity and make finding errors easier.

This was just a thought in that direction.

A more specific example or issue that is much more relevant at this time 
might be, should the conversion to bytes be automatic when combining str8 
and bytes?  (str and bytes in python 2.6+)

The first answer might be yes since it's a one to one conversion.  But it's 
implicit.

 >>> str8('hello ') + b'world'
b'hello world'

 >>> b'hello ' + str8('world')
b'hello world'

That's clear enough, but what about...

 >>> ''.join(slist)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string or Unicode, bytes found

And so starts yet another tedious session of tracing variables back to find 
where the bytes type actually occurred.  Which may not be obvious since it 
could have been an unintentional and implicit conversion.

>>> It's easy to tell what happens now: the bytes of those input
>>> strings are just appended; the result string does not follow
>>> a consistent character encoding anymore. This answer does
>>> not apply to your proposed modification, as it does not answer
>>> what the value of the .encoding attribute of the str8 would be
>>> after concatenation (likewise for slicing).
>> And what is the use of appending unlike encoded str8 types?
> 
> You may need to put encoded text into binary data, e.g. putting
> a file name into a zip file. Some of the bytes will be utf-8
> encoded, others will be cp437 encode, others will be data structures
> of the zip file, and the rest will be compressed bytes.
> 
> Likewise for constructing MIME messages: different pieces will
> use different encodings.

Wouldn't you need some sort of wrapper in these cases to indicate what the 
encoding is and where it starts and stops?

So even in binary data, extracting it to bytes and then decoding each 
section to it's particular encoded type should not be a problem.  Same goes 
for the other way around.

For text encoded data within other text encoded data, its a nested encoding 
that needs to be unencoded in the correct sequence.  Not a sequential 
encoding that is done and appended together as is.  Is that correct?  And 
it still needs headers to indicate it's encoding, start, and length.  Or 
something equivalent.  What am I missing?

Cheers,
   Ron

>> I think what Guido is thinking is we may need keep str8 around (for a
>> while) as a 'C' compatible string type for purposes of interfacing to
>> 'C' code.
> 
> That might be. I hope not, and I have plans to eliminate the need for
> many such places (providing Unicode-oriented APIs in some cases,
> and using the bytes type in other cases).
> 
> In cases where we still have char*, I think the API should specify that
> this must be ASCII most of them time, with UTF-8 in selected other
> cases; arbitrary binary data only when interfacing to the bytes
> type.
> 
> Regards,
> Martin
> 
>