[Python-Dev] Generalised String Coercion

Mon Aug 8 02:24:34 CEST 2005

[Guido]
> > My first response to the PEP, however, is that instead of a new
> > built-in function, I'd rather relax the requirement that str() return
> > an 8-bit string -- after all, int() is allowed to return a long, so
> > why couldn't str() be allowed to return a Unicode string?

[MAL]
> The problem here is that strings and Unicode are used in different
> ways, whereas integers and longs are very similar. Strings are used
> for both arbitrary data and text data, Unicode can only be used
> for text data.

Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a
separate "bytes" array type for non-text and for encoded text data,
just like Java; strings should always be considered text data.

We might be able to get there halfway in Python 2.x: we could
introduce the bytes type now, and provide separate APIs to read and
write them. (In fact, the array module and the f.readinto()  method
make this possible today, but it's too klunky so nobody uses it.
Perhaps a better API would be a new file-open mode ("B"?) to indicate
that a file's read* operations should return bytes instead of strings.
The bytes type could just be a very thin wrapper around array('b').

> The new text() built-in would help make a clear distinction
> between "convert this object to a string of bytes" and
> "please convert this to a text representation". We need to
> start making the separation somewhere and I think this is
> a good non-invasive start.

I agree with the latter, but I would prefer that any new APIs we use
use a 'bytes' data type to represent non-text data, rather than having
two different sets of APIs to differentiate between the use of 8-bit
strings as text vs. data -- while we *currently* use 8-bit strings for
both text and data, in Python 3.0 we won't, so then the interim APIs
would have to change again. I'd rather intrduce a new data type and
new APIs that work with it.

> Furthermore, the text() built-in could be used to only
> allow 8-bit strings with ASCII content to pass through
> and require that all non-ASCII content be returned as
> Unicode.
> 
> We wouldn't be able to enforce this in str().
> 
> I'm +1 on adding text().

I'm still -1.

> I would also like to suggest a new formatting marker '%t'
> to have the same semantics as text() - instead of changing
> the semantics of %s as the Neil suggests in the PEP. Again,
> the reason is to make the difference between text and
> arbitrary data explicit and visible in the code.

Hm. What would be the use case for using %s with binary, non-text data?

> > The main problem for a smooth Unicode transition remains I/O, in my
> > opinion; I'd like to see a PEP describing a way to attach an encoding
> > to text files, and a way to decide on a default encoding for stdin,
> > stdout, stderr.
> 
> Hmm, not sure why you need PEPs for this:

I'd forgotten how far we've come. I'm still unsure how the default
encoding on stdin/stdout works.

But it still needs to be simpler; IMO the built-in open() function
should have an encoding keyword. (But it could return something whose
type is not 'file' -- once again making a distinction between open and
file.) Do these files support universal newlines? IMO they should.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)