[Python-Dev] Generalised String Coercion

M.-A. Lemburg mal at egenix.com
Mon Aug 8 12:42:26 CEST 2005


Guido van Rossum wrote:
> [Guido]
> 
>>>My first response to the PEP, however, is that instead of a new
>>>built-in function, I'd rather relax the requirement that str() return
>>>an 8-bit string -- after all, int() is allowed to return a long, so
>>>why couldn't str() be allowed to return a Unicode string?
> 
> 
> [MAL]
> 
>>The problem here is that strings and Unicode are used in different
>>ways, whereas integers and longs are very similar. Strings are used
>>for both arbitrary data and text data, Unicode can only be used
>>for text data.
> 
> Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a
> separate "bytes" array type for non-text and for encoded text data,
> just like Java; strings should always be considered text data.
>
> We might be able to get there halfway in Python 2.x: we could
> introduce the bytes type now, and provide separate APIs to read and
> write them.
>
> (In fact, the array module and the f.readinto()  method
> make this possible today, but it's too klunky so nobody uses it.
> Perhaps a better API would be a new file-open mode ("B"?) to indicate
> that a file's read* operations should return bytes instead of strings.
> The bytes type could just be a very thin wrapper around array('b').

I'd prefer to keep such bytes type immutable (arrays are mutable),
otherwise, as Martin already mentioned, they wouldn't be usable
as dictionary keys and the transition from the current string
implementation would be made more difficult than necessary.

Since we won't have any use for the string type in Py3k,
why not simply strip it down to a plain bytes type ?

(I wouldn't want to lose or have to reinvent all the
optimizations that went into its implementation and which
are missing in the array implementation.)

About the file-type idea:

We already have text mode and binary mode - with their implementation
being platform dependent. I don't think that this is particularly
good area to add new functionality.

If you use codecs.open() to open a file, you could easily
write a codec which implements what you have in mind.

>>The new text() built-in would help make a clear distinction
>>between "convert this object to a string of bytes" and
>>"please convert this to a text representation". We need to
>>start making the separation somewhere and I think this is
>>a good non-invasive start.
> 
> 
> I agree with the latter, but I would prefer that any new APIs we use
> use a 'bytes' data type to represent non-text data, rather than having
> two different sets of APIs to differentiate between the use of 8-bit
> strings as text vs. data -- while we *currently* use 8-bit strings for
> both text and data, in Python 3.0 we won't, so then the interim APIs
> would have to change again. I'd rather intrduce a new data type and
> new APIs that work with it.

Well, let's put it this way: it all really depends on
what str() should mean in Py3k.

Given that str() is used for mixed content data strings,
simply aliasing str() to unicode() in Py3k would cause a
lot of breakage, due to changed semantics.

Aliasing str() to bytes() would also cause breakage, due
to the fact that bytes types wouldn't have string method
like e.g. .lower(), .upper(), etc.

Perhaps str() in Py3k should become a helper that
converts bytes() to Unicode, provided the content is
ASCII-only.

In any case, Py3k would only have unicode() for text
and bytes() for data, so there's no real need to continue
using str().

If we add the text() API in Py2k and with the above
meaning, then we could rename unicode() to text()
in Py3k - only a cosmetical change, but one that I would
find useful: text() and bytes() are more intuitive to
understand than unicode() and bytes().

>>Furthermore, the text() built-in could be used to only
>>allow 8-bit strings with ASCII content to pass through
>>and require that all non-ASCII content be returned as
>>Unicode.
>>
>>We wouldn't be able to enforce this in str().
>>
>>I'm +1 on adding text().
> 
> 
> I'm still -1.
> 
> 
>>I would also like to suggest a new formatting marker '%t'
>>to have the same semantics as text() - instead of changing
>>the semantics of %s as the Neil suggests in the PEP. Again,
>>the reason is to make the difference between text and
>>arbitrary data explicit and visible in the code.
> 
> 
> Hm. What would be the use case for using %s with binary, non-text data?

I guess we'd only keep it for backwards compatibility and
map it to the str() helper.

>>>The main problem for a smooth Unicode transition remains I/O, in my
>>>opinion; I'd like to see a PEP describing a way to attach an encoding
>>>to text files, and a way to decide on a default encoding for stdin,
>>>stdout, stderr.
>>
>>Hmm, not sure why you need PEPs for this:
> 
> 
> I'd forgotten how far we've come. I'm still unsure how the default
> encoding on stdin/stdout works.

Codecs in general work like this: they take an existing file-like
object and wrap it with new versions of .read(), .write(),
.readline(), etc. which filter the data through encoding and/or
decoding functions.

Once a file is wrapped with a codec StreamWriter/Reader,
you can continue using it as if it were a standard file-like
object.

> But it still needs to be simpler; IMO the built-in open() function
> should have an encoding keyword. (But it could return something whose
> type is not 'file' -- once again making a distinction between open and
> file.) 

Right, because it would then return a wrapped file object.

> Do these files support universal newlines? IMO they should. 

Since the codecs wrap the underlying file object which does
support universal newlines, this should be the case.

However, you should be aware of the fact that Unicode
defines a lot more line break characters than just \r,
\r\n, \n.

The codecs use the .splitlines() methods of strings and
Unicode - which support all of them transparently, so you
don't need to enable universal newlines support at all -
it's sort-of enabled per default.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 08 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list