[Python-Dev] str object going in Py3K

Fri Feb 17 00:15:04 CET 2006

On Wed, 15 Feb 2006 21:59:55 -0800, Alex Martelli <aleaxit at gmail.com> wrote:

>
>On Feb 15, 2006, at 9:51 AM, Barry Warsaw wrote:
>
>> On Wed, 2006-02-15 at 09:17 -0800, Guido van Rossum wrote:
>>
>>> Regarding open vs. opentext, I'm still not sure. I don't want to
>>> generalize from the openbytes precedent to openstr or openunicode
>>> (especially since the former is wrong in 2.x and the latter is wrong
>>> in 3.0). I'm tempting to hold out for open() since it's most
>>> compatible.
>>
>> If we go with two functions, I'd much rather hang them off of the file
>> type object then add two new builtins.  I really do think file.bytes()
>> and file.text() (a.k.a. open.bytes() and open.text()) is better than
>> opentext() or openbytes().
>
>I agree, or, MAL's idea of bytes.open() and unicode.open() is also  
>good.  My fondest dream is that we do NOT have an 'open' builtin  
>which has proven to be very error-prone when used in Windows by  
>newbies (as evidenced by beginner errors as seen on c.l.py, the  
>python-help lists, and other venues) -- defaulting 'open' to text is  
>errorprone, defaulting it to binary doesn't seem the greatest idea  
>either, principle "when in doubt, resist the temptation to guess"  
>strongly suggests not having 'open' as a built-in at all.  (And  
>namemangling into openthis and openthat seems less Pythonic to me  
>than exploiting namespaces by making structured names, either  
>this.open and that.open or open.this and open.that).  IOW, I entirely  
>agree with Barry and Marc Andre.
>
FWIW, I'd vote for file.text and file.bytes

I don't like bytes.open or unicode.open because I think
types in general should not know about I/O (IIRC Guido said that, so pay attention ;-)
Especially unicode.

E.g., why should unicode pull in a whole wad of I/O-related code if the user
is only using it as intermediary in some encoding change between low level binary
input and low level binary output? E.g., consider what you could do with one statement like (untested)

    s_str.translate(table, delch).encode('utf-8')

especially if you didn't have to introduce a phony latin-1 decoding and write it as (untested)

    s_str.translate(table, delch).decode('latin-1').encode('utf-8')     # use str.translate
or
    s_str.decode('latin-1').translate(mapping).encode('utf-8')          # use unicode.translate also for delch

to avoid exceptions if you have non-ascii in your s_str translation

It seems s_str.translate(table, delchars) wants to convert the s_str to unicode
if table is unicode, and then use unicode.translate (which bombs on delchars!) instead
of just effectively defining str.translate as

    def translate(self, table, deletechars=None):
        return ''.join(table[ord(x)] for x in self
                       if deletechars is None or x not in deletechars)

IMO, if you want unicode.translate, then write unicode(s_str).translate and use that.
Let str.translate just use the str ords, so simple custom decodes can be written without
the annoyance of

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 3: ordinal not in range(128)

Can we change this? Or what am I missing? I certainly would like to miss
the above message for str.translate :-(

BTW This would also allow taking advantage of features of both translates if desired, e.g. by
    s_str.translate(unichartable256, strdelchrs).translate(uniord_to_ustr_mapping).
(e.g., the latter permits single to multiple-character substitution)

This makes me think a translate method for bytes would be good for py3k (on topic ;-)
It it is just too handy a high speed conversion goodie to forgo IMO.
___________

BTW, ISTM that it would be nice to have a chunking-iterator-wrapper-returning-method
(as opposed to buffering specification) for file.bytes, so you could plug in

    file.bytes('path').chunk(1)  # maybe keyword opts for simple common record chunking also?

in places where you might now have to have (untested)

    (ord(x) for x in iter(lambda f=open('path','rb'):f.read(1)) if x)

or write a helper like
    def by_byte_ords(path, bufsize=8192):
        f = open(path, 'rb')
        buf = f.read(bufsize)
        while buf:
            for x in buf: yield ord(x)
            buf = f.read(bufsize)
and plug in
    by_byte_ords(path)
___________

BTW, bytes([]) would presumably be the file.bytes EOF?

Regards,
Bengt Richter