[Python-Dev] bytes / unicode

Tue Jun 22 23:41:51 CEST 2010

On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum <guido at python.org> wrote:
> (1) Literals.
>
> If you write something like x.split('&') you are implicitly assuming x
> is text. I don't see a very clean way to overcome this; you'll have to
> implement some kind of type check e.g.
>
>    x.split('&') if isinstance(x, str) else x.split(b'&')
>
> A handy helper function can be written:
>
>  def literal_as(constant, variable):
>      if isinstance(variable, str):
>          return constant
>      else:
>          return constant.encode('utf-8')
>
> So now you can write x.split(literal_as('&', x)).

I think this is a key point. In checking the behaviour of the os
module bytes APIs (see below), I used a simple filter along the lines
of:

  [x for x in seq if x.endswith("b")]

It would be nice if code along those lines could easily be made polymorphic.

Maybe what we want is a new class method on bytes and str (this idea
is similar to what MAL suggests later in the thread):

  def coerce(cls, obj, encoding=None, errors='surrogateescape'):
    if isinstance(obj, cls):
        return existing
    if encoding is None:
        encoding = sys.getdefaultencoding()
    # This is the str version, bytes,coerce would use obj.encode() instead
    return obj.decode(encoding, errors)

Then my example above could be made polymorphic (for ASCII compatible
encodings) by writing:

  [x for x in seq if x.endswith(x.coerce("b"))]

I'm trying to see downsides to this idea, and I'm not really seeing
any (well, other than 2.7 being almost out the door and the fact we'd
have to grant ourselves an exception to the language moratorium)

> (2) Data sources.
>
> These can be functions that produce new data from non-string data,
> e.g. str(<int>), read it from a named file, etc. An example is read()
> vs. write(): it's easy to create a (hypothetical) polymorphic stream
> object that accepts both f.write('booh') and f.write(b'booh'); but you
> need some other hack to make read() return something that matches a
> desired return type. I don't have a generic suggestion for a solution;
> for streams in particular, the existing distinction between binary and
> text streams works, of course, but there are other situations where
> this doesn't generalize (I think some XML interfaces have this
> awkwardness in their API for converting a tree to a string).

We may need to use the os and io modules as the precedents here:

os: normal API is text using the surrogateescape error handler,
parallel bytes API exposes raw bytes. Parallel API is polymorphic if
possible (e.g. os.listdir), but appends a 'b' to the name if the
polymorphic approach isn't practical (e.g. os.environb, os.getcwdb,
os.getenvb).
io. layered API, where both the raw bytes of the wire protocol and the
decoded bytes of the text layer are available

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia