[Python-Dev] Unicode support in getargs.c

Martin v. Loewis martin@v.loewis.de
Fri, 4 Jan 2002 19:40:34 +0100


> No, it doesn't, that is the whole point of why I started this
> thread!!!!

Oops, right. I was thinking the other way around: passing u"a.out"
where "a.out" is expected works fine; for this case, the memory
management issues come into play.

> > > Using Python StringObjects as binary buffers is also far less common
> > > than using StringObjects to store plain old strings, so if either of
> > > these uses bites the other it's the binary buffer that needs to
> > > suffer.
> > 
> > This is a conclusion I cannot agree with. Most strings are really
> > binary, if you look at them closely enough :-)
> 
> I'm not sure I understand this remark. If you made it just for the
> smiley: never mind. If you really don't agree: please explain why.

When the discussion of tagging binary strings in source code came up,
I started to look into the standard library which string literals
would have to be tagged as byte strings, and which are really
character strings.

I found that the overwhelming majority of string literals in the
standard Python library really denotes byte strings, if you ignore doc
strings. Sometimes, it isn't obvious that they are binary strings,
hence the smiley. Look at httplib.py:

__all__ = ["HTTP", ...

Not sure: Are Python function names byte strings or character strings?
Probably doesn't matter either way. Python source code is definitely
byte-oriented, explicitly wihthout any assumed encoding, so I'd lean
towards byte strings here.

_UNKNOWN = 'UNKNOWN'

Looks like a character string. However, it is used in

        self.version = _UNKNOWN # HTTP-Version

self.version is later sent on the byte-oriented HTTP protocol, so
_UNKNOWN *is* a byte string.

_CS_IDLE = 'Idle'

These are enumerators, let's say they are character strings.

        self.fp = sock.makefile('rb', 0)

Not sure. Could be a character string.

            print "reply:", repr(line)

Definitely a character string.

                version = "HTTP/0.9"
                status = "200"
                reason = ""

Protocol elements, thus byte string.

So, I'm arguing that byte strings are far more common than you may
think at first sight. In particular, everything passed to .read(),
either of a file, or of a socket, is a byte string, since files and
network connections are byte-oriented. In the particular case of
network connections, applying system conventions for narrow strings
would be foolish.

Regards,
Martin