[I18n-sig] Pre-PEP: Proposed Python Character Model

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 8 Feb 2001 20:58:20 +0100

> print u"hello world"
> rather than the easier
> print "hello world"
> even though the message is clearly text.

You can easily have the latter being Unicode by invoking Python with
the -U option. If the pragma PEP is ever implemented, one pragma
should be reserved to declare the source file encoding, and another
one to declare all strings as Unicode in this file.

> I think we agree that, eventually, we would like the simple notation
> for a string literal to create a unicode string. What Im not sure
> about is whether we can make that change soon. How often are string
> literals used to create what is logically just binary data?

Let's have a look. Excluding __doc__ strings (which can be recognized
syntactically), performing grep '"' in the Python library, I get

BaseHTTPServer.py:__version__ = "0.2"
BaseHTTPServer.py:__all__ = ["HTTPServer", "BaseHTTPRequestHandler"] 

Both are "protocol" in some sense, i.e. not meant to be
human-readable. +2 for binary data


This is text, giving +1 for binary data. Actually, it is HTML, so when
transferring it, it needs to be encoded in some encoding; so it
*could* be considered as the encoded message instead

BaseHTTPServer.py:    sys_version = "Python/" + string.split(sys.version)[0]
BaseHTTPServer.py:    server_version = "BaseHTTP/" + __version__ 
BaseHTTPServer.py:        self.request_version = version = "HTTP/0.9" # Default BaseHTTPServer.py:                self.send_error(400, "Bad request version (%s)BaseHTTPServer.py:                                "Bad HTTP/0.9 request type (%s 
BaseHTTPServer.py:            self.send_error(400, "Bad request syntax (%s)" % `
BaseHTTPServer.py:            self.send_error(501, "Unsupported method (%s)" % `

Part of the HTTP protocol, thus binary data. +9

BaseHTTPServer.py:        self.log_error("code %d, message %s", code, message) 

Log file; this is text, so +8

            self.wfile.write("%s %s %s\r\n" %

HTTP protocol, +9

There are a few more. In total, BaseHTTPServer.py contains more binary
strings than text strings.

For other files, the ratio may vary. In general, I believe "binary"
strings in source code, as many of the strings are typically processed
by some other program which expects a specific byte sequence, rather
than a character string. 

Human-readable strings or probably more common in GUI
applications. One should think about i18n here, which means that the
actual localized message catalogs must be separate from the program