[I18n-sig] Pre-PEP: Proposed Python Character Model
Martin v. Loewis
Thu, 8 Feb 2001 20:58:20 +0100
> print u"hello world"
> rather than the easier
> print "hello world"
> even though the message is clearly text.
You can easily have the latter being Unicode by invoking Python with
the -U option. If the pragma PEP is ever implemented, one pragma
should be reserved to declare the source file encoding, and another
one to declare all strings as Unicode in this file.
> I think we agree that, eventually, we would like the simple notation
> for a string literal to create a unicode string. What Im not sure
> about is whether we can make that change soon. How often are string
> literals used to create what is logically just binary data?
Let's have a look. Excluding __doc__ strings (which can be recognized
syntactically), performing grep '"' in the Python library, I get
BaseHTTPServer.py:__version__ = "0.2"
BaseHTTPServer.py:__all__ = ["HTTPServer", "BaseHTTPRequestHandler"]
Both are "protocol" in some sense, i.e. not meant to be
human-readable. +2 for binary data
BaseHTTPServer.py:DEFAULT_ERROR_MESSAGE = """\
This is text, giving +1 for binary data. Actually, it is HTML, so when
transferring it, it needs to be encoded in some encoding; so it
*could* be considered as the encoded message instead
BaseHTTPServer.py: sys_version = "Python/" + string.split(sys.version)
BaseHTTPServer.py: server_version = "BaseHTTP/" + __version__
BaseHTTPServer.py: self.request_version = version = "HTTP/0.9" # Default BaseHTTPServer.py: self.send_error(400, "Bad request version (%s)BaseHTTPServer.py: "Bad HTTP/0.9 request type (%s
BaseHTTPServer.py: self.send_error(400, "Bad request syntax (%s)" % `
BaseHTTPServer.py: self.send_error(501, "Unsupported method (%s)" % `
Part of the HTTP protocol, thus binary data. +9
BaseHTTPServer.py: self.log_error("code %d, message %s", code, message)
Log file; this is text, so +8
self.wfile.write("%s %s %s\r\n" %
HTTP protocol, +9
There are a few more. In total, BaseHTTPServer.py contains more binary
strings than text strings.
For other files, the ratio may vary. In general, I believe "binary"
strings in source code, as many of the strings are typically processed
by some other program which expects a specific byte sequence, rather
than a character string.
Human-readable strings or probably more common in GUI
applications. One should think about i18n here, which means that the
actual localized message catalogs must be separate from the program