String prefix question

Benjamin Kaplan benjamin.kaplan at case.edu
Sun Nov 8 22:04:04 EST 2009


On Sun, Nov 8, 2009 at 9:38 PM, Alan Harris-Reid
<alan at baselinedata.co.uk> wrote:
> In the Python.org 3.1 documentation (section 20.4.6), there is a simple
> “Hello World” WSGI application which includes the following method...
>
> def hello_world_app(environ, start_response):
> status = b'200 OK' # HTTP Status
> headers = [(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
> start_response(status, headers)
>
> # The returned object is going to be printed
> return [b"Hello World"]
>
> Question - Can anyone tell me why the 'b' prefix is present before each
> string? The method seems to work equally well with and without the prefix.
> From what I can gather from the documentation the b prefix represents a
> bytes literal, but can anyone explain (in simple english) what this means?
>
> Many thanks,
> Alan

The rather long version:
read http://www.joelonsoftware.com/articles/Unicode.html

A somewhat shorter summary, along with how Python deals with this:

Once upon a time, someone decided to allocate 1 byte for each
character. Since everything the Americans who made the computers
needed fit into 7 bits, this was alright. And they called this the
American Standard Code for Information Interchange (ASCII). When
computers came along, device manufacturers realized that they had 128
characters that didn't mean anything, so they all made their own
characters to show for the upper 128. And when they started selling
computers internationally, they used the upper 128 to store the
characters they needed for the local language. This had several
problems.

1) Files made by on one computer in one country wouldn't display right
in a computer made by a different manufacturer or for a different
country

2) The 256 characters were enough for most Western languages, but
Chinese and Japanese need a whole lot more.

To solve this problem, Unicode was created. Rather than thinking of
each character as a distinct set of bits, it just assigns a number to
each one (a code point). The bottom 128 characters are the original
ASCII set, and everything else you could think of was added on top of
that - other alphabets, mathematical symbols, music notes, cuneiform,
dominos, mah jong tiles, and more. Unicode is harder to implement than
a simple byte array, but it means strings are universal- every program
will interpret them exactly the same. Unicode strings in python are
the default ('') in Python 3.x and created in 2.x by putting a u in
front of the string declaration (u'')

Unicode, however, is a concept, and concepts can't be mapped to bits
that can be sent through the network or stored on the hard drive. So
instead we deal with strings internally as Unicode and then give them
an encoding when we send them back out. Some encodings, such as UTF-8,
can have multiple bytes per character and, as such, can deal with the
full range of Unicode characters. Other times, programs still expect
the old 8-bit encodings like ISO-8859-1 or the Windows Ansi code
pages. In Python, to declare that the string is a literal set of bytes
and the program should not try and interpret it, you use b'' in Python
3.x, or just declare it normally in Python 2.x ('').

------------------------------------------------------
What happens in your program:

When you print a Unicode string, Python has to decide what encoding to
use. If you're printing to a terminal, Python looks for the terminal's
encoding and uses that. In the event that it doesn't know what
encoding to use, Python defaults to ASCII because that's compatible
with almost everything. Since the string you're sending to the web
page only contains ASCII characters, the automatic conversion works
fine if you don't specify the b''. Since the resulting page uses UTF-8
(which you declare in the header), which is compatible with ASCII, the
output looks fine. If you try sending a string that has non-ASCII
characters, the program might throw a UnicodeEncodeError because it
doesn't know what bytes to use for those characters. It may be able to
guess, but since I haven't used WSGI directly before, I can't say for
sure.



More information about the Python-list mailing list