String prefix question

Mon Nov 9 14:18:56 EST 2009

Benjamin Kaplan wrote:
> On Sun, Nov 8, 2009 at 9:38 PM, Alan Harris-Reid
> <alan at baselinedata.co.uk> wrote:
>   
>> In the Python.org 3.1 documentation (section 20.4.6), there is a simple
>> "Hello World" WSGI application which includes the following method...
>>
>> def hello_world_app(environ, start_response):
>> status ='200 OK' # HTTP Status
>> headers =(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
>> start_response(status, headers)
>>
>> # The returned object is going to be printed
>> return [b"Hello World"]
>>
>> Question - Can anyone tell me why the 'b' prefix is present before each
>> string? The method seems to work equally well with and without the prefix.
>> From what I can gather from the documentation the b prefix represents a
>> bytes literal, but can anyone explain (in simple english) what this means?
>>
>> Many thanks,
>> Alan
>>     
>
> The rather long version:
> read http://www.joelonsoftware.com/articles/Unicode.html
>
> A somewhat shorter summary, along with how Python deals with this:
>
> Once upon a time, someone decided to allocate 1 byte for each
> character. Since everything the Americans who made the computers
> needed fit into 7 bits, this was alright. And they called this the
> American Standard Code for Information Interchange (ASCII). When
> computers came along, device manufacturers realized that they had 128
> characters that didn't mean anything, so they all made their own
> characters to show for the upper 128. And when they started selling
> computers internationally, they used the upper 128 to store the
> characters they needed for the local language. This had several
> problems.
>
> 1) Files made by on one computer in one country wouldn't display right
> in a computer made by a different manufacturer or for a different
> country
>
> 2) The 256 characters were enough for most Western languages, but
> Chinese and Japanese need a whole lot more.
>
> To solve this problem, Unicode was created. Rather than thinking of
> each character as a distinct set of bits, it just assigns a number to
> each one (a code point). The bottom 128 characters are the original
> ASCII set, and everything else you could think of was added on top of
> that - other alphabets, mathematical symbols, music notes, cuneiform,
> dominos, mah jong tiles, and more. Unicode is harder to implement than
> a simple byte array, but it means strings are universal- every program
> will interpret them exactly the same. Unicode strings in python are
> the default ('') in Python 3.x and created in 2.x by putting a u in
> front of the string declaration (u'')
>
> Unicode, however, is a concept, and concepts can't be mapped to bits
> that can be sent through the network or stored on the hard drive. So
> instead we deal with strings internally as Unicode and then give them
> an encoding when we send them back out. Some encodings, such as UTF-8,
> can have multiple bytes per character and, as such, can deal with the
> full range of Unicode characters. Other times, programs still expect
> the old 8-bit encodings like ISO-8859-1 or the Windows Ansi code
> pages. In Python, to declare that the string is a literal set of bytes
> and the program should not try and interpret it, you use b'' in Python
> 3.x, or just declare it normally in Python 2.x ('').
>
> ------------------------------------------------------
> What happens in your program:
>
> When you print a Unicode string, Python has to decide what encoding to
> use. If you're printing to a terminal, Python looks for the terminal's
> encoding and uses that. In the event that it doesn't know what
> encoding to use, Python defaults to ASCII because that's compatible
> with almost everything. Since the string you're sending to the web
> page only contains ASCII characters, the automatic conversion works
> fine if you don't specify the b''. Since the resulting page uses UTF-8
> (which you declare in the header), which is compatible with ASCII, the
> output looks fine. If you try sending a string that has non-ASCII
> characters, the program might throw a UnicodeEncodeError because it
> doesn't know what bytes to use for those characters. It may be able to
> guess, but since I haven't used WSGI directly before, I can't say for
> sure.
>   

Thanks Benjamin - great 'history' lesson - explains it well.

Regards,
Alan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20091109/978801e3/attachment.html>