
alan at baselinedata
Nov 9, 2009, 11:18 AM
Post #6 of 6
(290 views)
Permalink
|
Benjamin Kaplan wrote: > On Sun, Nov 8, 2009 at 9:38 PM, Alan Harris-Reid > <alan [at] baselinedata> wrote: > >> In the Python.org 3.1 documentation (section 20.4.6), there is a simple >> "Hello World" WSGI application which includes the following method... >> >> def hello_world_app(environ, start_response): >> status ='200 OK' # HTTP Status >> headers =(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers >> start_response(status, headers) >> >> # The returned object is going to be printed >> return [b"Hello World"] >> >> Question - Can anyone tell me why the 'b' prefix is present before each >> string? The method seems to work equally well with and without the prefix. >> From what I can gather from the documentation the b prefix represents a >> bytes literal, but can anyone explain (in simple english) what this means? >> >> Many thanks, >> Alan >> > > The rather long version: > read http://www.joelonsoftware.com/articles/Unicode.html > > A somewhat shorter summary, along with how Python deals with this: > > Once upon a time, someone decided to allocate 1 byte for each > character. Since everything the Americans who made the computers > needed fit into 7 bits, this was alright. And they called this the > American Standard Code for Information Interchange (ASCII). When > computers came along, device manufacturers realized that they had 128 > characters that didn't mean anything, so they all made their own > characters to show for the upper 128. And when they started selling > computers internationally, they used the upper 128 to store the > characters they needed for the local language. This had several > problems. > > 1) Files made by on one computer in one country wouldn't display right > in a computer made by a different manufacturer or for a different > country > > 2) The 256 characters were enough for most Western languages, but > Chinese and Japanese need a whole lot more. > > To solve this problem, Unicode was created. Rather than thinking of > each character as a distinct set of bits, it just assigns a number to > each one (a code point). The bottom 128 characters are the original > ASCII set, and everything else you could think of was added on top of > that - other alphabets, mathematical symbols, music notes, cuneiform, > dominos, mah jong tiles, and more. Unicode is harder to implement than > a simple byte array, but it means strings are universal- every program > will interpret them exactly the same. Unicode strings in python are > the default ('') in Python 3.x and created in 2.x by putting a u in > front of the string declaration (u'') > > Unicode, however, is a concept, and concepts can't be mapped to bits > that can be sent through the network or stored on the hard drive. So > instead we deal with strings internally as Unicode and then give them > an encoding when we send them back out. Some encodings, such as UTF-8, > can have multiple bytes per character and, as such, can deal with the > full range of Unicode characters. Other times, programs still expect > the old 8-bit encodings like ISO-8859-1 or the Windows Ansi code > pages. In Python, to declare that the string is a literal set of bytes > and the program should not try and interpret it, you use b'' in Python > 3.x, or just declare it normally in Python 2.x (''). > > ------------------------------------------------------ > What happens in your program: > > When you print a Unicode string, Python has to decide what encoding to > use. If you're printing to a terminal, Python looks for the terminal's > encoding and uses that. In the event that it doesn't know what > encoding to use, Python defaults to ASCII because that's compatible > with almost everything. Since the string you're sending to the web > page only contains ASCII characters, the automatic conversion works > fine if you don't specify the b''. Since the resulting page uses UTF-8 > (which you declare in the header), which is compatible with ASCII, the > output looks fine. If you try sending a string that has non-ASCII > characters, the program might throw a UnicodeEncodeError because it > doesn't know what bytes to use for those characters. It may be able to > guess, but since I haven't used WSGI directly before, I can't say for > sure. > Thanks Benjamin - great 'history' lesson - explains it well. Regards, Alan
|