Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

String prefix question

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


alan at baselinedata

Nov 8, 2009, 6:38 PM

Post #1 of 6 (335 views)
Permalink
String prefix question

In the Python.org 3.1 documentation (section 20.4.6), there is a simple
“Hello World” WSGI application which includes the following method...

def hello_world_app(environ, start_response):
status = b'200 OK' # HTTP Status
headers = [(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
start_response(status, headers)

# The returned object is going to be printed
return [b"Hello World"]

Question - Can anyone tell me why the 'b' prefix is present before each
string? The method seems to work equally well with and without the
prefix. From what I can gather from the documentation the b prefix
represents a bytes literal, but can anyone explain (in simple english)
what this means?

Many thanks,
Alan
--
http://mail.python.org/mailman/listinfo/python-list


ben+python at benfinney

Nov 8, 2009, 7:01 PM

Post #2 of 6 (299 views)
Permalink
Re: String prefix question [In reply to]

Alan Harris-Reid <alan [at] baselinedata> writes:

> From what I can gather from the documentation the b prefix represents
> a bytes literal

Yes. In Python 3 there are two types with similar-looking literal
syntax: ‘str’ and ‘bytes’. The types are mutually incompatible (though
they can be explicitly converted).

<URL:http://docs.python.org/3.1/library/stdtypes.html#typesseq>
<URL:http://docs.python.org/3.1/reference/lexical_analysis.html#strings>

> but can anyone explain (in simple english) what this means?

It means the difference between “a sequence of bytes” and “a sequence of
characters”. The two are not the same, have not ever been the same
despite a long history in computing of handwaving the differences, and
Python 3 finally makes them unambiguously distinct.

A general solution wasn't even feasible for a long time, but now we have
Unicode, a mature standard for uniformly representing all the world's
writing systems in software. So Python 3 made ‘str’ the Unicode “string
of characters” type, and the ‘'foo'’ literal syntax creates objects of
this type.

The Python 3.1 documentation has a Unicode HOWTO that you should read
<URL:http://docs.python.org/3.1/howto/unicode.html>.

--
\ “We must respect the other fellow's religion, but only in the |
`\ sense and to the extent that we respect his theory that his |
_o__) wife is beautiful and his children smart.” —Henry L. Mencken |
Ben Finney
--
http://mail.python.org/mailman/listinfo/python-list


benjamin.kaplan at case

Nov 8, 2009, 7:04 PM

Post #3 of 6 (308 views)
Permalink
Re: String prefix question [In reply to]

On Sun, Nov 8, 2009 at 9:38 PM, Alan Harris-Reid
<alan [at] baselinedata> wrote:
> In the Python.org 3.1 documentation (section 20.4.6), there is a simple
> “Hello World” WSGI application which includes the following method...
>
> def hello_world_app(environ, start_response):
> status = b'200 OK' # HTTP Status
> headers = [(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
> start_response(status, headers)
>
> # The returned object is going to be printed
> return [b"Hello World"]
>
> Question - Can anyone tell me why the 'b' prefix is present before each
> string? The method seems to work equally well with and without the prefix.
> From what I can gather from the documentation the b prefix represents a
> bytes literal, but can anyone explain (in simple english) what this means?
>
> Many thanks,
> Alan

The rather long version:
read http://www.joelonsoftware.com/articles/Unicode.html

A somewhat shorter summary, along with how Python deals with this:

Once upon a time, someone decided to allocate 1 byte for each
character. Since everything the Americans who made the computers
needed fit into 7 bits, this was alright. And they called this the
American Standard Code for Information Interchange (ASCII). When
computers came along, device manufacturers realized that they had 128
characters that didn't mean anything, so they all made their own
characters to show for the upper 128. And when they started selling
computers internationally, they used the upper 128 to store the
characters they needed for the local language. This had several
problems.

1) Files made by on one computer in one country wouldn't display right
in a computer made by a different manufacturer or for a different
country

2) The 256 characters were enough for most Western languages, but
Chinese and Japanese need a whole lot more.

To solve this problem, Unicode was created. Rather than thinking of
each character as a distinct set of bits, it just assigns a number to
each one (a code point). The bottom 128 characters are the original
ASCII set, and everything else you could think of was added on top of
that - other alphabets, mathematical symbols, music notes, cuneiform,
dominos, mah jong tiles, and more. Unicode is harder to implement than
a simple byte array, but it means strings are universal- every program
will interpret them exactly the same. Unicode strings in python are
the default ('') in Python 3.x and created in 2.x by putting a u in
front of the string declaration (u'')

Unicode, however, is a concept, and concepts can't be mapped to bits
that can be sent through the network or stored on the hard drive. So
instead we deal with strings internally as Unicode and then give them
an encoding when we send them back out. Some encodings, such as UTF-8,
can have multiple bytes per character and, as such, can deal with the
full range of Unicode characters. Other times, programs still expect
the old 8-bit encodings like ISO-8859-1 or the Windows Ansi code
pages. In Python, to declare that the string is a literal set of bytes
and the program should not try and interpret it, you use b'' in Python
3.x, or just declare it normally in Python 2.x ('').

------------------------------------------------------
What happens in your program:

When you print a Unicode string, Python has to decide what encoding to
use. If you're printing to a terminal, Python looks for the terminal's
encoding and uses that. In the event that it doesn't know what
encoding to use, Python defaults to ASCII because that's compatible
with almost everything. Since the string you're sending to the web
page only contains ASCII characters, the automatic conversion works
fine if you don't specify the b''. Since the resulting page uses UTF-8
(which you declare in the header), which is compatible with ASCII, the
output looks fine. If you try sending a string that has non-ASCII
characters, the program might throw a UnicodeEncodeError because it
doesn't know what bytes to use for those characters. It may be able to
guess, but since I haven't used WSGI directly before, I can't say for
sure.
--
http://mail.python.org/mailman/listinfo/python-list


grflanagan at gmail

Nov 9, 2009, 8:53 AM

Post #4 of 6 (295 views)
Permalink
Re: String prefix question [In reply to]

Alan Harris-Reid wrote:
> In the Python.org 3.1 documentation (section 20.4.6), there is a simple
> “Hello World” WSGI application which includes the following method...
>
> def hello_world_app(environ, start_response):
> status = b'200 OK' # HTTP Status
> headers = [(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
> start_response(status, headers)
>
> # The returned object is going to be printed
> return [b"Hello World"]
>
> Question - Can anyone tell me why the 'b' prefix is present before each
> string? The method seems to work equally well with and without the
> prefix. From what I can gather from the documentation the b prefix
> represents a bytes literal, but can anyone explain (in simple english)
> what this means?
>
> Many thanks,
> Alan

Another link:

http://www.stereoplex.com/two-voices/python-unicode-and-unicodedecodeerror


--
http://mail.python.org/mailman/listinfo/python-list


alan at baselinedata

Nov 9, 2009, 9:37 AM

Post #5 of 6 (295 views)
Permalink
Re: Re: String prefix question [In reply to]

Gerard Flanagan wrote:
> <div class="moz-text-flowed">Alan Harris-Reid wrote:
>> In the Python.org 3.1 documentation (section 20.4.6), there is a
>> simple “Hello World” WSGI application which includes the following
>> method...
>>
>> def hello_world_app(environ, start_response):
>> status ='200 OK' # HTTP Status
>> headers =(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
>> start_response(status, headers)
>>
>> # The returned object is going to be printed
>> return [b"Hello World"]
>>
>> Question - Can anyone tell me why the 'b' prefix is present before
>> each string? The method seems to work equally well with and without
>> the prefix. From what I can gather from the documentation the b
>> prefix represents a bytes literal, but can anyone explain (in simple
>> english) what this means?
>>
>> Many thanks,
>> Alan
>
> Another link:
>
> http://www.stereoplex.com/two-voices/python-unicode-and-unicodedecodeerror
>
>
>
>
> </div>
>
Gerard - thanks for the link - explains it well.

Many thanks,
Alan

--
http://mail.python.org/mailman/listinfo/python-list


alan at baselinedata

Nov 9, 2009, 11:18 AM

Post #6 of 6 (290 views)
Permalink
Re: Re: String prefix question [In reply to]

Benjamin Kaplan wrote:
> On Sun, Nov 8, 2009 at 9:38 PM, Alan Harris-Reid
> <alan [at] baselinedata> wrote:
>
>> In the Python.org 3.1 documentation (section 20.4.6), there is a simple
>> "Hello World" WSGI application which includes the following method...
>>
>> def hello_world_app(environ, start_response):
>> status ='200 OK' # HTTP Status
>> headers =(b'Content-type', b'text/plain; charset=utf-8')] # HTTP Headers
>> start_response(status, headers)
>>
>> # The returned object is going to be printed
>> return [b"Hello World"]
>>
>> Question - Can anyone tell me why the 'b' prefix is present before each
>> string? The method seems to work equally well with and without the prefix.
>> From what I can gather from the documentation the b prefix represents a
>> bytes literal, but can anyone explain (in simple english) what this means?
>>
>> Many thanks,
>> Alan
>>
>
> The rather long version:
> read http://www.joelonsoftware.com/articles/Unicode.html
>
> A somewhat shorter summary, along with how Python deals with this:
>
> Once upon a time, someone decided to allocate 1 byte for each
> character. Since everything the Americans who made the computers
> needed fit into 7 bits, this was alright. And they called this the
> American Standard Code for Information Interchange (ASCII). When
> computers came along, device manufacturers realized that they had 128
> characters that didn't mean anything, so they all made their own
> characters to show for the upper 128. And when they started selling
> computers internationally, they used the upper 128 to store the
> characters they needed for the local language. This had several
> problems.
>
> 1) Files made by on one computer in one country wouldn't display right
> in a computer made by a different manufacturer or for a different
> country
>
> 2) The 256 characters were enough for most Western languages, but
> Chinese and Japanese need a whole lot more.
>
> To solve this problem, Unicode was created. Rather than thinking of
> each character as a distinct set of bits, it just assigns a number to
> each one (a code point). The bottom 128 characters are the original
> ASCII set, and everything else you could think of was added on top of
> that - other alphabets, mathematical symbols, music notes, cuneiform,
> dominos, mah jong tiles, and more. Unicode is harder to implement than
> a simple byte array, but it means strings are universal- every program
> will interpret them exactly the same. Unicode strings in python are
> the default ('') in Python 3.x and created in 2.x by putting a u in
> front of the string declaration (u'')
>
> Unicode, however, is a concept, and concepts can't be mapped to bits
> that can be sent through the network or stored on the hard drive. So
> instead we deal with strings internally as Unicode and then give them
> an encoding when we send them back out. Some encodings, such as UTF-8,
> can have multiple bytes per character and, as such, can deal with the
> full range of Unicode characters. Other times, programs still expect
> the old 8-bit encodings like ISO-8859-1 or the Windows Ansi code
> pages. In Python, to declare that the string is a literal set of bytes
> and the program should not try and interpret it, you use b'' in Python
> 3.x, or just declare it normally in Python 2.x ('').
>
> ------------------------------------------------------
> What happens in your program:
>
> When you print a Unicode string, Python has to decide what encoding to
> use. If you're printing to a terminal, Python looks for the terminal's
> encoding and uses that. In the event that it doesn't know what
> encoding to use, Python defaults to ASCII because that's compatible
> with almost everything. Since the string you're sending to the web
> page only contains ASCII characters, the automatic conversion works
> fine if you don't specify the b''. Since the resulting page uses UTF-8
> (which you declare in the header), which is compatible with ASCII, the
> output looks fine. If you try sending a string that has non-ASCII
> characters, the program might throw a UnicodeEncodeError because it
> doesn't know what bytes to use for those characters. It may be able to
> guess, but since I haven't used WSGI directly before, I can't say for
> sure.
>

Thanks Benjamin - great 'history' lesson - explains it well.

Regards,
Alan

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.