Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

unicode and hashlib

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


dundeemt at gmail

Nov 28, 2008, 8:11 AM

Post #1 of 16 (2260 views)
Permalink
unicode and hashlib

hashlib.md5 does not appear to like unicode,
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/python-3000 [at] python/msg09824.html

So what is the canonical way to hash unicode?
* convert unicode to local
* hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method? i.e. Should I just use the built-in hash function?
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Nov 28, 2008, 11:24 AM

Post #2 of 16 (2233 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> hashlib.md5 does not appear to like unicode,
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> position 1650: ordinal not in range(128)
>
> After googling, I've found BDFL and others on Py3K talking about the
> problems of hashing non-bytes (i.e. buffers) ...
Unicode is characters, not a character encoding.
You could hash on a utf-8 encoding of the Unicode.

> So what is the canonical way to hash unicode?
> * convert unicode to local
> * hash in current local
> ???
There is no _the_ way to hash Unicode, any more than
there is no _the_ way to hash vectors. You need to
convert the abstract entity something concrete with
a well-defined representation in bytes, and hash that.

> Is this just a problem for md5 hashes that I would not encounter using
> a different method? i.e. Should I just use the built-in hash function?
No, it is a definitional problem. Perhaps you could explain how you
want to use the hash. If the internal hash is acceptable (e.g. for
grouping in dictionaries within a single run), use that. If you intend
to store and compare on the same system, say that. If you want cross-
platform execution of your code to produce the same hashes, say that.
A hash is a means to an end, and it is hard to give advice without
knowing the goal.

--Scott David Daniels
Scott.Daniels [at] Acm
--
http://mail.python.org/mailman/listinfo/python-list


google at mrabarnett

Nov 28, 2008, 11:25 AM

Post #3 of 16 (2219 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> hashlib.md5 does not appear to like unicode,
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> position 1650: ordinal not in range(128)
>
> After googling, I've found BDFL and others on Py3K talking about the
> problems of hashing non-bytes (i.e. buffers)
> http://www.mail-archive.com/python-3000 [at] python/msg09824.html
>
> So what is the canonical way to hash unicode?
> * convert unicode to local
> * hash in current local
> ???
> but what if local has ordinals outside of 128?
>
> Is this just a problem for md5 hashes that I would not encounter using
> a different method? i.e. Should I just use the built-in hash function?
>
It can handle bytestrings, but if you give it unicode it performs a
default encoding to ASCII, but that fails if there's a codepoint >=
U+0080. Personally, I'd recommend encoding unicode to UTF-8.
--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Nov 28, 2008, 12:03 PM

Post #4 of 16 (2231 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> hashlib.md5 does not appear to like unicode,
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> position 1650: ordinal not in range(128)

It is the (default) ascii encoder that does not like non-ascii chars.
I suspect that is you encode to bytes first with an encoder that does
work (latin-???), md5 will be happy.

Reports like this should include Python version.

> After googling, I've found BDFL and others on Py3K talking about the
> problems of hashing non-bytes (i.e. buffers)
> http://www.mail-archive.com/python-3000 [at] python/msg09824.html
>
> So what is the canonical way to hash unicode?
> * convert unicode to local
> * hash in current local
> ???
> but what if local has ordinals outside of 128?
>
> Is this just a problem for md5 hashes that I would not encounter using
> a different method? i.e. Should I just use the built-in hash function?
> --
> http://mail.python.org/mailman/listinfo/python-list
>

--
http://mail.python.org/mailman/listinfo/python-list


paul at boddie

Nov 28, 2008, 12:23 PM

Post #5 of 16 (2214 views)
Permalink
Re: unicode and hashlib [In reply to]

On 28 Nov, 21:03, Terry Reedy <tjre...@udel.edu> wrote:
>
> It is the (default) ascii encoder that does not like non-ascii chars.
> I suspect that is you encode to bytes first with an encoder that does
> work (latin-???), md5 will be happy.

I know that the "Python roadmap" answer to such questions might refer
to Python 3.0 and its "strings are Unicode" features, and having seen
this mentioned a lot recently, I'm surprised that no-one has done so
at the time of writing, but I do wonder whether good old Python 2.x
wouldn't benefit from a more explicit error message in these
situations.

Since the introduction of Unicode in Python 1.6/2.0, I've always tried
to make the distinction between what I call "plain strings" or "byte
strings" and "Unicode objects" or "character strings", and perhaps the
UnicodeEncodeError message should be enhanced to say what is actually
going on: that an attempt is being made to convert characters into
byte values and that the chosen way of doing so (which often involves
the default, ASCII encoding) cannot manage the job.

Paul
--
http://mail.python.org/mailman/listinfo/python-list


dundeemt at gmail

Nov 29, 2008, 6:23 AM

Post #6 of 16 (2207 views)
Permalink
Re: unicode and hashlib [In reply to]

On Nov 28, 1:24 pm, Scott David Daniels <Scott.Dani...@Acm.Org> wrote:
> Jeff H wrote:
> > hashlib.md5 does not appear to like unicode,
> >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > position 1650: ordinal not in range(128)
>
> > After googling, I've found BDFL and others on Py3K talking about the
> > problems of hashing non-bytes (i.e. buffers) ...
>
> Unicode is characters, not a character encoding.
> You could hash on a utf-8 encoding of the Unicode.
>
> > So what is the canonical way to hash unicode?
> >  * convert unicode to local
> >  * hash in current local
> > ???
>
> There is no _the_ way to hash Unicode, any more than
> there is no _the_ way to hash vectors.  You need to
> convert the abstract entity something concrete with
> a well-defined representation in bytes, and hash that.
>
> > Is this just a problem for md5 hashes that I would not encounter using
> > a different method?  i.e. Should I just use the built-in hash function?
>
> No, it is a definitional problem.  Perhaps you could explain how you
> want to use the hash.  If the internal hash is acceptable (e.g. for
> grouping in dictionaries within a single run), use that.  If you intend
> to store and compare on the same system, say that.  If you want cross-
> platform execution of your code to produce the same hashes, say that.
> A hash is a means to an end, and it is hard to give advice without
> knowing the goal.
>
I am checking for changes to large text objects stored in a database
against outside sources. So the hash needs to be reproducible/stable.

> --Scott David Daniels
> Scott.Dani...@Acm.Org

--
http://mail.python.org/mailman/listinfo/python-list


dundeemt at gmail

Nov 29, 2008, 6:27 AM

Post #7 of 16 (2211 views)
Permalink
Re: unicode and hashlib [In reply to]

On Nov 28, 2:03 pm, Terry Reedy <tjre...@udel.edu> wrote:
> Jeff H wrote:
> > hashlib.md5 does not appear to like unicode,
> >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > position 1650: ordinal not in range(128)
>
> It is the (default) ascii encoder that does not like non-ascii chars.
> I suspect that is you encode to bytes first with an encoder that does
> work (latin-???), md5 will be happy.
>
> Reports like this should include Python version.
>
> > After googling, I've found BDFL and others on Py3K talking about the
> > problems of hashing non-bytes (i.e. buffers)
> > http://www.mail-archive.com/python-3...@python.org/msg09824.html
>
> > So what is the canonical way to hash unicode?
> >  * convert unicode to local
> >  * hash in current local
> > ???
> > but what if local has ordinals outside of 128?
>
> > Is this just a problem for md5 hashes that I would not encounter using
> > a different method?  i.e. Should I just use the built-in hash function?
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
>

Python v2.52 -- however, this is not really a bug report because your
analysis is correct. I am converting cp1252 strings to unicode before
I persist them in a database. I am looking for advice/direction/
wisdom on how to sling these strings<g>

-Jeff
--
http://mail.python.org/mailman/listinfo/python-list


dundeemt at gmail

Nov 29, 2008, 6:51 AM

Post #8 of 16 (2219 views)
Permalink
Re: unicode and hashlib [In reply to]

On Nov 29, 8:27 am, Jeff H <dunde...@gmail.com> wrote:
> On Nov 28, 2:03 pm, Terry Reedy <tjre...@udel.edu> wrote:
>
>
>
> > Jeff H wrote:
> > > hashlib.md5 does not appear to like unicode,
> > >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > > position 1650: ordinal not in range(128)
>
> > It is the (default) ascii encoder that does not like non-ascii chars.
> > I suspect that is you encode to bytes first with an encoder that does
> > work (latin-???), md5 will be happy.
>
> > Reports like this should include Python version.
>
> > > After googling, I've found BDFL and others on Py3K talking about the
> > > problems of hashing non-bytes (i.e. buffers)
> > > http://www.mail-archive.com/python-3...@python.org/msg09824.html
>
> > > So what is the canonical way to hash unicode?
> > >  * convert unicode to local
> > >  * hash in current local
> > > ???
> > > but what if local has ordinals outside of 128?
>
> > > Is this just a problem for md5 hashes that I would not encounter using
> > > a different method?  i.e. Should I just use the built-in hash function?
> > > --
> > >http://mail.python.org/mailman/listinfo/python-list
>
> Python v2.52 -- however, this is not really a bug report because your
> analysis is correct. I am converting cp1252 strings to unicode before
> I persist them in a database.  I am looking for advice/direction/
> wisdom on how to sling these strings<g>
>
> -Jeff

Actually, what I am surprised by, is the fact that hashlib cares at
all about the encoding. A md5 hash can be produced for an .iso file
which means it can handle bytes, why does it care what it is being
fed, as long as there are bytes. I would have assumed that it would
take whatever was feed to it and view it as a byte array and then hash
it. You can read a binary file and hash it
print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?

--
http://mail.python.org/mailman/listinfo/python-list


bj_666 at gmx

Nov 29, 2008, 8:29 AM

Post #9 of 16 (2218 views)
Permalink
Re: unicode and hashlib [In reply to]

On Sat, 29 Nov 2008 06:51:33 -0800, Jeff H wrote:

> Actually, what I am surprised by, is the fact that hashlib cares at all
> about the encoding. A md5 hash can be produced for an .iso file which
> means it can handle bytes, why does it care what it is being fed, as
> long as there are bytes.

But you don't have bytes, you have a `unicode` object. The internal byte
representation is implementation specific and not your business.

> I would have assumed that it would take
> whatever was feed to it and view it as a byte array and then hash it.

How? There is no (sane) way to get at the internal byte representation.
And that byte representation might contain things like pointers to memory
locations that are different for two `unicode` objects which compare
equal, so you would get different hash values for objects that otherwise
look the same from the Python level. Not very useful.

> You can read a binary file and hash it
> print md5.new(file('foo.iso').read()).hexdigest()
> What do I need to do to tell hashlib not to try and decode, just treat
> the data as binary?

It's not about *de*coding, it is about *en*coding your `unicode` object
so you get bytes to feed to the MD5 algorithm.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Nov 29, 2008, 9:27 AM

Post #10 of 16 (2219 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> ...
> Actually, what I am surprised by, is the fact that hashlib cares at
> all about the encoding. A md5 hash can be produced for an .iso file
> which means it can handle bytes, why does it care what it is being
> fed, as long as there are bytes. I would have assumed that it would
> take whatever was feed to it and view it as a byte array and then hash
> it. You can read a binary file and hash it
> print md5.new(file('foo.iso').read()).hexdigest()
> What do I need to do to tell hashlib not to try and decode, just treat
> the data as binary?

If you do not care about portability or reproducability, you can just go
with the bytes you get to most easily.

To take your example:
with open('foo.iso', 'r'):
print hashlib.md5(src.read()).hexdigest()

will print different things on Linux and windows.

with open('foo.iso', 'rb'):
print hashlib.md5(src.read()).hexdigest()

should print the same thing on both; hashingdoes not magically allow
you to stop thinking.

If you now, and for all time, decide that the only source you will take
is cp1252, perhaps you should decode to cp1252 before hashing.

Even if you have Unicode, you can have alternative Unicode expression
of the same "characters," so you may want to convert the Unicode to a
"Normalized Form" of Unicode before decoding to bytes. The major
candidates for that are NFC, NFD, NFKC, and NFKD, see:
http://unicode.org/reports/tr15/
Again, once have chosen your normalized form (or decided to skip the
normalization step), I'd suggest going to UTF-8 (which is pretty
unambiguous) and them hash the result. The problem with another choice
is that UTF-16 comes in two flavors (UTF-16BE and UTF-16LE); UTF-32 also
has two flavors (UTF-32BE and UTF-32LE), and whatever your current
Python, you may well switch between UTF-16 and UTF-32 internally at some
point as you do regular upgrades (or BE vs. LE if you switch CPUs).

--Scott David Daniels
Scott.Daniels [at] Acm

you'll have to decide
, but you could
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Nov 29, 2008, 10:23 AM

Post #11 of 16 (2218 views)
Permalink
Re: unicode and hashlib [In reply to]

Scott David Daniels wrote:
...
> If you now, and for all time, decide that the only source you will take
> is cp1252, perhaps you should decode to cp1252 before hashing.

Of course my dyslexia sticks out here as I get encode and decode exactly
backwards -- Marc 'BlackJack' Rintsch has it right.

Characters (a concept) are "encoded" to a byte format (representation).
Bytes (a precise representation) are "decoded" to characters (a format
with semantics).

--Scott David Daniels
Scott.Daniels [at] Acm
--
http://mail.python.org/mailman/listinfo/python-list


dundeemt at gmail

Nov 29, 2008, 6:54 PM

Post #12 of 16 (2205 views)
Permalink
Re: unicode and hashlib [In reply to]

On Nov 29, 12:23 pm, Scott David Daniels <Scott.Dani...@Acm.Org>
wrote:
> Scott David Daniels wrote:
>
> ...
>
> > If you now, and for all time, decide that the only source you will take
> > is cp1252, perhaps you should decode to cp1252 before hashing.
>
> Of course my dyslexia sticks out here as I get encode and decode exactly
> backwards -- Marc 'BlackJack' Rintsch has it right.
>
> Characters (a concept) are "encoded" to a byte format (representation).
> Bytes (a precise representation) are "decoded" to characters (a format
> with semantics).
>
> --Scott David Daniels
> Scott.Dani...@Acm.Org

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
>128 - shhh'boom. So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

>>> a='André'
>>> b=unicode(a,'cp1252')
>>> b
u'Andr\xc3\xa9'
>>> hashlib.md5(b.encode('utf-8')).hexdigest()
'b4e5418a36bc4badfc47deb657a2b50c'

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'
Jeff
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Nov 30, 2008, 10:16 AM

Post #13 of 16 (2206 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> ...
> decode vs encode
> You decode from on character set to a unicode object
> You encode from a unicode object to a specifed character set

Pretty close:

encode:
Think of characters a "conceptual" -- you encode a character
string into a bunch of bytes (unicode -> bytes) in order to send
the characters along a wire, into an e-mail, or put in a database.

decode:
You got the bytes from the wire, database, Morse code, whatever.
You decode the byte stream into characters, and now you really have
characters. Thinking of it this way makes it clear which name is
which, unless (as I did once in this thread) you switch opposite
concepts carelessly.


Characters are content (understood by humans), bytes are gibberish
carried by hardware which likes that kid of thing. You encode a
message into nonsense for your carrier to carry to your recipient,
and the recipient decodes the nonsense back into the message.

--Scott David Daniels
Scott.Daniels [at] Acm
--
http://mail.python.org/mailman/listinfo/python-list


fakeaddress at nowhere

Dec 1, 2008, 5:53 AM

Post #14 of 16 (2202 views)
Permalink
Re: unicode and hashlib [In reply to]

Jeff H wrote:
> [...] So once I have character strings transformed
> internally to unicode objects, I should encode them in 'utf-8' before
> attempting to do things that guess at the proper way to encode them
> for further processing.(i.e. hashlib)

It looks like hashlib in Python 3 will not even attempt to digest a
unicode object. Trying to hash 'abcdefg' in in Python 3.0rc3 I get:

TypeError: object supporting the buffer API required

I think that's good behavior, except that the error message is likely to
send beginners to look up the obscure buffer interface before they find
they just need mystring.decode('utf8') or bytes(mystring, 'utf8').

>>>> a='André'
>>>> b=unicode(a,'cp1252')
>>>> b
> u'Andr\xc3\xa9'
>>>> hashlib.md5(b.encode('utf-8')).hexdigest()
> 'b4e5418a36bc4badfc47deb657a2b50c'

Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also
includes the stronger SHA-2 family.


--
--Bryan
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Dec 1, 2008, 4:35 PM

Post #15 of 16 (2200 views)
Permalink
Re: unicode and hashlib [In reply to]

Bryan Olson wrote:
> ... I think that's good behavior, except that the error message is likely
> to end beginners to look up the obscure buffer interface before they find
> they just need mystring.decode('utf8') or bytes(mystring, 'utf8').
Oops, careful here (I made this mistake once in this thread as well).
You _decode_ from unicode to bytes. The code you quoted doesn't run.
This does:

>>> a = 'Andr\xe9'
>>> b = unicode(a, 'cp1252')
>>> b.encode('utf-8')
'Andr\xc3\xa9'
>>> b.decode('utf-8')

Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
b.decode('utf-8')
File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 4: ordinal not in range(128)

>>> hashlib.md5(b.encode('utf-8')).hexdigest()
'45f1deffb45a5f6c2380a4cee9b3e452'

>>> hashlib.md5(b.decode('utf-8')).hexdigest()

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
hashlib.md5(b.decode('utf-8')).hexdigest()
File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 4: ordinal not in range(128)


> Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also
> includes the stronger SHA-2 family.

Well, the choice of hash always depends on the app.


-Scott
--
http://mail.python.org/mailman/listinfo/python-list


fakeaddress at nowhere

Dec 2, 2008, 4:34 PM

Post #16 of 16 (2184 views)
Permalink
Re: unicode and hashlib [In reply to]

Scott David Daniels wrote:
> Bryan Olson wrote:
>> ... I think that's good behavior, except that the error message is likely
>> to end beginners to look up the obscure buffer interface before they
>> find they just need mystring.decode('utf8') or bytes(mystring, 'utf8').

> Oops, careful here (I made this mistake once in this thread as well).
> You _decode_ from unicode to bytes. The code you quoted doesn't run.

Doh! I even tested it with .encode(), then wrote it wrong.

Just in case anyone Googles the error message and lands here: If you are
working with a Python str (string) object and get,

TypeError: object supporting the buffer API required

Then you probably want to encode the string to a bytes object, and
UTF-8 is likely the encoding of choice, as in:

mystring.encode('utf8')

or

bytes(mystring, 'utf8')


Thanks for the correction.
--
--Bryan
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.