Scott.Daniels at Acm
Nov 29, 2008, 9:27 AM
Post #10 of 16
Jeff H wrote:
> Actually, what I am surprised by, is the fact that hashlib cares at
> all about the encoding. A md5 hash can be produced for an .iso file
> which means it can handle bytes, why does it care what it is being
> fed, as long as there are bytes. I would have assumed that it would
> take whatever was feed to it and view it as a byte array and then hash
> it. You can read a binary file and hash it
> print md5.new(file('foo.iso').read()).hexdigest()
> What do I need to do to tell hashlib not to try and decode, just treat
> the data as binary?
If you do not care about portability or reproducability, you can just go
with the bytes you get to most easily.
To take your example:
with open('foo.iso', 'r'):
will print different things on Linux and windows.
with open('foo.iso', 'rb'):
should print the same thing on both; hashingdoes not magically allow
you to stop thinking.
If you now, and for all time, decide that the only source you will take
is cp1252, perhaps you should decode to cp1252 before hashing.
Even if you have Unicode, you can have alternative Unicode expression
of the same "characters," so you may want to convert the Unicode to a
"Normalized Form" of Unicode before decoding to bytes. The major
candidates for that are NFC, NFD, NFKC, and NFKD, see:
Again, once have chosen your normalized form (or decided to skip the
normalization step), I'd suggest going to UTF-8 (which is pretty
unambiguous) and them hash the result. The problem with another choice
is that UTF-16 comes in two flavors (UTF-16BE and UTF-16LE); UTF-32 also
has two flavors (UTF-32BE and UTF-32LE), and whatever your current
Python, you may well switch between UTF-16 and UTF-32 internally at some
point as you do regular upgrades (or BE vs. LE if you switch CPUs).
--Scott David Daniels
Scott.Daniels [at] Acm
you'll have to decide
, but you could