MD5. Hashes. Just what is that all about?
This article is to explore the subject a little and highlight some of the recent discoveries and issues. If you know all about MD5, hash collisions, SHA-1, … then this isn’t for you!
I’m writing this for someone who is technically aware, possibly heard of something called MD5 but has no real idea what it is, why it’s used nor why they should care. I’m going to avoid the detailed maths and programming, so jump on in, there’s nothing to be scared of here…
I know I’ve seen it somewhere…
A true geeky approach to MD5 hashes would kick off with a discussion about cryptography, public keys, authentication, and other subjects sure to send the nervous running in the other direction. So let’s forget that stuff for now…
MD5 is one of those odd-looking terms that you might recall seeing on some web page one day, but you can’t remember where, and nor did you pay any attention to it. Let me jog your memory a bit… One of those download pages on a site, where the funky colourful stuff suddenly gives way to a dry looking list of files to download…
Take a look at some Mozilla downloads here for example. See that file called MD5SUMS? And, related, the one called SHA1SUMS? That’s where you’ve seen the term, isn’t it?!!! But you always paid it no attention, right? And your download seemed to work just fine, no?
OK, what is in there?
Let’s take a look at a bit of that file MD5SUMS:
5ea1b8182d5bc57f8072a2db5426c0b6 ./win32/de-DE/Firefox 1.0rc2.zip
1596c4a9fc3c3e4db09bb9e306f8d16b ./win32/de-DE/Firefox Setup 1.0rc2.exe
44861d76895f0be59ae386aa9fce7d1f ./win32/en-US/Firefox 1.0rc2.zip
25b9ec7ae77c0df706e13eba5bccef80 ./win32/en-US/Firefox Setup 1.0rc2.exe
2527d4c149cadefe323f82887ca46549 ./win32/nb-NO/Firefox 1.0rc2.zip
7aaad62290030cd6e67379ba0033126d ./win32/nb-NO/Firefox Setup 1.0rc2.exe
692e1f904bc488bdaf678037d6bd94ca ./win32/nl-NL/Firefox 1.0rc2.zip
dc5fff712281cabf60a89f39389748ac ./win32/nl-NL/Firefox Setup 1.0rc2.exe
.
.
Don’t panic…
All we have here are some (rather long) numbers, followed by file names. The numbers are large and, to add to the possible visual shock, are hexadecimal, so they have letters in them as well as digits. But they are just numbers. Don’t be put off by them. And we can see that each file name seems to have a different number next to it. In fact it might strike you that the numbers are not just a little different (a digit or two) but very different indeed. This is something we’ll go in to in more detail.
Each number is obtained by “passing” the actual byte by byte contents of the file concerned “through” a function which acts upon it in a pre-defined way. Sounds a little dry and formal, but let’s look at some examples.
Functions, functions, functions…
Forget MD5 for a moment, and forget the files above. Let’s look are a really simple example.
I’ve got a file made up of 4 bytes, with the following values:
108 202 7 22
Those numbers are simply made-up by me – they have no significance at all. Oh, and for the moment we’re doing this in good’ol decimal, and not hexadecimal (digits and letters…)
I then define a hashing function called myFunc which says that you do this to the file being hashed:
Step 1: Start with a counter value of 10.
Step 2: Add the next byte of the file to the counter.
Step 3: If the result is greater than 3 digits in length, drop the last digit.
Step 4: If the file is finished, stop, otherwise go back to Step 2 and do this for the next byte in the file.
Very simple. Let’s do it for my test file:
Step 1: counter = 10
Step 2: counter = 10 + 108 = 118
Step 3: Drop the last digit (8) leaving counter = 11
Step 4: More to do… Yes, so back to Step 2
Step 2: counter = 11 + 202 = 213
Step 3: Drop the 3, leaving counter = 21
Step 4: Back to Step 2 again…!
Step 2: counter = 21 + 7 = 28
Step 3: counter = 28 (since only two digits)
Step 4: Go to Step 2 for the last time…
Step 2: counter = 28 + 22 = 40
Steps 3 and 4: Only two digits, leave at 40 and finish.
So we can say that myFunc produces, for my test file, a final value of 40.
Let’s change the test file now to:
208 202 7 22
(just one digit changed: 108 changes to 208)
Running the same steps, but written in a more compact form, we get:
10 + 208 = 218. Drop 8, leave 21.
21 + 202 = 223. Drop 3, leave 22.
22 + 7 = 229. Drop 9, leave 22.
22 + 22 = 44. Result: 44
So what do we see? A final result of 44. Which is different from the 40 which we got for the original file.
At the very simplest level, what does this tell us? Well, suppose I had not shown you the two input files. But I told you that the results, when passed though myFunc, were different, one being 40, the other one 44. What could you have told me with total certainty? You could say that, without any doubt, the two input files had been different. You could not say in what way they were different, but you could say that they differed in some way or other.
It’s kind of obvious, but if you imagine performing myFunc over and over again on one of these two files, could you ever get a different result than that which we got above? No. It’s just plain-old common-sense – if the same numbers go in, the same result comes out.
If I now told you that I performed myFunc on two files and got the same result, what could you then tell me with certainty? Could you tell me that they must therefore be the same file?
You can probably see this coming from a mile off, but let’s do it anyway. If we leave the first file the same, but now have a second file which, like our other “second file” only differs by one digit:
118 202 7 22
10 + 118 = 128, Drop the 8, leaving 12.
12 + 202 = 214. Drop the 4, leaving 21.
21 + 7 = 28. Drop nothing, leaving 28.
28 + 22 = 40. Drop nothing, result is 40.
So although the input file was different, the result produced is the same. 40. Different input file, same result from myFunc. So we could not have said that the input files must be the same. About as far as we can go is to say that they could have been the same. Which is not particularly useful it seems.
So what possible use is myFunc and the examples we performed with it? Well if you followed the simple steps and deductions we made above, then you understand a lot, in fact most, of what you need to know about MD5 Hashes and their associated uses and problems! The maths might be more complicated and lengthy with MD5, but the concepts are the same. Really, it’s that simple.
So back to MD5 – Just what is it then?
Many others have defined MD5 in formal terms better than I could. Pop on over to Wikipedia and scare yourself with the details of the maths, algorithm, pseudo-code and cryptic references to, er, cryptography. Then come back here and let’s see what it’s really used for.
(EDIT: 9 Nov 2005. I’ve just set up a page where you can repeat these hashes yourself, and try others too. You can click here and try out MD5 hashing!)
Let’s take another test file. This time I’m going to make it text (although remember that, underneath, that’s really just a string of bytes). My file contains:
Hello
Yup. One short, simple word. Written as numbers, for those who care, the contents are, in decimal:
72 101 108 108 111
Now instead of performing my myFunc on it, let’s instead pass it though the MD5 function itself (and let’s not worry how exactly we do that – we’ll just let the computer do it and trust the results…a later article will suggest methods to calculate MD5 hashes) What do we get?
8b1a9953c4611296a827abf8c47804d7
Yikes! That’s a lot longer than the file we fed in! Well that’s because the MD5 function always produces an output with the same length, irrespective of the input.
So now let’s edit the first file, which now contains:
Hallo
Note that we have changed one single letter (the ‘e’ becomes an ‘a’). What result do we now get?
d1bf93299de1b68e6d382c893bf1215f
That’s a whole lot different, no? Let’s make the change even smaller, and change the ‘e’ of “Hello” into an ‘f’, which is just a single character away, both alphabetically and numerically. So with a file containing:
Hfllo
we get an MD5 sum of:
dd930074b6bb892cfe5af7dfed0d15ca
Still very different, no?
So three extremely short and strikingly similar input files have produced three strikingly different output values from MD5. Now you can perhaps see one of the defining characteristics of the MD5 hash (and indeed hashes in general): for a trivially small change in the input they produce a marked change in output.
Returning to the question we looked at when evaluating the usefulness (or lack thereof) of myFunc: if I told you that I had three files which I performed MD5 hashes on, and the results were the three hexadecimal numbers you see above, what could you tell me about the contents of those files? As before, you could say that the files were not the same as each other.
Now imagine that two files produced the same MD5 hash, though. What then could you determine? Well, if you take on trust (or understand the maths!) that MD5 is a “good” hashing function, you could say with an extraordinary degree of confidence that the files were identical.
“Extraordinary degree of confidence”. Hmm. Why not “With total certainty”? Well that’s where things get interesting, and we’ll return to that problem in a short while.
And those Mozilla files?
So tying up loose ends, why did we start out with MD5 hashes for those Mozilla files we could download? What use are they? To verify that we’ve downloaded what we think we’ve downloaded.
If we download a copy of one of those quite large Mozilla installer files we might want to make sure that the download completed successfully. If I download one of these files along with the file containing the MD5 hash result (as supplied by the person making it available) then I can myself perform an MD5 hash upon the received file and see if my MD5 hash result is the same as the originator’s. If it is, then I can be comfortable in the knowledge that the file on my hard disk is almost certainly a good copy of the original, and no corruption has occurred.
As our example above with the “Hello” file showed, MD5 produces (it seems) a huge change in result for only a tiny change in input, so even if only one byte had been corrupted in one bit position in my downloaded file, the MD5 hash I calculate would be very different from the original one.
Now the more creative of you might, at this stage, be thinking “Ah! But how did he know he could trust the MD5 file he downloaded in the first place?” Fair question, but outside the scope of a discussion of simple hashing. We need to assume (and in practice generally can anyway) that the MD5 file posted is correct and that we can safe and reliably download it to our PCs over a secure and reliable channel. And no, we don’t want a hash file for the hash file, which in turn needs a hash file, and… for ever and ever! (Note: having said that, there is no technical reason why you cannot calculate the MD5 hash for the MD5SUMS file itself – it just starts getting a little silly, that’s all…)
Why no certainity?
Here we get to an issue which while actually very simple and obvious, is often overlooked by people. We said earlier that if two files hash to the same MD5 value, they are the same, Almost certainly. But not definitely.
And without further ado we come to the aspect of hashes that gets rather interesting! Can two different files hash, using MD5, to the same value? The simple one-word answer is: Yes.
How often do they do this? Well, in “the real world” the answer is “practically never”. In fact it’s so close to “never” that you really don’t normally need to worry about it. (If two different streams of data hash to the same result, we say that a collision has occurred.)
The result of MD5 is, always, a 128-bit number. That’s a big number indeed. The maximum value is a ‘3’ followed by 38 other numbers. Don’t believe me? Here it is:
3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000
Now that is a very big number. If we assume, for now, that the results of MD5 hashes are “spread out evenly” within this very large range, then we can see that the chances of two files you encounter in your day to day life having the same MD5 hash are incredibly low.
In betting terms, we’re saying that the chances of two randomly selected files hashing to the same value are 1 in 3,402,…etc etc.etc…000.
Kinda low.
End of part 1 – What’s in part2?
So we’ve seen what hashing functions are in general, and looked at MD5 in particular. We’ve seen how it produces very different outputs for small changes of input. We’ve also touched upon why MD5 might be useful: to verify that what we think we’ve got is, actually, what we think it is. And we’ve seen that while MD5 hashes can collide, the likelihood of this occurring is mindbogglingly small.
So what more to say in Part 2? The main topic there will be the problem of those collisions. Although we have already seen how very unlikely they are to occur “naturally”, what if nature is given a helping hand and we the unscrupulous can coerce collisions into occurring? And, if we can, why care?
Finally, in Part 2, we will also look at the general use of hashes such as MD5 in cryptography and dispel the belief that MD5 is itself an encryption technique.
Time permitting I will also put up a small interactive workshop where you can generate MD5 hashes yourself, and also a list of software to permit you to do the same on your PC for larger files.
(EDIT: 9 Nov 2005. I’ve just set up a page where you can repeat these hashes yourself, and try others too. You can click here and try out MD5 hashing!)
Comments and questions most welcome.
Recent Comments