Date: Fri, 30 Jun 95 15:49:32 EDT From: hugo@watson.ibm.com To: IPSEC@ans.net, iesg@CNRI.Reston.VA.US, ietf@CNRI.Reston.VA.US Subject: Rationale behind draft-krawczyk-keyed-md5-00.txt I am appending here a high-level and informal presentation of the rationale and analysis of keyed-MD5 that makes the basis for the choice of the particular mode of keyed-MD5 (for message authentication) presented in draft-krawczyk-keyed-md5-00.txt. I assume the reader is familiar with the iterative structure of MD5 but not necessarily with the details of the compression function. All the information that follows is based on the upcoming paper [BCK] by Bellare, Canetti and Krawczyk (as referenced in the above draft). Hugo ANALYSIS OF KEYED MD5 (sketch) ****************************** The basic issue with analyzing keyed-MD5 as a message authentication function is that it builds on MD5, a function that was originally designed for a different goal, namely, as a *keyless* collision-resistant hash function. What follows is a high-level and informal description of the analysis in [BCK] which is intended to identify the cryptographic requirements from a function as MD5 (more precisely, its compression function) that would guarantee that keyed-MD5 be a good MAC. The specific mode of keyed-MD5 presented in draft-krawczyk-keyed-md5-00.txt is an adaptation of the results of this analysis. KEYED COMPRESSION FUNCTIONS AND PSEUDORANDOM FUNCTIONS: Our approach for the analysis of keyed-MD5 (or keying any other iterative hash function) is to look at the compression function as a function *keyed via its IV* (the initial registers of MD5). We consider the set of the resultant (compression) functions (one function for each 128 bit key) as a "family of pseudorandom functions". This is a well-known notion in cryptography which captures the "proximity" of such functions to a set of truly random functions, and generalizes the more common concept of pseudorandom generators. The notion of security of a pseudorandom function (more precisely, a family of such functions) can be related to the famous Turing test for "intelligent computers". In Turing's test a human asks questions and gets responses; then this human has to decide if he was talking to another human being or to a machine. A computer that passes such a test, i.e., is not distinguished from a human being, is proven "intelligent". With pseudorandom functions the game is similar. But the one that asks the questions (inputs to the function) is the adversary and the responses may come from a truly random function or from a pseudorandom function (whose key is unknown to the adversary). The adversary wins if he decided correctly (with probability better than half!) on whether the function used was truly random or pseudorandom. The adversary could distinguish the pseudorandom function from random by recovering the secret key that defines the function, or just by being able to predict its output, etc. The parameters that define the adversary success are the resources it uses: time and number of queries, and the probability (beyond half) with which it successfully distinguishes. This probability is denoted by eps (this eps is a function of the adversary resources, the family of pseudorandom functions in use, the length of key, etc). An example of a (conjectured) good family of pseudorandom functions (or permutations) is DES. In this example each DES key determines a function in the family. If you know the key then you can compute the function in any point, but if not you can hardly predict a single bit of output. DES in MAC-CBC mode is also an example of pseudorandom functions (and the reason why it makes a good MAC). MAC FUNCTION: In general, a good pseudorandom function (i.e., one with small eps for reasonable resources by the adversary) makes a good MAC. A MAC (for message authentication code) is a function that uses a secret key and computes tags or checksums on information that only parties possessing the secret key can generate. The game for the adversary (that doesn't know the key) is that he can request the result of the checksum computed on a sequence of messages of his choice. At the end, after asking enough queries, the adversary needs to come with a message that he didn't ask before, but for which he can generate the checksum by himself. (This represents a "chosen message attack", and the new message for which the adversary can generate the checksum is a "forged message"). The more time and/or queries an adversary needs to reach a certain probability of forging a message the better the MAC function is. The goal of the design of keyed-MD5 is to come with a scheme that makes a good MAC. FROM PSEUDORANDOM COMPRESSION FUNCTION TO SECURE MAC: What we [BCK] essentially prove is that *if the keyed compression function* of MD5 makes a good pseudorandom function, then its iterations preserve this condition, and then the whole keyed-MD5 function makes a good MAC. Moreover, we give a precise reduction of the form: if you give me a forgery attack on keyed-MD5 we construct from it an explicit distinguishing attack on the basic compression function, where the success parameters (eps, etc) of both attacks are precisely related in our analysis. This approach is a formalization of the intuitive reasoning that assumes MD5 and similar functions to "behave as random functions". The pure intuitive approach, if not taken carefully, can be very misleading. In particular, one can prove (with different levels of sophistication) that the MD5 function does not behave as a random function (e.g., through extension attacks, statistical collisions, etc). However, in our formal analysis we prove that if the *keyed* compression function is close enough to random then the iterations are not far from random. This methodology of relating the security of the iterated function to that of the compression function is analogous to the traditional approach to constructing collision-resistant hash functions. For the later, the resistance of the compression function to collision search guarantees that property for the iterated function. In our case, it is the pseudorandom property that is propagated through the iterations, to provide a secure MAC. THE PSEUDORANDOMNESS ASSUMPTION: To this date we do not know of significant weaknesses of MD5's keyed compression function as a pseudorandom function. Still this property is a conjecture and not something that can be expected to be proven (soon). It is important to remark that it is the same property that is implicitly assumed by everybody that uses these functions to generate (pseudo) random keys (this is done "everywhere" these days: in Photuris, SSL, MKMP, the Preneel-van OOrschot paper [PV], just to mention some places that come to my mind). Interestingly, in the case of SHA the pseudorandomness of the function is claimed by the designers who recommend (in the DSS standard) the use of SHA for pseudorandom key generation. (In other words, if you consider your keys generated using MD5 as "random" then you can safely assume keyed-MD5 is a good MAC.) On the other hand, we do not prove the *necessity* of the pseudorandomness condition on the compression function in order for keyed-MD5 to be a secure MAC. Thus, it may be the case that even if the compression function is not pseudorandom still keyed-MD5 is a good MAC. Indeed, we have a more subtle analysis of some of the modes of keyed-MD5 in which pseudorandomness can be traded by different assumptions on the compression function. (I won't get into these details here). MAC AND COLLISION-RESISTANCE: It is important to remark that our approach reflects a significant distinction between keyed and keyless hash functions, namely, that the *traditional* security requirement of collision-resistance for keyless hash functions does not apply to a (keyed) MAC. Namely, the feasibility of finding (traditional) collisions in a given function is *not* (by itself) a sign that the function does not make a good MAC. The reason being that collision search (for MD5 and similar functions) concentrates on finding collision when the IV is *known* (or even chosen). In keyed-MD5 the IV is the key and then *random and secret*. A good search engine designed for known IV's (e.g. Van Oorschot-Wiener) may be useless to attack keyed-MD5. A good example for the independence between the collision resistance aspect and being a good MAC is (again) DES. If you know the key for DES-CBC-MAC you can find as many collisions as you want (using the invertibilty of DES with known key), but if you do not know the key, then collisions are hard to find (except for the fact that 64 bits of output allows for attacks in the order of 2^32 known/chosen message attacks). In other words, even if you hear tomorrow of (real) collisions in your favorite hash function it does not mean the function becomes insecure as a keyed function. (Though, that could be the case depending on the effect of that analysis on the pseudorandomness of the compression function). On the other hand, if collisions can be found (more than statistically expected) even if the IV is secret, that will have a significant negative effect on the security of the MAC. However, this kind of collisions are very different than the traditional ones. In particular, the adversary will need the "assistance" of the legal user to collect MAC results on known and/or chosen messages; this is in sharp contrast to known-IV collisions that can be searched by the adversary off-line and independently of any user or secret key. THE PROPOSED MODE OF KEYED-MD5: Now back to the keyed-MD5 function as described in my draft, namely, MD5(K,pad,text,K). As you can see this mode does not directly use a keyed-IV. Indeed, in order *not to change* the MD5 code, we are applying regular MD5 (with its regular "magic IV") to the actual key (which is padded to complete a full block), and the result of this first application becomes the "random secret IV" for the rest of the function. The appended key has two roles: to prevent extension attacks, and to serve as yet another (implicit) transformation on the output of MD5. Notice that the appended key is not padded (see below for its rationale). On the other hand, the padding of the prepended key is done due to *security reasons*. One is to achieve a transformation of the key into a pseudorandom IV (as explained above), the other is to avoid, in the case of short messages having a single iteration of MD5 (Kaliski and Robshaw pointed out that a single iteration may be more vulnerable to linear cryptanalysis - though, no such weakness is known today). The use of keys of length less that 128 bits is not recommended. Keys longer than 128 bits may not add any significant security given that the actual strength is given by the resultant IV which is 128 bit long. As said in an note I sent a few days ago, this particular choice of keyed-MD5 is intended to satisfy the following properties: * it is based on MD5 (or similar functions) * no change to the MD5 code required * no degradation of MD5 speed * simple key requirements Departing from these conditions, other constructions with better analytical properties can be suggested . For example, directly keying the MD5 function via its initial registers as proposed in [BCK] (and [PV]) (this requires less assumptions on the compression function of MD5 at the cost of a minimal change in MD5 code). However, the proposed scheme does not suffer of any practical weakness known today. On the other hand, if the above properties are to be relaxed then other designs which may prove more robust to future attacks are possible. This includes functions that may be completely unrelated to MD5 or similar hash functions; indeed such an approach may prove more secure and/or more efficient than basing the MAC on iterated hash functions. THE BIRTHDAY ATTACK: The best attack on keyed-MD5 that we know is a 2^64 chosen/known message attack that we sketch next. For simplicity, consider the application of keyed-MD5 (denoted KMD5) on messages of the form AB where A is a 512-bit block and B a 320-bit block. KMD5(AB) results on three iterations of the compression function of MD5 on the information (K, pad, A , B, K, length), where K is the 128-bit key, pad is '1' followed by 383 0's, and length is the 64 bit encoding of the total length (i.e, 960) as added by MD5. An attacker fixes a block B, and asks for the KMD5 of (Ai,B), for i=1,2,...,2^64, where the Ai's are 2^64 (arbitrary and different) blocks of 512 bits each. With good probability (by the birthday paradox), there exist i<>j with KMD5(Ai,B)=KMD5(Aj,B). In particular,with good probability it happens in this case that MD5(K,Ai)=MD5(K,Aj) (this is the result of the first iteration of MD5, i.e. no length appended). The adversary then asks for KMD5(Ai,C) for any value of C<>B. It then "guesses" the value of KMD5(Aj,C) to be the same as KMD5(Ai,C). If, indeed, it happened that (K,Ai) collided with (K,Aj) after the first iteration of MD5 then the adversary is right. That is, with high probability the adversary correctly forged KMD5(Aj,C). (Clearly, improvements for the adversary are possible such as testing whether the latter collision happened by using some other value of C, but the above is enough for a forgery with good probability). Strictly speaking the above attack requires 2^64 *chosen* messages since it requires keeping the last block unchanged. The blocks Ai, however, can be arbitrary. They need to be known to the adversary (at least those that generate collisions) but not to be chosen by him. In some scenarios a common trailing block may be available as part of standard encoding of information in which case the attack may become a known-only attack. (This is a reason not to pad the appended key in keyed-MD5 to a full block and for keeping the appended length of MD5). In addition, the attack works with messages of any length (even variable length) and the more common trailing blocks in the messages being queried the better the success probability of the attack. However, these attacks require, in any case, a huge amount of information (2^64 blocks) being MAC-ed with the *same key*. To be avoided one just needs to change keys before processing that much information. We note that the same type of attack applies to all proposed modes for keying MD5, and more generally it applies to all iterative constructions of MAC (including CBC-MAC). As said, it is the best attack we know on keyed-MD5. In [BCK] we show that this attack is optimal if the compression function is a truly random function. However, for functions like keyed-MD5 this may very well not be the case. It is important to note the essential difference between the above attack and the traditional birthday attacks on collision-resistant hash functions as MD5. In the latter, the processing time for the attack is also 2^64 but is independent of any secret information and requires no interaction with the legitimate party. Thus, it is far more realistic than against keyed-MD5. The birthday attack on iterative MAC described above was independently discovered by [BCK] and by Preneel and Van Oorschot. A detailed description of the attack can be found in the upcoming Crypto'95 paper by the later authors [PV]. REFERENCES [BCK] M. Bellare, R. Canetti, and H. Krawczyk, "Keying MD5 -- Message Authentication via Iterated Pseudo-randomness", manuscript. [PV] B. Preneel and P. Van Oorschot, "Building fast MACs from hash functions", to appear Crypto'95.