Date: Fri, 30 Jun 95 15:49:32 EDT
From: hugo@watson.ibm.com
To: IPSEC@ans.net, iesg@CNRI.Reston.VA.US, ietf@CNRI.Reston.VA.US
Subject: Rationale behind draft-krawczyk-keyed-md5-00.txt

I am appending here a high-level and informal presentation of the rationale
and analysis of keyed-MD5 that makes the basis for the choice of the
particular mode of keyed-MD5 (for message authentication) presented in
draft-krawczyk-keyed-md5-00.txt.

I assume the reader is familiar with the iterative structure of MD5 but not
necessarily with the details of the compression function.

All the information that follows is based on the upcoming paper [BCK]
by Bellare, Canetti and Krawczyk (as referenced in the above draft).

Hugo


ANALYSIS OF KEYED MD5 (sketch)
******************************

The basic issue with analyzing keyed-MD5 as a message authentication
function is that it builds on MD5, a function that was originally designed
for a different goal, namely, as a *keyless* collision-resistant hash function.
What follows is a high-level and informal description of the analysis in
[BCK] which is intended to identify the cryptographic requirements from a
function as MD5 (more precisely, its compression function) that would
guarantee that keyed-MD5 be a good MAC. The specific mode of keyed-MD5
presented in draft-krawczyk-keyed-md5-00.txt is an adaptation of the results
of this analysis.

KEYED COMPRESSION FUNCTIONS AND PSEUDORANDOM FUNCTIONS: Our approach
for the analysis of keyed-MD5 (or keying any other iterative hash function)
is to look at the compression function as a function *keyed via its IV*
(the initial registers of MD5). We consider the set of the resultant
(compression) functions (one function for each 128 bit key) as a
"family of pseudorandom functions". This is a well-known notion in cryptography
which captures the "proximity" of such functions to a set of truly random
functions, and generalizes the more common concept of pseudorandom generators.

The notion of security of a pseudorandom function (more precisely, a
family of such functions) can be related to the famous Turing test for
"intelligent computers". In Turing's test a human asks questions and
gets responses; then this human has to decide if he was talking to another
human being or to a machine.  A computer that passes such a test, i.e.,
is not distinguished from a human being, is proven "intelligent".
With pseudorandom functions the game is similar. But the one that asks
the questions (inputs to the function) is the adversary and the responses
may come from a truly random function or from a pseudorandom function
(whose key is unknown to the adversary). The adversary wins if he decided
correctly (with probability better than half!) on whether the function
used was truly random or pseudorandom.

The adversary could distinguish  the pseudorandom function from random
by recovering the secret key that defines the function, or just by being
able to predict its output, etc. The parameters that define the adversary
success are the resources it uses: time and number of queries, and the
probability (beyond half) with which it successfully distinguishes.
This probability is denoted by eps (this eps is a function of the adversary
resources, the family of pseudorandom functions in use, the length of key, etc).

An example of a (conjectured) good family of pseudorandom functions (or
permutations) is DES.  In this example  each DES key determines a function
in the family. If you know the key then you can compute the function in any
point, but if not you can hardly predict a single bit of output.
DES in MAC-CBC mode is also an example of pseudorandom functions (and
the reason why it makes a good MAC).

MAC FUNCTION: In general, a good pseudorandom function (i.e., one with
small eps for reasonable resources by the adversary) makes a good MAC.
A MAC (for message authentication code) is a function that uses a secret
key and computes tags or checksums on information that only parties possessing
the secret key can generate.  The game for the adversary (that doesn't
know the key) is that he can request the result of the checksum computed
on a sequence of messages of his choice. At the end, after asking enough
queries, the adversary needs to come with a message that he didn't ask
before, but for which he can generate the checksum by himself.
(This represents a "chosen message attack", and the new message for which
the adversary can generate the checksum is a "forged message").
The more time and/or queries an adversary needs to reach a certain probability
of forging a message the better the MAC function is.

The goal of the design of keyed-MD5 is to come with a scheme that makes a
good MAC.

FROM PSEUDORANDOM COMPRESSION FUNCTION TO SECURE MAC: What we [BCK]
essentially prove is that *if the keyed compression function* of MD5
makes a good pseudorandom function, then its iterations preserve this
condition, and then the whole keyed-MD5 function makes a good MAC.
Moreover, we give a precise reduction of the form: if you give me a forgery
attack on keyed-MD5 we construct from it an explicit distinguishing attack
on the basic compression function, where the success parameters
(eps, etc) of both attacks are precisely related in our analysis.

This approach is a formalization of the intuitive reasoning that assumes
MD5 and similar functions to "behave as random functions". The pure intuitive
approach, if not taken carefully, can be very misleading.  In particular,
one can prove (with different levels of sophistication) that the MD5
function does not behave as a random function (e.g., through extension
attacks, statistical collisions, etc). However, in our formal analysis
we prove that if the *keyed* compression function is close enough to
random then the iterations are not far from random.

This methodology of relating the security of the iterated function to that
of the compression function is analogous to the traditional approach to
constructing collision-resistant hash functions. For the later, the
resistance of the compression function to collision search guarantees that
property for the iterated function. In our case, it is the pseudorandom
property that is propagated through the iterations, to provide a secure MAC.

THE PSEUDORANDOMNESS ASSUMPTION: To this date we do not know of significant
weaknesses of MD5's keyed compression function as a pseudorandom function.
Still this property is a conjecture and not something that can be expected
to be proven (soon).  It is important to remark that it is the same property
that is implicitly assumed by everybody that uses these functions to
generate (pseudo) random keys (this is done "everywhere" these days:
in Photuris, SSL, MKMP, the Preneel-van OOrschot paper [PV], just to
mention some places that come to my mind). Interestingly, in the case of SHA
the pseudorandomness of the function is claimed by the designers who
recommend (in the DSS standard) the use of SHA for pseudorandom key generation.
(In other words, if you consider your keys generated using MD5 as "random"
then you can safely assume keyed-MD5 is a good MAC.)

On the other hand, we do not prove the *necessity* of the pseudorandomness
condition on the compression function in order for keyed-MD5 to be a secure
MAC. Thus, it may be the case that even if the compression function is not
pseudorandom still keyed-MD5 is a good MAC.
Indeed, we have a more subtle analysis of some of the modes of keyed-MD5
in which pseudorandomness can be traded by different assumptions on the
compression function. (I won't get into these details here).

MAC AND COLLISION-RESISTANCE: It is important to remark that our approach
reflects a significant distinction between keyed and keyless hash functions,
namely, that the *traditional* security requirement of collision-resistance
for keyless hash functions does not apply to  a (keyed) MAC.
Namely, the feasibility of finding (traditional) collisions in a given
function is *not* (by itself) a sign that the function does not make
a good MAC. The reason being that collision search (for MD5 and similar
functions) concentrates on finding collision when the IV is *known* (or
even chosen). In keyed-MD5 the IV is the key and then *random and secret*.
A good search engine designed for known IV's (e.g. Van Oorschot-Wiener)
may be useless to attack keyed-MD5.

A good example for the independence between the collision resistance
aspect and being a good MAC is (again) DES. If you know the key for
DES-CBC-MAC you can find as many collisions as you want
(using the invertibilty of DES with known key), but if you do not know the
key, then collisions are hard to find (except for the fact that 64 bits of
output allows for attacks in the order of 2^32 known/chosen message attacks).

In other words, even if you hear tomorrow of (real) collisions in your
favorite hash function it does not mean the function becomes insecure
as a keyed function. (Though, that could be the case depending on the
effect of that analysis on the pseudorandomness of the compression function).
On the other hand, if collisions can be found (more than statistically
expected) even if the IV is secret, that will have a significant negative
effect on the security of the MAC. However, this kind of collisions are
very different than the traditional ones. In particular, the adversary will
need the "assistance" of the legal user to collect MAC results on known
and/or chosen messages; this is in sharp contrast to known-IV collisions
that can be searched by the adversary off-line and  independently of
any user or secret key.

THE PROPOSED MODE OF KEYED-MD5: Now back to the keyed-MD5 function as
described in my draft, namely,  MD5(K,pad,text,K).
As you can see this mode does not directly use a keyed-IV. Indeed, in
order *not to change* the MD5 code, we are applying regular MD5 (with
its regular "magic IV") to the actual key (which is padded to complete
a full block), and the result of this first application becomes the "random
secret IV" for the rest of the function.  The appended key has two roles:
to prevent extension attacks, and to serve as yet another (implicit)
transformation on the output of MD5.  Notice that the appended key is
not padded (see below for its rationale).

On the other hand, the padding of the prepended key is done due to
*security reasons*. One is to achieve a transformation of the key into
a pseudorandom IV (as explained above), the other is to avoid, in the
case of short messages having a single iteration of MD5 (Kaliski and
Robshaw pointed out that a single iteration may be more vulnerable
to linear cryptanalysis - though, no such weakness is known today).

The use of keys of length less that 128 bits is not recommended.
Keys longer than 128 bits may not add any significant security given that
the actual strength is given by the resultant IV which is 128 bit long.

As said in an note I sent a few days ago, this particular choice of
keyed-MD5 is intended to satisfy the following properties:

* it is based on MD5 (or similar functions)
* no change to the MD5 code required
* no degradation of MD5 speed
* simple key requirements

Departing from these conditions, other constructions with better analytical
properties can be suggested . For example, directly keying the MD5 function
via its initial registers as proposed in [BCK] (and [PV])
(this requires less assumptions on the compression function of MD5
at the cost of a minimal change in MD5 code).

However, the proposed scheme does not suffer of any practical weakness
known today. On the other hand, if the above properties are to be relaxed
then other designs which may prove more robust to future attacks are
possible. This includes functions that may be completely unrelated to MD5 or
similar hash functions; indeed such an approach may prove more secure and/or
more efficient than basing the MAC on iterated hash functions.


THE BIRTHDAY ATTACK: The best attack on keyed-MD5 that we know is a 2^64
chosen/known message attack that we sketch next.
For simplicity, consider the application of keyed-MD5 (denoted KMD5)
on messages of the form AB where A is a 512-bit block and B a 320-bit block.
KMD5(AB) results on three iterations of the compression function of MD5
on the information (K, pad, A , B, K, length), where K is the 128-bit
key, pad is '1' followed by 383 0's, and length is the 64 bit encoding
of the total length (i.e, 960) as added by MD5.

An attacker fixes a block B, and asks for the KMD5 of (Ai,B),
for i=1,2,...,2^64, where the Ai's are 2^64 (arbitrary and different)
blocks of 512 bits each.

With good probability (by the birthday paradox), there exist i<>j with
KMD5(Ai,B)=KMD5(Aj,B). In particular,with good probability it happens
in this case that MD5(K,Ai)=MD5(K,Aj) (this is the result of the first
iteration of MD5, i.e. no length appended). The adversary then asks for
KMD5(Ai,C) for any value of C<>B. It then "guesses" the value of  KMD5(Aj,C)
to be the same as KMD5(Ai,C). If, indeed, it happened that (K,Ai) collided
with (K,Aj) after the first iteration of MD5 then the adversary is right.
That is, with high probability the adversary correctly forged KMD5(Aj,C).
(Clearly, improvements for the adversary are possible such as testing
whether the latter collision happened by using some other value
of C, but the above is enough for a forgery with good probability).

Strictly speaking the above attack requires 2^64 *chosen* messages since
it requires keeping the last block unchanged. The blocks Ai, however, can be
arbitrary. They need to be known to the adversary (at least those that
generate collisions) but not to be chosen by him. In some scenarios a common
trailing block may be available as part of standard encoding of information
in which case the attack may become a known-only attack.
(This is a reason not to pad the appended key in keyed-MD5 to a full block
and for keeping the appended length of MD5).

In addition, the attack works with messages of any length (even variable
length) and the more common trailing blocks in the messages being queried
the better the success probability of the attack.
However, these attacks require, in any case, a huge amount of information
(2^64 blocks) being MAC-ed with the *same key*. To be avoided one just
needs to change keys before processing that much information.
We note that the same type of attack applies to all proposed modes for
keying MD5, and more generally it applies to all iterative constructions
of MAC (including CBC-MAC).

As said, it is the best attack we know on keyed-MD5. In [BCK] we show that
this attack is optimal if the compression function is a truly random
function. However, for functions like keyed-MD5 this may very well not be
the case.

It is important to note the essential difference between the above
attack and the traditional birthday attacks on collision-resistant hash
functions as MD5. In the latter, the processing time for the attack is
also 2^64 but is independent of any secret information and requires no
interaction with the legitimate party. Thus, it is far more realistic than
against keyed-MD5.

The birthday attack on iterative MAC described above was independently
discovered by [BCK] and by Preneel and Van Oorschot. A detailed description
of the attack can be found in the upcoming Crypto'95 paper by the later
authors [PV].

REFERENCES

[BCK] M. Bellare, R. Canetti, and H. Krawczyk, "Keying MD5 -- Message
      Authentication via Iterated Pseudo-randomness", manuscript.

[PV]  B. Preneel and P. Van Oorschot, "Building fast MACs from hash
      functions", to appear Crypto'95.