Eco514|Game Theory Lecture 13: Repeated Games (2) Marciano Siniscalchi October 28, 1999 Introduction [Again, by and large, I will follow OR, Chap. 8, so I will keep these notes to a minimum.] Review of key de nitions Recall our three payo aggregation criteria: discounting, i.e. (uti)t 1 i (wti)t 1 , X t 1 t 1(uti wti) > 0 (also recall that the payo pro le corresponding to a stream (ut) is taken to be (1 )Pt 1 t 1u(at)); limit of means: (uti)t 1 i (wti)t 1 , lim inf t!1 TX t=1 uti wti T > 0; and overtaking: (uti)t 1 i (wti)t 1 , lim inft!1 TX t=1 (uti wti) > 0: Also recall the de nition of machines: De nition 1 Fix a normal-form game G. A machine for Player i 2 N is a tuple Mi = (Qi;q0i;fi; i), where: (i) Qi is a nite set (whose elements should be thought of as labels); (ii) q0i is the initial state of the machine; (iii) fi : Qi!Ai is the action function: it speci es what Player i does at each state; and (iv) i : Qi A!Qi is the transition function: if action a2A is played and Player i’s machine state is qi2Qi, then at the next stage Player i’s machine state will be i(qi;a). 1 Note: OR actually allows for arbitrary state spaces. For most proofs, nite state spaces are enough. There is one exception, which I will note below. De nition 2 Fix a normal-form game G. The minmax payo for player i, vi, is de ned by vi = mina i2A i maxai2Aiui(ai;a i). We are not going to worry about implementing mixtures of action pro les in this lecture, so I just remind you of the de nition of enforceability: De nition 3 Fix a normal-form game G. A payo pro le u 2 RN is enforceable (resp. strictly enforceable) if ui vi (resp. ui >vi) for all i2N Perfect Folk Theorems The strategies in the proof of the Nash folk theorem (say, with discounting) call for inde nite punishment following a deviation from the equilibrium. This is ne in games such as the Prisoner’s Dilemma, in which the minmax actions are actually an equilibrium (indeed, in PD, they are the only equilibrium!). That is, the threat of inde nite punishment is indeed credible. A D A 2,3 1,6 D 0,1 0,1 Figure 1: Inde nite Punishment is not credible However, consider the game in Figure 1. The Row player can hold the Column player down to his minimum payo by choosing D, but D is strictly dominated for her. Hence, while (2,3) can be enforced as a Nash equilibrium payo pro le in the in nitely repeated version of the game by assuming that players switch to (D,D) following a deviation, this would not work if we required subgame perfection, regardless of the payo aggregation criterion. A warm-up exercise: Perfect Folk Theorems for Limit-of-Means Thus, we must be smarter than that. A key intuition is that, after all, a deviator need not be punished forever, but only long enough to wipe out any gains from his deviation. Formally, x a game G = (N;(Ai;ui)i2N) and let M = maxi2N;a2Aui(a). Fix an outcome a of G; clearly, a one-time deviation by Player i cannot yield more than M ui(a ). Thus, if Player i deviates, we need only punish him so as to wipe out this payo di erential; clearly, how long this will be depends on the payo aggregation criterion. Intuitively, limit-of-means should make things very simple, because, following a deviation, punishers face only a nite number of periods in which they must forego their equilibrium 2 payo stream in order to punish Player i. Under limit-of-means aggregation, nite periods do not matter, so punishers are indi erent between punishing and not punishing. There is only one subtlety: of course, the same argument applies to Player i: no one-time deviation can be pro table for her! So, what exactly are we trying to deter? The answer is that, although no nite deviation from the in nite repetition of a is prof- itable for Player i, the following strategy could be. Suppose that, in the putative equilibrium we wish to construct, a deviation is followed by L rounds of punishment (you can think of L0 = 1 if you wish). Thus, if Player i deviates once, she gets an extra payo of (at most) M ui(a ), but then loses ui(a ) vi utils in each of the subsequent L periods. Now suppose L is small, so that M ui(a ) >L[ui(a ) vi]. For example, in the game of Figure 1, suppose that i is the Column player, that a = (A;A), and that L = 1. Then a deviation yields 3 utils, whereas one round of punishment costs 2 utils. Then, Player i can adopt the following strategy: deviate, then play a best-response to p i (in Figure 1, play D) for L periods; then, as soon as play is supposed to return to a , deviate and best-respond to the minmax pro le, and so on. This is a pro table deviation. [Observe that it is also a neat example of a game in which the one-deviation property holds!] Thus, we must choose L large enough so 8i2N; M ui(a ) <L[ui(a ) vi] In Figure 1, it is enough to choose L0 = 2. To complete the argument, we must specify what happens if more than one player de- viates, or if somebody deviates from the punishment stage. As in the proof of the Nash Folk Theorem, multiple deviations lead to the lowest-index player being punished (they will not occur in equilibrium anyway, so players cannot hope to get away with deviating because somebody else will deviate, too). Finally, if one or more punishers fail to punish, we simply disregard these further (un- pro table) deviations: again, in equilibrium they are not going to occur, so players cannot count on them to improve their predicament after a rst deviation. This proves: Proposition 0.1 [OR, Proposition 146.2] Fix a game G. Any feasible, strictly enforce- able payo pro le of G is a subgame-perfect equilibrium payo pro le of the limit-of-means in nitely repeated version of G. Machines You will undoubtedly notice that describing these strategies verbally is awkward; doing so formally (as we have tried to do in the proof of the Nash Folk theorem) is even worse. Machines can help. For instance, here is a set of machines that players can use to implement the strategies in the proof of Proposition 0.1: Player i uses machine Mi = (Q;q0;fi; ) (where Q;q0; are common to all players) de ned as follows. 3 First, Q =fN;P(j;t)g. N is the normal state, in which a is played. P(j;t) is the state in which j is punished, and t more rounds of punishment are required. Second, q0 = N. Third, fi(N) = a i, fi(P(j;t)) = p j;i if j6= i, and fj(P(j;t)) = rj(p j). This should be obvious. Finally, ( ; ) is such that we remain in N if nobody deviates, we switch from N to P(j;L) if j is the lowest-index deviator, we always move from P(j;t) to P(j;t 1) if t6= 0, and we always move from P(j;0) back to N. Easy! Discounting The strategy pro le thus constructed is not a subgame-perfect equilibrium of the game in Figure 1 if payo s are aggregated via discounting, for any discount factor. Suppose the Column player deviates: then the Row player is supposed to choose D for 2 periods, and hence receive a ow payo of 0 units. She will then receive 2 forever after. However, if she deviates to A, nothing happens: the Column player continues to choose D, so again the Row player can choose A. Obviously, she prefers (1,1,2,2,:::) to (0,0,2,2,:::)! Thus, the key intuition is that we must somehow ensure that punishers are willing to punish. In principle, one could think of punishing punishers, but this may fail to work with discounting: essentially, second deviations might require longer punishment periods (because the burden of carrying out the rst punishment lasts for L periods, and not just one), third deviations might require even longer punishments, and so on. This is certainly the case in the game of Figure 1. The alternative is to reward punishers. This leads to OR’s Proposition 151.1. I am only going to o er a few comments on the proof. The \conciliation" statesC(j) serve to reward punishers, in a way. However, this is subtle: we never go back to the Nash state C(0). What happens is, if j deviates and i punishes him, then after punishment we move to C(j), which i prefers to C(i). Otherwise, we go to C(i) and stay there until somebody else deviates.1 In my opinion, this is also a punishment of sorts, but of course you are welcome to di er. Also: the remark about the rst condition on being su cient has to do with the fact that, after L periods of punishment, we move to C(j); since ui(a(j)) <ui(a(0)) by assump- tion, this is actually a further punishment (which, er, actually reinforces my interpretation, but never mind that). The point is that this punishment may not be enough, or may come 1Again, let me repeat this because I rst got it wrong in class (but then you guys spotted me!): whenever we are in state C(j), we remain there if somebody deviates, and move to the state P(k;L) in which Player k’s punishment begins if k deviates from a(j). Thus, what supports a(j) as a continuation equilibrium after a deviation by Player j is the threat of further punishment; a(j) need not be a Nash equilibrium per se. 4 too late, so we make sure that Player j is hit really hard for L periods before switching to the conciliation phase. Finally, the second condition on has the following interpretation: by punishing Player j, Player i potentially loses M ui(p j;rj(p j)) for the L periods following Player j’s deviation (be it a deviation from a(0) or from whatever j was supposed to play). On the other hand, after L rounds of punishments, the game will switch to the C(j) conciliation stage. Now, Player i prefers to be in the C(j) state than in the C(i) state, by assumption, so if the discount factor is large enough, she will not deviate. [The only subtle point is that she may actually deviate at any t2f1;:::;Lg(where time is measured starting from the rst stage after Player j’s deviation). If she deviates at t, she will actually be held down to vi starting from t + 1 and until t + L L + 1; the condition in the text is thus stronger than it needs be (it assumes a payo of M from t+ 1 to L, and a punishment payo of ui(a(i)) from L+ 1 to t+L).] 5