Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson Introduction to Algorithms Day 12 L8.2? 2001 by Charles E. Leiserson A weakness of hashing Problem: For any hash function h, a set of keys exists that can cause the average access time of a hash table to skyrocket. IDEA: Choose the hash function at random, independently of the keys. ? Even if an adversary can see your code, he or she cannot find a bad set of keys, since he or she doesn’t know exactly which hash function will be chosen. ? An adversary can pick all keys from {k ∈ U : h(k) = i} for some slot i. Introduction to Algorithms Day 12 L8.3? 2001 by Charles E. Leiserson Universal hashing Definition. Let U be a universe of keys, and let H be a finite collection of hash functions, each mapping U to {0, 1, …, m–1}. We say H is universal if for all x, y ∈ U, where x ≠ y, we have |{h ∈H: h(x) = h(y)}| = |H|/m. That is, the chance of a collision between x and y is 1/m if we choose h randomly from H. H {h : h(x) = h(y)} |H| m Introduction to Algorithms Day 12 L8.4? 2001 by Charles E. Leiserson Universality is good Theorem. Let h be a hash function chosen (uniformly) at random from a universal set H of hash functions. Suppose h is used to hash n arbitrary keys into the m slots of a table T. Then, for a given key x, we have E[#collisions with x] < n/m. Introduction to Algorithms Day 12 L8.5? 2001 by Charles E. Leiserson Proof of theorem Proof. Let C x be the random variable denoting the total number of collisions of keys in T with x, and let c xy = 1 if h(x) = h(y), 0 otherwise. Note: E[c xy ] = 1/m and ∑ ?∈ = }{xTy xyx cC . Introduction to Algorithms Day 12 L8.6? 2001 by Charles E. Leiserson Proof (continued) ? ? ? ? ? ? ? ? = ∑ ?∈ }{ ][ xTy xyx cECE ? Take expectation of both sides. Introduction to Algorithms Day 12 L8.7? 2001 by Charles E. Leiserson Proof (continued) ∑ ∑ ?∈ ?∈ = ? ? ? ? ? ? ? ? = }{ }{ ][ ][ xTy xy xTy xyx cE cECE ? Linearity of expectation. ? Take expectation of both sides. Introduction to Algorithms Day 12 L8.8? 2001 by Charles E. Leiserson Proof (continued) ∑ ∑ ∑ ?∈ ?∈ ?∈ = = ? ? ? ? ? ? ? ? = }{ }{ }{ /1 ][ ][ xTy xTy xy xTy xyx m cE cECE ? E[c xy ] = 1/m. ? Linearity of expectation. ? Take expectation of both sides. Introduction to Algorithms Day 12 L8.9? 2001 by Charles E. Leiserson Proof (continued) m n m cE cECE xTy xTy xy xTy xyx 1 /1 ][ ][ }{ }{ }{ ? = = = ? ? ? ? ? ? ? ? = ∑ ∑ ∑ ?∈ ?∈ ?∈ ? Take expectation of both sides. ? Linearity of expectation. ? E[c xy ] = 1/m. ? Algebra.. Introduction to Algorithms Day 12 L8.10? 2001 by Charles E. Leiserson REMEMBER THIS! Constructing a set of universal hash functions Let m be prime. Decompose key k into r + 1 digits, each with value in the set {0, 1, …, m–1}. That is, let k = ?k 0 , k 1 , …, k r ?, where 0 ≤ k i < m. Randomized strategy: Pick a = ?a 0 , a 1 , …, a r ? where each a i is chosen randomly from {0, 1, …, m–1}. mkakh r i iia mod)( 0 ∑ = =Define . How big is H = {h a }? |H| = m r + 1 . Dot product, modulo m Introduction to Algorithms Day 12 L8.11? 2001 by Charles E. Leiserson Universality of dot-product hash functions Theorem. The set H = {h a } is universal. Proof. Suppose that x = ?x 0 , x 1 , …, x r ? and y = ?y 0 , y 1 , …, y r ? be distinct keys. Thus, they differ in at least one digit position, wlog position 0. For how many h a ∈Hdo x and y collide? )(mod 00 myaxa r i ii r i ii ∑∑ == ≡ . We must have h a (x) = h a (y), which implies that Introduction to Algorithms Day 12 L8.12? 2001 by Charles E. Leiserson Proof (continued) Equivalently, we have )(mod0)( 0 myxa r i iii ≡? ∑ = or )(mod0)()( 1 000 myxayxa r i iii ≡?+? ∑ = )(mod)()( 1 000 myxayxa r i iii ∑ = ??≡? which implies that , . Introduction to Algorithms Day 12 L8.13? 2001 by Charles E. Leiserson Fact from number theory Theorem. Let m be prime. For any z ∈Z m such that z ≠ 0, there exists a unique z –1 ∈Z m such that z · z –1 ≡ 1 (mod m). Example: m = 7. z z –1 1 2 3 4 5 6 1 4 5 2 3 6 Introduction to Algorithms Day 12 L8.14? 2001 by Charles E. Leiserson Back to the proof )(mod)()( 1 000 myxayxa r i iii ∑ = ??≡? We have and since x 0 ≠ y 0 , an inverse (x 0 – y 0 ) –1 must exist, which implies that , )(mod)()( 1 00 1 0 myxyxaa r i iii ? = ?? ? ? ? ? ? ? ? ? ??≡ ∑ . Thus, for any choices of a 1 , a 2 , …, a r , exactly one choice of a 0 causes x and y to collide. Introduction to Algorithms Day 12 L8.15? 2001 by Charles E. Leiserson Proof (completed) Q. How many h a ’s cause x and y to collide? A. There are m choices for each of a 1 , a 2 , …, a r , but once these are chosen, exactly one choice for a 0 causes x and y to collide, namely myxyxaa r i iii mod)()( 1 00 1 0 ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ??= ? = ∑ . Thus, the number of h a ’s that cause x and y to collide is m r ·1 = m r = |H|/m. Introduction to Algorithms Day 12 L8.16? 2001 by Charles E. Leiserson Perfect hashing Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case. IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2! 40 40 37 37 22 22 0 1 2 3 4 5 6 26 26 ma 012345678 14 14 27 27 S 4 S 6 S 1 4 4 31 31 1 1 00 00 9 9 86 86 T h 31 (14) = h 31 (27) = 1 Introduction to Algorithms Day 12 L8.17? 2001 by Charles E. Leiserson Collisions at level 2 Theorem. Let H be a class of universal hash functions for a table of size m = n 2 . Then, if we use a random h ∈H to hash n keys into the table, the expected number of collisions is at most 1/2. Proof. By the definition of universality, the probability that 2 given keys in the table collide under h is 1/m = 1/n 2 . Since there are pairs of keys that can possibly collide, the expected number of collisions is ( ) 2 n 2 11 2 )1( 1 2 22 <? ? =? ? ? ? ? ? ? n nn n n . Introduction to Algorithms Day 12 L8.18? 2001 by Charles E. Leiserson No collisions at level 2 Corollary. The probability of no collisions is at least 1/2. Thus, just by testing random hash functions in H, we’ll quickly find one that works. Proof. Markov’s inequality says that for any nonnegative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. Introduction to Algorithms Day 12 L8.19? 2001 by Charles E. Leiserson Analysis of storage For the level-1 hash table T, choose m = n, and let n i be random variable for the number of keys that hash to slot i in T. By using n i 2 slots for the level-2 hash table S i , the expected total storage required for the two-level scheme is therefore () )( 1 0 2 nnE m i i Θ= ? ? ? ? ? ? Θ ∑ ? = , since the analysis is identical to the analysis from recitation of the expected running time of bucket sort. (For a probability bound, apply Markov.)