Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 7 Prof. Charles E. Leiserson Introduction to Algorithms Day 11 L7.2? 2001 by Charles E. Leiserson Symbol-table problem Symbol table T holding n records: key[x] key[x] record x Other fields containing satellite data Operations on T: ? INSERT(T, x) ? DELETE(T, x) ? SEARCH(T, k) How should the data structure T be organized? Introduction to Algorithms Day 11 L7.3? 2001 by Charles E. Leiserson Direct-access table IDEA: Suppose that the set of keys is K ? {0, 1, …, m–1}, and keys are distinct. Set up an array T[0 . . m–1]: T[k] = x if k∈ K and key[x] = k, NIL otherwise. Then, operations take Θ(1) time. Problem: The range of keys can be large: ? 64-bit numbers (which represent 18,446,744,073,709,551,616 different keys), ? character strings (even larger!). Introduction to Algorithms Day 11 L7.4? 2001 by Charles E. Leiserson As each key is inserted, h maps it to a slot of T. Hash functions Solution: Use a hash function h to map the universe U of all keys into {0, 1, …, m–1}: U K k 1 k 2 k 3 k 4 k 5 0 m–1 h(k 1 ) h(k 4 ) h(k 2 ) h(k 3 ) When a record to be inserted maps to an already occupied slot in T, a collision occurs. T = h(k 5 ) Introduction to Algorithms Day 11 L7.5? 2001 by Charles E. Leiserson Resolving collisions by chaining ? Records in the same slot are linked into a list. h(49) = h(86) = h(52) = i T 49 49 86 86 52 52 i Introduction to Algorithms Day 11 L7.6? 2001 by Charles E. Leiserson Analysis of chaining We make the assumption of simple uniform hashing: ? Each key k ∈ K of keys is equally likely to be hashed to any slot of table T, independent of where other keys are hashed. Let n be the number of keys in the table, and let m be the number of slots. Define the load factor of T to be α = n/m = average number of keys per slot. Introduction to Algorithms Day 11 L7.7? 2001 by Charles E. Leiserson Search cost Expected time to search for a record with a given key = Θ(1 + α). apply hash function and access slot search the list Expected search time = Θ(1) if α = O(1), or equivalently, if n = O(m). Introduction to Algorithms Day 11 L7.8? 2001 by Charles E. Leiserson Choosing a hash function The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice as long as their deficiencies can be avoided. Desirata: ? A good hash function should distribute the keys uniformly into the slots of the table. ? Regularity in the key distribution should not affect this uniformity. Introduction to Algorithms Day 11 L7.9? 2001 by Charles E. Leiserson h(k) Division method Assume all keys are integers, and define h(k) = k mod m. Extreme deficiency: If m = 2 r , then the hash doesn’t even depend on all the bits of k: ? If k = 1011000111011010 2 and r = 6, then h(k) = 011010 2 . Deficiency: Don’t pick an m that has a small divisor d. A preponderance of keys that are congruent modulo d can adversely affect uniformity. Introduction to Algorithms Day 11 L7.10? 2001 by Charles E. Leiserson Division method (continued) h(k) = k mod m. Pick m to be a prime not too close to a power of 2 or 10 and not otherwise used prominently in the computing environment. Annoyance: ? Sometimes, making the table size a prime is inconvenient. But, this method is popular, although the next method we’ll see is usually superior. Introduction to Algorithms Day 11 L7.11? 2001 by Charles E. Leiserson Multiplication method Assume that all keys are integers, m = 2 r , and our computer has w-bit words. Define h(k) = (A·k mod 2 w ) rsh (w – r), where rsh is the “bit-wise right-shift” operator and A is an odd integer in the range 2 w–1 < A < 2 w . ? Don’t pick A too close to 2 w . ? Multiplication modulo 2 w is fast. ? The rsh operator is fast. Introduction to Algorithms Day 11 L7.12? 2001 by Charles E. Leiserson 4 0 35 26 17 Modular wheel Multiplication method example h(k) = (A·k mod 2 w ) rsh (w – r) Suppose that m = 8 = 2 3 and that our computer has w = 7-bit words: 1 0 1 1 0 0 1 × 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 = A = k h(k) A . 2A . 3A . Introduction to Algorithms Day 11 L7.13? 2001 by Charles E. Leiserson Dot-product method Randomized strategy: Let m be prime. Decompose key k into r + 1 digits, each with value in the set {0, 1, …, m–1}. That is, let k = ?k 0 , k 1 , …, k m–1 ?, where 0 ≤ k i < m. Pick a = ?a 0 , a 1 , …, a m–1 ? where each a i is chosen randomly from {0, 1, …, m–1}. mkakh r i iia mod)( 0 ∑ = =Define . ? Excellent in practice, but expensive to compute. Introduction to Algorithms Day 11 L7.14? 2001 by Charles E. Leiserson Resolving collisions by open addressing No storage is used outside of the hash table itself. ? Insertion systematically probes the table until an empty slot is found. ? The hash function depends on both the key and probe number: h : U × {0, 1, …, m–1} → {0, 1, …, m–1}. ? The probe sequence ?h(k,0), h(k,1), …, h(k,m–1)? should be a permutation of {0, 1, …, m–1}. ? The table may fill up, and deletion is difficult (but not impossible). Introduction to Algorithms Day 11 L7.15? 2001 by Charles E. Leiserson 204 Example of open addressing Insert key k = 496: 0. Probe h(496,0) 586 133 481 T 0 m–1 collision Introduction to Algorithms Day 11 L7.16? 2001 by Charles E. Leiserson Example of open addressing Insert key k = 496: 0. Probe h(496,0) 586 133 204 481 T 0 m–1 1. Probe h(496,1) collision Introduction to Algorithms Day 11 L7.17? 2001 by Charles E. Leiserson Example of open addressing Insert key k = 496: 0. Probe h(496,0) 586 133 204 481 T 0 m–1 1. Probe h(496,1) insertion 496 2. Probe h(496,2) Introduction to Algorithms Day 11 L7.18? 2001 by Charles E. Leiserson Example of open addressing Search for key k = 496: 0. Probe h(496,0) 586 133 204 481 T 0 m–1 1. Probe h(496,1) 496 2. Probe h(496,2) Search uses the same probe sequence, terminating suc- cessfully if it finds the key and unsuccessfully if it encounters an empty slot. Introduction to Algorithms Day 11 L7.19? 2001 by Charles E. Leiserson Probing strategies Linear probing: Given an ordinary hash function h′(k), linear probing uses the hash function h(k,i) = (h′(k) + i) mod m. This method, though simple, suffers from primary clustering, where long runs of occupied slots build up, increasing the average search time. Moreover, the long runs of occupied slots tend to get longer. Introduction to Algorithms Day 11 L7.20? 2001 by Charles E. Leiserson Probing strategies Double hashing Given two ordinary hash functions h 1 (k) and h 2 (k), double hashing uses the hash function h(k,i) = (h 1 (k) + i?h 2 (k)) mod m. This method generally produces excellent results, but h 2 (k) must be relatively prime to m. One way is to make m a power of 2 and design h 2 (k) to produce only odd numbers. Introduction to Algorithms Day 11 L7.21? 2001 by Charles E. Leiserson Analysis of open addressing We make the assumption of uniform hashing: ? Each key is equally likely to have any one of the m! permutations as its probe sequence. Theorem. Given an open-addressed hash table with load factor α = n/m < 1, the expected number of probes in an unsuccessful search is at most 1/(1–α). Introduction to Algorithms Day 11 L7.22? 2001 by Charles E. Leiserson Proof of the theorem Proof. ? At least one probe is always necessary. ? With probability n/m, the first probe hits an occupied slot, and a second probe is necessary. ? With probability (n–1)/(m–1), the second probe hits an occupied slot, and a third probe is necessary. ? With probability (n–2)/(m–2), the third probe hits an occupied slot, etc. Observe that α=< ? ? m n im in for i = 1, 2, …, n. Introduction to Algorithms Day 11 L7.23? 2001 by Charles E. Leiserson Proof (continued) Therefore, the expected number of probes is ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? +? + ? ? + ? ? ++LL 1 1 1 2 2 1 1 1 11 nmm n m n m n ( )( )( )() α α ααα αααα ? = = ++++≤ ++++≤ ∑ ∞ = 1 1 1 1111 0 32 i i L LL . The textbook has a more rigorous proof. Introduction to Algorithms Day 11 L7.24? 2001 by Charles E. Leiserson Implications of the theorem ? If α is constant, then accessing an open- addressed hash table takes constant time. ? If the table is half full, then the expected number of probes is 1/(1–0.5) = 2. ? If the table is 90% full, then the expected number of probes is 1/(1–0.9) = 10.