Chapter 9 Tables And Information
Retrieval
1,Introduction,Breaking the lgn Barrier
2,Rectangular Arrays
3,Tables of Various Shapes
4,Tables,A New Abstract Data Type
5,Application,Radix Sort
Chapter 9 Tables And Information
Retrieval
6,Hashing
7,Analysis of Hashing
8,Conclusions,Comparison of
Methods
9,Application,The Life Game
Revisited
10,Pointers and Pitfalls
9.1 Introduction,Breaking the lgn Barrier
◆ By use of key comparisons alone,it is
impossible to complete a search of n items in
fewer than lg n comparisons,on average(lower
bound_search_thm).
◆ Ordinary table lookup or array access
requires only constant time O(1).
◆ Both table lookup and searching share the
same essential purpose,that of information
retrieval,The key used for searching and the
index used for table lookup have the same
essential purpose,one piece of information that
is used to locate further information.
◆ Both table lookup and searching algorithms
provide functions from a set of keys or indices
to locations in a list or array.
◆ In this chapter we study ways to implement
and access various kinds of tables in
contiguous storage.
◆ Several steps may be needed to retrieve an
entry from some kinds of tables,but the time
required remains O(1).It is bounded by a
constant that does not depend on the size of
the table,Thus table lookup can be more
efficient than any searching method.
◆ We shall implement abstractly defined
tables with arrays,In order to distinguish
between the abstract concept and its
implementation,we introduce:
Convention
The index defining an entry of an abstractly
defined table is enclosed in parentheses,
whereas the index of an entry of an array
is enclosed in square brackets.
9.2 Rectangular Arrays
1 Row-major and Column-major Ordering:
Fig 9.1 pg.381
通常行主序
存储方式较普遍
2 indexing Rectangular table Pg.382
Suppose position of Entry(0,0) is Loc(0,0),every entry
store
c cell,then Entry(i,j) position Loc(i,j)
Loc(i,j)=Loc(0,0)+(i*3+j)*c ( Row-major ordering)
Loc(i,j)=Loc(0,0)+(i+2*j)*c ( Column-major ordering )
In row-major ordering,entry [i,j ] goes to position
ni+j,
存储位置
(地址 )
序号
index
3 Variation,A Access Arrays Pg.382
Fig 9.2 pg.383
9.3 Tables of Various Shapes
Fig 9.3 pg.384
1 triangular table
Fig 9.4 pg.384
Lower triangular matrix
我们以矩阵的下三角部分为例,来分析矩阵元素 对应的
存储空间的地址。在下三角部分,第 k行有 k个元素,而
元素之前共有 i-1 行,再加上第 i行的 j个元素,所以 排在
下三角部分的位序 (index) 是:
2,Jagged Table with Access
Table
Fig 9.5 pg.385
3,Inverted Table
Fig 9.6 pg.387
倒排表
◆ In mathematics a function is defined in terms of
two sets and a correspondence from elements of the
first set to elements of the second,If f is a function
from a set A to a set B,then f assigns to each
element of A a unique element of B.
◆ The set A is called the domain of f,and the set B
is called the codomain of f,
9.4 Tables,A New Abstract Data Type
1 Functions
◆ The subset of B containing just those elements that
occur as values of f is called the range of f,
DEFINITION A table with index set I and base type T is a
function from I to T together with the following operations.
1,Table access,Evaluate the function at any index in I,
2,Table assignment,Modify the function by changing its
value at a specified index in I to the new value specified
in the assignment.
3,Creation,Set up a new function.
4,Clearing,Remove all elements from the index set I,so
there is no remaining domain.
5,Insertion,Adjoin a new element x to the index set I and
define a corresponding value of the function at x.
6,Deletion,Delete an element x from the index set I and
restrict the function to the resulting smaller domain.
See pg,390-391
9.5 Application,Radix Sort
1 The Idea
基数排序和上一章中所述各类排序方法均不同,它是
一种借助于多排序码排序的思想对单逻辑排序码进行排序
的方法 。 例如,
Fig 9.10 pg.392
前面我们介绍的各种排序方法都是以关键字的比较与数
据的移动为主, 但 基数排序却 主要借助, 分配, 和, 收
集, 两种操作来完成排序任务 。
Fig 9.11 pg.393
2 Linked Implementation of Radix Sort
◆ List definition:
template <class Record>
class Sortable_list,public List<Record>
{ public,
void radix_sort( );
private,
// auxiliary functions
void rethread(Queue queues[]);
};
Inherit from
linked Listdiffer fromabove RecordContiguous Queue
(chapter 3)
◆ Record definition:
class Record
{ public:
char_key letter(int position) const;
Record( ); //default constructor
operator Key( ) const; // cast to Key
// Add other methods and data members for the class.
};
◆ Sorting Method,Linked Radix Sort
const int max_chars = 28;
template <class Record>
void Sortable_list<Record>,,radix_sort( )
{ Record data; Queue queues[max_chars];
for(int position= key_size-1; position >= 0; position--)
{while(remove(0,data) == success)
{ int
queue_number=alphabetic_order(data.key_letter(position));
queues[queue_number].append(data); // Queue operation.
}
rethread(queues); //Reassemble the list.
}
}
◆ Auxiliary Functions,Linked Radix
Sort
Selecting a queue:int alphabetic_order(char c)
/* Post,The function returns the alphabetic position of
character c,
or it returns 0 if the character is blank,*/
{ if(c==' ') return 0;
if('a'<=c && c<='z') return c-'a'+1;
if('A'<=c && c<='Z') return c-'A'+1;
return 27;
}该例子许多部分没有实现,请同学自己练习。
也可以不按该例给出的这些类及方法,自己实现之。
3,Analysis of Radix Sort
◆ The time used by radix sort is ?(nk),where n is the
number of items being sorted and k is the number of
characters in a key.
◆ The relative performance of radix sort to other
methods will relate to the relative sizes of nk and nlgn;
that is,of k and lgn.
◆ If the keys are long but there are relatively few of
them,then k is large and lg n relatively small,and
other methods (such as mergesort) will outperform
radix sort.
◆ If k is small (the keys are short) and there are a
large number of keys,then radix sort will be faster
than any other method we have studied.
9.6 Hash Tables
在记录的存储位置和它的关键字之间建立一个确定的对
应关系 h,使每个关键字和结构中一个唯一的存储位置相
对应 。 因而在查找时, 只要根据这个对应关系 h找到给定
值 k的映象 h(k),若结构中存在关键字和 k相等的记录, 则
必定在 h(k)的存储位置上, 因此不需要进行比较便可直接
得到所查记录 。 我们称这个对应关系 h为散列 (Hash)函数
( 也译为杂凑函数 ), 用散列函数建立的表称为散列表 。
1 Sparse Table See Pg.397
稀疏表
通常,关键字集合比较大,它的元素包括所有可能
的关键字,而地址集合的元素仅为散列表中的地址值
(设表长为 n,则地址为 0~ n-1)。 例如在 C编译程序中
需对源程序中的标识符建立一张散列表。在设定散列表函
数时要考虑的关键字字集合应包含所有可能产生的关键字,
按 C语言的语法规定,标识符可定义为以字母开头的最多
8个字符的字母、数字、下划线组成的字符串,且大小写
字母是可区分的。因此,C语言中标识的集合大小为:
14763152263152163152152 101, 57!CC2!CCCCC ??????? ?
实际上任何源程序都不会出现这么多的标识符,设表长
为 1,000就足够了,地址集合中的元素为 0~ 999。
In practice,however,only a small fraction of the keys
Will actually occur.That is spare.
Hash Table See Pg.398
◆ Start with an array that holds the hash table.
◆ Use a hash function to take a key and map it to some
index in the array,This function will generally map
several different keys to the same index.
◆ If the desired record is in the location given by the
index,then we are finished; otherwise we must use
some method to resolve the collision that may have
occurred between two records wanting to go to the
same location.
◆ To use hashing we must (a) find good hash
functions and (b) determine how to resolve collisions.
2,Choosing a Hash Function
◆ A hash function should be easy and quick to
compute.
◆ A hash function should achieve an even distribution
of the
keys that actually occur across the range of indices.
◆ The usual way to make a hash function is to take the
冲突
thereby obtain an index that will be uniformly
distributed over the range of indices.
◆ Note that there is nothing random about a hash
function,If
the function is evaluated more than once on the same
key,
then it must give the same result every time,so the key
can
be retrieved without fail.
◆ Truncation,Sometimes we ignore part of the key,and
use the remaining part as the index.
◆ Folding,We may partition the key into several parts
and combine the parts in a convenient way.
◆ Modular arithmetic,We may convert the key to an
integer,
divide by the size of the index range,and take the
均匀分布
C++ Example of a Hash Function pg.400-pg.401
3,Collision Resolution with Open Addressing
Linear probing:
Linear probing starts with the hash address and
searches sequentially for the target key or an empty
position,The array should be considered circular,so
that when the last location is reached,the search
proceeds to the first location of the array.
线性探测聚集
Clustering,See pg.401
当表中 i,i+1,i+2位置上已填有记录时,下一个散列地址为 i,
i+1,i+2和 i+3的记录都将填入 i+3的位置,这种散列地址
不同的记录争夺同一个后继散列地址的现象称作“聚集”,
即在处理 同义词 的冲突过程中又增加了非同义词的冲突,
显然,这种现象对查找不利。
Quadratic Probing:
If there is a collision at hash address h,quadratic
probing goes to locations h+1,h+4,h+9,?,that is,at
locations h+i2 (mod hashsize) for i= 1,2,?,二次探 测
Other methods:
◆ Key-dependent increments;
◆ Random probing.
Hash Table Specifications
const int hash_size=997;
class Hash_table
{ public:
Hash_table( );
void clear( );
Error_code insert(const Record &new entry);
Error_code retrieve(const Key &target,Record &found)
const;
a prime number of
appropriate size
private:
Record table[hash_size];
};
Hash table,,Hash_table( );
Post,The hash table has been created and initialized to be
empty.
void Hash table,,clear( );
Post,The hash table has been cleared and is empty.
Error_code Hash_table,,retrieve(const Key &target,
Record &found) const;
Post,If an entry in the hash table has key equal to
target,then found takes on the value of such an entry,and
success is returned,Otherwise,not_present is returned.
Hash Table Insertion
Error_code Hash_table::insert(const Record &new_entry)
/* Post:If the Hash_table is full,a code of overflow is
returned,
If the table already contains an item with the key
of
new_entry a code of duplicate error is returned.
Otherwise,The Record new_entry is inserted into
the
Hash_table and success is returned.
Uses:Methods for classes Key,and Record,The function
hash, */
{ Error_code result=success;
int probe_count,increment,probe;
Key null;
null.make_blank( );
Counter to be sure
that table is not full
Increment used for
quadratic probing
Position currently
probed in the
hash tableNull key forcomparison purposes
while( table[probe]!=null && table[probe]!=new_entry
&&
probe_count<(hash_size+1)/2)
{ probe_count++;
probe=(probe+increment)%hash_size;
increment += 2;
}
if(table[probe]==null) table[probe]=new_entry;
else if(table[probe]==new_entry) result=duplicate_error;
else result=overflow;
return result;
}
Is the location empty?Duplicate key?Has overflow occurred?
Prepare increment
for next iterationInsert new_entryhe table is full
4,Collision Resolution by chaining
◆ The linked lists from the hash table are called chains.
◆ If the records are large,a chained hash table can
save space.
◆ Collision resolution with chaining is simple;
clustering is no problem.
◆ The hash table itself can be smaller than the number
of records; overflow is no problem.
◆ Deletion is quick and easy in a chained hash table.
◆ If the records are very small and the table nearly full,
chaining may take more space.
Code for Chained Hash Tables
class Hash_table
{
public:
// Specify methods here.
private:
List<Record> table[hash_size];
};
Constructor:
The implementation of the constructor simply calls
the constructor for each list in the array.
Clear:
To clear the table,we must clear the linked list in each
of the table positions,using the List method clear( ).
Retrieval:
Sequential_search(table[hash(target)],target,position);
Insertion:
table[hash(new entry)].insert(0,new entry);
Deletion:
Error_code Hash table,,remove(const Key type
&target,Record &x);
Example:
设一个散列表包含 hash_Size = 13个表项, 其下标从 0到 12,
散列函数采用除留余数法, 用 %hash_Size将各关键码映像
到表中, 采用线性探查法解决冲突 。 给定的关键码集合是:
{400,126,45,32,58,100,3,29,200,10,36 }。
假设每一个关键码被查找的概率相同,请计算出 ASL=?
1,4 54511163)3226(1111A S L ????????
9.7 Analysis of Hashing
1,The Birthday Surprise
How many randomly chosen people need to be in a room
before it becomes likely that two people will have the same
birthday (month and day)?
The probability that m people all have different
birthdays is
This expression becomes less than 0.5 whenever m ?23.
For hashing,the birthday surprise says that for any
problem of reasonable size,collisions will almost
certainly occur.
A probe is one comparison of a key with the target.
The load factor of the table is ?= n /t,where n positions
are occupied out of a total of t positions in the table.
Retrieval from a chained hash table with load factor ?
requires approximately 1+ ?/2 probes in the successful
case and probes in the unsuccessful case.
Retrieval from a hash table with open addressing,random
probing,and load factor ? requires approximately
probes in the successful case and probes in the
unsuccessful case,respectively.
Retrieval from a hash table with open addressing,linear
probing,and load factor requires approximately
probes in the successful case and in the unsuccessful
case,respectively.
Theoretical comparisons:
Empirical comparisons:
经验主义的
9.8 Conclusions,Comparison of Methods
We have studied four principal methods of
information retrieval,the first two for lists and the
second two for tables,Often we can choose either
lists or tables for our data structures.
◆ Sequential search is ?(n).
Sequential search is the most flexible method,
The data may be stored in any order,with either
contiguous or linked representation.
◆ Binary search is ?(㏒ n).
Binary search demands more,but is faster,The
keys must be in order,and the data must be in
random-access representation (contiguous storage).
可随机访问的
◆ Table lookup is ?(1).
Ordinary lookup in contiguous tables is best,both
in speed and convenience,unless a list is preferred,
or the set of keys is sparse,or insertions or deletions
are frequent.
◆ Hash-table retrieval is ?(1).
Hashing requires the most structure,a peculiar
ordering of the keys well suited to retrieval from the
hash table,but generally useless for any other
purpose,If the data are to be available for human
inspection,then some kind of order is needed,and a
hash table is inappropriate.
9.9 Application,The Life Game Revisited
◆ The Life cells are supposed to be on an unbounded
grid,not a finite array as used in Chapter 1.
◆ In the class Life,we would like to have the
declaration class Life
{ public:
// methods
private:
bool map[int][int];
// other data and auxiliary functions
};
which is,of course,illegal.
See Pg,418
◆ Use a hash table to represent the data member map
(sparse array).
◆ The main function remains unchanged from
Chapter 1.
◆ Rewrite the method update so that it uses a hash
table to look up the status of cells.
◆ For any given cell in a configuration,we can
determine the number of living neighbors by
looking up the status of each neighboring cell.
◆ Candidates that might live in the coming generation
are those that are alive and their (dead) neighbors,
Method update will traverse these cells,determine
their neighbor counts by using the hash table,and
select those cells that will live in the next generation.
See Pg,420 - 426
有 100个待排序的整数, 其值在区间 [0,999]而且
互不相同,试利用散列表结构 (设表长为 100,处理冲
突采用链地址法 )实现对上述整数集合的排序 。
要求自行设计散列函数,并分析最好情况及最坏
情况下的时间复杂度。
答:由于是排序, 将散列函数设计为 i / 10(必须
按大小排, 因此要根据高位数排列 ),发生冲突时按整
数值从小到大插入链表 。
最好情况 出现在这 100个待排序的整数均匀分布
在区间 [0,999];散列表每项仅一个整数;
最坏情况 是:每项有 10个数在链表中, 共 10项不
空 。
Pointers and Pitfalls
◆ Use top-down design for your data structures,just
as you do for your algorithms.
◆ Before considering detailed structures,decide what
operations on the data will be required.
◆ For the design and programming of lists,see
Chapter 6.
◆ Use the logical structure of the data to decide what
kind of table to use.
◆ Let the structure of the data help you decide
whether an index function or an access table is better
for accessing a table of data.
◆ In using a hash table,let the nature of the data and
the required operations help you decide between
chaining and open addressing.
◆ Hash functions must usually be custom-designed
for the kind of keys used for accessing the hash table.
◆ Recall from the analysis of hashing that some
collisions will almost inevitably occur.
◆ For open addressing,clustering is unlikely to be a
problem until the hash table is more than half full.
◆ Sequential search is slow but robust,Use it for
short lists or if there is any doubt that the keys in
the list are properly ordered.
◆ Be extremely careful if you must reprogram
binary search,Verify that your algorithm is
correct and test it on all the extreme cases.
◆ Drawing trees is an excellent way both to trace
the action of an algorithm and to analyze its
behavior.
◆ Rely on the big-O analysis of algorithms for
large applications but not for small applications.
Retrieval
1,Introduction,Breaking the lgn Barrier
2,Rectangular Arrays
3,Tables of Various Shapes
4,Tables,A New Abstract Data Type
5,Application,Radix Sort
Chapter 9 Tables And Information
Retrieval
6,Hashing
7,Analysis of Hashing
8,Conclusions,Comparison of
Methods
9,Application,The Life Game
Revisited
10,Pointers and Pitfalls
9.1 Introduction,Breaking the lgn Barrier
◆ By use of key comparisons alone,it is
impossible to complete a search of n items in
fewer than lg n comparisons,on average(lower
bound_search_thm).
◆ Ordinary table lookup or array access
requires only constant time O(1).
◆ Both table lookup and searching share the
same essential purpose,that of information
retrieval,The key used for searching and the
index used for table lookup have the same
essential purpose,one piece of information that
is used to locate further information.
◆ Both table lookup and searching algorithms
provide functions from a set of keys or indices
to locations in a list or array.
◆ In this chapter we study ways to implement
and access various kinds of tables in
contiguous storage.
◆ Several steps may be needed to retrieve an
entry from some kinds of tables,but the time
required remains O(1).It is bounded by a
constant that does not depend on the size of
the table,Thus table lookup can be more
efficient than any searching method.
◆ We shall implement abstractly defined
tables with arrays,In order to distinguish
between the abstract concept and its
implementation,we introduce:
Convention
The index defining an entry of an abstractly
defined table is enclosed in parentheses,
whereas the index of an entry of an array
is enclosed in square brackets.
9.2 Rectangular Arrays
1 Row-major and Column-major Ordering:
Fig 9.1 pg.381
通常行主序
存储方式较普遍
2 indexing Rectangular table Pg.382
Suppose position of Entry(0,0) is Loc(0,0),every entry
store
c cell,then Entry(i,j) position Loc(i,j)
Loc(i,j)=Loc(0,0)+(i*3+j)*c ( Row-major ordering)
Loc(i,j)=Loc(0,0)+(i+2*j)*c ( Column-major ordering )
In row-major ordering,entry [i,j ] goes to position
ni+j,
存储位置
(地址 )
序号
index
3 Variation,A Access Arrays Pg.382
Fig 9.2 pg.383
9.3 Tables of Various Shapes
Fig 9.3 pg.384
1 triangular table
Fig 9.4 pg.384
Lower triangular matrix
我们以矩阵的下三角部分为例,来分析矩阵元素 对应的
存储空间的地址。在下三角部分,第 k行有 k个元素,而
元素之前共有 i-1 行,再加上第 i行的 j个元素,所以 排在
下三角部分的位序 (index) 是:
2,Jagged Table with Access
Table
Fig 9.5 pg.385
3,Inverted Table
Fig 9.6 pg.387
倒排表
◆ In mathematics a function is defined in terms of
two sets and a correspondence from elements of the
first set to elements of the second,If f is a function
from a set A to a set B,then f assigns to each
element of A a unique element of B.
◆ The set A is called the domain of f,and the set B
is called the codomain of f,
9.4 Tables,A New Abstract Data Type
1 Functions
◆ The subset of B containing just those elements that
occur as values of f is called the range of f,
DEFINITION A table with index set I and base type T is a
function from I to T together with the following operations.
1,Table access,Evaluate the function at any index in I,
2,Table assignment,Modify the function by changing its
value at a specified index in I to the new value specified
in the assignment.
3,Creation,Set up a new function.
4,Clearing,Remove all elements from the index set I,so
there is no remaining domain.
5,Insertion,Adjoin a new element x to the index set I and
define a corresponding value of the function at x.
6,Deletion,Delete an element x from the index set I and
restrict the function to the resulting smaller domain.
See pg,390-391
9.5 Application,Radix Sort
1 The Idea
基数排序和上一章中所述各类排序方法均不同,它是
一种借助于多排序码排序的思想对单逻辑排序码进行排序
的方法 。 例如,
Fig 9.10 pg.392
前面我们介绍的各种排序方法都是以关键字的比较与数
据的移动为主, 但 基数排序却 主要借助, 分配, 和, 收
集, 两种操作来完成排序任务 。
Fig 9.11 pg.393
2 Linked Implementation of Radix Sort
◆ List definition:
template <class Record>
class Sortable_list,public List<Record>
{ public,
void radix_sort( );
private,
// auxiliary functions
void rethread(Queue queues[]);
};
Inherit from
linked Listdiffer fromabove RecordContiguous Queue
(chapter 3)
◆ Record definition:
class Record
{ public:
char_key letter(int position) const;
Record( ); //default constructor
operator Key( ) const; // cast to Key
// Add other methods and data members for the class.
};
◆ Sorting Method,Linked Radix Sort
const int max_chars = 28;
template <class Record>
void Sortable_list<Record>,,radix_sort( )
{ Record data; Queue queues[max_chars];
for(int position= key_size-1; position >= 0; position--)
{while(remove(0,data) == success)
{ int
queue_number=alphabetic_order(data.key_letter(position));
queues[queue_number].append(data); // Queue operation.
}
rethread(queues); //Reassemble the list.
}
}
◆ Auxiliary Functions,Linked Radix
Sort
Selecting a queue:int alphabetic_order(char c)
/* Post,The function returns the alphabetic position of
character c,
or it returns 0 if the character is blank,*/
{ if(c==' ') return 0;
if('a'<=c && c<='z') return c-'a'+1;
if('A'<=c && c<='Z') return c-'A'+1;
return 27;
}该例子许多部分没有实现,请同学自己练习。
也可以不按该例给出的这些类及方法,自己实现之。
3,Analysis of Radix Sort
◆ The time used by radix sort is ?(nk),where n is the
number of items being sorted and k is the number of
characters in a key.
◆ The relative performance of radix sort to other
methods will relate to the relative sizes of nk and nlgn;
that is,of k and lgn.
◆ If the keys are long but there are relatively few of
them,then k is large and lg n relatively small,and
other methods (such as mergesort) will outperform
radix sort.
◆ If k is small (the keys are short) and there are a
large number of keys,then radix sort will be faster
than any other method we have studied.
9.6 Hash Tables
在记录的存储位置和它的关键字之间建立一个确定的对
应关系 h,使每个关键字和结构中一个唯一的存储位置相
对应 。 因而在查找时, 只要根据这个对应关系 h找到给定
值 k的映象 h(k),若结构中存在关键字和 k相等的记录, 则
必定在 h(k)的存储位置上, 因此不需要进行比较便可直接
得到所查记录 。 我们称这个对应关系 h为散列 (Hash)函数
( 也译为杂凑函数 ), 用散列函数建立的表称为散列表 。
1 Sparse Table See Pg.397
稀疏表
通常,关键字集合比较大,它的元素包括所有可能
的关键字,而地址集合的元素仅为散列表中的地址值
(设表长为 n,则地址为 0~ n-1)。 例如在 C编译程序中
需对源程序中的标识符建立一张散列表。在设定散列表函
数时要考虑的关键字字集合应包含所有可能产生的关键字,
按 C语言的语法规定,标识符可定义为以字母开头的最多
8个字符的字母、数字、下划线组成的字符串,且大小写
字母是可区分的。因此,C语言中标识的集合大小为:
14763152263152163152152 101, 57!CC2!CCCCC ??????? ?
实际上任何源程序都不会出现这么多的标识符,设表长
为 1,000就足够了,地址集合中的元素为 0~ 999。
In practice,however,only a small fraction of the keys
Will actually occur.That is spare.
Hash Table See Pg.398
◆ Start with an array that holds the hash table.
◆ Use a hash function to take a key and map it to some
index in the array,This function will generally map
several different keys to the same index.
◆ If the desired record is in the location given by the
index,then we are finished; otherwise we must use
some method to resolve the collision that may have
occurred between two records wanting to go to the
same location.
◆ To use hashing we must (a) find good hash
functions and (b) determine how to resolve collisions.
2,Choosing a Hash Function
◆ A hash function should be easy and quick to
compute.
◆ A hash function should achieve an even distribution
of the
keys that actually occur across the range of indices.
◆ The usual way to make a hash function is to take the
冲突
thereby obtain an index that will be uniformly
distributed over the range of indices.
◆ Note that there is nothing random about a hash
function,If
the function is evaluated more than once on the same
key,
then it must give the same result every time,so the key
can
be retrieved without fail.
◆ Truncation,Sometimes we ignore part of the key,and
use the remaining part as the index.
◆ Folding,We may partition the key into several parts
and combine the parts in a convenient way.
◆ Modular arithmetic,We may convert the key to an
integer,
divide by the size of the index range,and take the
均匀分布
C++ Example of a Hash Function pg.400-pg.401
3,Collision Resolution with Open Addressing
Linear probing:
Linear probing starts with the hash address and
searches sequentially for the target key or an empty
position,The array should be considered circular,so
that when the last location is reached,the search
proceeds to the first location of the array.
线性探测聚集
Clustering,See pg.401
当表中 i,i+1,i+2位置上已填有记录时,下一个散列地址为 i,
i+1,i+2和 i+3的记录都将填入 i+3的位置,这种散列地址
不同的记录争夺同一个后继散列地址的现象称作“聚集”,
即在处理 同义词 的冲突过程中又增加了非同义词的冲突,
显然,这种现象对查找不利。
Quadratic Probing:
If there is a collision at hash address h,quadratic
probing goes to locations h+1,h+4,h+9,?,that is,at
locations h+i2 (mod hashsize) for i= 1,2,?,二次探 测
Other methods:
◆ Key-dependent increments;
◆ Random probing.
Hash Table Specifications
const int hash_size=997;
class Hash_table
{ public:
Hash_table( );
void clear( );
Error_code insert(const Record &new entry);
Error_code retrieve(const Key &target,Record &found)
const;
a prime number of
appropriate size
private:
Record table[hash_size];
};
Hash table,,Hash_table( );
Post,The hash table has been created and initialized to be
empty.
void Hash table,,clear( );
Post,The hash table has been cleared and is empty.
Error_code Hash_table,,retrieve(const Key &target,
Record &found) const;
Post,If an entry in the hash table has key equal to
target,then found takes on the value of such an entry,and
success is returned,Otherwise,not_present is returned.
Hash Table Insertion
Error_code Hash_table::insert(const Record &new_entry)
/* Post:If the Hash_table is full,a code of overflow is
returned,
If the table already contains an item with the key
of
new_entry a code of duplicate error is returned.
Otherwise,The Record new_entry is inserted into
the
Hash_table and success is returned.
Uses:Methods for classes Key,and Record,The function
hash, */
{ Error_code result=success;
int probe_count,increment,probe;
Key null;
null.make_blank( );
Counter to be sure
that table is not full
Increment used for
quadratic probing
Position currently
probed in the
hash tableNull key forcomparison purposes
while( table[probe]!=null && table[probe]!=new_entry
&&
probe_count<(hash_size+1)/2)
{ probe_count++;
probe=(probe+increment)%hash_size;
increment += 2;
}
if(table[probe]==null) table[probe]=new_entry;
else if(table[probe]==new_entry) result=duplicate_error;
else result=overflow;
return result;
}
Is the location empty?Duplicate key?Has overflow occurred?
Prepare increment
for next iterationInsert new_entryhe table is full
4,Collision Resolution by chaining
◆ The linked lists from the hash table are called chains.
◆ If the records are large,a chained hash table can
save space.
◆ Collision resolution with chaining is simple;
clustering is no problem.
◆ The hash table itself can be smaller than the number
of records; overflow is no problem.
◆ Deletion is quick and easy in a chained hash table.
◆ If the records are very small and the table nearly full,
chaining may take more space.
Code for Chained Hash Tables
class Hash_table
{
public:
// Specify methods here.
private:
List<Record> table[hash_size];
};
Constructor:
The implementation of the constructor simply calls
the constructor for each list in the array.
Clear:
To clear the table,we must clear the linked list in each
of the table positions,using the List method clear( ).
Retrieval:
Sequential_search(table[hash(target)],target,position);
Insertion:
table[hash(new entry)].insert(0,new entry);
Deletion:
Error_code Hash table,,remove(const Key type
&target,Record &x);
Example:
设一个散列表包含 hash_Size = 13个表项, 其下标从 0到 12,
散列函数采用除留余数法, 用 %hash_Size将各关键码映像
到表中, 采用线性探查法解决冲突 。 给定的关键码集合是:
{400,126,45,32,58,100,3,29,200,10,36 }。
假设每一个关键码被查找的概率相同,请计算出 ASL=?
1,4 54511163)3226(1111A S L ????????
9.7 Analysis of Hashing
1,The Birthday Surprise
How many randomly chosen people need to be in a room
before it becomes likely that two people will have the same
birthday (month and day)?
The probability that m people all have different
birthdays is
This expression becomes less than 0.5 whenever m ?23.
For hashing,the birthday surprise says that for any
problem of reasonable size,collisions will almost
certainly occur.
A probe is one comparison of a key with the target.
The load factor of the table is ?= n /t,where n positions
are occupied out of a total of t positions in the table.
Retrieval from a chained hash table with load factor ?
requires approximately 1+ ?/2 probes in the successful
case and probes in the unsuccessful case.
Retrieval from a hash table with open addressing,random
probing,and load factor ? requires approximately
probes in the successful case and probes in the
unsuccessful case,respectively.
Retrieval from a hash table with open addressing,linear
probing,and load factor requires approximately
probes in the successful case and in the unsuccessful
case,respectively.
Theoretical comparisons:
Empirical comparisons:
经验主义的
9.8 Conclusions,Comparison of Methods
We have studied four principal methods of
information retrieval,the first two for lists and the
second two for tables,Often we can choose either
lists or tables for our data structures.
◆ Sequential search is ?(n).
Sequential search is the most flexible method,
The data may be stored in any order,with either
contiguous or linked representation.
◆ Binary search is ?(㏒ n).
Binary search demands more,but is faster,The
keys must be in order,and the data must be in
random-access representation (contiguous storage).
可随机访问的
◆ Table lookup is ?(1).
Ordinary lookup in contiguous tables is best,both
in speed and convenience,unless a list is preferred,
or the set of keys is sparse,or insertions or deletions
are frequent.
◆ Hash-table retrieval is ?(1).
Hashing requires the most structure,a peculiar
ordering of the keys well suited to retrieval from the
hash table,but generally useless for any other
purpose,If the data are to be available for human
inspection,then some kind of order is needed,and a
hash table is inappropriate.
9.9 Application,The Life Game Revisited
◆ The Life cells are supposed to be on an unbounded
grid,not a finite array as used in Chapter 1.
◆ In the class Life,we would like to have the
declaration class Life
{ public:
// methods
private:
bool map[int][int];
// other data and auxiliary functions
};
which is,of course,illegal.
See Pg,418
◆ Use a hash table to represent the data member map
(sparse array).
◆ The main function remains unchanged from
Chapter 1.
◆ Rewrite the method update so that it uses a hash
table to look up the status of cells.
◆ For any given cell in a configuration,we can
determine the number of living neighbors by
looking up the status of each neighboring cell.
◆ Candidates that might live in the coming generation
are those that are alive and their (dead) neighbors,
Method update will traverse these cells,determine
their neighbor counts by using the hash table,and
select those cells that will live in the next generation.
See Pg,420 - 426
有 100个待排序的整数, 其值在区间 [0,999]而且
互不相同,试利用散列表结构 (设表长为 100,处理冲
突采用链地址法 )实现对上述整数集合的排序 。
要求自行设计散列函数,并分析最好情况及最坏
情况下的时间复杂度。
答:由于是排序, 将散列函数设计为 i / 10(必须
按大小排, 因此要根据高位数排列 ),发生冲突时按整
数值从小到大插入链表 。
最好情况 出现在这 100个待排序的整数均匀分布
在区间 [0,999];散列表每项仅一个整数;
最坏情况 是:每项有 10个数在链表中, 共 10项不
空 。
Pointers and Pitfalls
◆ Use top-down design for your data structures,just
as you do for your algorithms.
◆ Before considering detailed structures,decide what
operations on the data will be required.
◆ For the design and programming of lists,see
Chapter 6.
◆ Use the logical structure of the data to decide what
kind of table to use.
◆ Let the structure of the data help you decide
whether an index function or an access table is better
for accessing a table of data.
◆ In using a hash table,let the nature of the data and
the required operations help you decide between
chaining and open addressing.
◆ Hash functions must usually be custom-designed
for the kind of keys used for accessing the hash table.
◆ Recall from the analysis of hashing that some
collisions will almost inevitably occur.
◆ For open addressing,clustering is unlikely to be a
problem until the hash table is more than half full.
◆ Sequential search is slow but robust,Use it for
short lists or if there is any doubt that the keys in
the list are properly ordered.
◆ Be extremely careful if you must reprogram
binary search,Verify that your algorithm is
correct and test it on all the extreme cases.
◆ Drawing trees is an excellent way both to trace
the action of an algorithm and to analyze its
behavior.
◆ Rely on the big-O analysis of algorithms for
large applications but not for small applications.