HASH TABLES


 

An element with key k hashes to slot h(k). Thus, h(k) is the hash value of key k. The animation above illustrates this basic idea. The point of the hash function is to reduce the range of array indices that need to be handled. Instead of |U| values, only m values are handled. Storage requirements are correspondingly reduced.

 

The fly in the ointment of this creative idea is that two keys may hash to the same slot--a collision. Fortunately, there are effective techniques for resolving the conflict created by collisions.

 

METHODS FOR CREATING A HASH FUNCTION

The division method

The division method  involves mapping a key k into one of m slots by taking the remainder of k divided by m as expressed in the hash function

h(k) = k mod m .

For example, if the hash table has size m = 12 and the key is k = 100, then h(k) = 4. Since it requires only a single division operation, hashing by division is quite fast.

When using the division method, certain values of m are usually avoided. For example, m should not be a power of 2, since if m = 2p, then h(k) is just the p lowest-order bits of k. Unless it is known a priori that the probability distribution on keys makes all low-order
p
-bit patterns equally likely, it is better that the hash function depends on all the bits of the key. Powers of 10 should be avoided if the application deals with decimal numbers as keys, since the hash function does not depend on all the decimal digits of k. Finally, it can be shown that when m = 2p - 1 and k is a character string interpreted in radix 2p, two strings that are identical except for a transposition of two adjacent characters will hash to the same value.

Good values for m are primes not too close to exact powers of 2. An example would be to allocate a hash table using collisions resolved by chaining for n = 2000 character strings of 8 bits. An average of 3 elements are examined in an unsuccessful search, and a hash table of size m = 701 is allocated. The number 701 is chosen because it is a prime near a= 2000/3 but not near any power of 2. Treating each key k as an integer, the hash function would be

h(k) = k mod 701.

A precautionary measure would involve checking how evenly this hash function distributes sets of keys among the slots, where the keys are chosen from "real" data.

The multiplication method

The multiplication method for creating hash functions operates in two steps. First the key k is multiplied by a constant A in the range 0 < A < 1 and the fractional part of kA is extracted. Then, this value is multiplied by m and the floor of the result is taken. In short, the hash function is

h(k) = m (k A mod 1).

COLLISION RESOLUTION METHODS

Linear probing

Given an ordinary hash function h': U -> {0, 1, . . . , m - 1}, the method of linear probing uses the hash function

h(k,i) = (h'(k) + i) mod m

for i = 0,1,...,m - 1. Given key k, the first slot probed is T[h'(k)]. The next slot T[h'(k) + 1] is probed, and so on up to slot T[m - 1]. Slots T[0], T[1], . . . ,  follow until a final probe slot T[h'(k) - 1] is found. Since the initial probe position determines the entire probe sequence, only m distinct probe sequences are used with linear probing.

Linear probing is simple to implement, but it suffers from a problem known as primary clustering. Long runs of occupied slots build up, increasing the average search time. For example, if n = m/2 keys in the table, where every even-indexed slot is occupied and every odd-indexed slot is empty, then the average unsuccessful search takes 1.5 probes. If the first n = m/2 locations are occupied, however, the average number of probes increases to about n/4 = m/8. Clusters are likely to arise, since if an empty slot is preceded by i full slots, then the probability that the empty slot is the next one filled is (i + 1)/m, compared with a probability of 1/m if the preceding slot were empty. Thus, runs of occupied slots tend to get longer, and linear probing is not an accurate approximation of uniform hashing.

Double hashing

Double hashing is one of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. Double hashing uses a hash function of the form

h(k, i) = (h1(k) + ih2(k)) mod m,

where h1 and h2 are auxiliary hash functions. The initial position probed is T[h1 (k)]; successive probe positions are offset from previous positions by the amount h2(k), modulo m. Thus, unlike the case of linear or quadratic probing, the probe sequence here depends in two ways upon the key k, since the initial probe position, the offset, or both, may vary.

Double hashing represents an improvement over linear or quadratic probing. As a result, the performance of double hashing appears to be very close to the performance of the "ideal" scheme of uniform hashing.

Chaining

In chaining, all the elements that hash to the same slot are placed in a linked list. Slot j contains a pointer to the head of the list of all stored elements that hash to j; if there are no such elements, slot j contains NIL.

The dictionary operations on a hash table T are easy to implement when collisions are resolved by chaining.

CHAINED-HASH-INSERT(T,x)
insert x at the head of list T[h(key[x])]

CHAINED-HASH-SEARCH(T,k)
search for an element with key k in list T[h(k)]

CHAINED-HASH-DELETE(T,x)
delete x from the list T[h(key[x])]

The worst-case running time for insertion is O(1). For searching, the worst-case running time is proportional to the length of the list; this will be analyzed in greater detail below. Deletion of an element x can be accomplished in O(1) time if the lists are doubly linked. (If the lists are singly linked, the first x in the list T[h(key[x])] must be found, so that the next link of x's predecessor can be properly set to splice x out. In this case, deletion and searching have essentially the same running time.)