An element with
key k hashes to slot h(k). Thus, h(k) is the hash value of
key k. The animation above illustrates this basic idea. The point of
the
hash function is to reduce the range of array indices that need to be
handled. Instead of |U| values, only m values are handled.
Storage requirements are correspondingly reduced.
The fly in the ointment of
this creative idea is that two keys may hash to the same slot--a
collision. Fortunately, there are effective techniques for resolving
the conflict created by collisions.
METHODS FOR CREATING A HASH
FUNCTION
The division method
The division
method involves mapping a key k into one of m
slots by taking
the remainder of k divided by m as expressed in the
hash
function
h(k) = k mod m .
For
example, if the hash table has size m = 12 and the key is k
= 100, then h(k) = 4. Since it requires only a single division
operation, hashing by division is quite fast.
When
using the division method, certain values
of m are usually avoided. For
example, m should not be a power of 2, since if m = 2p,
then h(k) is just the p lowest-order bits of k.
Unless it is known a priori that the probability distribution on keys
makes all low-order
p-bit patterns equally likely, it is better
that the hash function depends on all the bits of the key. Powers of
10 should be avoided if the application deals with decimal numbers as
keys, since the hash function does not depend on all the decimal
digits of k. Finally, it can be shown that when m = 2p
- 1 and k is a character string interpreted in radix 2p,
two strings that are identical except for a transposition of two
adjacent characters will hash to the same value.
Good
values for m are primes not too close to exact powers of 2. An
example would be to allocate a hash table using collisions
resolved by chaining for n = 2000 character
strings of 8 bits. An
average of 3 elements are examined in an unsuccessful search, and a
hash
table of size m = 701 is allocated. The number 701 is chosen
because it is a
prime near a= 2000/3 but not near any power of 2. Treating each key k
as an integer, the hash function would be
h(k) = k mod 701.
A precautionary measure
would involve checking how evenly this hash function distributes sets
of keys
among the slots, where the keys are chosen from "real" data.
The multiplication method
The multiplication
method for creating hash functions operates in two steps.
First the key k is multiplied by a constant A in the
range 0 < A < 1 and the fractional part of kA is extracted.
Then, this value is multiplied by m and the floor of the
result is taken.
In short, the hash function is
h(k) = m (k A mod 1).
COLLISION RESOLUTION METHODS
Linear probing
Given an ordinary hash
function h': U -> {0, 1, . . . , m - 1}, the
method of linear probing uses the hash function
h(k,i) = (h'(k) + i) mod m
for i = 0,1,...,m
- 1. Given key k, the first slot probed is T[h'(k)].
The next slot T[h'(k) + 1] is probed, and so on
up to
slot T[m - 1]. Slots T[0], T[1], . . .
, follow until a final probe
slot T[h'(k) - 1] is found. Since the initial
probe
position determines the entire probe sequence, only m distinct
probe sequences are used with linear probing.
Linear probing is simple to
implement, but it suffers from a problem known as primary
clustering. Long runs of occupied slots build up, increasing
the average search time. For example, if n = m/2
keys in the table, where every even-indexed slot is occupied and every
odd-indexed slot is empty, then the average unsuccessful search takes
1.5 probes. If the first n = m/2 locations are
occupied, however, the average number of probes increases to about n/4
= m/8. Clusters are likely to arise, since if an empty slot is
preceded by i full slots, then the probability that the empty
slot is the next one filled is (i + 1)/m, compared with a
probability of 1/m if the preceding slot were empty. Thus, runs
of occupied slots tend to get longer, and linear probing is not an
accurate approximation of uniform hashing.
Double hashing
Double
hashing is one of the best methods available for open addressing
because the permutations produced have many of the characteristics of
randomly chosen permutations. Double hashing uses a hash
function of the form
h(k, i) = (h1(k) + ih2(k)) mod m,
where h1
and h2 are auxiliary hash functions. The initial
position probed is T[h1 (k)];
successive probe positions are offset from previous positions by the
amount h2(k), modulo m. Thus, unlike
the case of linear or quadratic probing, the probe sequence here
depends in two ways upon the key k, since the initial probe
position, the offset, or both, may vary.
Double
hashing represents an improvement over linear or quadratic probing. As
a result, the performance of double hashing appears to be very close to
the performance of the "ideal" scheme of uniform hashing.
Chaining
In chaining, all the
elements that hash to the same slot are placed in a linked list. Slot j
contains a
pointer to the head of the list of all stored elements that hash to j;
if there are no such elements, slot j contains NIL.
The dictionary operations on
a hash table T are easy to implement when collisions are resolved by
chaining.
CHAINED-HASH-INSERT(T,x)
insert x at the head of list T[h(key[x])]
CHAINED-HASH-SEARCH(T,k)
search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T,x)
delete x from the list T[h(key[x])]
The worst-case running time
for insertion is O(1). For searching, the worst-case running time is
proportional to the length of the list; this will be analyzed in
greater detail below. Deletion of an element x can be accomplished in
O(1)
time if the lists are doubly linked. (If the lists are singly linked,
the first x in the list T[h(key[x])] must be found, so that the next
link of
x's predecessor can be properly set to splice x out. In this case,
deletion and searching have essentially the same running time.)
|