2015-10-07

Hash

Purpose

    To translate an extreme large key space into a reasonably small range of integers.

hashCode = hashFunction(key)

    Typically, maintain a hash table in an array H[0, ..., h-1]
    Handy for the implementation of a dictionary ADT.

Two Issues

  • choose hash function - hash code should have uniformed distribution.
  • handle collisions - since hash function is many-to-one function.

Hash Functions

  • spreads the keys around fairly uniformly (some theoretical work on this subject)
  • practice
    • choose h as a power of 2
    • mod operator can be implemented as: num & (h - 1), because h is a power of 2.
    • choose the multiplier a = 8 (h/23) + 5
    • hashCode = (a K) mod h    (if K is integer)
    • hashCode = (a^2 K1 + aK2) mod h    (if K is a pair of integers)
    • hashCode = (a^len*K[1] + a^(len - 1)*K[2] + ... + aK[len]) mod h    (if K is a string)
    • double the array size (the hash table) if load factor (n/h) great than 50% (to reduce the probability of collision, typically we choose 2n for the range of hash code). 
    • After doubling, should go through the old array and insert the key into the new array using new hash function.
    • to reduce the amount of computation, new hash function can be simpler. e.g. hashCode = (2k[1] + 1) for strings.

Collision policies

  • Closed address hashing (also called chained hashing) - maintain a linked list for those keys who have the same hash code.
    • obviously, the running time for searching in a linked list is O(n)
    • why do not use balanced BST? It has theoretical advantages, but it is rarely done in practice. Since load factors are kept small, people more rely on uniformed distributed hash code instead of sophisticated data structures.
  • Open address hashing - store all elements on the array of hash table rather than using linked list. 
    • more space efficient than closed address hashing.
    • searching takes place right in the hash table, so it is likely to be more efficient in time also.
    • choose alternative locations is called rehashing.
    • the simplest rehashing policy is linear probing, that is, check next available cell until it is not occupied by some other key.
    • average cost of a successful search is sqrt(n) which is not a good performance. So why consider this policy? 
    • one reason is that the performance is quite good if the load factor is kept low, 50% for example. And with array doubling, it is doable.
    • another reason is that a more sophisticated rehashing scheme alleviates the problem of linear probing strategy.
    • deletion problem: keys could be cut off from hash table. if we define a constant value as a way to mark "obsolete", then 1) the constant could be reused for a new element. 2) in judging load factor, "obsolete" cells count as elements.
    • in conclusion, this strategy of handling collision seems problematic.

Implementation 


  • my C code is here.

Reference

  • Introduction to design and analysis, 3rd Ed, Sara Baase, section 6.5


No comments:

Post a Comment