Introduction

Hashtable is one of the most effective ways of implementing a dictionary data types. Dictionary is one of the three fundamental abstract data types (containers, dictionaries, and priority queues) It allows access to data items by content. For example, you can access a definition of a word by entering the word.

>> dc_definition = {“hello”:”used as a greeting or to begin a telephone conversation.”}
>> print dc_definition[“hello”]
>> stdout: “used as a greeting or to begin a telephone conversation.”

will return a definition of the word “hello”. Here the word “hello” is the search key and the definition is the value. Primary operations of a dictionary are:

  • self.search(k) – Given a search key k, return a pointer to the element in the dictionary
  • self.insert(x) – Given a data item x, add it to the set into the dictionary
  • self.delete(x) – Given a pointer to a given data item x, remove it from the dictionary

From the primary operations listed above, it is intuitive to use Direct-address table or Hash tables to maintain a dictionary.

Direct-Access table

Direct access tables would be the most trivial way to represent a dictionary. If you have items of integers, then just use the key as the index of an array as shown below

direct_access_table

However, this is terrible due to two reasons:

  1. Keys may not be non-negative integers
  2. You can end up with a gigantic memory hog

Solution to 1: Prehash. Maps keys to nonnegative integers

Solution to 2: Hashing. Reduce universe of U of all keys (integers) down to reasonable size m for table.

Hash Tables

Hash tables resolve the pitfalls of Direct-Access table using the idea of Prehash and Hashing. Prehash means you map keys to non-negative integers. Hashing reduces the universe of all keys down to a reasonable size m for the table. In the core of hash tables, it has a “hash function” which is a mathematical function that maps keys to integers {0, 1,… , m-1}. The output of the hash function is the index of an array that the item will be stored in.

H(S) : U -> {0, 1, …, m-1}

where S is the key. The hash function plays a critical role in distributing the keys in the even manner such that the average complexity stays low. In case the hash value is too big, you can take the remainder of H(S)/m. That’s all good, but what happens when H(X) = H(Y) when X != Y? This is results in a collision.

Collision Resolution

No matter how good the hash function is, there will always be a case of a collision. Collision is where you have two distinct keys that are hashed to the same value. Chaining is often the approach to collision resolution.

Here is how Chaining works:

  1. Represent the hash table as an array of m linked lists
  2. ith list will contain all the items that hash to the value of i
  3. The search, insert, and deletion reduce to the corresponding problem in linked lists

Chaining_for_collision_resolution.PNG

Figure 1. Collision resolution by chaining. You can see an array of linked lists. Assuming have n items and m buckets, the average time complexity of Search, Insert, and Deletion are O(n/m), O(1), and O(1), respectively. (http://slideplayer.com/slide/4245924/)

Chaining is a very natural way of dealing with collision resolution, but it suffers from considerable amount of memory to pointers. Worst case for chaining arises, if all the keys hash to the same slot, meaning you have n linked lists that you need to iterate through to insert, search, and delete. So why not use this memory to make the space bigger for the table?

An alternative approach is something called Open Addressing. The hash table is maintained as an array of elements (not buckets as opposed to chaining), each initialized to null.

  1. Initialize an array with null elements
  2. Upon insertion, check if the desired position is empty
  3. If empty, insert
  4. Otherwise, find another place to insert it instead.
    1. Sequential probing: Insert the item in the next open spot in the table. This is good if the table is not too full, contiguous run should be fairly small
  5. For searching, depending on the probing methodology (assume sequential probing for now), we first check the value, if it is the value we want return it; otherwise, probe through and return the value
  6. For deletion, you have to be careful. Because deleting an element might break a chain of insertions, making some element inaccessible.  In this case you have to reinsert the element that come after to start from the new hole.

open_address_collision_resolution

Figure 2. Collision resolution by open addressing. As you can see John Smith and Sandra Dee are hashed to a same value; hence Sandra Dee probes to the next availalbe slot and inserts itself there. (https://en.wikipedia.org/wiki/Data_structure)

Analysis of Hashing

For analysis of hashing, people often use the assumption of simple uniform hashing which means that any given element is equally likely to hash into any of the m slots, independent of where any other element has hashed to. Using this rather unrealistic assumption, one can prove the average complexity of operations in a hash table.

Hash Functions

Up to this point, we learned about hash functions and collision resolution. However, as mentioned before it is critical for the hash function to be good to improve performance. So what makes a good hash function? A simple answer would be a hash function that satisfies the simple uniform hashing assumption stated above. However, in reality it is hard to check this condition as we rarely know the probability distribution from which the keys are drawn; nor its independence.

TBC..