Overview

Arrays are superfast. Simply becasue the exact location (index) of the element is provided.

What if we could tie that index to the actual value of the data being stored? If there was a way to UNIQUELY map each value being stored to an index, that would allow array-speed lookups on virtually ALL data.

That’s what a hash function is. While 100% uniqueness will not be guaranteed, it is still a worthwhile function to try and find for each data set.

Consider the simplest example of the birthday problem – given a day and a month – you need to look up the name of a person who has a birthday on that date.

For simplicity, let us assume that there are only 365 people in our data set and each one has a unique birthday.

The LinkedList Solution:

If you were to use a linked list to store each person’s name and birthdate, you would have to visit an average of 183 nodes to perform this simple lookup.

The Hashtable Solution:

With a hashtable, this average is reduced to ONE!

That’s right – just ONE direct lookup retrieves the appropriate person’s name – all the lookup does is take the input (day and month) – computes its hash (which gives it a number between 1 and 365) – and uses that as an index into an array containing the names. So – if my birthday is on the 14th of Feb  (which happens to be the 45th day of the year) – the hashing function should correctly return the number 45. My storage array’s 45th element – Array[45] – should contain my name.  No traversing down any list – just one simple, straight to the point, lookup.

What is the catch? The trick, of course, lies in finding a way to map all those days and months to a unique number.

Somehow – we need to go from two numbers (the MONTH – running from 1 through 12 – and the DAY – running from 1 through 31) – and come up with a single number running from 1 through 365 (or 0 through 364 to appease arrays in most languages).

Here is a simple way to accomplish this feat.

Step 1 : Store a simple static array containing 12 entries – for the days elapsed  when you reach the FIRST of each month. So for January 1st – this entry would be 0 for February 1st, it would be 31 (since 31 days have elapsed before you get to Feb 1st), for March 1st, it would be 60. And so on for each of the 12 months.

Step 2: Given a birthday (month and day) – say Feb 14th  – all you do is look up the February days elapsed from step 1  (which is 31) – and add 14 to it – to give you 45.  For March 10th, you would take 60 and add 10 to it – to give you 70. All we are saying here is that March 10th is the 70th day of the year and Feb 14th is the 45th day of the year.

int CreateHashUsingDayandMonth(int day, int month)
{
  int[12] days_elapsed = {0, 31, 60, 121, 152, 182, 213, 244, 274, 305, 335};

  return days_elapsed[month] + day
}

This is an example of a perfect hashing function – since we assumed our data set was nice enough that there were no duplicates.  If we have duplicates, we run into ‘collisions’ which are the norm rather than the exception.  More on those in a future post.

For now,  the beauty of the simple hashing function above is that we have reduced the average number of lookups from 183 to 1 !

Don’t you think that hashes (hash functions) are under-appreciated? When was the last time you thanked a hash function?

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.


Series NavigationSwapping without a temp variable