Hash Functions & Collisions

Designing hash functions and handling collisions. Learn chaining and open addressing strategies.

A hash function's quality directly impacts hash table performance. Understanding how to design hash functions and handle collisions is essential for implementing and analyzing hash-based data structures.

What Makes a Good Hash Function?

A good hash function should:

Be deterministic: Same input always produces same output
Distribute uniformly: Spread keys evenly across the array
Be fast to compute: O(1) for fixed-size keys
Minimize collisions: Different keys should rarely hash to same index

// Bad hash function - causes clustering
fn bad_hash(key: &str, size: usize) -> usize {
    key.len() % size  // Many words have same length!
}

// Better hash function - uses all characters
fn better_hash(key: &str, size: usize) -> usize {
    let mut hash_val: u64 = 0;
    for ch in key.chars() {
        hash_val = hash_val.wrapping_mul(31).wrapping_add(ch as u64);
    }
    (hash_val as usize) % size
}

Common Hash Functions

Division Method

fn division_hash(key: usize, size: usize) -> usize {
    key % size
}

// For strings, first convert to number
fn string_hash_division(s: &str, size: usize) -> usize {
    let mut hash_val: u64 = 0;
    for c in s.chars() {
        hash_val = hash_val.wrapping_mul(31).wrapping_add(c as u64);
    }
    (hash_val as usize) % size
}

Why Prime Size?

Using a prime number for table size improves distribution. If size has common factors with the hash values, keys cluster at certain indices.

Multiplication Method

fn multiplication_hash(key: usize, size: usize) -> usize {
    const A: f64 = 0.6180339887;  // (sqrt(5) - 1) / 2
    ((size as f64) * (((key as f64) * A) % 1.0)) as usize
}

Polynomial Rolling Hash

Common for strings:

fn polynomial_hash(s: &str, base: u64, modulus: u64) -> u64 {
    let mut hash_val: u64 = 0;
    let mut base_power: u64 = 1;

    for c in s.chars() {
        hash_val = (hash_val + (c as u64) * base_power) % modulus;
        base_power = (base_power * base) % modulus;
    }

    hash_val
}

// Example usage with base=31 and mod=10^9+7
fn main() {
    let hash = polynomial_hash("hello", 31, 1_000_000_007);
    println!("{}", hash);
}

Collision Resolution Strategies

1. Separate Chaining

Each slot contains a linked list of colliding elements:

class HashTableChaining:
    def __init__(self, size=101):  # Prime size
        self.size = size
        self.buckets = [[] for _ in range(size)]
        self.count = 0

    def _hash(self, key):
        if isinstance(key, str):
            h = 0
            for c in key:
                h = h * 31 + ord(c)
            return h % self.size
        return key % self.size

    def put(self, key, value):
        idx = self._hash(key)

        for i, (k, v) in enumerate(self.buckets[idx]):
            if k == key:
                self.buckets[idx][i] = (key, value)
                return

        self.buckets[idx].append((key, value))
        self.count += 1
        self._check_load_factor()

    def get(self, key):
        idx = self._hash(key)
        for k, v in self.buckets[idx]:
            if k == key:
                return v
        return None

    def _check_load_factor(self):
        if self.count / self.size > 0.75:
            self._resize()

    def _resize(self):
        old_buckets = self.buckets
        self.size = self.size * 2 + 1  # Keep it odd (closer to prime)
        self.buckets = [[] for _ in range(self.size)]
        self.count = 0

        for bucket in old_buckets:
            for key, value in bucket:
                self.put(key, value)

2. Linear Probing

On collision, check the next slot:

class HashTableLinear:
    def __init__(self, size=101):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size
        self.DELETED = object()  # Tombstone marker

    def _hash(self, key):
        return hash(key) % self.size

    def put(self, key, value):
        idx = self._hash(key)
        start = idx

        while self.keys[idx] is not None and self.keys[idx] is not self.DELETED:
            if self.keys[idx] == key:
                self.values[idx] = value
                return
            idx = (idx + 1) % self.size
            if idx == start:
                raise Exception("Table full")

        self.keys[idx] = key
        self.values[idx] = value

    def get(self, key):
        idx = self._hash(key)
        start = idx

        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                return self.values[idx]
            idx = (idx + 1) % self.size
            if idx == start:
                break

        return None

    def remove(self, key):
        idx = self._hash(key)
        start = idx

        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.keys[idx] = self.DELETED
                self.values[idx] = None
                return True
            idx = (idx + 1) % self.size
            if idx == start:
                break

        return False

Primary Clustering

Linear probing causes "primary clustering" where consecutive filled slots form long chains, degrading performance. Quadratic probing and double hashing reduce this.

3. Quadratic Probing

Probe at h(k) + 1², h(k) + 2², h(k) + 3², ...

fn quadratic_probe(&self, key: &K, attempt: usize) -> usize {
    let base = self.hash(key);
    (base + attempt * attempt) % self.size
}

4. Double Hashing

Use a second hash function to determine the probe step:

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

fn double_hash_probe<K: Hash>(&self, key: &K, attempt: usize) -> usize {
    let h1 = self.hash(key);

    let mut hasher = DefaultHasher::new();
    key.hash(&mut hasher);
    let h2 = 1 + ((hasher.finish() as usize) % (self.size - 1));  // Never 0

    (h1 + attempt * h2) % self.size
}

Comparison of Strategies

| Strategy | Pros | Cons | |----------|------|------| | Chaining | Simple, handles high load | Extra memory for pointers | | Linear Probing | Cache-friendly | Primary clustering | | Quadratic Probing | Reduces clustering | May not probe all slots | | Double Hashing | Best distribution | More computation |

Load Factor Analysis

Load factor α = n / m (items / slots)

Chaining:

Average probes for successful search: 1 + α/2
Average probes for unsuccessful search: 1 + α

Open Addressing (Linear Probing):

Average probes for successful: (1/2)(1 + 1/(1-α))
Average probes for unsuccessful: (1/2)(1 + 1/(1-α)²)

Load Factor vs Expected Probes (Linear Probing):

α = 0.5: ~1.5 probes (successful), ~2.5 probes (unsuccessful)
α = 0.7: ~2.2 probes (successful), ~6.1 probes (unsuccessful)
α = 0.9: ~5.5 probes (successful), ~50 probes (unsuccessful)

Keep α < 0.7 for good performance!

Robin Hood Hashing

A clever variant of linear probing that reduces variance:

class RobinHoodHashTable:
    def __init__(self, size=16):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size
        self.probe_lengths = [0] * size
        self.count = 0

    def _hash(self, key):
        return hash(key) % self.size

    def put(self, key, value):
        idx = self._hash(key)
        probe_len = 0

        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.values[idx] = value
                return

            # Robin Hood: steal from the rich
            if probe_len > self.probe_lengths[idx]:
                # Swap and continue inserting displaced element
                key, self.keys[idx] = self.keys[idx], key
                value, self.values[idx] = self.values[idx], value
                probe_len, self.probe_lengths[idx] = self.probe_lengths[idx], probe_len

            idx = (idx + 1) % self.size
            probe_len += 1

        self.keys[idx] = key
        self.values[idx] = value
        self.probe_lengths[idx] = probe_len
        self.count += 1

Cryptographic vs Non-Cryptographic

| Type | Example | Properties | Use Case | |------|---------|------------|----------| | Non-crypto | MurmurHash, FNV | Fast, good distribution | Hash tables | | Cryptographic | SHA-256, MD5 | Secure, slow | Passwords, signatures |

For hash tables, use fast non-cryptographic hashes. Cryptographic hashes are overkill.

Real-World Implementation Tips

Use prime table sizes for division method
Keep load factor below 0.75
Use power-of-2 sizes with good hash functions (allows bitwise AND)
Handle deletion carefully in open addressing (use tombstones)
Consider Robin Hood for better worst-case behavior

Key Takeaways

Good hash functions distribute uniformly and compute quickly
Collisions are inevitable—handle them gracefully
Chaining is simple but uses more memory
Open addressing is cache-friendly but needs careful deletion
Load factor directly affects performance—resize when needed
Double hashing provides best distribution for open addressing

Understanding collision resolution helps you choose the right hash table implementation and predict performance characteristics. Most languages use sophisticated implementations, but knowing the fundamentals helps when performance tuning or implementing custom solutions.

Hash Tables

Introduction to Graphs