Hash Functions & Collisions
Designing hash functions and handling collisions. Learn chaining and open addressing strategies.
A hash function's quality directly impacts hash table performance. Understanding how to design hash functions and handle collisions is essential for implementing and analyzing hash-based data structures.
What Makes a Good Hash Function?
A good hash function should:
- Be deterministic: Same input always produces same output
- Distribute uniformly: Spread keys evenly across the array
- Be fast to compute: O(1) for fixed-size keys
- Minimize collisions: Different keys should rarely hash to same index
// Bad hash function - causes clustering
fn bad_hash(key: &str, size: usize) -> usize {
key.len() % size // Many words have same length!
}
// Better hash function - uses all characters
fn better_hash(key: &str, size: usize) -> usize {
let mut hash_val: u64 = 0;
for ch in key.chars() {
hash_val = hash_val.wrapping_mul(31).wrapping_add(ch as u64);
}
(hash_val as usize) % size
}
Common Hash Functions
Division Method
fn division_hash(key: usize, size: usize) -> usize {
key % size
}
// For strings, first convert to number
fn string_hash_division(s: &str, size: usize) -> usize {
let mut hash_val: u64 = 0;
for c in s.chars() {
hash_val = hash_val.wrapping_mul(31).wrapping_add(c as u64);
}
(hash_val as usize) % size
}
Why Prime Size?
Using a prime number for table size improves distribution. If size has common factors with the hash values, keys cluster at certain indices.
Multiplication Method
fn multiplication_hash(key: usize, size: usize) -> usize {
const A: f64 = 0.6180339887; // (sqrt(5) - 1) / 2
((size as f64) * (((key as f64) * A) % 1.0)) as usize
}
Polynomial Rolling Hash
Common for strings:
fn polynomial_hash(s: &str, base: u64, modulus: u64) -> u64 {
let mut hash_val: u64 = 0;
let mut base_power: u64 = 1;
for c in s.chars() {
hash_val = (hash_val + (c as u64) * base_power) % modulus;
base_power = (base_power * base) % modulus;
}
hash_val
}
// Example usage with base=31 and mod=10^9+7
fn main() {
let hash = polynomial_hash("hello", 31, 1_000_000_007);
println!("{}", hash);
}
Collision Resolution Strategies
1. Separate Chaining
Each slot contains a linked list of colliding elements:
class HashTableChaining:
def __init__(self, size=101): # Prime size
self.size = size
self.buckets = [[] for _ in range(size)]
self.count = 0
def _hash(self, key):
if isinstance(key, str):
h = 0
for c in key:
h = h * 31 + ord(c)
return h % self.size
return key % self.size
def put(self, key, value):
idx = self._hash(key)
for i, (k, v) in enumerate(self.buckets[idx]):
if k == key:
self.buckets[idx][i] = (key, value)
return
self.buckets[idx].append((key, value))
self.count += 1
self._check_load_factor()
def get(self, key):
idx = self._hash(key)
for k, v in self.buckets[idx]:
if k == key:
return v
return None
def _check_load_factor(self):
if self.count / self.size > 0.75:
self._resize()
def _resize(self):
old_buckets = self.buckets
self.size = self.size * 2 + 1 # Keep it odd (closer to prime)
self.buckets = [[] for _ in range(self.size)]
self.count = 0
for bucket in old_buckets:
for key, value in bucket:
self.put(key, value)
2. Linear Probing
On collision, check the next slot:
class HashTableLinear:
def __init__(self, size=101):
self.size = size
self.keys = [None] * size
self.values = [None] * size
self.DELETED = object() # Tombstone marker
def _hash(self, key):
return hash(key) % self.size
def put(self, key, value):
idx = self._hash(key)
start = idx
while self.keys[idx] is not None and self.keys[idx] is not self.DELETED:
if self.keys[idx] == key:
self.values[idx] = value
return
idx = (idx + 1) % self.size
if idx == start:
raise Exception("Table full")
self.keys[idx] = key
self.values[idx] = value
def get(self, key):
idx = self._hash(key)
start = idx
while self.keys[idx] is not None:
if self.keys[idx] == key:
return self.values[idx]
idx = (idx + 1) % self.size
if idx == start:
break
return None
def remove(self, key):
idx = self._hash(key)
start = idx
while self.keys[idx] is not None:
if self.keys[idx] == key:
self.keys[idx] = self.DELETED
self.values[idx] = None
return True
idx = (idx + 1) % self.size
if idx == start:
break
return False
Primary Clustering
Linear probing causes "primary clustering" where consecutive filled slots form long chains, degrading performance. Quadratic probing and double hashing reduce this.
3. Quadratic Probing
Probe at h(k) + 1², h(k) + 2², h(k) + 3², ...
fn quadratic_probe(&self, key: &K, attempt: usize) -> usize {
let base = self.hash(key);
(base + attempt * attempt) % self.size
}
4. Double Hashing
Use a second hash function to determine the probe step:
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
fn double_hash_probe<K: Hash>(&self, key: &K, attempt: usize) -> usize {
let h1 = self.hash(key);
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let h2 = 1 + ((hasher.finish() as usize) % (self.size - 1)); // Never 0
(h1 + attempt * h2) % self.size
}
Comparison of Strategies
| Strategy | Pros | Cons | |----------|------|------| | Chaining | Simple, handles high load | Extra memory for pointers | | Linear Probing | Cache-friendly | Primary clustering | | Quadratic Probing | Reduces clustering | May not probe all slots | | Double Hashing | Best distribution | More computation |
Load Factor Analysis
Load factor α = n / m (items / slots)
Chaining:
- Average probes for successful search: 1 + α/2
- Average probes for unsuccessful search: 1 + α
Open Addressing (Linear Probing):
- Average probes for successful: (1/2)(1 + 1/(1-α))
- Average probes for unsuccessful: (1/2)(1 + 1/(1-α)²)
Load Factor vs Expected Probes (Linear Probing):
α = 0.5: ~1.5 probes (successful), ~2.5 probes (unsuccessful)
α = 0.7: ~2.2 probes (successful), ~6.1 probes (unsuccessful)
α = 0.9: ~5.5 probes (successful), ~50 probes (unsuccessful)
Keep α < 0.7 for good performance!
Robin Hood Hashing
A clever variant of linear probing that reduces variance:
class RobinHoodHashTable:
def __init__(self, size=16):
self.size = size
self.keys = [None] * size
self.values = [None] * size
self.probe_lengths = [0] * size
self.count = 0
def _hash(self, key):
return hash(key) % self.size
def put(self, key, value):
idx = self._hash(key)
probe_len = 0
while self.keys[idx] is not None:
if self.keys[idx] == key:
self.values[idx] = value
return
# Robin Hood: steal from the rich
if probe_len > self.probe_lengths[idx]:
# Swap and continue inserting displaced element
key, self.keys[idx] = self.keys[idx], key
value, self.values[idx] = self.values[idx], value
probe_len, self.probe_lengths[idx] = self.probe_lengths[idx], probe_len
idx = (idx + 1) % self.size
probe_len += 1
self.keys[idx] = key
self.values[idx] = value
self.probe_lengths[idx] = probe_len
self.count += 1
Cryptographic vs Non-Cryptographic
| Type | Example | Properties | Use Case | |------|---------|------------|----------| | Non-crypto | MurmurHash, FNV | Fast, good distribution | Hash tables | | Cryptographic | SHA-256, MD5 | Secure, slow | Passwords, signatures |
For hash tables, use fast non-cryptographic hashes. Cryptographic hashes are overkill.
Real-World Implementation Tips
- Use prime table sizes for division method
- Keep load factor below 0.75
- Use power-of-2 sizes with good hash functions (allows bitwise AND)
- Handle deletion carefully in open addressing (use tombstones)
- Consider Robin Hood for better worst-case behavior
Key Takeaways
- Good hash functions distribute uniformly and compute quickly
- Collisions are inevitable—handle them gracefully
- Chaining is simple but uses more memory
- Open addressing is cache-friendly but needs careful deletion
- Load factor directly affects performance—resize when needed
- Double hashing provides best distribution for open addressing
Understanding collision resolution helps you choose the right hash table implementation and predict performance characteristics. Most languages use sophisticated implementations, but knowing the fundamentals helps when performance tuning or implementing custom solutions.
