mtdowling

Favor Hash Lookups Over Array Searches

A common programming requirement is to match a string against a set of known strings. For example, let’s say you were iterating over the words in a forum post and testing to see if a word is in a list of prohibited words. A common approach to this problem is to create an array of the known prohibited words and then use PHP’s in_array() function to test if the string is found in the list. However, there’s a simple optimization you can make to significantly improve the performance of the algorithm.

Searching an Array

<?php

$words = get_all_words_in_text($text);
$badWords = ['$$$$', '@#$%', 'crud' /** ... */ ];

foreach ($words as $word) {
    if (in_array(strtolower($word), $badWords)) {
        echo 'Found bad word: ' . $word . "\n";
    }
}

This approach solves the problem, but it’s inefficient. As we iterate over each word, we then test each word against each word in the bad word list using in_array(). PHP’s in_array() function runs in linear time, meaning that as the size of the bad words array increases, the amount of time it takes to search the array proportionally increases. We can do much better than this.

Hash Lookups

Because the list of bad words is known up front, we can easily restructure our program to run in constant time by utilizing a PHP associative array. Like most hash tables, a key lookup on a PHP associative array is O(1) (except when a collision occurs which isn’t a concern for our use case).

If we restructure our PHP array to become an associative array where the words we are looking for are the keys and each value is true, then we can use PHP’s isset() function and greatly speed things up:

<?php

$words = get_all_words_in_text($text);
$badWords = [
    '$$$$' => true,
    '@#$%' => true,
    'crud' => true
    // ...
];

foreach ($words as $word) {
    if (isset($badWords[strtolower($word)])) {
        echo 'Found bad word: ' . $word . "\n";
    }
}

Benchmark

Let’s put this to the test. I’ve written up a simple test to show the amount of time it takes to run each of the above implementations using some canned data.

<?php

$total = 10000;
$paragraph = 'this is a sentence. Crud! $$$$!';
$words = explode(' ', $paragraph);
$badWordList = ['$$$$', '@#$%', 'crud', 'fud', 'fudd', 'dud'];

$s = microtime(true);
for ($j = 0; $j < $total; $j++) {
    foreach ($words as $word) {
        in_array(strtolower($word), $badWordList);
    }
}
echo "in_array: " . (microtime(true) - $s) . "\n";

$badWordHash = [
    '$$$$' => true,
    '@#$%' => true,
    'crud' => true,
    'fud'  => true,
    'fudd' => true,
    'dud'  => true
];

$s = microtime(true);
for ($j = 0; $j < $total; $j++) {
    foreach ($words as $word) {
        isset($badWordHash[strtolower($word)]);
    }
}

echo "hash:     " . (microtime(true) - $s) . "\n";

Here are the results of running the perf script 10,000 times:

in_array: 0.033491134643555
hash:     0.0069370269775391

As you can see, converting the algorithm to use a hash lookup is 480% faster than using in_array(). And that was using a small list of words.

It’s important to understand that as the size of the array of prohibited words grows, so does the amount of time it takes to call in_array(). Conversely, the amount of time taken to run the isset() implementation will remain at a constant time. Let me show you what I mean. The following example uses a list of 10,000 words in our prohibited word list, and the results are quite different.

<?php

$total = 10000;
$paragraph = 'this is a sentence. Crud! $$$$!';
$words = explode(' ', $paragraph);

// Create a bunch of letter entries
$sequence = [];
for ($j = 0; $j < 10000; $j++) {
    $sequence[] = 'a' . $j;
}

$s = microtime(true);
for ($j = 0; $j < $total; $j++) {
    foreach ($words as $word) {
        in_array(strtolower($word), $sequence);
    }
}
echo "in_array: " . (microtime(true) - $s) . "\n";

// Convert the array to a hash
$hash = array_fill_keys($sequence, true);
$s = microtime(true);
for ($j = 0; $j < $total; $j++) {
    foreach ($words as $word) {
        isset($hash[strtolower($word)]);
    }
}

echo "hash:     " . (microtime(true) - $s) . "\n";

The results for this test are astronomically different between the in_array and hash implementation. The hash implementation is a whopping 3,162% faster!

in_array: 20.464313983917
hash:     0.0064699649810791

This Isn’t Anything New

This isn’t a new idea. It’s a pretty common idiom in various languages. I was recently reminded of the fact that I constantly use hash lookups (no pun intended) for this sort of thing when I was reading “Programming in Lua”.

Next time you’re using PHP’s in_array() to test a string against a known list of strings, see if it’s at all possible to instead use a hash and isset(). It’ll make your programs faster.

Comments