Needle in a haystack

Needle in a haystack's easy - just bring a magnet.

~ Alec "The Hacker" Hardison, Leverage: The Zoo Job

Searching has been a problem long before the advent of computers and has more often than not, required us to adapt to the specifics of the problem we are trying to solve. In other words, there has never been a general optimal solution to this class of problems. And this has also carried over to the world of computers. The optimal search strategy is not always apparent and almost never the same accross different use-cases.

The One Search Algorithm

There’s a folklore in my country about the great King Akbar and his wise advisor Birbal. So it was that from time to time Akbar would try and test the intellectual prowess of Birbal by presenting him with seemingly impossible tasks. On one such occassion, Akbar gave one of his precious emerald rings to a trusted court minister and asked him to hide it somewhere no one could find it. Soon afterwards, he sent for Birbal to appear before him. When Birbal arrived, he asked him “Birbal! I seem to have lost my priceless jewel that was gifted to me by my father! You must find out who stole it! You must.” Hearing this, Birbal calmly asked, “Do you remember keeping it somewhere or giving it to anyone?” Akbar, trying hard to ensure he doesn’t catch on to the scheme replied, “No, why would I give it to someone? And I never keep it anywhere but on my person at all times! You must find out who stole it!” Birbal thought for a minute, then realized the ring was only stolen a while ago so the thief could still be within the castle’s premises. He asked the chief of guards, “Has anyone left the castle yet since morning?” to which the chief replied, “No, my lord”. Birbal said, “Lock the gates and let no one leave or enter”. Then he calmly shifted his gaze to the King’s feet and asked with a mournful face - “My king, I’ll try my best to catch the thief. But for now, we must organize a feast for everyone within this castle right now. A death feast to mourn the loss of your father’s last gift to you since it feels as if he has died a second time today.” The King, curious to see where he’s leading with this, replied “Very well, let there be a feast by lunch”. And so it was, everyone within the castle walls was invited into the dining hall where they sat in a circle around a large wooden table with the King and Birbal’s chairs at one end. Several dozen bowls of food were brought in and everyone began to eat. Halfway through the feast and at a moment’s glance, Birbal stood up hastily from his chair, pointed his arm out to seemingly nowhere in particular and screamed, “Thief! Thief! Catch him! He has a bit of straw in his beard!“. Now usually a man would not fall into such an obvious trap. But it was noon already with seemingly no one on his tail and he was enjoying a glorious buffet of food and for a moment the “appointed thief” (the minister from earlier) had let his guard down. He got startled, raised his palm and searched through his beard frantically for any straw. By the time he realized what was happening, it was too late. Birbal bowed to the king as he closed with, “My king, your thief.”

I guess the point I’m trying to make is when it comes to Search Algorithms, there is no one-size-fits-all. No algorithms is always applicable to a scenario and even when there are multiple applicable algorithms always a trade-off involved in selecting one algorithm over another. So we have to be just as creative in finding an algorithm that fits our requirements as Birbal was in catch the ring thief. Let’s take a look at some examples that illustrate this.

Under the microscope

Every weekday morning when I need my socks, I have to go through a dozen clothes to get them. And every time I wonder if I had only sorted my clothes the day before, this would have been much easier for me. But the fact of life is we are not always prepared and we don’t always have the time to prepare. When that happens a simple algorithm of rifling through all that we have one at a time is the best we can do. In programming, Linear Search is the name of the base search algorithm we use when all else fails. It’s basic but it’s effective and it works in all cases.

def linear_search(arr: list[int], key: int) -> int:
    N = len(arr)
    for i in range(N):
        if arr[i] == key:
            return i
    return -1

But what if we do have the time to sort our data beforehand? Binary Search allows us to utilize the order of our sorted data in our favour to search for what we want faster than we can do with linear search.

def binary_search(arr: list[int], key: int) -> int:
    N = len(arr)
    start = 0
    end = N - 1
    while start <= end:
        current = start + (end - start) // 2
        if key < arr[current]:
            end = current - 1
        elif key > arr[current]:
            start = current + 1
        else:
            return current
    return -1

So sorted data always means binary search right? Well, not exactly. Suppose we don’t have a fixed length of data but an infinite amount of it. As you can see, Binary Search requires both a lower bound and an upper-bound to be defined on the data we are working with. For an infinitely long sequence there is no defined upper-bound. What would we even define as the mid-point without first defining that? We can resort to Linear Search and it would work. But let’s see if we can’t figure out a way to modify the Binary Search for this problem.

The only problem with applying the Binary Search algorithm on an infinite sequence of sorted input is that we cannot define an upper-bound. So all we have to do is find an upper-bound. However, what we don’t need is to find a tight-upper-bound since Binary Search will work on any sequence as long as it is sorted. So instead of searching for an upper-bound by incrementing the search index by 1 each time (at which point we could just do Linear Search instead), we can increase it exponentially until it’s larger than our search key. When we get there, we have a defined upper-bound based on which we can apply the binary search algorithm on our sequence.

Now the question is how can we define an infinite sequence to test our theory. What we need is a non-trivial (to make it interesting ofcourse) increasing sequence of numbers. As it turns out the Fibonacci numbers serve our purpose very well. And since we can very easily calculate the n-th fibonacci number in O(1) time (See the article on Fast Fibonacci for more details) it’s an even better candidate for our experiment.

from typing import TypeVar, Generic
import math

T = TypeVar('T')
class Stream(Generic[T]):
    def at(self, index: int) -> T:
        raise NotImplementedError

class FiboStream(Stream[int]):
    def at(self, index: int) -> int:
        return math.floor((math.pow((1 + math.sqrt(5))/2, index+1) - math.pow((1 - math.sqrt(5))/2, index+1))/math.sqrt(5))

We implement the Unbounded Binary Search algorithm in two steps -

Starting from a certain index we exponentially increase the index until we find the index of an element larger than the search key (let’s call this index ubound_index).
We perform Binary Search on the range (0, ubound_index) to find the location of the search key.

def binary_search(
    stream: Stream[int],
    start: int,
    end: int,
    key: int
) -> int:
    while start <= end:
        current = start + (end - start) // 2
        if key < stream.at(current):
            end = current - 1
        elif key > stream.at(current):
            start = current + 1
        else:
            return current
    return -1

def find_ubound_index(stream: Stream[int], key: int) -> int:
    ubound_index = 0
    current_value = stream.at(ubound_index)
    while current_value < key:
        ubound_index *= 2
        current_value = stream.at(ubound_index)
    return ubound_index

def unbounded_binary_search(stream: Stream[int], key: int) -> int:
    ubound_index = find_ubound_index(stream, key)
    return binary_search(stream, 0, ubound_index, key)

Another name for Unbounded Binary Search is the Exponential Search algorithm. But all this begs the question - Given the option, is sorting in ascending or descending order always the best idea? Well, as it turns out there are other ways of arranging data that prove to be more useful in certain scenarios.

Trees

In our earlier experiments, we discussed how sorting the data in ascending or descending order can help us search faster. But we glanced over the fact that we had a freedom in both of the cases we explored - We could jump at any index at any point of time within the bounds of our search range without any additional overhead. This is a very valid premise when dealing with certain structures such as Arrays which have a fixed memory length assigned to them and the data memory index can be calculated just as easily by adding the index to the base memory location. Surprisingly enough, that’s not always the case. Let us take the example of our file system. As many of you readers know, the File System does not have a fixed length of memory and nor can it. If that were to be the case we could only have files of the same fixed sizes. But that’s not the case. We have different files of varying sizes in our computers. So ofcourse, the computer is not using an array to store the file’s contents. What it’s using is called the heap memory as opposed to the stack memory which is abundant, flexible and expandable. However, one thing it is not is easily traversible. To draw a comparison, if you were to store the same sorted numbers in a heap memory, you could not tell by looking at the i-th number’s position where the (i+5)-th number is stored. You’d have to traverse it one index at a time (i.e., go to i+1 then i+2 and so on). Therefore, sorting is not of much help here. Is there another way to store these files that could help us search faster? Yes there is, and it’s called a Tree; more specifically, a Binary Search Tree (BST).

As an example to illustrate how search works for a BST, let us take 7 numbers (instead of files) which are stored in the heap memory. If they were stored in a sorted ascending order they’d look something like this -

As you can see, performing Binary Search on this to find 5 -

Would take us first to 4 which requires 3 traversals.
Then it takes us to 6 which requires another 2 traversal.
Finally, it takes us to 5 which requires 1 traversal in the opposite direction.

So the search algorithm requires a total of 6 traversals to find 5.

Now let’s arrange the numbers within the heap itself but as a Binary Search Tree instead of a sorted array. To do that, we can construct a Binary Search Tree from the elements to look like this -

For comparison, we can also flatten this represent into an array structure by doing this - Starting from the root element which is 4, we put each left child of the i-th element at position `2i+1` and every right child at position `2i+2`.

However, we must note that in the heap memory the data can be, and is, stored in the former structure which makes it easier to traverse from the parent to the child nodes.

This time, our search begins from the root which is 4.

We check and see that 5 is greater than 4. So we do 1 traversal down the right node to find 6.
Since 5 is less than 6 we do 1 traversal down the left node to find 5.

And that takes a total of 2 traversals. As we can extrapolate from this small example, for the massive amount of files in a usual computer, this small change in structuring our data gives us significantly more efficiency in searching. So when our data elements are of fixed sizes, arrays do fine but when our data is dynamic in size, we considered using Trees to make it easier to search. Great!

Now we started with random arrays and then we learnt that sorting an array makes it easier for us to search. More recently we learnt that our data cannot always be represented as an array in which case, a Tree performs better. But does that mean trees are always the best way to represent data in heap memory? Nope.

More Tries

Let us consider another real-world use-case - Autocomplete. You must have seen how Search Engines such as Google can automatically predict what your next word or letter is based on what you have typed so far on the Search Bar. How does it do that today? Short answer, machine learning and complex algorithms far beyond the scope of this article. But even at our current understanding, we can try and come up with our own autocomplete solution. Let’s consider an example where we have a dictionary of words and we want to predict what word the user is typing based on what they have typed so far. We can create a Binary Search Tree of all the letters in all the words and then try to predict the closest matching valid word. For the sake of our sanity, let’s consider our dictionary is a small 5 word dictionary in which the words are -

Game
Dart
Gum
Dirt
Gain

We can construct the Binary Search Tree for the letters in these words by using the index of the letters in the alphabet. When we cannot accomodate a word into any existing binary tree, we create a new Binary Search Tree and track it’s root node.

So, after adding the first 4 words, we have these 2 binary trees -

Now when we try to add the last word “Gain” we see there’s no way to accomodate it into the existing tree starting with node G. So we have to create a new Binary Search Tree -

Now when the user types in “Ga” we first search in the first tree and find “Game” as the shortest matching word. In the second tree we stop at the root node “D” itself. In the third tree we find “Gain” as the shortest matching word. Since both “Game” and “Gain” are of the same length, we output the first one i.e., “Game” as the autocomplete suggestion. This manner of structuring our data works fine for such a small dictionary but the English dictionary is massive (with around 17,000 words). Maintaining so many trees is impractical and also makes search much slower. Can we find a way to compress the data into a more memory-efficient structure? Yes we can, by using Tries.

With Tries, our data is structured in memory as follows -

Right off the bat we notice that our structure is now more tightly packed since it’s using a combination of Arrays and Pointers to better represent the data. It also loses no performance in searching. In fact, it’s actually better now that we need not start from the top when the user types a new letter into our input. We can just resume from the last node downwards.

Conclusion

We have seen how the Binary Search algorithm is better for sorted data while Linear Search is the best we can do on random or unsorted data. We have evolved our humble Binary Search to work on unbounded series. We have evaluated the Binary Search Tree as the replacement for the Binary Search algorithm to reduce traversals when the data is dynamically stored in a heap-like memory. Finally, we have seen how Tries make Binary Search much more memory-efficient for large datasets. After all these endeavours, I think we can finally agree that no matter what the different challenges of today’s search engineering are, one thing that is almost certain is that there’s no one search algorithm that can solve them all. So we have to keep inventing and we have to keep improving. The search is always on!

Resources

[Notebook] The Search Problem

The Search Problem

Table of Contents