BOYER-MOORE SUBSTRING SEARCH
In the previous article we have talked about brute-force substring search algorithm. We came to the conclusion that it is working fine, but it is too slow. So here is a better approach: the Boyer-Moore substring search algorithm. What was the problem with brute-force search? We keep considering too many bad options as well. We should find a way to reduce the number of unnecessary checking (whether two characters match or not) in order to end up with a better running time.
In 1977 Robert S. Boyer and J. Strother Moore constructed a substring search approach. Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left. Why is it good? We can skip multiple characters at the same time rather than considering every single character in the text. We have to preprocess the text and construct the so-called bad match table.
The bad match table contains the characters in the pattern and an integer accordingly. We could use a hashtable as an underlying data structre. When a mismatch occurs we have to shift the pattern to the right corresponding to the value in the table. Why is it good? Unlike brute-force search, we can skip several characters and ultimately the algorithm will be faster.
BAD MATCH TABLE
So we know why is it good to use a bad match table. But how to construct the table itself? We have to make sure the table does not contain repetitive characters. If there is several ‘a’ letters in the pattern, the table only contains one ‘a’ letter.
max( 1 ; lengthOfPattern – actualCharacterIndex – 1 )
This is the formula we have to use. We iterate over the pattern and compute the values for the bad match table. We keep updating the old values for the same characters.
Let’s see a concrete example to get a good grasp of Boyer-Moore substring search algorithm. We have a text ‘This is a test’ and of course a pattern we are looking for: ‘test’.
First, we have to construct the bad match table. We have to consider the characters in the pattern and have to use the formula we have discussed. If we have repetitive characters (such as character ‘t’) we have to update the table accordingly.
So we know how many characters to shift if there is a mismatch. The * means any other character. As we have discussed, we start comparing characters from right to left. If there is a mismatch we look up the given character in the table. In this case we look up ‘s’: the value is 1. So we shift the pattern 1 step to the right.
Again, there is a mismatch. So we look up character ‘a’ in the table. Character ‘a’ is not present in the table: so we have to shift as many characters to the right as the length of pattern (basically this is the * case).
Again, there is a mismatch. We use the lookup table to now how many steps to shift.
There is a match. So we keep checking whether the other characters are matching or not. We go from right to left as you can see. The algorithm ends when we have considered every single character in the pattern.
Let’s see the source code. First we have to compute the bad match table values. We can use the formula discussed above:
After setting up the table, we can focus on the algorithm implementation. As we have discussed earlier, we have to compare the characters from right to left. Thats why the inner for loop starts with the lengthOfPattern. If there is a mismatch: we have two options. If the given character is present in the table: we know how many characters to shift the pattern. Anyways (this is the * case) we shift as many characters as the lengthOfPattern. This is how Boyer-Moore algorithm works:
We have seen the running time complexity for brute-force search. It has O(N*M) running time. We are able to reduce this running time to O(N+M) with Boyer-Moore substring search approach, which is quite favourable.