Boyer-Moore string search

Povilas Balciunas

2017-08-10 10:05

Boyer-Moore is a string search algorithm. The first time I heard about it was on FreeBSD mailing list when someone was explaining "why GNU grep is fast". [2] Although, it was invented in 1977 [3], turns out this algorithm is considered one of the most efficient and very widely used even nowadays. [1]

So I thought I have to at least take a brief overview of this classic algorithm.

Overview

Similarly to naive string search. the algorithm scans text from left to right searching for the given pattern. Although, the pattern is matched from right to left:

Text:    X X X A B C X X X D B C
Pattern:       D B C - - -> pattern shift
              <- - - pattern matching

What makes Boyer-Moore algorithm faster than naive string search is the heuristics it uses to shift pattern right. Rather than shifting one symbol at a time, this algorithm uses some heuristics to skip multiple symbols. We can optionally choose which heuristics to use. In order to go fastest, we should use them all.

Each heuristic calculates how many positions we can skip. Then we choose the maximum number of skips and repeat the matching.

At first glance the algorithm looks pretty easy. But it has multiple cases which might make it a little cofusing.

Bad character heuristic

It states that when text does not match the pattern at a given position, we can shift the pattern right until it does. [4]

   0 1 2 3 4 5 6 7
T: X X X A B C A C
P:     B C A C
       0 1 2 3

In this case the algorithm starts matching the pattern at text position 5 and pattern position 3. The mismatch occurs at text_pos 4 and pattern_pos 2. The mismatched symbol is 'B'. 'B' exists in the pattern_pos 0. Thus we can skip 2 symbols: 2 - 0 = 2:

   0 1 2 3 4 5 6 7
T: X X X A B C A C
P:         B C A C
           0 1 2 3

In case the pattern does not have the mismatched symbol, we can move it past that symbol:

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:     D C A C

Mismatched symbol is B which does not exist in the pattern. Thus the pattern is shifted to:

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:           D C A C

Preprocessing

When mismatch happens we could search for a matching symbol traversing the pattern backwards. That would take O(n) comparisons. Instead we can achieve O(1) with a lookup table. This table is constructed once for the given pattern and it holds the last position for a given character. E.g.:

pattern = 'abca'
table['a'] = 3
table['b'] = 1
table['c'] = 2

def preproc_bad_character(pattern: str) -> list:
    last_char_pos = [-1] * 256
    for i, c in enumerate(pattern):
        last_char_pos[ord(c)] = i
    return last_char_pos

Good suffix heuristic

This heuristic is nicely explained in this video.

It heuristic has 3 cases:

the matching pattern suffix exists in another pattern place:
```
   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P: A B C D B C
```
BC matches the text and another BC is in the pattern at position 1. Thus we could shift the pattern by 3 positions right:
```
   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:       A B C D B C
```

the matching pattern suffix is also the prefix of a pattern:

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P: B C D A B C

In this case we can safely shift the pattern by 4 positions:

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:         B C D A B C

When none of the above cases are satisfied the pattern is shifted past the matched text part:

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:     D A B C

   0 1 2 3 4 5 6 7 8 9
T: X X X A B C A C D C
P:             D A B C

This heuristic also has a preprocessing step to construct a lookup table. The algorithm is a bit more complicated and better explained in [6] [5].

References