Data Structures and Algorithms

String Matching Algorithms

Strings are commonly implemented as an array of characters.

The definition of pattern matching problem:

Given a string s of length n and a pattern string p of length m (usually m <= n), p is a substring of s (or p is in s) if there exists i, such that i >=0 and i <= (n-m), such that s[i+j] = p[j] for all j from 0 to m-1.

Usually all symbols in string s and pattern string p are from an alphabet Σ. Σ is usually finite, but not necessarily so. |Σ| denotes the size of the alphabet. Some string matching algorithms need to assume that |Σ| is a finite constant number.

Brute Force Algorithm:

int BFSearch(const string & s, const string & p)
{
    lens = s.length();
    lenp = p.length();
    for(i = 0; i <= lens-lenp; i++) {
        j = 0;
        while (j < lenp && s[i+j] == p[j])
            j++;
        if (j == lenp) return i;
    }
    return -1;
}

Correct and easy to understand
Works even with a potentially unbounded and unknown alphabet Σ
Run-time efficiency: best case Θ(m), worst case Θ(n*m)

BM (Boyer-Moore) Algorithm:

Heuristics: don't throw away the information gathered in a mismatch.

Caveat: There is a fixed alphabet Σ with finite size.

void calculateLast(int last[], int size, const string & p)
{
    for(i = 0; i < size; i++)
        last[i] = -1;
    for(i = 0; i < p.length(); i++)
        last[p[i]] = i;
}
int BMSearch(const string & s, const string & p)
{
    lens = s.length();
    lenp = p.length();
    int last[size]; // size = size of the alphabet
    calculateLast(last, size, p);
    i = j = lenp - 1;
    do {
        if (s[i] == p[j]) {
            if (j == 0)
                return i;
            else {
                i--; j--;
            }
        } else {
            i = i + lenp - min(j, last[s[i]] + 1);
            j = lenp -1;
        }
    } while (i < lens);
    return -1;
}

Pre-process the pattern string to use the mismatch information
Run-time efficiency: best case Θ(n/m), worst case Θ(n*m)
But the worst case rarely happens in English text pattern match.

KMP (Knuth-Morris-Pratt) Algorithm:

Heuristics: When a mismatch happens, we have already done a lot of the matching comparisons. Don't throw away the information gathered in the matching comparisons.

Failure function preprocesses pattern string p to compute the proper shift so that we can reuse previousely performed (matching) comparisons.

// f[j] is the length of the longest prefix of p
// that is a suffix of p[1..j], f[0] = 0
// Bootstrapping Method
void failure(int f[], const string & p)
{
    f[0] = 0;
    i = 1; j = 0;
    lenp = p.length();
    while (i < lenp) {
        if (p[j] == p[i]) {
            f[i] = j+1;
            i++; j++;
        } else if (j > 0) {
            j = f[j-1];
        } else {
            f[i] = 0;
            i++;
        }
    }
}

int KMPSearch(const string & s, const string & p)
{
    lens = s.length();
    lenp = p.length();
    int f[lenp]; 
    failure(f, p);
    i = j = 0;
    while (i < lens) {
        if (s[i] == p[j]) {
            if (j == lenp -1)
                return i-lenp+1;
            else {
                i++; j++;
            }
        } else if (j > 0) {
            j = f[j-1];
        } else {
            i++;
        }
    }
    return -1
}

The alphabet can be unknown or even infinite.
Pre-process the pattern string to use the matching information.
Run-time efficiency: Θ(n+m)

Combine KMP and BM Algorithm:

Heuristics: use both the matching comparison information and the mismatched character in the text string.

When a mismatch happens at position k of the pattern string, we know the character (c) in the text string in the mismatched position, and we know that we have matches at all the positions before k.

Based on those information, we can calculate skip distance (skip[k, c]) at position k with c as the smallest number d in the range of 0 <= d <= k, such that p[k-d] = c and p[j] = p[j+d] for 0 <= j <= k-d-1, or k+1 if we can't find such a d.

int KMP_BMSearch(const string & s, const string & p)
{
    lens = s.length();
    lenp = p.length();
    int skip[lenp][size] ; // size = size of the alphabet
    skipArray(skip, p);
    i = j = 0;
    while (i <= lens - lenp) {
        if (s[i+j] == p[j]) {
            if (j == lenp)
                return i;
            else
                j++;
        } else {
            d = skip[j, s[i+j]);
            i = i + d;
            j = j + 1 - d;
        }
    }
    return -1;
}

The alphabet must be fixed and finite.
Pre-process the pattern string to use the matching and mismatching information.
Run-time efficiency: Θ(n/m+m)

KR (Karp-Rabin) Algorithm:

Heuristics: if two string's hash codes are different, these two strings can't be identical.

A practically useful hash function must be effiiciently computable, highly discriminating and easily computable from H(s[j..j+m-1]), s[j], and s[j+m].

This algorithm is especially useful if there are more than one pattern strings, e.g., catching plagiarism.

const int x = 33;
int hash(const string t, int start, int end)
{
    int ht = 0;
    for(int i = start; i <= end; i++)
        ht = ht * x + t[i];
    return ht;
}
int KRSearch(const string & s, const string & p)
{
    int lenp = p.length();
    int lens = s.length();
    int hp = hash(p, 0, lenp-1);
    int hs = hash(s, 0, lenp-1);
    int d = pow(x, lenp-1);
    for(int i = 0; i <= lens-lenp; i++) {
        if (hp == hs && substring(s, i, i+lenp-1) == p)
            return i;
        else
            hs = (hs - s[i]*d) * x + s[i+lenp];
    }
    return -1;
}