2015-11-15

String Matching

Naive Algorithm


  • shifting pattern 1 element each mismatch
  • running time O(n.m)
  • n - the length of string
  • m - the length of pattern
  • the goal of following 3 string-matching algorithm is to achieve O(n + m) running time. 
  • Obviously, if m is small, it doesn't make much difference among those algorithms. (It's kinda of like the question about when dynamic programming is less efficient than brute-force one for knapsack problem: O(nk) >= O(2^k))

KMP Algorithm


  • idea: have to scan text, which takes O(n), but can we try to shift more than 1 position each mismatch?
  • if s[i] != p[j], that means s[i-j ... i-1] == p[0 ... j-1], if there is common substring in p[0 ... j-1], try to use it. so we do a little pre-process of the pattern to calculate the longest common substring for each substring.

Boyer-Moore-Horspool Algorithm


  • compare letter from rightmost of pattern to left (instead of comparing from left to right like KMP algorithm does)
  • what is the difference of comparing from RIGHT to LEFT? 
    • let's assume s[i] != p[j] (j = 0 ... m-1) and s[i] even does not exist in the pattern, then we can shift (j + 1) position directly. 
    • if j = m - 1, we start comparing s[i + m] with p[m - 1] without missing any match.
    • the  above case is the best one, which enable us to shift m position, the max possible value, but how about s[i] do present in the pattern? how many shift can we perform?
    • let's assume s[i] exists in pattern, and s[i] = p[k] ( k < j, since k must on the left side of j), so we can shift (j - k) position, which is going to align s[i] with p[k].
    • from the above analysis, we are considering if s[i] in pattern or not, then decide how many shift we can perform. KMP has a different way, it concerns the longest common substring of p[0...j-1], try to align the common substring with s[... i-1], and continue comparing s[i] with p[length-of-longest-substring(p[0...j-1])]
  • like KMP algorithm, we do a little pre-process about the pattern, called "bad match table", the value is how many shift we can perform from s[i], which is the start comparison point not necessarily the mismatch point.
    • to calculate the "bad match table"
      • possible shifts = length - 1 - index
      • we can see the possible shifts are getting smaller if a character is duplicate. 
      • last letter of pattern is length if not already defined

Robin-Karp Algorithm


  • this algorithm compares the hash value between the pattern and the substring of text, instead of comparing the substring itself with the pattern.
  • if we compute the hash value for each substring, then it doesn't make much difference with naive algorithm even worse, since the hash function takes O(m) running time.
  • the key idea is to achieve O(1) running time when calculating hash value for each substring from text, that is, rolling hash, it updates the new hash value instead of recomputing it.
  • then how does it work?
    • let x = (b^m-1)s[i] + (b^m-2)s[i+1] + ... + (b^0)s[i+m-1], 
    • let y = (b^m-1)s[i+1] + ... + (b^0)s[i+m]
    • then, y = (x - s[i](b^m-1)b + s[j+m], b = |alphabets|
    • h(y) = (h(x).b%q - (s[i]%q).(b^m%q) + s[j+m] %q) %q
    • b%q and (b^m)%q are constant which is pre-compute-able, that means h(y) could be updated with O(1) time.
  • single pattern searching vs. multiple pattern searching
    • single pattern searching: KMP, BMH
    • multiple pattern searching: rolling hash is a choice (e.g. Leetcode - repeated DNA sequence)

Reference




No comments:

Post a Comment