When it comes to string processing, the Suffix Automaton (SAM) is often considered the most powerful data structure. It is a minimal Deterministic Finite Automaton (DFA) that recognizes all suffixes of a given string.
While Suffix Trees and Suffix Arrays are more common, SAM can be built in linear time and space and is often easier to implement for complex substring queries.
The Core Concept: End-Equivalent Classes
The SAM groups substrings into "end-equivalent" classes. Two substrings are equivalent if they appear at the same set of end positions in the original string.
- States: Each state in the SAM represents a set of substrings that share the same end positions.
- Transitions: Edges between states represent adding a character to the substrings.
- Suffix Links: Links that point to the longest suffix of a state's substrings that belongs to a different end-equivalent class.
Suffix Automaton Implementation in Java
import java.util.*;
public class SuffixAutomaton {
static class State {
int len, link;
Map<Character, Integer> next = new HashMap<>();
State(int len, int link) {
this.len = len;
this.link = link;
}
}
private List<State> st = new ArrayList<>();
private int last;
public SuffixAutomaton(String s) {
st.add(new State(0, -1));
last = 0;
for (char c : s.toCharArray()) {
extend(c);
}
}
private void extend(char c) {
int cur = st.size();
st.add(new State(st.get(last).len + 1, 0));
int p = last;
while (p != -1 && !st.get(p).next.containsKey(c)) {
st.get(p).next.put(c, cur);
p = st.get(p).link;
}
if (p == -1) {
st.get(cur).link = 0;
} else {
int q = st.get(p).next.get(c);
if (st.get(p).len + 1 == st.get(q).len) {
st.get(cur).link = q;
} else {
int clone = st.size();
State qState = st.get(q);
State cloneState = new State(st.get(p).len + 1, qState.link);
cloneState.next.putAll(qState.next);
st.add(cloneState);
while (p != -1 && st.get(p).next.get(c) == q) {
st.get(p).next.put(c, clone);
p = st.get(p).link;
}
st.get(q).link = st.get(cur).link = clone;
}
}
last = cur;
}
public boolean contains(String pattern) {
int cur = 0;
for (char c : pattern.toCharArray()) {
if (!st.get(cur).next.containsKey(c)) return false;
cur = st.get(cur).next.get(c);
}
return true;
}
}
Why use Suffix Automaton?
| Feature | Suffix Automaton | Suffix Tree | Suffix Array |
|---|---|---|---|
| Construction | $O(N)$ | $O(N)$ | $O(N \log N)$ or $O(N)$ |
| Space | $O(N \times \Sigma)$ | $O(N \times \Sigma)$ | $O(N)$ |
| Substring Search | $O(M)$ | $O(M)$ | $O(M \log N)$ |
| Implementation | Concise | Complex | Moderate |
Real-World Applications
- Bioinformatics: Efficiently finding long common substrings in genomic sequences.
- Data Compression: Used in algorithms like LZ77 to find the longest previous match.
- Plagiarism Detection: Identifying overlapping segments between large documents.
Summary
The Suffix Automaton is a masterclass in efficient string representation. By collapsing all suffixes into a minimal DFA, it provides a robust foundation for solving almost any substring-related problem in linear time. While the construction logic is subtle, its power and versatility make it an essential tool for advanced algorithmic challenges.
