Struct aho_corasick::AhoCorasickBuilder
source · pub struct AhoCorasickBuilder { /* private fields */ }
Expand description
A builder for configuring an Aho-Corasick automaton.
Quick advice
- Use
AhoCorasickBuilder::match_kind
to configure your searcher withMatchKind::LeftmostFirst
if you want to match how backtracking regex engines execute searches forpat1|pat2|..|patN
. UseMatchKind::LeftmostLongest
if you want to match how POSIX regex engines do it. - If you need an anchored search, use
AhoCorasickBuilder::start_kind
to set theStartKind::Anchored
mode sinceStartKind::Unanchored
is the default. Or just useStartKind::Both
to support both types of searches. - You might want to use
AhoCorasickBuilder::kind
to set your searcher to always use aAhoCorasickKind::DFA
if search speed is critical and memory usage isn’t a concern. Otherwise, not setting a kind will probably make the right choice for you. Beware that if you useStartKind::Both
to build a searcher that supports both unanchored and anchored searches and you setAhoCorasickKind::DFA
, then the DFA will essentially be duplicated to support both simultaneously. This results in very high memory usage. - For all other options, their defaults are almost certainly what you want.
Implementations§
source§impl AhoCorasickBuilder
impl AhoCorasickBuilder
sourcepub fn new() -> AhoCorasickBuilder
pub fn new() -> AhoCorasickBuilder
Create a new builder for configuring an Aho-Corasick automaton.
The builder provides a way to configure a number of things, including ASCII case insensitivity and what kind of match semantics are used.
sourcepub fn build<I, P>(&self, patterns: I) -> Result<AhoCorasick, BuildError>where
I: IntoIterator<Item = P>,
P: AsRef<[u8]>,
pub fn build<I, P>(&self, patterns: I) -> Result<AhoCorasick, BuildError>where I: IntoIterator<Item = P>, P: AsRef<[u8]>,
Build an Aho-Corasick automaton using the configuration set on this builder.
A builder may be reused to create more automatons.
Examples
Basic usage:
use aho_corasick::{AhoCorasickBuilder, PatternID};
let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new().build(patterns).unwrap();
assert_eq!(
Some(PatternID::must(1)),
ac.find("xxx bar xxx").map(|m| m.pattern()),
);
sourcepub fn match_kind(&mut self, kind: MatchKind) -> &mut AhoCorasickBuilder
pub fn match_kind(&mut self, kind: MatchKind) -> &mut AhoCorasickBuilder
Set the desired match semantics.
The default is MatchKind::Standard
, which corresponds to the match
semantics supported by the standard textbook description of the
Aho-Corasick algorithm. Namely, matches are reported as soon as they
are found. Moreover, this is the only way to get overlapping matches
or do stream searching.
The other kinds of match semantics that are supported are
MatchKind::LeftmostFirst
and MatchKind::LeftmostLongest
. The
former corresponds to the match you would get if you were to try to
match each pattern at each position in the haystack in the same order
that you give to the automaton. That is, it returns the leftmost match
corresponding to the earliest pattern given to the automaton. The
latter corresponds to finding the longest possible match among all
leftmost matches.
For more details on match semantics, see the documentation for
MatchKind
.
Note that setting this to MatchKind::LeftmostFirst
or
MatchKind::LeftmostLongest
will cause some search routines on
AhoCorasick
to return an error (or panic if you’re using the
infallible API). Notably, this includes stream and overlapping
searches.
Examples
In these examples, we demonstrate the differences between match
semantics for a particular set of patterns in a specific order:
b
, abc
, abcd
.
Standard semantics:
use aho_corasick::{AhoCorasick, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasick::builder()
.match_kind(MatchKind::Standard) // default, not necessary
.build(patterns)
.unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("b", &haystack[mat.start()..mat.end()]);
Leftmost-first semantics:
use aho_corasick::{AhoCorasick, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasick::builder()
.match_kind(MatchKind::LeftmostFirst)
.build(patterns)
.unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abc", &haystack[mat.start()..mat.end()]);
Leftmost-longest semantics:
use aho_corasick::{AhoCorasick, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasick::builder()
.match_kind(MatchKind::LeftmostLongest)
.build(patterns)
.unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abcd", &haystack[mat.start()..mat.end()]);
sourcepub fn start_kind(&mut self, kind: StartKind) -> &mut AhoCorasickBuilder
pub fn start_kind(&mut self, kind: StartKind) -> &mut AhoCorasickBuilder
Sets the starting state configuration for the automaton.
Every Aho-Corasick automaton is capable of having two start states: one that is used for unanchored searches and one that is used for anchored searches. Some automatons, like the NFAs, support this with almost zero additional cost. Other automatons, like the DFA, require two copies of the underlying transition table to support both simultaneously.
Because there may be an added non-trivial cost to supporting both, it is possible to configure which starting state configuration is needed.
Indeed, since anchored searches tend to be somewhat more rare,
only unanchored searches are supported by default. Thus,
StartKind::Unanchored
is the default.
Note that when this is set to StartKind::Unanchored
, then
running an anchored search will result in an error (or a panic
if using the infallible APIs). Similarly, when this is set to
StartKind::Anchored
, then running an unanchored search will
result in an error (or a panic if using the infallible APIs). When
StartKind::Both
is used, then both unanchored and anchored searches
are always supported.
Also note that even if an AhoCorasick
searcher is using an NFA
internally (which always supports both unanchored and anchored
searches), an error will still be reported for a search that isn’t
supported by the configuration set via this method. This means,
for example, that an error is never dependent on which internal
implementation of Aho-Corasick is used.
Example: anchored search
This shows how to build a searcher that only supports anchored searches:
use aho_corasick::{
AhoCorasick, Anchored, Input, Match, MatchKind, StartKind,
};
let ac = AhoCorasick::builder()
.match_kind(MatchKind::LeftmostFirst)
.start_kind(StartKind::Anchored)
.build(&["b", "abc", "abcd"])
.unwrap();
// An unanchored search is not supported! An error here is guaranteed
// given the configuration above regardless of which kind of
// Aho-Corasick implementation ends up being used internally.
let input = Input::new("foo abcd").anchored(Anchored::No);
assert!(ac.try_find(input).is_err());
let input = Input::new("foo abcd").anchored(Anchored::Yes);
assert_eq!(None, ac.try_find(input)?);
let input = Input::new("abcd").anchored(Anchored::Yes);
assert_eq!(Some(Match::must(1, 0..3)), ac.try_find(input)?);
Example: unanchored and anchored searches
This shows how to build a searcher that supports both unanchored and anchored searches:
use aho_corasick::{
AhoCorasick, Anchored, Input, Match, MatchKind, StartKind,
};
let ac = AhoCorasick::builder()
.match_kind(MatchKind::LeftmostFirst)
.start_kind(StartKind::Both)
.build(&["b", "abc", "abcd"])
.unwrap();
let input = Input::new("foo abcd").anchored(Anchored::No);
assert_eq!(Some(Match::must(1, 4..7)), ac.try_find(input)?);
let input = Input::new("foo abcd").anchored(Anchored::Yes);
assert_eq!(None, ac.try_find(input)?);
let input = Input::new("abcd").anchored(Anchored::Yes);
assert_eq!(Some(Match::must(1, 0..3)), ac.try_find(input)?);
sourcepub fn ascii_case_insensitive(&mut self, yes: bool) -> &mut AhoCorasickBuilder
pub fn ascii_case_insensitive(&mut self, yes: bool) -> &mut AhoCorasickBuilder
Enable ASCII-aware case insensitive matching.
When this option is enabled, searching will be performed without
respect to case for ASCII letters (a-z
and A-Z
) only.
Enabling this option does not change the search algorithm, but it may increase the size of the automaton.
NOTE: It is unlikely that support for Unicode case folding will
be added in the future. The ASCII case works via a simple hack to the
underlying automaton, but full Unicode handling requires a fair bit of
sophistication. If you do need Unicode handling, you might consider
using the regex
crate or the lower level
regex-automata
crate.
Examples
Basic usage:
use aho_corasick::AhoCorasick;
let patterns = &["FOO", "bAr", "BaZ"];
let haystack = "foo bar baz";
let ac = AhoCorasick::builder()
.ascii_case_insensitive(true)
.build(patterns)
.unwrap();
assert_eq!(3, ac.find_iter(haystack).count());
sourcepub fn kind(&mut self, kind: Option<AhoCorasickKind>) -> &mut AhoCorasickBuilder
pub fn kind(&mut self, kind: Option<AhoCorasickKind>) -> &mut AhoCorasickBuilder
Choose the type of underlying automaton to use.
Currently, there are four choices:
AhoCorasickKind::NoncontiguousNFA
instructs the searcher to use anoncontiguous::NFA
. A noncontiguous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.AhoCorasickKind::ContiguousNFA
instructs the searcher to use acontiguous::NFA
. A contiguous NFA is a little slower to build than a noncontiguous NFA, has excellent memory usage and is typically a little slower than a DFA for a search.AhoCorasickKind::DFA
instructs the searcher to use adfa::DFA
. A DFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.None
(the default) instructs the searcher to choose the “best” Aho-Corasick implementation. This choice is typically based primarily on the number of patterns.
Setting this configuration does not change the time complexity for
constructing the Aho-Corasick automaton (which is O(p)
where p
is the total number of patterns being compiled). Setting this to
AhoCorasickKind::DFA
does however reduce the time complexity of
non-overlapping searches from O(n + p)
to O(n)
, where n
is the
length of the haystack.
In general, you should probably stick to the default unless you have
some kind of reason to use a specific Aho-Corasick implementation. For
example, you might choose AhoCorasickKind::DFA
if you don’t care
about memory usage and want the fastest possible search times.
Setting this guarantees that the searcher returned uses the chosen
implementation. If that implementation could not be constructed, then
an error will be returned. In contrast, when None
is used, it is
possible for it to attempt to construct, for example, a contiguous
NFA and have it fail. In which case, it will fall back to using a
noncontiguous NFA.
If None
is given, then one may use AhoCorasick::kind
to determine
which Aho-Corasick implementation was chosen.
Note that the heuristics used for choosing which AhoCorasickKind
may be changed in a semver compatible release.
sourcepub fn prefilter(&mut self, yes: bool) -> &mut AhoCorasickBuilder
pub fn prefilter(&mut self, yes: bool) -> &mut AhoCorasickBuilder
Enable heuristic prefilter optimizations.
When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be sub-optimal for a particular workload.
Currently, prefilters are typically only active when building searchers with a small (less than 100) number of patterns.
This is enabled by default.
sourcepub fn dense_depth(&mut self, depth: usize) -> &mut AhoCorasickBuilder
pub fn dense_depth(&mut self, depth: usize) -> &mut AhoCorasickBuilder
Set the limit on how many states use a dense representation for their transitions. Other states will generally use a sparse representation.
A dense representation uses more memory but is generally faster, since the next transition in a dense representation can be computed in a constant number of instructions. A sparse representation uses less memory but is generally slower, since the next transition in a sparse representation requires executing a variable number of instructions.
This setting is only used when an Aho-Corasick implementation is used that supports the dense versus sparse representation trade off. Not all do.
This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the automaton. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation.
By default, this is set to a low but non-zero number. Setting this to
0
is almost never what you want, since it is likely to make searches
very slow due to the start state itself being forced to use a sparse
representation. However, it is unlikely that increasing this number
will help things much, since the most active states have a small depth.
More to the point, the memory usage increases superlinearly as this
number increases.
sourcepub fn byte_classes(&mut self, yes: bool) -> &mut AhoCorasickBuilder
pub fn byte_classes(&mut self, yes: bool) -> &mut AhoCorasickBuilder
A debug settting for whether to attempt to shrink the size of the automaton’s alphabet or not.
This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.
When enabled, some (but not all) Aho-Corasick automatons will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the automaton.
The advantage of this map is that the size of the transition table can
be reduced drastically from #states * 256 * sizeof(u32)
to
#states * k * sizeof(u32)
where k
is the number of equivalence
classes (rounded up to the nearest power of 2). As a result, total
space usage can decrease substantially. Moreover, since a smaller
alphabet is used, automaton compilation becomes faster as well.
WARNING: This is only useful for debugging automatons. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
Trait Implementations§
source§impl Clone for AhoCorasickBuilder
impl Clone for AhoCorasickBuilder
source§fn clone(&self) -> AhoCorasickBuilder
fn clone(&self) -> AhoCorasickBuilder
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more