# Struct fst::raw::Fst
[−]
[src]

pub struct Fst { /* fields omitted */ }

An acyclic deterministic finite state transducer.

# How does it work?

The short answer: it's just like a prefix trie, which compresses keys based only on their prefixes, except that a automaton/transducer also compresses suffixes.

The longer answer is that keys in an automaton are stored only in the transitions from one state to another. A key can be acquired by tracing a path from the root of the automaton to any match state. The inputs along each transition are concatenated. Once a match state is reached, the concatenation of inputs up until that point corresponds to a single key.

But why is it called a transducer instead of an automaton? A finite state transducer is just like a finite state automaton, except that it has output transitions in addition to input transitions. Namely, the value associated with any particular key is determined by summing the outputs along every input transition that leads to the key's corresponding match state.

This is best demonstrated with a couple images. First, let's ignore the "transducer" aspect and focus on a plain automaton.

Consider that your keys are abbreviations of some of the months in the Gregorian calendar:

jan feb mar apr may jun jul

The corresponding automaton that stores all of these as keys looks like this:

Notice here how the prefix and suffix of `jan`

and `jun`

are shared.
Similarly, the prefixes of `jun`

and `jul`

are shared and the prefixes
of `mar`

and `may`

are shared.

All of the keys from this automaton can be enumerated in lexicographic order by following every transition from each node in lexicographic order. Since it is acyclic, the procedure will terminate.

A key can be found by tracing it through the transitions in the automaton.
For example, the key `aug`

is known not to be in the automaton by only
visiting the root state (because there is no `a`

transition). For another
example, the key `jax`

is known not to be in the set only after moving
through the transitions for `j`

and `a`

. Namely, after those transitions
are followed, there are no transitions for `x`

.

Notice here that looking up a key is proportional the length of the key itself. Namely, lookup time is not affected by the number of keys in the automaton!

Additionally, notice that the automaton exploits the fact that many keys
share common prefixes and suffixes. For example, `jun`

and `jul`

are
represented with no more states than would be required to represent either
one on its own. Instead, the only change is a single extra transition. This
is a form of compression and is key to how the automatons produced by this
crate are so small.

Let's move on to finite state transducers. Consider the same set of keys as above, but let's assign their numeric month values:

jan,1 feb,2 mar,3 apr,4 may,5 jun,6 jul,7

The corresponding transducer looks very similar to the automaton above, except outputs have been added to some of the transitions:

All of the operations with a transducer are the same as described above for automatons. Additionally, the same compression techniques are used: common prefixes and suffixes in keys are exploited.

The key difference is that some transitions have been given an output.
As one follows input transitions, one must sum the outputs as they
are seen. (A transition with no output represents the additive identity,
or `0`

in this case.) For example, when looking up `feb`

, the transition
`f`

has output `2`

, the transition `e`

has output `0`

, and the transition
`b`

also has output `0`

. The sum of these is `2`

, which is exactly the
value we associated with `feb`

.

For another more interesting example, consider `jul`

. The `j`

transition
has output `1`

, the `u`

transition has output `5`

and the `l`

transition
has output `1`

. Summing these together gets us `7`

, which is again the
correct value associated with `jul`

. Notice that if we instead looked up
the `jun`

key, then the `n`

transition would be followed instead of the
`l`

transition, which has no output. Therefore, the `jun`

key equals
`1+5+0=6`

.

The trick to transducers is that there exists a unique path through the transducer for every key, and its outputs are stored appropriately along this path such that the correct value is returned when they are all summed together. This process also enables the data that makes up each value to be shared across many values in the transducer in exactly the same way that keys are shared. This is yet another form of compression!

# Bonus: a billion strings

The amount of compression one can get from automata can be absolutely
ridiuclous. Consider the particular case of storing all billion strings
in the range `0000000001-1000000000`

, e.g.,

0000000001 0000000002 ... 0000000100 0000000101 ... 0999999999 1000000000

The corresponding automaton looks like this:

Indeed, the on disk size of this automaton is a mere **251 bytes**.

Of course, this is a bit of a pathological best case, but it does serve to show how good compression can be in the optimal case.

Also, check out the
corresponding transducer
that maps each string to its integer value. It's a bit bigger, but still
only takes up **896 bytes** of space on disk. This demonstrates that
output values are also compressible.

# Does this crate produce minimal transducers?

For any non-trivial sized set of keys, it is unlikely that this crate will produce a minimal transducer. As far as this author knows, guaranteeing a minimal transducer requires working memory proportional to the number of states. This can be quite costly and is anathema to the main design goal of this crate: provide the ability to work with gigantic sets of strings with constant memory overhead.

Instead, construction of a finite state transducer uses a cache of states. More frequently used states are cached and reused, which provides reasonably good compression ratios. (No comprehensive benchmarks exist to back up this claim.)

It is possible that this crate may expose a way to guarantee minimal construction of transducers at the expense of exorbitant memory requirements.

# Bibliography

I initially got the idea to use finite state tranducers to represent ordered sets/maps from Michael McCandless' work on incorporating transducers in Lucene.

However, my work would also not have been possible without the hard work of many academics, especially Jan Daciuk.

- Incremental construction of minimal acyclic finite-state automata
(Section 3 provides a decent overview of the algorithm used to construct
transducers in this crate, assuming all outputs are
`0`

.) - Direct Construction of Minimal Acyclic Subsequential Transducers (The whole thing. The proof is dense but illuminating. The algorithm at the end is the money shot, namely, it incorporates output values.)
- Experiments with Automata Compression, Smaller Representation of Finite State Automata (various compression techniques for representing states/transitions)
- Jan Daciuk's dissertation (excellent for in depth overview)
- Comparison of Construction Algorithms for Minimal, Acyclic, Deterministic, Finite-State Automata from Sets of Strings (excellent for surface level overview)

## Methods

`impl Fst`

[src]

`fn from_path<P: AsRef<Path>>(path: P) -> Result<Self>`

Opens a transducer stored at the given file path via a memory map.

The fst must have been written with a compatible finite state
transducer builder (`Builder`

qualifies). If the format is invalid or
if there is a mismatch between the API version of this library and the
fst, then an error is returned.

`fn from_mmap(mmap: MmapReadOnly) -> Result<Self>`

Opens a transducer from a `MmapReadOnly`

.

This is useful if a transducer is serialized to only a part of a file.
A `MmapReadOnly`

lets one control which region of the file is used for
the transducer.

`fn from_bytes(bytes: Vec<u8>) -> Result<Self>`

Creates a transducer from its representation as a raw byte sequence.

Note that this operation is very cheap (no allocations and no copies).

The fst must have been written with a compatible finite state
transducer builder (`Builder`

qualifies). If the format is invalid or
if there is a mismatch between the API version of this library and the
fst, then an error is returned.

`fn from_static_slice(bytes: &'static [u8]) -> Result<Self>`

Creates a transducer from its representation as a raw byte sequence.

This accepts a static byte slice, which may be useful if the Fst is embedded into source code.

Creates a transducer from a shared vector at the given offset and length.

This permits creating multiple transducers from a single region of owned memory.

`fn get<B: AsRef<[u8]>>(&self, key: B) -> Option<Output>`

Retrieves the value associated with a key.

If the key does not exist, then `None`

is returned.

`fn contains_key<B: AsRef<[u8]>>(&self, key: B) -> bool`

Returns true if and only if the given key is in this FST.

`fn stream(&self) -> Stream`

Return a lexicographically ordered stream of all key-value pairs in this fst.

`fn range(&self) -> StreamBuilder`

Return a builder for range queries.

A range query returns a subset of key-value pairs in this fst in a range given in lexicographic order.

`fn search<A: Automaton>(&self, aut: A) -> StreamBuilder<A>`

Executes an automaton on the keys of this map.

`fn len(&self) -> usize`

Returns the number of keys in this fst.

`fn is_empty(&self) -> bool`

Returns true if and only if this fst has no keys.

`fn size(&self) -> usize`

Returns the number of bytes used by this fst.

`fn op(&self) -> OpBuilder`

Creates a new fst operation with this fst added to it.

The `OpBuilder`

type can be used to add additional fst streams
and perform set operations like union, intersection, difference and
symmetric difference on the keys of the fst. These set operations also
allow one to specify how conflicting values are merged in the stream.

`fn is_disjoint<'f, I, S>(&self, stream: I) -> bool where I: for<'a> IntoStreamer<'a, Into=S, Item=(&'a [u8], Output)>,`

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

Returns true if and only if the `self`

fst is disjoint with the fst
`stream`

.

`stream`

must be a lexicographically ordered sequence of byte strings
with associated values.

`fn is_subset<'f, I, S>(&self, stream: I) -> bool where I: for<'a> IntoStreamer<'a, Into=S, Item=(&'a [u8], Output)>,`

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

Returns true if and only if the `self`

fst is a subset of the fst
`stream`

.

`stream`

must be a lexicographically ordered sequence of byte strings
with associated values.

`fn is_superset<'f, I, S>(&self, stream: I) -> bool where I: for<'a> IntoStreamer<'a, Into=S, Item=(&'a [u8], Output)>,`

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

S: 'f + for<'a> Streamer<'a, Item=(&'a [u8], Output)>

Returns true if and only if the `self`

fst is a superset of the fst
`stream`

.

`stream`

must be a lexicographically ordered sequence of byte strings
with associated values.

`fn fst_type(&self) -> FstType`

Returns the underlying type of this fst.

FstType is a convention used to indicate the type of the underlying transducer.

This crate reserves the range 0-255 (inclusive) but currently leaves the meaning of 0-255 unspecified.

`fn root(&self) -> Node`

Returns the root node of this fst.

`fn node(&self, addr: CompiledAddr) -> Node`

Returns the node at the given address.

Node addresses can be obtained by reading transitions on `Node`

values.