We're using the raft.MemoryStorage
struct already (01-single-node-cluster and 02-single-node-proposals), and are going to continue using it.
So, let's take a closer look at the actual interface and implementation. We're also going to cover it with a few test cases to be extra sure of the analysis.
raft.Storage
is an interface with the following methods:
InitialState() (pb.HardState, pb.ConfState, error)
Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
Term(i uint64) (uint64, error)
LastIndex() (uint64, error)
FirstIndex() (uint64, error)
Snapshot() (pb.Snapshot, error)
Storage
maintains a list of entries, each Entry
has a Term
and Index
, which are fundamental properties of log entries in the raft algorithm.Storage
enforces a set of rules:
Storage
what are the smallest and largest index for entries it stores, via FirstIndex()
and LastIndex()
methods.Storage
is able to provide an Entry
for any index in the range [FirstIndex(), LastIndex()]
(inclusive).
N
, call Entries(N, N+1, maxSize)
.LastIndex() < FirstIndex()
, the storage is effectively empty.maxSize
parameter of Entries()
is specified in bytes.FirstIndex()
are "compacted" into a pb.Snapshot
, which represents the result of applying all entries with indexes up to FirstIndex() - 1
to the state machine.
FirstIndex()
results in ErrCompacted
error.Snapshot
stores the "last compacted" index in its metadata, so Snapshot().Metadata.Index
is always equal to FirstIndex() - 1
.*raft.MemoryStorage
is an implementation of this interface, adding more methods:
SetHardState(st pb.HardState) error
ApplySnapshot(snap pb.Snapshot) error
Append(entries []pb.Entry)
CreateSnapshot(i uint64, cs *pb.ConfState, data []byte) (pb.Snapshot, error)
Compact(compactIndex uint64) error
The implementation stores 3 things:
pb.HardState
— a tuple of 3 integers (term, vote, commit)
.pb.Snapshot
— a protobuf with []byte
(arbitrary data) + metadata: (index, term, pb.ConfState)
.[]pb.Entry
— a slice of entries, where each Entry
has Index
and Term
.MemoryStorage
When a new MemoryStorage
is constructed via NewMemoryStorage()
, it contains the default values for hardState
and snapshot
, as well as a single "dummy" Entry
.
For a newly constructed MemoryStorage
, we have firstIndex = 1
and lastIndex = 0
. In practice, this means that the storage contains no entries, because in order to read them, we would need to call Entries(lo, hi, ...)
, setting hi <= lastIndex + 1
. The returned slice would be empty, because the half-open interval [lo, hi) = [1, 1)
is empty.
Also note that for the new storage, snapshot.Metadata.Index == 0 == firstIndex - 1
, which is another implicit invariant of the Storage
interface.
FirstIndex() and LastIndex()
As entries are coming in, they're added to MemoryStorage
, and its internal representation looks like this:
slice index: 0 1 2 3 4 ... 10
entry index: 0 1 2 3 4 ... 10
dummy entry ^
For this storage, firstIndex = 1
and lastIndex = 10
. So the full list of entries can be read via Entries(1, 11, maxSize)
.
CreateSnapshot()
and Compact()
A MemoryStorage
can be instructed to overwrite its snapshot
from the outside. A caller provides pb.Snapshot
representing the FSM state (and raft cluster configuration state) for which all log entries up to a given index are applied. For example:
¸---- index=11
|
existing: [snapshot until index 10]xxxxxxxxxxxxxxxxxxxxxxxx
^ first ^ last
storage.CreateSnapshot(20, config, data) ¸------- index=21
result: [ snapshot until index 20]|
xxxxxxxxxxxxxxxxxxxxxxxx
^ first ^ last
This doesn't remove "compacted" entries from the underlying storage, so Entries()
call will still return the entries for indexes [11, 20]
.
To remove the "compacted" entries from the storage, use Compact()
:
¸---- index = 21
existing: [ snapshot until index 20]|
xxxxxxxxxxxxxxxxxxxxxxxx
^ first ^ last
¸---- index = 21
storage.Compact(20) |
result: [ snapshot until index 20]xxxxxxxxxxxxxx
^ first ^ last
Append()
The Append()
method expects the passed entries' indexes to be ordered and consecutive. It also expects there to be no "gaps" between LastIndex()
and the first index of the passed entries.
Despite its name, Append()
doesn't append the entries, but rather merges them.
Imagine MemoryStorage
which looks like this (it has all entries up to index = 10
compacted into a snapshot):
index: 0123456789012345678901234567
entries: [ snapshot]xxxxxxxxxxxxxxxxx
^ ^
| `--- lastIndex = 27
`------------------- firstIndex = 11
So MemoryStorage::Apply()
has to handle all of these possible cases:
existing: [ snapshot]xxxxxxxxxxxxxxxxx
new: yyyyyyyyy
result: xxxxxxxxxxxxxxxxx
(no new entries need to be appended)
etcd-io/raft
readme):existing: [ snapshot]xxxxxxxxxxxxxxxxx
new: yyyyyyyyyyyyy |
result: yyyyyyy `----- obsolete newer entries are discarded
[ ] <-------- existing entries are overwritten
existing: [ snapshot]xxxxxxxxxxxxxxxxx
new: yyyyyyyyyy |
result: xxyyyyyyyyyy `-- obsolete newer entries are discarded
| [ ] <--- existing entries are overwritten
some entries stay ---´
lastIndex
: ,--- some new entries are appended
existing: [ snapshot]xxxxxxxxxxxxxxxxx |
new: yyyyyyyyyy
result: xxxxxxxxxxxyyyyyyyyyy
| [ ] <----- existing entries are overwritten
some entries stay ---´
existing: [ snapshot]xxxxxxxxxxxxxxxxx
new: yyyyyyyyyy
result: xxxxxxxxxxxxxxxxxyyyyyyyyyy
| [ ] <--- new entries are appended
some entries stay ---´
lastIndex
, but with a gap:existing: [ snapshot]xxxxxxxxxxxxxxxxx
new: yyyyyyyyyy
|
result: error, no gaps are allowed ---´
ApplySnapshot()
Overwrites the snapshot stored in a MemoryStorage
, and replaces its entries with a single dummy entry, making the storage effectively empty.
However, if the Snapshot
's index (this is the index of log entry up until which the log was compacted) is less than the index of the Snapshot
already stored in the MemoryStorage
(in other words, the passed snapshot is old), then Append()
call is a no-op.
Next: 04-detour-node-refactor.