Using etcd-io/raft library

03-detour-memory-storage

We're using the raft.MemoryStorage struct already (01-single-node-cluster and 02-single-node-proposals), and are going to continue using it.

So, let's take a closer look at the actual interface and implementation. We're also going to cover it with a few test cases to be extra sure of the analysis.

Interface overview

raft.Storage is an interface with the following methods:

InitialState() (pb.HardState, pb.ConfState, error)
Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
Term(i uint64) (uint64, error)
LastIndex() (uint64, error)
FirstIndex() (uint64, error)
Snapshot() (pb.Snapshot, error)

The Storage maintains a list of entries, each Entry has a Term and Index, which are fundamental properties of log entries in the raft algorithm.
A valid Storage enforces a set of rules:
- In the entries list, the terms are monotonically increasing.
- The indexes are increasing and consecutive (no gaps are allowed).
You can ask the Storage what are the smallest and largest index for entries it stores, via FirstIndex() and LastIndex() methods.
The Storage is able to provide an Entry for any index in the range [FirstIndex(), LastIndex()] (inclusive).
- For example, to read the entry with index N, call Entries(N, N+1, maxSize).
- This implies that when LastIndex() < FirstIndex(), the storage is effectively empty.
- The maxSize parameter of Entries() is specified in bytes.
The entries before FirstIndex() are "compacted" into a pb.Snapshot, which represents the result of applying all entries with indexes up to FirstIndex() - 1 to the state machine.
- Attempting to read entries at index less than FirstIndex() results in ErrCompacted error.
- The Snapshot stores the "last compacted" index in its metadata, so Snapshot().Metadata.Index is always equal to FirstIndex() - 1.

Implementation overview

*raft.MemoryStorage is an implementation of this interface, adding more methods:

SetHardState(st pb.HardState) error
ApplySnapshot(snap pb.Snapshot) error
Append(entries []pb.Entry)
CreateSnapshot(i uint64, cs *pb.ConfState, data []byte) (pb.Snapshot, error)
Compact(compactIndex uint64) error

The implementation stores 3 things:

pb.HardState — a tuple of 3 integers (term, vote, commit).
pb.Snapshot — a protobuf with []byte (arbitrary data) + metadata: (index, term, pb.ConfState).
[]pb.Entry — a slice of entries, where each Entry has Index and Term.

New `MemoryStorage`

When a new MemoryStorage is constructed via NewMemoryStorage(), it contains the default values for hardState and snapshot, as well as a single "dummy" Entry.

For a newly constructed MemoryStorage, we have firstIndex = 1 and lastIndex = 0. In practice, this means that the storage contains no entries, because in order to read them, we would need to call Entries(lo, hi, ...), setting hi <= lastIndex + 1. The returned slice would be empty, because the half-open interval [lo, hi) = [1, 1) is empty.

Also note that for the new storage, snapshot.Metadata.Index == 0 == firstIndex - 1, which is another implicit invariant of the Storage interface.

`FirstIndex() and LastIndex()`

As entries are coming in, they're added to MemoryStorage, and its internal representation looks like this:

slice index:     0      1  2  3  4  ...   10
entry index:     0      1  2  3  4  ...   10
     dummy entry ^

For this storage, firstIndex = 1 and lastIndex = 10. So the full list of entries can be read via Entries(1, 11, maxSize).

`CreateSnapshot()` and `Compact()`

A MemoryStorage can be instructed to overwrite its snapshot from the outside. A caller provides pb.Snapshot representing the FSM state (and raft cluster configuration state) for which all log entries up to a given index are applied. For example:

                                         ¸---- index=11
                                        |
existing:      [snapshot until index 10]xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last

storage.CreateSnapshot(20, config, data)           ¸------- index=21
result:        [          snapshot until index 20]|
                                        xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last

This doesn't remove "compacted" entries from the underlying storage, so Entries() call will still return the entries for indexes [11, 20].

To remove the "compacted" entries from the storage, use Compact():

                                                   ¸---- index = 21
existing:      [          snapshot until index 20]|
                                        xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last
                                                   ¸---- index = 21
storage.Compact(20)                               |
result:        [          snapshot until index 20]xxxxxxxxxxxxxx
                                                  ^ first      ^ last

`Append()`

The Append() method expects the passed entries' indexes to be ordered and consecutive. It also expects there to be no "gaps" between LastIndex() and the first index of the passed entries.

Despite its name, Append() doesn't append the entries, but rather merges them.

Imagine MemoryStorage which looks like this (it has all entries up to index = 10 compacted into a snapshot):

index:      0123456789012345678901234567
entries:    [ snapshot]xxxxxxxxxxxxxxxxx
                       ^               ^
                       |               `---  lastIndex = 27
                       `------------------- firstIndex = 11

So MemoryStorage::Apply() has to handle all of these possible cases:

New entries fall completely into the snapshot.

existing:   [ snapshot]xxxxxxxxxxxxxxxxx
new:        yyyyyyyyy
result:                xxxxxxxxxxxxxxxxx
         (no new entries need to be appended)

New entries start withing the snapshot, and partially overlap stored entries. Note how this discards any stored entries with indexes that are larger than any passed entry (this requirement is mentioned in the etcd-io/raft readme):

existing:   [ snapshot]xxxxxxxxxxxxxxxxx
new:             yyyyyyyyyyyyy    |
result:                yyyyyyy    `----- obsolete newer entries are discarded
                       [     ] <-------- existing entries are overwritten

New entries fall entirely within the stored entries:

existing:   [ snapshot]xxxxxxxxxxxxxxxxx
new:                     yyyyyyyyyy  |
result:                xxyyyyyyyyyy  `-- obsolete newer entries are discarded
                       | [        ] <--- existing entries are overwritten
  some entries stay ---´

New entries start within the stored entries and continue past lastIndex:

                                           ,--- some new entries are appended
existing:   [ snapshot]xxxxxxxxxxxxxxxxx  |
new:                              yyyyyyyyyy
result:                xxxxxxxxxxxyyyyyyyyyy
                            |     [    ] <----- existing entries are overwritten
       some entries stay ---´

New entries start immediately after stored entries:

existing:   [ snapshot]xxxxxxxxxxxxxxxxx
new:                                    yyyyyyyyyy
result:                xxxxxxxxxxxxxxxxxyyyyyyyyyy
                            |           [        ] <--- new entries are appended
       some entries stay ---´

New entries start after lastIndex, but with a gap:

existing:   [ snapshot]xxxxxxxxxxxxxxxxx
new:                                       yyyyyyyyyy
                                         |
result:    error, no gaps are allowed ---´

`ApplySnapshot()`

Overwrites the snapshot stored in a MemoryStorage, and replaces its entries with a single dummy entry, making the storage effectively empty.

However, if the Snapshot's index (this is the index of log entry up until which the log was compacted) is less than the index of the Snapshot already stored in the MemoryStorage (in other words, the passed snapshot is old), then Append() call is a no-op.

Next: 04-detour-node-refactor.

This site is open source. Improve this page.

Using etcd-io/raft library

03-detour-memory-storage

Interface overview

Implementation overview

New MemoryStorage

FirstIndex() and LastIndex()

CreateSnapshot() and Compact()

Append()

ApplySnapshot()

New `MemoryStorage`

`FirstIndex() and LastIndex()`

`CreateSnapshot()` and `Compact()`

`Append()`

`ApplySnapshot()`