Using etcd-io/raft library

03-detour-memory-storage

We're using the raft.MemoryStorage struct already (01-single-node-cluster and 02-single-node-proposals), and are going to continue using it.

So, let's take a closer look at the actual interface and implementation. We're also going to cover it with a few test cases to be extra sure of the analysis.

Interface overview

raft.Storage is an interface with the following methods:

InitialState() (pb.HardState, pb.ConfState, error)
Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
Term(i uint64) (uint64, error)
LastIndex() (uint64, error)
FirstIndex() (uint64, error)
Snapshot() (pb.Snapshot, error)

Implementation overview

*raft.MemoryStorage is an implementation of this interface, adding more methods:

SetHardState(st pb.HardState) error
ApplySnapshot(snap pb.Snapshot) error
Append(entries []pb.Entry)
CreateSnapshot(i uint64, cs *pb.ConfState, data []byte) (pb.Snapshot, error)
Compact(compactIndex uint64) error

The implementation stores 3 things:

  1. pb.HardState — a tuple of 3 integers (term, vote, commit).
  2. pb.Snapshot — a protobuf with []byte (arbitrary data) + metadata: (index, term, pb.ConfState).
  3. []pb.Entry — a slice of entries, where each Entry has Index and Term.

New MemoryStorage

When a new MemoryStorage is constructed via NewMemoryStorage(), it contains the default values for hardState and snapshot, as well as a single "dummy" Entry.

For a newly constructed MemoryStorage, we have firstIndex = 1 and lastIndex = 0. In practice, this means that the storage contains no entries, because in order to read them, we would need to call Entries(lo, hi, ...), setting hi <= lastIndex + 1. The returned slice would be empty, because the half-open interval [lo, hi) = [1, 1) is empty.

Also note that for the new storage, snapshot.Metadata.Index == 0 == firstIndex - 1, which is another implicit invariant of the Storage interface.

FirstIndex() and LastIndex()

As entries are coming in, they're added to MemoryStorage, and its internal representation looks like this:

slice index:     0      1  2  3  4  ...   10
entry index:     0      1  2  3  4  ...   10
     dummy entry ^

For this storage, firstIndex = 1 and lastIndex = 10. So the full list of entries can be read via Entries(1, 11, maxSize).

CreateSnapshot() and Compact()

A MemoryStorage can be instructed to overwrite its snapshot from the outside. A caller provides pb.Snapshot representing the FSM state (and raft cluster configuration state) for which all log entries up to a given index are applied. For example:

                                         ¸---- index=11
                                        |
existing:      [snapshot until index 10]xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last

storage.CreateSnapshot(20, config, data)           ¸------- index=21
result:        [          snapshot until index 20]|
                                        xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last

This doesn't remove "compacted" entries from the underlying storage, so Entries() call will still return the entries for indexes [11, 20].

To remove the "compacted" entries from the storage, use Compact():

                                                   ¸---- index = 21
existing:      [          snapshot until index 20]|
                                        xxxxxxxxxxxxxxxxxxxxxxxx
                                        ^ first                ^ last
                                                   ¸---- index = 21
storage.Compact(20)                               |
result:        [          snapshot until index 20]xxxxxxxxxxxxxx
                                                  ^ first      ^ last

Append()

The Append() method expects the passed entries' indexes to be ordered and consecutive. It also expects there to be no "gaps" between LastIndex() and the first index of the passed entries.

Despite its name, Append() doesn't append the entries, but rather merges them.

Imagine MemoryStorage which looks like this (it has all entries up to index = 10 compacted into a snapshot):

index:      0123456789012345678901234567
entries:    [ snapshot]xxxxxxxxxxxxxxxxx
                       ^               ^
                       |               `---  lastIndex = 27
                       `------------------- firstIndex = 11

So MemoryStorage::Apply() has to handle all of these possible cases:

  1. New entries fall completely into the snapshot.
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx
    new:        yyyyyyyyy
    result:                xxxxxxxxxxxxxxxxx
             (no new entries need to be appended)
    
  2. New entries start withing the snapshot, and partially overlap stored entries. Note how this discards any stored entries with indexes that are larger than any passed entry (this requirement is mentioned in the etcd-io/raft readme):
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx
    new:             yyyyyyyyyyyyy    |
    result:                yyyyyyy    `----- obsolete newer entries are discarded
                           [     ] <-------- existing entries are overwritten
    
  3. New entries fall entirely within the stored entries:
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx
    new:                     yyyyyyyyyy  |
    result:                xxyyyyyyyyyy  `-- obsolete newer entries are discarded
                           | [        ] <--- existing entries are overwritten
      some entries stay ---´
    
  4. New entries start within the stored entries and continue past lastIndex:
                                               ,--- some new entries are appended
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx  |
    new:                              yyyyyyyyyy
    result:                xxxxxxxxxxxyyyyyyyyyy
                                |     [    ] <----- existing entries are overwritten
           some entries stay ---´
    
  5. New entries start immediately after stored entries:
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx
    new:                                    yyyyyyyyyy
    result:                xxxxxxxxxxxxxxxxxyyyyyyyyyy
                                |           [        ] <--- new entries are appended
           some entries stay ---´
    
  6. New entries start after lastIndex, but with a gap:
    existing:   [ snapshot]xxxxxxxxxxxxxxxxx
    new:                                       yyyyyyyyyy
                                             |
    result:    error, no gaps are allowed ---´
    

ApplySnapshot()

Overwrites the snapshot stored in a MemoryStorage, and replaces its entries with a single dummy entry, making the storage effectively empty.

However, if the Snapshot's index (this is the index of log entry up until which the log was compacted) is less than the index of the Snapshot already stored in the MemoryStorage (in other words, the passed snapshot is old), then Append() call is a no-op.

Next: 04-detour-node-refactor.