rocksdb 存储

weige2980

已于 2025-01-07 10:15:56 修改

阅读量694

点赞数 24

CC 4.0 BY-SA版权

分类专栏： myrocksdb 文章标签： db

于 2025-01-07 09:44:40 首次发布

本文链接：https://blog.csdn.net/weige2980/article/details/144977735

myrocksdb 专栏收录该内容

1 篇文章

订阅专栏

1 概述

用户写入的键值对会先写入磁盘上的 WAL (Write Ahead Log)文件，然后再写入内存中的 MemTable即可返回，从而实现将用户的随机修改、插入、删除等操作转化为了对 WAL 文件的顺序写，因此提供了更高的写性能。广泛应用于高性能服务器、分布式存储、数据库领域。

LSM树结构由上图示例，特点如下：

类似树状结构
随机写转顺序写
非原地更新
冷热数据分级
定期合并（compaction）

优缺点：

优点：

插入、删除、修改操作极快，具有优秀的写入性能；

数据冷热分层，对于新写入、修改过、删除过的数据能较快的读到

缺点：

由于分层的原因会有一定的读放大、空间放大

定期的合并操作会消耗掉一定的硬件资源，cpu、磁盘、内存；

2 编译使用

git clone https://github.com/facebook/rocksdb.git
git clone git@github.com:facebook/rocksdb.git
cd rocksdb
git checkout v6.8.1 


origin  git@github.com:duckdb/duckdb.git (fetch)
origin  git@github.com:duckdb/duckdb.git (push)

mkdir build
cd build
cmake ..
make

Cmake 可能会报错

sudo apt-get install libgflags-dev

Demon

根目录CmakeLists.txt 尾部增加一行

add_subdirectory(application)

application/
├── CMakeLists.txt
├── include
└── src
    └── test1.cpp

2 directories, 2 files

其中 application.CMakeLists.txt 中编译demon 可执行程序内容如下：

set(MYDB_NAME myrocksdb)  # 设置一个名字

include_directories(include) # 包含头文件
include_directories(${PROJECT_SOURCE_DIR})
include_directories(${PROJECT_SOURCE_DIR}/include)

message("编译我的 rocksdb")
message(${PROJECT_SOURCE_DIR}/include)
message(${PROJECT_SOURCE_DIR}/include)
file(GLOB ROCKSDB_WRAPPER_SRC src/*.cpp)  # 包含源文件
add_executable(${MYDB_NAME} ${ROCKSDB_WRAPPER_SRC}) # 可执行文件

# 查找 RocksDB 库
#find_package(${ROCKSDB_LIB} REQUIRED)

# 链接 RocksDB 库
target_link_libraries(${MYDB_NAME} PRIVATE ${ROCKSDB_STATIC_LIB}) # 链接静态库

#target_link_libraries(${ROCKSDB_LIB}) # 连接库

其中 application.src.test1.cpp 是测试的demon 程序

#include <iostream>
#include <rocksdb/db.h>
#include <rocksdb/options.h>

int main() {
  rocksdb::DB* db;
  rocksdb::Options options;
  options.create_if_missing = true;

  // 打开数据库
  rocksdb::Status status = rocksdb::DB::Open(options, "./testdb", &db);
  if (!status.ok()) {
    std::cerr << "无法打开数据库: " << status.ToString() << std::endl;
    return -1;
  }

  // 插入键值对
  status = db->Put(rocksdb::WriteOptions(), "key1", "value1");
  if (!status.ok()) {
    std::cerr << "插入失败: " << status.ToString() << std::endl;
    delete db;
    return -1;
  }

  // 读取键值对
  std::string value;
  status = db->Get(rocksdb::ReadOptions(), "key1", &value);
  if (status.ok()) {
    std::cout << "读取成功: key1 -> " << value << std::endl;
  } else {
    std::cerr << "读取失败: " << status.ToString() << std::endl;
  }

  // 关闭数据库
  delete db;
  return 0;
}

编译运行方法1：

然后我们执行上面的编译，build/application 目录下就会编译出一个 myrocksdb 可执行程序的demon，直接运行

./myrocksdb 即可运行demon，如下所示

编译运行方法2:

Clion 打开项目，Clion 运行demon，如下所示

3 memtable和skiplist

RocksDB的写请求写入到memtable后就认为是写成功了，一旦一个memtable被写满（或者满足一定条件），他会变成不可修改的memtable，即immemtable，并被一个新的memtable替换。一个后台线程会将immemtable的内容落盘到一个SST文件，然后immemtable就可以被销毁了。

memtable提供了多种数据结构的实现，最常用的为Skiplist（跳表），基于Skiplist的memtable在多数情况下都有较好读、写、随机访问以及序列化扫描性能，并且支持并发写入。

3.1 MemTable 结构重要成员

class MemTable {
 private:
  KeyComparator comparator_; // 比较器，比较 key 的大小
  const ImmutableMemTableOptions moptions_;
  
  std::unique_ptr<MemTableRep> table_; // 指向跳表的一个实现
  std::unique_ptr<MemTableRep> range_del_table_; // 指向skiplist，用于kTypeRangeDeletion类型
  
    // Total data size of all data inserted
  std::atomic<uint64_t> data_size_;  // 数据大小
  std::atomic<uint64_t> num_entries_; // 多少个写 kv
  std::atomic<uint64_t> num_deletes_; // 多少个 删除
  
    // These are used to manage memtable flushes to storage
  bool flush_in_progress_; // started the flush     // 正在flush
  bool flush_completed_;   // finished the flush    // flush 结束了
  uint64_t file_number_;    // filled up after flush is complete
  
  std::unique_ptr<DynamicBloom> bloom_filter_; // 布隆过滤器， 快速判断一个 kv 不在库中

3.2 memtable插入一条kv的数据格式：

|-internal_key_size-|---key---|--seq--type|--value_size--|--value--|

internal_key_size : varint类型，包括key、seq、type所占的字节数
key：字符串，就是Put进来的key字符串
seq：序列号，占7个字节
type：操作类型，占1个字节
value_size：varint类型，表示value的长度
value：字符串，就是Put进来的value字符串

3.3 MemTable 插入KV流程

申请一段内存，将 key，val 数据封装到内存中，然后调用跳表接口，将kv数据写入跳表。并更新统计信息和布隆过滤器。

bool MemTable::Add(SequenceNumber s, ValueType type,
                   const Slice& key, /* user key */
                   const Slice& value, bool allow_concurrent,
                   MemTablePostProcessInfo* post_process_info, void** hint) {
  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]
  uint32_t key_size = static_cast<uint32_t>(key.size()); // key 长度
  uint32_t val_size = static_cast<uint32_t>(value.size()); // val 长度
  uint32_t internal_key_size = key_size + 8;  // seq + type
  const uint32_t encoded_len = VarintLength(internal_key_size) +
                               internal_key_size + VarintLength(val_size) +
                               val_size;
  char* buf = nullptr; // [->]
  std::unique_ptr<MemTableRep>& table =
      type == kTypeRangeDeletion ? range_del_table_ : table_;
  KeyHandle handle = table->Allocate(encoded_len, &buf); // 为KV 分配内存 buf

  char* p = EncodeVarint32(buf, internal_key_size); // 将interval size字段写入 buf
  memcpy(p, key.data(), key_size); // 将key 数据写入 buf
  Slice key_slice(p, key_size);
  p += key_size;
  uint64_t packed = PackSequenceAndType(s, type); // seq + type
  EncodeFixed64(p, packed);       // 将 sql + type 字段写入 buf
  p += 8;
  p = EncodeVarint32(p, val_size); // 将 val size 字段写入 buf
  memcpy(p, value.data(), val_size); // 将 val 数据写入 buf，  此时这条写KV 的数据写入了 buf 中。
  assert((unsigned)(p + val_size - buf) == (unsigned)encoded_len);
  size_t ts_sz = GetInternalKeyComparator().user_comparator()->timestamp_size();

  if (!allow_concurrent) { // 允许并发， 默认允许并发
    // Extract prefix for insert with hint.
    if (insert_with_hint_prefix_extractor_ != nullptr &&
        insert_with_hint_prefix_extractor_->InDomain(key_slice)) {
      Slice prefix = insert_with_hint_prefix_extractor_->Transform(key_slice);
      // 带hint插入，通过map记录一些前缀插入skiplist的位置，从而再次插入相同前缀的key时快速找到位置
      bool res = table->InsertKeyWithHint(handle, &insert_hints_[prefix]);
      if (UNLIKELY(!res)) {
        return res;
      }
    } else {  // 默认走到这里
      bool res = table->InsertKey(handle); // 插入跳表， handle 是内存指针，存储了 key val 的信息
      if (UNLIKELY(!res)) {
        return res;
      }
    }

    // this is a bit ugly, but is the way to avoid locked instructions
    // when incrementing an atomic
    // 更新统计信息
    num_entries_.store(num_entries_.load(std::memory_order_relaxed) + 1, // 更新元素数 +1
                       std::memory_order_relaxed);
    data_size_.store(data_size_.load(std::memory_order_relaxed) + encoded_len, // 更新数据大小 + 本次写入大小
                     std::memory_order_relaxed);
    if (type == kTypeDeletion) {
      num_deletes_.store(num_deletes_.load(std::memory_order_relaxed) + 1,
                         std::memory_order_relaxed);
    }
    // 更新布隆过滤
    if (bloom_filter_ && prefix_extractor_ &&
        prefix_extractor_->InDomain(key)) {
      bloom_filter_->Add(prefix_extractor_->Transform(key));
    }
    if (bloom_filter_ && moptions_.memtable_whole_key_filtering) {
      bloom_filter_->Add(StripTimestampFromUserKey(key, ts_sz));
    }

    // The first sequence number inserted into the memtable
    assert(first_seqno_ == 0 || s >= first_seqno_);
    if (first_seqno_ == 0) {
      first_seqno_.store(s, std::memory_order_relaxed);

      if (earliest_seqno_ == kMaxSequenceNumber) {
        earliest_seqno_.store(GetFirstSequenceNumber(),
                              std::memory_order_relaxed);
      }
      assert(first_seqno_.load() >= earliest_seqno_.load());
    }
    assert(post_process_info == nullptr);
    UpdateFlushState();
  } else {
    ...
  }
  if (type == kTypeRangeDeletion) {
    is_range_del_table_empty_.store(false, std::memory_order_relaxed);
  }
  UpdateOldestKeyTime();
  return true;
}

3.4 跳表节点结构

template <class Comparator>
struct InlineSkipList<Comparator>::Node {
  ......
  const char* Key() const { return reinterpret_cast<const char*>(&next_[1]); } // next_[1] 是顺序内存的下一个Node

  // Accessors/mutators for links.  Wrapped in methods so we can add
  // the appropriate barriers as necessary, and perform the necessary
  // addressing trickery for storing links below the Node in memory.
  Node* Next(int n) { // 获取下第n 层节点
    assert(n >= 0);
    // Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return ((&next_[0] - n)->load(std::memory_order_acquire));
  }

  void SetNext(int n, Node* x) { // 将 节点设置到 第n 层位置
    assert(n >= 0);
    // Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    (&next_[0] - n)->store(x, std::memory_order_release);
  }

  ......
 private:
  // next_[0] is the lowest level link (level 0).  Higher levels are
  // stored _earlier_, so level 1 is at next_[-1].
  std::atomic<Node*> next_[1];   // Node只有一个成员，即数组 [Node*], 所以 Node 节点本身大小也是一个Node指针大小 
};

可以看出跳表节点 Node只有一个成员，即数组 [Node*], 所以 Node 节点本身大小也是一个Node指针大小。

这里有个隐含的很有意思的知识点，如果我有一个节点 Node x; 则x 地址和 x中的数组 next_的地址是相等的；即 &x == &(x.next_); 这个知识点在跳表节点取某层的节点指针时会用到。

3.5 跳表节点内存结构


/*
 *
 * raw 内存结构 {Node*,Node*,Node*,Node*,Node{<Node*> next_[1]}, [Key]}
 * return Node 位置的地址x；               x
 * 通过 x节点 找第0层，即 x.next_[0],  找第1层， 即x.next_[-1], 找 key， 即 x.next_[1]
 *
 * */
template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::AllocateNode(size_t key_size, int height) {
  auto prefix = sizeof(std::atomic<Node*>) * (height - 1); // 高度为5， prefix 为4个Node指针 长度

  // prefix is space for the height - 1 pointers that we store before
  // the Node instance (next_[-(height - 1) .. -1]).  Node starts at
  // raw + prefix, and holds the bottom-mode (level 0) skip list pointer
  // next_[0].  key_size is the bytes for the key, which comes just after
  // the Node.
  // 高度-1 个指针。  //高度-1个Node指针，每个指针指向该高度的下一个节点
  char* raw = allocator_->AllocateAligned(prefix + sizeof(Node) + key_size);  // 4Node指针长度+1Node长度 + ks(假如是4字节， 即将key 放在这里)
  Node* x = reinterpret_cast<Node*>(raw + prefix); // return Node 位置

  // Once we've linked the node into the skip list we don't actually need
  // to know its height, because we can implicitly use the fact that we
  // traversed into a node at level h to known that h is a valid level
  // for that node.  We need to convey the height to the Insert step,
  // however, so that it can perform the proper links.  Since we're not
  // using the pointers at the moment, StashHeight temporarily borrow
  // storage from next_[0] for that purpose.
  // //将节点高度暂时存储在高度为1的位置，插入完成后就不需要高度了，
  // //这个位置就会存放指向下一个节点的指针
  x->StashHeight(height);
  return x; // 返回一个node， node 私有成员有个 node数组
}

这里AllocateNode函数为跳表节点申请一块内存，如上图所示，假如高度为5，则 raw 的buf 的内存结构如下： {Node*,Node*,Node*,Node*,Node{<Node*> next_[1]}, [Key]}，它包括4个Node*指针，和一个Node大小，和一个key 大小。并return 5个元素Node 位置的地址x。其中前面4个元素Node指针分别代表x节点的第4，3，2，1层的 next指针；第5个元素Node 结构代表就是节点x； 5个元素Node的成员变量next_[1] 中的代表节点 x 第 0层的 next指针。第6个元素 key，代表节点 x 的key。

我们知道跳表节点结构，如果我要找上面的跳表节点x，的不同层高的指针。

例如通过 x节点找第0层的next0，即 x.next_[0], 即Node中next_数组的第[0]指针；

通过x节点找第1层的next1，即第4个元素Node* 指针，我们可以根据节点x 的指针减1 获得该指针，即 next1 = &x - 1。由于 &x == &(x.next_), 所以 next1 = &(x.next) -1 。

现在我们看 Node 类的 Node* Next 函数，找第 n层的next 节点指针是不是感觉非常简单呢？代码如是：

return ((&next_[0] - n)->load(std::memory_order_acquire));

3.6 跳表节点随机高度

代码如下：

/*
 *
 * 1 如何控制一个随机高度呢？
 * 2 我们可以看到   kScaledInverseBranching_((Random::kMaxNext + 1) / kBranching_),
 * 3 kBranching_ 是一个传入的控制因子，默认是4
 * 4 kScaledInverseBranching_ 是随机数最大值的 1/4
 * 5 那么就很明显了， rnd->Next() 产生的随机数有1/4的几率小于kScaledInverseBranching_。
 * 6 我们可以很容易得到一个节点的随机高度是（在配置的最大高度范围内）
 * 1 层的概率为，  3/4
 * 2 层的概率为，  1/4 * 3/4 = 3/12
 * 3 层的概率为，  1/4 * 1/4 * 3/4 = 3/64
 * 4 层的概率为，  1/4 * 1/4 * 1/4 * 3/4 = 3/256
 * 5 层的概率为，  1/4 * 1/4 * 1/4 * 1/4 * 3/4 = 3/1024
 *
 * */
template <class Comparator>
int InlineSkipList<Comparator>::RandomHeight() { // 随机高度
  auto rnd = Random::GetTLSInstance();  // 获取一个随机数生成器的实例
  // 按照线程局部存储（Thread Local Storage，TLS）的方式来获取实例，意味着每个线程都可以有自己独立的随机数生成器实例，
  // 这样在多线程环境下能保证生成的随机数相对独立且互不干扰
  // Increase height with probability 1 in kBranching
  int height = 1; // 跳表的高度至少是 1
  // 调用随机数生成器实例 rnd 的 Next 方法来获取下一个随机数
  // kScaledInverseBranching_, 控制生成高度概率的阈值常量，意思是只有当生成的随机数小于这个阈值时，才会增加跳表的高度
  while (height < kMaxHeight_ && height < kMaxPossibleHeight &&
         rnd->Next() < kScaledInverseBranching_) {
    height++;
  }
  assert(height > 0);
  assert(height <= kMaxHeight_);
  assert(height <= kMaxPossibleHeight);
  return height;
}

1 如何控制一个随机高度呢？

2 我们可以看到 kScaledInverseBranching_((Random::kMaxNext + 1) / kBranching_),

3 kBranching_ 是一个传入的控制因子，默认是4

4 kScaledInverseBranching_ 是随机数最大值的 1/4

5 那么就很明显了， rnd->Next() 产生的随机数有1/4的几率小于kScaledInverseBranching_。

6 我们可以很容易得到一个节点的随机高度是（在配置的最大高度范围内）的概率如下所示：

1 层的概率为， 3/4

2 层的概率为， 1/4 * 3/4 = 3/12

3 层的概率为， 1/4 * 1/4 * 3/4 = 3/64

4 层的概率为， 1/4 * 1/4 * 1/4 * 3/4 = 3/256

5 层的概率为， 1/4 * 1/4 * 1/4 * 1/4 * 3/4 = 3/1024

3.7 跳表判断key 是否在节点的后面

校验 key 是否在节点前面，即校验 n.key < key

代码如下

// 校验 key 是否在节点前面，即校验 n.key < key
template <class Comparator>
bool InlineSkipList<Comparator>::KeyIsAfterNode(const char* key,
                                                Node* n) const {
  // nullptr n is considered infinite
  assert(n != head_);
  return (n != nullptr) && (compare_(n->Key(), key) < 0); //  key 在右边， n 在左边
}

3.8 寻找第一个大于或等于key 的节点

1 从head 节点，从最高层遍历链表每个节点 cur
2 如果 cur.key == key；或 key < cur.key 且是第0层，则返回 cur
3 如果 cur.key < key, 则递进链表
4 其他情况，即 key < cur.key 且不是第0层，则层高-1，继续遍历链表


// 寻找第一个大于或等于key 的节点
// 1   3   5
// 1 2 3 4 5   ,  find 4.5, 即找到5
/*
 * 1 从head 节点，从最高层遍历链表每个节点 cur
 * 2 如果 cur.key == key； 或 key < cur.key 且是第0层，则返回 cur
 * 3 如果  cur.key < key, 则递进链表
 * 4 其他情况， 即 key < cur.key 且不是第0层，则层高-1，继续遍历链表
 * */
template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindGreaterOrEqual(const char* key) const {
  // Note: It looks like we could reduce duplication by implementing
  // this function as FindLessThan(key)->Next(0), but we wouldn't be able
  // to exit early on equality and the result wouldn't even be correct.
  // A concurrent insert might occur after FindLessThan(key) but before
  // we get a chance to call Next(0).
  Node* x = head_; // 头
  int level = GetMaxHeight() - 1; // 最高层
  Node* last_bigger = nullptr;
  const DecodedKey key_decoded = compare_.decode_key(key);
  while (true) {
    Node* next = x->Next(level);
    if (next != nullptr) {
      PREFETCH(next->Next(level), 0, 1);
    }
    // Make sure the lists are sorted
    assert(x == head_ || next == nullptr || KeyIsAfterNode(next->Key(), x)); // next 在 x 右边
    // Make sure we haven't overshot during our search
    assert(x == head_ || KeyIsAfterNode(key_decoded, x));
    int cmp = (next == nullptr || next == last_bigger)
                  ? 1
                  : compare_(next->Key(), key_decoded); // x 在 key 的右边 --> 1
    if (cmp == 0 || (cmp > 0 && level == 0)) { // 如果是0层，则返回 x
      return next;
    } else if (cmp < 0) { // x 在key 的左边， --> -1,  链表递进
      // Keep searching in this list
      x = next;
    } else { // level 不为0 ，则层高降低，继续寻找
      // Switch to next list, reuse compare_() result
      last_bigger = next;
      level--;
    }
  }
}

3.9跳表寻找返回小于 key 的最后一个 Node

1 从head 节点，从最高层遍历链表每个节点 x

2 如果 x.key < key，则递进链表

3 如果 key == x.key, 或 key < x.key *

3.1 如果是第0层，则返回 pre 节点 *

3.2 如果不是第0层，则层高-1，继续遍历遍历链表


// 返回小于 key 的最后一个 Node
// 1   3   5
// 1 2 3 4 5   ,  find 4.5, 即找到 4；  find4， 找到3；
/*
 * 1 从head 节点，从最高层遍历链表每个节点 x
 * 2 如果 x.key < key，则递进链表
 * 3 如果 key == x.key, 或 key < x.key
 * 3.1 如果是第0层，则返回 pre 节点
 * 3.2 如果不是第0层，则层高-1，继续遍历遍历链表
 * */
template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindLessThan(const char* key, Node** prev,
                                         Node* root, int top_level,
                                         int bottom_level) const {
  assert(top_level > bottom_level);
  int level = top_level - 1; // 最高层
  Node* x = root; // 根节点
  // KeyIsAfter(key, last_not_after) is definitely false
  Node* last_not_after = nullptr;
  const DecodedKey key_decoded = compare_.decode_key(key); // 编码
  while (true) {
    assert(x != nullptr);
    Node* next = x->Next(level); // level 层 next 节点
    if (next != nullptr) {
      PREFETCH(next->Next(level), 0, 1);
    }
    assert(x == head_ || next == nullptr || KeyIsAfterNode(next->Key(), x));
    assert(x == head_ || KeyIsAfterNode(key_decoded, x));
    if (next != last_not_after && KeyIsAfterNode(key_decoded, next)) { //  x.key < key
      // Keep searching in this list
      assert(next != nullptr);
      x = next; // 递进， x = next
    } else {  // 否则， key_decoded 不在 next 的右边 ，   key  next
      if (prev != nullptr) {             //           x      next
        prev[level] = x; // 设置prev                   pre    next
      }
      if (level == bottom_level) { // 如果level 是最底层， 则返回 x，  这里返回的pre 节点，即上个循环处理的节点
        return x;
      } else { // 如果不是最底层， 则跳表降低一层，继续搜索
        // Switch to next list, reuse KeyIsAfterNode() result
        last_not_after = next; // 记录下一个节点， level 降低一次层
        level--;
      }
    }
  }
}

3.9 获取最后一个节点

1 从head 节点，从最高层遍历链表每个节点 x *

2 如果 x 节点为null

2.1 如果是第0层，则返回pre 节点

2.2 如果不是第0层，则层高减1，继续遍历链表

3 如果 x 不为null，则递进链表

// 获取最后一个节点
// 1   3   5
// 1 2 3 4 5   ,  find, 即返回5
/*
 * 1 从head 节点，从最高层遍历链表每个节点 x
 * 2 如果  x 节点为null
 * 2.1 如果是第0层，则返回pre 节点
 * 2.2 如果不是第0层，则层高减1，继续遍历链表
 * 3 如果 x 不为null，则递进链表
 *
 * */
template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindLast() const {
  Node* x = head_; //head 节点
  int level = GetMaxHeight() - 1; // 最高层-1
  while (true) {
    Node* next = x->Next(level); // 遍历链表
    if (next == nullptr) { // 如果 next 节点为null
      if (level == 0) { // 如果是最底层，则返回 cur 节点
        return x;
      } else {
        // Switch to next list  如果不是最底层，则层高减去1
        level--;
      }
    } else { // 递进链表
      x = next;
    }
  }
}

3.10 查找跳表大约多少个元素

// 查找跳表大约多少个元素

// 1 3 5

// 1 2 3 4 5 , 即大于5个 /*

1 从head 节点，从最高层遍历链表每个节点 x，并对节点计数 count. *

2 如果 key <= cur.key，如果不是第 0 层跳表，则则count = count * kBranching_

3 如果是第0层跳表，则返回 count 计数。 * * 这里 kBranching_ 默认是

4，即默认较高一层的节点数是较低一层节点数的4倍。

// 查找跳表大约多少个元素
// 1   3   5
// 1 2 3 4 5   ,  即大于5个
/*
 * 1 从head 节点，从最高层遍历链表每个节点 x， 并对节点计数 count.
 * 2 如果 key <= cur.key，如果不是第 0 层跳表， 则则count = count * kBranching_
 * 3 如果是第0层跳表，则返回 count 计数。
 * 
 * 这里 kBranching_ 默认是4， 即默认较高一层的节点数是较低一层节点数的4倍。
 *
 * */
template <class Comparator>
uint64_t InlineSkipList<Comparator>::EstimateCount(const char* key) const {
  uint64_t count = 0;

  Node* x = head_; // 头节点
  int level = GetMaxHeight() - 1; // 最大层高
  const DecodedKey key_decoded = compare_.decode_key(key); // key
  while (true) {
    assert(x == head_ || compare_(x->Key(), key_decoded) < 0);
    Node* next = x->Next(level); // 获取next
    if (next != nullptr) {
      PREFETCH(next->Next(level), 0, 1);
    }
    if (next == nullptr || compare_(next->Key(), key_decoded) >= 0) { // 如果 key <= cur.key
      if (level == 0) { // 如果高度为0， 则返回count
        return count;
      } else {
        // Switch to next list
        count *= kBranching_; // 如果高度不为0， 则则count = count * branch， 分支因子
        level--; // 高度减1
      }
    } else { // 如果next 在key 左边，则递进链表
      x = next;
      count++;
    }
  }
}

3.11 跳表插入

跳表插入需要一个 splice 的辅助结构，这个 splice 英文名叫粘结，也挺有意思。

代码如下：

//  [Node{Node*[1]}, key], 例如内存分布如图，我们知道了 key 的地址为  key;  怎么获取左边的这个Node 的起始地址呢？
// 由于Node 占内存大小为 size(Node*) , 所以该Node 起始地址为 key 的地址减去一个 Node* 的内存大小即可；
// 所以节点地址为  Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;
// 我们有一个Node 了，怎么插入到跳表中呢？ 这里用 splice 进行了辅助， 假如我们的节点高度为2，我们需要获取要插入位置的前一个节点集合
// 然后遍历集合中的每一个节点，将Node 到每个节点的后面即可
template <class Comparator>
template <bool UseCAS>
bool InlineSkipList<Comparator>::Insert(const char* key, Splice* splice,
                                        bool allow_partial_splice_fix) {
  Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;  // 获取Node 指针

[Node{Node*[1]}, key], 例如内存分布如图，我们知道了 key 的地址为 key; 怎么获取左边的这个Node 的起始地址呢？

由于Node 占内存大小为 size(Node*) , 所以该Node 起始地址为 key 的地址减去一个 Node* 的内存大小即可；

所以节点地址为 Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;

   if (splice->height_ < max_height) {
    // Either splice has never been used or max_height has grown since
    // last use.  We could potentially fix it in the latter case, but
    // that is tricky.
    splice->prev_[max_height] = head_; // 最高层， pre 为head   0 层的头结点
    splice->next_[max_height] = nullptr; // 最高层，next 为 null
    splice->height_ = max_height; // 最高层
    recompute_height = max_height;
    }
  if (recompute_height > 0) {
    RecomputeSpliceLevels(key_decoded, splice, recompute_height);
  }

这里，splice 信息不够如果需要重新计算 splice，则执行 RecomputeSpliceLevels 函数重新计算。

/*
 *
 * 开始从 splice->prev_[max] 即跳表 [max] 层的头结点遍历找到 key 的pre 和 next 指针，放在 splice->prev_[max-1] 位置
 * 然后从 splice->prev_[max-1] 即跳表[max-1] 的起始位置，开始搜索得到 pre 和 next 指针， 结果放在 splice->prev_[max-2] 位置
 *
 * 3                        18    2层
 * 3    4     10            18    1层 
 * 3    4  7  20     11     18    0层
 * key 为11
 * 从第[2]层找到  pre = 3, next = 18
 * 从第[1]层找到  pre = 10, next = 18
 * 从第[0]层找到  pre = 20, next = 11
 * */
template <class Comparator>
void InlineSkipList<Comparator>::RecomputeSpliceLevels(const DecodedKey& key,
                                                       Splice* splice,
                                                       int recompute_level) {
  assert(recompute_level > 0);
  assert(recompute_level <= splice->height_);
  for (int i = recompute_level - 1; i >= 0; --i) { // 遍历 recompute_level 之下的所有层， 从下一层找到 pre 和 next，放在本层位置
    FindSpliceForLevel<true>(key, splice->prev_[i + 1], splice->next_[i + 1], i,
                       &splice->prev_[i], &splice->next_[i]);
  }
}

实际上， splice 有两个指针数组， pre[], next[], 分别存储了该key 要插入的位置信息，也就该key 应该插入的每一层的位置的的pre节点和 next 节点。

例如要插入11，则 splice 的pre数组为 [20,10,3], next 数组为[11,18,18]

3 18 2层

3 4 10 18 1层

3 4 7 10 11 18 0层

从第[2]层找到 pre = 3, next = 18

从第[1]层找到 pre = 10, next = 18

从第[0]层找到 pre = 20, next = 11

for (int i = 0; i < height; ++i) {
      if (i >= recompute_height &&
          splice->prev_[i]->Next(i) != splice->next_[i]) {
        FindSpliceForLevel<false>(key_decoded, splice->prev_[i], nullptr, i,
                                  &splice->prev_[i], &splice->next_[i]);
      }
      // Checking for duplicate keys on the level 0 is sufficient
      if (UNLIKELY(i == 0 && splice->next_[i] != nullptr &&
                   compare_(x->Key(), splice->next_[i]->Key()) >= 0)) {  // 第0层，splice.next key <= x.key； 例如 [3,5] 插入5
        // duplicate key
        return false;
      }
      if (UNLIKELY(i == 0 && splice->prev_[xi] != head_ &&
                   compare_(splice->prev_[i]->Key(), x->Key()) >= 0)) { // // 第0 层，x.key <= prekey； 例如: 【3，5】 插入3
        // duplicate key
        return false;
      }
      assert(splice->next_[i] == nullptr ||
             compare_(x->Key(), splice->next_[i]->Key()) < 0);
      assert(splice->prev_[i] == head_ ||
             compare_(splice->prev_[i]->Key(), x->Key()) < 0);
      assert(splice->prev_[i]->Next(i) == splice->next_[i]);
      x->NoBarrier_SetNext(i, splice->next_[i]); // height 高度的每层，都插入一个 next 节点指针； splice->next_[i] 是每层大于 key 的第一个节点
      splice->prev_[i]->SetNext(i, x);               // 例如 : [3,5] , 插入4
    }

这里从第0层开始到最高层，将本节点x 插入到每一层。如果本节点层高比 splice 的高度高，说明splice 信息不够，则先计算该层 i 的插入点信息。

重复key 的校验，两种情况说明插入的key是重复的key ，第一种是第0层，splice.next key <= x.key；例如 [3,5] 插入5。第二种是第0层，splice.next key <= x.key；例如 [3,5] 插入5。

设置x节点每层的next[i] 指针为为splice 的next_[i]。splice.pre 保存了跳表指定位置的左边节点，设置splice.pre[i].next[i] = x。即将x节点insert 到了该层的指定位置。

3.12 带hint插入

  // Insert hints for each prefix.
  std::unordered_map<Slice, void*, SliceHasher> insert_hints_;
    // 带hint插入，通过map记录一些前缀插入skiplist的位置，从而再次插入相同前缀的key时快速找到位置
    
  bool res = table->InsertKeyWithHint(handle, &insert_hints_[prefix]);

带hint插入，通过map记录一些前缀插入skiplist的位置，从而再次插入相同前缀的key时快速找到位置。这个void* 就是保存的4.11 中介绍的 splice 结构，保存了该前缀对应的key 在跳表中的位置信息。

3.13 Cpu 预取

在4.10 中 FindLessThan函数中。有一句代码 PREFETCH(next->Next(level), 0, 1)。 Cpu 预读就是说的这个。PREFETCH是一个宏, #define PREFETCH(addr, rw, locality) __builtin_prefetch(addr, rw, locality)。

__builtin_prefetch是一个 GCC（GNU Compiler Collection）内置函数。它的主要目的是实现数据预取（Data Prefetching）功能。数据预取是从将数据从主存储器（Main Memory）预取到高速缓存（Cache）中。

addr 是内存地址，从哪里获取数据。

rw 是预取的读写属性，0是读，1是读写。

locality 局部性特征，0是最高局部性情况，预取最接近当前执行位置的数据。1 是会预取稍远一些数据。

我么再来看一下这个函数 FindLessThan

template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindLessThan(const char* key, Node** prev,
                                         Node* root, int top_level,
                                         int bottom_level) const {
  assert(top_level > bottom_level);
  int level = top_level - 1; // 最高层
  Node* x = root; // 根节点
  // KeyIsAfter(key, last_not_after) is definitely false
  Node* last_not_after = nullptr;
  const DecodedKey key_decoded = compare_.decode_key(key); // 编码
  while (true) {
    assert(x != nullptr);
    Node* next = x->Next(level); // level 层 next 节点
    if (next != nullptr) {
      PREFETCH(next->Next(level), 0, 1);
    }
    assert(x == head_ || next == nullptr || KeyIsAfterNode(next->Key(), x));
    assert(x == head_ || KeyIsAfterNode(key_decoded, x));
    if (next != last_not_after && KeyIsAfterNode(key_decoded, next)) { //  x.key < key
      // Keep searching in this list
      assert(next != nullptr);
      x = next; // 递进， x = next
    } else {  // 否则， key_decoded 不在 next 的右边 ，   key  next
      if (prev != nullptr) {             //           x      next
        prev[level] = x; // 设置prev                   pre    next
      }
      if (level == bottom_level) { // 如果level 是最底层， 则返回 x，  这里返回的pre 节点，即上个循环处理的节点
        return x;
      } else { // 如果不是最底层， 则跳表降低一层，继续搜索
        // Switch to next list, reuse KeyIsAfterNode() result
        last_not_after = next; // 记录下一个节点， level 降低一次层
        level--;
      }
    }
  }
}

示例

如上图所示，如果要 FindLessThan(10.5), 则会返回 key=10 节点。整个寻找的过程是，

step1: 从节点1的第4层开始；

step2: 找到节点3的第四层；

step3: 发现节点3的第4层的next 是节点5，key=18，大于10，层高-1，到了节点3的第3层；

step4: 发现节点3的第3层的next是节点5，key=18，大于10，层高-1，到了节点3的第2层；

step5: 发现节点3的第2层的next是节点4，key=11，大于10，层高-1，到了节点3的第1层；

step6: 发现节点3的第1层的next是节点4，key=11，大于10，层高-1，到了节点3的第0层；

step7: 发现节点3的第1层的next是节点4，key=11，大于10，且层高为0，返回pre节点，即节点3；

我们可以看到整个查找过程用了节点3的5个指针。如果每次使用，都从内存获取一个，速度较慢。我们在step2找到节点3时候。用 PREFETCH(next->Next(level), 0, 1)，预取5个指针到cpu高速缓存，提高后续的读取效率。

由于cpu 预取时从addr 开始某个大小内存（例如64个字节）；跳表遍历时高层到底层；

所以节点的结构这样设计，高层在低地址，低层在高地址，满足了cpu预取的条件。

3.14 优化点

1 kTypeRangeDeletion 类型删除，用于范围类型的删除，例如有5个delete, delete 1, delete 2, delete 3, delete 4, delete 5; 如果向memtable 中插入5条删除信息，会有较大的空间放大。可以用一条删除信息 delete (1-5) 即可。减小了空间放大。

2 rocksdb将最高的Node*放置在低地址，充分利用CPU的prefetch。

3 rocksdb 有InsertWithHint，带hint插入，通过map记录一些前缀插入skiplist的位置，从而再次插入相同前缀的key时快速找到位置。