我发现在rocksdb中对大量键使用merge操作符非常慢。
我的程序使用一个简单的关联合并运算符(基于上游StringAppendOperator),它使用给定键的分隔符连接值。合并所有键和程序完成运行需要很长时间。
PS:我从source -latest master构建了rocksdb。我不确定我是否遗漏了一些非常明显的东西。
下面是一个具有500万个键的可重复性最小的例子--键的数量可以通过改变for循环的限制来调整。
#include <filesystem>
#include <iostream>
#include <utility>
#include <rocksdb/db.h>
#include "rocksdb/merge_operator.h"
// Based on: https://github.com/facebook/rocksdb/blob/main/utilities/merge_operators/string_append/stringappend.h#L13
class StringAppendOperator : public rocksdb::AssociativeMergeOperator
{
public:
// Constructor: specify delimiter
explicit StringAppendOperator(std::string delim) : delim_(std::move(delim)) {};
bool Merge(const rocksdb::Slice &key, const rocksdb::Slice *existing_value,
const rocksdb::Slice &value, std::string *new_value,
rocksdb::Logger *logger) const override;
static const char *kClassName() { return "StringAppendOperator"; }
static const char *kNickName() { return "stringappend"; }
[[nodiscard]] const char *Name() const override { return kClassName(); }
[[nodiscard]] const char *NickName() const override { return kNickName(); }
private:
std::string delim_;// The delimiter is inserted between elements
};
// Implementation for the merge operation (concatenates two strings)
bool StringAppendOperator::Merge(const rocksdb::Slice & /*key*/,
const rocksdb::Slice *existing_value,
const rocksdb::Slice &value, std::string *new_value,
rocksdb::Logger * /*logger*/) const
{
// Clear the *new_value for writing.
assert(new_value);
new_value->clear();
if (!existing_value)
{
// No existing_value. Set *new_value = value
new_value->assign(value.data(), value.size());
}
else
{
// Generic append (existing_value != null).
// Reserve *new_value to correct size, and apply concatenation.
new_value->reserve(existing_value->size() + delim_.size() + value.size());
new_value->assign(existing_value->data(), existing_value->size());
new_value->append(delim_);
new_value->append(value.data(), value.size());
std::cout << "Merging " << value.data() << "\n";
}
return true;
}
int main()
{
rocksdb::Options options;
options.create_if_missing = true;
options.merge_operator.reset(new StringAppendOperator(","));
# tried a variety of settings
options.max_background_compactions = 16;
options.max_background_flushes = 16;
options.max_background_jobs = 16;
options.max_subcompactions = 16;
rocksdb::DB *db{};
auto s = rocksdb::DB::Open(options, "/tmp/test", &db);
assert(s.ok());
rocksdb::WriteBatch wb;
for (uint64_t i = 0; i < 2500000; i++)
{
wb.Merge("a:b", std::to_string(i));
wb.Merge("c:d", std::to_string(i));
}
db->Write(rocksdb::WriteOptions(), &wb);
db->Flush(rocksdb::FlushOptions());
rocksdb::ReadOptions read_options;
rocksdb::Iterator *it = db->NewIterator(read_options);
for (it->SeekToFirst(); it->Valid(); it->Next())
{
std::cout << it->key().ToString() << " --> " << it->value().ToString() << "\n";
}
delete it;
delete db;
std::filesystem::remove_all("/tmp/test");
return 0;
}
1条答案
按热度按时间cuxqih211#
在Speedb Hive的Discord上分享了您的问题。回复来自我们的创始人兼首席科学家Hilik。
“合并操作符对于获得快速写入响应时间非常有用,alas读取需要阅读原始文件并按合并顺序应用。此操作可能会非常昂贵,特别是对于需要在每次合并时复制和追加的字符串。解决此问题的最简单方法是最终使用read modify write。在应用程序级别执行此操作是可能的,但可能会有问题(如果两个线程可以同时执行此操作)。我们正在考虑在压缩期间解决此问题的方法,并愿意与您合作进行PR ...'
希望这能有所帮助。加入discord服务器以参与此讨论和许多其他有趣的相关主题。Here是指向有关您的主题的讨论的链接