From 3f45a0d3374233fb5d1bbab2bb9c4e74bbd9c22a Mon Sep 17 00:00:00 2001 From: "Lei, HUANG" <6406592+v0y4g3r@users.noreply.github.com> Date: Mon, 6 Feb 2023 22:15:53 +0800 Subject: [PATCH] docs: rfc for table compaction (#939) * doc: rfc for table compaction * docs: update compaction rfc --- docs/rfcs/2023-02-01-table-compaction.md | 92 ++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/rfcs/2023-02-01-table-compaction.md diff --git a/docs/rfcs/2023-02-01-table-compaction.md b/docs/rfcs/2023-02-01-table-compaction.md new file mode 100644 index 0000000000..585151f5e1 --- /dev/null +++ b/docs/rfcs/2023-02-01-table-compaction.md @@ -0,0 +1,92 @@ +--- +Feature Name: "table-compaction" +Tracking Issue: https://github.com/GreptimeTeam/greptimedb/issues/930 +Date: 2023-02-01 +Author: "Lei, HUANG " +--- + +# Table Compaction + +--- + +## Background + +GreptimeDB uses an LSM-tree based storage engine that flushes memtables to SSTs for persistence. +But currently it only supports level 0. SST files in level 0 does not guarantee to contain only rows with disjoint time ranges. +That is to say, different SST files in level 0 may contain overlapped timestamps. +The consequence is, in order to retrieve rows in some time range, all files need to be scanned, which brings a lot of IO overhead. + +Also, just like other LSMT engines, delete/update to existing primary keys are converted to new rows with delete/update mark and appended to SSTs on flushing. +We need to merge the operations to same primary keys so that we don't have to go through all SST files to find the final state of these primary keys. + +## Goal + +Implement a compaction framework to: +- maintain SSTs in timestamp order to accelerate queries with timestamp condition; +- merge rows with same primary key; +- purge expired SSTs; +- accommodate other tasks like data rollup/indexing. + + +## Overview + +Table compaction involves following components: +- Compaction scheduler: run compaction tasks, limit the consumed resources; +- Compaction strategy: find the SSTs to compact and determine the output files of compaction. +- Compaction task: read the rows from input SSTs and write to the output files. + +## Implementation + +### Compaction scheduler + +`CompactionScheduler` is an executor that continuously polls and executes compaction request from a task queue. + +```rust +#[async_trait] +pub trait CompactionScheduler { + /// Schedules a compaction task. + async fn schedule(&self, task: CompactionRequest) -> Result<()>; + + /// Stops compaction scheduler. + async fn stop(&self) -> Result<()>; +} +``` + + + +### Compaction triggering + +Currently, we can check whether to compact tables when memtable is flushed to SST. + +https://github.com/GreptimeTeam/greptimedb/blob/4015dd80752e1e6aaa3d7cacc3203cb67ed9be6d/src/storage/src/flush.rs#L245 + + +### Compaction strategy + +`CompactionStrategy` defines how to pick SSTs in all levels for compaction. + +```rust +pub trait CompactionStrategy { + fn pick( + &self, + ctx: CompactionContext, + levels: &LevelMetas, + ) -> Result; +} +``` + +The most suitable compaction strategy for time-series scenario would be +a hybrid strategy that combines time window compaction with size-tired compaction, just like [Cassandra](https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html) and [ScyllaDB](https://docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html#time-window-compaction-strategy-twcs) does. + +We can first group SSTs in level n into buckets according to some predefined time window. Within that window, +SSTs are compacted in a size-tired manner (find SSTs with similar size and compact them to level n+1). +SSTs from different time windows are neven compacted together. +That strategy guarantees SSTs in each level are mainly sorted in timestamp order which boosts queries with +explict timestamp condition, while size-tired compaction minimizes the impact to foreground writes. + +### Alternatives + +Currently, GreptimeDB's storage engine [only support two levels](https://github.com/GreptimeTeam/greptimedb/blob/43aefc5d74dfa73b7819cae77b7eb546d8534a41/src/storage/src/sst.rs#L32). +For level 0, we can start with a simple time-window based leveled compaction, which reads from all SSTs in level 0, +align them to time windows with a fixed duration, merge them with SSTs in level 1 within the same time window +to ensure there is only one sorted run in level 1. \ No newline at end of file