* docs: add rfc for wal purge * docs: fix typo * docs: follow name format * chore: all in heartbeat * fix: unneeded sentence in rfc * chore: apply comments
2.9 KiB
Feature Name, Tracking Issue, Date, Author
| Feature Name | Tracking Issue | Date | Author |
|---|---|---|---|
| Remote WAL Purge | https://github.com/GreptimeTeam/greptimedb/issues/5474 | 2025-02-06 | Yuhan Wang <profsyb@gmail.com> |
Summary
This RFC proposes a method for purging remote WAL in the database.
Motivation
Currently only local wal entries are purged when flushing, while remote wal does nothing.
Details
sequenceDiagram
Region0->>Kafka: Last entry id of the topic in use
Region0->>WALPruner: Heartbeat with last entry id
WALPruner->>+WALPruner: Time Loop
WALPruner->>+ProcedureManager: Submit purge procedure
ProcedureManager->>Region0: Flush request
ProcedureManager->>Kafka: Prune WAL entries
Region0->>Region0: Flush
Steps
Before purge
Before purging remote WAL, metasrv needs to know:
last_entry_idof each region.kafka_topic_last_entry_idwhich is the last entry id of the topic in use. Can be lazily updated and needed when region has empty memtable.- Kafka topics that each region uses.
The states are maintained through:
- Heartbeat: Datanode sends
last_entry_idto metasrv in heartbeat. As for regions with empty memtable,last_entry_idshould equals tokafka_topic_last_entry_id. - Metasrv maintains a topic-region map to know which region uses which topic.
kafka_topic_last_entry_id will be maintained by the region itself. Region will update the value after k heartbeats if the memtable is empty.
Purge procedure
We can better handle locks utilizing current procedure. It's quite similar to the region migration procedure.
After a period of time, metasrv will submit a purge procedure to ProcedureManager. The purge will apply to all topics.
The procedure is divided into following stages:
- Preparation:
- Retrieve
last_entry_idof each region kvbackend. - Choose regions that have a relatively small
last_entry_idas candidate regions, which means we need to send a flush request to these regions.
- Retrieve
- Communication:
- Send flush requests to candidate regions.
- Purge:
- Choose proper entry id to delete for each topic. The entry should be the smallest
last_entry_id - 1among all regions. - Delete legacy entries in Kafka.
- Store the
last_purged_entry_idin kvbackend. It should be locked to prevent other regions from replaying the purged entries.
- Choose proper entry id to delete for each topic. The entry should be the smallest
After purge
After purge, there may be some regions that have last_entry_id smaller than the entry we just deleted. It's legal since we only delete the entries that are not needed anymore.
When restarting a region, it should query the last_purged_entry_id from metasrv and replay from min(last_entry_id, last_purged_entry_id).
Error handling
No persisted states are needed since all states are maintained in kvbackend.
Retry when failed to retrieving metadata from kvbackend.
Alternatives
Purge time can depend on the size of the WAL entries instead of a fixed period of time, which may be more efficient.